Abstract
Quantitative structureactivity relationship (QSAR) modeling is a powerful tool for drug discovery, yet the lack of interpretability of commonly used QSAR models hinders their application in molecular design. We propose a similaritybased regression framework, topological regression (TR), that offers a statistically grounded, computationally fast, and interpretable technique to predict drug responses. We compare the predictive performance of TR on 530 ChEMBL human target activity datasets against the predictive performance of deeplearningbased QSAR models. Our results suggest that our sparse TR model can achieve equal, if not better, performance than the deep learningbased QSAR models and provide better intuitive interpretation by extracting an approximate isometry between the chemical space of the drugs and their activity space.
Similar content being viewed by others
Introduction
Quantitative structure–activity relationship (QSAR) models have become an essential tool in pharmaceutical discovery, especially in the virtual screening for hits and lead optimization stages^{1}. Experimental characterization of candidate molecules is expensive and timeconsuming. As a relatively easytoimplement alternative, QSAR models could be a valuable tool for assisting chemists by providing design ideas to prioritize their experiments. QSARs are usually supervised machine learning models that describe the connections between chemical structures and their biological activities, such as their potency, physicochemical properties, pharmacokinetic properties, or environmental effects^{2}. QSAR models enable in silico structural design by providing property predictions from machinereadable representations of the chemical structure, thereby helping generate and prioritize design ideas. This technique has been widely applied in virtual screening and lead optimization with a fair amount of success^{1,3}.
In QSAR methods, chemical substances must first be transformed into machinecomprehensible mathematical representations. Three commonly used representations are (a) vectors such as classical molecular descriptors or molecular fingerprints (FPs), (b) graphs, and (c) strings such as Simplified Molecular Input Line Entry System (SMILES). Classical molecular descriptors^{4} encode a specific computed or measured attribute of the molecule into a single number, for instance, the count of bonds, atoms, functional groups, or physicochemical characteristics, and are often used in combination to form feature vectors. PaDEL^{5}, Mordred^{6}, and RDKit are examples of popular descriptorcalculation software packages for numerically representing the chemical structure and molecular characteristics. Extendedconnectivity fingerprints (ECFPs)^{7} are an example of a topological fingerprint computed using a variant of the Morgan Algorithm that encodes chemical substructures by atom neighborhoods using a highdimensional sparse bitstring representation. The graph representation, on the other hand, characterizes 2D chemical structures as graphs, with atoms as vertices and bonds as edges. SMILES specify a notation for representing the chemical graphs of molecules as strings of characters.
Once the chemical structures are represented using a suitable protocol, a predictive method is chosen to connect the structural information with the functional properties. For instance, if the chemical structures are represented as strings or graphs, deeplearning methods are often used for prediction due to their ability to perform embedded feature extraction. Chemprop^{8}, in particular, has turned out to be a popular method that uses directed messagepassing neural networks to learn molecular representations directly from the graphs to predict the properties of molecules. This method has been shown to excel at antibiotic discovery^{9,10} and lipophilicity prediction^{11} indicating its potential as a QSAR model. With the rise in popularity of large language models and the attention mechanism, the use of SMILES strings has been increasingly investigated for their potential embedded feature extraction, predictive performance, and interpretability. For example,^{12} pretrained a transformerbased network through masked SMILES recovery, and offered the pretrained model for transfer learning onto specific tasks. Similarly, TransformerConvolutional Neural Network (CNN)^{13} applied the transformer architecture to canonicalize SMILES string inputs and enables transfer learning of the model onto specific activity prediction tasks.
QSAR models are often developed for their predictive performance. However, the effectiveness of QSAR models, as a computational tool assisting molecular discovery and design, could be greatly improved by enhancing their domainspecific interpretability. Model interpretability, usually defined as the ability to explain predictions in a humanunderstandable way^{14}, typically consists of computing feature importance scores^{15,16,17,18}, influence functions to identify training instances most responsible for the prediction^{19}, developing locally interpretable models to approximate global blackbox algorithms^{20,21,22}, and generating counterfactuals^{23,24}. For example, standard shallow learners, like Random Forests (RF) and Support Vector Machines (SVM) are often used in QSAR modeling to offer feature importance scores^{25}. However, molecular interpretability is largely based on the interpretability of the underlying molecular representation. For instance, ALogP can be used as an important classical descriptor that plays a key role in determining the solubility of a molecule. However, a target value of ALogP cannot be mapped back to a precise chemical structure. When using interpretable fingerprints, the foregoing feature importance scores could potentially map prediction contributions onto the molecule to visualize which substructures positively or negatively impacted the prediction^{25,26,27}. Although feature importance measures increase the explanatory power of machine learning models, caution must be taken when these scores are invoked on molecules outside the applicability domain of the model, as prediction importance does not always translate to biological relevance^{28}. Locally interpretable models can be fitted to explicate predictions of blackbox models. For instance, SHapley Additive exPlanations (SHAP) offers a modelagnostic method for calculating predictionwise feature importance^{21,22}. Since this technique usually informs which features contributed to the specific test instance’s model prediction the most, it may not always lead to actionable design ideas. Thus^{24}, proposed Molecular Model Agnostic Counterfactual Explanations (MMACE) to generate counterfactual explanations that would help answer the question: what changes will result in an alternate outcome, regardless of the underlying model used. These methods are based on the model’s knowledge and, therefore, may be influenced by chance correlation, rough response surfaces, and overfitted models, leading to disappointing results^{29}. Recent advances in the attention mechanism of deep learners offer some explanatory power^{30}. For instance^{31}, uses Layerwise Relevance Propagation to provide structural interpretation of nodes and edges (atoms and bonds), TransformerCNN incorporates Layerwise Relevance Propagation to calculate individual atom contributions influencing the predictions, and^{32} uses salient maps to highlight the substructures closely related to the model output. These maps are analogous to the foregoing feature importance concept and have similar drawbacks in terms of deriving actionable insights for the design of new molecules.
Similaritybased methods^{33} (knearest neighbor (KNN), kernel regression^{34,35}, and pairwise kernel method^{36}), provide natural intuitive interpretation at the instancelevel by directly providing the training instances that influenced the model’s prediction the most. For example, readacross is a popular alternative property prediction technique that finds the most similar chemicals to the query chemical. Numerous publicly available tools use some variants of readacross techniques to aid chemists with design ideas^{37}. These tools allow chemists to assess the potential of the selected analogous neighbors to infer properties of the query chemical. Additionally, similaritybased methods allow informative visualizations through network graphs derived from the similarities. Networklike Similarity Graphs (NSG)^{38} were developed to guide lead optimization in drug discovery and have often been used to display the complex activity landscapes and the relationships between chemicals within a target set in 2D. Expanding this to drugtarget interactions, methods like Similarity Ensemble Approach (SEA)^{39} and Chemical Similarity Network Analysis Pulldown (CSNAP)^{40} enable visualization of drugtarget interaction networks and the prediction of offtarget drug interactions, which have led to deeper investigations into drug polypharmacology and the discovery of offtarget drug interactions^{41,42}. As we show later, these chemical similarity networks allow the clustering of similar molecules, which enables practitioners to mine regions of desired activity for innovative design ideas and potential leads. In addition to providing predictionwise training instance importance, these graph structures are directly compatible with Laplacian Scores^{43} for global feature importance, which have been used in QSAR modeling for feature selection^{44,45}. Since SHAP and MMACE are model agnostic, they can also be paired with similaritybased QSAR models to allow predictionwise feature importance and the generation of unseen counterfactuals. Thus, similaritybased methods can provide multiple layers of interpretability on top of the commonly applied chemical similarity interpretation and visualization methods listed above.
However, a problem in similaritybased QSAR is that most QSAR methods assume that similar structures lead to similar activities, which is often violated in chemical structure modeling due to the prevalence of activity cliffs (ACs)^{46}, which are pairs of compounds with similar molecular structures, but with a large difference in potency against their target^{47}. The existence of ACs often cause QSAR models to fail, especially in the lead optimization stage^{48}, and limit the prediction performance across the drug landscape, leading to the use of networkbased methods to interpret and analyze their behavior^{38,49}. One way to use similaritybased methods in the presence of ACs is to learn the similarity metric from the data itself, instead of choosing a similarity metric a priori. Large margin nearest neighbor^{50} is a very popular algorithm for supervised metric learning when the response variable is categorical. For continuous response variables, Metric Learning Kernel Regression (MLKR)^{51} is perhaps the most popular algorithm to estimate the similarity metric. Metric learning techniques offer good explanatory power because once the metric is learned, the chemical space of molecules is approximately isometric to the activity space, resulting in smoother structure–activity landscapes as shown in^{52}. Consequently, under the learned metric, highactivity molecules are clustered relatively tightly in the chemical space and therefore, that space could be mined for new molecules. Figure 1 depicts this phenomenon using various projection methods Generative Topographic Mapping (GTM)^{53}, Multidimensional Scaling (MDS), tdistributed Stochastic Neighbor Embedding (tSNE), and Uniform Manifold Approximation and Projection (UMAP), to show the interpolated activity landscapes of the protein target Coagulation factor XIII, or CHEMBL4530, in 2 dimensions and compare them with an MLKRbased representation. Observe that, except MLKR, none of the other methods were able to separate two chemicallysimilarbutfuncationallydifferent molecules, CHEMBL208650 and CHEMBL2086502, which have a Tanimoto similarity between ECFP4 fingerprints of 0.70 but target difference of 2.61. This is to be expected because, in their original form, GTM, MDS, tSNE, UMAP are all unsupervised techniques and do not incorporate the activity information in their projections. MLKR, on the other hand, is a supervised metric learning method, which allows it to incorporate the target activity information resulting in smoother activity landscapes.
In this paper, we develop an MLKRinspired regressionbased technique, topological regression (TR), that models the distance in the response space using the distances in the chemical space. TR essentially builds a parametric model to determine how pairwise distances in the chemical space impact the weights of nearest neighbors in the response space. Observe, unlike metric learning techniques, TR does not attempt to learn a metric in the response space, nor does it attempt to provide a lower dimensional projection like MDS or GTM. Rather, TR simply estimates the weights of nearest neighbors. In comparison to traditional modeling methods, like RFs and SVMs, which are dependent on a predefined fingerprint, TR can accommodate nonmetric systems and does not crucially require coordinates for each instance. As we will show in the subsequent sections, TR can work on the similarities between training molecules, such as those computed from molecular kernels^{54,55}, thereby circumventing the problem of featurization of molecules. Since, our primary usecase scenario is QSAR in the lead identification/optimization process, where the contiguity of highactivity molecules plays a significant role, we perform a largescale comparison on 530 ChEMBL bio targets. We use RF, ChemProp, and TransformerCNN as baseline models and show that TR matches the performance of TransformerCNN at a significantly less computational cost. We also observe, empirically, that TR produces numerically superior predictive performance as compared to the other competing methods. Additionally, both MLKR and TR produce reasonably contiguous areas of high activity, thereby identifying a relatively compact highactivity chemical space.
Results
Model performance comparison on ChEMBL datasets
We apply our TR method with Gaussian kernel neighbor weighting on 530 ChEMBL datasets under both random split and scaffold split. As explained earlier, we use the ECFP4 TC distance as input to TR to predict the activity values. We use 80% of all the instances in each dataset for training and the remainder for testing. For the construction in the section “multivariate construction of topological regression”, when I^{*} ∩ I = ϕ, we use 20% of the training instances as anchor points and the remaining 80% of the training set for neighborhood training. We denote this method as TR* in the results. For the approach described in the section “univariate construction of topological regression” without disjointedness requirement, we use 50% of training instances, with a maximum of 2000 instances to improve computation time, as anchor points, and those results are denoted as TR. Finally, to reduce the sensitivity of results to anchor point selection, and to improve generalization error, different random sets of anchor points were sampled to create an ensemble of TR models(see the section “ensemble topological regression”). We denote this method as Ensemble TR and used t = 15, μ_{k} = 0.6, and \({\sigma }_{k}^{2}=0.2\) for the subsequent results.
The average Spearman correlation and NRMSE for each method (RF, MLKR with KNN, ChemProp, TCNN, TCNN with augmentation, TR*, TR, and Ensemble TR) on both splitting scenarios are shown in Table 1. Figure 2 compares each method using boxplots showing the distribution of the performances for both random and scaffold splitting. As expected, TR* is unable to achieve performance comparable to the competing methods as the model is being constrained by the disjointedness requirement. When we relax this requirement, we observe that TR’s predictive performance improves considerably and is only numerically inferior to TCNN with augmentation. Finally, when we incorporate an ensemble of TR models, the predictive performance of Ensemble TR is essentially as good as that of TCNN with augmentation. If we invoke the law of parsimony, our conceptually straightforward, and mathematically less complex, topological regression approach appears to be more appealing as compared to competing deep learning techniques.
Computational comparison on ChEMBL datasets
To illustrate the computational efficiency of TR and Ensemble TR, we report each competing method’s average training time, testing time, and peak RAM consumption across all 530 datasets. These results are shown in Table 2. For fair comparison and to provide the best optimized hardware for each model, we trained the deep learning models on systems with GPUs as the training of deep learningbased models are better optimized in GPU based systems. Since the pretrained TCNN model was released and used for finetuning, the reported TCNN time does not include pretraining time. From the results, we observe that TR and Ensemble TR result in the fastest training times and significantly less peak RAM consumption. For testing, TR takes more time than MLKR since RBF kernels are employed compared to MLKR which simply uses 5NN predictions after transformation, however TR still results in faster test times than TCNN. These results demonstrate the computational efficiency of TR.
Interpreting TR
Inspection of the regression coefficients in B demonstrates how TR offers more flexibility as compared to standard KNN. Recall, W_{K,m}, K ∈ I^{ *}, m ∈ I quantifies the impact of Y_{K} on Y_{m}. Now, in an ordinary KNN inverse distance weighting scheme, as distance between the Kth instance and mth instance increases in the chemical space, W_{K,m} decreases, i.e., \(\frac{\delta }{\delta {d}_{K,m;X}^{2}}{W}_{K,m} \, < \, 0\). However, for TR \(\frac{\delta }{\delta {d}_{K,m;X}^{2}}{W}_{K,m}={W}_{K,m}{b}_{KK}\). Now W_{K,m} > 0 by construction, therefore \(sign(\frac{\delta }{\delta {d}_{K,m;X}^{2}}{W}_{K,m})\) depends upon the sign(b_{KK}). Hence, TR can push molecules closer in chemical space far apart in the response space. What this implies is, the prediction generation process for TR can be interpreted in the same vein as that used by KNN, except, unlike KNN, TR searches for nearest set of anchor points in the response space.
We use the chemical space of the drugs targeting Phospholipase D2 (ChEMBL ID: CHEMBL2734) to demonstrate this phenomenon. In Fig. 3 we seek to predict the response corresponding to the molecule CHEMBL492559 (denoted by a red star, pChEMBL= 6.73) in the test set. Based on similarity in the chemical space, standard KNN finds three molecules, CHEMBL492558, CHEMBL492704, and CHEMBL492588, as nearest neighbors, under a 5fold crossvalidation protocol, and makes predictions based on the average of the activities of these three molecules. However, the target molecule is almost at the edge of a highactivity region. Therefore, naive KNN identifies two neighbors, CHEMBL492704 and CHEMBL492588 from the nearby low activity region (across the cliff) and only one neighbor CHEMBL492558 from the ideal high activity region. This happens because the highactivity region in the neighborhood of the target molecule is sparsely populated. In contrast, since TR directly incorporates Y in the learning, it identifies three crosscliff molecules, CHEMBL494008, CHEMBL4581260, and CHEMBL1254736, that have greater weights in predicting the response associated with the target molecule as compared to CHEMBL492704 and CHEMBL492588. Observe that all three molecules identified by TR as nearest neighbors (CHEMBL494008, CHEMBL4581260, CHEMBL1254736) are in relatively highactivity regions. By presenting structures from diverse scaffolds that exhibit similar activities, TR not only enhances prediction reliability but also aids in the identification of key spatial structural characteristics influencing the activities. The presented structures can be further validated with structural chemical methods such as structural alignment or docking simulations.
To further illustrate this point across the entire dataset, rather than for one particular test molecule, we generated KNNgraphs depicting the predictions of the various similaritybased methods with the color indicating the activity elicited by the molecules. To do so, each training and test sample was represented as a node, and the predicted neighbors were considered as the connecting edges. These graphs are synonymous with NSGs, in fact, just like NSGs, the edges were only included if the similarity was greater than a fixed cutoff TC and if the molecules were predicted as one of the nearest neighbors. Therefore, the number of neighbors and the cutoff TC control the connectedness of the network graphs, more connections would be established with a larger number of nearest neighbors and lower cutoff similarities until the graph is complete. We used 5 nearest neighbors and the mean similarity of the entire target dataset as the cutoff TC for each competing method for all subsequent network graphs, meaning at most 5 connections would be established if their similarities were greater than the fixed cutoff TC. An example of these KNNgraphs, depicting the test nearest neighbors of a single CV fold of the dataset CHEMBL2734, is included in Fig. 4. Additional figures depicting the training predictions, testing predictions, and molecules within the most active cluster are included in the supplementary document. Notice that the predicted TR neighbors are similar in response value, leading to more homogeneous activity throughout the clusters, whereas KNN and MLKR both result in clusters containing diverse activity values. To quantify this variability, we included the average withincluster standard deviations for each method in the figure where a low withincluster standard deviation denotes a more homogeneous cluster.
To systematically show this behavior across all 530 datasets, we calculated the average withincluster standard deviation from the foregoing test prediction KNNgraphs for the competing methods. Figure 5 depicts these results in the form of a line graph across all 530 datasets. Clearly, TR systematically produces lower withincluster standard deviation compared to KNN and MLKR, resulting in higher levels of homogeneous activity within the clusters. If we envision activity cliffs to be a phenomenon that induces a strong outlier within an otherwise homogeneous cluster, then it stands to reason that by measuring withincluster homogeneity we can infer about the presence of cliffs in that cluster. Higher levels of withincluster homogeneity essentially smooths out activity cliffs resulting in more relevant similaritybased predictions and providing practitioners with instancewise similar molecules for lead optimization.
Since TR results in more homogeneous clusters, the clusters themselves can be more meaningfully mined by chemists for innovative design ideas, potential target leads, and lead optimization pathways. For example, clustering can be performed on the training data, and the most active cluster may contain molecules with specific features that practitioners can use to guide designs and future experiments. The same can be done with the least active cluster to see which molecular features to avoid and provide further insights. Furthermore, the most active training cluster can be mined for lead molecules that have other desired characteristics, such as low toxicity or ease of production. Analogous to NSGs, the training clusters can also be used to visualize lead optimization pathways. Figure 6 depicts a lead optimization pathway in the most active cluster of target protein complex Integrin alpha4/beta7 (CHEMBL278) with (a) the TR KNNgraph obtained from the training data of a single CV fold, (b) the most active cluster depicted as a minimum spanning tree with the minimum spanning path between the most active and least active molecules depicted in red, and (c) 5 example molecules from the lead optimization pathway connecting the most active and least active molecules in the cluster. These pathways can be traversed by chemists to envision what changes resulted in specific behaviors, allowing them to easily analyze the current state of a target dataset and discover potential design ideas (additional figures representing optimization pathways for various target datasets are provided in the Supplementary document). If we envision an untested molecule as an additional node in Fig. 6a, the TR method could directly produce the set of edges radiating from that node (via the model for W) that would enable one to assess how the untested molecule relates with the previously tested molecules. This could enable greater trust in the predictions as the chemist could easily visualize how the new sample relates to known molecules. Additionally, these graphs fit directly with Laplacian Scores for feature selection, allowing global feature importance to be calculated in a routine fashion. Lastly, when paired with SHAP or MMACE, which are model agnostic, TR would be able to efficiently generate instancewise feature importance and unseen counterfactuals, adding additional layers to TR’s interpretability.
Discussion
In this paper, we have developed a statistical methodology, topological regression (TR), to perform similaritybased regression and demonstrated how it can be used for QSAR modeling. We tested TR on regression tasks with 530 ChEMBL human targets and compared it with a traditional RF, Nearest Neighbors, a metric learning algorithm (MLKR), and two deep learning methods, ChemProp and TransformerCNN. Empirically, we observed that TR or ensemble TR compared favorably against all competing methods in terms of predictive accuracy on the scaffold split and achieved comparable performance with TCNN on the random splitting at a much lower computational cost. Most importantly, TR provides explainability, visual interpretability, and theoretical justifiability in the form of testable adequacy and optimal model size.
The performances of RF, TCNN, ChemProp, and MLKR are mostly interpreted in a comparative sense. The usual measures employed to assess the performance of these models  NRMSE, MAE  have unbounded support and hence do not offer information about the goodnessoffit. TR on the other hand completely relies on multivariate general linear models  geographically weighted regression when extracting W_{i,j} from the drug response, and standard regression theory when modeling W_{i,j}. For both of these techniques, rigorous tests for goodnessoffit exist^{56,57}. Since the standard coefficient of determination offers an immediate goodnessoffit statistic for linear models (or transformed linear models), we compute the training Rsq values (using (7)) for all 530 ChEMBL datasets considered in this paper. The average Rsq turns out to be 0.8396. Evidently, our conceptually straightforward parametric linear model has sufficient power to explain variation in W_{i,j}. Turning to predictive adequacy, we compute the prediction interval for the W’s (using extracted W’s as targets) in the crossvalidation set. Once again, the linear model specification allows us to compute the prediction interval analytically. We then compute the coverage of these prediction intervals across all folds. Ideally, we would like to see the coverage of the prediction interval achieve a nominal level. In all the 530 datasets across all the folds, the coverage of 95% prediction interval is 94.3%. Clearly, the model specified in (7) is adequate for prediction purposes as well. These results provide empirical justification for the adequacy of the TR model.
Given the small to moderate sample size in ChEMBL datasets, model complexity has a significant impact on prediction performance. For ChemProp or TCNNtype deep learners, regularization of network weights, dropout layers, and ablations are standard procedures to control model complexity. However, these measures are adhoc and their theoretical properties are not well established. For standard KNN (or even in MLKR), the number of neighbors determines the model size. However, we need to fix the number of nearest neighbors apriori and tune that quantity via crossvalidation. TR, on the other hand, offers a theoretically appropriate way to choose neighborhood size and hence model complexity. In TR, the anchor points play the role of neighbors and ∣I ^{*}∣ determines the size of the coefficient matrix B. Consequently, changing ∣I ^{*}∣ yields sequences of nested models, and hence standard model selection techniques, for instance, AIC or BIC, could be used to identify the appropriate size of I ^{*} without resorting to crossvalidation. Since AIC/BIC automatically penalizes model complexity for a given sample size, we can arrive at an optimal model complexity for TR.
Furthermore, TR provides an intuitive explanation of its predictive mechanism based on nearest neighbors in the response space as shown through KNN graphs in the section “interpreting TR”. This explanation could be gleaned from MLKR as well. However, the computational complexity associated with semidefinite programming, required in MLKR, is considerable if the dimension of the input space is high. TR, on the other hand, directly learns the weights associated with neighboring responses, and, by a suitable transformation, estimates the parameters in an unconstrained fashion. This leads to a significant reduction in computational expense as reported in Table 2.
Finally, the visual representation of TR’s predictive mechanism could provide design ideas and allow fast knowledgebased model validation. We anticipate that our framework will have practical value in drug discovery or other QSAR tasks and assist in designing new molecules more effectively.
Methods
Data description and problem motivation
We begin with a description of the datasets that we use to illustrate the comparative performances of the competing models. We offer a brief description of ChemProp, TransformerCNN and MLKR methods and then outline the motivation behind developing the TR framework.
Dataset
Since our focus is on QSAR modeling in the lead optimization phase of drug discovery, we choose to assess the performance of competing models on wellcurated datasets with single target bioactivity. For this purpose, we downloaded data from the ChEMBL database^{58} following the extraction protocol of^{59}. This included only selecting ‘SINGLE PROTEIN’ or ‘PROTEIN COMPLEX’ human targets with confidence scores of 9 and 7, respectively. Additionally, only pCHEMBL values, which are comparable bioactivity measures of halfmaximal response (IC50, XC50, EC50, etc.) on a negative logarithmic scale, were selected. We refer the readers to^{59} for further data extraction details.
In the cleaning phase, we first removed the datasets that were too small to train ChemProp and TransformerCNN. Within each dataset, we further removed instances with duplicated SMILES and instances with chemically invalid SMILES strings which could not be converted to RDKit molecules. Finally, we had 530 datasets on various human target bioactivities. Sample size ranged from 100 to 7890 with the median sample size being 677. The various target activities, referred to as pChEMBL values, were used as the univariate response variable.
Although several representative descriptors and fingerprints (for example: RDKit descriptors, Mordred^{6}, ECFP4^{7}) are available, we mainly focus on ECFP4 representation for similaritybased predictive models because, empirically, this representation offered the best predictive performance. We relegate the results demonstrating the superior predictive performance of the ECFP4 representation to the Supplementary Material. We calculate folded ECFP4 fingerprints using RDKit’s implementation of the Morgan algorithm with a radius of 2 atoms and bitsize of 1024. Since the output of this representation system is binary, we use the Tanimoto coefficient (TC) as a measure of similarity and 1 − TC as a measure of distance for TR. No standardization steps were required as RDKit was used to extract ECFP4 fingerprints. The ECFP4 fingerprints were used to train the RF model, whereas Chemprop used the SMILES string inputs to internally extract the graph representations and TransformerCNN directly used the SMILES strings.
ChemProp
We used ChemProp as a baseline model because of its demonstrated utility in drug discovery. ChemProp is a fullfledged Graph Convolutional Neural Network model that takes 2D representations of molecules as predictors. We employed ChemProp’s Bayesian hyperparameter optimization, which optimizes the hidden size, depth, dropout, and the number of feedforward layers, and trained the model for 100 epochs for all datasets.
TransformerCNN
We also used TransformerCNN (TCNN) as a baseline model as it is selfproclaimed to be a Swissarmy knife for QSAR modeling. TCNN is a pretrained model on over 17 million pairs of strings for the task of SMILES canonicalization. The output of the transformer encoder is then used to generate modelacquired FPs, which are used for downstream prediction through tasktrained TextCNN and convolutional highway layers. In addition, the architecture enables data augmentation by ensembling the results from multiple noncanonical smiles for each sample. Lastly, the architecture contains practically no hyperparameters and enables learning rate scheduling and early stopping, limiting the need for hyperparameter optimization. This mixture of large pretraining, sample augmentation, and stringsize agnostic architecture results in a powerful prediction model. We followed the TCNN instructions and trained the model on the SMILES strings, with and without augmentation, for at most 35 epochs as learning rate scheduling and early stopping were employed.
Metric Learning Kernel Regression
The purpose of metric learning is to find a distance metric for a specific task through supervised learning. The metric found by metric learning could subsequently be used in KNN regression or kernel regression for generating predictions and visualizations. For regression tasks, MLKR^{51} finds the Mahalanobis metric that minimizes the cumulative leaveoneout CV error \({{{{{{{\mathcal{L}}}}}}}}={\sum }_{i}{({Y}_{i}{\hat{Y}}_{i})}^{2}\), where Y_{i} is the numeric response variable of the ith training sample and \(\hat{Y}=\frac{{\Sigma }_{j\ne i}{Y}_{j}{W}_{ij}}{{\Sigma }_{j\ne i}{W}_{ij}}\) with W_{.,.} being the weights associated with Gaussian kernels. In particular, the transformation matrix L used to obtain the learned metric can be written as a decomposition of Mahalanobis matrix M = L^{T}L. After L is learned from the data, the original coordinate system of the predictor space X is transformed into the new coordinate system given by LX. Thus, MLKR learns a global space transformation, which can be used to calculate the distance in the response space. Then KNNregression or similaritybased kernel regression can be performed to provide predictions and interpretation.
However, in order to compute distances, we first need to characterize the molecules in a fashion such that distances can be computed. As mentioned, we focus on ECFP4 fingerprints, which is thus the initial coordinate system supplied to MLKR to learn the transformation and produce a new coordinate system such that the predictor space is approximately isometric to the response space. Figure 7 illustrates this phenomenon. In the left panel, we computed the pairwise Tanimoto distances among all the molecules targeting MitogenActivated Protein Kinase 12 (ChEMBL ID: CHEMBLE1908389) using ECFP4 features and projected them in 2D MDS space. The intensity of the pixels indicate the response each molecule elicited. In the right panel, we used the distance metric learned from MLKR to generate the 2D coordinates. Observe how the two molecules, CHEMBL3727733 and CHEMBL3729567, which appeared to be neighbors in the chemical space, were pushed apart after the MLKR transformation. Additionally, we observe a smoother spatial trend in the image produced after the MLKR transformation which allows us to use KNN or kernel regression  with purely distancedependent kernel elements  for prediction purposes.
Comparison procedure
To compare model performances we design two types of data splits: (a) random split and (b) scaffold split. Random split is done with 5fold crossvalidation with 80% training and 20% testing in each fold. The scores of the five folds are averaged as the final score. In drug discovery, new structures are often proposed by editing on the scaffold of a known good candidate. Predictions are more likely to fail across scaffolds due to greater chemical dissimilarities. Scaffold split makes sure the training and test samples belong to different Murcko scaffolds  mimicking scenarios when predictions for a new structure of a different scaffold is sought. Since fullblown crossvalidation is not feasible with scaffold splits, we use a single holdout set comprising approximately 20% of data points for each ChEMBL dataset. We use Spearman ρ and Normalized Root Mean Square Error (NRMSE) to compare the candidate models’ capabilities to generate predictions. In the section “Results”, we compare these two metrics obtained from ChemProp with those obtained from MLKRKNN under both splitting scenarios for all 530 ChEMBL datasets and observe that MLKRKNN offers numerically superior performance as compared to ChemProp, even though MLKR is not directly a regression technique.
This empirical observation motivates us to develop TR based on a distance formulation and thereby make the MLKRtype strategy amenable to statistical inference. We observe that in the MLKR procedure, a lot of effort is undertaken to ensure that the transformed space is indeed a metric space. However, for prediction, a weighted averaging of the responses from nearest neighbors is performed. Notice that symmetry and nonnegativity are the only two conditions required for those weights (W_{ij}). Therefore, we contend that we can directly work with W_{ij}s instead. We then proceed to show that, under suitable distributional specification, an explicit estimator of E(W_{ij}) could be obtained. Since the estimand is an expectation operator, standard statistical theory (delta method, residual bootstrap) could be brought to bear to assess the statistical properties of this estimator. To the best of our knowledge, such statistical assessment of the estimates produced by vanilla MLKR is not available.
Multivariate construction of topological regression
Topological regression (TR) is a similaritybased regression framework that connects the distances in the chemical space with nonnegative weights appearing in nearest neighbor regression defined on the response space. The model is illustrated in Fig. 8. More specifically, we specify a multivariate regression model for the weights W_{ij}s and derive a closedform expression for the estimator of E(W_{ij}) under an inverse distanceweighting scheme. Subsequently, we also offer a discussion on an approximate estimator of the foregoing quantity when the weighting is done using a Gaussian kernel.
Let \({{{{{{{\mathcal{D}}}}}}}}\) represent the set of all training points. First, we partition \({{{{{{{\mathcal{D}}}}}}}}\) into a set of K anchor points and \(N= {{{{{{{\mathcal{D}}}}}}}} K\) neighborhoodtraining points. Let \({I}^{*}=\{{i}_{1}^{*},{i}_{2}^{*},...,{i}_{K}^{*}\}\) be the set of indices associated with the anchor points and I = {i_{1}, i_{2}, . . . , i_{N}} be the indices associated with the neighborhoodtraining points, with I ^{*} ∩ I = ϕ and ∣I ^{*}∣ < ∣I∣. Let \({Y}_{{i}_{j}},{i}_{j}\in I\) be the response associated with the i_{j}th instance in the set I. Our goal is to express \({Y}_{{i}_{j}}\) as a linear combination of responses \({Y}_{{i}_{j}^{*}}\) belonging to the set I ^{*}, i.e.
where \({W}_{{i}_{l}^{*}{i}_{j}}\) is a nonnegative weight that determines the contribution of the response associated with the lth point in I ^{*} towards the response associated with the jth point in I. Such nonnegative weights are fairly common in distanceweighted regression, for instance, in geographically weighted spatial regression models, often the weights are specified in terms of Gaussian kernels, i.e., \({W}_{{i}_{l}^{*}{i}_{j}}=\exp (\beta {d}_{{i}_{l}^{*},{i}_{j}}^{2})\) with d^{2}(. ) being a squared Euclidean distance and β > 0 controlling the smoothness of the random field.
Neighborhood training model
Customarily, the weights are expressed as a deterministic function of the distances in the predictor space. In standard KNN regression, we assume that distance in the predictor space is proportional to the distance in the response space. In metric learning, a transformation of the predictor space is learned such that there is an approximate isometry between the transformed predictor space and the response space. In TR, we instead write a formal statistical model to connect \({W}_{{i}_{j}^{*}{i}_{j}}\) with the squared Euclidean distances in the predictor space in the following fashion:
We define the weights
and since we have I ^{*} and I to be disjoint and the responses could be assumed to be absolutely continuous, we can define
with the entries in \(\tilde{W}\), i.e., \({(\tilde{W})}_{{i}_{j}^{*}{i}_{j}}\) being real quantities. Define the squared Euclidean distance matrix in the predictor space as
We define a simple multivariate linear regression model connecting \(\tilde{W}\) with D_{X}. Consider the mth row of \(\tilde{W}\). Observe that, this row consists of the weights used to express the mth response in I using all the responses in I ^{*}. We envision this row to be a set of repeated measurements taken on the mth point in I from the vantage points in I ^{*}. Thus, denoting the K elements in the mth row of \(\tilde{W}\) by \({\tilde{W}}_{.,m}=({\tilde{W}}_{1,m},{\tilde{W}}_{2,m},\cdots \,,{\tilde{W}}_{K,m})\), the corresponding row of predictors in D_{X} by \({D}_{.,m;X}=({d}_{1,m;X}^{2},{d}_{2,m;X}^{2},\cdots \,,{d}_{K,m;X}^{2})\), and the matrix of regression coefficients by
we arrive at the following regression model
with \({{{{{{{\boldsymbol{\epsilon }}}}}}}}=({\epsilon }_{1},{\epsilon }_{2},\cdots \,,{\epsilon }_{K}) \sim {{{{{{{{\mathcal{N}}}}}}}}}_{K}(0,\Sigma )\). Now assuming mutual independence across the N rows of \(\tilde{W}\) and since N > K (by construction), we can obtain the MLEs of B and Σ. Let \(\hat{B}\) and \(\hat{\Sigma }\) denote their respective estimates. Then, for a new query point, we can compute \(({d}_{1,query;X}^{2},{d}_{2,query;X}^{2},\cdots \,,{d}_{K,query;X}^{2})\) and, using \(\hat{B}\), obtain the predictions \(({\tilde{W}}_{1,query},{\tilde{W}}_{2,query},\cdots \,,{\tilde{W}}_{K,query})\). However, observe that (1) requires (W_{1,query}, W_{2,query}, ⋯ , W_{K,query}) to generate a prediction for the query point, and simply exponentiating the output, \(\hat{\tilde{W}}\), of (6) will yield a biased estimate of W because \(E(W)=E({e}^{\tilde{W}}) \, \ne \, {e}^{E(\tilde{W})}\) due to Jensen’s inequality. Therefore, we use the properties of the multivariate lognormal distribution to improve the estimate of W in the following way:
Clearly \({{{{{{{{\boldsymbol{W}}}}}}}}}_{.,m}={e}^{{\tilde{{{{{{{{\boldsymbol{W}}}}}}}}}}_{.,m}}\) where the exponent is taken coordinatewise with \({\tilde{{{{{{{{\boldsymbol{W}}}}}}}}}}_{.,m} \sim {{{{{{{{\mathcal{N}}}}}}}}}_{K}({{{{{{{{\boldsymbol{\mu }}}}}}}}}_{.,m},\Sigma )\) and \({\mu }_{j,m}={b}_{0j}+{b}_{1j}{d}_{1,m}^{2}+{b}_{2j}{d}_{2,m}^{2}+\cdots+{b}_{Kj}{d}_{K,m}^{2}\). Then the usual relationship between the expectation of a lognormal variate with the momentgenerating function of its normal counterpart can be used to show that \(E({W}_{j,m})=E({e}^{{\tilde{W}}_{j,m}})=\exp ({\mu }_{j,m}+{\Sigma }_{jj}/2)\). Additionally, it is fairly straightforward to show that the covariance matrix of W_{.,m} is given by Var(W_{.,m}) = diag(E(W_{.,m}))(e^{Σ} − 11^{T})diag(E(W_{.,m})). Consequently, an estimator of W_{j,query} is given by \({\hat{{{{{{{{\boldsymbol{W}}}}}}}}}}_{j,{m}^{*}}=\hat{E}({{{{{{{{\boldsymbol{W}}}}}}}}}_{j,query})=\exp ({\hat{\mu }}_{j,query}+{\hat{\Sigma }}_{jj}/2)\) and the corresponding estimator of the covariance matrix is \(\hat{Var}({{{{{{{{\boldsymbol{W}}}}}}}}}_{.,query})=diag({\hat{{{{{{{{\boldsymbol{W}}}}}}}}}}_{.,query})({e}^{\hat{\Sigma }}{{{{{{{\boldsymbol{1}}}}}}}}{{{{{{{{\boldsymbol{1}}}}}}}}}^{T})diag({\hat{{{{{{{{\boldsymbol{W}}}}}}}}}}_{.,query})\). The estimated covariance matrix is positive definite as long as \(\hat{\Sigma }\) is positive definite. Furthermore, since \(\hat{B}\) is asymptotically normally distributed, we can obtain a conservative estimate of the pointwise prediction interval of \({{{{{{{{\boldsymbol{W}}}}}}}}}_{.,{m}^{*}}\) using the parametric bootstrap technique outlined in^{60}.
Extraction of W
In the above discussion, we have used \(\log ({{{{{{{\boldsymbol{W}}}}}}}})\) as the target of the multivariate regression in (6). However, W are not observed, but are parameters that appear in the distanceweighted regression in the response space (1). Hence, we first need to extract these weights. A naive option is to set the weights \({W}_{{i}_{j}^{*},{i}_{j}}\) as the inverse of squared Euclidean distance in the response space between points in I and I ^{*}, i.e. \({W}_{{i}_{j}^{*},{i}_{j}}=1/{d}_{{i}_{j}^{*},{i}_{j};Y}^{2},{i}_{j}^{*}\in {I}^{*},{i}_{j}\in I\). In this configuration, we can simply supply \(1/{d}_{{i}_{j}^{*},{i}_{j};Y}^{2}\) in the LHS of (6). We will still recover a closed form expression for \(\hat{E}(W)\) because the lognormal distribution is closed under an inverse transformation.
Univariate construction of topological regression
The requirement I ^{*} ∩ I = ϕ in the previous section induces a delicate tradeoff. If we increase the number of anchor points, the neighborhood training model becomes overparametrized. If, on the other hand, we decrease the number of anchor points there may not be enough anchor points to reliably estimate the response, especially in isolated regions of high activity.
One possible solution is to bring the distances among anchor points themselves in the neighborhood training model. But, that conflicts with the above theoretical development because each point in I ^{*} can be observed from the remaining K − 1 points in I ^{*} and hence we do not have a K × K covariance matrix. Additionally, because of the symmetry constraint (W_{i,j} = W_{j,i}), we can only work with the triangular matrix of weights associated with points within I ^{*}. Thus, if we forego the above multivariate loglinear regression construction (6) and view TR purely as a leastsquare optimization problem we can use K(N − K) + K(K − 1)/2 equations to obtain the leastsquare estimates of the coefficient matrix B. In this scenario, the first K(N − K) equations are obtained by varying m from 1, 2, . . . N in (6). The remaining K(K − 1)/2 equations connect the \({\tilde{W}}_{{i}_{j}^{*},{i}_{{j}^{{\prime} }}^{*}}\) with the instances in I ^{*}. More specifically, dropping the subscript i and simply denote the K elements in I ^{*} as {1^{*}, 2^{*}, 3^{*}, ⋯ , K^{*}}, then we have the following system of equations:
\(\hat{B}\) could be obtained by minimizing the error sum of squares. Additionally, if we assume the error terms are iid \({{{{{{{\mathcal{N}}}}}}}}(0,{\sigma }^{2})\), we can easily obtain \({\hat{\sigma }}^{2}\) from the residuals. Now, when a query instance comes in with known chemical features, we can compute \({{{{{{{{\boldsymbol{d}}}}}}}}}_{.,query}^{2}=[{d}_{{1}^{*},query}^{2},{d}_{{2}^{*},query}^{2},\cdots \,,{d}_{{K}^{*},query}^{2}]\) in the chemical space and obtain \({\hat{\tilde{{{{{{{{\boldsymbol{W}}}}}}}}}}}_{.,query}={{{{{{{{\boldsymbol{d}}}}}}}}}_{.,query}^{2}\hat{B}\). Then an estimator of the neighborhood weights for the query point is given by \({\hat{{{{{{{{\boldsymbol{W}}}}}}}}}}_{.,query}=\exp ({\hat{\tilde{{{{{{{{\boldsymbol{W}}}}}}}}}}}_{.,query}+{\hat{\sigma }}^{2}/2)\).
Additionally, since the W’s in this case are univariate, we have the flexibility to write \({W}_{{i}_{j}^{*},\neg {i}_{j}^{*}}=\exp (\beta {d}_{{i}_{j}^{*},\neg {i}_{j}^{*};Y}^{2})\) with β > 0 and replace the W’s in the LHS of (7) by \(\log ({d}_{{i}_{j}^{*},\neg {i}_{j}^{*};Y}^{2})\). Now, each d^{2} has a univariate lognormal distribution. Now, to obtain an estimator of \(E({W}_{{i}_{j}^{*},\neg {i}_{j}^{*}})\), we first observe that
is the Laplace transform of lognormal distribution. Although, there is no closed form solution of (8), but^{61} derives a sharp approximator of (8) for β > 0 using Lambert’s W function. Therefore we propose the following Monte Carlo procedure to estimate \(E({W}_{{i}_{j}^{*},\neg {i}_{j}^{*}})\) as follows:

a.
Fit a standard geographically weighted regression with Gaussian Kernel in the response space and extract \(\hat{\beta }\)^{62}.

b.
Fit the model (6) with \(\log ({d}_{{i}_{j}^{*},\neg {i}_{j}^{*};Y}^{2})\) in the LHS and obtain \({\hat{\mu }}_{{i}_{j}^{*},\neg {i}_{j}^{*}}\) and \({\hat{\sigma }}^{2}\).

c.
Draw R iid replicates of \({d}_{{i}_{j}^{*},\neg {i}_{j}^{*};Y}^{2}\) from lognormal\((0,{\hat{\sigma }}^{2})\).

d.
For each realization compute \(\exp (\hat{\beta }{d}_{{i}_{j}^{*},\neg {i}_{j}^{*};Y}^{2(r)}{e}^{{\hat{\mu }}_{{i}_{j}^{*},\neg {i}_{j}^{*}}})\).

e.
Then the Monte Carlo estimator of the LHS of (8) is given by \(\hat{E}({W}_{{i}_{j}^{*},\neg {i}_{j}^{*}})=\frac{1}{R}\mathop{\sum }\nolimits_{r=1}^{R}\exp (\hat{\beta }{d}_{{i}_{j}^{*},\neg {i}_{j}^{*};Y}^{2(r)}{e}^{{\hat{\mu }}_{{i}_{j}^{*},\neg {i}_{j}^{*}}})\)
While this Monte Carlo approximation works well when β and σ are small, it fails to explore the tail region as β → ∞. Hence, if \(\hat{\beta }\) is large, an efficient importance sampler, derived in^{61}, should be used.
Ensemble topological regression
The above construction in (7) allows relaxing the disjointedness requirement I ^{*} ∩ I = ϕ to include the anchor points as neighborhood training points and allows modeling the \({\tilde{W}}^{{\prime} }s\) through least squares optimization. However, by construction, ∣I ^{*}∣ < ∣I∣, meaning not all training points can be included as anchor points because the least squares model becomes overparameterized and overfits the training data leading to poor generalization performance. Since a subset of the available training set must be selected as anchor points, the results may be sensitive to the selected anchor points. To average out the effect of anchor points, one can simply randomly sample multiple different sets of anchor points and ensemble the results of each set. In order to achieve this, we introduce Ensemble TR, which samples t sets of anchor points independently and generates average predictions from the resulting t TRmodels. The percentage of training instances to include as anchor instances can be viewed as a hyperparameter, so t percentages can be sampled from a Gaussian distribution \({{{{{{{\mathcal{N}}}}}}}}({\mu }_{k},{\sigma }_{k}^{2})\), with μ_{k} being the mean percentage of training instances to include as anchor instances and \({\sigma }_{k}^{2}\) being the requested variance of the t percentages. To verify percentage values are valid and to prevent over or underfitting, the sampled percentages are clipped between the range [30%, 90%]. This leaves the user with three parameters: the number of models (t), the mean percentage of training samples to include as anchor instances (μ_{k}), and the variance of the percentages \(({\sigma }_{k}^{2})\). Ensemble TR maintains its computational efficiency considering \({D}_{X}^{N\times N}\) can be initially calculated, and \(t\,{D}_{X}^{N\times K}\)’s can be easily sampled from \({D}_{X}^{N\times N}\). This means that once distances are calculated, only t multitask linear regression models must be solved and RBF kernels applied to their outputs to generate predictions, leading to fast run times.
Reporting summary
Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.
Data availability
The ChEMBL datasets used in this study are available in the ChEMBL database (https://www.ebi.ac.uk/chembl/)^{58}. The code to extract the 530 ChEMBL datasets is provided in the code repository. Source data are provided with this paper.
Code availability
Sample data files and Python code to regenerate the TR figures and results are openly provided at https://github.com/Ribosome25/TopoReg_QSAR, which is archived in Zenodo under the identifier https://doi.org/10.5281/zenodo.10929477^{63}.
References
Neves, B. J. et al. Qsarbased virtual screening: advances and applications in drug discovery. Front. Pharmacol. 9, 1275 (2018).
Kwon, S., Bae, H., Jo, J. & Yoon, S. Comprehensive ensemble in qsar prediction for drug discovery. BMC Bioinformatics 20, 1–12 (2019).
Cherkasov, A. et al. Qsar modeling: where have you been? where are you going to? J. Medicinal Chem. 57, 4977–5010 (2014).
Grisoni, F., Ballabio, D., Todeschini, R. & Consonni, V. Molecular descriptors for structure–activity applications: a handson approach. Methods Mol. Biol. 1800, 3–53 (2018).
Yap, C. W. Padeldescriptor: An open source software to calculate molecular descriptors and fingerprints. J. Comput. Chem. 32, 1466–1474 (2011).
Moriwaki, H., Tian, Y.S., Kawashita, N. & Takagi, T. Mordred: a molecular descriptor calculator. J. Cheminform. 10, 1–14 (2018).
Rogers, D. & Hahn, M. Extendedconnectivity fingerprints. J. Chem. Inform. Modeling 50, 742–754 (2010).
Yang, K. et al. Analyzing learned molecular representations for property prediction. J. Chem. Inform. Modeling 59, 3370–3388 (2019).
Stokes, J. M. et al. A deep learning approach to antibiotic discovery. Cell 180, 688–70213 (2020).
Liu, G. et al. Deep learningguided discovery of an antibiotic targeting acinetobacter baumannii. Nat. Chem. Biol. 19, 1342–1350 (2023).
Isert, C., Kromann, J. C., Stiefl, N., Schneider, G. & Lewis, R. A. Machine learning for fast, quantum mechanicsbased approximation of drug lipophilicity. ACS Omega 8, 2046–2056 (2023).
Wang, S., Guo, Y., Wang, Y., Sun, H. & Huang, J. Smilesbert: large scale unsupervised pretraining for molecular property prediction. In: Proc. 10th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics 429–436 (IEEE, 2019).
Karpov, P., Godin, G. & Tetko, I. V. Transformercnn: Swiss knife for qsar modeling and interpretation. Journal of cheminformatics 12, 1–12 (2020).
DoshiVelez, F. & Kim, B. Towards a rigorous science of interpretable machine learning. Preprint at https://arxiv.org/abs/1702.08608 (2017).
Sundararajan, M., Taly, A. & Yan, Q. Axiomatic attribution for deep networks. In: International Conference on Machine Learning. (eds Precup, D. & The, Y. W.) 3319–3328 (PMLR, 2017).
Nembrini, S., König, I. R. & Wright, M. N. The revival of the gini importance? Bioinformatics 34, 3711–3718 (2018).
Altmann, A., Toloşi, L., Sander, O. & Lengauer, T. Permutation importance: a corrected feature importance measure. Bioinformatics 26, 1340–1347 (2010).
Smilkov, D., Thorat, N., Kim, B., Viégas, F. & Wattenberg, M. Smoothgrad: removing noise by adding noise. Preprint at https://arxiv.org/abs/1706.03825 (2017).
Koh, P.W. & Liang, P. Understanding blackbox predictions via influence functions. In: International Conference on Machine Learning (eds Precup, D. & The, Y. W.) 1885–1894 (PMLR, 2017).
Ribeiro, M.T., Singh, S. & Guestrin, C. "why should i trust you?” explaining the predictions of any classifier. In: Proc. 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. (ed Krishnapuram, B.) 1135–1144 (ACM, Digital Library, 2016).
Lundberg, S.M. & Lee, S.I. A unified approach to interpreting model predictions. Adv. Neural Inform. Process. Syst. 30 (2017).
RodríguezPérez, R. & Bajorath, J. Interpretation of compound activity predictions from complex machine learning models using local approximations and shapley values. J. Medicinal Chem. 63, 8761–8777 (2019).
Mothilal, R.K., Sharma, A. & Tan, C. Explaining machine learning classifiers through diverse counterfactual explanations. In Proc. 2020 Conference on Fairness, Accountability, and Transparency 607–617 (2020).
Wellawatte, G. P., Seshadri, A. & White, A. D. Model agnostic generation of counterfactual explanations for molecules. Chem. Sci. 13, 3697–3705 (2022).
Marchese Robinson, R. L., Palczewska, A., Palczewski, J. & Kidley, N. Comparison of the predictive performance and interpretability of random forest and linear models on benchmark data sets. J. Chem. Inform. modeling 57, 1773–1792 (2017).
Polishchuk, P. Interpretation of quantitative structure–activity relationship models: past, present, and future. J. Chem. Inform. Modeling 57, 2618–2639 (2017).
Balfer, J. & Bajorath, J. Visualization and interpretation of support vector machine activity predictions. J. Chem. Inform. Modeling 55, 1136–1147 (2015).
Sheridan, R. P. Interpretation of qsar models by coloring atoms according to changes in predicted activity: how robust is it? J. Chem. Inform. Modeling 59, 1324–1337 (2019).
Shoombuatong, W. et al. Towards the Revival of Interpretable Qsar Models. Advances in Qsar Modeling: Applications in Pharmaceutical, Chemical, Food, Agricultural and Environmental Sciences 3–55 (Springer, 2017).
Xiong, Z. et al. Pushing the boundaries of molecular representation for drug discovery with the graph attention mechanism. J. Medicinal Chem. 63, 8749–8760 (2019).
Baldassarre, F. & Azizpour, H. Explainability techniques for graph convolutional networks. Preprint at https://arxiv.org/abs/1905.13686 (2019).
Weber, J. K. et al. Simplified, interpretable graph convolutional neural networks for small molecule activity prediction. J. Comput.Aided Mol. Des. 36, 391–404 (2021).
Ding, H., Takigawa, I., Mamitsuka, H. & Zhu, S. Similaritybased machine learning methods for predicting drug–target interactions: a brief review. Briefings Bioinform. 15, 734–747 (2014).
Yamanishi, Y., Araki, M., Gutteridge, A., Honda, W. & Kanehisa, M. Prediction of drug–target interaction networks from the integration of chemical and genomic spaces. Bioinformatics 24, 232–240 (2008).
GajewiczSkretna, A., Furuhama, A., Yamamoto, H. & Suzuki, N. Generating accurate in silico predictions of acute aquatic toxicity for a range of organic chemicals: Towards similaritybased machine learning methods. Chemosphere 280, 130681 (2021).
Jacob, L. & Vert, J.P. Proteinligand interaction prediction: an improved chemogenomics approach. Bioinformatics 24, 2149–2156 (2008).
Patlewicz, G., Helman, G., Pradeep, P. & Shah, I. Navigating through the minefield of readacross tools: a review of in silico tools for grouping. Comput. Toxicol. 3, 1–18 (2017).
Wawer, M., Peltason, L., Weskamp, N., Teckentrup, A. & Bajorath, J. Structure activity relationship anatomy by networklike similarity graphs and local structure activity relationship indices. J. Medicinal Chem. 51, 6075–6084 (2008).
Keiser, M. J. et al. Relating protein pharmacology by ligand chemistry. Nat. Biotechnol. 25, 197–206 (2007).
Lo, Y.C. et al. Largescale chemical similarity networks for target profiling of compounds identified in cellbased chemical screens. PLoS Comput. Biol. 11, 1004153 (2015).
Lounkine, E. et al. Largescale prediction and testing of drug activity on sideeffect targets. Nature 486, 361–367 (2012).
Keiser, M. J. et al. Predicting new molecular targets for known drugs. Nature 462, 175–181 (2009).
He, X., Cai, D. & Niyogi, P. Laplacian score for feature selection. Adv. Neural Inform. Process. Syst. 18 (2005).
Sheikhpour, R., Sarram, M. A., Gharaghani, S. & Chahooki, M. A. Z. Feature selection based on graph laplacian by using compounds with known and unknown activities. J. Chemometrics 31, 2899 (2017).
Valizade Hasanloei, M. A., Sheikhpour, R., Sarram, M. A., Sheikhpour, E. & Sharifi, H. A combined Fisher and Laplacian score for feature selection in QSAR based drug design using compounds with known and unknown activities. J. Comput.Aided Mol. Des. 32, 375–384 (2018).
CruzMonteagudo, M. et al. Activity cliffs in drug discovery: Dr jekyll or mr hyde? Drug Discov. Today 19, 1069–1080 (2014).
Stumpfe, D., Hu, H. & Bajorath, J. Evolving concept of activity cliffs. ACS Omega 4, 14360–14368 (2019).
Maggiora, G. M. On outliers and activity cliffs why QSAR often disappoints. J. Chem. Inform. Modeling 46, 1535–1535 (2006).
Hu, H. & Bajorath, J. Simplified activity cliff network representations with high interpretability and immediate access to SAR information. J. Comput.Aided Mol. Des. 34, 943–952 (2020).
Weinberger, K.Q., Blitzer, J. & Saul, L. Distance metric learning for large margin nearest neighbor classification. Adv. Neural Inform. Process. Syst. 18 (2005).
Weinberger, K.Q. & Tesauro, G. in Artificial Intelligence and Statistics (eds. Meila, M. & Shen, x) 612–619 (PMLR, 2007).
Kireeva, N. V., Ovchinnikova, S. I., Kuznetsov, S. L., Kazennov, A. M. & Tsivadze, A. Y. Impact of distancebased metric learning on classification and visualization model performance and structure–activity landscapes. J. Comput.aided Mol. Des. 28, 61–73 (2014).
Horvath, D., Marcou, G. & Varnek, A. In (ed Roy, K.) Advances in QSAR Modeling: Applications in Pharmaceutical, Chemical, Food, Agricultural and Environmental Sciences 167–199 (Springer Verlag, 2017).
Fröhlich, H., Wegner, J. K., Sieker, F. & Zell, A. Kernel functions for attributed molecular graphs—a new similaritybased approach to ADME prediction in classification and regression. QSAR Combinatorial Sci. 25, 317–326 (2006).
Mohr, J. A., Jain, B. J. & Obermayer, K. Molecule kernels: a descriptorand alignmentfree quantitative structure–activity relationship approach. J. Chem. Inform. Modeling 48, 1868–1881 (2008).
Charlton, M., Fotheringham, S. & Brunsdon, C. Geographically Weighted Regression Vol. 2, White paper (National Centre for Geocomputation, National University of Ireland Maynooth, 2009).
Johnson, R.A. & Dean, W.W. et al. Applied Multivariate Statistical Analysis, 5th edn. (Prentice Hall, NJ, 2002).
Gaulton, A. et al. The ChEMBL database in 2017. Nucleic Acids Res. 45, 945–954 (2017).
Bosc, N., Atkinson, F., Felix, E., Gaulton, A., Hersey, A. & Leach, A. R. Large scale comparison of QSAR and conformal prediction methods and their applications in drug discovery. J. Cheminform. 11, 1–16 (2019).
Carroll, R. J. & Ruppert, D. Prediction and tolerance intervals with transformation and/or weighting. Technometrics 33, 197–210 (1991).
Asmussen, S., Jensen, J. L. & RojasNandayapa, L. On the Laplace transform of the lognormal distribution. Methodol. Comput. Appl. Probab. 18, 441–458 (2016).
Fotheringham, A.S., Brunsdon, C. & Charlton, M. Geographically Weighted Regression: the Analysis of Spatially Varying Relationships (John Wiley & Sons, 2003).
Zhang, R., Nolte, D., SanchezVillalobos, C., Ghosh, S. & Pal, R. Topological Regression as an interpretable and efficient tool for Quantitative StructureActivity Relationship Modeling. Zenodo https://doi.org/10.5281/zenodo.10929477 (2024).
Acknowledgements
This work was supported in part by the National Science Foundation under Grants Nos. 2007903 (received by RP) and 2007418 (Received by S.G) and Leidos Biomed/NCI under contract 22X049 Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation or Leidos Biomed/NCI. The authors acknowledge the High Performance Computing Center (HPCC) at Texas Tech University for providing computational resources that have contributed to the research results reported within this paper. http://www.hpcc.ttu.edu.
Author information
Authors and Affiliations
Contributions
R.Z., D.N., S.G., and R.P. formulated the problem and conceived the experiments, R.Z., D.N., C.S., conducted the experiments, R.Z., D.N., C.S., S.G., and R.P. analyzed the results. All authors reviewed the manuscript. R.Z. conducted this work while he was working at Texas Tech University, however, he is currently working at Merck Inc.
Corresponding authors
Ethics declarations
Competing interests
The authors declare no competing interests.
Peer review
Peer review information
Nature Communications thanks Martin Vogt, and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. A peer review file is available.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Source data
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Zhang, R., Nolte, D., SanchezVillalobos, C. et al. Topological regression as an interpretable and efficient tool for quantitative structureactivity relationship modeling. Nat Commun 15, 5072 (2024). https://doi.org/10.1038/s41467024493720
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41467024493720
Comments
By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.