Augmented drug combination dataset to improve the performance of machine learning models predicting synergistic anticancer effects

Liu, Mengmeng; Srivastava, Gopal; Ramanujam, J.; Brylinski, Michal

doi:10.1038/s41598-024-51940-9

Download PDF

Article
Open access
Published: 18 January 2024

Augmented drug combination dataset to improve the performance of machine learning models predicting synergistic anticancer effects

Mengmeng Liu¹^na1,
Gopal Srivastava²^na1,
J. Ramanujam^1,3 &
…
Michal Brylinski^2,3

Scientific Reports volume 14, Article number: 1668 (2024) Cite this article

1225 Accesses
1 Citations
Metrics details

Subjects

Abstract

Combination therapy has gained popularity in cancer treatment as it enhances the treatment efficacy and overcomes drug resistance. Although machine learning (ML) techniques have become an indispensable tool for discovering new drug combinations, the data on drug combination therapy currently available may be insufficient to build high-precision models. We developed a data augmentation protocol to unbiasedly scale up the existing anti-cancer drug synergy dataset. Using a new drug similarity metric, we augmented the synergy data by substituting a compound in a drug combination instance with another molecule that exhibits highly similar pharmacological effects. Using this protocol, we were able to upscale the AZ-DREAM Challenges dataset from 8798 to 6,016,697 drug combinations. Comprehensive performance evaluations show that ML models trained on the augmented data consistently achieve higher accuracy than those trained solely on the original dataset. Our data augmentation protocol provides a systematic and unbiased approach to generating more diverse and larger-scale drug combination datasets, enabling the development of more precise and effective ML models. The protocol presented in this study could serve as a foundation for future research aimed at discovering novel and effective drug combinations for cancer treatment.

Harnessing machine learning to find synergistic combinations for FDA-approved cancer drugs

Article Open access 29 January 2024

In-silico Prediction of Synergistic Anti-Cancer Drug Combinations Using Multi-omics Data

Article Open access 20 June 2019

A cancer drug atlas enables synergistic targeting of independent drug vulnerabilities

Article Open access 10 June 2020

Introduction

Developing effective anticancer therapies is an important yet challenging task. Most currently available treatments employ a monotherapy, i.e., using a single drug to treat a particular disease^1,2. Although widely used, monotherapies are known to suffer from certain problems, such as the acquired drug resistance and prominent side effects^1,3. In contrast, combination therapies utilizing multiple pharmaceuticals to simultaneously target several biological processes generally have greater chances of overcoming these issues⁴. Not surprisingly, combination therapies against complex diseases, such as cancer, are attracting a significant attention. Nonetheless, exploring all possible drug combinations within a vast pharmacological space is a major obstacle to find those drug combinations exhibiting synergistic effects. Accurate computational methods to select the most promising therapeutic candidates for experimental testing can greatly facilitate the discovery of effective drug combinations.

Approaches utilizing machine learning (ML) are well suited to predict drug synergistic effects. Supervised learning techniques require large-scale experimental data to train models predicting effective drug combinations. These datasets differ with respect to the number of drugs and cell lines. For instance, A Large Matrix of Antineoplastic Agent Combinations from the National Cancer Institute (NCI-ALMANAC) contains 5232 drug pairs tested against 60 cancer cell lines⁵. Another resource provides drug responses measured for a panel of 39 cancer cell lines and 22 experimental drugs in all possible pairwise combinations and in combination with 16 approved drugs, totaling 583 compound pairs⁶. Other datasets are focused on a specific cell line, for example, 1833 bioactive drugs at 5 μm were tested in combination with temozolomide at 400 μm against a human glioblastoma cell line T98G^N⁷. Furthermore, 1327 drug combinations from the CeMM library of unique drugs (CLOUD) dataset containing 308 prodrugs and active drugs⁸ were found effective against a human chronic myeloid leukemia cell line KBM-7⁹.

Meta-datasets collect and standardize the results of individual drug combination screening studies in order to enable a more efficient utilization of these data resources. For instance, DrugComb is an open-access data portal to 739,964 combinations of 8397 drugs tested on 2320 cell lines from 33 tissues^10,11. It quantifies the degree of drug-drug interactions over the full dose–response matrix with several synergy scores, Bliss independence (BLISS), Highest single agent (HSA), Loewe additivity (LOEWE), and Zero interaction potency (ZIP)^12,13,14. SYNERGxDB is a comprehensive dataset compiled from nine individual datasets containing 22,507 pairwise combinations of 1977 drugs tested on 151 cell lines from 15 tissues¹⁵. Similar to DrugComb, SYNERGxDB also provides standardized synergy scores, BLISS and ZIP. Finally, Dialog for Reverse Engineering Assessments and Methods (DREAM) Challenges partnered with AstraZeneca and the Sanger Institute to compile a dataset of 20,483 synergy scores for 910 drug combinations involving 118 anticancer drugs tested against 85 cancer cell lines¹⁶. This dataset also provides a quality assessment score for each combination, ranging from − 3 to 1, where 1 indicates a synergy between drugs in the combination. Along with the synergy data for drug combinations, the AZ-DREAM Challenges data comprise various molecular data, such as mutations, copy number variation, gene expression, and the tissue of origin. These datasets offer unparalleled opportunities to develop highly accurate ML models to predict drug synergistic effects.

Since the performance of supervised ML strongly depends on the quality, quantity, and the contextual subject of training data, the data scarcity problem is one of the most common challenges to develop robust ML models. To overcome this difficulty, data augmentation techniques are widely employed to expand the volume of available data. For instance, classical augmentation methods, such as image flipping, image rotation, noise injection, kernel filters, random erasing, and image mixing, are frequently used in the medical image analysis domain^{17,18,19,20,21,22}. Data augmentation techniques gaining attention in the medical time series analysis domain²³ include the time domain augmentation²⁴, the time–frequency domain augmentation²⁵, decomposition-based methods^26,27, statistical generative models^28,29, and learning-based methods^30,31,32,33. In addition, more advanced deep learning-based augmentation techniques, including the feature space augmentation^34,35, generative adversarial networks (GAN)-based augmentation^{36,37,38,39,40}, the neural style transfer^41,42, and meta-learning schemes^43,44,45, have been proposed.

To combat overfitting in a neural network architecture with 60 million parameters for image recognition, two types of data augmentation were employed, label-preserving transformations and altering the intensities of the RGB channels in training images using Principal Component Analysis⁴⁶. Indeed, these data augmentation techniques significantly reduced overfitting and improved performance, leading to the reduction in the top-1 error rate by more than 1%. CutMix is an interesting augmentation technique that combines regions from different images to create augmented samples⁴⁷. CutMix improves model generalization by encouraging localization, providing diverse training examples, and enhancing model robustness against input corruption, as well as out-of-distribution detection performances. Augmenting training data with bilingual lexicon information was demonstrated to improve the performance of machine translation models on low-resource and unsupervised languages⁴⁸. Three main types of lexical augmentation employed are codeswitching, lexical prompting, and raw token-pair training. Extensive experimentation results show that applying any of these augmentations to monolingual data yields substantial improvements, and that they can be combined for even greater effect.

Although image, language, and sequential data augmentation methods are well established, these approaches are, in principle, unsuitable to generate the heterogeneous data of cellular and molecular features for drug synergy prediction with supervised ML. On that account, a variety of domain-specific techniques have been developed. For instance, the fact that multiple simplified molecular-input line-entry system (SMILES) strings represent the same molecule was used to augment a molecular dataset of chemical species⁴⁹ using the SMILES enumeration⁵⁰. Further, data augmentation utilizing multiple SMILES representations for a single compound was demonstrated to enhance the prediction accuracy of various molecular properties, such as solubility, lipophilicity, and bioactivity, irrespective of the specific machine learning model employed or the size of the dataset⁵¹. Another study doubled the size of a training dataset to predict anticancer drug synergism based on NCI-ALMANAC by generating duplicates with the reverse order of drugs⁵². Data up-sampling was also applied to increase the number of minor class instances for phenotype-based virtual screening of anticancer drug combinations⁵³. Finally, an example of a deep learning-based data augmentation technique is the uniform graph convolutional network (UGCN)⁵⁴. It employs a drug representation based on atomic interactions within organic compounds rather than hand-crafted features, such as molecular fingerprints, and string-based features, such as SMILES. UGCN can be used to augment chemical data by randomly sampling multiple complementary graphs for a single drug.

Despite the encouraging results reported for the abovementioned data augmentation techniques for drug synergy prediction, many of existing methods either are too general (up-sampling) or consider only drug structural information (SMILES enumeration and UGCN). To address these issues, we devised a new augmentation approach combining the drug chemical similarity with the system-level information on drug-target interactions. This approach employs a novel similarity metric, the drug action/chemical similarity (DACS) score, taking into account not only the chemical characteristics of drugs, but also their molecular targets. Applying the DACS score to augment the AZ-DREAM Challenges data with new compounds from PubChem⁵⁵ significantly increased the size and diversity of the training dataset for drug synergy prediction. To the best of our knowledge, this methodology represents the first systematic and effective protocol to augment a synergy dataset simultaneously utilizing the information on drug chemical structures and their protein targets. As a proof of concept, the augmented dataset was used to train several ML models demonstrating a higher accuracy of drug synergy prediction compared to those models trained on the original AZ-DREAM Challenges data.

Results

Similarity measure for cellular responses to drug treatment

During the data augmentation, new drug combinations are generated by replacing drugs with those molecules triggering similar pharmacological responses. The similarity of pharmacological effects of two drugs is quantified by the Kendall τ correlation coefficient between pIC₅₀ values for the monotherapy treatments of multiple cancer cell lines. A positive value of Kendall τ indicates that two drugs have similar pharmacological effects in terms of the inhibition of the cancer growth, whereas a negative correlation and the lack of correlation point to different cellular responses to drug treatment. This concept is illustrated in Fig. 1 for crizotinib, a tyrosine kinase inhibitor used for the treatment of non-small cell lung carcinoma (NSCLC)⁵⁶, paired with six other anti-cancer drugs. Figures 1A–C are examples of a positive correlation between crizotinib and everolimus (Kendall τ of 0.50), entinostat (Kendall τ of 0.44), and perifosine (Kendall τ of 0.42), respectively. Everolimus, a derivative of sirolimus with cell proliferation and immunosuppressive properties, is used in combination with other anticancer agents for the treatment of kidney and breast cancer, and neuroendocrine tumors of gastrointestinal and lung origins⁵⁷. Entinostat, a benzamide derivative with the antineoplastic activity, and perifosine, an allosteric AKT inhibitor with the antiglycolytic activity, are used for the treatment of NSCLC^58,59. According to the analysis of pIC₅₀ values against multiple cancer cell lines, these three drugs have similar profiles to that of crizotinib, i.e., they inhibit the growth of the same cancer cell lines and are ineffective against the same group of cell lines as well.

In contrast, cellular responses of crizotinib are uncorrelated with that of adavosertib (Fig. 1D, Kendall τ of − 0.06), vinorelbine (Fig. 1E, Kendall τ of − 0.03), and capivasertib (Fig. 1F, Kendall τ of − 0.01). Adavosertib is a tyrosine kinase WEE1 inhibitor used to improve the outcome in triple-negative breast cancer⁶⁰, vinorelbine is an agent to treat NSCLC and breast cancer⁶¹, and capivasertib is AKT inhibitor used in the treatment of breast cancer⁶². Since these drugs have uncorrelated pharmacological effects, they cannot be used to replace crizotinib during the data augmentation process. The analysis of cellular responses with the Kendall τ is versatile and can be applied when two drugs have been tested on at least two common cell lines, otherwise the value of the Kendall τ is set to 0. The similarities of pharmacological effects between crizotinib and everolimus, entinostat, perifosine, adavosertib, vinorelbine, and capivasertib were calculated based on 7 + 2, 9 + 0, 7 + 0, 9 + 0, 0 + 13, and 9 + 10 common (breast + lung) cell lines, respectively.

Relation between drug similarity and pharmacological effects

Next, we investigate how similar two drugs need to be in order to trigger similar pharmacological effects. This analysis is performed for 4753 $\left(98{\text{C}}2\right)$ possible pairs of 98 drugs in the AZ-DREAM Challenges dataset. Pharmacological responses are quantified with the Kendall τ correlation coefficient, whereas the drug similarity is measured with two metrics. The first score is the drug chemical similarity calculated as the Tanimoto coefficient (TC) between FP2 fingerprints⁶³. Figure 2 (solid blue line) shows that, as expected, the fraction of drug pairs with the positive Kendall τ increases with the increasing chemical similarity and reaches a value of 1.0 for the TC threshold of 0.6. The second metric is the drug action similarity computed as the Matthews correlation coefficient (MCC)⁶⁴ between target proteins in the protein–protein interaction (PPI) network from the IHP-PING dataset⁶⁵. Similar to the TC, the fraction of drug pairs with the positive Kendall τ also increases with the increasing MCC reaching 1.0 for the MCC threshold of 0.6 (Fig. 2, dashed purple line). For comparison, increasing the threshold for a random similarity does not increase the fraction of drug pairs with the positive Kendall τ (Fig. 2, dotted black line).

Drug action/chemical similarity score

Analyses presented above demonstrate that both chemical and drug action similarities can be used for data augmentation. However, their combination could potentially cover a larger chemical space than individual similarities while ensuring that the pharmacological profiles of drugs selected for augmentation are highly similar to those of their parent molecules. Therefore, we combined TC and MCC into a new metric, the drug action/chemical similarity (DACS) score. Figure 3 shows the relation between the DACS score and the fraction of drug pairs with the positive Kendall τ as the spatial heatmap in two dimensions corresponding to the individual similarities. The dark blue section in the upper left corner of the heatmap corresponds to the area of a low positive correlation, whereas the light blue section shows the combination of individual similarities resulting in a high positive correlation. The DACS score can be represented as a quarter circle in Fig. 3 (dashed black line). For example, above a DACS threshold of 0.6, as many as 85.7% drug pairs have a positive Kendall τ correlation.

Dataset augmentation with DACS

The DACS metric is used as a guide to find the optimal number of new instances to be generated for the synergy dataset according to a procedure presented in Fig. 4. Each instance in the AZ-DREAM Challenges dataset consists of a pair of drugs targeting a cell line with a particular synergy score (Fig. 4A, drug pair 1:2). During the augmentation procedure, candidate molecules to replace one drug in a pair are identified in the STITCH database⁶⁶ (Fig. 4B, drugs 3, 4, and 5). Next, DACS scores against the drug to be replaced are calculated (Fig. 4C) and those molecules having scores larger than a cutoff are selected (Fig. 4D, drugs 3 and 5). The original drug is then replaced by the selected molecules to create augmented pairs (Fig. 4E, drug pairs 3:2 and 5:2). This procedure is repeated for the second drug in the original pair creating more augmented instances (Fig. 4F, drug pairs 1:6).

The selection of a cutoff for DACS scores between the original drug to be replaced and the candidate substitute compounds is critical to create high-quality augmented instances. On that account, we conducted an analysis of the fraction of new drugs having similar pharmacological profiles to their parent molecules and the number of new instances that can be obtained from the STITCH database at different DACS similarity thresholds. Figure 5 shows that these two quantities are inversely related, i.e., increasing the DACS similarity threshold results in a higher chance of substitute compounds to trigger similar pharmacological responses (dashed purple line), however, at the same time, fewer molecules can be used to augment the dataset (solid blue line). The intersection point marked by a dotted black line in Fig. 5 represents the DACS cutoff of 0.53, at which the majority of substitute drugs (82%) have similar pharmacological profiles to their parent molecules and as many as 42,225 new drugs can be obtained from the STITCH database to augment the synergy dataset. Applying this threshold to replace one molecule in a drug pair in the AZ-DREAM Challenges dataset of 8798 instances produces an augmented dataset of 6,016,697 drug pairs annotated with synergy scores against various cancer cell lines.

Ideally, the distribution of synergy values across the augmented dataset should be the same as for the AZ-DREAM Challenges dataset. Figure 6 shows that these two distributions indeed are similar; the average synergy score ± standard deviation is 9.9 ± 26.1 for the AZ-DREAM Challenges dataset and 12.1 ± 28.5 for the augmented dataset. In addition, we compare various physicochemical properties of drugs present in the original and augmented dataset to those calculated for a set of 27,385 molecules selected randomly from the STITCH database⁶⁶. Indeed, the original and augmented drugs have similar octanol–water partition coefficient (logP, 3.6 ± 2.0 and 3.8 ± 1.8), the number of hydrogen bond donors (HBD, 2.0 ± 1.2 and 2.0 ± 1.6) and acceptors (HBA, 6.8 ± 2.6 and 5.8 ± 2.4), and the Quantitative Estimate of Druglikeness⁶⁷ (QED, 0.48 ± 0.18 and 0.49 ± 0.20). For comparison, logP, HBD, HBA, and QED for random molecules are 3.2 ± 2.4, 1.9 ± 1.9, 5.0 ± 2.7, and 0.50 ± 0.22, respectively. These analyses demonstrate that the augmented dataset does not contain artifacts, such as molecules with certain physicochemical properties, that could potentially bias the training of machine learning models toward a particular effect (either synergism or antagonism).

Drug synergy prediction with machine learning

Finally, we investigate whether training machine learning against the augmented data achieves a better classification performance than training against the original AZ-DREAM Challenges dataset. Four state-of-the-art machine learning methods are employed, Logistic Regression (LR)^68,69, Support Vector Machines (SVM)^70,71, Random Forest (RF)⁷², and Gradient Boosting Trees (GBT)⁷³. Following the original publication¹⁶, drug pairs having synergy scores higher than 20 are labelled synergistic and those having synergy scores lower than − 20 are labelled antagonistic. First, we performed a fivefold cross-validation by randomly splitting the dataset into 5 subsets. Note that the augmented data are only used to train machine learning models, which are then validated against AZ-DREAM Challenges instances. Table 1 shows the classification performance evaluated with several metrics. Encouragingly, the performance of classifiers is improved when models are trained against the augmented data and the random-split validation is employed. For instance, the area under the receiver operating characteristic plot (AUC) increased from 0.802 to 0.809 for RF and from 0.859 to 0.863 for GBT classifiers.

Table 1 Performance of machine learning in the prediction of drug synergistic effects.

Full size table

Although a random-split cross-validation is often used to assess the performance of drug synergy predictors ¹⁶ , it leads to a significant overlap between training and validation subsets because those instances involving similar cell lines are present in both sets. Consequently, the trained model is going to have only a weak ability to generalize to unseen data, even though the validation accuracy may seem high. In order to mitigate this issue and more reliably evaluate the performance of machine learning trained on drug synergy data, we conducted a tissue-based cross-validation in which each fold comprises a particular tissue (or a group of tissues). This protocol has been shown to eliminate the overlap between training and validation subsets allowing for an unbiased assessment of the capabilities of machine learning to extract the information from input data ⁷⁴.

Table 1 and receiver operating characteristic plots presented in Fig. 7 show that applying the more rigorous tissue-based validation protocol decreases the performance of machine learning predicting drug synergistic effects. However, this evaluation is more reliable because it better mimics a real scenario in which machine learning is applied to predict drug synergistic effects for unseen data, i.e., drug combinations against cell lines originating from tissues that have not been used to train the classifier. With this cross-validation protocol, machine learning trained on the augmented data yields even higher improvements in terms of the classification accuracy compared to models trained on the original AZ-DREAM Challenges dataset. For example, the AUC increased from 0.647 to 0.685 for RF and from 0.688 to 0.734 for GBT classifiers.

Table 2 shows AUC scores for each tissue fold and tree-based models trained on both the original and the augmented datasets. The comparison of AUC scores reveals that incorporating the augmented data into the training process systematically improves the classification performance regardless of the tissue type. In general, these findings indicate that incorporating augmented data can provide enhanced information for training machine learning models in a more effective manner.

Table 2 Area under the receiver operating characteristic plot (AUC) scores for each fold in the tissue-based cross-validation.

Full size table

Classification of instances with ambiguous synergy scores

The robustness of ML models stems from the foundation laid by the quality of the training data, ensuring that they can effectively handle diverse and complex scenarios with a high degree of accuracy. When a machine learning model encounters instances with ambiguous labels, it adapts by making predictions that are less confident for such cases. To illustrate this phenomenon, we evaluate the capability of the trained GBT model to handle instances with unclear class labels by assessing its performance across a spectrum of synergy scores. The GBT model was selected because its performance in fivefold cross-validation against instances with reliable synergy scores $\ge 20$ (synergistic cases) and $\le -20$ (antagonistic cases) is better than those of LR, SVM, and RF. Figure 8 shows the distribution of prediction probabilities reported by the GBT model for drug combinations selected from the AZ-DREAM Challenges dataset with a varying degree of synergy scores with the corresponding statistics reported in Table 3.

Table 3 Statistics for the distribution of prediction probabilities across varying degrees of drug synergy.

Full size table

Including ambiguous labels represented by synergy scores close to 0 lowers the confidence, and the model attempts to reflect this uncertainty in its predictions. For instance, Fig. 8A shows that the median (Q₂) prediction probability is 0.981 when the most ambiguous positive cases with the synergy score of $>0$ are included, while it is as high as 0.999 when the model is applied to only the most reliable positive cases with the synergy score of $\ge 20$. This trend can also be observed for negative instances (Fig. 8B), for which the median prediction probability increases from 0.248 for the most ambiguous cases with the synergy score of $<0$ to 0.687 for the most reliable cases with the synergy score of $\le -20$. Another indication of the lack of strong prediction confidence when instances having unclear labels are included is the increased spread of prediction probabilities. Indeed, wider interquartile ranges (Q₃-Q₁) are observed when ambiguous positive cases are considered compared to those obtained for the most reliable drug combinations only. For negative cases, Q₂ and Q₃ values decrease as more unclear instances are included, meaning there is a concentration of prediction probability towards the lower values, which signifies the declined prediction confidence for those instances and a diminished level of assurance in the ability to assign accurate classifications by the model.

Evaluation against “unseen” data

To further evaluate the generalizability of a model trained on the AZ-DREAM Challenges augmented data, we conducted the performance evaluation against an independent dataset of 250 drug combinations selected from DrugCombDB⁷⁵. It is important to note that since drugs in this set are chemically dissimilar to those in the AZ-DREAM Challenges dataset, DrugCombDB instances can be regarded as “unseen” data. In this analysis, two GBT models were trained, one using the original AZ-DREAM Challenges data and the other using both the original and augmented instances. A GBT model trained solely on the original data correctly classified only 76/250 drug combinations (12 synergistic and 64 antagonistic) yielding the accuracy of 0.30 and a high false positive rate (FPR) of 0.73. In contrast, a GBT model that incorporated augmented data during training correctly predicted 141/250 drug combinations (11 synergistic and 130 antagonistic) achieving a much higher accuracy of 0.56 and a significantly lower FPR of 0.45. This improved performance by employing augmented instances highlights the importance of data augmentation techniques in enhancing the ability of machine learning models to generalize to new drug synergy data. Through exposure to a comprehensive and diverse dataset, the model acquired improved pattern recognition capabilities and achieved more accurate classifications, resulting in an enhanced reliability for drug synergy predictions in a real-world application scenario.

Discussion

In this study, we devised a data augmentation protocol to solve the data scarcity problem in predicting synergistic effects of anti-cancer drug combinations with machine learning models. The augmentation protocol expands the synergy dataset by replacing a compound in a drug combination instance with another molecule having highly similar pharmacological effects. This is achieved through the use of the DACS similarity metric between two drugs, which incorporates both chemical structure and drug action similarities. Compared to existing techniques used in synergy data augmentation, such as the upsampling⁵³, the SMILES enumeration⁵⁰, and the reverse order of drugs⁵², which essentially duplicate the existing data points, our approach expands the dataset by including new, unbiased instances. As a results, this augmentation methodology not only enriches the available data points, but also enhances the diversity of the data, which is highly beneficial to improve the generalizability of machine learning models. Additionally, in contrast to other augmentation approaches involving a learning process⁵⁴, our method generates data points in a shorter amount of time.

While random-split cross-validation is frequently utilized for data partitioning, it may lead to tissue-level overlap and elevate the possibility of model overfitting, particularly when dealing with data containing multiple cell lines from the same tissue. The reason for this is that those instances involving similar cell lines tend to have comparable feature representations, such as gene expression profiles and the gene-disease association. The overlap is likely going to occur when these instances are present in both the training and validation sets⁷⁶. In such cases, the trained model may exhibit a strong performance due to the presence of overlapping data, but it will not perform well on novel, unseen data. Consequently, the model may be overestimated in terms of its true performance and fail to generalize to other datasets. On the other hand, a tissue-based cross-validation can effectively eliminate the data overlap issue. By excluding all instances originating from a validation tissue from the training set for each fold, the generalizability of a machine learning model can be properly evaluated.

Tree-based models (RF and GBT) employed in this study are robust, interpretable, and widely adopted by AZ-DREAM Challenges participants¹⁶. These models have the ability to deal with complex non-linear input–output relationships and can handle sizable datasets to a certain degree. Neither tree-based models nor other classifiers like LR and SVM are designed to exploit intricate relationships between features. This limitation is especially notable when dealing with heterogeneous features, including protein–protein interactions, gene expression levels, and drug-protein associations. In such cases, these models may struggle to find the optimal decision boundaries, generally leading to an unsatisfactory performance. Neural networks, on the other hand, are better equipped to handle diverse data types and can learn complex relationships between features with hidden layers and non-linear activation functions. This ability to integrate multiple heterogeneous data into a single model can often result in an improved performance compared to tree-based models. Our future research will concentrate on exploring this aspect.

The augmentation protocol devised in this study is not limited to anti-cancer drug data can be used to expand other synergy datasets as well; it has the potential to become a universal tactic for enhancing datasets in drug discovery and related fields. This could result in a greater amount of data being accessible and ultimately lead to better research results. Furthermore, the developed new drug similarity measure, the DACS score, improves the way drug similarity is assessed. By integrating both structural and target similarities, DACS provides a more exhaustive and inclusive perspective on drug similarity compared to traditional methods that only examine a single aspect, such as the chemical similarity. By offering a more holistic approach to analyzing and evaluating the similarities between drugs, DACS can help improve the accuracy and efficiency of the drug discovery process.

Deep learning, with its ability to dissect complex data and reveal underlying patterns and relationships, has become a pivotal tool in the field of pharmacology and drug development^77,78. The varied and comprehensive synergy dataset created in this study has the potential to significantly aid deep learning models by offering a diverse range of data for training purposes. The utilization of sufficient data enables deep learning algorithms to recognize intricate relationships and connections among cellular, molecular, and biological system-level features, thereby elevating the precision and efficacy of synergistic effect predictions. Moreover, an extensive and varied dataset reduces the risk of overfitting, a common issue where models become too reliant on limited training data and struggle to generalize to new data. Thus, the utilization of a comprehensive synergy dataset can lead to more robust and dependable deep learning models and ultimately, more advanced outcomes in drug discovery and related fields.

In addition to being used in deep learning-based drug discovery, the proposed anti-cancer drug synergy dataset has the potential to facilitate other applications, such as drug repositioning, drug target identification, toxicity analysis, the modeling of drug interactions, systems pharmacology, and precision medicine. By providing valuable insights into the interactions between drugs, targets, and biological systems, the synergy data can contribute to the development of more effective and safer pharmaceutics. Overall, the wide-ranging possibilities arising from this study may have significant implications for the drug discovery and development field. Ultimately, this could result in the creation of novel therapeutic approaches for a range of diseases.

Methods

Similarity of drug pharmacological effects

The Kendall τ rank correlation coefficient is employed to measure the ordinal association between the pharmacological effects of two drugs against a set of cell lines. First, common cell lines targeted by both drugs are identified and two lists ranked by pIC₅₀ values for monotherapy treatments are calculated. Next, the value of the Kendall τ accounting for ties $\left({\tau }_{b}\right)$^79,80 is computed:

$${\tau }_{b}=\frac{{n}_{c}-{n}_{d}}{\sqrt{\left({n}_{c}+{n}_{d}+{n}_{1}\right)\left({n}_{c}+{n}_{d}+{n}_{2}\right)}}$$

(1)

where ${n}_{c}$ is the number of concordant cell line pairs (having the same order in both drug lists), ${n}_{d}$ is the number of discordant cell line pairs (having different order in both drug lists), ${n}_{1}$ is the number of pairs tied only in the first list, and ${n}_{2}$ is the number of pairs tied only in the second list. ${\tau }_{b}$ of $+1$ indicates a perfectly positive association, i.e., the two drugs having the same pharmacological effects in terms of the inhibition of the cancer growth across multiple common cell lines. A value of $-1$ indicates a perfectly negative association, i.e., the opposite pharmacological effects, and a value of $0$ indicates the lack of any association. The Kendall τ coefficient is calculated when pIC₅₀ values are available for monotherapy treatments of at least two common cell lines, otherwise it is set to $0$.

Similarity of drug molecular mechanism of action

Similarity of the mechanism of action of two drugs is quantified with the MCC⁶⁴ computed for 19,968 proteins in the IHP-PING dataset⁶⁵ according to chemical-protein associations obtained from the STITCH database⁶⁶:

$$MCC= \frac{\left(T\times N\right)-\left(A\times B\right)}{\sqrt{\left(T+A\right)\left(T+B\right)\left(N+A\right)\left(N+B\right)}}$$

(2)

where $T$ is the number of proteins targeted by both drugs, $N$ is the number of proteins not targeted by any drug, $A$ is the number of proteins only targeted by the first drug, and $B$ is the number of proteins only targeted by the second drug. MCC ranges from $-1$ to $+1$ with high positive values indicating a significant overlap between the molecular targets of two drugs, thus a similar mechanism of action. The MCC for a pair of drugs having different mechanisms of action is going to be around $0$.

Drug action/chemical similarity score

The DACS measure provides a convenient and informative way to combine the drug structure similarity with the similarity of the molecular mechanisms of action. It is calculated as:

$$DACS= \sqrt{{TC}^{2}+{MCC}^{2}}$$

(3)

where $TC$ is the Tanimoto coefficient between drug FP2 fingerprints⁶³ and $MCC$ is the similarity of drug mechanism of action defined in Eq. (1). When one of the component metrics, either TC or MCC, is sufficiently high, then the other metric does not need to be as high for the DACS score to be over a predefined threshold. In rare cases of negative MCC values, the MCC component of the DACS score is set to 0.

Classification datasets

Following the original paper on the AZ-DREAM Challenges dataset¹⁶, we compiled the primary dataset by excluding those instances having ambiguous synergy scores between − 20 and 20 to create a classification dataset of 3210 drug combinations comprising 2461 synergistic (a synergy score $\ge 20$) and 749 antagonistic (a synergy score $\le -20$) cases. The corresponding augmented dataset contains 1,850,037 synergistic and 465,288 antagonistic combinations totaling 2,315,325 labeled instances. Further, the following four datasets were constructed at varying degrees of drug synergy to evaluate the performance of ML against instances having ambiguous labels, 8817 combinations comprising 5839 synergistic (a synergy score $>0$) and 2978 antagonistic (a synergy score $<0$) cases, 6974 combinations comprising 4882 synergistic (a synergy score $\ge 5$) and 2092 antagonistic (a synergy score $\le -5$) cases, 5408 combinations comprising 3913 synergistic (a synergy score $\ge 10$) and 1495 antagonistic (a synergy score $\le -10$) cases, and 4180 combinations comprising 3119 synergistic (a synergy score $\ge 15$) and 1061 antagonistic (a synergy score $\le -15$) cases.

In addition to the primary dataset, an independent validation set was created based on DrugCombDB⁷⁵. Applying the same synergy score criteria and excluding molecules with the TC of ≥ 0.4 to any compound in the AZ-DREAM Challenges dataset resulted in 250 drug combinations with 14 synergistic and 236 antagonistic effects, referred to as “unseen” data.

Feature vectors

Input data for machine learning consist of drug and cell features. The former are computed with Mol2vec⁸¹ by encoding a drug chemical structure to a 300-dimensional vector. The latter features are calculated by embedding 17,419 gene expression values for a cell line obtained from the AZ-DREAM Challenges dataset with an adversarial deconfounding autoencoder⁸². Similar to drug embeddings, the gene expression profile is encoded to a 300-dimensional vector. The final, 900-dimensional feature vector is generated by concatenating two drug feature vectors and a cell feature vector.

Cross-validation protocols

Two cross-validation procedures are employed utilizing a random and a tissue-based data split. In the random-split cross-validation, the classification dataset is randomly partitioned into five equal-size folds. In the tissue-based cross-validation, the dataset is assigned to five groups according to the tissue type of cell lines, the breast tissue, the digestive system, the excretory system, the respiratory system, and other tissues. Note that tissue types in the augmented dataset are the same as in the original dataset because the augmentation process does not affect cell lines. A fivefold cross-validation is conducted the usual way, i.e., in each round, the machine learning model is trained on the augmented data for 4 subsets and then validated against the original AZ-DREAM Challenges instances in the remaining subset. This protocol ensures that the augmented data is used only to train classifiers and the validation is performed on the original data and labels. Since the original dataset is imbalanced, comprising 76.7% synergistic and 23.3% antagonistic instances, a stratified split is used to preserve the percentage of samples for each class in each fold. When augmenting the training set, the ratio is preserved by proportionally adding instances of each class. In the tissue-based split, although the proportions of synergistic and antagonistic instances are different in each tissue, the training set is augmented in a way to preserve the ratio of synergistic/antagonistic instances in individual folds.

Machine learning

Four machine learning models are used to evaluate the performance of supervised learning algorithms on the original and the augmented datasets of drug combinations, Logistic Regression, Support Vector Machines, Random Forest, and Gradient Boosting Trees. LR is a supervised machine learning algorithm designed for binary classification tasks to predict the likelihood of an instance belonging to one of two classes (synergistic or antagonistic in our case). It employs the logistic function to transform a linear combination of input features into a probability score, allowing for intuitive interpretation^68,69. Model training involves minimizing the logistic loss function through optimization techniques such as gradient descent. The coefficients of the linear equation are estimated during the training process to create a predictive model. The following parameters were used in the LR model: L2 penalty, the tolerance for stopping criteria of 0.0001, the inverse of regularization strength of 0.45, the maximum number of iterations of 500, and class weights set to “balanced” to deal with the imbalanced dataset.

SVM is a powerful supervised machine learning algorithm used for classification and regression tasks. In the classification context, it aims to find the optimal hyperplane in the feature space to maximize the margin between data points belonging to different classes^70,71. SVM is effective in dealing with high-dimensional features and can handle non-linear relationships through the use of kernel functions implicitly mapping the input features into a higher-dimensional space. The following parameters were used in the SVM model: the regularization parameter of 0.42, a linear kernel type, the tolerance for stopping criterion of 0.001, a probability set to true to enable probability estimation, and class weights set to “balanced” to deal with the imbalanced dataset.

The RF classifier utilizes a collection of individual trees built independently to determine the final output by the majority vote⁷². In contrast, the GBT classifier builds trees additively to reduce the bias of the previous tree, and then combines the output of all trees scaled by the learning rate to calculate the final output⁷³. Parameters of both classifiers were manually tuned to optimize their classification performance. The following parameters were used in RF: the number of trees in the forest of 300, the minimum number of samples per leaf node of 85, the number of features to consider for the best split equal to the square root of total number of features, and class weights set to: “balanced” in order to deal with the imbalanced dataset. The following parameters were used in GBT: the number of boosting stages of 650, the minimum number of samples per leaf node of 120, the number of features to consider for the best split equal to the square root of total number of features, the learning rate of 0.28, and the maximum depth of the individual regression estimators of 5. In validation calculations against “unseen” data, a GBT model is first trained on the AZ-DREAM Challenges dataset, utilizing either the original instances or the original and augmented data. The trained model is then employed to classify instances in the DrugCombDB dataset⁷⁵.

Data availability

All data are freely available at https://github.com/MengLiu90/Synergy-Data-Augmentation.

References

Liu, Y. & Zhao, H. Predicting synergistic effects between compounds through their structural similarity and effects on transcriptomes. Bioinformatics 32(24), 3782–3789 (2016).
Article CAS PubMed PubMed Central Google Scholar
Vogel, C. L. et al. Efficacy and safety of trastuzumab as a single agent in first-line treatment of HER2-overexpressing metastatic breast cancer. J. Clin. Oncol. 20(3), 719–726 (2002).
Article CAS PubMed Google Scholar
Bayat Mokhtari, R. et al. Combination therapy in combating cancer. Oncotarget 8(23), 38022–38043 (2017).
Article PubMed Google Scholar
Rafique, R., Islam, S. M. R. & Kazi, J. U. Machine learning in the prediction of cancer therapy. Comput. Struct. Biotechnol. J. 19, 4003–4017 (2021).
Article CAS PubMed PubMed Central Google Scholar
Holbeck, S. L. et al. The National cancer institute ALMANAC: A comprehensive screening resource for the detection of anticancer drug pairs with enhanced therapeutic activity. Cancer Res. 77(13), 3564–3576 (2017).
Article CAS PubMed PubMed Central Google Scholar
O’Neil, J. et al. An unbiased oncology compound screen to identify novel combination strategies. Mol. Cancer Ther. 15(6), 1155–1162 (2016).
Article PubMed Google Scholar
Forcina, G. C. et al. Systematic quantification of population cell death kinetics in mammalian cells. Cell Syst. 4(6), 600–610 (2017).
Article CAS PubMed PubMed Central Google Scholar
Markt, P. et al. CLOUD – CeMM library of unique drugs. J. Cheminform. 4, P23 (2012).
Article PubMed Central Google Scholar
Licciardello, M. P. et al. A combinatorial screen of the CLOUD uncovers a synergy targeting the androgen receptor. Nat. Chem. Biol. 13(7), 771–778 (2017).
Article CAS PubMed Google Scholar
Zheng, S. et al. DrugComb update: A more comprehensive drug sensitivity data repository and analysis portal. Nucleic Acids Res. 49(W1), W174–W184 (2021).
Article MathSciNet CAS PubMed PubMed Central Google Scholar
Zagidullin, B. et al. DrugComb: An integrative cancer drug combination data portal. Nucleic Acids Res. 47(W1), W43–W51 (2019).
Article CAS PubMed PubMed Central Google Scholar
Berenbaum, M. C. What is synergy?. Pharmacol. Rev. 41(2), 93–141 (1989).
CAS PubMed Google Scholar
Loewe, S. The problem of synergism and antagonism of combined drugs. Arzneimittelforschung 3(6), 285–290 (1953).
CAS PubMed Google Scholar
Yadav, B. et al. Searching for drug synergy in complex dose-response landscapes using an interaction potency model. Comput. Struct. Biotechnol. J. 13, 504–513 (2015).
Article CAS PubMed PubMed Central Google Scholar
Seo, H. et al. SYNERGxDB: An integrative pharmacogenomic portal to identify synergistic drug combinations for precision oncology. Nucleic Acids Res. 48(W1), W494–W501 (2020).
Article CAS PubMed PubMed Central Google Scholar
Menden, M. P. et al. Community assessment to advance computational prediction of cancer drug combinations in a pharmacogenomic screen. Nat. Commun. 10(1), 2674 (2019).
Article ADS PubMed PubMed Central Google Scholar
Shorten, C. & Khoshgoftaar, T. M. A survey on image data augmentation for deep learning. J. Big Data 6(1), 1–48 (2019).
Article Google Scholar
Taylor, L. & Nitschke, G. Improving deep learning with generic data augmentation. in 2018 IEEE Symposium Series on Computational Intelligence (SSCI) (IEEE, 2018).
Moreno-Barea, F. J. et al. Forward noise adjustment scheme for data augmentation. in 2018 IEEE Symposium Series on Computational Intelligence (SSCI) (IEEE, 2018).
Zhong, Z. et al. Random erasing data augmentation. in Proceedings of the AAAI conference on artificial intelligence (2020).
Inoue, H. Data Augmentation by Pairing Samples for Images Classification. arXiv preprint arXiv:1801.02929 (2018).
Summers, C. & Dinneen, M. J. Improved mixed-example data augmentation. in 2019 IEEE Winter Conference on Applications of Computer Vision (WACV). (IEEE, 2019).
Wen, Q. et al. Time Series Data Augmentation for Deep Learning: A Survey. arXiv preprint arXiv:2002.12478 (2020).
Le Guennec, A., Malinowski, S. & Tavenard, R. Data augmentation for time series classification using convolutional neural networks. in ECML/PKDD Workshop on Advanced Analytics and Learning on Temporal Data (2016).
Steven Eyobu, O. & Han, D. S. Feature representation and data augmentation for human activity classification based on wearable IMU sensor data using a deep LSTM neural network. Sensors 18(9), 2892 (2018).
Article ADS PubMed PubMed Central Google Scholar
Gao, J. et al. Robusttad: Robust Time Series Anomaly Detection Via Decomposition and Convolutional Neural Networks. arXiv preprint arXiv:2002.09545 (2020).
Wen, Q. et al. RobustSTL: A robust seasonal-trend decomposition algorithm for long time series. in Proceedings of the AAAI Conference on Artificial Intelligence (2019).
Cao, H., Tan, V. Y. & Pang, J. Z. A parsimonious mixture of Gaussian trees model for oversampling in imbalanced and multimodal time-series classification. IEEE Transact. Neural Netw. Learn. Syst. 25(12), 2226–2239 (2014).
Article Google Scholar
Kang, Y., Hyndman, R. J. & Li, F. GRATIS: GeneRAting time series with diverse and controllable characteristics. Stat. Anal. Data Min. ASA Data Sci. J. 13(4), 354–376 (2020).
Article MathSciNet Google Scholar
Esteban, C., Hyland, S. L. & Rätsch, G. Real-Valued (medical) Time Series Generation with Recurrent Conditional Gans. arXiv preprint arXiv:1706.02633 (2017).
Ratner, A. J. et al. Learning to compose domain-specific transformations for data augmentation. Adv. Neural Inf. Process. Syst. 30 (2017).
Zhang, X. et al. Adversarial Autoaugment. arXiv preprint arXiv:1912.11188 (2019).
Dash, S. et al. Medical time-series data generation using generative adversarial networks. in International Conference on Artificial Intelligence in Medicine (Springer, 2020).
DeVries, T. & Taylor, G.W. Dataset Augmentation in Feature Space. arXiv preprint arXiv:1702.05538 (2017).
Wong, S. C. et al. Understanding data augmentation for classification: When to warp?. in 2016 International Conference on Digital Image Computing: Techniques and Applications (DICTA) (IEEE, 2016).
Frid-Adar, M. et al. Gan-Based Data Augmentation for Improved Liver Lesion Classification. (2018).
Calimeri, F. et al. Biomedical data augmentation using generative adversarial neural networks. in International Conference on Artificial Neural Networks (Springer, 2017).
Frid-Adar, M. et al. GAN-based synthetic medical image augmentation for increased CNN performance in liver lesion classification. Neurocomputing 321, 321–331 (2018).
Article Google Scholar
Han, C. et al. GAN-based synthetic brain MR image generation. in 2018 IEEE 15th International Symposium on Biomedical Imaging (ISBI 2018) (IEEE, 2018).
Madani, A. et al. Chest x-ray generation and data augmentation for cardiovascular abnormality classification. in Medical Imaging 2018: Image Processing (SPIE, 2018).
Gatys, L. A., Ecker, A. S. & Bethge, M. A Neural Algorithm of Artistic Style. arXiv preprint arXiv:1508.06576 (2015).
Jackson, P. T. et al. Style augmentation: Data augmentation via style randomization. in CVPR Workshops. (2019).
Wang, J. & Perez, L. The effectiveness of data augmentation in image classification using deep learning. Convol. Neural Netw. Vis. Recogn. 11, 1–8 (2017).
Google Scholar
Lemley, J., Bazrafkan, S. & Corcoran, P. Smart augmentation learning an optimal data augmentation strategy. IEEE Access 5, 5858–5869 (2017).
Article Google Scholar
Cubuk, E. D. et al. Autoaugment: Learning Augmentation Policies from Data. arXiv preprint arXiv:1805.09501 (2018).
Krizhevsky, A., Sutskever, I. & Hinton, G. E. ImageNet classification with deep convolutional neural networks. Commun. ACM 60(6), 84–90 (2017).
Article Google Scholar
Yun, S. et al. Cutmix: Regularization strategy to train strong classifiers with localizable features. in Proceedings of the IEEE/CVF International Conference on Computer Vision (2019).
Jones, A. et al. Bilex Rx: Lexical Data Augmentation for Massively Multilingual Machine Translation. arXiv preprint arXiv:2303.15265 (2023).
Sutherland, J. J., O’brien, L. A. & Weaver, D. F. Spline-fitting with a genetic algorithm: A method for developing classification structure− activity relationships. J. Chem. Inf. Comput. Sci. 43(6), 1906–1915 (2003).
Article CAS PubMed Google Scholar
Bjerrum, E. J. SMILES Enumeration as Data Augmentation for Neural Network Modeling of Molecules. arXiv preprint arXiv:1703.07076 (2017).
Kimber, T. B., Gagnebin, M. & Volkamer, A. Maxsmi: Maximizing molecular property prediction performance with confidence estimation using smiles augmentation and deep learning. Artif. Intell. Life Sci. 1, 100014 (2021).
CAS Google Scholar
Sidorov, P. et al. Predicting synergism of cancer drug combinations using NCI-ALMANAC data. Front. Chem. 7, 509 (2019).
Article ADS PubMed PubMed Central Google Scholar
Ye, Z. et al. ScaffComb: A phenotype-based framework for drug combination virtual screening in large-scale chemical datasets. Adv. Sci. 8(24), 2102092 (2021).
Article CAS Google Scholar
Liu, Q. et al. DeepCDR: A hybrid graph convolutional network for predicting cancer drug response. Bioinformatics 36, i911–i918 (2020).
Article CAS PubMed Google Scholar
Kim, S. et al. PubChem in 2021: New data content and improved web interfaces. Nucleic Acids Res. 49(D1), D1388–D1395 (2021).
Article CAS PubMed Google Scholar
Chuang, J. C. & Neal, J. W. Crizotinib as first line therapy for advanced ALK-positive non-small cell lung cancers. Transl. Lung Cancer Res. 4(5), 639–641 (2015).
CAS PubMed PubMed Central Google Scholar
Royce, M. E. & Osman, D. Everolimus in the treatment of metastatic breast cancer. Breast Cancer (Auckl) 9, 73–79 (2015).
CAS PubMed Google Scholar
Ruiz, R., Raez, L. E. & Rolfo, C. Entinostat (SNDX-275) for the treatment of non-small cell lung cancer. Expert Opin. Investig. Drugs 24(8), 1101–1109 (2015).
Article CAS PubMed Google Scholar
Le Grand, M. et al. Akt targeting as a strategy to boost chemotherapy efficacy in non-small cell lung cancer through metabolism suppression. Sci. Rep. 7, 45136 (2017).
Article ADS PubMed PubMed Central Google Scholar
Keenan, T. E. et al. Clinical efficacy and molecular response correlates of the WEE1 inhibitor adavosertib combined with cisplatin in patients with metastatic triple-negative breast cancer. Clin. Cancer Res. 27(4), 983–991 (2021).
Article CAS PubMed Google Scholar
Cazzaniga, M. E. et al. Metronomic oral vinorelbine in advanced breast cancer and non-small-cell lung cancer: Current status and future development. Fut. Oncol. 12(3), 373–387 (2016).
Article CAS Google Scholar
Smyth, L. M. et al. Capivasertib, an AKT Kinase Inhibitor, as monotherapy or in combination with fulvestrant in patients with. Clin. Cancer Res. 26(15), 3947–3957 (2020).
Article CAS PubMed PubMed Central Google Scholar
O’Boyle, N. M. et al. Open babel: An open chemical toolbox. J. Cheminform. 3, 33 (2011).
Article PubMed PubMed Central Google Scholar
Matthews, B. W. Comparison of the predicted and observed secondary structure of T4 phage lysozyme. Biochim. Biophys. Acta 405(2), 442–451 (1975).
Article CAS PubMed Google Scholar
Mazandu, G. K. et al. IHP-PING—generating integrated human protein–protein interaction networks on-the-fly. Brief. Bioinformat. 22(4), 277 (2021).
Article Google Scholar
Szklarczyk, D. et al. STITCH 5: augmenting protein-chemical interaction networks with tissue and affinity data. Nucleic Acids Res. 44(D1), D380–D384 (2016).
Article CAS PubMed Google Scholar
Keller, T. H., Pichota, A. & Yin, Z. A practical view of ‘druggability’. Curr. Opin. Chem. Biol. 10(4), 357–361 (2006).
Article CAS PubMed Google Scholar
Hosmer, D. & Lemeshow, S. Applied Logistic Regression 2nd edn. (Wiley, New York, 2000).
Book Google Scholar
Tolles, J. & Meurer, W. J. Logistic regression: Relating patient characteristics to outcomes. JAMA 316(5), 533–534 (2016).
Article PubMed Google Scholar
Cortes, C. & Vapnik, V. Support-vector networks. Mach. Learn. 20, 273–297 (1995).
Article Google Scholar
Ben-Hur, A. et al. Support vector clustering. J. Mach. Learn. Res. 2, 125–137 (2001).
Google Scholar
Breiman, L. Random forests. Mach. Learn. 45(1), 5–32 (2001).
Article Google Scholar
Friedman, J. H. Greedy function approximation: A gradient boosting machine. Ann. Stat. 1189–1232 (2001).
Pu, L. et al. CancerOmicsNet: A multi-omics network-based approach to anti-cancer drug profiling. Oncotarget 13, 695–706 (2022).
Article PubMed PubMed Central Google Scholar
Liu, H. et al. DrugCombDB: A comprehensive database of drug combinations toward the discovery of combinatorial therapy. Nucleic Acids Res. 48(D1), D871–D881 (2020).
CAS PubMed Google Scholar
Singha, M. et al. GraphGR: A graph neural network to predict the effect of pharmacotherapy on the cancer cell growth. bioRxiv (2020).
Nag, S. et al. Deep learning tools for advancing drug discovery and development. 3 Biotech 12(5), 110 (2022).
Article PubMed PubMed Central Google Scholar
Vamathevan, J. et al. Applications of machine learning in drug discovery and development. Nat. Rev. Drug Discov. 18(6), 463–477 (2019).
Article CAS PubMed PubMed Central Google Scholar
Kendall, M. G. Rank Correlation Methods. (1962).
Agresti, A. Analysis of ordinal categorical data (Wiley, 2010).
Book Google Scholar
Jaeger, S., Fulle, S. & Turk, S. Mol2vec: Unsupervised machine learning approach with chemical intuition. J. Chem. Inf. Model 58(1), 27–35 (2018).
Article CAS PubMed Google Scholar
Dincer, A. B., Janizek, J. D. & Lee, S.-I. Adversarial deconfounding autoencoder for learning robust gene expression embeddings. Bioinformatics 36, i573–i582 (2020).
Article CAS PubMed PubMed Central Google Scholar

Download references

Acknowledgements

Portions of this research were conducted with computing resources provided by Louisiana State University.

Funding

This work has been supported in part by the National Institute of General Medical Sciences of the National Institutes of Health award R35GM119524, the US National Science Foundation award CCF-1619303, the Louisiana Board of Regents contract LEQSF (2016–19)-RD-B-03, and the Center for Computation and Technology at Louisiana State University.

Author information

These authors contributed equally: Mengmeng Liu and Gopal Srivastava.

Authors and Affiliations

Division of Electrical and Computer Engineering, Louisiana State University, Baton Rouge, LA, 70803, USA
Mengmeng Liu & J. Ramanujam
Department of Biological Sciences, Louisiana State University, Baton Rouge, LA, 70803, USA
Gopal Srivastava & Michal Brylinski
Center for Computation and Technology, Louisiana State University, Baton Rouge, LA, 70803, USA
J. Ramanujam & Michal Brylinski

Authors

Mengmeng Liu
View author publications
You can also search for this author in PubMed Google Scholar
Gopal Srivastava
View author publications
You can also search for this author in PubMed Google Scholar
J. Ramanujam
View author publications
You can also search for this author in PubMed Google Scholar
Michal Brylinski
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

M.L. processed the data, performed statistical analyses, conducted machine learning experiments, and analyzed the results. G.S. curated the data, performed preliminary statistical analyses, and augmented the data. J.R. contributed to the augmentation design and the interpretation of the results. J.R. and M.B. secured funding for the project. M.L. and G.S. drafted the paper. M.B. supervised the project and wrote the final version of the manuscript.

Corresponding author

Correspondence to Michal Brylinski.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher's note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Liu, M., Srivastava, G., Ramanujam, J. et al. Augmented drug combination dataset to improve the performance of machine learning models predicting synergistic anticancer effects. Sci Rep 14, 1668 (2024). https://doi.org/10.1038/s41598-024-51940-9

Download citation

Received: 23 October 2023
Accepted: 11 January 2024
Published: 18 January 2024
DOI: https://doi.org/10.1038/s41598-024-51940-9

Comments

By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.

Subjects

Abstract

Similar content being viewed by others

Harnessing machine learning to find synergistic combinations for FDA-approved cancer drugs

In-silico Prediction of Synergistic Anti-Cancer Drug Combinations Using Multi-omics Data

A cancer drug atlas enables synergistic targeting of independent drug vulnerabilities

Introduction

Results

Similarity measure for cellular responses to drug treatment

Relation between drug similarity and pharmacological effects

Drug action/chemical similarity score

Dataset augmentation with DACS

Drug synergy prediction with machine learning

Classification of instances with ambiguous synergy scores

Evaluation against “unseen” data

Discussion

Methods

Similarity of drug pharmacological effects

Similarity of drug molecular mechanism of action

Drug action/chemical similarity score

Classification datasets

Feature vectors

Cross-validation protocols

Machine learning

Data availability

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Publisher's note

Rights and permissions

About this article

Cite this article

Share this article

Comments

Search

Quick links