ADRML: anticancer drug response prediction using manifold learning

One of the prominent challenges in precision medicine is to select the most appropriate treatment strategy for each patient based on the personalized information. The availability of massive data about drugs and cell lines facilitates the possibility of proposing efficient computational models for predicting anticancer drug response. In this study, we propose ADRML, a model for Anticancer Drug Response Prediction using Manifold Learning to systematically integrate the cell line information with the drug information to make accurate predictions about drug therapeutic. The proposed model maps the drug response matrix into the lower-rank spaces that lead to obtaining new perspectives about cell lines and drugs. The drug response for a new cell line-drug pair is computed using the low-rank features. The evaluation of ADRML performance on various types of cell lines and drug information, in addition to the comparisons with previously proposed methods, shows that ADRML provides accurate and robust predictions. Further investigations about the association between drug response and pathway activity scores reveal that the predicted drug responses can shed light on the underlying drug mechanism. Also, the case studies suggest that the predictions of ADRML about novel cell line-drug pairs are validated by reliable pieces of evidence from the literature. Consequently, the evaluations verify that ADRML can be used in accurately predicting and imputing the anticancer drug response.

Precision medicine aims to finely select treatments for cancer based on the genetic information of individual patients 1 . One of the highly critical problems in precision medicine is predicting anticancer drug response for each patient [2][3][4] . Due to the heterogeneity of tumors, the patients with the same type of cancer may show various therapeutic responses toward similar drugs 5 . Therefore, providing computational methods to discover the relationship between genomic information and drug sensitivity is of high importance and can be beneficial in precision medicine 3,6 .
Genomics of Drug Sensitivity in Cancer (GDSC) 7 and Cancer Cell Line Encyclopedia (CCLE) 8 are two projects that have provided molecular profiles and drug response values for hundreds of cancer cell lines against several anticancer drugs. These large datasets facilitate the development of computational methods for anticancer drug sensitivity prediction. Numerous computational methods have been proposed to predict drug response using gene expression profile, or other molecular information of cell lines. Some of the computational methods have considered drug information such as chemical substructure of drugs, besides made use of cell line information. In the proposed computational methods, various machine learning methods have been utilized such as sparse linear regression 4,[9][10][11] , random forest 2,12,13 , kernel-based methods 4,14-17 , matrix factorization 1,18-20 , neural networks and deep learning [21][22][23][24] .
Wang et al. have proposed a Similarity Regularized Matrix Factorization (SRMF) method, which utilizes the similarity of cell lines based on gene expression profiles and chemical substructure similarity of drugs to predict anticancer drug sensitivity 1 . They also conducted drug-repurposing and suggested new potential treatments for cell lines with Non-small Cell Lung Cancer (NSCL). It is suggested that patients who have similar genomic properties reveal similar responses to similar drugs 1 . Based on the SRMF study, Suphavilai et al. have proposed a recommender system called "CaDRReS" that can predict drug response for unseen cell lines 19 . Furthermore, they showed that latent space features are correlated with associated pathways of drugs. They did not consider any features of drugs for predicting the drug response values. Afterwards, Chang et al. have devised "CDRscan", an ensemble model containing five Convolutional Neural Networks (CNNs) 21 . They made use of mutational profiles of cell lines and chemical substructure of drugs as the input features to these CNNs. The drug response values were measured by averaging the output of five CNNs. Moreover, they have repurposed multiple non-oncology drugs as the potential therapeutic agents for cancer cell lines. Recently, Wei et al. have suggested a simple cell Hyper-parameter tuning. ADRML model is fully described in "Methods" which has three hyperparameters: "k" is the dimension of latent space, " µ " is the regularization coefficient, and " " is the similarity conservation coefficient. In order to map the response matrix into lower dimensional space, "k" value was considered to be less than the number of cell lines and drugs. For simplicity, we considered k = k ′ % ofmin(number of cell lines, number of drugs) . We tuned the hyper-parameter values using grid search. We executed ADRML with fivefold cross-validation on all pairs of cell line and drug for all combinations of k ∈ {10%, 20%, ..., 90%} , and µ ∈ {2 −3 , 2 −2 , 2 −1 , 2 0 , 2 1 , 2 2 , 2 3 } . The hyper-parameters were tuned on CCLE dataset, using gene expression similarity of cell lines and chemical similarity of drugs by maximizing a fitness score (briefly mentioned as fitness in the following). www.nature.com/scientificreports/ where the evaluation criteria including Coefficient of Determination ( R 2 ), Pearson Correlation Coefficient (PCC), and Root Mean Square Error (RMSE) are completely explained in "Evaluation criteria". The definition of fitness score is logical since the best model is the one with the highest values of R 2 and PCC, and the lowest value of RMSE. ADRML achieved the best results when k = 70%, µ = 2 3 and = 2 2 . We considered the same hyperparameter values for all types of similarities in CCLE and GDSC. In order to illustrate the impact of µ and on the fitness score, we fixed the latent dimension to k = 70% and depicted the fitness function in a 3D-histogram of Fig. 1a. It is evident that when is small, the fiteness function is increasing with regard to µ . Conversely, when µ is small, the larger values leads to higher fitness score. Moreover, the values of µ = 2 3 and = 2 2 were fixed and the influence of latent space dimension was examined. Figure 1b demonstrates that the greater dimensions of latent space leads to higher fitness score. Moreover, PCC, and R 2 improves by increasing k, while RMSE declines as k grows larger. However, the criteria value do not change or have subtle changes after k = 70%. performance of ADRML prediction. We investigated the effects of using different similarity constraints on ADRML performance. Several cell line similarities based on gene-expression, mutation, and copy number variation, and multiple drug similarities based on chemical substructure, target proteins, and KEGG pathways were considered as the constraints of manifold learning. Table 2 summarizes the performance of ADRML for every combination of cell line and drug similarity. Each pair of cell line and drug similarity is shown in one row and the columns show the computed criteria. Clearly, ADRML yields both accurate and robust performance in each scenario, because the results of all conditions are quite high and close to each other. However, it achieves the best results using similarity of cell lines based on gene expression and similarity of drugs based on target proteins, which yields RMSE = 0.487, R 2 = 0.682, PCC = 0.846 . We used these two similarities for further evaluations.
In order to investigate ADRML performance on each drug, we depicted the drug-wise correlation plots. Figures 2 and 3 illustrated the pearson correlation between the observed and the predicted log IC50 for four drugs in CCLE and GDSC datasets, respectively. The figures show high drug-wise PCC and validate that ADRML can  Table 2. Performance of ADRML on various types of similarities. The performance of each model is evaluated using fivefold cross-validation on cell line-drug pairs and using k = 70% , µ = 2 3 , and lambda = 2 2 . Each row shows the performance of ADRML on a pair of cell line and drug similarity. The best results of each criteria is shown in bold face.  www.nature.com/scientificreports/ and using the same datasets. The comparison was made on the average performance of the models over 30 repetitions of fivefold cross-validation with tuned hyper-parameters. It should be noted that the hyper-parameters of CaDRReS cannot be fully tuned, due to its high time complexity. The hyper-parameters for CaDRReS is assumed according to its paper and authors' suggestion.
The features used for cell lines and drugs are different in these methods. For each method, the required features, as mentioned in their paper, are provided from the benchmark datasets described in "Benchmark datasets and collected features".
In addition to the mentioned methods, K-nearest neighbor (KNN) with K = 1 was considered as a baseline method and compared to the results of other methods. KNN is implemented using the Scikit-learn module in Python 36 . For executing KNN, the input feature vector for each pair of cell line c i and drug d j was considered as the concatenated vector of ith row of simC and jth column of simD. All types of cell line similarities and drug www.nature.com/scientificreports/ similarities were considered as simC and simD, respectively. The complete report of KNN performance on various types of similarities are provided in Supplementary Table S1. KNN obtained the best performance on gene expression similarity of cell lines and chemical substructure similarity of drugs. Tables 3 and 4 present the performance of the mentioned methods on CCLE and GDSC, respectively. Additionally, the scatter-plots with fitted lines for the predictions of the mentioned methods on CCLE are presented in Supplementary Figs. S123-S128.
The results of baseline method (KNN) in both datasets were not too far from the state-of-the-art methods, which means that improving the results is challenging. In CCLE dataset, SRMF achieved the best RMSE and favorable PCC; however, it achieved R 2 lower than the baseline, i.e., the variance of predicted log IC50 did not explain the variance of real drug responses perfectly. CaDRReS yielded reasonable results but its R 2 and PCC were less than the baseline. CDRscan obtained the favorable R 2 and PCC but it had the highest RMSE. Therefore, its prediction values have a high correlation and far distance to the real responses, simultaneously. CDCN revealed a satisfying performance but with lower R 2 and PCC, and higher RMSE than the results of ADRML. Therefore ADRML outperformed other methods.
In the case of the GDSC dataset, SRMF obtained the best RMSE and moderate R 2 , and PCC. The performance of CaDRRS was satisfying, but R 2 and PCC were worse than the baseline. CDRscan showed good performance but with high RMSE, similar to its performance on the CCLE dataset. Moreover, CDCN's performance was satisfying; however, its R 2 and PCC were lower than ADRML, and its RMSE was higher than ADRML. Consequently, ADRML outperformed other methods with regard to R 2 , and PCC.
In addition to the mentioned analysis, we investigated whether using other types of cell line similarities and dug similarities would aid in improving the results of other methods. To this aim, we executed CDCN, SRMF, CaDRReS, and KNN on all types of similarities. It is worth mentioning that CDRscan receives binary feature matrices as the input and the dimension of binary feature vectors of drugs in CCLE and GDSC datasets were not appropriate for the designed CNNs in CDRscan; therefore, it is not applicable to perform CDRscan on other types of similarities. Other methods (CDCN, SRMF, CaDRReS, and KNN) receive the similarity matrices as the input. Moreover, CaDRReS gets only the cell line similarity, and it does not obtain any drug similarity matrix from the input.
The entire report of the performance criteria measured for the performance of the mentioned methods is presented in Supplementary Table S1. It can be seen that the performance of other methods almost does not improve using other similarities in comparison to their proposed similarities. Often, with respect to a particular pair of cell line similarities and drug similarities, SRMF obtains the best RMSE. At the same time, ADRML achieves best R 2 and best PCC.
All in sum, ADRML performed better than other state-of-the-art methods on both CCLE and GDSC in terms of R 2 and PCC. These achievements further substantiate ADRML performance.

Removing redundant cell lines from ccLe and GDSc. CCLE dataset contains 363 cell lines from 22
different tissue types. The number of cell lines in each tissue type is shown in Fig. 4. The least frequent tissue types (Biliary tract and prostate) contain one cell line, and the most frequent tissue type (Lung) comprises 76 cell Table 3. Comparison of methods' performance on CCLE dataset. The methods were evaluated by averaging over 30 repetitions of fivefold cross-validation on cell line-drug pair. The best results of each criterion are shown in boldface. www.nature.com/scientificreports/ lines. Since the cell lines from the same tissue may have high similarity, this may lead to redundancy. Thus, it is better to eliminate the redundancy within each tissue type and based on the number of cell lines from that tissue. In order to remove the redundancy in each tissue type, we filtered out the cell lines that are very similar to other cell lines. In this way, we excluded the cell lines with high similarity to other cell lines in the same tissue type. The detailed procedure of removing redundant cell lines is described in "Finding the most redundant cell lines". This procedure led to eliminating 64 cell lines and the remaining 299 cell lines from CCLE. The remaining cell lines comprise the purified CCLE dataset without redundancy. The list of remaining and excluding cell lines are reported in Supplementary Table S2. To analyze the performance of ADRML and other state-of-the-art methods on the new dataset, we executed these methods using 30 repetitions of fivefold cross-validation. Table 5 demonstrates the performance of methods on the new dataset. It can be seen that ADRML outperforms other methods with respect to R 2 and PCC.
Moreover, the GDSC dataset comprises 555 cell lines from 19 tissue types. Various tissue types have different numbers of cell lines which are shown in Fig. 5 To remove the redundant cell lines from GDSC, the procedure described in "Finding the most redundant cell lines" was applied on the GDSC, resulting in eliminating 103 cell lines and preserving 452 cell lines. The remaining cell lines form the purified GDSC dataset with lower redundancy. The list of remaining and excluding cell lines are reported in Supplementary Table S2. The performance of methods on the new GDSC dataset using 30 repetitions of fivefold cross-validation is represented in Table 6. It can be seen that SRMF obtained the best RMSE, CDCN achieved the best R 2 and ADRML yield the best PCC.   Tables 3 and 4 with the results in the Tables 5 and 6 that the performance of models declines a bit when the redundant cell lines were removed. This issue may be due to the reduction in sample size or the existence of bias before removing redundancy of cell lines.
Moreover, we applied the redundancy removal procedure with different thresholds (θ) to investigate the performance of ADRML on different levels of redundancy removal. Furthermore, this procedure is repeated based on gene expression similarities of cell lines. Table 7 represents the number of remaining cell lines according to the various values of threshold.
ADRML performance was evaluated on each of the resulting datasets after redundancy removal based on various levels of strictness. Figure 6a,b illustrate the PCC values of ADRML assessed using 5-fold cross-validation on the purified datasets. These figures verify that the trend of ADRML performance is almost the same on purified datasets based on copy number variation and gene expression. ADRML achieves the best PCC on the strictest threshold which removes a lot of cell lines and adding other cell lines declines its PCC. Moreover, the ADRML's

Analysis of association between drugs and signaling pathways.
To demonstrate that the prediction of ADRML is meaningful and rational, we investigated the correlation between the predicted drug responses and pathway activity scores for several Biocarta Pathways from MsigDB 37 . The detailed procedure is described in "Computing association of drugs and signalling pathways". Figure 7 visualizes the association between drugs and signaling pathways for 24 drugs in the CCLE dataset and 25 Biocarta pathways. The entire association values are provided in Supplementary Table S3. There are numerous pieces of evidence in the literature for these correlations, some of which are provided here. Paclitaxel drug and TGFβ signaling pathway exhibited a highly positive correlation. Paclitaxel is one of the agents that have been frequently reported for the activation of TGFβ pathway [38][39][40][41] . Thus, the higher consumption of Paclitaxel leads to more activation of TGFβ , which verifies the high positive correlation between Paclitaxel and TGFβ . Moreover, Paclitaxel positively associated with P53 pathway. It has been verified that Paclitaxel activates P53 signaling pathway 42   www.nature.com/scientificreports/ Irinotecan response has a very significant positive correlation with the activity score of P53. Irinotecan is a topoisomerase I inhibitor, which is frequently used for anticancer therapy. The previous study on human hepatocellular carcinoma (HCC) cell lines for the investigation of the apoptotic mechanisms of Irinotecan has revealed that it significantly activates P53 45 . Additionally, the positive correlation of Irinotecan response and EGFR pathway is supported by several pieces of research. They have shown the resistance to Irinotecan is connected with the increased expression of EGFR 46 and have admitted that Irinotecan upregulates the EGFR pathway 47 . Also, Panobbinostat which is a potent inhibitor of deacetylases and HSP90 48 , revealed a high significant positive correlation with TGFβ pathway. Previous study have shown that using Panobinostat increased the level of TGFβ 48 . case studies. We conducted case studies on GDSC cell-line-drug pairs with unknown IC50 values. To do this, we did not impute the missing values in the IC50 matrix and trained ADRML with all known drug responses. For each drug, the predictions of ADRML on unknown pairs were partitioned into four quantiles, and the cell lines in the first and last quantiles were considered as the sensitive and resistant cell lines for that drug, respectively. The complete list of sensitive and resistant predicted associations are provided in Supplementary Tables S4 and S5, respectively. The sensitive associations were inquired into both the literature and the latest release of GDSC (released Feb. 2020). Table 8 represents the supportive pieces of evidence for ADRML predictions in Literature. Table 9 incorporates some of the cell line-drug pairs that had unknown IC50 values in the previous data extracted from GDSC, and now the drug response value for these pairs are available in the latest release of GDSC.

Discussion
In this study, we proposed a computational model for predicting anticancer drug response, using manifold learning, called ADRML. The model combines three sets of information, including known drug responses, cell line similarity, and drug similarity, to infer the novel predictions. The main contribution of this paper is evaluating the influence of various types of cell line similarities and drug similarities on the prediction performance. We collected various features for cell lines and drugs from CCLE, GDSC, STiTCH, PubChem, and Drugbank. Here, we investigated nine different scenarios using three cell line similarities based on gene expression, mutation, and copy number variation, and three drug similarities based on the chemical substructure, target proteins, and KEGG pathways. The performance of ADRML was investigated using fivefold cross-validation on cell line-drug pairs. The best performance was obtained using gene expression data about cell lines and target protein data about drugs, which was more accurate than the previously proposed methods. We also investigated the performance of other state-of-the-art methods and KNN (with k = 1) as the baseline method on various types of similarities and showed that their best performance was achieved using the similarities that were suggested in their papers. Another contribution of this paper was the purification of CCLE and GDSC benchmarks via removing redundant cell lines. The purified benchmarks were also used for assessing the methods' performance. The results showed that excluding redundant cell lines declines the methods' performance, which may be due to the reduction of sample size or removing bias from the database.
It was interesting that KNN with k = 1 as a simple baseline method shows favorable results and outperforms some more complicated methods, especially on the purified datasets. However, it should be noted that sophisticated methods' performance declines when the data size is not sufficient. A complicated method needs a massive amount of data to train well and gets a good grasp of predicting outputs from inputs. For example, Chang et al. 21 have provided CDRscan with more cell lines and drugs than used in this paper and have trained CDRscan with 95% of its data (despite 80% of data in this paper). Therefore, the reported R 2 in 21 is better than the results reported in this paper. One can conclude that providing more informative data may enrich the training data and lead to better training the complex models. It is noteworthy that due to the challenging inherent of the problem, little improvements in results is welcome and useful.
The proposed method in this study outperformed other methods in terms of two criteria R 2 and PCC in most comparison scenarios. The predicted drug response values revealed high correlations with observed drug responses and suggested meaningful clues about drug mechanisms in activation/inhibition of pathways. Moreover, the reliable literature evidence supports the predictions of ADRML about novel cell line-drug pairs. As a consequence, the promising results of ADRML verified its efficiency in predicting anticancer drug prediction and imputation.

Method
The proposed method includes five steps: • Pre-processing to impute missing data • Calculating various types of similarity matrices for cell lines and drugs • Normalizing the similarity matrices • Similarity-constrained manifold learning to factorize the IC50 matrix into low-rank latent matrices • Estimating Unknown IC50 values using the latent matrices The overall workflow of ADRML is illustrated in Fig. 8.
For the convenience, define EXPR c i , CNV c i , and MUT c i as the expression of all genes, copy number variation c i , respectively. More precisely, CNV c,g and MUT c,g denote the copy number variation and mutation status of gene g in cell line c. Furthermore, CHEM d i , TRGT d i , and KEGG d i stand for chemical features, target status (equals 1 for the proteins that are the target of the drug, 0 otherwise) for all proteins, and pathway status (equals 1 for the pathways that are the associated with the drug, 0 otherwise) for drug d i , respectively. Finally, IC50 c i ,d j is defined as the log IC50 value for cell line c i treated with drug d j . Table 9. ADRML sensitive predictions for novel cell line-drug pairs verified by the latest release of GDSC. These pairs had unknown IC50 in the training dataset and were predicted as a sensitive pair by ADRML. The latest release of GDSC reported these pairs as the sensitive pairs. www.nature.com/scientificreports/ Pre-processing to impute the missing data. Several steps were done to impute the missing data. First, the features that were missed in the majority of cell lines are removed. Second, the cell lines that contain missing values for more than half of the features were excluded. The other missing values were imputed using a k-nearest neighbor approach. To this aim, the distance measure between cell lines was defined as the Euclidean distance of their expression profiles because there is no missing in expression features of the cell line; thus, the distance can be calculated for each pair of cell lines. The distance between c 1 , c 2 is D(c 1 , c 2 ) = ||EXPR c 1 − EXPR c 2 || 2 2 . Then, the mean feature value among 10-nearest cell lines was used to impute the missing IC50 value of drug d or CNV value of gene g in cell line c. . Moreover, to impute the mutation status ("1" for mutated and "0" for wild type) of gene g in cell line c, the majority vote of 10 nearest cell lines is used, i.e. MUT c,g is 1, if and only if c i ∈NN c MUT c i ,g > c i ∈NN c (1 − MUT c i ,g ) .

Similarity matrices construction and normalization.
For computing the similarity score of two cell lines (or drugs), the PCC and Jaccard-index (JI) were regarded as the similarity function, which are elaborated in the following.
where x, y are two feature vectors, x i and y i denote the ith element of these vectors, and x,ȳ are the mean value of them. Basically, the PCC is used to calculate the similarity of two continuous vectors, while JI is appropriate to measure the similarity of two discrete vectors. Therefore, we considered this rationality in the calculation of similarity matrices. The dimensions of cell line, and drug similarity matrices are n × n and m × m , respectively, where n denotes the number of cell lines and m denotes the number of drugs. Consequently, we constructed the three types of similarity matrices for cell lines, based on EXPR, CNV, and MUT. Since EXPR and CNV features are real-valued, PCC was used to measure their similarity, while MUT is binary-valued and JI was used to measure mutation similarity.
• SimC EXPR is the similarity matrix of cell lines based on their gene expression profiles.
• SimC CNV is the similarity matrix of cell lines based on their copy number variations.
• SimC MUT is the similarity matrix of cell lines based on their mutation profiles.
Furthermore, three types of similarity matrices for drug based on Pubchem SMILES (CHEM), target proteins (TRGT ), and KEGG pathways (KEGG) were calculated as follows. It is notable that all drug features are binaryvalued; thus, JI was used for measuring the similarity of drugs based on each type of information.
• SimD CHEM is the similarity matrix of drugs according to their chemical substructure fingerprints.
• SimD TRGT is the similarity matrix of drugs according to their target proteins.
• SimD KEGG is the similarity matrix of drugs according to their KEGG pathways.
Then all of the computed similarity matrices were normalized by computing the symmetric normalized Laplacian 51 . Let S be a similarity matrix, the normalized similarity matrix S norm was obtained as follows.
where D is a diagonal matrix with diagonal elements equal to the summation of each row in S, i.e. D i,i = j S i,j . It is noteworthy that D ii = 0.
Manifold learning with similarity constraints. We constructed a bipartite graph with two parts: drugs and cell lines. The weight of edges between cell line c i and drug d j is log IC50 value of drug d j on cell line c i . Thus, the IC50 drug response matrix R = [r i,j ] n×m is the adjacency matrix of this graph, where n, m are the number of cell lines and drugs, respectively. We used the manifold learning to factorize the drug response matrix R in two www.nature.com/scientificreports/ latent matrices P n×k and Q m×k with lower rank. By using this factorization we could map the cell line and drug features into a latent space with dimension k, i.e. P and Q are the cell line latent matrix and drug latent matrix, respectively. The ith row of P (shown by p i ) is the latent vector of cell line c i , and the jth row of Q (shown by q j ) indicates the latent vector of drug d j .
The initial goal is to find matrices P and Q, such that each drug response value is obtained by inner product of corresponding latent vectors, i.e., r i,j = p i · q T j ; thus, the loss function can be formulated as: Two terms i ||p i || 2 and j ||q j || 2 are the regularization constraints of P and Q and µ is the regularization coefficient. The regularization terms prevent these matrices to grow dramatically; therefore, the over-fitting issue may not occur. These regularization terms help to reduce the variance and increase the stability and generalization capabilities of the model 52 .
Manifold learning studies 53,54 have shown that the mapping of data to a lower dimensional space can conserve the topological structure of data. Since p i is the feature vector of cell line c i , the distance of two cell lines c i and c j can be measured by ||p i − p j || 2 . Similarly, ||q i − q j || 2 denotes the distance of drugs d i and d j . We should consider some constraints to maintain the distance of cell lines and the distance of drugs while mapping them from the original features space to the lower dimensional latent space. Thus, the loss function is supplemented by two more terms.
where is the coefficient of similarity consistency, SimC ∈ {SimC EXPR , SimC CNV , SimC MUT } , and SimD ∈ {SimD CHEM , SimD TRGT , SimD KEGG } . Two last terms are minimized when the feature vectors of cell line (or drug) pairs with high similarity are mapped to not distant latent vectors. Therefore, the topological distance of cell lines (or drugs) is maintained while mapping to the lower dimensional space.
Iterative optimization rules. The latent matrices P, Q must be obtained by minimizing the loss function in 15. We used the iterative Newton's method 55 to update P, Q matrices: where p t i (or q t j ) denotes the updated p i (or q j ) after t steps, for all t > 0 and p 0 i , q 0 j were initialized randomly. The first and second derivatives (gradient and Hessian) of loss function with respect to p i and q j are computed as the following: Therefore, the latent matrices P, Q are updated alternatively according to Eqs. (22,23) until convergence. www.nature.com/scientificreports/ The convergence criterion is met when ||p t+1 T Q t+1 − p t T Q t || < ǫ . In this study, we considered ǫ = 0.01 . The value of loss function declined in every iteration, due to the positive definite second derivatives. Therefore, the convergence criterion is definitely met after some steps 55 (usually after 10-20 step). After convergence, an estimated matrix is obtained by R pred = Q * P T . Moreover, the manifold learning was applied on the transpose of response matrix, i.e. all the above procedure was repeated for factorizing R T to P ′ and Q ′ . In the second use of Manifold learning we initialized P ′ and Q ′ by the final computed Q and P in the first run, respectively. After the convergence, the second predicted matrix was constructed by R ′ pred = P ′ * Q ′T . Consequently, the predicted log IC50 was computed by R = 0.5(R pred + R ′ pred ). evaluation criteria. We measured the performance ADRML using 5-fold cross-validation on cell line-drug pairs. To do this, each pair of (c i , d j ) was considered as a sample. Then, the set of all samples was partitioned randomly into five almost equally-sized subsets (fold). One fold was considered as the test data and the other folds were regarded as the training data. The evaluation was computed for the test data. This procedure was iterated until each fold was considered once as the test data. Finally, the average of evaluation criteria over these five iterations denoted the model performance. Evaluation of ADRML is summarized as pseudo-code and shown in Fig. 9.
To avoid randomness and reducing variance, the model performance was averaged over 30 randomly repetition of 5-fold cross-validation. The evaluation criteria include RMSE, R 2 , and PCC as follows.   , d j ) ; ∀ i ≤ n, j ≤ m)) (IDX 1 , IDX 2 , IDX 3 , IDX 4 , IDX 5 ) = Split(IDX, 5) R = ∅ for 1 ≤ f ≤ 5 do: Randomly Initialize P n×k , Q m×k P ML , Q ML ← M anif oldLearning (R train , P, Q, SimC, SimD, λ, µ, k) P ML , Q ML ← M anif oldLearning (R T train , Q ML , P ML , SimD, SimC, λ, µ, k) f , PCC f = Evaluation(IC real , IC pred ) RM SE = average(RM SE 1 , ..., RM SE 5 ) R 2 = avergae(R 2 1 , ..., R 2 5 ) P CC = avergae(P CC 1 , ..., P CC 5 ) www.nature.com/scientificreports/ where IC real and IC pred are the vector of real and predicted drug response values for all samples in test set, respectively, Ī C real ,Ī C pred are their mean values, and |Test| is the number of samples in the test set. Each criterion evaluates the model performance from a different point of view. Therefore, it is possible to obtain results which led to promising values of one criterion and unfavorable values for other criteria.
finding the most redundant cell lines. In order to eliminate the redundancy from the dataset, the cell lines in each tissue type that have high similarity to the majority of cell lines in that tissue type were considered as the most redundant cell lines and excluded from the dataset. To do this, the minimum (Q0), first quantile (Q1), second quantile (Q2), third quantile (Q3), and maximum (Q4) values for each type of cell line similarity in all tissue type were calculated, which are shown in Supplementary Tables S6 and S7. The diversity of cell lines was projected better concerning the values of copy number variation similarities, since there was a vast difference between the quantile values with respect to this similarity. Therefore, the third quantile of copy number variation similarities between the cell lines were computed in each tissue type t (denoted by Q3(CNV, t) ). The cell line c in tissue type t was excluded if it had the similarity higher than Q3(CNV, t) with more than θ =20% of cell lines in tissue type t.
Computing association of drugs and signalling pathways. The association between drug and pathway was computed by the PCC of drug response values and pathway activity scores. To do this, we considered all Biocarta signaling pathways and eliminated the pathways that the gene expression data of more than 10% of its genes were not provided. Therefore, we considered 107 Biocarta pathways for CCLE dataset. The pathway activity score for cell line c i and pathway p j was computed according to Emdadi et. al. 20 , by summing up the fold change of gene expressions for all genes g l in pathway p i .
where median c (EXPR(c, g l )) is the median of gene expression of gene g l in all cell lines. Thus, the score of a cell line in activating a pathway denotes the total amount of change in gene expression with respect to the median expression.
The correlation of drug d i and pathway p j was obtained by PCC(IC pred (:, i), AS(:, j)) , where IC pred (:, i) denotes the predicted drug response vector of drug d i for all cell lines and AS( : , j) stands for the activity score vector of pathway p j for all cell lines.