Representation of features as images with neighborhood dependencies for compatibility with convolutional neural networks

Deep learning with Convolutional Neural Networks has shown great promise in image-based classification and enhancement but is often unsuitable for predictive modeling using features without spatial correlations. We present a feature representation approach termed REFINED (REpresentation of Features as Images with NEighborhood Dependencies) to arrange high-dimensional vectors in a compact image form conducible for CNN-based deep learning. We consider the similarities between features to generate a concise feature map in the form of a two-dimensional image by minimizing the pairwise distance values following a Bayesian Metric Multidimensional Scaling Approach. We hypothesize that this approach enables embedded feature extraction and, integrated with CNN-based deep learning, can boost the predictive accuracy. We illustrate the superior predictive capabilities of the proposed framework as compared to state-of-the-art methodologies in drug sensitivity prediction scenarios using synthetic datasets, drug chemical descriptors as predictors from NCI60, and both transcriptomic information and drug descriptors as predictors from GDSC.


REFINE Model
The presentation of the model is very hard to follow. Some points that should be addressed are: -The authors change notation multiple times. For example, in section 2.1.1: x is defined as Z(s) but 2 paragraphs later (C ase 2) x is used again to denote the "true" position, then s is used again ("ie estimate sj") and in the very next sentence "estimate x(.)". In addition, p is used to denote both the number of feature and the rate parameter of HPP -With respect to the prior of s. It's mentioned initially that it's a Homogeneous Poisson but then a Uniform is specified in eq (2). It's unclear which of the 2 the authors used. In case it's the former, why it doesn't appear on the posterior. The authors claim that HPP essentially forces a constraint of 1 voxel per picture but this is not obvious. The area A(s) is not defined clearly.
-The prior of sigma is specified as Inverse Gamma in page 6 and Inverse Gaussian in page 7 (probably by mistake). Then its posterior is approximated using the conjugate form without clear justification.
-There is a typo ((q=p 2)). It should be q=(p 2). Also, in the posterior the exponent of sigma squared should probably be -(q⁄2+a+1) -The hill climbing algorithm is not well described either. It's mentioned that its objective function is related to the "true" distance delta but s is already used to define delta so it's not clear how changing them can improve the objective. -Finally, the sentence "Assuming that the intensity…developed in [26]." is particularly hard to follow for this reviewer.

C NN hyper-parameter tuning
The authors mention that tuning parameters for their C NN have been determined via crossvalidation. However, there is a lot of missing information and several questions arise: -Was the cross validation run on the training set alone? -Was preprocessing also included in the cross validation? More specifically, this refers to data imputation, gene selection (page 9) as well as the mapping of features using REFINE. To prevent data leak, all preprocessing steps should be part of the training phase.
-What was the grid of parameters tested during cross validation? -How was the actual network architecture decided in the first place? Was it fixed from the very beginning, or were layers added during the study? This can (and maybe should) also be part of the optimization.
-The input size (64x64 = 4096) is not particularly large for biological datasets that tend to be wide (eg 20,000 genes like GDSC ). Did the authors experiment with bigger sizes? -Finally, I propose the use of full nested cross validation, instead of the split to training-validationtesting.

RF, SV and ANN tuning
In line with comment #2, the same questions apply to the training and tuning of these three learning methods.

Other comparisons
The comparison with RF, SV and ANN is a good idea, but elastic nets have also been tried on the GDSC dataset with relative success. The authors need to compare to this method as well. In addition, the ultimate goal of learning drug sensitivity models from cell line transcriptomic data is to apply these models to patient data and predict their response to drug treatments. One such gold standard dataset is the bortezomib clinical trial on multiple myeloma. Elastic net models, trained on GDSC , have been able to make statistically significant predictions of responders versus non-responders to this drug: https://genomebiology.biomedcentral.com/articles/10. 1186/gb-2014-15-3-r47 The authors should test whether their bortezomib model is also equally successful. Residuals plots, especially those on figure 5, seem very weird. The observed data do not span the whole [0, 1] interval (in figure 6 this is more pronounced). All competing models have very pathological behavior, they seem to under estimate low values and overestimate high ones. ANN also appears to be constant for low values. It looks like just a simple linear term would suffice to fix this behavior and it's surprising that powerful models like these failed do so.

Metrics and statistical significance
The authors should report confidence intervals for all their evaluation metrics. Also, the statistical significance of the difference in performance across methods should be assessed. Finally, for the classification problem, the authors should report AUC s and/or similar metrics. Moreover, their NRMSE metric is deviance based (it looks like the square root of 1 -R2) and thus it's not well suited to compare models (same or different) trained on different datasets (see table  3). For figure 4, the authors don't specify which quantity is represented by the color-scale. Finally, bias is not a standard metric because residual plots typically do not have clear constant slopes.

2D representation of feature vectors
The main idea of this study is that the organization of unordered feature vectors into 2D image representations can confer an advantage in learning tasks. However, this claim has not been tested in detail. The authors should test the following and possibly come up with more ideas of their own: -Input randomly-ordered 2D image representations into their C NN architecture and retrain (using nested C V) -Study the performance as a function of the number of input features and again compare to randomly ordered 2D images -C ompare to other approaches that create 2D representations from unordered vectors 7. Other comments -It's not clear how the random dataset was created. Which distributions were used? Were the target variables normalized as well? -Page 8: Normalized IC 50s to the [0,1] interval. It is not clear if this was done globally or per drug, or per cell line. Also, figures 5, 6, and 7 to span the whole range. Please clarify. Why is this necessary? I am not sure this is justified. -Page 8: Figure 3c. The title says "IC 50" but the legend says "NLOGIC 50". -Page 16: "we used the NC I60 coordinates to generate the drug images for the 222 drugs"; it appears that this provides an unfair advantage to the REFINED algorithm over the rest of the methods.
-Page 10: The hybrid C NN receives inputs that are pre-organized into the two distinct data types (transcriptomics and chemical descriptors). Again, this seems to be unfair to the other algorithms. To avoid this, please also compare the performance of all algorithms based on transcriptomic data alone -I apologized if I missed this in the methods, but it is not clear if different models are trained per drug, or whether there is a single model (e.g. a multi-class multi-label classification). To avoid unfair comparisons, all tested algorithms should be trained to the same learning task. -Finally, I suggest that the authors come up with more creative ways to present their results in the main figures. Most of the presented results figures and tables are good and informative but are better suited for the Supplemental section (with the exception of figures 2 and 4). The main figures should visualize the information better. Also, figure 1 is very important, but needs to be improved.
Reviewer #2 (Remarks to the Author): This paper proposes the method called REFINED, which represents multidimensional feature vectors as 2D image forms to exploit C onvolutional neural networks (C NNs) as C NNs achieved human-level performance in image-based tasks. It applies Bayesian Multidimensional Scaling (BMDS) and hill-climbing algorithm for mapping each feature into an optimal location on a 2D image space. The paper demonstrates its advantages: automated feature selection without explicit feature selection techniques, an increase in predictive accuracy, and bias correction.
The approach of REFINED method is interesting as it converts feature vectors to 2-dimensional locations reflecting feature similarity. Despite that, this paper has critical points that compromise contributions the paper asserts.
1. The presentation is fairly poor. Above all, 'automated feature selection' is a misleading term. The authors emphasize automated feature selection as a core contribution of the paper, however, 'feature selection' seems incorrect. Instead, 'automated feature extraction' is an appropriate term since it is derived from C NNs that allow feature extraction. Moreover, the authors did not elaborate specific details of how each feature value is visualized in the generated 2D image. In addition, some fundamental notations have an incorrect definition in section 2.1.1. For example, in equation (1), q must be defined as a combination of p taken 2 because q denotes all possible pairs. Likewise, IG denotes Inverse Gaussian distribution in this paper, but it should refer to Inverse Gamma distribution which is theoretically correct. Besides, some notations are not specified. These could be considered as an incomplete understanding of the theoretical backgrounds.
We have completely revised the section and made the notations coherent. Our intention in the above specification was to bring out the fact that we are randomly distributing "p" points (locations associated with "p" features) in unit square with a constant intensity function "p/Area of unit square". We have revised the text to make the issue clear.

"-With respect to the prior of s. It's mentioned initially that it's a Homogeneous Poisson but then a
Uniform is specified in eq (2). It's unclear which of the 2 the authors used. In case it's the former, why it doesn't appear on the posterior. The authors claim that HPP essentially forces a constraint of 1 voxel per picture, but this is not obvious. The area A(s) is not defined clearly." As mentioned above, HPP is equivalent to complete spatial randomness. Since we are only distributing "p" locations, an equivalent specification is to generate "p" 2-dimensional coordinates from Uniform distribution operating on unit square (Chandler & Royle, 2013).
-" The prior of sigma is specified as Inverse Gamma in page 6 and Inverse Gaussian in page 7 (probably by mistake). Then its posterior is approximated using the conjugate form without clear justification.
-There is a typo ((q=p 2)). It should be q=(p 2). Also, in the posterior the exponent of sigma squared should probably be -(q⁄2+a+1)" We thank the reviewer for pointing that out. We have rectified the statements.
-"The hill climbing algorithm is not well described either. It's mentioned that its objective function is related to the "true" distance delta, but s is already used to define delta so it's not clear how changing them can improve the objective."

It is to be noted that the posterior realizations of si are not identifiable and the convergence of MCMC cannot be monitored for si's. Rather the convergence is monitored for the δ and σ chains (On & Raftery, 2001). The Hill Climbing is done as a post-processing step to see if local adjustments of locations can improve the stress function. This step is done to mimic the swap step associated with Partition-around-medoid algorithm. More formally, observe that the output of the BMDS algorithm produces a set of locations " in the unit square. This is not a veritable image. To convert this collection of points and their associated values (value of the predictor) into an image, we need to impose a grid on unit square such that at most one !
# is contained in each grid. The hill-climbing operation is undertaken to identify the coarsest resolution of this grid. More specifically, suppose ! # is contained in grid i and " # is contained in grid j. Let !" & be the BMDS posterior estimate of the distance between ! # and " # , the hill-climbing operation is performed to minimize the distortion associated with placing ! # and " # in their respective grid -centers, i.e, dist(center of grid i, center of grid j) ≈ !" & . Clearly, given the global optimizer producing !" & , the hill-climbing operation identifies a set of local adjustments that closely follows the global optimum estimates.
-"Finally, the sentence "Assuming that the intensity…developed in [26]." is particularly hard to follow for this reviewer".
We have revised that statement to make it clearer. "We also incorporated larger number of genes to experiment the impact of creating larger images using REFINED. We selected larger common subset of genes including (2147 ∼ 2000, 2985 ∼ 3000, 4271 ∼ 4000, 8048 ∼ 8000) genes and created the REFINED images to train the same CNN architecture that we used to train the GDSC data with 1211 selected genes. We also trained other competing models with the same hyper parameters. In our experiment, the SVR model crashed after several hours due to the larger memory requirement while estimating the kernel by computing the distance function between each point in the dataset. Thus, we used a bag of SVRs, where we selected subset of 400 features (appended gene expression and drug descriptors) to train 100 SVRs in parallel. The complete results are provided in figure 33 and table 18 along with the corresponding confidence intervals in figure 34 35 36 of the appendix. We observed that the trend of REFINED CNN outperforming other approaches is maintained irrespective of the number of genes used for training the models."

We have also modified the synthetic data section, where now the number of features that we generate range from 20 to 4000, and the number of samples range from 50 to 10000 for each set of generated features. The following paragraph has been included in the synthetic dataset section:
"We simulated a synthetic dataset with P correlated features for N samples, where for each simulation 20%, 50% and 80% of the features were spurious. The features were simulated from a zero mean Gaussian process with stationary isotropic covariance matrix whose (i; j)th element is given by |$%&| . P ranged from 20,50,100,400,800,1000,2000,4000 and for each P, the number of simulated sampled ranged from 50,200,600,1000,2000,5000,10000."

"-Finally, I propose the use of full nested cross validation, instead of the split to training-validationtesting. 3. RF, SV and ANN tuning
In line with comment #2, the same questions apply to the training and tuning of these three learning methods."

We appreciate the Reviewer's recommendation on using nested cross validation as it is often the optimum method for hyper parameter search and unbiased evaluation of a machine learning model. However, nested cross-validation can be extremely time consuming when the dataset is large and numerous types of comparisons for different approaches have to be compared as is applicable to our scenario. Thus, we compared the hyper-parameter search performance of Nested Cross Validation against training-validation-testing (hold-out) approach using a random subset of three different cell lines from the NCI60 dataset. For the hold-out approach, we trained the models on the training set, fine-tuned the hyper parameters using the validation set, and evaluated them on the test set. We observed that the trends in terms of performances of different models (as shown in Figure 18 of revised manuscript) remains similar for both nested CV and hold-out approaches and thus the results for the other cell lines and dataset are reported using the less computationally intensive trainingvalidation-testing (hold-out) approach. Our conclusion of similar results for nested CV and hold-out for large datasets has been observed earlier as concluded in http://www.jmlr.org/papers/volume11/cawley10a/cawley10a.pdf that the effect of bias in model evaluation using nested CV compared to other approaches decrease as the datasets gets larger.
"4. Other comparisons The comparison with RF, SV and ANN is a good idea, but elastic nets have also been tried on the GDSC dataset with relative success. The authors need to compare to this method as well. "

As suggested by the reviewer, we included elastic net results for not only GDSC dataset, but also for the NCI60 dataset. Our proposed REFINED-CNN approach outperformed Elastic Net for all considered cases.
"In addition, the ultimate goal of learning drug sensitivity models from cell line transcriptomic data is to apply these models to patient data and predict their response to drug treatments. One such gold standard dataset is the bortezomib clinical trial on multiple myeloma. Elastic net models, trained on GDSC, have been able to make statistically significant predictions of responders versus non-responders to this drug: https: //genomebiology.biomedcentral.com/articles/10.1186/gb-2014-15-3-r47 The authors should test whether their bortezomib model is also equally successful."

As recommended by the reviewer, we did test our approach on the Bortezomib model. However, detailed reading of the supplementary information and codes of the paper elucidated some issues such as use of test data to select optimal cutoff and thus, we didn't include the comparison results in the revised paper. We have created a separate 7-page report (attached at the end of this response to Referees Letter) that provides the detailed results of the bortezomib model comparison.
"Residuals plots, especially those on figure 5, seem very weird. The observed data do not span the whole [0, 1] interval (in figure 6 this is more pronounced). All competing models have very pathological behavior, they seem to underestimate low values and overestimate high ones. ANN also appears to be constant for low values. It looks like just a simple linear term would suffice to fix this behavior and it's surprising that powerful models like these failed do so."

As recommended by the reviewer, we corrected the residual plots to represent the non-normalized range. The residuals are plotted in the same manner as previous studies such as : https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-018-2060-2 As shown in figure 3, all the cell lines of the NCI drug response have a massive peak which represents non-responsive drugs and there is limited response lower than that peak and thus, the observed data do not span the whole range [0,1] (initial submission), and [0,8.5] (revised submission). The behavior of ANN is potentially caused by the prevalence of point-mass. In fact, any mean field model will struggle to accommodate point masses. CNN (or other local smoothers, for example Lowess) are essentially defined on highly resolved bins and hence are adept at handling such point masses. We have observed that RF can be modified to such scenarios too when its classifier properties are exploited (https://www.nature.com/articles/s41598-017-11665-4). Note that models such as RF are known to have biases that overestimates low values and underestimates high values (PMID: 29589559).
"5. Metrics and statistical significance The authors should report confidence intervals for all their evaluation metrics. Also, the statistical significance of the difference in performance across methods should be assessed. Finally, for the classification problem, the authors should report AUCs and/or similar metrics. Moreover, their NRMSE metric is deviance based (it looks like the square root of 1 -R2) and thus it's not well suited to compare models (same or different) trained on different datasets (see table 3)."

We report precision, recall, F1-score and AUROC for the classification and NRMSE, Correlation Coefficient and bias for regression. For data augmentation comparison in Table 3, we have added additional metric of normalized mean absolute error for comparison. For the model comparisons, we have kept the same training and testing datasets. We also report statistical significance for both regression and classification tasks. The classification results along with McNemar's test for statistical significance for NCI dataset are reported in tables 6-8 of the appendices. To report the statistical significance of the difference in performance of the regression task across considered methods (REFINED CNN, Random CNN, PCA CNN, RF, SVR, ANN, EN), we considered the design of a null distribution and compared our results in comparison to the null distribution similar to the approach considered in https://www.nature.com/articles/nbt.2877..Random predictions were generated from the training CDF of Y. Under the assumption of the simplest intercept-only model the conditional distribution Y|X is proportional to the marginal distribution of Y. Evaluation of the comparison metrics for each realization of random predictions produces a null distribution of the comparison metric. Similar distribution of the foregoing metrics can be obtained by fitting the candidate models of bootstrapped samples. Therefore, we have two distributions of each comparison metric. Significant lack of predictive power is indicated if distribution of a metric obtained under a candidate model overlaps substantially with the distribution of the corresponding metric obtained under null model. However, there is no observed "test statistic" to perform a formal test. Consequently, we "bag" the comparison metric scores (say NRMSE) coming from the candidate model and the null model and ask the question if there exist sufficient statistical evidence to of the existence of two clusters in a completely unsupervised situation. A formal answer to assess significance is such a situation is given by the Gap Statistic (Tibshirani et al, 2001). This statistic is most powerful in determining H0: m=1 cluster vs H1: m= more than 1 cluster. Non-rejection of H0 indicates lack of predictive power for the corresponding candidate model. Furthermore, Gap performs a greedy search to identify suitable number of clusters. In an ideal situation the Gap statistic will identify m=2 clusters and the cluster membership will show that members of one cluster contains all the scores coming from the null model while the other cluster contains all the scores from the candidate model. This will not indicate sharp delineation of the candidate model from the null and high level of predictive precision. Cases with m>2 is ambiguous and can happen under severe heteroskedasticity. Subsequently, the null model results were paired with each model and the Gap statistics were utilized to report the significance of the difference in performance across the models. We observed that our proposed REFINED CNN performs better than the other 6 competing methods for both average error and statistical performance. The Gap statistics results are provided as tables 12 and 17 and figures 19, 30, 31 and 32 of the Appendix. The following paragraph is included in the revised manuscript to provide the details of the utilized Gap Statistics approach:
"We used Gap statistics [41] to report the significance of the difference in performance across methods of the regression tasks. We paired each model with a null model [1] with bootstrap sampling, where the bootstrap sampling was done on the dose-response values of the test set along with their corresponding predicted values for each model and the null model randomly predicted dose-response values for the sampled test set using the distribution of the training set dose-responses.
The process is repeated for 10,000 times and a distribution of NRMSE, PCC and Bias is made for each model along with the null model. The distributions per each metric for each model is paired with null models' metric distribution. Then cluster centroids are calculated using k-means (k = 2) clustering, after the Gap statistics shows appropriateness of using 2 clusters. The difference between each model and the null model cluster centroids per metric represents the difference between them."

To report the confidence interval of the regression models, we have provided bar plots in figures 20, 21, 22, 34, 35, 36 of the appendix, using a pseudo Jackknife-After-Bootstrap approach as below:
"We calculated 95 % confidence interval for each metric that is used to measure the performance of the methodologies in modeling NCI60 and GDSC dataset using a pseudo Jackknife-After-Bootstrap confidence interval generation approach [4,31]. Multiple Bootstrap sets were selected from the test samples and error metrics generated resulting in a distribution for each error metric which was used to calculate the confidence interval for a given cell line for NCI60 dataset or a given pair of cell line and drug for the GDSC dataset." We have also conducted a robustness analysis (similar to the approach followed in https://www.nature.com/articles/nbt.2877) on the regression results for both NCI60 and GDSC datasets as shown in tables 11 and 16 of the revised manuscript. We observed that REFINED CNN outperformed other models more than 90% of the times using the NCI60 dataset. REFINED CNN also outperformed shallow models as well as DRF and HGNN in more than 90% of the times, and the deep models in more than 60% of the times using the GDSC dataset.

The description of the robustness analysis is included in the revised manuscript as:
"In addition to the Gap statistics, all models were subjected to a robustness analysis [1], where we calculate how many times the REFINED CNN outperforms other competing models in 10,000 repetition of bootstrap sampling process."

The 95% confidence interval for each of the metric for measuring the classifier performance was calculated using the Binomial proportion confidence interval [https://onlinelibrary.wiley.com/doi/abs/10.1002/sim.4780120902]. To report the statistical significance of the difference in performance across classification tasks, we used the McNemar test (results in table 8 of the Appendix) with the following details included in the results section of the revised manuscript:
"To compare the statistical significance of the difference in performance between competing classifiers and REFINED CNN, we used the McNemar's test which is a paired nonparametric statistical hypothesis test. McNemar's test evaluate whether two models disagree in the same way or not. To compare classifiers with REFINED CNN classifier pairwise, a contingency table is formed and McNemar's test is applied [40]. The null hypothesis is whether two classifiers disagree by the same amount. Therefore, if the p-value is smaller than a threshold (0.05), null hypothesis is rejected, and the conclusion is: there is a significant disagreement between the two classifiers. The results of comparing REFINED CNN with other models using McNemar's test is provided in table 8 of the appendix. We observe that the REFINED CNN classifier performance is significantly different than the other classifiers (LR, RF, SVM, ANN, Random CNN, PCA CNN). "

"For figure 4, the authors don't specify which quantity is represented by the color-scale."
We thank the reviewer for pointing that out. We have included the quantity "Difference in NRMSE" for the color-bar.
"Finally, bias is not a standard metric because residual plots typically do not have clear constant slopes." Figure 9 of revised manuscript. Figure 18 shows,

REFINED CNN outperforms Random CNN and PCA CNN. The following information about the alternative image generation methods are included in the section "Comparison with other image generation methods".
"In this section, we consider two alternative image generation methods for comparison purposes: random projection -based and PCA-based. In the random projection method, we assume each image is a matrix and the location of each entry in the vector is randomly mapped to a location in the matrix. Thus, we placed each element of the drug descriptors or the gene expression on the image (matrix) coordinates one after another. Principal component analysis (PCA) [26] is mainly used for dimensionality reduction and visualization purposes, where each sample could be represented on a 2D plane aligned with their first two principal eigen vectors. The first two principal components of the transposed covariance matrix with rows being features and columns representing samples, were selected as the feature coordinates. Some of the generated images using the random and PCA methods are shown in figure 2."

Minor Points:
"-It's not clear how the random dataset was created. Which distributions were used? Were the target variables normalized as well?"

As recommended by the reviewer, we do not normalize IC50s in the range of [0,1] anymore to avoid misunderstanding, and all the results are provided based on the actual range in the revised version of the manuscript. As we revised the manuscript, the figure numbers are not same as before, and the table below represents which figure in the revised manuscript is equivalent to which figure in the initial manuscript.
Initial submission Revised submission 5 17 in the appendix 6 9 7 10 In the revised manuscript, figure 17 of the appendix shows scatter plot results of the synthetic data. In figure 17,

the observation values span the entire range while the prediction values of the models don't span the same range depending on their bias. For example, since RF overpredicts lower values, and underpredicts higher values, its corresponding prediction values don't span the entire range. On the other hand, the REFINED CNN doesn't suffer from such a problem, therefore its values span the entire range. This happens when we deal with the GDSC data too whose results are shown in figure 10 of the revised manuscript. As shown in figure 3, the cell lines with more than 10k drug responses that are selected for the study from NCI60 data contains a unique distribution with a massive peak and thus figure 9 of the revised manuscript shows scarcity of the responses in the range below the peak.
"-Page 8: Figure 3c. The title says "IC50" but the legend says "NLOGIC50".
-Page 16: "we used the NCI60 coordinates to generate the drug images for the 222 drugs"; it appears that this provides an unfair advantage to the REFINED algorithm over the rest of the methods."

As suggested by the reviewer, we do not incorporate any prior knowledge anymore, and we generate completely independent coordinates using REFINED for drug data of the GDSC dataset. The results in the revised version of the manuscript are based on not incorporating any prior knowledge. Currently, we use 992 chemical descriptors for the GDSC drugs, which is larger compared to the 672 descriptors that we used for NCI60.
"-Page 10: The hybrid CNN receives inputs that are pre-organized into the two distinct data types (transcriptomics and chemical descriptors). Again, this seems to be unfair to the other algorithms. To avoid this, please also compare the performance of all algorithms based on transcriptomic data alone"

As pointed out by the reviewer, we clarified how models were trained on the same learning tasks by adding the following explanations in the predictive modeling sections. NCI:
"We randomly picked 17 cell lines where each cell line contained more than 10k drug responses so that we have sufficient number of samples to train a deep learning model. Subsequently, 17 set of models were trained to predict NLOGGI50 of drugs tested on the selected cell line. In each set, all the abovementioned models were trained on the same set of samples and tested on a separate set of same samples." GDSC: "We trained REFINED CNN and 6 other competing models on same set of samples, where each sample is a combination of one drug tested on one cell line. All the models were tested on a separate set of same samples."

The authors would like to thank Reviewer No. 2 for his/her constructive review of the manuscript. With regard to the concerns raised by this Reviewer, the following changes have been incorporated in the revised version:
Major points: "1. The presentation is fairly poor. Above all, 'automated feature selection' is a misleading term. The authors emphasize automated feature selection as a core contribution of the paper, however, 'feature selection' seems incorrect. Instead, 'automated feature extraction' is an appropriate term since it is derived from CNNs that allow feature extraction. Moreover, the authors did not elaborate specific details of how each feature value is visualized in the generated 2D image. In addition, some fundamental notations have an incorrect definition in section 2.1.1. For example, in equation (1), q must be defined as a combination of p taken 2 because q denotes all possible pairs. Likewise, IG denotes Inverse Gaussian distribution in this paper, but it should refer to Inverse Gamma distribution which is theoretically correct. Besides, some notations are not specified. These could be considered as an incomplete understanding of the theoretical backgrounds."

As recommended by the reviewer, we have restructured the paper and have proofread carefully while considering all the above-mentioned points. We have attempted to correct all terminology, grammatical, spelling errors. We added the following clarification in the REFINED section of the revised manuscript to explain the visualization in the generated 2D images.
"Once the cost function is minimized, a set of unique coordinates in a 2D space was produced for each feature. Using those coordinates, we mapped the features into a 2D space and create an image per sample. The created images are used to train the REFINED CNN." "2. In drug response tasks, three baseline models are conventional which do not offer state-of-the-art performance. As the authors imply REFINED-CNN provides better performance than others, the authors should assess the performance of REFINED-CNN comparing with other advanced models giving a stateof-the-art performance (i.e., Deep-Resp-forest, Transfer learning-based, Heterogeneous graph)"

Deep-Resp-forest (DRF): DRF is a deep cascaded forest model, designed to classify the anti-cancer drug response as "sensitive" or "resistive" based on integration of heterogeneous input data (gene expression and copy number alteration (CNA)). It was applied on the GDSC and CCLE data, in a drugwise manner, where the DRF model was trained for each drug separately. So, it can be applied on the GDSC regression task of our study considering the differences: 1) we train a regression model not a classification model, 2) inputs of our model are gene expression and drug descriptors. To address the first difference, we can convert the random forest classifier to a random forest regressor for prediction. The second difference cannot be addressed as the second arm of the DRF is designed to handle the CNA data rather than the drug descriptor data. Therefore, the performance of DRF is poor compared to not only REFINED CNN, but also other models.
Heterogeneous graph neural network (HGNN) Table 3

As advised by the reviewer we added an ablation study section to the revised manuscript (Section 3.5), where we used other dimensionality reduction techniques for REFINED initialization. For comparison purposes, we considered local dimensionality reduction approaches of Local Linear Embedding (LLE) and Laplacian Eigenmaps (LE) and global dimensionality reduction approach of Isomap. We also assigned feature coordinates randomly without using any dimensionality reduction technique to evaluate the hill climbing component. Detailed results and descriptions are provided in the ablation study section of the revised manuscript (section 3.5) and figures 11, 12, 23-27 and tables 4, 14 and 15.
Minor Points: "1. In section 2.1.1, a constraint is omitted in the notation. For instance, there should be a constraint of j≠k, j, k = 1, 2, ..., p in estimating x(.) based on observed distance. Also, notation S_j,l (in 4th paragraph, page 6) is undefined."

As suggested by the reviewer, we have restructured the theoretical Basis for REFINED section of the manuscript and corrected all notations.
"2. Analyses are inadequate. There is no qualitative analysis that examines the effectiveness of the REFINED-CNN method. On top of that, quantitative analyses only describe results, neither interpretation nor intuition. In-depth analyses, such as case study and error analysis would increase the overall quality of the paper."

As suggested by the reviewer, we have included the following additional analysis in the revised manuscript (a) confidence intervals of the error metrics (b) Robustness analysis based on performance on different Bootstrap test samples which illustrates that REFINED CNN predictions are robust over different test sample sets (c) Comparison of each model prediction to random predictions to show the statistical significance of the predictions. We created a Null distribution based on random predictions following the distribution of training Y and used that to evaluate how likely our prediction can be generated from a random prediction. We also added GAP statistics to compare the various models. (d) We analyzed the performance of the various models with change in sample size and showed how CNN starts outperforming other models when sample size increases (e) we analyzed the performance of our REFINED CNN approach as compared to additional models including other approaches to generate 2D images from 1D Vectors such as PCA based or Random Projections. (f) We conducted an ablation study to evaluate the significance of various steps in the methodology. We considered Local embedding approaches such as LE and LLE and global approaches using Geodesic Distance such as ISOMAP for comparison purposes with our Bayesian MDS approach (further detailed analysis of the ablation study are included in section 3.5 of the revised manuscript).
"3. Using accuracy as the only metric in classification tasks is inappropriate. It could mislead an evaluation when an output distribution is imbalanced. F1 score would be better than the single accuracy." As advised by the reviewer we have added F1-score, precision, recall, and AUROC along with accuracy to report the performance of each classifier.
"4. Are there any reasons to utilize a hill-climbing algorithm despite its local optima problem?" As pointed out by the reviewer, we have corrected the caption of figure 1.

The hill climbing is actually used because of its local optima properties. Observe that the output of the BMDS algorithm produces a set of locations " in the unit square. This is not a veritable image. To convert this collection of points and their associated values (value of the predictor) into an image, we need to impose a grid on unit square such that at most one ! # is contained in each grid. The hillclimbing operation is undertaken to identify the coarsest resolution of this grid. More specifically, suppose ! # is contained in grid i and
"7. Figure 1 has a low resolution."

As recommended by the reviewer, we have created a new figure to elaborate the REFINED diagram with high resolution.
"8. 'Voxel representation' appears to be inaccurately used with 2D image space, since voxel indicates 3dimensional space."

Report on Modeling using Bortezomib Clinical Trial Data
As suggested by the reviewer, we tried to apply our model to patient's data (bortezomib clinical trial on multiple myeloma) and predict their response. We quote the below paragraph from [1]: "The authors of the original clinical study reported that a 100-gene signature model [2], built on two arms of the trial (025 and 040), could predict bortezomib response in the third (039) arm of the trial with 63% accuracy. To compare our predictions with those originally reported, we assessed the performance of our model on only this third arm of the trial. The [1] models generate a continuous variable and to compare the results previously reported directly, we must dichotomize this variable (i.e. split the data into 'sensitive' and 'resistant' at an arbitrary cutpoint). At the optimal cut-point (-5.29), 51 of 71 patients were correctly classified, meaning that our method achieved a classification accuracy of 72%." The study is interesting and informative in the paper, but we found some inconsistency while we went over their code in the supplementary. They modeled a ridge regression on in vitro data then predicted the log (IC50) of the patients. They mapped the predicted log (IC50) to "Responsive" and "Nonresponsive" using the patient's data information included in the test set. As it is shown in the figure 1, they iteratively tried all the possible cutoff points on the test set, to find the optimal cutoff point which maximizes the accuracy of the test set, that is not fundamentally ideal. Therefore, we decided not to include our comparison in the manuscript and provide a separate report. Extract reprinted by permission from Springer Nature: Geeleher, P., Cox, N.J. & Huang, R.S. Clinical drug response can be predicted using baseline gene expression levels and in vitro drug sensitivity in cell lines. Genome Biol 15, R47 (2014).
We follow same preprocessing steps as [1], including power transform, and homogenization of gene expression between two different domains. Then the preprocessed data is utilized for modeling. As the patient's data is divided into arm A and B, we summarize our results following the same division. For comparison, figure 4-a and 4-d of [1] are provided in the appendix of this report.

A) In vivo responders and non-responders to bortezomib (compared with figure 4-a of [1]):
All analysis in [1] is done in R, while we use Python (scikitlearn and tensorflow), therefore there are some differences in the results that we have, compared to the [1]. Also, as we ran their provided R code, we noticed some difference in the results produced by the R code than the paper, which might be because of the updates on the downloaded datasets. According to the R-code, the genes with variance lower than 0.2 are discarded. Using the same variance threshold makes the genes list empty, therefore we used variance threshold as 0.1 As suggested by the reviewer we compared our REFINED CNN method with Elastic Net (EN), plus support vector regressor (SVR) and Radom Forest (RF). The results are summarized in the table 1. Figure 2 shows, accuracy of REFINED CNN versus different threshold cutoff points, where the optimum accuracy is 0.627 at log (IC50) of -4.78. The corresponding box plot is shown in the figure 3. We used the trained model on GDSC, which incorporates the drug and cell line data as two separate inputs. Then, we used REFINED images for bortezomib, and same gene coordinates of cell line images to generate corresponding REFINED images of the patient's data. If we use the network trained on the GDSC as reported in our paper directly to classify the patients, accuracy will be 0.568, the derails are provided in table 1 as REFINED CNN 2.
On other scenario is using transfer learning to predict responders and non-responders' patients. Hence, we replaced the last fully connected layer of our network with a fully connected layer with two nodes followed by a SoftMax layer, with a binary cross entropy loss function. Half of the patient's data were used for fine tuning, and other half for testing. The results of fine-tuned transferred network improved the accuracy, precision, recall, F1-score, and AUC up to 0.651, 0.672, 0.651, 0.629, 0.6347.  The same preprocessing as Arm A was done on the training data and tested on the arm B of the dataset. The results are summarized in table 2. The same transfer learning approach as part A, were used for predicting responder and non-responder patients of arm (039) of the bortezomib trial. The results of finetuned transferred network improved the accuracy, precision, recall, F1-score, and AUC up to 0.764, 0.779, 0.764, 0.791, 0.735.

Conclusion
Our experiment shows that finding the optimum cut off point to maximize accuracy of predicting nonresponder or responder patients to Bortezomib, is completely independent from minimizing error of regression model trained on in vitro data. In other words, since we use the test data information to optimally predict the test data, therefore optimizing any regression model on the in vivo data is not bested. According to the tables 1 and 2, the models with minimum NRMSE doesn't necessarily provide better classification accuracy. Also, as it shown in the figure 6, the optimum classification results are acquired by picking an unstable cut off threshold, where if the cut off point changes with a small degree, then the classification results will be decreased significantly.
In these scenarios' authors recommend using deep or statistical transfer learning approaches to predict responsive and non-responsive subjects.