ProTargetMiner as a proteome signature library of anticancer molecules for functional discovery

Deconvolution of targets and action mechanisms of anticancer compounds is fundamental in drug development. Here, we report on ProTargetMiner as a publicly available expandable proteome signature library of anticancer molecules in cancer cell lines. Based on 287 A549 adenocarcinoma proteomes affected by 56 compounds, the main dataset contains 7,328 proteins and 1,307,859 refined protein-drug pairs. These proteomic signatures cluster by compound targets and action mechanisms. The targets and mechanistic proteins are deconvoluted by partial least square modeling, provided through the website http://protargetminer.genexplain.com. For 9 molecules representing the most diverse mechanisms and the common cancer cell lines MCF-7, RKO and A549, deep proteome datasets are obtained. Combining data from the three cell lines highlights common drug targets and cell-specific differences. The database can be easily extended and merged with new compound signatures. ProTargetMiner serves as a chemical proteomics resource for the cancer research community, and can become a valuable tool in drug discovery.

In the manuscript 'ProTargetMiner: a proteome signature library of anticancer molecules for functional discovery' Zubarev and co-workers present an initial proteome signature library for anticancer drugs and provide an R-based package for data analysis that provides clues about potential drug targets, mechanism of action and cellular response to drug treatment in general. This tool in essence mimics on the proteome level mRNA expression based tools such as connectivity map. The authors provide proteome maps for 56 anti cancer drugs in A549 cells which is a valuable resource on its own right and suggest a framework of analysis methods and tools to infer from proteome signatures to potential targets and mechanism of action. This is an interesting resource for cancer drugs that complements existing target databases such as drug bank and that can alert to potential additional targets and target-or off-target related secondary effects on the proteome. Examples are currently limited to anticancer drugs but the approach might be applicable to other indications provided that cell perturbations lead to a common biological endpoint (such as cell death).
There are a couple of points that the authors might want to consider when revising their manuscript. Points: • The resource could be more impactful if the authors would implement this into an easily accessible data base system that integrates future data by their own lab or from other researchers following the procedures outlined in the manuscript. Implementation into existing repositories such as drug bank or proteomicsdb might be an alternative to generating their own system.
• The discussion is a very nice and balanced reflection of what their approach can deliver. This balance is not consistently achieved in the results section that sometimes misses on reporting the absence of expected findings. For example, sorafenib is first and foremost a kinase inhibitor but none of its known kinase targets is detected in the OPLS-DA plots, also dasatinib was primarily designed as an (BCR-)Abl inhibitor and has been shown to very potently inhibit a large number of tyrosine kinases of which hardly any seems to show up in their plots.
• The authors should state stronger that specific anticancer drugs may be only very imperfectly represented in proteomics profiles despite a certain level of cytotoxicity. In cases where the primary target is not expressed or the cell system is not dependent on a certain pathway proteomics signatures might rather be dominated by off target effects. For example the efficacy of the targeted covalent BTK inhibitor ibrutinib in certain form of leukaemia will not be reflected in A549 lung cancer cells despite the fact that the compound will be cytotoxic at high concentrations.
• There is hardly any novelty on inhibitors reported in the study. An exception might be the finding of dysregulation of cholesterol metabolism. Unfortunately, there is little follow-up that plausibly explains the mechanism underlying this dysregulation • I have not found a detailed description of the OPLS-DA modeling presented in this paper, nor an explanation of the acronym for that matter. All in all the description of statistical procedures is rather light touch.
• The tools presented to not seem to give an estimate of significance of individual observations, nor an estimation of false discovery rate. The authors rather pick some proteins of interest at the extremes of the x-axis in their OPLS-DA plots but do not comment many other proteins with similar values. There seem to be no stringent thresholds or other heuristic criteria for picking the proteins of interest either.
• The section on the minimal size of the target miner system is a bit unclear. In figure 4 would it not be more useful to plot the median of all proteins with an indicator of spread rather than two randomly selected proteins that are not affected. Can the authors suggest an optimal initial set of compounds covering a wide variety of mechanisms to base the system on. There should be an influence of mechanistic diversity on the minimal size of the system. minor • In some cases the authors could have taken more care to provide the reader the information needed to understand their rationale. For example in the paragraph discussing size or the target miner vs. depth of the proteomics data set, the authors state that they acquired deeper proteomics data sets without detailing how much deeper, what was done differently. Judging from the details in materials and methods, they essentially doubled the number of fractions that were analysed with the same gradient. This information would be useful in results along with an average number of proteins or unique peptides per sample to better understand the difference to the above. Similarly, the authors could at least provide some information about the cholesterol measurements in the results section.
• After presenting the main features of their system, the authors elaborate on opportunities to base the target miner system on less data and on the influence of the level of proteome coverage. It would be useful if the authors could include a concluding sentence for each of these sections stating the main messages.
Reviewer #2 (Remarks to the Author): The authors have generated a proteome signature library for anticancer molecules at LC50 concentrations. The main dataset is based on 287 A549 adenocarcinoma proteomes affected by 56 compounds. In addition, the authors provide an R Shiny package to perform OPLS-DA modelling for a selected compound using the provided data or data by the user. While the proposed tool in principle might be a valuable tool for many researchers, it has some drawbacks that need to be addressed. Another limitation of the manuscript is lack of some details, which make the manuscript difficult to follow. Finally, the approach as such is not novel. The idea of extending the concept of connectivity maps to proteomics has already been presented (PMID 29655704; 90 drugs in 6 cell lines) and the same data analysis methodology has been used by the authors in their previous publications (e.g. PMID 29572246).

Major comments:
The OPLS-DA method used by the authors as a key component in the analysis to interpret protein regulation and drug specificity comes with its caveats. It is known that OPLS-DA can easily produce statistically unreliable group separation and is even sometimes used as an alternative method if for example PCA fails to separate the groups (PMID 27547730). If OPLS-DA is the chosen modelling technique, the authors should thoroughly validate their findings using e.g. permutation testing and cross-validation. This is an important step regardless the evidence that the authors found from literature for their "counter-intuitive results" (row 266).
The authors are performing multiple normalization methods sequentially for their data (rows 536-542) and should provide a rationale for this. For instance, when the log2 protein ratios between the sample and all control samples are scaled to have zero median fold changes, there is a great possibility that much of the produced signal is biased. This is especially true if the samples are behaving very differently, which is often the case when cancer cell lines are treated with intense compounds.
More details should be provided on various aspects of the methodologies as well as some of the results: 1) Please include the PCA analysis results described in rows 143-145. 2) Please clarify what exactly are the 287 signatures mentioned in the abstract (how many samples of each type). 3) Please provide details about the technical quality of the datasets, including reproducibility between the replicates and the role of missing values. 4) Please provide details on how you determined tomatine as an outlier and show its behavior. 5) Please provide details on how you defined significant regulation (row 173) and significant outlier (row 236), including statistical test applied and p-value. 6) Please specify which statistical test was used to determine Gene Ontology and pathway enrichment analysis and justify why "30 most specifically up-or down-regulated proteins were selected" (rows 254-255). 7) Please clarify which results you mean by "All reported p values are from two-sided Student's t-test." and justify the assumption about normally distributed data (rows 543-544). 8) Please include in the text a short description of the OPLS-DA modelling and its implementation.

Minor comments:
The idea of applying connectivity map for proteomics, while not completely novel, is nevertheless interesting. Is it possible to cross-confirm the analysis results using data from the previous proteomics and mRNA studies?
Overall, the text is somewhat hard to follow. It seems to be best suited for experts knowing exactly this narrow topic.
The references appear to be broken in the Materials and Methods section. Fig. 5f is missing, corresponding to figure legend "f, the enrichment of "poly(A) RNA binding" (13 proteins, p=7.47E-06) and "ribonucleoprotein complex biogenesis" (6 proteins, p=0.04) as downregulated proteins in the merged dataset for vincristine. Data are represented as mean±SD."

Reviewer #1 (Remarks to the Author):
In the manuscript 'ProTargetMiner: a proteome signature library of anticancer molecules for functional discovery' Zubarev and co-workers present an initial proteome signature library for anticancer drugs and provide an R-based package for data analysis that provides clues about potential drug targets, mechanism of action and cellular response to drug treatment in general. This tool in essence mimics on the proteome level mRNA expression based tools such as connectivity map. The authors provide proteome maps for 56 anti cancer drugs in A549 cells which is a valuable resource on its own right and suggest a framework of analysis methods and tools to infer from proteome signatures to potential targets and mechanism of action. This is an interesting resource for cancer drugs that complements existing target databases such as drug bank and that can alert to potential additional targets and target-or off-target related secondary effects on the proteome. Examples are currently limited to anticancer drugs but the approach might be applicable to other indications provided that cell perturbations lead to a common biological endpoint (such as cell death). There are a couple of points that the authors might want to consider when revising their manuscript.
Response: Thanks for the fine analysis and appraisal of our manuscript. Comment 1. The resource could be more impactful if the authors would implement this into an easily accessible data base system that integrates future data by their own lab or from other researchers following the procedures outlined in the manuscript. Implementation into existing repositories such as drug bank or proteomicsdb might be an alternative to generating their own system.

Response:
The provided R Shiny package embeds the data described in the manuscript and allows users to expand the database or upload own datasets. Our data can also be easily merged with user data using the provided accessions for all the proteins. Furthermore, to fulfill the reviewer comment, ProTargetMiner will now be directly available online through this web page: http://protargetminer.genexplain.com/ Comment 2. The discussion is a very nice and balanced reflection of what their approach can deliver. This balance is not consistently achieved in the results section that sometimes misses on reporting the absence of expected findings. For example, sorafenib is first and foremost a kinase inhibitor but none of its known kinase targets is detected in the OPLS-DA plots, also dasatinib was primarily designed as an (BCR-)Abl inhibitor and has been shown 2 to very potently inhibit a large number of tyrosine kinases of which hardly any seems to show up in their plots.

Response:
We tried to provide a similar balance in the result section. Unfortunately, missing values are the inherent drawback of shotgun proteomics, resulting in false negatives (missing some of the targets). This issue is more prominent when data from multiple experiments are combined (Supplementary Figure 4 shows how merging the experiments comes at the expense of more missing values). Furthermore, as also now discussed in further detail in the paper, kinase targets are not the most suitable to ProTargetMiner analysis, because their inhibition often does not lead to significant protein abundance changes, likely due to the redundancy of kinase activity. This shortcoming is however ameliorated in the deep data sets, where we find multiple targets for dasatinib. For dasatinib, in the merged deep data set, we uncover 4 known dasatinib targets (among 10 that were available in the data set) out of 23 targets in drugbank. Furthermore, we identify novel target candidates that have not been reported before. In different OPLS-DA plots in different cell lines, overall four known tyrosine kinases including SRC, YES1, CSK and LYN were identified. BCR-ABL was not among the top proteins.
Comment 3. The authors should state stronger that specific anticancer drugs may be only very imperfectly represented in proteomics profiles despite a certain level of cytotoxicity. In cases where the primary target is not expressed or the cell system is not dependent on a certain pathway proteomics signatures might rather be dominated by off target effects. For example the efficacy of the targeted covalent BTK inhibitor ibrutinib in certain form of leukaemia will not be reflected in A549 lung cancer cells despite the fact that the compound will be cytotoxic at high concentrations.

Response:
We expanded the discussion section with this consideration.

Comment 4.
There is hardly any novelty on inhibitors reported in the study. An exception might be the finding of dysregulation of cholesterol metabolism. Unfortunately, there is little follow-up that plausibly explains the mechanism underlying this dysregulation.

Response:
As a proof of principle study, the current work aimed mostly at "discovering" already known targets, which validated the approach, while discovering truly novel targets was a secondary objective due to the difficulty and expense associated with validation of such targets. Apart from cholesterol-related findings, we have also provided data supporting that AXL up-regulation can be a resistance factor against sorafenib and regorafenib toxicity. Furthermore, we provide new insight in the action of some pyrimidine analogues, which are shown for the first time to induce ribosomal stress.
Comment 5. I have not found a detailed description of the OPLS-DA modeling presented in 3 this paper, nor an explanation of the acronym for that matter. All in all the description of statistical procedures is rather light touch.
Response: OPLS-DA is now described in higher detail and clarity in the paper. A figure (number 3) was now added to elaborate on OPLS-DA. The statistical procedures are also explained better. VIP parameter is also discussed for prioritizing the targets. Comment 6. The tools presented to not seem to give an estimate of significance of individual observations, nor an estimation of false discovery rate. The authors rather pick some proteins of interest at the extremes of the x-axis in their OPLS-DA plots but do not comment many other proteins with similar values. There seem to be no stringent thresholds or other heuristic criteria for picking the proteins of interest either.

Response:
In OPLS-DA, the statistical variation between the replicates is automatically accounted for, with larger variations leading to a smaller x coordinate. Thus each protein on the extreme of x axis is of high potential interest. As we now explain, the statistical significance of each of these outliers can be estimated from the plots (using VIP values) an example of which is now shown in Figure 4. Furthermore, P-values against control for each data point is now provided in the R Shiny package. As for the false discovery rate, this isn't an easy issue because the terms "target" is only loosely defined, without any quantitative measure or threshold associated with it. The proteins of interest chosen to be shown are known for the selected drugs and are therefore, serving as proof-of-principle. It is practically impossible to validate or comment on all the outlying proteins.
Comment 7. The section on the minimal size of the target miner system is a bit unclear. In figure 4 would it not be more useful to plot the median of all proteins with an indicator of spread rather than two randomly selected proteins that are not affected. Can the authors suggest an optimal initial set of compounds covering a wide variety of mechanisms to base the system on. There should be an influence of mechanistic diversity on the minimal size of the system.

Response:
The median of all proteins would be a horizontal line running at N/2, where N is the number of proteins. Regarding the optimal initial set of compounds covering a wide variety of mechanisms, the 9 drugs used for deep proteome analysis in three other cell lines serve as such a set, as these compounds were found to represent most orthogonal proteome responses (most diverse mechanisms).

Minor comments
Comment 8. In some cases the authors could have taken more care to provide the reader the information needed to understand their rationale. For example in the paragraph discussing size or the target miner vs. depth of the proteomics data set, the authors state that they acquired deeper proteomics data sets without detailing how much deeper, what 4 was done differently. Judging from the details in materials and methods, they essentially doubled the number of fractions that were analysed with the same gradient. This information would be useful in results along with an average number of proteins or unique peptides per sample to better understand the difference to the above. Similarly, the authors could at least provide some information about the cholesterol measurements in the results section.
Response: The required information was added regarding the deep proteome experiments in Supplementary Figures 4 and 9. The cholesterol results were further explained in the results section.
Comment 9. After presenting the main features of their system, the authors elaborate on opportunities to base the target miner system on less data and on the influence of the level of proteome coverage. It would be useful if the authors could include a concluding sentence for each of these sections stating the main messages.

Response:
We thank the reviewer for this comment and added concluding sentences for most of these sections.

Reviewer #2 (Remarks to the Author):
Comment 1. The authors have generated a proteome signature library for anticancer molecules at LC50 concentrations. The main dataset is based on 287 A549 adenocarcinoma proteomes affected by 56 compounds. In addition, the authors provide an R Shiny package to perform OPLS-DA modelling for a selected compound using the provided data or data by the user. While the proposed tool in principle might be a valuable tool for many researchers, it has some drawbacks that need to be addressed. Another limitation of the manuscript is lack of some details, which make the manuscript difficult to follow. Finally, the approach as such is not novel. The idea of extending the concept of connectivity maps to proteomics has already been presented (PMID 29655704; 90 drugs in 6 cell lines) and the same data analysis methodology has been used by the authors in their previous publications (e.g. PMID 29572246).

Response:
We thank the reviewer for the meticulous analysis of our manuscript. We are aware of the proteomics connectivity map paper (PMID 29655704) and have duly cited and discussed this paper. ProTargetMiner however, does not pursue a connectivity map approach. We believe that the novelty of ProTargetMiner is in using for every drug the concentration that causes an equivalent biological effect (LC50 at 48 h), which allows for more adequate mapping of the cell state, and more accurate determination of the targets and action mechanism.
Furthermore, unlike connectivity maps, ProTargetMiner provides specifically regulated proteins. Indeed, OPLS-DA modeling has been used in our previous publication (PMID 29572246), but here we have greatly increased the specificity by including many more 5 compounds. Besides, unlike our previous approach, ProTargetMiner database is easily expandable. For including a new compound, one needs to analyze only 6 proteomes from a single cell line.

Major comments:
Comment 2. The OPLS-DA method used by the authors as a key component in the analysis to interpret protein regulation and drug specificity comes with its caveats. It is known that OPLS-DA can easily produce statistically unreliable group separation and is even sometimes used as an alternative method if for example PCA fails to separate the groups (PMID 27547730). If OPLS-DA is the chosen modelling technique, the authors should thoroughly validate their findings using e.g. permutation testing and cross-validation. This is an important step regardless the evidence that the authors found from literature for their "counter-intuitive results" (row 266).

Response:
We agree that statistical reliability of the OPLS-DA models is a very important issue, and have now provided detailed explanation of the evaluation of statistical uncertainty in protein OPLS-DA coordinates (see our reply to Comment 6 of Reviewer 1). This uncertainty derives from the variability between the three replicate analyses, and it is more reliable than permutation. The latter ignores such important parameter as the number of unique peptides used to quantify a given protein, while we used this parameter for picking up the most reliable proteins among those with similar x-coordinates. Crossvalidation is not applicable here, as it would require orders of magnitude more replicates than three, which would be impossible to obtain. It should also be noted that here, OPLS-DA has been used to separate only two groups and the OPLS-DA loadings are shown and used for target deconvolution. Forced and unreliable separation with OPLS-DA usually refers to studies where multiple classes are separated. PCA would not be able to strictly separate two classes. Furthermore, now we provide p values against controls for every proteins on the plot in the R Shiny package. The validity of the findings can also be verified by checking the expression of the respective proteins in the data set.
Comment 3. The authors are performing multiple normalization methods sequentially for their data (rows 536-542) and should provide a rationale for this. For instance, when the log2 protein ratios between the sample and all control samples are scaled to have zero median fold changes, there is a great possibility that much of the produced signal is biased. This is especially true if the samples are behaving very differently, which is often the case when cancer cell lines are treated with intense compounds.

Response:
The biological effect of all compounds in our study was equivalent (50% viability after 48 h), thus in that respect all compounds were equally intense. It was therefore logical to assume that the total protein concentration per living cell was similar for all compounds, although some variations could of course occur. We did not explain this rational because of