Characterizing Cancer Drug Response and Biological Correlates: A Geometric Network Approach

In the present work, we apply a geometric network approach to study common biological features of anticancer drug response. We use for this purpose the panel of 60 human cell lines (NCI-60) provided by the National Cancer Institute. Our study suggests that mathematical tools for network-based analysis can provide novel insights into drug response and cancer biology. We adopted a discrete notion of Ricci curvature to measure, via a link between Ricci curvature and network robustness established by the theory of optimal mass transport, the robustness of biological networks constructed with a pre-treatment gene expression dataset and coupled the results with the GI50 response of the cell lines to the drugs. Based on the resulting drug response ranking, we assessed the impact of genes that are likely associated with individual drug response. For genes identified as important, we performed a gene ontology enrichment analysis using a curated bioinformatics database which resulted in biological processes associated with drug response across cell lines and tissue types which are plausible from the point of view of the biological literature. These results demonstrate the potential of using the mathematical network analysis in assessing drug response and in identifying relevant genomic biomarkers and biological processes for precision medicine.

In this paper, we propose the use of certain tools from discrete geometry to gain new insights into cancer drug response. For this purpose, we tested our methodology on a panel of 60 human cancer cell lines (NCI-60). It has been more than 30 years since the U.S. National Cancer Institute (NCI) established a human cell line panel for the purpose of discovering novel cancer drugs. The NCI-60 panel was designed to recast the previous murine-based drugs from leukemia treatment to the treatment of more diverse human solid tumors. This departure was due to the difference and diversity of the biology of human tumors from murine leukemia 1 . This panel was developed as part of the NCI's Developmental Therapeutics Program (DTP, http://dtp.nci.nih.gov) to screen in vitro response to over 100,000 chemical compounds and natural products including FDA-approved anti-cancer drugs and those currently undergoing clinical trial. This ongoing service is accepting global submissions and continues screening up to 3,000 small molecules per year as potential anti-cancer therapies. The NCI-60 panel represents nine tumor types: leukemia, breast, central nervous system, colon, skin, lung, ovarian, prostate, and renal cancers. The NCI-60 panel is thus an established tool for in vitro drug screening and has significantly improved the philosophy and research of human cancer drugs 2 . This panel has led to many important discoveries, including a general advance in the understanding of the mechanism of cancer and the action of drugs 3,4 . Moreover, comprehensive genomic data including transcript expression data, protein expression data, sequencing (mutation) data, DNA copy number, and methylation as well as drug screening data on the 60 cell lines make it a unique resource for system pharmacogenomics and systems biology 5 . Most importantly for our work, this data resource enables us to explore both pre-treatment genomic data and drug responses of a notable number of FDA approved anticancer agents (~130) which is unmatched by any other cancer databases 1 .
We are interested in analyzing the gene interaction network for these cell lines via mathematical tools. In recent years, there have been tremendous efforts to elucidate the complex mechanisms of biological networks by investigating the interactions of different genetic and epigenetic factors. Given that gene/protein interactions inherently form a mathematical network, it is reasonable to expect that mathematical tools can facilitate a better Methods Background on curvature. In the present study, we employed an analogue of Ricci curvature to analyze cancer protein expression networks. In the classical continuous setting, the Ricci curvature tensor provides a way of measuring the degree to which the geometry determined by a given Riemannian metric differs from that of ordinary Euclidean space; see 7 for all the details. We briefly sketch the main ideas to motivate the discrete definition applicable to the networks of interest.
Assume that M is a complete connected Riemannian manifold equipped with metric g. On such spaces, one has a notion of geodesic, namely curves which locally travel the shortest distance between points. Geodesics generalize the notion of straight lines in Euclidean space. Let x, y ∈ M be two very close points defining tangent vector (xy). Moreover, let ω be a tangent vector at x and ω′ be the tangent vector at y obtained by parallel transport of w along (xy).
Positivity for the Ricci curvature near x or y is characterized by the fact that the trajectories of the geodesics corresponding to ω and ω′ will approach each other. This can be compared to the situation of the traditional flat geometry of Euclidean space, where such geodesics are always equidistant from each other with their "direction" being unchanged by parallel transport. Equivalently, this positivity may be formulated by the fact that the average distance between two small geodesic balls is less than the distance of their centers. Ricci curvature along the direction (xy) quantifies this, averaged on all directions w at x. On the other hand, when the curvature is negative, the geodesics diverge. Lower bounds on the Ricci curvature prevent geodesics from diverging too fast and geodesic balls from growing too quickly in volume 7 . In other words, lower Ricci curvature bounds estimate the tendency of geodesics to converge. Interestingly, optimal transport offers a lower bound on the Ricci curvature in terms of entropy 17,18 . This fact will be exploited below to show that on a weighted graph one may use curvature as a proxy for functional robustness. Optimal mass transport theory is concerned with the problem of finding an optimal transport plan (relative to some cost function) for moving a given initial mass distribution (or gene expression levels in our case) μ into a final configuration ν in a mass preserving manner 11,12,16,23 . We will assume that μ and ν are normalized to be probability measures. Let X d ( , ) denote a metric measure space. Then the p-Wasserstein distance between μ and ν is defined as where the latter infimum is taken over all joint probability measures π on X X × whose marginals are μ and ν, i.e.: Consider the case X is a Riemannian manifold. The Wasserstein distance defines a Riemannian distance function on the space of the probability measures on X 13,14 . We denote this space by Using the theory of optimal transport, Lott, Sturm and Villani 17,18 derived an elegant connection between Ricci curvature, Ric, and the Boltzmann entropy, Ent. Namely, Ric ≥ k if and only if the entropy functional is displacement k-concave along the 2-Wasserstein geodesics, i.e. for all μ μ ∈ X P , ( and t ∈ [0, 1] we have: Robustness defined on networks. The relation (1) indicates a positive correlation between changes in entropy and changes in Ricci curvature that we express as We will describe notions of Ricci curvature and entropy on graphs below. Here, we show that changes in robustness, i.e., the ability of a system to functionally adapt to changes in the environment (denoted as ΔR) is also positively correlated with entropy 24 , and thus with network curvature (2): More precisely, the measure of robustness employed in 19 is the rate function, R, from the theory of large deviations 25 . One considers random perturbations of a given network that result in deviations of some observable. We let p ε (t) denote the probability that the mean of the observable deviates by more than ε from the original (unperturbed) value at time t. Since p ε (t) → 0, we want to measure its relative rate, that is, we set ε →∞ Therefore, large R means not much deviation and small R corresponds to a large deviation. In thermodynamics, it is well-known that entropy and rate functions from large deviations are very closely related 19 . The Fluctuation Theorem is an expression of this fact for networks, and may be written as In classical thermodynamics, the entropy controls the asymptotics of thermodynamical limits, and states of highest probability have maximum entropy.
The Fluctuation Theorem has consequences for just about any type of network: biological, communication, social, or neural 19 . In rough terms, it means that the ability of a network to maintain its functionality in the face of perturbations (internal or external) can be quantified by the correlation of activities of various elements that comprise the network. In the standard statement, this correlation is given via entropy. This has been reformulated geometrically in terms of curvature 21,22 . In fact, the correlations (2) and (4) yield (3). Therefore, by calculating the Ricci curvature, we can measure the robustness of the networks. We provide some examples of known graph models in the Supplementary Information. In all cases, the graphs that are expected to be more robust possess higher average Ricci curvatures (Supplementary Figs S2, S3). We will now give the precise definition for networks modeled as weighted graphs.
Curvature on weighted graphs. In discrete settings, we assume that our network is represented by an undirected and positively weighted graph, G = (V, E), where V is the set of n vertices (nodes) in the network and E is the set of edges. We set where the sum is taken over all neighbors z of x, and w xy denotes the weight of an edge connecting x and y (it is taken as zero if there is no connecting edge between x and y).
One of the key notions of Ricci curvature on a discrete metric measure space is the Ollivier-Ricci curvature. As discussed by Ollivier 10 and indicated in Fig. 1, the Ricci curvature of a Riemannian manifold can be characterized by comparing the average distance between small geodesic spheres and the distance between their centers. Ollivier then extended this idea from the geodesic sphere to an associated probability measure near a point on a metric space X.
Consider any metric  d V V : × → + on the set of vertices V. For example, d(x, y) may denote the number of edges in the shortest path connecting x and y. For any two distinct points x, y ∈ V, the Ollivier-Ricci (OR) curvature is defined as follows: where μ x , μ y are defined in (5). In the present study, we used the Hungarian algorithm 26 to compute the Earth Mover Distance on our reference networks. This discrete notion of Ricci curvature has been already used to investigate the robustness of cancer networks 21,22 .
Using this edge based notion of curvature, we can also define the scalar curvature of a given node in the graph as follows: where the sum is taken over all neighbors of x.
Gene expression and drug activity data. The summary of our methodology is shown in Fig. 2. The mRNA expression data for the NCI-60 human tumor cell lines were retrieved from the CellMiner web application (http://discover.nci.nih.gov/cellminer). Cellminer, written by the Genomics & Bioinformatics Group, (LMP, CCR,NCI) 5 , provides freely accessible analysis tools and downloadable data sets for exploring NCI-60 data. The database contains transcript expression values for several assays of the NCI-60 cell lines. This study utilizes Affymetrix HG-U133 (A-B) with GeneChip RMA (GC-RMA) normalization from this website. There was no expression information of Affymetrix HG-U133 (A-B) for LC:NCI-H23 in non-small cell lung cancer (NSCLC), and there were many missing GI50 values for the drug responses of ME:MDA-N of the melanoma cell line. Excluding these two cell lines resulted in 58 complete data sets (see Fig. 3(a)).
Using the gene expressions arrays, we found the gene-to-gene correlations to build our weighted networks. The underlying topology has been derived from Human Protein Reference Database (HPRD, http://www.hprd.org) 27 . Specifically, we took the intersection of the genes that appear in both HPRD data and the gene expression data, and then retained the largest connected component. The weights of the edges, however, come from NCI-60 gene expressions. Even though our method is robust to network topologies, we chose the HPRD due to its reliability (a manually curated database) compared to other databases 28 . Affymetrix HG-U133 (A-B) data contains 34,899 probes. For the repeated gene IDs, we used the average RNA expression of the corresponding probes. We used only probes with known gene names, resulting in 16,821 genes with RNA expression. We chose the intersection of these genes with the HPRD database as the nodes of the pre-treatment network. Overall, the network consists of 8240 genes. The weights of the edges are defined by the Pearson correlation between the gene expressions along the 58 cell lines. We further used the transformation of + corr i j (1 ( , )) 2 for the genes i and j to make a positively weighted network. The main advantage of using this affine transformation as compared to taking the absolute values of the correlations is that the transformation is invertible, and thus does not result in any loss of information of the weighted network. This is not true of the absolute value. In practice, however, using either absolute value or the affine transformation seems to give similar results 21,24 . We then found the Ollivier-Ricci curvature for all the edges, and the scalar curvature for all the nodes (genes) within the pre-treatment network. We further identified significant genes (nodes) for each drug in this network. These important genes were selected based on significant Spearman's correlations (Rs) values (p-value <0.05) between drugs' activity, 50% growth inhibition (GI50), and gene expressions across the cell lines. The correlation values can be either positive or negative depending on whether the given gene acts as an oncogene or tumor suppressor. The observed range for absolute value of the significant correlations was 0.26 < |Rs| < 0.68. Finally, we computed the average of Ricci curvature values for the significant genes associated with each drug. The programming was done primarily in Matlab.
The GI50 is the drug concentration resulting in a 50% reduction in the net protein increase during drug incubation as compared with the same increase in control cells 29,30 . Normalized ( log 10 − ) GI50 was retrieved using the R package rcellminer 31 . This package complements the functionality of CellMiner by providing programmatic data access as well as analysis and visualization tool. By computing the Spearman's correlation between the expression of all 8240 genes and the GI50 response of 129 FDA approved drugs along the 58 cell lines we constructed the correlation matrix (D) shown in Fig. 2D. This is a matrix of 8240 × 129 elements which includes the correlations for all the pairs of genes and drugs. As an example, Fig. 3(b) derived from the rcellminer package illustrates the Spearman's correlation between the gene expression of MYC (z-score) and methotrexate activity (z-score). In this case, the correlation is very significant (Rs = 0.41) with a very low p-value of 0.001, therefore, MYC would be one of the selected genes for methotrexate. We provide another example, the gene MAPK8 and drug Salinomycin with a significant negative correlation (Rs = −0.38, p-value = 0.004) in Supplementary Fig. S4.
Ranking drugs and gene ontology enrichment analysis. The FDA approved drugs correspond to 161 NSC drug numbers (numeric identifiers for substances submitted to the National Cancer Institute). Among them we analyzed those with less than 2 missing GI50 values across all the cell lines, resulting in 129 drugs. For each drug, we identified genes whose expressions were significantly correlated (p-value< 0.05) to GI50. We compared the Ricci curvature distribution of these significant genes across the top and bottom ranked drugs and also compared them to the calculated Spearman's correlation in Supplementary Figs S5, S6. The median number of the genes selected for each drug was ~500 with the minimum number of ~300. The average Ollivier-Ricci curvature of these genes was calculated for each drug.
We then sorted the 129 FDA approved drugs in ascending order according to the average Ricci curvature (Fig. 2E) arguing that since the effective drugs should be able to perturb the subnetwork of its significant genes, this subnetwork should possess a lower average Ricci curvature. We used this ranking of 129 drugs to assess the importance of 8240 genes in our network across cell lines. More precisely, we gave a linear weight of (129−r) + 1 to all the selected genes associated with the r-th ranked drug. Then, for each gene, we computed the sum of all these weights and ranked the genes in descending order of the total weights. Thus, the final gene score increases if a gene (a) is important for many drugs, and (b) strongly contributes to robustness across multiple drugs. For a biological analysis, the top 200 genes were selected where the histogram has an apparent sharp decline; see Fig. 4. We performed a gene ontology (GO) enrichment analysis on the top 200 ranked genes using the MetaCore software (Thomson Reuters). MetaCore is an integrated software system based on a manually-curated database of molecular interactions, molecular pathways, gene-disease associations, chemical metabolism and toxicity information. The list of 129 drugs includes 23 repeated drugs. For these repeated drugs, the GI50 data were different for the different NSC numbers, and therefore, they have different rankings. Although the rankings were close for these drugs, we chose the highest ranking as a representative for the repeated drug. Consequently, our final ranking consists of 106 drugs.

Results
We found the scalar Ollivier-Ricci curvature of all the 8240 genes in the pre-treatment interaction network discussed previously. The scalar curvatures range between −210.4 and 3.6 with an average of −5.2. Supplementary Table S1 presents the top 40 genes with the highest absolute value of Ricci curvature. The top two genes, TP53 and YWHAG, stand out with regards to their Ricci curvatures. A visualization of the pre-treatment network is provided in Supplementary Fig. S7. We then ranked the drugs based on the average Ricci curvature of significant genes for each drug. The top 30 ranked drugs are presented in Table 1. There are a number of very effective drugs that are ranked highly in this table; we elaborate on these drugs further in the discussion section below. The ranking of all 106 drugs are presented in Supplementary Table S2.
We validated this ranking by running the entire algorithmic pipeline nine times; see Fig. 2. Each time we excluded all the cell lines of one of the nine cancer tissues. We then computed the Spearman's correlation between drug rankings resulting from all the cell lines and those based on leave-one-out rankings. Interestingly, correlations are very high with low p-values, showing relative consistency of the results across cell lines. The color map in Fig. 5 illustrates these correlations. The rows (columns) of this symmetric map are numbered by excluding all the cell lines of one cancer type (Fig. 3(a)). For example, 'w/o Breast' corresponds to exclusion of all the five cell lines of breast cancer. The 10th row (column) corresponds to the case of considering all the cell lines. The  correlation values are also included in the color map. As we see in Fig. 5, the correlation is very significant among the rankings (0.19 < Rs < 0.82, p-value ≪ 0.05), yet it is less significant after excluding the cell lines of leukemia (4th column or row). In other words, ranking of the drugs after exclusion of cell lines of leukemia is different from excluding any of the other eight cancer cell lines. This highlights that the effect of leukemia in the drug ranking is different than other cancer tissues. Of note, this is not a computational effect alone, given that leukemia is not even the cancer type with the greatest number of cell lines (colon, melanoma, lung and renal have more cell lines; see Fig. 3(a)). In order to determine which subset of leukemia accounts for this observation, we repeated the pipeline for the 6 cell lines of leukemia. We found two clusters of cell lines for leukemia according to their drug rankings ( Supplementary Fig. S8). We then compared the drug rankings after adding these two subsets to the other cell lines. We discovered that a subset consisting of two cell lines most likely caused this deviation of leukemia from the other cell lines. Interestingly, these two cell lines belong to the same type of leukemia, namely, ALL. Of note, even though leukemia is different from other cancer types, there still exists a strong correlation with the others. The highest p-value (Rs = 0.19) that is between 'w/o Leukemia' and 'w/o Breast' is 0.03, which is still less than 0.05. The correlation between all cell lines and w4 (leukemia) is 0.33 with a p-value of 0.00012. The top 200 ranked genes are presented in Supplementary Table S4. The results of the gene ontology enrichment analysis of this 200 gene set are shown in Fig. 6(a). The top ten biological processes presented with very small p-values (<10 −8 ). These p-values correspond to the hypergeometric test performed by MetaCore using the number of input genes, ontology related genes, and the total number of genes in the database. The top three biological processes are all involved with cellular localization. Also, the protein-protein interaction network of this analysis is presented in Fig. 6(b). The network contains two hubs associated with the gene product of CUX1 and PRKACA, which will be elaborated upon in the discussion section below. We further investigated the connectivity of these 200 genes in our pre-treatment network to ensure that these genes are not simply spread across distant parts of the network. We found that a subset of 146 genes of these 200 genes are mutually connected by paths of length two (i.e., connected with two edges), which is shown in Supplementary Fig. S9. The gene ontology enrichment analysis of these 146 genes resulted in very similar top biological processes ( Supplementary Fig. S10), and included a branch of cellular localization, namely intracellular transport, with a p-value of 7.8 × 10 −6 . The table of the top 50 biological processes with their corresponding p-values is provided in Supplementary Table S3. Also, Supplementary Table S4 includes these 146 genes.
In addition to the analysis of all cancer cell lines, we performed our algorithmic pipeline separately for four specific cancer tissues with the greatest number of cell lines: melanoma (9 cell lines), lung and renal (both 8 cell lines) and colon (7 cell lines) cancers. We performed the gene ontology enrichment analysis using the top 200 genes (Supplementary Table S4) of these specific cancer tissues, and compared the results to the biological processes of all the cell lines. We present the results of the gene ontology enrichment analysis of the top 200 genes of renal, lung, melanoma and colon cancer tumors in Supplementary Figs S11, S12, S13 and S14. Interestingly, these cancer types share some similar biological processes to those resulting from all 58 cell lines, which we will discuss further in the next section. However, the cancer specific analysis of this database is limited due to the small number of cell lines in any given specific cancer.

Discussion
In the present study, we considered the NCI-60 panel comprised of 58 individual cancer cell lines derived from nine different tissues (breast, brain, colon, blood, skin, lung, ovarian, prostate and kidney). The gene expressions of the cell lines were used to construct the network consisting of 8240 genes. This is a weighted graph where the underlying interactions are derived from Human Protein Reference Database. In this graph, the weights of the edges are the gene-to-gene (Pearson) correlations. The network was analyzed by calculating the discrete Ricci curvature. Based on our arguments in the Methods section, we claim that this is a guide to the robustness of the network, i.e., the ability to withstand perturbations in the system. In our case, these perturbations are induced by the drugs.
The scalar Ricci curvature helps us to identify the important targets (genes) for the drugs within the network. We present the top 40 genes with the least negative value (highest absolute value) of Ricci curvature in the pre-treatment network in Supplementary Table S1. The top two genes, TP53 and YWHAG have very low scalar Ricci curvatures compared to other genes (See Supplementary Table S1). Of note, mutations of TP53 are present in more than 50% of human cancers, making it the most common genetic event in human cancer 32,33 . This gene has many connections to other genes within the network which also contributes to its extreme value of scalar Ricci curvature (See Supplementary Fig. S7). Even though our primary focus in this study is to identify the important genes based on drug response (Fig. 4) and the biological processes they are involved in (Fig. 6), we briefly discuss the roles of TP53 and YWHAG (top 2 ranked genes in the pre-treatment network) in cancer pathogenesis in the Supplementary Information. Furthermore, we measure the effectiveness of the drugs by the average Ricci curvature of the nodes (genes) it affects. For each drug, we found the significantly (positive/ negative) correlated genes by computing the correlation between GI50 and gene expression along the 58 cell lines. The significant genes for each drug were chosen based on p-values less than 0.05. On average ~500 genes were selected for each drug. These genes identify a subnetwork within our pre-treatment network which is affected by the drug. We provide the average discrete Ricci curvature of these genes for each drug. Since robustness and Ricci curvature are positively correlated, we can study the efficacy of the drug on the network by considering the average Ricci curvature of the subnetwork affected by that drug. If these subnetworks possess a higher average Ricci curvature, we expect them to show more resistance to the drug. In other words, drugs cannot effectively perturb the subnetworks with high average curvature. Therefore, the ranking of the drugs in ascending order (of average Ricci curvature) is a guide to the efficacy of the drugs for the cell lines. This network based view of the drug's effectiveness considers the gene interactions of the cell lines as well as the drug response. The table of top 30 drugs is presented in Table 1. We also provide the table of all 106 drug ranking in Supplementary Table S2.
Salinomycin, the first ranked drug, has recently been considered as a promising novel anti-cancer agent for targeting human cancer stem cells despite its not well-known mechanism of action [34][35][36] . The chemotherapeutic property of Salinomycin can overcome the resistance of tumor cells toward multiple drugs while selectively targeting the cancer stem cells. This antibiotic drug has been shown to kill breast cancer stem cells in mice at least 100 times more effectively than the known anti-cancer drug Paclitaxel. The study screened 16,000 different chemical compounds and found that only a few drugs, including Salinomycin, targeted cancer stem cells responsible for metastasis 37 .
Gefitinib, our second ranked drug, is a molecular targeted drug in the treatment of non-small cell lung cancer. Approximately 85-90% of lung cancer cases, the most deadly cancer in the US, are NSCLC tumors. Mutations in the EGFR (epidermal growth factor receptor) gene are present in about 10 percent of NSCLC tumors (https:// www.iressa-usa.com). EGFR overexpression leads to inappropriate activation of the anti-apoptotic Ras signalling cascade, thereby leading to uncontrolled cell proliferation 38 . Gefitinib competes with adenosine triphosphate at the ATP binding site in epithelial cells, blocking its tyrosine kinase activity, and consequently inhibiting EGFR signaling pathway, which can induce tumor cell apoptosis 39 . In 2015, gefitinib was FDA approved as a first line treatment in patients with metastatic NSCLC who harbor the most common types of EGFR mutations in NSCLC (exon 19 deletions or exon 21 L858R substitution gene mutations) 40 .
Omacetaxine mepesuccinate, also known as homoharringtonine, the 3rd ranked drug, was originally identified over 35 years ago as a novel plant alkaloid with antitumor properties. Its mechanism of action is thought to be inhibition of protein translation by preventing the initial elongation step of protein synthesis via an interaction with the ribosomal A-site 41 . It was approved by the FDA in October 2012 for the treatment of adult patients with chronic myeloid leukemia (CML) with resistance and/or intolerance to two or more tyroskine kinase inhibitors as the current first-line treatment 42 . Furthermore, clinical studies have shown activity of omacetaxine in other malignancies as a single agent or in combination with other therapies in acute myeloid leukemia (AML) and myelodysplastic syndrome (MDS), and studies are ongoing in this regard [43][44][45] . Also, a number common antitumor agents were ranked highly: doxorubicin, paclitaxel, fluorouracil (5-FU) and tamoxifen are commonly used in breast cancer treatment. Paclitaxel, vinblastine and irinotecan are often used in NSCLC. Homoharringtonine, azacitidine, and arsenic trioxide are common anticancer agents against leukemia.
The leave-one-out validation of drug rankings suggests that the rankings are not highly dependent on specific cancer tissue. As in clinical practice, this supports the use of anticancer drugs for the treatment of different cancer types. Overall, the Spearman's correlations are higher among solid tumors as compared to the liquid tumor, Leukemia. Network-based analysis can also help to understand important biological processes in which the most correlated genes of the top ranked drugs are involved. To this end, we ranked the genes by using our drug ranking. The genes that are correlated to the top ranked drugs are given higher weights. The genes were then ranked in the descending order of the sum of the weighted values. Therefore, the top ranked genes are more correlated with the drug sensitivity in the cell lines and are better targets in the network. The gene ontology analysis, performed via MetaCore, was employed to find the biological processes with which the top 200 genes are involved.
The top three biological processes are concerned with cellular localization. Cells consist of many different compartments that are specialized to carry out various tasks. Based on the gene ontology directory, cellular localization of a protein is involved with the process whereby a protein complex is transported to, and/or maintained in, a specific location within a cell, including the localization of substances or cellular entities to the cell membrane. The number of proteins that have reliable subcellular location annotations is approximately 20% of all known proteins to date 46 . It is of particular interest to determine if a potential target is a cell surface or secreted molecule which would be more easily accessible for the targeted drug approach 47 . Therefore, knowledge of the subcellular localization of a protein can significantly improve target identification during the drug development process 48 . We further repeated the gene ontology enrichment analysis of the subset of 146 among the top 200 genes, which are connected with a path of length two. A more specific ontology term which is a branch of cellular localization, namely intracellular transport, is included in the top biological processes of these genes (Supplementary Table S3). In this regard, disruption of normal intracellular protein transport is critically involved in the pathophysiology of a broad range of human malignancies. Aberrant localization of oncoproteins and tumor suppressors may result in their over-activation or inactivation, respectively, and this can allow evasion of anti-neoplastic therapy as well 49,50 . Therefore, the therapeutic targeting of the nucleocytoplasmic shuttling of macromolecules has emerged SCientifiC REPORTs | (2018) 8:6402 | DOI:10.1038/s41598-018-24679-3 as a promising novel therapeutic approach to treat human diseases including cancer. More specifically, targets involving the nuclear envelope and nuclear intracellular transport machinery including cargo proteins, transport receptors, Ran regulators, and the nuclear pore complex have been proposed for therapeutic intervention. Several agents have been developed against these targets, some of them with promising therapeutic windows 51 .
We also present in Fig. 6(b) the protein-protein interaction network identified by MetaCore corresponding to the top 200 genes. There are two notable hubs in this network: CUX1 and cAMP-dependent protein kinase which is a gene product of PRKACA. CUX1 is specifically important since it is a transcription factor to a number of our top ranked genes associated with the drug sensitivity. CUX1 (also known as CUTL1) is a homeobox transcription factor highly evolutionarily conserved and plays a known role in embryonic development, cell growth and differentiation in mammals 52 . Moreover, the role of CUX1 in drug resistance/sensitivity, has been explored. Specifically, gain-of-function as well as loss-of-function studies have shown that increased CUX1 activity significantly enhanced cell sensitivity and cancer tissue response to chemotherapy drugs and resulted in increased apoptosis and growth inhibition. In contrast, decreased CUX1 expression reduced cell sensitivity to chemotherapy drugs with fewer apoptoses and resultant drug resistance 53 . These studies, which have been in the context of gastric cancer, suggest an inverse association between CUX1 and drug resistance, implying that CUX1 is an attractive therapeutic target. Whether this phenomenon applies to cancers other than gastric cancer remains to be elucidated. The human PRKACA gene encodes the PKA catalytic subunit alpha (C α ) isoform. With regards to drug sensitivity/resistance, PRKACA is over expressed in invasive and anti-HER2 therapy (trastuzumab/ lapatinib)-resistant breast cancers. In addition to PRKACA conferring resistance to anti-HER2 therapy, it also impairs apoptosis 54 . Consequently, inhibition of PRKACA and/or its downstream anti-apoptotic effectors in combination with anti-HER2 therapy may increase the drug sensitivity. Its role in drug sensitivity/resistance for other cancers needs to be studied further.
Finally, we were interested in comparing the results of the gene ontology enrichment analysis of all 58 cancer type cell lines to some specific cancer tissues with the largest number of cell lines. The repeated algorithmic process for lung, renal, melanoma and colon cancers yields some consistency. Similar to all cancer type genes, the gene ontology enrichment analysis for the top genes involved with renal and lung cancers results in cellular localization and cellular component organization. However, most of the biological processes from melanoma and colon cancer gene ontology enrichment analysis are quite broad such as negative regulation of cellular process, which is also a top biological process of all cancer types as well as renal cancer. Of note, colon cancer has fewer cell lines (7 cell lines) than melanoma (9 cell lines) and lung and renal cancer (8 cell lines). However, as mentioned earlier, the cancer specific analysis of this database is limited due to the small number of cell lines in each specific cancer.
The need for network based techniques is becoming more and more prevalent as a result of the exponential growth of data in the genomic era. In this work, we utilized mathematical techniques to drive a network based analysis in order to explore the genomic and pharmacogenomics information of NCI-60. The framework of this study can be extended to find possible optimal combinations of drugs. Combining anti-cancer agents, whether cytotoxic or molecularly targeted, with different mechanisms of action is the most practical approach to overcome single drug resistance and can produce sustained clinical remissions. Also, the study described in the present work may be extended to tissue-specific drugs employing a more appropriate tissue-specific cell line database. This analysis on the NCI-60 is promising and supports efforts to analyze larger datasets with advanced network mathematics 55,56