Synthetic Lethality-based Identification of Targets for Anticancer Drugs in the Human Signaling Network

Chemotherapy agents can cause serious adverse effects by attacking both cancer tissues and normal tissues. Therefore, we proposed a synthetic lethality (SL) concept-based computational method to identify specific anticancer drug targets. First, a 3-step screening strategy (network-based, frequency-based and function-based screening) was proposed to identify the SL gene pairs by mining 697 cancer genes and the human signaling network, which had 6306 proteins and 62937 protein-protein interactions. The network-based screening was composed of a stability score constructed using a network information centrality measure (the average shortest path length) and the distance-based screening between the cancer gene and the non-cancer gene. Then, the non-cancer genes were extracted and annotated using drug-target interaction and drug description information to obtain potential anticancer drug targets. Finally, the human SL data in SynLethDB, the existing drug sensitivity data and text-mining were utilized for target validation. We successfully identified 2555 SL gene pairs and 57 potential anticancer drug targets. Among them, CDK1, CDK2, PLK1 and WEE1 were verified by all three aspects and could be preferentially used in specific targeted therapy in the future.

To identify SL interactions that could be efficacious in treating cancer, many approaches have been proposed. Current screening methods for potential SL gene pairs can be summarized in three categories. The first is based on model organisms (such as yeast or fruit flies). Their genomes are small and can be easily mutated and matched; therefore, gene silencing techniques are easier to conduct in model organisms. However, as with the homologous inference methods of all model organisms, most genes in SL gene pairs in model organisms do not have homologous genes in human genome. Even though homologous genes can be found in the human genome, their functions have undergone great changes and cannot be directly converted into SL gene pairs 12 . The second screening method was gene silencing in mammals, and two types of gene silencing methods have been developed. One is based on the priori knowledge speculation 13 . The potential SL gene pairs contained two kinds of genes, namely, mutant cancer genes and SL partner genes. Therefore, the SL partner genes should be directly knocked down and tested one by one. The other is based on high-throughput experimental techniques for unbiased screening of the whole genome 14 . Ultimately, siRNA and CRISPR screenings proved to be the most reliable methods for detecting SL gene pairs 15 . However, compared to model genetic systems, human cell systems face greater challenges for genome-wide siRNA or CRISPR screening. Moreover, these approaches are considerably more expensive, labor-intensive, time consuming and many of the essential genes so identified turn out to be either restricted to only these cell-line models or are in frequently overexpressed in cancers 16 . For these reasons, the third screening method based on computational methods has attracted more and more attention.
Computational approaches, which can help to identify and prioritize potential SL gene pairs for further experimental validation, represent an attractive alternative compared to genome-wide siRNA or CRISPR-based human cell line screening approaches. These methods include human orthologous gene pairs inference from yeast SL genes 7,17 ; the use of robustness features in the cancer PPI network to evaluate the importance of gene pairs 18 ; a mutual exclusivity calculation using statistical models from gene mutation/transcriptional expression data 19,20 ; data-driven detection of SL (DAISY) that combined somatic copy number alteration, siRNA screening and cell survival and gene co-expression information and achieved a promising performance in data-driven SL gene pair identification 21 ; and a learning-based pipeline for training and prediction, which combined the three features of mutation coverage, driver mutation probability and network information centrality into a manifolds ranking model to generate a ranking list of potential SL pairs 16 .
Furthermore, the methods mentioned above are not based on the human biological system or cannot be a good simulation of the human complex and staggered environments. The cells employ signaling pathways and networks to drive biological processes in which genomic alterations might result in malignant signaling, which then leads to cancer phenotypes 22 . In this article, the human system was abstracted into a human signaling network. The specific mutant gene was defined as a cancer gene and its SL partner gene was defined as the non-cancer gene. Then, we proposed a computational method using a 3-step screening strategy to identify SL gene pairs from the perspective of a network system. Next, according to the SL gene pairs we identified, we extracted non-cancer genes to obtain anticancer drug targets. Finally, we used 3 different aspects of data to validate parts of our results. Overall, the SL strategy contributes to the identification of anticancer drug targets and drug redirection.

Results
Human cancer signaling network. This subject focused on high-frequency non-cancer genes that have a greater impact on biological systems. Thus, the frequencies of all non-cancer genes were counted according to the genes passing through the shortest path between all cancer gene and non-cancer gene pairs in the human signaling network ( Fig. 1(a)). All of the nodes in the human signaling network were sorted by frequency in descending order. Then, the top 30% (740) of non-cancer genes were obtained to construct a network named the human cancer signaling network (HCSN) for further research. As shown in Fig. 1(b), HCSN includes 6153 proteins and 56976 protein-protein interactions, and 697 cancer genes were successfully mapped. Thus, non-cancer genes were paired with cancer genes to form 515780 (740 × 697) gene pairs, which were used as input data for the following 3-step screening strategy for identifying SL gene pairs. The human cancer signaling network (HCSN). Blue nodes denote non-cancer genes; yellow nodes denote cancer genes; and edges represent protein-protein interactions. A larger node indicates a greater degree.
Scientific RepoRtS | (2018) 8:8440 | DOI:10.1038/s41598-018-26783-w SL gene pairs. We designed a 3-step screening strategy to predict the SL gene pairs in the HCSN, and the results are described herein.
First, we chose the network-based screening method to obtain the SL gene pairs. According to the stability score and 1000 randomized networks (P < 0.05), we obtained the significant SL gene pairs. Then, we screened the gene pairs based on the distance between non-cancer genes and cancer genes. The average distance between non-cancer gene and cancer gene was 2.90; therefore, we kept the gene pairs with distances no more than 2. After the first screening step, 9241 gene pairs were obtained.
Second, we chose the frequency-based screening method. We plotted the cumulative frequency percentage plot to obtain a reasonable frequency threshold (Fig. 2). As seen from the figure, the growth trend of the top 50% curve was faster. Therefore, 122 high-frequency non-cancer genes were focused on in our study. As a result, 4788 gene pairs were obtained.
Third, the function-based screening method was performed. The 4788 gene pairs from the second screening contained 749 genes and these genes were significantly enriched in 47 pathways (Fig. 3). These pathways could be divided into seven biological process categories, namely, cell growth and death, cell motility, signal transduction, endocrine system, immune system, cell community and growth. Many biological pathways in our results were found to be closely related to SL. For example, the HIF-1 signaling pathway, which activated the transcription of  genes involved in angiogenesis, cell survival, glucose metabolism and invasion, was used as a screening pathway for the discovery of SL gene pairs 23 . The PI3K-AKT signaling pathway 24 , the RAS signaling pathway 25 , the P53 signaling pathway 26 , and the mTOR signaling pathway 27 were also widely considered promising pathways for SL recognition and have attracted the interest of many researchers.
In addition, after function-based screening, we obtained 395 significantly enriched genes conformed 2555 SL gene pairs, which included 81 non-cancer genes and 314 cancer genes (Fig. 4). The average degree of the light blue nodes and red nodes were 8.14 and 31.54, respectively. According to the concept of SL, we think that these 81 non-cancer genes should be potential and specific anticancer drug targets. Designing drugs against these non-cancer genes in cancer with specific cancer gene mutations could improve the therapeutic efficiency and reduce side effects. However, at the same time, many aspects need to be considered before a protein that could be used as a drug target such as molecular weight, polarity, and tissue distribution in the body. Therefore, we focused on the existing drug target information and our non-cancer genes in the SL gene pairs to explore adaptive anticancer drug targets.
Potential anticancer drug targets. We used the existing drug-target interaction data and 81 non-cancer genes in SL gene pairs to extract specific anticancer targets and drugs, which might be used in cancer treatment. After we annotated the 81 non-cancer genes with the drug-target information, 57 known drug targets (Table 1) were identified, of which 27 had been used as anticancer drug targets in clinical treatment. Using these 27 targets, we expected specific and low-risk cancer therapies to be achieved. In our opinion, the rest of the 30 targets, which are closely related to the occurrence and progression of cancer, such as immune-related and anti-inflammatory targets, have the potential to become anticancer drug targets and will be used in anticancer drug re-positioning.
In addition, the average degree of the 57 drug targets was 33.81, which indicates those nodes had interactions with more red nodes in the network (Fig. 4). Meanwhile, some light blue nodes showed a large degree, but they weren't known drug targets such as PAK1 and IL4. The frequencies of PAK1 and IL4 were 269 and 60, respectively. PAK1 encodes a family member of the serine/threonine p21-activating kinases, also known as the PAK proteins. This specific family member regulates cell motility and morphology. In addition, PAK1 could be mapped into many promising SL recognized pathways such as the MAPK signaling pathway, focal adhesion, and the ErbB signaling pathway. The protein encoded by the IL4 gene is a pleiotropic cytokine produced by activated T cells. This cytokine is a ligand for the interleukin 4 receptor. In addition, it could be mapped into the T cell receptor signaling pathway and the Fc epsilon RI signaling pathway. Therefore, those light blue nodes that had large degree also tend to have great effects in specific anticancer therapy in combination with the SL gene pairs we identified.
Validation of the anticancer drug target. To verify the results, three aspects of the data were used. The first was SynLethDB 28 , which contained SL pairs information collected from biochemical assays, computational predictions, text mining results and other related databases. We used the overlap gene data between SynLethDB and our predicted anticancer drug targets information to validate the results. Because of the limitation of the SL gene pair data, 20 of the 57 known drug targets that we found were not included in SynLethDB. As a result, 18 of the 37 anticancer drug targets were validated as SL partner genes in this database. These targets with corresponding cancer genes constitute 35 SL gene pairs in our predicted results (Supplementary Table S1). The second was the known drug sensitivity data. Among the data of drug targets that was used, 13 were overlapped with our result. In different cancer cell lines, a smaller IC50 value indicates higher drug sensitivity and the corresponding drug target tends to have better effects in cancer therapy. More information is shown in Supplementary Table S2 (only IC50 values less than 0 are shown). Finally, we conducted text-mining to determine the relationship between the anticancer drug targets that we found and the genes related to cancer (or SL). The results showed that 52 of the 81 non-cancer genes had been shown to be significantly associated with cancer (p < 0.05) and 16 of the 81 non-cancer genes had been shown to be significantly correlated with SL. Furthermore, 12 anticancer drug targets were closely associated with both SL and cancer (see Supplementary Table S3). In total, 27 of the 57 anticancer drug targets were verified through three different aspects and 4 targets have been verified in all three aspects of the data, as shown in Fig. 5(a).
In particular, the four overlap non-cancer genes (CDK1, CDK2, PLK1 and WEE1), which were validated by the three data resources, were all known anticancer drug targets and clinical trial targets. Furthermore, CDK1, CDK2, PLK1 and WEE1 were also predicted to be promising anticancer targets in BRCA2-ovarian cancers by Bueno's research 29 . Therefore, we focused on the analysis of these four overlap genes. Above all, the CDK1 and CDK2, which can be promising specific anticancer target, are both family members of the serine/threonine protein kinases that participate in cell cycle regulation. Firstly, CDK1 can be the SL partner gene of the cancer genes KRAS and MYC. As reported, KRAS mutations have been found in approximately 20% of human cancers, but there is currently no therapy targeting them 30 . Thus, targeting the SL partner gene CDK1 in ovarian cancer patients carrying a KRAS mutation could be a good choice in anticancer drug research and development. Although the cancer gene MYC is a very attractive therapeutic target in the treatment of breast cancer, the direct inhibition of the MYC gene is still a great challenge and has not yet provided a clinically effective drug to target it 31 . In the MYC-dependent breast cancer, another alternative is to target MYC's SL partner gene CDK1, as reported in some small interfering RNA (siRNA) experiment 31 . Secondly, CDK2 was predicted to be SL partner gene with p53 and MYCN by RNA interference techniques 32,33 . In p53 defective cells, CDK2 can separate mitogenic from anti-apoptotic signaling for SL 33 . The SL relationship between CDK2 and MYCN indicates CDK2 inhibitors as potential MYCN-selective cancer therapeutics 32 . Furthermore, CDK1 and CDK2 are both drug targets of the investigational drug Alvocidib which is a synthetic flavonoid based on an extract from an Indian plant for the potential treatment of cancer. It works by inhibiting CDK, arresting cell division and causing apoptosis in non-small lung cancer cells 34 . According to the concept of SL, using Alvocidib to target CDK1 may selectively kill specific gene mutant tumor cells. Then, PLK1, which was a drug target studied in acute myeloid leukemia, non-small cell lung cancer, and pancreatic cancer 34 , could be a SL partner gene of many cancer genes in our results. In the drug sensitivity validation, some cells are sensitive to the drug target PLK1, which indicates that PLK1 can participate in various cancers by forming SL gene pairs with many cancer genes. Furthermore, some researches has identified PLK1 as a gene whose depletion was particularly detrimental to the viability of PIM1-overexpressing prostate cancer, which was particularly sensitive to PLK1 inhibition and suggests that PIM1 might be used as a marker for identifying patients who will benefit from PLK1 inhibitor treatment 35 . Finally, WEE1 kinase could regulate CDK1 and CDK2 activity to facilitate DNA replication during S-phase and prevent unscheduled entry into mitosis, and cancers with defects in the FA and HR pathways may be targeted by WEE1 inhibition, which provides a basis for a novel SL strategy for cancers harboring FA/HR defects 36 . In addition to the four intersection genes, many of the other non-cancer genes that we identified have already been predicted as the anticancer drug targets. For example, in the drug sensitivity experiment, IL6, which could be the SL partner gene of CDKN2A, RB1, STK11 and TP53, was a specific anticancer drug target in the prostate cancer DU-145 cell line when targeted by VX-702 37 .

Discussion
With the development of molecular biology, biological research has entered the post-genome era and has made it possible to understand the function of the organism from an overall level. Synthetic biological systems (human protein interaction networks) are complex, and each protein element is a node in the complex network that accomplishes each biological process by synergizing the interactions of the nodes. Thus, the biological network can be abstractly seen as a human biological system and provides pre-screening for in vitro and in vivo follow-up anticancer drug targets screening. It can also save financial and material resources and time. The existing approach, which also used networks to identify SL gene pairs, was proven to be effective 18 . However, they only took the efficiency changes of knocking out two nodes in the network into account. Since this change may sometimes be caused by knocking out a single gene node rather than the pair, we improved the method by considering the knockout of both a single node and two nodes, which was more reliable in our opinion. Furthermore, we took a multi-step screening strategy from many perspectives to obtain the SL gene pairs, which might get better results.
Although this study has many advantages, there are some shortcomings. The most significant one is that the data resources we used. On one hand, it is the original data we used for this study. Although we integrated the cancer gene data and drug-target interactions data from different databases, more data should be included in the future to obtain more useful results. This way, we will improve the accuracy of our results and reduce data limitations. On the other hand, it is the limitation of the validation data. The genes and drugs in the drug sensitivity experiment are relatively small, so we could only validate the overlapping genes between the existing data and our studies. The SynLethDB database, which we used to validate, included 16976 SL gene pairs composed by 5157 genes. Only 7088 SL gene pairs (7088/16976 = 41.75%) that composed by 2174 genes (2174/5157 = 42.16%) were found in our network data. At the same time, we made a comparison between all 5157 genes in SynLethDB and our 697 input cancer genes, the overlap genes were only 369 (52.94%), which constituted 8582 SL gene pairs (8582/16976 = 55.55%) in SynLethDB. As can be seen from above, the data contained in the SynLethDB were very different from our input data. As a result, we can only validate the overlap part between SynLethDB and ours. We also tried to make a comparison with other state-of-the-art computational SL finding methods. However, various computational methods provided potential SL gene pairs from different data resources and perspectives, such as the correlation of gene expression with mutations, gene co-expression in related biological processes, robustness in the cancer network or human conserved SL gene interactions, which may be the reason for the low coincidence rate of the SL gene pairs obtained from different computational methods. At the same  7,16,18,20,21,38,39 . As shown in Fig. 5(b), the overlap SL gene pairs of these methods was very rare (the details are shown in the Supplementary Table S4).This was not the case with our results, but also with others. The results from the different methods were complementary to each other in predicting the SL gene pairs 16 .
The 57 known drug targets that we found might be targets for anticancer drugs and could be used in drug re-positioning. Focusing on these targets can accelerate the development of anticancer drugs. The other non-cancer genes, which have not been drug targets previously, may also have potential in cancer therapy. Moreover, in different cancer cells, mutations in the same cancer gene can also lead to various functions; therefore, our follow-up study will focus on the different mutant types of the same genes, which are dedicated to finding more specific anticancer drug targets and corresponding sensitive drugs through the combination of the SL strategy.

Materials and Methods
Data sources. In this paper, the human signaling network, including 6306 proteins and 62937 protein-protein interactions, was collected and curated manually by Zaman 22 from previous studies [40][41][42] . The cancer genes were downloaded from the F-Census 43 and Cancer Gene Census 44 . We obtained 697 cancer genes after removing the redundant ones. Drug-target interaction data was collected from the DrugBank 45 , Therapeutic Targets Database (TTD) 34 and PROMISCUOUS databases 46 . In addition, we obtained 16976 human SL genes pairs from the SynLethDB database 28 . The drug sensitivity data and the gene mutation backgrounds of 639 cancer cell lines were gathered from the research 37 , which contained 88 cancer genes and 130 drugs under clinical and preclinical investigation in the experiment. SL screening. The overall workflow of our method is shown in Fig. 6. Above all, we constructed the human cancer signaling network (HCSN). Next, a 3-step screening strategy was used to obtain the SL gene pairs. Then we extracted the non-cancer genes from the SL gene pairs and analyzed them with the drug-target interactions to find the targets that were suited for anticancer drugs. Finally, we conducted the validation with prior data.

Construction of HCSN.
To get the HCSN, we removed the orphan nodes, peripheral interactions, self-loop and redundant interactions of the human signaling network and mapped the cancer genes into it. The human Figure 6. The workflow of anticancer drug targets identification. The human cancer signaling network (HCSN) was constructed to obtain SL gene pairs using a 3-step screening strategy. The data of non-cancer genes and drug-target interactions data were obtained to identify the anticancer drug targets. Some validations were made to validate our results.
Obtainment of SL gene pairs. In this study, we designed a computational approach to predict SL gene pairs in the HCSN, which were mainly composed of a 3-step defined screening strategy, network-based screening, frequency-based screening and function-based screening.

1) Calculation of the stability score
Herein, the stability score was defined as the stability changes of HCSN when knocking out a pair of nodes and just one node. Therefore, according to the concept of SL, gene pairs, which have higher stability scores, should more likely be the SL gene pairs. A stability change may be caused by just one node rather than the combination effects of gene pairs. Therefore, we proposed a network information centrality-based approach by knocking out both a pair of nodes and the single one, respectively. Then, the network information centrality-based stability score S was scored in formula (1): where D 0 was the average shortest path length of HCSN; D m and D n represent the average shortest path length of HCSN after removing the cancer gene node m and the non-cancer gene node n, respectively; and D m,n was the average shortest path length of HCSN after removing both the cancer gene nodes m and non-cancer gene nodes n. D was the average shortest path length of the network (calculated by the closeness in R package igraph 48 ) and it was defined as follows in formula (2): i j ij 1 2 here, d ij refers to the shortest path between the nodes i and j; N represents the total number of nodes in the network.

2) Network randomization
To evaluate the significance, we calculated the probability values p for each of the gene pairs using 1000 degree-preserving randomized networks (constructed by R package tnet 49 ). The formula to calculate the P values was as follows:

S S obs r andom
where S obs refers to the S score obtained from HCSN and S random refers to the S score obtained from randomized network.
< N S S obs r andom represents the numbers when the S score in the randomized network was larger than that in HCSN.

3) Distance-based screening of the gene pairs
Distance-based SL screening played a vital role in network analysis in our study. We thought that the human signaling network was very important in tumorigenesis and cancer progression. In the network, proteins next to each other may have some similar functions and will participate in certain similar biological progresses. In other words, two proteins might more likely be SL partners if they were closer in distance in the network. Therefore, we calculated the distance between every non-cancer gene and cancer gene, and then computed the average distance of those nodes. Then, we discarded the pairs for which the distance was larger than the average.
Frequency-based screening of gene pairs. The development of cancer is often quite complex and usually involves multiple genes and pathways. We defined the nodes in HSCN with high degree as high frequency genes. We assumed that the higher frequency non-cancer genes in HCSN are more important in the biological progress. Therefore, we used the frequency of non-cancer genes as a filter for further screening. According to the cumulative frequency percentage, we filtered out the low frequency non-cancer genes and kept high frequency ones for further analysis.
Function-based screening of gene pairs. The occurrence and progress of cancer are closely related to cell survival, signal transduction, cell growth and death, etc. The SL genes were closely associated with cancer, and so, we thought that they played important roles in these cancer-related functions. To further identify SL gene pairs, we applied the genes from the above step for pathway enrichment analysis with DAVID Bioinformatics Resources 6.8 50 . Afterwards, we got the final SL gene pairs and some significant pathways which helped to exploit the identification of SL gene pairs.
Identification and validation of anticancer drug targets. The identification of anticancer drug targets. We assumed that the anticancer drug target was a protein, which could be targeted by at least one anticancer drug. To identify potential anticancer drug targets, we applied the drug-target interactions and drug description information to annotate the identified non-cancer genes in the SL gene pairs we identified above. The validation of the anticancer drug targets. We validated our identified anticancer drug targets with three data sources. Firstly, the human SL gene pair in the SynLethDB database was used. Secondly, the SL gene pair can be validated by Garnett et al. 's drug sensitivity experiment results. A SL gene pair could be seen as a specific mutated cancer gene and a drug targeted non-cancer gene. The cell line with the specific mutated cancer gene should have poor survival condition when added drugs to target the SL partner of the specific mutated cancer gene. That is, the cell line was highly sensitive to the drug. Thus, Garnett et al. 's drug sensitivity experiment was used be used to validate the anticancer drug target we obtained. Thirdly, text-mining validation was applied to validate our results. For gene G (the non-cancer gene in the SL pair), the number of studies that mentioned gene G in PubMed was K. The number of cancer-related (or SL-related) studies was M. The total number of studies in PubMed was N. By using hypergeometric test, we calculated the probability that at least x of the K articles containing gene G demonstrated that gene G is associated with cancer (or SL). The significance threshold was set to 0.05 and all of the genes with a significant P-value of less than 0.05 were verified to be cancer-related (or SL-related) genes.