Identification of potential inhibitors based on compound proposal contest: Tyrosine-protein kinase Yes as a target

A search of broader range of chemical space is important for drug discovery. Different methods of computer-aided drug discovery (CADD) are known to propose compounds in different chemical spaces as hit molecules for the same target protein. This study aimed at using multiple CADD methods through open innovation to achieve a level of hit molecule diversity that is not achievable with any particular single method. We held a compound proposal contest, in which multiple research groups participated and predicted inhibitors of tyrosine-protein kinase Yes. This showed whether collective knowledge based on individual approaches helped to obtain hit compounds from a broad range of chemical space and whether the contest-based approach was effective.

Scientific RepoRts | 5:17209 | DOI: 10.1038/srep17209 The compound proposal contest was organized by the Initiative for Parallel Bioinformatics (IPAB). It started on January 7 2014 and ended on March 20 2014. Ten groups participated in the contest. Any groups could participate in the contest if its members agreed that all proposed compounds and methods were to be made public. The participants were asked to propose a prioritized set of 120 compounds. We selected the top 50 compounds from each group and 118 additional compounds via clustering analysis of the submitted compounds, for a total of 600 unique compounds. In an inhibitory activity assay, 24 of the 600 compounds showed inhibition at various ranges and seven were identified as potential hit compounds. Among the potential hits IC 50 of three compounds with a novel structure were estimated. The salient features of the methods, experimental validations, and potential inhibitors are discussed below.

Details of the contest
Compound library. Enamine Ltd provided a collection of approximately 2.2 million small compounds that are commonly used in high-throughput screening (HTS) studies to identify potential hit compounds. The compounds were readily available in the Enamine library; therefore, we used them for screening. Enamine libraries are available at http://www.enamine.net/.
Computational methods. Different methods have been adopted to identify potential inhibitors of Yes, and they can be roughly classified into the following two categories: the protein structure-based method (SB) and the ligand-based method (LB). Here, we define SB as a docking simulation or a geometric hashing technique that utilizes protein structure. LB involves screening techniques based on structural similarity comparison to known inhibitors or SAR derived from known inhibitors. The comparative analysis of the methods used by the 10 different groups is presented in Table 1. Some groups employed a multi-step approach, where LB was employed to screen the Enamine library and SB was applied to the resultant compounds (denoted by LB → SB). Others used LB and SB simultaneously to screen the Enamine library (denoted by LB&SB).
Protein structure-based method (Groups 1, 2, 3, 5, 7, 9, and 10). In this approach, the structure of the target protein is the main focus for identifying potential inhibitors. Initially, the 3D structure of the target protein is obtained via homology modeling using Modeller [33][34][35] , FAMS [36][37][38] , etc. Potential inhibitors are then identified using docking, SAR, and molecular dynamics (MD) simulations. The SB method was used by seven of the competing groups, 1, 2, 3, 5, 7, 9, and 10. Group 1 (G1): Based on a BLAST search of PDB, homologs of Yes that contains ligands were searched, and 25 structures were identified. The Tanimoto indices between the 25 ligands and the Enamine library compounds were calculated, and 1241 compounds with indices > 0.55 were chosen for screening. The protein structure used in the docking simulation was created using a template structure with the smallest P-value. Finally, the 1241 compounds were docked using ChooseLD based on FPAScore function. Group 2 (G2): A series of sequence alignments and binding site investigations were performed to create a protein structure used in subsequent docking simulation and SAR studies. The following three approaches were employed separately: docking, SAR, and similarity comparison with known inhibitors. The docking method used two machine learning systems 39 based on three docking simulation packages. The SAR model was created using PubChem BioAssay data 40 (AID 686947) and was applied to the Enamine library. For similarity comparison, a small number of known inhibitors of Yes were selected from the PubChem BioAssay data, and the Enamine library was searched for similar compounds. Finally, the two-dimensional (2D)/3D structures of those compounds as well as their binding poses were compared with those of the native inhibitor found in the target structure. Group 3 (G3): Known inhibitors were selected from the literature [41][42][43][44][45][46] and PubChem BioAssay data 40 (AID 686946). The Enamine library compounds were screened by similarity search using the known inhibitors. Then, docking simulation of the screened compounds was conducted. The model protein structure was selected based on the docking poses that reproduced those of the known ligands PP2 and dasatinib. Finally, the compounds selected via LB and SB were screened using the pseudo-consensus method 47 . In total, 53 compounds from LB, 53 from SB, and 14 from the pseudo-consensus method were included in the final list. Group 5 (G5): A multiple-template ligand was modeled using 70 protein-ligand complexes that had a protein sequence identity with Yes of > 70%. Proteins and their bound ligands were superimposed by the protein structure alignment program MICAN 48 against a modeled Yes structure. The Enamine library compounds were compared with the multiple-template model using the geometric hashing technique 49 , where scoring was defined as the number of coincident atoms minus the protein-compound crash penalty, to identify potential hit compounds 50 . Group 7 (G7): MD simulation and fragment molecular orbital calculation of the Src-dasatinib complex structure were performed to identify residues that interact with dasatinib with high retention or interaction energy, respectively. This information was utilized to define constraints for docking simulation, i.e., to specify specific residues that the docked compounds should interact with. The protein structure for the docking simulation was created using homology modeling. Group 9 (G9): Data on active ligands were collected from BindingDB 51 , and their physicochemical characteristics were computed and compared with the set of the 2.2 million Enamine library compounds for primary screening.
Homology modeling and MD studies were performed to select the best structural orientations, and the resultant eight structures (one homology and seven MD structures) were independently subjected to docking simulation of the screened compounds, active inhibitors, and decoys. The screened compounds with high docking scores were considered only when the protein structures used could supersede those of the active inhibitors and decoys in terms of scores. Group 10 (G10): The modeled 3D structure of Yes was validated by analyzing its binding poses with known ligands (dasatinib, saracatinib, and bosutinib). These predicted binding poses were captured as a consensus SB pharmacophore model, which was used to screen the Enamine library. To further prioritize the compounds, an enriched substructure filter, which was derived using Src family kinase inhibitors retrieved from BindingDB 51 , was applied to the screened compounds. This list of 2000 potential hit compounds was clustered. Clusters were prioritized after visual inspection, and a representative or the best hit of each cluster was chosen for the final list.

Ligand-based method (Groups 4, 6, and 8).
In the LB approach, potential hit compounds were primarily identified using the activity data for available kinases. Three of the groups, 4, 6, and 8, used this approach. Group 4 (G4): The IC 50 s of tyrosine kinase inhibitors were downloaded from Kinase SARfari 52 and relevant indices (pIC 50 , ligand lipophilic efficiency 53 , binding efficiency index 54 and surface efficiency index (SEI) 54 ) were calculated. Indices were related with physicochemical properties. The experimental indices; physicochemical properties such as hydrophobicity, volume, and pI of the 36 amino acid residues surrounding the ATP binding sites (ABS36) 55 ; and compound descriptors were trained with support vector regression, and three models (SEI OETree, SEI MACCS, and SEI OBFP2) were proven to predict experimental values better than other models. These models were applied to the Enamine library to predict the active compounds. It should be noted that Group 4 focused on identifying compounds with good SEI rather than good inhibition activity. Group 6 (G6): PubChem BioAssay data of 858 compounds 56 of Yes (AID 686947) were downloaded, and their inhibition rates were normalized at a concentration of 15 μ M using linear interpolation between the nearest neighbors of actual measured activities. The activities and a set of molecular descriptors were trained using a random forest model 57 , and the model was utilized for predicting the potential hit compounds. Group 8 (G8): PubChem BioAssay data of 858 compounds 56 (AID 686947) of Yes were downloaded. Compounds with an activity < 1 μ M were defined as active inhibitors, and the rest were classified as inactive. In addition to the inactive inhibitors, some compounds in the Enamine library were defined as inactive to exploit large-scale inactive compounds for training in the SAR model. The SAR model was developed by comparing the activity data with 772 descriptors using the balanced random forest method 57,58 and was applied to the Enamine library to predict active compounds. An imbalance in the numbers of active and inactive compounds was addressed during training 59 .
Selection of compounds for experimental inhibitory assay. Initially, we selected the topmost 50 compounds from each of the 10 groups to obtain a total of 482 unique compounds. In addition, following cluster analysis, 118 additional compounds were selected using a scoring procedure and manual inspection. We used k-means clustering to classify the compounds into 10 clusters and subsequently computed the similarity score relative to structures of known inhibitors of Src family kinases deposited in ChEMBL using the Tanimoto principle based on MACCS fingerprint 60 . We selected compounds with a maximum similarity score of < 0.72 to identify novel inhibitors. We defined a consensus number in each cluster, based on a number of different groups that proposed any compounds to each cluster. From the consensus, we chose 118 compounds, for a total of 600 compounds to be tested by the inhibitory activity assay.
Experimental procedures. We outsourced the inhibitory activity assay of the proposed compounds to Bienta (http://bienta.net/). Bienta utilized HTS in order to estimate a percentage inhibition rate at 10 μ M of each compound. All HTS procedures were performed in accordance with the Promega A. The compound's inhibition rate was higher than the average inhibition rate of all compounds in the same plate plus three-fold of the standard deviation of inhibition rates in the same plate. In this calculation, the inhibition rates for positive and negative controls were not considered. B. The compound's inhibition rate was > 25% (except for the compounds classified in "A"). This condition was implemented to eliminate false negatives. C. The compound's inhibition rate was the highest in its group (except for the compounds classified in "B").
The compounds that identified with any of these criteria proceeded to the secondary assay conducted on one 384-well plate. Each compound was tested in six wells, and the average of the six values was used to determine the percentage inhibition rate. We used an individual inhibition rate of > 30% to identify the compounds that could potentially serve as inhibitors to Yes.

Results and Discussion
Common compounds identified by different methods. Ten groups each submitted 120 compounds for a total of 1200 compounds. The analysis of the submitted compounds showed that 17 compounds overlapped between two groups, and one compound was the same in three groups. In total, 75% of overlapping compounds were proposed mainly by groups that utilized known ligand information directly or indirectly (G2, G3, G6, and G8). The higher overlapping rate would be attributed to the same information that these LB methods employed, i.e PubChem BioAssay AID 686947. All of the overlapping compounds were selected for the inhibitory activity assay. The behaviors of these compounds were distinct: two were identified as potential hit compounds and the others did not show any inhibition in assay experiments.
Inhibition rates of selected compounds. We conducted inhibition assay experiments at 10 μ M for all of the selected 600 compounds to measure the percentage of inhibition. Among them, 24 compounds satisfied our primary hit conditions and were tested in the secondary assay. The secondary assay was conducted on a single plate, and the results are shown in Table 2. Critical evaluation of these 24 compounds identified seven compounds as probable inhibitors of Yes, and their structures along with their inhibition rates are presented in Table 3. These seven compounds were again tested from their fresh powders to confirm the inhibition rates. Compounds with inhibition rates > 50% in the fresh powder assay were Z1546610485 (56.3%, G2, G6, and G8), Z820655914 (89.0%, G5), Z1546616191 (95.4%, G6), Z1157725083 (65.0%, G8), and Z653349554 (66.7%, G10).
Compound Z1546610485 was identified by three groups that employed the LB method (LB or LB&SB in Table 1). The compound is known as gefitinib and is listed in the PubChem Bioactive database 56 (AID686946 and AID686947) as a tyrosine kinase inhibitor, which may explain why it was independently proposed by three groups. This shows that the LB methods used by these groups could correctly identify a hit compound. On the other hand, Z1546616191, a known tyrosine kinase inhibitor named sunitinib, also listed in the database, was proposed only by G6. It is unclear why the other LB groups did not propose it in their lists of 120 compounds. The number of compounds tested in this contest (a minimum of 50 per group) is insufficient to derive conclusive insight.
According to the PubChem BioAssay data, the inhibition rates of gefitinib and sunitinib at 8.6 μ M with 4 nM Yes, 0.1 mM ATP, and 0.3 mg/mL substrate (poly Glu:Tyr = 4:1) were 72.1% and 93.9%, respectively, indicating consistency of our data with the literature. The inhibition rates of a few compounds exceeded that of gefitinib in this study.
Among the 118 compounds that were added by the clustering analysis, there were no potential hit molecules. This might be because the compounds were selected so that their similarity score to known inhibitors was < 0.72.  60 , which were 10% of the compound library and randomly sampled, followed by the projection of the sampled compounds (Random in Fig. 1A), assayed compounds of each group (G1− G10), and the seven potential hits (seven Hits) onto PC1 and PC2. The cumulative variances of PC1 and PC2 were 26% and 50%, respectively, indicating that PC1 and PC2 could well account for the chemical space of the compound library. Figure 1A shows that compounds submitted by the same group show a tendency to gather in the chemical space, i.e., the chemical space covered by compounds submitted only by one group tends to be small. To quantify the coverage, we divided the chemical space into 13 for both PC1 and PC2, as shown in Fig. 1A, and counted a number of grids that contains at least one compound of a group concerned. The coverage numbers for all the groups were as follows: Random: 124, G1: 13, G2: 12, G3: 26, G4: 3, G5: 14, G6: 19, G7: 27, G8: 18, G9: 18, G10: 18, G1− 10: 54 (G1− 10 contains all the compounds that all the groups submitted), and Known: 89. These values show that the chemical space coverage submitted only by one group tends to be small. On the other hand, the coverage of the merged compounds, G1− 10, was comparable to the chemical space of known Src inhibitors. Because the seven potential hits distributed over chemical space that could not be covered only by one group and were different from each other (see structures in Table 3), the contest-based approach can enhance diverse sampling. The coverages of G3 and G7 were relatively high because G3 employed three different approaches and made a compounds list from the three methods. G7 employed the more SB-oriented method that utilized information of lesser known inhibitors. Figure 1B-D show the number density of the compound library, Src known inhibitors, and assayed compounds, respectively, in the chemical space. The density number map of assayed compounds are  not similar to that of the compound library but are similar to Src known inhibitors, indicating that the assayed compounds were not just randomly chosen but enriched toward Src inhibitors.  Figure 2 shows the distribution of the six chemical properties for these three sets of compounds. We observed that for four of the six considered properties, there was a marked difference between the average values of the submitted compounds (AlogP: 3. . Notably, the average ROTB value of the submitted compounds (ROTB: 4.7) was smaller than that of the Enamine library (ROTB: 5.5), which is closer to the average ROTB value of the Src family kinase inhibitors (ROTB: 6.4), although the average ROTB value of the potential hit compounds was 5.7. In addition, the average MW of submitted compounds (MW: 361) was similar to that of the Enamine library (MW: 365), although the average MW of the potential hit compounds was 391, which is closer to the MW of the Src family kinase inhibitors (MW: 455). This analysis suggests that the prediction methods could be improved by considering ROTB and MW. In particular, for docking simulation, special consideration would be necessary when known inhibitors have a large ROTB value, because it is more difficult to cover the conformational space of compounds. Further, we have surveyed the novelty of the submitted compounds. Figure 3 shows a distribution of maximum Tanimoto similarity coefficients for the submitted compounds compared with the known Src family inhibitors. We also measured the difference between the average similarity scores of different approaches to understand the effect of methodologies for selecting compounds. The 10 methods used in the contest are classified into two filter types (see the "Computational methods" section): SB: G1, G2,  Table 3. Potential hit compounds in validation assay (from fresh powder). a The compound is known as gefitinib. b The compound is known as sunitinib.

Characteristic features of submitted and assayed compounds.
Scientific RepoRts | 5:17209 | DOI: 10.1038/srep17209 G3, G5, G7, G9 and G10; and LB: G4, G6 and G8. The average similarity scores for SB and LB are 0.765 and 0.767, respectively, indicating no apparent difference. Because G4′ s method was modeled to identify compounds with good SEI, it may affect the average similarity score. The average similarity scores for LB without the inclusion of G4′ s compounds is 0.824, indicating that compounds proposed by SB were more novel than those proposed by LB.
Comparison of different approaches for identifying the potential hit compounds. The systematic comparison of various methods for identifying potential hit compounds can provide insight for a deeper understanding of the concepts of drug design. Among the seven potential hit compounds, six were proposed by groups that either mainly or partly adopted an LB screening process or a ligand template (pharmacophore) derived from known inhibitors. The other compound was proposed by a group that also utilized such information to discriminate a good protein structure for docking from several model structures that are able to discriminate between known active and inactive compounds with respect to docking scores. Therefore, this study indicates that the usage of experimental binding affinity or binding poses is necessary to identify potential inhibitors. This concept reveals the importance of analyzing specific interactions to select potential hit compounds. The application of machine learning techniques helped to map the input features with binding affinity. Further, SB methods combined with pharmacophore modeling and docking could be useful in identifying potential hit compounds. Overall, the comparison of methods indicates the importance of balancing between LB and SB methods to identify inhibitors. Furthermore, we observed that the inclusion of visualization and detailed analysis are important for identifying potential hit compounds. With respect to speed, LB methods are faster than SB models, and machine learning techniques could aid successful prediction. As for the novelty of the potential hit compounds, the LB methods identified compounds similar to the known Src family kinase inhibitors (e.g., similarity scores of Z1157725083, Z1546616191, and Z1546610485 relative to known inhibitors were 0.80, 1.0, and 1.0, respectively). On the other hand, the SB methods predicted compounds with relatively lower similarity to the known Src family kinase inhibitors (e.g., similarity scores of Z820655914, Z126204226, and Z653349554 were 0.75, 0.77, and 0.79, respectively). When novelty is of interest, an SB method with the aid of known inhibitor information and/or docking poses is a good choice.
In addition, we calculated two different ligand efficiency indices: inhibition rate (%) divided by MW or topological polar surface area (TPSA), as shown in the Supporting Information (Supplementary Fig.  S1). Two compounds (Z1095352660 and Z993990690) proposed by an LB approach (G4) are plotted in the upper-left corner of Supplementary Fig. S1. These compounds are small (MW of 151 and 255 for Z1095352660 and Z993990690, respectively) but have relatively high ligand efficiencies compared to their sizes (inhibition rate/MW: 0.26 and 0.10, respectively; inhibition rate/TPSA: 2.5 and 2.0, respectively).  G4′ s method focused on SEI and successfully identified compounds with high ligand efficiency comparable to those of known inhibitors.
Comparison between the potential hit and negative compounds. The list of compounds that did not show any inhibition of Yes is presented in Supplementary Table S1. These compounds can be used as decoys for docking and other studies. We have analyzed the characteristic features of the seven potential hit compounds and performed a comparison with the negative compounds. The physicochemical features of all 1180 proposed compounds, 24 active compounds, and seven selected compounds, along with the 574 decoys, are shown in Fig. 4. Because there are no significant differences between the selected compounds and the rest of the submitted compounds, the negative compounds could be good decoys and may be helpful to further refine active inhibitors.
The strategy of using different approaches together. The contest based approach is the outcome of ten individual methods (Table 1), which are independent to each other on various perspectives: (i) different templates to obtain the target in SB approach, (ii) database of actives and decoys in LB method, (iii) variations in software packages for identifying the hits and docking and (iv) scoring procedures for ranking the hit compounds. Although the main objective of each method is to identify the lead compounds by covering a large chemical space and utilizing standard procedures none of them is able to identify all the hit compounds, which have been observed experimentally. We anticipated that all the methods could identify few hit compounds and most of them are not overlapping with each other. Hence, we have used the strategy of collecting the top ranked compounds in each method for verifying the hit compounds using experimental techniques.
The advantages of using different approaches together. Each prediction method utilized advanced techniques and reliable procedures reported in the literature for identifying potential hit compounds. The overlapping compounds are minimal among different methods and all methods provided diverse list of compounds with a strong basis for understanding the activity. Further, no single method is efficient to identify the hit compounds and it is not possible for a single group to perform all computational methods. The outcome of each method is complimenting with each other and hence the combination of methods could help to identify the hit compounds realistically. Interestingly, the hit compounds identified by experiments have been proposed by different groups participated in the contest. The contest based approach made it possible to narrow down the experiments from 2.2 million to 600 compounds and 24 of them are identified as hits.
Suggestions for future based on the experience gained in this contest. The outcome of the contest based approach provide several insights for future directions: (i) comparative performance of structure based and ligand based approaches for identifying the hits, (ii) list of actives and decoys for the target cYes kinase, which could be used to refine the methods and validating new methods, (iii) probable interaction and binding modes for target based drug design, (iv) utilizing efficient, reliable and wide range of information for identifying lead compounds and (v) combination of methods to identify and rank potential compounds. Looking back into known experimental data on several ligands it is possible to predict additional compounds with better affinity and understand the mechanism.

Conclusions
We conducted a contest-based approach to identify various inhibitors of the tyrosine-protein kinase Yes. In total, 10 groups participated in the contest and tackled the challenge using their own methods. The proposed compounds from all the groups collectively had a more diverse chemical space than compounds proposed only by each group, indicating that a contest-based approach can supply the early stage of drug discovery with various initial inhibitors. The contest was also successful in identifying 24 compounds with inhibition activity and seven potential hit compounds. The IC 50 evaluation of Z820655914, Z653349554, and Z1157725083 by the 8-point curve showed that the values of Z820655914 and Z1157725083 were >100 μ M. The values for Z653349554 suggested that it had been reacted with a reagent. The potential hit compounds can be further considered for the next phase of drug design. Our study revealed that using information about known inhibitors or their docking poses was necessary for both the LB and the SB approaches.