A novel graph convolutional neural network for predicting interaction sites on protein kinase inhibitors in phosphorylation

Protein kinase-inhibitor interactions are key to the phosphorylation of proteins involved in cell proliferation, differentiation, and apoptosis, which shows the importance of binding mechanism research and kinase inhibitor design. In this study, a novel machine learning module (i.e., the WL Box) was designed and assembled to the Prediction of Interaction Sites of Protein Kinase Inhibitors (PISPKI) model, which is a graph convolutional neural network (GCN) to predict the interaction sites of protein kinase inhibitors. The WL Box is a novel module based on the well-known Weisfeiler-Lehman algorithm, which assembles multiple switch weights to effectively compute graph features. The PISPKI model was evaluated by testing with shuffled datasets and ablation analysis using 11 kinase classes. The accuracy of the PISPKI model with the shuffled datasets varied from 83 to 86%, demonstrating superior performance compared to two baseline models. The effectiveness of the model was confirmed by testing with shuffled datasets. Furthermore, the performance of each component of the model was analyzed via the ablation study, which demonstrated that the WL Box module was critical. The code is available at https://github.com/feiqiwang/PISPKI.


Results
Experiment. Database and datasets. Protein-ligand complexes and interaction sites were collected from the sc-PDB three-dimensional database of ligandable binding sites 36 and grouped by protein UniProt identifications from the Protein Data Bank 37 . In total, 1,064 protein-ligand complexes datasets of 22 protein kinases were extracted and categorized into 11 corresponding kinase classes as shown in Supplementary Information A. A program was developed to convert the mol2 file to a model input file consisting of the feature matrix F ∈ {0, 1} 35×N and the structure adjacency matrix S ∈ {0, 1, 2, 3, 4} N×N for each inhibitor molecule consisting of N atoms from the protein-ligand complex. According to the mol2 format, there are 35 atom types and eight bond types. The categorical features of every atom were one-hot encoded as a color label and aligned with the feature matrix F. The bond defined as single, triple, dummy, unknown, and not connected were classified as TYPE 1, a double bond as TYPE 2, an amide bond as TYPE 3, and an aromatic bond as TYPE 4. The TYPE 1 category consists of five bond types (i.e., single, triple, dummy, unknown, and not connected) in the dataset. Here, the single bond is the most common bond type, and the remaining four bond types are rare. The structure adjacency matrix S is a record of the connection relationships between two atoms and their corresponding BOND TYPE of the inhibitor molecule. TYPE 0 can be used for any two atoms with bond types that are not mentioned in the the mol2 file.
Setup. An individual prediction model was established for each class of kinases. An inhibitor molecule with N atoms provides N data pairs (F, S) by assigning a specific mark to the label of each atom of feature matrix F. If a marked atom binds with a residue of the kinase, the corresponding output assigned a value of 1, otherwise, 0. The binding types between atoms and residues were ignored, as the binding state was the focus of this study. Each of the original datasets was expanded to the one with at least 2,560 positive and 2,560 negative samples by using the method described in "Dataset expander program" section, and the resulting expanded rates are shown in the last column of Table 2. The dataset was randomly split into three parts: one tenth positive/one tenth negative datasets into the test dataset, one tenth positive/one tenth negative datasets into the validation dataset, and rest of datasets into the training dataset, where three datasets were totally non-overlapping. Training datasets were used to train PISPKI models of each kinase class for several epochs, and models were evaluated by validation datasets at each epoch after training. Furthermore, bootstrapping was applied to the training and validation datasets to uniformly assign samples at each epoch. The program was developed with PyTorch 38 . As shown by the model setup in Table 1, early-stopping 39 was set to 5 epochs to avoid overfitting issues and the accuracy of the sixth to last validation was recorded. After training was completed, the model was evaluated with the testing dataset.
Noise elimination. Multiple protein-ligand complexes consisted of the same protein (kinase) and ligand (inhibitor), but with different interaction sites. However, unique confusing events can occur, such as the existence of a kinase inhibitor with two crystal structures ( α and β ) and an atom of the inhibitor that binds with a residue of the kinase of crystal structure α but does not bond with any residue of crystal structure β.
To eliminate this type of noisy data, the notation B

(K)
I was used to denote inhibitor molecule I having M crystal structures on kinase K to represent a set consisting of all atoms binding with K. With B (K,I) m designating a binding atom set for one of the crystal structures composed of inhibitor I and kinase K, the following definition is obtained: Then, for atom i of the inhibitor molecule I, the interaction state Y (K,I) i for kinase K is determined by  Table S2), a dataset expander algorithm was developed inspired by the expansion method widely applied with image recognition datasets. A seed is randomly assigned to the reindex rows or columns of matrices, and the reindex operation does not change the structure of the inhibitor molecule but creates a different input data pair. An example of the structure adjacency matrix expansion is shown in Fig. 1. Each input pair (F, S) from the original dataset is modified to a new pair utilizing the same seeds for F and S while maintaining the same output Y as follows: where F ′ , S ′ and Y ′ are the created feature matrix, structure adjacency matrix, and output, respectively. In addition, r,c indicates that the reindex operation was applied to both rows and columns, whereas r indicates application  www.nature.com/scientificreports/ to rows only. By utilizing different seeds with an original pair, multiple different sample pairs can be obtained up to P N N = N! , where N is the atom number of the inhibitor molecule. The batch process method of the expander program is shown in Algorithm 1. As mentioned in "Experiment" section, original datasets transformed from the mol2 format were collected with arrays of atoms in a particular order. The dataset expander program can also potentially support the model to improve compatibility with datasets collected from formats other than mol2. by comparison with Support Vector Machine (SVM) and Convolutional Neural Network (Conv-Net) models as baselines, where the SVM model applies the radial basis function kernel and Conv-Net has a traditional architecture consisting of two convolutional layers and a fully connected layer. Feature matrices with zero-padding were used as input for the baseline models. The highest accuracy of 10 repeated experiments was recorded. Comparison of the proposed PISPKI model and the two baseline models is shown in Table 2.
The accuracy of the PISPKI model to predict whether an atom from an inhibitor molecule is an interaction site or not mostly ranged from 83 to 86% for the different kinase classes, which was notably better than that of the two baseline models. In addition, both the Conv-Net and SVM models were unstable with different datasets of kinase classes, whereas the proposed model was not. Although the accuracy for the Circadian clock protein kinase was high, the prediction accuracy of the model is not necessarily high because the corresponding dataset contained only 16 protein-ligand complexes. www.nature.com/scientificreports/ However, the expansion rate has no effect on the prediction of the interaction site, with the exception of extreme situations, such as the Circadian clock protein kinase mentioned above. Nonetheless, the performance of the model can be improved by applying a small number of expansion operations, as the accuracies of datasets with expansion rates of less than 3, such as the mitogen-activated protein kinase, tyrosine-protein kinase, serine/ threonine-protein kinase, and cyclin-dependent kinase, are stable at about 86%.
Performance evaluation with shuffled datasets. The effectiveness of the PISPKI model was further assessed with shuffled datasets. Due to the limited number, portions of the validation datasets were randomly extracted and the interaction sites were shuffled to create shuffled datasets. Consider two cases: (1) the PISPKI model could still predict the interaction sites of shuffled datasets with an accuracy equal to or greater than that of the testing datasets; and (2) the model is not compatible with shuffled datasets or the accuracy is obviously degraded. The baseline accuracy was set to 50% to denote the state "cannot work", as such a situation is a binary classification issue. Case (1) suggests the model is compatible with both correct and incorrect data, indicating a problematic state, whereas case (2) confirmed the effectiveness of the model. The performances of the shuffled and testing datasets for each kinase class are compared in Fig. 2. The PISPKI model is incompatible with shuffled datasets, thereby validating its effectiveness.
Ablation study. To ensure and discuss the necessity of each part of the PISPKI model, an ablation experiment was designed in which the performance of the model was assessed by removing components. In the experiment, only four typical kinase class datasets with expansion rates of less than 3 were collected. Then, (1) two WL Boxes; (2) one WL Box; (3) and the Conv-layers were abandoned, and (4) 35 atom subtypes were merged into 16 colors by combining the same types of chemical elements to successively construct four incomplete models, which are illustrated in Supplementary Information B. The performance of the incomplete models was compared to that of the PISPKI model (Fig. 3).
The performance of the PISPKI model was significantly compromised by removing two WL Boxes from most datasets, which obviously decreased the accuracy. In addition, the model was incompatible with the mitogenactivated protein kinase dataset, thereby confirming that the WL Box is the core of the PISPKI model. As shown in the third column of each dataset in the figure, reducing the number of WL Boxes to one had very limited influence on the model. However, compared with the full model, the performance of the truncated model was improved by adding extra WL Boxes. The convolutional layers seem to be an insignificant component of in most datasets, which still suggests potential advantages. Notably, the convolutional layers only process original structures and feature label information as mentioned in "Method" section. Although the WL Boxes process the feature and structure information more exquisitely, the original information processed by the convolutional layers facilitates inference of the interaction sites more accurately in complicated cases. In the last ablation experiment, the necessity of feature richness, which represents the quality of each feature, was tested. For this evaluation, datasets from mol2 files were collected, which encoded atoms in the SYBYL format that were further divided into 35 subtypes (color) and combined into 16 types (color) based on chemical elements as illustrated in Supplementary Information B Fig. S1(e). By the combination operation, the performance of the model with the different datasets decreased by various degrees. The results not only highlight the importance of SYBYL atom types but also serve as a reminder that the performance of the PISPKI model can be improved by enhancing feature richness.
In the ablation experiments, incomplete models were applied to examine differences in performance loss observed from the datasets. The effects on the cyclin-dependent kinase and serine/threonine-protein kinase datasets were very limited by applying incomplete models. However, the tyrosine-protein kinase and mitogen-activated protein kinase datasets had extremely low and high impacts, respectively. As mentioned above, the PISPKI model aims to solve the issue with interaction site prediction. However, there were multiple different sub-issues due to www.nature.com/scientificreports/ the kinase class, which is another reason why the accuracy of the baseline models was extremely variable with different kinase class datasets ( Table 2).

Discussion
In this study, a novel machine learning model (i.e., WL algorithm-based GCN network) was designed and developed to predict interaction sites of protein inhibitors in phosphorylation. The accuracy of the model was consistently 83-86%, which can be greatly improved by applying datasets with low expansion rates compared to the two baseline models. The model performance can be improved by the addition of feature richness. At present, features are transformed from inhibitor molecules based on SYBYL atom types, which contain more information than chemical elements. More information about atoms can be collected to enhance the richness of features such as radius, atomic mass, formal charge, and aromaticity. In addition, the protein kinase residue information should also be used as input to the model. Furthermore, the limitation of datasets effects the model performance. This research was not only limited by input feature richness but also by the small number of datasets because there have been relatively few investigations to identify the interaction sites of inhibitor molecules. The spatial pyramid pooling (SPP) module facilitated compatibility of the model with inhibitor molecules having different number of atoms. Furthermore, the importance of the WL Box was confirmed by the ablation study ("Ablation study" section), which showed that the addition of multiple WL Boxes can enhance performance. Although applying a complicated model on a simple issue is not recommended owing to potential performance degradation because of excess trainable parameters, the PISPKI model can predict most interaction sites and solve other complicated biology issues. Hence, stable model performance is absolutely critical.

Method
Model. The architecture of the proposed PISPKI model is shown in Fig. 4. The model framework consists of four main parts: data preprocessing, WL Boxes, convolutional layers, and dense layers. First, each inhibitor molecule with N atoms is transformed to N pairs of feature matrices and structure adjacency matrices, where the ith atom is marked in the ith feature matrix to predict whether the corresponding atom is an interaction site. Note that the output of the model is assigned a value of 1 if the marked atom is predicted to be an interaction site. The feature matrices and structure adjacency matrices contain, respectively, atom and bond information of the molecules. Due to uncertainties about the number of atoms of the molecules, the sizes of the two matrices can be alterabled for different input data. Notably, zero-padding is not applied to satisfy all input data in the same size. After preprocessing, each pair of matrices is added to two submodules: WL Boxes and convolutional layers. The WL Boxes mainly process feature matrices using structure adjacency matrices as auxiliary information. Matrices with more significant features can be obtained from the output of WL Boxes; then the pooling layer processes new feature matrices into fixed-length vectors by applying the spatial pyramid pooling (SPP). By contrast, the convolutional layer processes structure adjacency matrices in which some atom information about the feature matrices is embedded in the diagonal elements, and the output is also processed by the SPP into a fixed-length vector. The resulting two vectors are concatenated as input to the dense layers for binary classification. Remarkably, the output vector from the pooling layer is obtained by combining the pooled results of the updated feature matrices and structure adjacency matrices, and the lengths of the vectors from the updated feature matrices are larger than those from the updated structure adjacency matrices, indicating that the output from the WL Boxes offers more information for the prediction of interaction sites by the classifier in the dense layer, thereby the WL Box module is the core of the PISPKI model. Let G(V, E) be an undirected graph representing an inhibitor molecule, where V = {v 1 , . . . , v N } is a set of atoms and E is a set of edges. Information on atoms (i.e., vertices in G) is represented by a binary feature matrix F of size N × N c in which each row corresponds to an atom and the corresponding row vector is a color vector representing the atom type. Information on the edges of G is represented by an adjacency matrix S of size N × N . S ij (i.e., the element of the ith row and jth column) is assigned a value of 0 if {v i , v j } / ∈ E , otherwise matrix S ij denotes the bond type (i.e., S ij ∈ {1, 2, . . . , N e }).
Since graphs representing chemical structures are also considered, there is no self-loop; thus all diagonal elements of S are assigned a value 0. To effectively utilize the adjacency matrix, each vertex v i is assigned a label index l i according to the feature matrix row color c m by Then, the convolutional layer input matrix S conv is obtained by the structure adjacency matrix S and the diagonal matrix as Weisfeiler-Lehman algorithm. The Weisfeiler-Lehman (WL) algorithm, which was first proposed in 1968 to solve the graph isomorphism problem 35 , has recently been widely applied in neural network models. For every vertex v i , features from neighboring vertices are aggregated and computed to update its own feature, which is computed as follows: where x(·) and x ′ (·) are the original and updated features of vertices, respectively, and N(v i ) denotes a set of neighboring vertices for vertex v i , while emb is an embedding function based on neighborhood aggregation that concatenates features from neighboring vertices of v i , and AGG is a custom function computing feature from the target vertex and its neighboring vertices. By implementing different functions, features can be updated in different ways. In addition, vertices can always have special features by several repetitions even with large graphs. Subsequently, the isomorphism of two graphs can be analyzed by examining the different features of updated set of vertices.
As mentioned in "Preliminaries" section, every vertex and edge have a solid color and label, respectively, and the colors of vertices are updated individually with the colors of the neighboring vertices and labels of the connected edge as illustrated in Fig. 5. The aggregation function is called mix, which can blend multiple colors together. Here, the mix ratio is dependent on the labels of the edges between the updated and neighboring vertices. After several repetitions of the algorithm, every vertex has a unique blended color as the feature in the inhibitor molecule graph.
(1)  Finally, the hidden states of the last neuron of each layer l at time step T are combined to a tensor F [i, j, l] as the output of a WL Box after processing with the activation function by where F represents an output tensor of the WL Box, and F [−, −, l] denotes the lth block that is defined as a two-dimensional matrix consisting of all elements and the array from tensor F when the index is equal to l in rank 3 in the tensor; σ is an activation function, and h (T) l is the last hidden state of the lth layer at time step T. Besides, the structure adjacency matrix S is invariant during the process in the WL Box and can be completely delivered to the next module if needed.
Multiple WL Boxes. Multiple WL Boxes can be implemented sequentially to further improve the feature matrix of the model. Two WL Boxes are assembled for the model as shown in Fig. 4. Notably, the second WL Box, which differs from the first WL Box, receives a tensor F as the input rather than matrix F, and so on. To get around this Spatial pyramid pooling. The spatial pyramid pooling (SPP) layer is applied to normalize the output from the WL Boxes and convolutional layers in this study, which is a novel and effective machine learning module proposed by He in 2015 40 . Different from classical pooling modules, SPP is a type of extensive research for region of interest operation, which further works with different sizes of pooling kernels in a matrix, and then concatenates the pooling results to a vector as the output. This also applies to hand-crafted pooling regions 41 over scales of kernels that are dependent on different sizes of input matrices and adopts the spatial pyramid operation to obtain more comprehensive pooled feature maps, which are then converted to a fixed length vector. A spatial pyramid consists of multiple stages, and each stage runs a pooling operation using the corresponding pooling coefficient. The notation k represents the pooling coefficient with k = 1, . . . , K , where K is the stage number of a spatial pyramid. During each operation by the spatial pyramid, a hand-crafted kernel is applied, which yields precise k × k output from the inputted two-dimensional matrix. Due to differences in kernel size, each input is extended to where F is a block of a tensor or matrix of N c × N and k is the matrix extension function for pooling coefficient k. By applying the function, F is extended to a matrix p k 0 = R k·⌈ Nc k ⌉×k·⌈ N k ⌉ , and all extended elements are equal to 0. Here, the SPP layer receives two tensors F and S from the last WL Box and the convolutional layer. Two spatial pyramids are constructed (with stage numbers K F and K S ) to individually compute the two tensors. The pooling operation works with every block K F and K S times for the input tensors F and S , respectively. Maximum SPP is applied to the output tensor F of the WL Box. For the kth stage of the spatial pyramid, each element of the xth column and yth row of matrix p k f is computed by www.nature.com/scientificreports/ where k denotes the pooling coefficient, k ∈ {1, . . . , K F } ; p k 0 is an extended matrix computed by Eq. (5); p k 0 [X, Y ] denotes the element of the Xth column and Yth row; (I, J) is the hand-crafted region in the kth stage, where I = ⌈ N c k ⌉ and J = ⌈ N k ⌉. Similarly, the average SPP was applied to the output tensor S from the convolutional layer. For the kth stage of the spatial pyrmaid, each element of the xth column and yth row of matrix p k s is computed by where k denotes the pooling coefficient, k ∈ {1, . . . , K S } ; p k 0 is an extended matrix computed by Eq. (5), p k 0 [X : I, Y : J] denotes a submatrix collecting elements from Xth to (X + I) th columns and Yth to (Y + J) th rows; I and J constitute the hand-crafted region of the kth stage, where I = ⌈ N c k ⌉ and J = ⌈ N k ⌉. Finally, all elements are collated from the pooling matrices to an output vector as where is a function that converts a matrix to a vector, such as �(p 2 f ) = (p 2 f [1,1], p 2 f [1,2], p 2 f [2,1], p 2 f [2, 2]). Then, the output of the SPP layer is sent to a binary classifier for interaction site prediction by the dense layer.