Amalgamation of 3D structure and sequence information for protein–protein interaction prediction

Protein is the primary building block of living organisms. It interacts with other proteins and is then involved in various biological processes. Protein–protein interactions (PPIs) help in predicting and hence help in understanding the functionality of the proteins, causes and growth of diseases, and designing new drugs. However, there is a vast gap between the available protein sequences and the identification of protein–protein interactions. To bridge this gap, researchers proposed several computational methods to reveal the interactions between proteins. These methods merely depend on sequence-based information of proteins. With the advancement of technology, different types of information related to proteins are available such as 3D structure information. Nowadays, deep learning techniques are adopted successfully in various domains, including bioinformatics. So, current work focuses on the utilization of different modalities, such as 3D structures and sequence-based information of proteins, and deep learning algorithms to predict PPIs. The proposed approach is divided into several phases. We first get several illustrations of proteins using their 3D coordinates information, and three attributes, such as hydropathy index, isoelectric point, and charge of amino acids. Amino acids are the building blocks of proteins. A pre-trained ResNet50 model, a subclass of a convolutional neural network, is utilized to extract features from these representations of proteins. Autocovariance and conjoint triad are two widely used sequence-based methods to encode proteins, which are used here as another modality of protein sequences. A stacked autoencoder is utilized to get the compact form of sequence-based information. Finally, the features obtained from different modalities are concatenated in pairs and fed into the classifier to predict labels for protein pairs. We have experimented on the human PPIs dataset and Saccharomyces cerevisiae PPIs dataset and compared our results with the state-of-the-art deep-learning-based classifiers. The results achieved by the proposed method are superior to those obtained by the existing methods. Extensive experimentations on different datasets indicate that our approach to learning and combining features from two different modalities is useful in PPI prediction.

Evaluation criteria. In this experiment, we have used a repeated 3-fold cross-validation (CV) method and a train-test split method to estimate the performance of the model. The 3-fold CV randomly divides the whole dataset into three independent subsets of equal sizes. Each time one subset is used as the test set, and the remaining two subsets are used to train the model. This process is repeated three times so that each subset gets a chance to be the test set once. The 3-fold CV may suffer from the noisy estimation of model's performance as the results from different splits of data might be very different. To avoid this, we repeat the 3-fold CV method three times, known as repeated 3-fold cross-validation. To get the final results, we take the average and standard deviations of three experiments from all runs. The train-test split divides the dataset into a training set to train the model and test set to measure the model's performance. Since the PPI problem comes under the category of binary classification problem so the system output must be classified as one of the four types. These are: • True Positive (TP): When the system accurately categorizes interacting pairs to be interacting. Accuracy, sensitivity, specificity, precision, F-score, Matthews correlation coefficient (MCC), area under Receiver Operating Characteristic curve (AUROC), and area under Precision-Recall curve (AUPRC) are some widely used evaluation criteria that we have used to measure the performance of the proposed approach. These are defined below: The accuracy represents the proportion of samples that are correctly classified to the total number of samples. It works well when the datasets are balanced. Sensitivity is the true positive rate. The higher value of sensitivity shows the potential of a classifier to distinguish positive data points. Specificity is the false positive rate. The higher value of specificity represents the ability of a classifier to identify negative data points. F-score quantifies the robustness of the model. The higher the value more robust is the model. MCC calculates the correlation coefficient between the actual class and predicted class. It gives a value between -1 to 1 (1 represents perfect (1) Accuracy = TP + TN TP + FP + TN + FN   Voxel-based protein structure. The protein's tertiary structure information is stored in a text file that contains atoms and their (x,y,z) coordinates in space. Each protein is represented as a binary volumetric shape with volume elements such as voxel fitted in a cube V of a fixed grid size l in the three dimensions. Nearest neighbor interpolation is used to obtain the continuity between voxels, such that for (i, j, k) ∈ [0; l − 1] 3 , a voxel of vertices takes the value 1 if the backbone of the enzyme passes through the voxel, and 0 otherwise. In this experiment, we have ignored the side chains of protein. We have considered only the backbone atoms such as carbon, nitrogen, and calcium to get the representation of the protein. The binary representation of protein tells only about the shape. Hydropathy index, isoelectric points, and charge are some biological indicators. These indicators describe the local properties of the protein's building block, i.e., amino acids. These attributes are incorporated into a representation model, which gives us some other useful representations of the protein. So, we have one binary and three attribute volumetric representations for each protein, as depicted in Fig. 1. The various steps involved in getting these volumetric representations of proteins are as follows: • Extract the 3D coordinates of only the backbone atoms of protein from a text file (.pdb) that contains information about the protein's tertiary structure. Also, the attribute values for each amino acid of a protein are extracted. • The coordinates and attribute values obtained from interpolation between consecutive atoms (A i , A i+1 ) of the backbone are added. These interpolated points are computed as: where the value of k varies from 1 to p.
• Then, the scaling of these coordinates is done by multiplying the coordinates with a value given as: where l is the grid size and R max is the radius of the sphere that should be fitted into volume V.
• Coordinates are converted into binary voxels and voxels with attributes values.
• Finally, the voxels having no direct neighbor are removed.
(i + δx, j + δy, δk + z)|(δx, δy, δz) ∈ {0, 1} 3  Autocovariance. Autocovariance 22 is a sequence-based method to encode proteins. Among the sequencebased coding scheme, it is one of the widely used processes. It explains the interaction and correlation between variables at different positions in a sequence. The following equation is used to convert the protein sequences into a vector: where P represents the protein sequence, l is the length of the sequence P, m refers to the location of amino acid in the sequence P, n is the n-th descriptor, P m,n is the normalized n-the descriptor value for m-th amino acid, and lag refers to the value of the lag. This equation transformed the protein sequences of variable length into vectors of equal size, i.e., (n × lag) . In this study, the value for n is taken as 7 as it refers to the seven physicochemical properties of twenty amino acids, and the value chosen for the lag is 30 22 . These values provide a vector with 210 (7 × 30) elements for each protein sequence.
Conjoint triad. Conjoint triad 23 is another popular sequence-based method to convert protein sequences into vectors of numbers. This process of transforming sequences of symbols into vectors of numbers is divided into several steps. First, based on the dipole and side-chain volumes of all twenty amino acids, they are clustered into seven groups. Then, each amino acid of a sequence is replaced by its cluster number. After that, a window of size 3-amino acids is used to slide from N-terminus to C-terminus across the whole sequence. This window slides one step at a time. The total possible combinations with window size 3 and 7 clusters are 343 (7 × 7 × 7) . So, each protein sequence is represented as a vector with 343 elements. The vector elements represent the count of all combinations across the protein sequence.

Residual network.
A convolutional neural network (CNN), an example of a deep learning model is used to extract features from images. In recent years, various CNN architectures have been available to obtain low/mid/ high level features and are widely used in image classification tasks. These architectures come under the category of deep convolutional network. Residual network 41 , also known as ResNet, is the subclass of the deep CNN. In theory, a deeper network means getting better accuracy. But in reality, a deep network may suffer from the problem of vanishing/exploding gradient problem and degradation of training accuracy during the convergence of the neural network. Several methods, like Batch normalization, are used to solve the problem of the vanishing/ exploding gradient problem. To address the problem of accuracy degradation, ResNet introduced the concept of skip connection. In a deep convolutional neural network, several layers are stacked which make up the process of learning features during training. But in a residual network, the objective is to learn some residual. Let H(x) be the mapping of input x obtained by stacking few layers. Then the residual function F(x) is defined as: So, H(x) can be written as F(x) + x . Here it is assumed that both H(x) and x have the same dimension. In a feed-forward neural network, F(x) + x is expressed by using skip connection. Skip connection as the name itself suggests that they skip one or more layers. In the case of ResNet, these connections are used to execute identity mapping. The output of this connection is added to the output of stacked layers, as depicted in Fig. 2. Implementation of skip connection does not involve extra parameters, and computational complexity also remains the same as before. The building block of the residual network is defined as: where x is the input to the layers considered, and y represents the output vector. F(x, W i ) is the residual mapping function that needs to be learned. The residual function F is flexible in terms of the number of layers. For two layers, it is described as F = W 2 (σ (W 1 (x)) , where W 1 and W 2 are weight matrices, σ is the ReLU activation. The operation F + x is achieved by skip connection and element-wise addition. After performing F + x operation, the non-linearity is added by using ReLU activation (σ (F + x)) . For the cases where both F and x have the same dimension, the Eq. (11) works fine. While in cases where the dimensions differ, we use a modified form of this equation, as shown below:  . RNN network suffers from the problem of long-term dependencies. LSTM network is designed to solve this long-term dependency problem of RNN. RNN struggles to remember the information for longer periods, whereas, in the case of LSTM, it is their default behavior. Both RNN and LSTM have the chain of repeating neural network modules, but they differ in their internal structure. The critical components of the LSTM network are the cell state and its several gates, as depicted in Fig. 3. These gates include an input gate, output gate, and forget gate. The cell state is considered as the memory of the network. The job of the forget gate is to decide what information to keep and what to throw away. For that purpose, it takes into consideration the information from the previous hidden state represented as h t−1 and the current input x t . These are then passed through a sigmoid function, which gives a number between 0 and 1. A value closer to 0 leads to the removal of the very information, while a value closer to 1 means to keep it. The forget gate is described as: where W f means weight matrix, and b f is the bias of the forget gate network. The input gate of the LSTM network is used to update its cell state. Like forget gate, it takes previous hidden state information, h t−1 , and current input, x t , and passes them through sigmoid function. The input gate is defined as: where W i and b i are the weight matrix and bias vector of the input gate, respectively. The candidate cell state, c ′ t with W c as the weight matrix and b c as the bias term are defined as: So, the actual cell state, C t , at timestamp t is defined as: The output gate is responsible for producing the next hidden state. With W o as weight matrix and b o as the bias term, it is described as: which gives us the next hidden state, defined below: Here, × and + represent point-wise multiplication and addition, respectively. In this experiment, we have used four visual representations for each protein. These are passed through the ResNet50 pre-trained model individually, and a set of feature vectors are generated, each with length 2048. These structural feature vectors of proteins in pairs are concatenated, which gives feature vectors with 4096 elements. Then these four feature vectors are fed into the LSTM layer, which gives a hidden representation of the set of feature vectors at last timestamp. After that, the encoded sequence-based information is concatenated with a hidden representation of structural characteristics. For the encoding purpose, we use a stacked autoencoder having one hidden layer. Finally, these concatenated features are input to a sigmoid layer predicting the output labels for PPI. A value higher than 0.5 means positive class, while less than 0.5 shows negative class. Here, a positive class means that proteins in pairs are interacting with each other. The overall working of the proposed framework to predict PPI is depicted in Fig. 4.

Results and discussion
This section summarizes the experimental results obtained by the proposed method on the PPIs datasets. We compare the results obtained with those of state-of-the-art deep-learning-based classifiers to illustrate the efficacy of the proposed approach. The models used in this work are implemented in Keras (Python-based framework).
Prediction performance of proposed model. We have first trained a multi-layer perceptron neural network on each feature set separately of the human PPIs dataset. This neural network consists of an input layer, two hidden layers followed by an output layer. Table 1 summarizes the results of the average of repeated  Experimental results on Saccharomyces cerevisiae PPIs dataset. The Saccharomyces cerevisiae dataset has 4314 interacting protein pairs and 6265 non-interacting protein pairs. Since we have less number of positive samples, we randomly select 1951 positive samples from the dataset. We mix these randomly selected positive pairs to the dataset so that the final dataset has a 1:1 ratio of positive samples and negative samples. Then, we randomly split the final dataset having 12,530 samples into two parts (80% and 20%). The first part, i.e., 80% of the final dataset, is used to train the model. The remaining 20% is used as a test set to analyze the performance of the trained model. Table 6 Tables 7 and 8 summarize the results for a unimodal and bimodal feature sets on test data of the human PPIs dataset and Saccharomyces cerevisiae PPIs dataset, respectively. The term unimodal means we have used only one mode of information, either sequence-based or structural features, to train the proposed model. The term bimodal means that we have utilized two types of information representing proteins to get the final feature vectors used as input to the model. For encoded features obtained from sequencebased methods (AC and CT), we have used Sun et al. 30 approach to get the values of performance metrics on the test sets of these two datasets. For unimodal structural features, the hidden state representation at the last timestamp, as shown in Fig. 4, is fed directly into the sigmoid layer (output layer). The results show an improvement in classifiers' predictive potential when both structural and sequence-based features are combined. The values of accuracy, F-score, and MCC of bimodal features {Structural + AC + CT} are 2.43%, 1.71%, and 6.01% higher than unimodal features {CT}, respectively on the test set of human PPIs dataset. The values of accuracy, F-score, and MCC of bimodal features {Structural + AC + CT} are 2.95%, 2.80%, and 6.28% higher than unimodal features {CT}, respectively on the test set of Saccharomyces cerevisiae PPIs dataset. Figure 5 depicts the results for different feature combinations in the form of a histogram.
Comparison with existing methods. To further evaluate the proposed method's performance, we have compared our results with those of state-of-the-art deep-learning-based methods 30,31,37 . Table 9 presents the comparison of results between the proposed approach and stacked autoencoder (SAE) based classifier 30 30 . SAE_AC is a stacked autoencoder taking inputs extracted using the autocovariance method. SAE_CT is the stacked autoencoder model whose input is obtained by the conjoint triad method from protein sequences. It can be seen from Table 9 that the proposed approach outperforms the existing method. The accuracy values obtained by the state-of-the-art methods and the proposed method are 0.9682, 0.9447, and 0.9720, respectively. To make the comparison between models fair, we have also trained the state-of-the-art models ( SAE_AC and SAE_CT ) on the training set of our dataset. The obtained results on our test set are mentioned in Table 7 (row 1 and row 2). The values of accuracy, sensitivity, specificity, precision, F-score, MCC, and area under the PR curve obtained by the proposed approach are 3.83%, 2.78%, 6.28%, 2.67%, 2.72%, 9.40%, and 0.88% higher than those obtained by (SAE_AC) , respectively. Table 10 presents the comparison of results between the proposed approach and existing deep-learning-based methods 31     www.nature.com/scientificreports/ times using 3-fold cross-validation. Welch's t-test 42 with 5% (0.05) significance level is conducted to illustrate that the accuracy values obtained by the proposed approach are not happened by chance. The t-test gives p-value, which is the probability of the improvements in results just occurred by chance. It the p-value is less than 0.05, it means that the results are statistically significant (rejection of the null hypothesis). A null hypothesis states that there is no significant difference between the results achieved by two different algorithms. Table 11 presents the p-values for different bimodal combinations of the input feature set. All the values mentioned in Table 11 are less than 0.05, indicating the results achieved by the proposed method to predict PPIs are statistically significant.

Conclusion
The study of protein-protein interaction is essential as the various activities and functions of a protein depend on the protein(s) that interact with it. There are various methods available to detect PPIs. But still, there is a scope to improve the prediction capability and robustness of these methods by using multimodal biomedical data and the latest techniques of deep learning. In this work, we have combined different modalities of proteins to improve the prediction capability of the classifier. These modalities include the sequence-based information and structural view of proteins. Deep learning algorithms (ResNet50 and Stacked autoencoder) are used to extract features from these modalities. These features are then used as input to the classifier. The improvements in results attained by our proposed method are statistically significant. The proposed method achieves an average accuracy of 0.9726 of repeated 3-fold cross-validation on the human PPIs dataset with 25,493 samples. Our proposed approach is also compared with some widely used deep-learning-based classifiers that utilize sequence-based information to train the model. The obtained results demonstrate that the proposed approach generally outperforms the existing methods. The significant observation from this study is that the proposed approach can learn useful features from multimodal information of proteins and perform well despite the model being trained on a lesser number of samples. In the future, we will try using some other type of information about proteins and deep learning techniques with the hope of getting better result.