## Introduction

Protein–protein interactions (PPIs) have an essential role in all the major cellular processes which assist in elucidating protein function, but also for interpreting most of the biology of the cells. In particular, key proteins in these protein interactions may provide the basis for the development of targeted therapeutic drugs in the related diseases, also informing on the underlying molecular basis of diseases1. While there are numerous databases such as BioLip2 and PDB3 available for querying protein–protein interaction sites, they appear overwhelming due to the increasing number of proteins now known to humans4. Similarly, biological experiments for the detection of binding sites, such as two-hybrid analysis and affinity systems, are very time-consuming and expensive5. To bridge this gap, many computational methods have been developed to address protein interactions and associated sites. In recent years, many deep learning-based protein interaction site identification models have been proposed by incorporating the powerful feature extraction capabilities of deep learning, resulting in a qualitative leap in prediction performance compared to traditional machine learning. For example, Zeng et al.6 used TextCNN as a feature extractor to learn features using convolutional kernels of different sizes, which can improve the prediction performance. Xie et al.7 adopted a simple CNN to learn local features between residues. Yang et al.8 presented a deep neural network with local weight sharing to predict amino acid interaction sites. Sun et al.9 developed a deep learning architecture based on residual neural networks for predicting interacting amino acids in transmembrane proteins. Zhang et al.10 used a simplified LSTM to predict PPI, aiming to learn the contextual information of the features using LSTM’s ability to grasp the global context. Li et al.11 integrated local contextual information and long-range dependencies by incorporating CNN and RNN, which improves the model’s performance. Unfortunately, most of these computational methods are very unstable and poorly generalized, especially for these highly unbalanced benchmark datasets, implying some room for improvement.

On the other hand, a plethora of protein sequence encoding methods has been proposed for modeling protein sequences into a feature matrix. One-hot encoding of protein interaction sites is a very efficient method that has been used in many computational approaches10,12. However, they cannot accurately express functional differences between amino acids. Position-specific scoring matrix (PSSM) is frequently employed for sequence-level and residue-level prediction tasks to characterize the relationship between sequences and functions4,6,10,11,13, which is relatively time-consuming due to the fact that PSSM requires sequence alignment of large databases. Recently, the development of word embedding models in natural language processing has provided the possibility of addressing protein-coding. Some word embedding models such as Word2Vec14, Doc2Vec15, fastText16, and GloVe17 have been widely adopted in the field of bioinformatics; for instance, Zeng et al.6 encoded amino acids using a static word embedding model based on ProtVec18, which improves the accuracy of PPIs prediction. The iCircRBP-DHN proposed by Yang et al.19 advances the identification accuracy of circRNA-RBP interaction sites by Doc2Vec15. Min et al.20 carried out chromatin accessibility prediction by using GloVe17 as an embedding method for gene sequences. Hamid21 used Word2Vec22 to represent protein sequences for differentiating bacteriocins. Unfortunately, such static word vector embeddings do not capture well the association between sequences and structures and neglect the potential connections between sequence contexts. To address these limitations, dynamic word embeddings, as represented by the Bidirectional Encoder Representations from Transformers (BERT) model have demonstrated very good performance in semantic analysis, able to learn sequence context of protein sequences by pre-training large-scale unlabeled corpora in a bidirectional manner23,24,25.

In our study, we propose an ensemble deep learning model (EDLMPPI)-based protein–protein interaction site identification method, as depicted in Fig. 1. We suggest adopting ProtT5 based on transformer architecture as the amino acid feature extractor, to fully exploit the global contextual association of each amino acid, and then, we incorporate eleven additional feature descriptors to further enrich the feature representation. In EDLMPPI, the deep learning architecture is composed of BiLSTM26 and capsule network27, where BiLSTM can learn features in both forward and backward directions of protein sequences in a comprehensive manner, and the capsule network can further discover correlations between features. To cope with the impact of the unbalanced datasets, we train multiple deep learning models to form ensemble deep learning and then perform predictions. To investigate the effectiveness of our proposed EDLMPPI, we conducted experiments on the network mechanism and feature extraction parts. All experiments were based on the training and test sets described in the section “Methods”. The validation set was randomly token as 20% of the training set, and we also used stratified random sampling to divide the validation set to ensure consistency of the distribution of the training and validation sets. To validate the effectiveness of EDLMPPI, we compare it with ten different machine learning models and deep learning models on the benchmark datasets. Further, we also compare EDLMPPI with other PPI site prediction models and demonstrated that EDLMPPI is in front by a large margin, which validates the efficiency of EDLMPPI’s feature extraction and network architecture. To explore the biological significance of EDLMPPI, we extract the structural domains of protein sequences. Compared with other methods, the interaction sites predicted by EDLMPPI showed a higher correlation with the native sites in the structural domain. In addition, we conducted an interpretable analysis to demonstrate the internal process of EDLMPPI’s feature representation. We built a web server for EDLMPPI prediction at http://www.edlmppi.top:5002/.

## Results and discussion

### EDLMPPI can provide a more efficient scheme for characterizing protein sequences

In our study, we adopted a multi-channel strategy to form combined features with MBF (Multi-source Biological Features, including the evolutionary information, physical properties, and physicochemical properties of protein residues) and ProtT5 as inputs to the model, respectively. Then, the two sets of vectors were concatenated and normalized before the softmax classification layer. In MBF, the sliding window mechanism was employed to encode the local contextual information for each residue, which can effectively prevent overfitting and improve the generalization of the model. Moreover, for a window size of n (n is an odd number), the middle-most amino acid is the target amino acid to be predicted, and the sliding step is 1. Therefore, we first conducted an experiment to find the optimal window size in MBF by evaluating the performance of the MBF model with different window sizes from the set {5, 11, 15, 21, 25, 33}. The experimental results of different window sizes are summarized in Fig. 2a with Dset_448 as an example. It is clear that the model achieved the best performance measured by several key metrics including AP, AUROC, and MCC for a window size of 25. However, the overall performance of the algorithm decreased with a window size of 31, which indicates that larger windows are not always better. Therefore, in our study, we choose a window size of 25 as the final size.

In addition, to further investigate the superiority of our proposed feature descriptor, we compared the combined features in EDLMPPI with a single feature descriptor including MBF and ProtT5, respectively. The experimental results are tabulated in Table 1 and Fig. 2b. It can be observed that combining the features of MBF and ProtT5 greatly outperformed the individual feature descriptors on all three datasets. Indeed, for the evaluation metric AP, frequently used to evaluate unbalanced data, the combined features surpassed MBF on the three datasets, respectively, and outperformed ProtT5 by 1.8%, 3%, and 2.9%, respectively, revealing that the combined features enriched the protein expression and enhanced the performance of the model. Moreover, when comparing Prot5 and MBF, it can also be revealed that the AP values of Prot5 perform better than on those three datasets and outperforms MBF by 10.7%, 11.2%, and 8.6%, respectively for AUROC, unveiling the effectiveness of dynamic word embedding in protein–protein binding site prediction. The reason may be that ProtT5 captured better the difference between amino acids (binding sites and non-binding sites) from our labeled training data while MBF had difficulty distinguishing amino acid specificity based on evolutionary information and other biological functions.

### Comparing ProtT5 with other protein language models

In recent years, language models based on Transformer architecture have been widely used in protein prediction problems. The self-attention-based Transformer can directly calculate the two-by-two association between residues and capture the interdependence between amino acids at different positions. In addition to ProtT5, several alternative protein pre-training models including ESM-1b28 and ProGen229 have been proposed to characterize protein sequences. ESM-1b uses a RoBERTa-based architecture with the Uniref50 2018_03 database as the unsupervised training corpus while using pre-activation layer normalization to optimize hyperparameters in the translator. ProGen2 was scaled to 6.4 billion parameters and trained on different sequence datasets with more than 1 billion proteins from genomic, metagenomic, and immune repertoire databases. For a fair comparison, we replaced the embedding representation learned by ProtT5 with the embedding representation learned by ESM-1b and ProGen2. The experimental results are summarized in Table 2. As depicted in this table, we observe that ProtT5 is superior to ESM-1b and ProGen2 in AP and AUROC, demonstrating that the ProtT5 is more suited for characterizing the amino acid sequences for protein–protein binding events.

### EDLMPPI can effectively deal with the overfitting problem caused by data imbalance

As the number of residues in the binding sites is only one-tenth of the total number, this unbalanced data pushes the model training to focus on the major class and ignores the minor class, leading to overfitting of the model30,31,32. To address this issue, we proposed employing ensemble deep learning to tackle the skewed distribution of categories of unbalanced datasets. To investigate the performance of the ensemble model, we compared it to three other different unbalanced data processing algorithms, including cost-sensitive model33, random over-sampling34, and random under-sampling34 under these three datasets. In detail, the cost-sensitive model33 focuses on the samples of categories by optimizing the lowest total cost of classification errors. Over-sampling34 generates new samples for the underrepresented classes by random sampling, while under-sampling34 randomly removes redundant samples from the major class sample.

The experimental results are summarized in Table 3 and Fig. 2d. Generally, the ensemble model performed the best, obtaining higher MCC, AUROC, and AP scores. In terms of AP scores on the three datasets, the ensemble learning algorithm comparatively outperformed the competing algorithms with 46.0%, 33.0%, and 41.3%, respectively, indicating an improved generalization performance with the asymmetric bagging method. In addition, the average precision of the over-sampling method on the three data sets was 43.9%, 31.5%, and 40.4%, respectively, which was lower than the ensemble learning method since the over-sampling method destroys the dependencies between features and limits the ability of the model to find correlations between features. It is worth noting that the under-sampling method can be considered as a sub-model of the ensemble deep learning model, which lags for AUROC and AP scores by 1.1% to 3.9% on all three datasets compared to the ensemble learning method.

In summary, we can conclude that the ensemble deep learning method based on asymmetric bagging assures the efficiency of algorithm execution and enhances its identification performance, by comparatively reducing the impact of the unbalanced data sets.

### Comparing EDLMPPI with different machine learning algorithms

To study the effectiveness of EDLMPPI, we compared it with five machine learning methods, including three ensemble learning methods (XGBoost35, LightGBM36, and CatBoost37) and two other machine learning methods, SGDClassifier (Stochastic Gradient Descent), and MLPClassifier (Multi-Layer Perception). Figure 2c and Table 4 depict the experimental results of the different algorithms on all three datasets. From the results, we see that our proposed model had better performance than the five other machine learning algorithms on all three datasets. In particular, on Dset_448, EDLMPPI outperformed the machine learning methods by 2.1–3.4% in the average AUROC and by 3.0–6.2% for the average AP on the three datasets, indicating the large improvement in the predictive ability of EDLMPPI. Moreover, since the same feature descriptor is adopted by EDLMPPI and these machine learning algorithms, we observe from the results that the comprehensive performance of the deep learning method was stronger than that of traditional machine learning, indicating that the deep learning method can explore the potential connection between protein sequence and structure better, thereby improving the prediction of protein binding sites performance, which further proves the effectiveness of EDLMPPI.

### Comparing EDLMPPI with different deep learning architectures

To validate the effectiveness and sophistication of the revised architecture of EDLMPPI, we compared it with other five deep learning models including TextCNN38, Single-Capsule27, BiLSTM39, BiGRU40, and Multi-Head Attention41 using the same feature descriptors. The experimental results of the different deep learning models are depicted in Fig. 2c and Table 4, where we see that EDLMPPI performed comparatively better than the other deep learning models, measured by the evaluation metric AP, outperforming the second-ranked Multi-Head-Attention by 1.2%, 2.2%, and 1.2% on the three datasets, respectively. Moreover, the intuitive view of TextCNN’s performance was weaker than several other deep learning models, which is consistent with our expectation that the CNN structure only extracted the local features, undermining the integrity of Prot5’s context-based embedding. In addition, LSTM and GRU perform comparably on Dset_448 and Dset_72, but LSTM performs better than GRU on Dset 164, which is the reason for choosing LSTM to learn long-term dependencies in the final model EDLMPPI.

### Comparing EDLMPPI with other PPIs prediction methods

To further test the advancement brought by EDLMPPI, we compared it with ten current PPI prediction methods including SPPIDER42, SPRINT43, PSIVER44, SPRINGS45, LORIS46, CRFPPI47, SSWRF48, DLPred49, SCRIBER13, and DELPHI11. We obtained the prediction scores for each protein sequence in the test dataset through the web server or the available source codes of these algorithms. We adopted TPR, TNR, Pre, ACC, F1, MCC, AUROC, and AP as the evaluation criteria and MCC, AUROC, and AP as the important determinants for evaluating the merits of the models that are frequently used to evaluate unbalanced data13. The prediction results are summarized in Table 5 and Fig. 3a.

We see that EDLMPPI is better than the other PPIs prediction methods for most evaluation metrics, with AUROCs of 82.0%, 78.8%, and 75.5% on the three datasets, respectively, substantially higher than the DELPHI method with 73.7%, 71.1%, and 68.5%, respectively. Furthermore, for the average precision (AP), EDLMPPI beat DELPHI by 12.3%, 9.3%, and 8.1% on the three datasets, respectively, which brings considerable improvement. The reason seems to be that EDLMPPI can address the amino acid long-range dependency problem based on the transformer of self-attentive mechanism, which fully explores the global contextual features and semantic information, indicating that our proposed deep learning architecture provides an important contribution to accurate classification. In addition, we incorporated traditional biological features, such as the evolutionary information and several physicochemical properties, to bridge possible shortcomings of ProtT5, thereby further improving identification performance. Notably, EDLMPPI showed a higher advantage on the Dset_448 dataset, comparing full-length sequences, suggesting that our feature extraction method may be better and more accurate in the functional expression of complete protein sequences. Overall, EDLMPPI has been substantially ahead of existing methods and can be used as a complementary tool for protein–protein interaction site annotation.

### Protein binding domains analysis

Protein domains are closely related to the completion of physiological functions of the proteins and serve as the structural basis for their cellular functions50. To gain insight into the potential relationship between protein structural domains and protein–protein interaction sites, we performed an experiment to verify whether EDLMPPI accurately predicts PPIs in the protein domain. We annotated 448 protein sequences in the Dset_448 dataset by Pfam51 to remove any overlapping structural domains and finally obtained 501 structural domains. Figure 3b shows the correspondence between structural domains of each size and the number of PPIs in them, while we compare the prediction results of EDLMPPI, DELPHI, and SCRIBER13. In addition, we added a control group to enhance the rationality of the experiment: a fragment of the same size as the protein domain was randomly selected from the sequence. From the results, the prediction results of EDLMPPI were more optimistic than the other two methods, with the number of PPIs predicted by EDLMPPI increasing with the growth of the structural domain. According to a previous study52, the length-deviant domain superfamilies are highly interacting, more mixed in function, and regulated by multiple proteins, which supports the plausibility of EDLMPPI in predicting protein function. In addition, we counted the proportion of predicted PPIs estimated by EDLMPPI, DELPHI, and SCRIBER for each structural domain and calculated the Pearson correlation coefficient with the true proportion vector. EDLMPPI presented the highest correlation with the native annotations with a score of 0.70, while DELPHI, SCRIBER, and the control group scored 0.63, 0.57, and 0.21, respectively.

To further indicate that EDLMPPI can accurately predict the performance of binding sites in protein domains, we selected three enzyme proteins with high catalytic activity, P19821 - DPO1_THEAQ, P9WHH9 - DLDH_MYCTU, and P17109 - MEND_ECOLI to demonstrate the difference in performance predicted by different methods. Since SCRIBER and DELPHI provided better performance in the prediction of PPIs than other PPI site prediction models, we employed the prediction results of SCRIBER and DELPHI in these three sequence species as comparisons, and the results are displayed in Table 6. With a protein structural domain size of 337 in P19821 - DPO1_THEAQ, the true number of experimentally detected PPIs is 31, and the prediction of EDLMPPI was 36, closer to the true number compared to SCRIBER and DELPHI. This performance is more evident in P9WHH9 - DLDH_MYCTU and P17109 - MEND_ECOLI, where the number of PPIs predicted by EDLMPPI differs from the true value by only 1–2, indicating the effectiveness of EDLMPPI in predicting the binding sites of protein structural domains and also validating our previous conclusion that EDLMPPI can provide more binding sites in the structural domains of proteins.

### Interpretability analysis

To investigate the effectiveness of the EDLMPPI architecture, we extracted the intermediate layer outputs of the model at various stages and mapped them onto a two-dimensional space for clustering, as shown in Fig. 4a. We see that the original embedding was distributed haphazardly, while after the BiLSTM layer, a more obvious clustering effect can be seen. The capsule layer further preserved the key classification features, and the binding and non-binding sites appeared as separate clusters. Finally, after the softmax function, accurate identification was achieved.

In addition, we explored the contributions of different features to the protein–protein binding site recognition and the interaction relationship. Figure 4b shows the 20 features that have the greatest impact on PPIs identification, and reveals how they act in predicting the non-binding sites and bindings sites, respectively. The red color represents higher feature values while the blue represents lower feature values. Taking features 1027 and 33 as examples, the higher feature 1027 tends to classify samples as binding sites while the higher feature 33 is more likely to classify samples as non-binding sites. Compared with the impact of a single feature on the model, the interaction of features was more important. Figure 4c shows how Feature 1027 and Feature 569 interact with the other features. We note that Feature 1027 had no significant interaction with the other features, which is consistent with our judgment that Feature 1027 represents solvent accessibility and is encoded as a vector of length 1, without too much dependency on the other features. On the other hand, a strong correlation was shown between Features 569 and 72, and the effect of Feature 72 on classification was weakened at lower values of Feature 569. This comes from the fact that ProtT5 contains global context dependency, and the expression of features is based on joint action with other features, which further validates the effectiveness of ProtT5. Figure 4d is a stacked diagram showing the effect of each feature on each sample, which allows us to observe which features affect the identification of a sample.

To gain a deeper understanding of the working of EDLMPPI, we investigated the internal process of ProtT5 embedding for reliability. First, we selected a complete protein sequence and encoded it using ProtT5. For each amino acid embedding vector, we applied the Pearson correlation coefficient to describe the correlation between residues. The results are displayed in Fig. 5a, where we see that each amino acid always had a strong correlation with the amino acid closer to it, but as the distance becomes farther, ProtT5 could still capture an association between amino acids, implying that ProtT5 balanced the local influences and long-term dependence. To further into the process, we applied Bertviz53 to visualize each attention head and each layer in ProtT5, and the results are shown in Fig. 5b,c, where the different colors represent the different attention heads and the saturation of the lines represents the attention scores. Figure 5b(a) shows the first layer of attention in all attention heads, which roughly resembles a full connection, implying that for each residue, all attention heads tried to find the association with the target of the other residues. The </s> acts as a sequence splitter that carries the attention of all residues, which indicates that for ProtT5, the overall identity of a sequence is determined by all amino acids together. Furthermore, Fig. 5b(d) shows clearly the flow of target amino acids in the different attention heads, verifying our previous statement that higher attention is seen with closer proximity. Moreover, Fig. 5c visualizes the evolution of each attention head in the different layers, as the layers deepened, the attention pattern shifted from focusing on the association between different amino acids to transmitting the expression of the amino acid sequences. In summary, ProtT5 can explore the connection between the protein-level structure and its function from local to global, providing a reasonable interpretation that EDLMPPI effectively predicts protein–protein interaction binding sites.

## The EDLMPPI user interface facilitates exploration of PPIs identification

To facilitate researcher use and improve our model and accelerate progress in protein binding site prediction, we have developed a fully functional EDLMPPI online prediction webserver for PPI, which is available at http://www.edlmppi.top:5002/. Given the limited computational resources and the large computational capacity of ProtT5, we patiently guide users on how to set up the ProtT5 environment in their local environment, download the model, and extract features in various scenarios. Furthermore, the online prediction server also encourages users to upload the extracted ProtT5 features directly on the EDLMPPI server, so that we can return the prediction results via email with an interpretation of the results. In addition, we have synchronized open source data and code on GitHub, which can be accessed at https://github.com/houzl3416/EDLMPPI.git.

Supplementary Figure 1 summarizes the main modules and Supplementary Figure 1a illustrates the main interface, which includes three ways of obtaining ProtT5: extracting it onto your device, extracting it onto Colab, or downloading the file we provide. Supplementary Figure 1b shows the prediction module: once the ProtT5 features are uploaded, the server can send the prediction results to the user’s email automatically. Finally, there is the downloadable module depicted in Supplementary Figure 1c, where users can directly click on the links to download datasets and models in a rapid manner.

## Conclusions

In this study, we propose a protein–protein interaction site prediction method based on ensemble deep learning models, called EDLMPPI, EDLMPPI adapts the dynamic word embedding model based on transformer architecture to the study of protein–protein interaction sites and uses ProtT5 to capture the contextual and positional information between residues, while integrating eleven multi-source biological features to further enrich the feature representation. Meanwhile, we developed a multi-channel integrated deep learning model that captures both local context dependence and global context dependence of protein sequences and effectively solves the data imbalance problem.

To demonstrate the effectiveness of the EDLMPPI, we compared it to ten different traditional machine learning and deep learning models on three widely-used benchmark datasets. Moreover, we compared EDLMPPI with other PPI website prediction models and the predictive performance of EDLMPPI improves prediction over these models. Besides, in the prediction of PPIs in protein structural domains, EDLMPPI shows more biologically consistent results, which indicates that EDLMPPI has the ability for certain biological analysis and can be used to guide biologists to make specific experiments on proteins. Meanwhile, the interpretability analysis fully demonstrates the internal vision of the EDLMPPI model, which further enhances the rationality of the model.

In addition, the release of the EDLMPPI online prediction web server provides detailed guidance on model training and prediction, ensuring that the results of our experiments are repeatable and operational. The code and data are also open-sourced at https://github.com/houzl3416/EDLMPPI.git.

In summary, EDLMPPI is a very competitive protein–protein interaction site prediction tool with the advantages of high efficiency and accuracy, proving a new alternative for protein interaction site identification. It provides new ideas and insights into the task of protein–protein interaction site prediction and can also serve as an important assistant for biologists to effectively implement PPI prediction and downstream analysis work. The release of the webserver also greatly facilitates the work of other researchers to improve our model and achieve more effective prediction results. In the future, we will incorporate other dynamic word embedding models into our proposed model and adapt them to other relevant protein identification problems.

## Methods

### Datasets

For datasets, we collected three widely-used benchmark datasets, Dset_18654, Dset_7254, and Dset_16455. Dset_186 was constructed from the PDB database3 and contains 186 protein sequences with a resolution of <3.0 Å and sequence homology <25%. This dataset was refined in multiple steps, including the removal of chains with identical UniprotKB/Swiss-Prot accessions, the removal of transmembrane proteins, the removal of dimeric structures, the removal of proteins with surface accessibility and interfacial polarity buried within a certain range, and the removal of similarities. Dset_72 and Dset_164 were constructed in the same manner as Dset_186, and consist of 72 and 186 protein sequences, respectively.

Further Dset_1291 is a dataset from the BioLip database, where a binding site is defined if the distance between an atom of a residue and an atom of a given protein partner is 0.5 Å plus the sum of the van der Waals radii of the two atoms13. Zhang et al.13 eliminated the fragmented proteins and then transferred the annotation of the bound residues to the same UniProt sequence. Therefore, the similarity between the sequences was reduced to less than 25% under the Blast-Clust method. Finally, Dset_843 (843 sequences of Dset_1291) was used to train our model, while the remaining 448 sequences (Dset_448) were employed as the independent test set.

Using these datasets, we constructed the training and test sets. As Dset_843 and Dset_448 consist entirely of full-length protein sequences, while Dset_71, Dset_186, and Dset_164 are composed of fragmented sequences; to enhance the generalizability of the model, we selected Dset_843 and Dset_186 representing two different types of datasets as our training datasets, respectively. Then Dset_448, Dset_72, and Dset_164 were used as independent test sets to test the performance of the different PPI site prediction models. In addition, to reduce the similarity between the training and test sets, we performed consistency redundancy removal between them using the PSI-BlAST56 procedure to ensure the similarity was below 25%. Supplementary Table 1 summarizes the number of protein residues and the proportion of binding sites in each dataset, where it is easy to see that the distribution of the datasets is relatively unbalanced, with positive samples accounting for only 10–18% of the total sample size, which poses a challenge for the generalizability of the model.

### Feature descriptors

To fully explore the structural characteristics of protein–protein interaction sites, several features, including dynamic global contextual information and multi-source biological features, are extracted from protein sequences as follows.

#### Dynamic global contextual information

Due to the expensive cost of traditional biological experiments and the low capability of some deep learning-based techniques, we introduce the dynamic word embedding-based ProtT524 to represent the feature expression information of proteins to obtain the global context-sensitive information between the different sequences and amino acids, which has already been proven to be an effective method experimentally. Specifically, ProtT5 is employed for generating global contextual embeddings. Indeed, ProtT5 learns a positional encoding for each attention head in the transformer architecture and shares it on all levels. In ProtT5, the training corpus is Uniref50, which contains 45 million protein sequences composed of 15 billion amino acids. Such a huge training set guarantees that ProtT5 will capture the structural and functional connections between different types or races of proteins.

ProtT5 first maps each amino acid into a fixed-length vector by means of an embedding layer, besides, the position embedding in ProtT5 is employed to encode the relative positional information of each amino acid in the corresponding protein sequence, and the segment embedding was introduced to distinguish the different protein sequences. The sum of the token embedding, segmentation embedding, and position embedding provides not only a non-contextual mapping of amino acids to the underlying space but also extends the amino acid dependencies in each protein sequence and the contextual associations between different protein sequences, which can be defined as follows:

$${E}_{word} =\, {E}_{tok}+{E}_{seg}+{E}_{pos}\\ =\, {O}_{tok}{W}_{tok}+{O}_{seg}{W}_{seg}+{O}_{pos}{W}_{pos}$$
(1)

where Wtok, Wseg, and Wpos are the corresponding parameter matrices to be trained. After that, dynamic word embedding, learned from the multi-head self-attention mechanism in the transformer architecture, is used to correlate the relevant amino acids in the protein sequence, which can be calculated through the following formula:

$$X{W}_{i}^{Q}={Q}_{i},\quad X{W}_{i}^{K}={K}_{i},\quad X{W}_{i}^{V}={V}_{i},\quad i=1,\ldots ,m$$
(2)
$${Z}_{i} =\, {{{{{{{\rm{Attention}}}}}}}}{\left({Q}_{i},{K}_{i},{V}_{i}\right)}_{i}\\ =\, {{{{{{{\rm{SoftMax}}}}}}}}\left(\frac{{Q}_{i}{K}_{i}^{T}}{\sqrt{{d}_{k}}}\right){V}_{i},\quad i=1,\ldots ,m$$
(3)
$${{{{{{{\rm{MultiHead}}}}}}}}(Q,K,V)={{{{{{{\rm{Concat}}}}}}}}\left({Z}_{1},\ldots ,{Z}_{m}\right){W}^{O}$$
(4)

where Q(Query), K(Key), V(Value) are obtained through m linear transformations, which are used to store all word embeddings. Zi represents the attention of each attention head, which is calculated by the linear transformation of a set of Q, K, V.

Indeed, the attention stack of ProtT5 consists of 24 layers, each layer contains 32 attention heads, and the size of the hidden layer is 1024. This stacked mode is what allows each layer to operate on the output of the previous layer. Through such a repeated combination of word embedding, ProtT5 can form a very rich representation as it reaches the deepest layer of the model23. Therefore, in our study, we extract the embedding of the last layer of the attention stack into our feature representation.

#### Multi-source biological features

Further, to improve the prediction performance, we accessed the evolutionary information, physical properties, and physicochemical properties of protein residues to enrich the feature expression.

(1) Position-Specific Scoring Matrix (PSSM): PSSM provides a flexible way to represent the specificity of residue interactions, which describes the evolutionary conservation of the residue positions. It can be described as follows:

$${{{{{{{\rm{score}}}}}}}}(a,b)={\log }_{10}\left(M(a,b)/{p}_{a}{p}_{b}\right)$$
(5)

where pa and pb represent the probability of observing amino acids a and b, respectively, and M(a, b) is the probability score of a mutation. We chose Uniref90 as the comparison database, set the number of iterations to three, and set the threshold value to 0.001 by PSI-BLAST.

(2) Physical characteristics: Physical characteristics are the graph index, polarization rate, normalized van der Waals volume, hydrophobicity, isoelectric point, spiral probability, and sheet probability. The same calculations are performed using the values reported in ref. 57 to obtain a 7-dimensional vector for each amino acid.

(3) Physicochemical properties: To accurately express the differences and connections between different residues, we introduce the physicochemical properties of amino acids. The physicochemical characteristics of a residue are described by three values: the number of atoms, the number of electrostatic charges, and the number of potential hydrogen bonds. These values are only related to the type of amino acid and do not contain any structural information from the amino acid residue.

### Ensemble deep memory capsule network

To capture the crucial information in the hybrid feature schemes more efficiently, we developed the ensemble deep memory capsule network (EDMCN) to maximize the feature learning performance of protein–protein interaction site identification, as depicted in Fig. 1. Deep memory capsule networks expand the parallelism of traditional memory networks by linking them with different output sizes to capture the correlation between amino acids at different depth scales. Besides, the capsule structure can further explore the intrinsic connections between features and retain location information between samples. In addition, to promote the generalization and stability of the model, we introduced an asymmetric bagging algorithm to solve the high imbalance between samples.

#### Deep memory network

Traditional memory networks such as LSTM39, GRU40, etc. have achieved good results in organizing the context of features for prediction. However, these models are parameter-sensitive, which greatly affects the stability of the prediction. To address this, we developed a deep memory network to enhance the generalization performance of the model. The central idea of deep memory networks is to connect multiple memory networks with different output scales to capture the correlation between residues in a multi-scale manner. Formally, it mainly controls the protein information flow through three gates (input gate(i), forget gate(f), and output gate(o)), including when to remember, update, and utilize the information. The forget gate works by accepting a long-term memory Mt−1 and deciding on which parts to retain or discard. In a time step t, the forget gate first calculates the forgetting factor ft from the previous hidden state ht−1 and the current input information mt:

$${f}_{t}=\sigma \left({W}_{f}\cdot \left[{h}_{t-1},{m}_{t}\right]+{b}_{f}\right)$$
(6)

where σ is the logistic sigmoid function. The input gate mainly controls which input currents mt can pass through the memory cell, first by generating a control signal to control the rate rt of inflow:

$${r}_{t}=\sigma \left({W}_{r}\cdot \left[{h}_{t-1},{m}_{t}\right]+{b}_{r}\right)$$
(7)

Next, the input gate generates candidate memory cells $$\widetilde{{M}_{t}}$$ and calculates the memory information that eventually passes through the input gate based on the previously solved rt:

$${\widetilde{M}}_{t}=\tanh \left({W}_{M}\cdot \left[{h}_{t-1},{m}_{t}\right]+{b}_{M}\right)$$
(8)
$${M}_{t}={f}_{t}* {M}_{t-1}+{r}_{t}* {\widetilde{M}}_{t}$$
(9)

Finally, the output gate filters mt by generating the control signal gt to obtain the output Ot:

$${g}_{t}=\sigma \left({W}_{g}\cdot \left[{h}_{t-1},{m}_{t}\right]+{b}_{g}\right)$$
(10)
$${O}_{t}={g}_{t}* \tanh \left({M}_{t}\right)$$
(11)

#### Capsule network

Deep memory network effectively captures global contextual dependencies among features, however, it tends to weaken the strong correlations among local features and lose topological information about feature types. To solve this problem, we introduce the capsule network27. Intuitively, the capsule network contains a convolutional network part along with neurons called capsules, which decide its perception of features, not only reflected in the importance of the features but also the various states of the features, including their location information. In this way, the capsule network can effectively capture the potential associations between features for our highly context-dependent feature description methods.

The structure of capsule neurons in a capsule network is shown in Fig. 1. In a capsule network, the capsule neurons are connected in a similar way as a full connection, for the current layer of capsules c1, c2, …, ci, the position relationship between the local and global features is learned through the pose transformation (translation, rotation, deflation):

$${\hat{c}}_{j| i}={W}_{ij}{c}_{i}$$
(12)

where Wij is the weight matrix. Then, we multiply each transformed vector by a coupling coefficient oij and pass it to the next layer of capsules, and sum all the neuron signals received by the j-th capsule of the next layer:

$${s}_{j}=\mathop{\sum}\limits_{i}{o}_{ij}{\hat{c}}_{j| i}$$
(13)

and the oij can be calculated as follows:

$${o}_{ij}=\frac{{{{{{{{{\rm{e}}}}}}}}}^{{b}_{ij}}}{{\sum }_{n}{{{{{{{{\rm{e}}}}}}}}}^{{b}_{in}}}$$
(14)

where bij is the logarithmic prior probability of whether two capsules are connected. Similar to sigmoid, a nonlinear activation function called squash27 is employed for mapping vectors to [0, 1], and the capsule output vj of this layer can be calculated as follows:

$${v}_{j}=\frac{{\left\Vert {s}_{j}\right\Vert }^{2}}{1+{\left\Vert {s}_{j}\right\Vert }^{2}}\frac{{s}_{j}}{\left\Vert {s}_{j}\right\Vert }$$
(15)

#### Ensemble deep learning algorithm

To further improve the stability and generalization performance of our proposed model, an ensemble learning method based on the asymmetric bagging algorithm58 is applied to deal with the skewed distribution of categories in unbalanced datasets. Bagging is one of the prevailing ensemble learning methods59, which can integrate the prediction results of multiple different classifiers and then use the voting principle to determine the class of the samples in the decision phase, aiming to reduce variance and promote the generalization performance of the model. Indeed, the principle of variance reduction by bagging is represented by the following equation:

$${{{{{{{\rm{Var}}}}}}}}(cX) =\, E\left[{(cX-E[cX])}^{2}\right]\\ =\, {c}^{2}E\left[{(X-E[X])}^{2}\right]\\ =\, {c}^{2}{{{{{{{\rm{Var}}}}}}}}(X)$$
(16)
$${{{{{{{\rm{Var}}}}}}}}\left({X}_{1}+\cdots +{X}_{n}\right)={{{{{{{\rm{Var}}}}}}}}\left({X}_{1}\right)+\cdots +{{{{{{{\rm{Var}}}}}}}}\left({X}_{n}\right)$$
(17)
$${{{{{{{\rm{Var}}}}}}}}\left(\frac{1}{n}\mathop{\sum }\limits_{i=1}^{n}{X}_{i}\right)=\frac{1}{{n}^{2}}{{{{{{{\rm{Var}}}}}}}}\left(\mathop{\sum }\limits_{i=1}^{n}{X}_{i}\right)=\frac{{\sigma }^{2}}{n}$$
(18)

where X represents an independent sample, Var(X) is the variance, and E(X) represents the mean of sample X. Then, it can be seen that assuming there are n independent models with an identical distribution and the variance of each model is σ2, the variance of the ensemble model can be deduced from Eqs. (16) and (17) as σ2/n. Bagging is sampled with put-back sampling so that there are duplicate samples between data sets, thus violating the independence assumption in Eq. (18). In this case, the variance of the ensemble model based on the correlation coefficient rho between the individual models can be expressed as follows:

$${{{{{{{\rm{Var}}}}}}}}\left(\frac{1}{n}\mathop{\sum }\limits_{i=1}^{n}{X}_{i}\right)=\frac{{\sigma }^{2}}{n}+\frac{n-1}{n}\rho {\sigma }^{2}$$
(19)

Under that, as the number of classifiers increases or the correlation between single models decreases, the variance of the ensemble model further decreases. Motivated by the above observations, we proposed to employ the asymmetric bagging algorithm to achieve this goal. For the dataset S, in each iteration, we keep all the samples of protein binding sites as Sp, and separate a subset $${S}_{n}^{{\prime} }$$ with the same scale as Sp from the samples Sn of non-binding sites. This step is repeated for sampling without replacement until the training process covers all samples, and eventually, multiple classifiers can be obtained. After that, we sum the softmax values obtained by these multiple classifiers for each sample to make the final identification decision. On this basis, asymmetric bagging can adequately ensure a balanced class distribution of the input data for each model and keep the correlation between individual models as low as possible. It is worth mentioning that although the ensemble models may increase the computational complexity, the feasibility of parallelism in asymmetric bagging can effectively reduce the running time with sufficient computational resources.

### Parameter settings

To demonstrate the effectiveness of our proposed EDLMPPI, we compare it to several traditional machine learning methods and deep learning methods. In the following section, we present the details of the parameter settings of these algorithms.

#### Deep learning algorithms

For EDLMPPI, we use the tanh function as the activation function and adopt the Glorot initializer with a uniform distribution to initialize the weights for the BiLSTM part. Then, for the number of neurons in the hidden layer, we fix a set of candidate values [32, 64, 128, 256]. For the capsule network, the main hyperparameters are the number of neural capsules and the dimensionality of each neuronal vector, for which we set a group of candidate values [32, 64, 128, 256] and [3, 5, 7, 10], respectively. To obtain the best hyperparameters, we optimize the three sets of candidate values above by the grid search method under Tensorflow 2.5.0 and Keras 2.4.3. The epochs are set to 100 and the early stop mechanism is applied to prevent overfitting of the proposed algorithm.

To conduct a fair comparison to the other deep learning algorithms including TextCNN38, Single-Capsule27, BiLSTM39, BiGRU40, and MultiHead Attention41, to conduct a fair comparison, the hyperparameter optimization methods used the same principles as EDLMPPI; we also adopted the same rules of the hyperparameter optimization method as for EDLMPPI, using a grid search procedure to select reasonable hyperparameters. For TextCNN, the test settings for different combinations of convolutional kernels of different sizes were {{1, 3, 5, 7}, {7, 9, 11, 13}, {4, 5, 6, 7}, {7, 8, 9, 10}}, where the number of filters for each combination is chosen from 16, 32, 64, 128, respectively. The number of hidden layer cells of BiLSTM and BiGRU is chosen from {32, 64, 128}. In the capsule network, the candidate values for the number of neural capsules and the dimensionality of each neuronal vector are {32, 64, 128, 256} and {3, 5, 7, 10}, respectively. Finally, the Multi-Head attention network selects the number of attention heads from {4, 8, 16, 32}.

#### Machine learning algorithms

The machine learning methods contain three ensemble learning methods (XGBoost35, LightGBM36, and CatBoost37), SGDClassifier (Stochastic Gradient Descent), and MLPClassifier (Multi-Layer Perceptron), which are under the scikit-learn60 dependency package in Python environment. XGBoost adopts a level-wise decision tree construction strategy, LightGBM uses a leaf-wise construction strategy and CatBoost applies a symmetric tree structure with full binary decision trees. The SGDClassifier is a stochastic gradient descent learning model with a regularized linear method. The loss gradient is estimated for each sample at a time, and the model is updated in the process using an intensity-decreasing schedule. MLP is a forward-structured artificial neural network, which can solve complex problems quickly. The grid search procedure is also performed to find the optimal hyperparameters for these five classifiers. The candidate parameters and the optimal parameter combinations are summarized in Supplementary Table 2.

### Evaluation performance

To evaluate the performance of different computational methods, we used sensitivity (TPR), specificity (TNR), precision (Pre), accuracy (ACC), F1-score (F1), the Matthews correlation coefficient (MCC), the area under the receiver operating characteristic curve (AUROC), and average precision (AP) as measurement criteria, which can be formulated as below:

$${{{{{{{\rm{TPR}}}}}}}}=\frac{{{{{{{{\rm{TP}}}}}}}}}{{{{{{{{\rm{TP}}}}}}}}+{{{{{{{\rm{FN}}}}}}}}}$$
(20)
$${{{{{{{\rm{TNR}}}}}}}}=\frac{{{{{{{{\rm{TN}}}}}}}}}{{{{{{{{\rm{TN}}}}}}}}+{{{{{{{\rm{FP}}}}}}}}}$$
(21)
$${{{{{{{\rm{Pre}}}}}}}}=\frac{{{{{{{{\rm{TP}}}}}}}}}{{{{{{{{\rm{TP}}}}}}}}+{{{{{{{\rm{FP}}}}}}}}}$$
(22)
$${{{{{{{\rm{ACC}}}}}}}}=\frac{{{{{{{{\rm{TP}}}}}}}}+{{{{{{{\rm{TN}}}}}}}}}{{{{{{{{\rm{TP}}}}}}}}+{{{{{{{\rm{FN}}}}}}}}+{{{{{{{\rm{TN}}}}}}}}+{{{{{{{\rm{FP}}}}}}}}}$$
(23)
$${{{{{{{\rm{F}}}}}}}}1=2\times \frac{{{{{{{{\rm{TPR}}}}}}}}\times {{{{{{{\rm{Pre}}}}}}}}}{{{{{{{{\rm{TPR}}}}}}}}+{{{{{{{\rm{Pre}}}}}}}}}$$
(24)
$${{{{{{{\rm{MCC}}}}}}}}=\frac{{{{{{{{\rm{TP}}}}}}}}\times {{{{{{{\rm{TN}}}}}}}}-{{{{{{{\rm{FN}}}}}}}}\times {{{{{{{\rm{FP}}}}}}}}}{\sqrt{({{{{{{{\rm{TP}}}}}}}}+{{{{{{{\rm{FP}}}}}}}})\times ({{{{{{{\rm{TP}}}}}}}}+{{{{{{{\rm{FN}}}}}}}})\times ({{{{{{{\rm{TN}}}}}}}}+{{{{{{{\rm{FP}}}}}}}})\times ({{{{{{{\rm{TN}}}}}}}}+{{{{{{{\rm{FN}}}}}}}})}}$$
(25)

where true positives (TP) and false positives (FP) represent the number of correctly-predicted binding sites and incorrectly predicted binding sites, respectively. True negatives (TN) and false negatives (FN) represent the number of correctly predicted non-binding sites and incorrectly-predicted non-binding sites, respectively. TPR describes the proportion of correctly predicted binding sites in all positive samples, TNR indicates the proportion of correctly predicted non-binding sites in the total negative samples, and Pre represents the probability of correct prediction in all samples with predicted binding sites.

In unbalanced data, since ACC cannot accurately capture the strengths of the model, we adopted ACC as an additional metric for evaluation. In addition, another two metrics, AUROC and AP are calculated related to the predicted probability of each amino acid to measure the unbalanced data. AUROC is not influenced by sample imbalance and can accurately measure model performance in unbalanced data61. AP is a weighted average of the accuracy of each threshold in the dataset, with the change in recall as the weight, which can be defined as follows:

$${{{{{{{\rm{AP}}}}}}}}=\mathop{\sum}\limits_{n}\left({R}_{n}-{R}_{n-1}\right){P}_{n}$$
(26)

where Rn and Pn are the recall and precision at the n-th threshold.

### Statistics and reproducibility

The statistical analyses of the data were conducted using the Python software package. We used the asymmetric bagging algorithm to focus on the imbalance of the data to reduce its impact on the experimental results. The reproducibility of experiments was ensured by performing a minimum of three independent replicates for each condition. Replicates were performed by different researchers, and the data were combined and analyzed using appropriate statistical tests. Overall, our experiments were designed to be highly reproducible. All materials and procedures were clearly described in the methods section, and the data were carefully collected and analyzed using standard statistical methods. We believe that these measures have increased the reliability and reproducibility of our results.

### Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.