SOFB is a comprehensive ensemble deep learning approach for elucidating and characterizing protein-nucleic-acid-binding residues

Proteins and nucleic-acids are essential components of living organisms that interact in critical cellular processes. Accurate prediction of nucleic acid-binding residues in proteins can contribute to a better understanding of protein function. However, the discrepancy between protein sequence information and obtained structural and functional data renders most current computational models ineffective. Therefore, it is vital to design computational models based on protein sequence information to identify nucleic acid binding sites in proteins. Here, we implement an ensemble deep learning model-based nucleic-acid-binding residues on proteins identification method, called SOFB, which characterizes protein sequences by learning the semantics of biological dynamics contexts, and then develop an ensemble deep learning-based sequence network to learn feature representation and classification by explicitly modeling dynamic semantic information. Among them, the language learning model, which is constructed from natural language to biological language, captures the underlying relationships of protein sequences, and the ensemble deep learning-based sequence network consisting of different convolutional layers together with Bi-LSTM refines various features for optimal performance. Meanwhile, to address the imbalanced issue, we adopt ensemble learning to train multiple models and then incorporate them. Our experimental results on several DNA/RNA nucleic-acid-binding residue datasets demonstrate that our proposed model outperforms other state-of-the-art methods. In addition, we conduct an interpretability analysis of the identified nucleic acid binding residue sequences based on the attention weights of the language learning model, revealing novel insights into the dynamic semantic information that supports the identified nucleic acid binding residues. SOFB is available at https://github.com/Encryptional/SOFB and https://figshare.com/articles/online_resource/SOFB_figshare_rar/25499452.

1 Supplementary Note 1: The details of the datasets used.
Supplementary Table 1: The details of the datasets used.From top to bottom, the name of the datasets, the number of protein sequences in the datasets, the number of nucleic-acid-binding residues in the datasets and the number of non-nucleic-acid-binding residues in the datasets.Source data are provided with this paper.2 Supplementary Note 2: The details of comparison with other language model Supplementary Figure 1 illustrates the overall prediction performance of SOFB across different feature characterizations.The Supplementary Figure 1 indicates that despite fine-tuning, the performance of ESM (esm1 t34 670M UR100) [1] shows negligible improvement.In terms of recognizing DNA and RNA binding residues, the AUCs improved marginally by 0.001 and 0.003, reaching 0.898 and 0.802, respectively.However, these enhancements remain insufficient compared to the bio-language learning model initially employed.Notably, the performance of the latest ESM2 (esm2 t12 35M UR50D) [2] surpasses both ESM and fine-tuned ESM, exhibiting superior metrics across the board, particularly in F1 and MCC.Specifically, ESM2 achieved AUCs of 0.902 and 0.822 for the DNA and RNA tasks, respectively.However, after fine-tuning, ESM2's performance witnessed a decline, with AUCs dropping to 0.869 and 0.780, representing decrements of 0.032 and 0.042, respectively.This decline might stem from the ESM2 model's reduced suitability for fine-tuning compared to ESM, possibly influenced by variations in certain layer parameters that led to a departure from its original performance level.
Supplementary Figure 1 4 Supplementary Note 4: The experiment of chain interactions We conducted another experiment to investigate the effect of chain interactions on the prediction performance of our SOFB.In particular, we adjusted the number of protein chains in the training sets and obtained different number of protein chains with five experimental groups of 100, 200, 300, 400, and all protein chains.The experimental results are tabulated in Supplementary Table 2. From the experimental results, it can be observed that in the DNA binding residue recognition task, as the number of protein chains increases, the interactions between protein chains are enhanced, which leads to the increment of the results.
Moreover, in the RNA-binding residue identification task, the prediction performance also improves with the increasing number of protein chains.In addition, the ROC curves of the experimental results are illustrated in Supplementary Figure 3, from which we can observe that although the number of protein chains in the training set had a slight impact on the performance of SOFB, an increase in the number of chains enhanced the interaction between protein chains, thereby further improving the predictive capability of SOFB.
Supplementary Table 2: From left to right each column shows the number of protein chains, the number of amino acids, the number of binding residues, the number of non-binding residues, Rec, Pre, F1, MCC, AUC, respectively.Source data are provided with this paper.6 Supplementary Note 6: The result of SOFB on different protein family We conducted experiments on evaluating the performance of our SOFB on predicting nucleic acid binding residues of proteins across different protein families(protein family).We primarily utilized the protein binding residue dataset obtained from BioLip [4] in our study.Although we cannot directly obtain the data classified according to protein family from BioLip, we categorized the proteins in the BioLip test set based on the provided protein IDs by indexing in InterPro [5], thereby classifying the proteins according to their protein families.We then evaluated the performance of SOFB across different protein families using MCC as the evaluation metrics.
The experimental results were summarized in the top of the Supplementary Figure 5, where the top seven protein families that were best characterized were illustrated.From the Supplementary Figure 5, we can observe that SOFB performed best on the Bacterial regulatory proteins, tetR family (PF00440), which represents a DNA-binding domain with a helix-turn-helix (HTH) structure.
MerR HTH family regulatory protein (PF13411) has a winged helix-turn-helix (wHTH) structural domain.MarR family (PF01047), Myb-like DNA-binding domain (PF00249) also belongs to the clan of HTH.Therefore, we infer that SOFB has concentrations on DNA-binding residues with protein clan for HTH structure.
Supplementary Figure 5: SOFB predicts results for different protein family within the DNA and RNA-binding test datasets (using MCC as a metric), and shows the protein family among them with the best results, respectively.Source data are provided with this paper.
In terms of RNA-binding residues prediction task as illustrated in the bottom of the Supplementary Figure 5, SOFB performs best on the KH domain (PF00013), which presents in a wide variety of nucleic acid-binding proteins.
Pumilio-family RNA binding repeat (PF00806), the Puf domains that usually occur as a tandem repeat of 8 domains, was also accurately inferred for its binding residues by SOFB.Unfortunately, SOFB exhibited limited performance in the RNA binding prediction task within other protein families.We speculate that it is due to the presence of multiple proteins or more repeats with in protein families PF00013 and PF00806, resulting in that SOFB learns more informative features of such proteins.Overall, in both DNA binding and RNA binding prediction tasks, SOFB has demonstrated excellent performance in certain protein families.8 Supplementary Note 8: The results of SOFB on other datasets We have conducted additional experiment to explore and compare the predictive capabilities of SOFB.Firstly, we collected YFK16, YK17 and MW15 test datasets from [6], and each dataset consisted of 2 subsets for protein-DNA and protein-RNA binding.Subsequent evaluations of the nucleicacid-binding residues prediction performance on these tissue datasets with metric AUC involved SOFB, along with other baseline models including DRNApred, COACH-D, SVMnuc, Nucbind and iDRNA-ITF.As illustrated in Supplementary Besides, due to the limited size of the test datasets, we ultimately selected the YK17 training dataset as our large-scale benchmark dataset.In particular, we employed CD-HIT method to eliminate protein sequences with identity exceeding 30%, resulting in a final dataset of 464 proteins with totaling 106,081 that binds to DNA and 416 RNA binding proteins with totaling 95,020 amino acids.Subsequently, we compared the predictive performance of SOFB with iDRNA-ITF [3] on this dataset to validate the performance of SOFB on largescale datasets.
Supplementary Table 5: The number of protein entries within the large dataset used, the number of binding residues versus non-binding residues, and the performance of SOFB and iDRNA-ITF on the dataset shows that the results of SOFB are superior to iDRNA-ITF, which is currently the best performer.Source data are provided with this paper.The experimental results were summarised in Supplementary Table 5.In terms of DNA binding resisues prediction, From the table, we can observe that SOFB outperformed iDRNA-ITF by a margin on the large-scale dataset.
For instance, SOFB achieved an AUC improvement of over 5% and a precision improvement of over 10%.In the RNA binding prediction task, we observed a narrower gap between SOFB and iDRNA-ITF.Both models exhibited similar AUC values, but SOFB continued to outperform iDRNA-ITF in precision, reaching a remarkable precision score of 0.86.Overall, our SOFB model maintains its strong effectiveness and remains highly competitive even on larger-scale datasets.
9 Supplementary Note 9: The details of Case

Study
We have conducted additional analyses and employ a more consistent criterion for proteins selection, where the three protein chains with the highest MCC score obtained by the best two models (SOFB and iDRNA-ITF) were selected.
For the DNA task, the top three proteins are 5h3r A, 6c31 A, 6enb A, and for the RNA task, they are 6htu A, 5www A and 5wzg A. We compared the predictive results of SOFB and iDRNA-ITF and visualized the nucleic-acidbinding residues of these proteins in the DNA and RNA-binding, respectively.
The visualization of the protein chains are illustrated in Supplementary Figure 7 and Supplementary Figure 8.
The DNA-binding protein 5h3r A consists of 141 amino acids and 20 DNA-binding residues, as depicted in Supplementary Figure 7.Both SOFB and iDRNA-ITF accurately predicted all 20 binding residues of the protein.
However, iDRNA-ITF predicted more false positives than we did, so our precision (Pre) was 0.188 higher than theirs.SOFB achieved the F1 of 0.909 and the Mathews correlation coefficient (MCC) of 0.898.In contrast, iDRNA-ITF resulted in F1 and MCC values of 0.784 and 0.766.
The three protein chains with the best results within SOFB demonstrates SOFB's superior ability to detect true nucleic-acid-binding sites in sequences that prove challenging for alternative methods.Although SOFB predicts false positives for amino acids at various positions, they are predominantly spatially proximate to the nucleic acid.This finding suggests that SOFB can glean spatial structure information from one-dimensional sequence data, such as residue positions in three-dimensional space post-protein folding, and utilize it for nucleic-acid-binding residue identification.
The RNA-binding protein 6htu A comprises 76 amino acids and 16 RNAbinding residues, as shown in Supplementary Figure 8. SOFB missed three binding sites on this protein chain, while iDRNA-ITF only missed predicting one binding site.However, its success in predicting more binding sites was at the cost of eighteen false-positive amino acids (the number of false positives for SOFB was one).This resulted in its F1 and MCC being 0.254 and 0.313 less than SOFB.
The RNA-binding protein 5www A comprises 94 amino acids and 24 RNAbinding residues.The RNA-binding protein 5www A consists of 94 amino acids and 24 RNA-binding residues.SOFB and iDRNA-ITF identified 19, 17 binding residues and 3, 9 false positives for this protein, respectively.From these results, it is evident that SOFB exhibits more pronounced advantages in the recognition of RNA binding residues.Its superiority lies in its ability to mitigate false positives when identifying a comparable number of correct examples.This factor contributes to its enhanced performance in this particular task.
test sets) were selected, and we then performed random mutations on the amino acids incorporating the positions of the binding residue.Subsequently, the mutant sequences were fed into the NABert model for calculating the attention score.Specifically, the scores of an amino acid in the 16 heads of the last layer were averaged to get the attention score for that amino acid.
We subsequently conducted a statistical analysis to evaluate the differences in attention scores before and after these positional mutations.Specifically, we computed the attention scores for each mutation and performed a t-test to evaluate the significance of the differences in attention scores.
The experimental results were summarised in Supplementary Figure 10.In terms of DNA binding residue prediction, we can observe from Supplementary Figure 10 (a) that differences were observed in the attention scores before and after the mutations, with the p-value equal to or less than 0.05.For the RNA binding residue prediction, it can also be demonstrated from Supplementary Figure 10 (b) that the changes can be observed in the attention scores after the mutations.Furthermore, we observed that in both DNA binding prediction tasks and RNA binding prediction tasks, the attention scores of the mutated positions were lower compared to those before the mutations.This finding suggests that SOFB has the ability to concentrate more attention on biologically relevant positions, indicating its potential for discovering functional sites.
These statistical results and hypothesis tests provide evidence for the effectiveness and potential interpretability of SOFB, offering different insights into the identification of functional sites.
12 Supplementary Note 12: The results of SOFB that incorporate with the

RoseTTAFold method
RoseTTAFoldNA [7] extends the RoseTTAFold's end-to-end deep learning approach to model the nucleic acid and protein-nucleic acid complexes, and can rapidly produces three-dimensional structure models with confidence estimates for protein-DNA and protein-RNA complexes, and for RNA tertiary structures.RoseTTAFoldNA is broadly useful for modeling the structure of naturally occurring protein-nucleic acid complexes, and for designing sequence specific RNA and DNA binding proteins.
However, it is unfortunately that the RoseTTAFoldNA model requires both protein sequences and RNA or DNA sequences to predict protein structures or DNA-protein binding, while in our study, it only involves the recognition of nucleic acid-binding residues within protein sequences, not specific nucleic acid sequences.Therefore, it is difficult to apply the RoseTTAFoldNA model into our study.Nonetheless, we conducted additional experiments and employed RoseTTAFold [8], the prototype of RoseTTAFoldNA, to generate the extensive information of protein structures and integrate it as part of the bio-information into our SOFB to predict the nucleic acid-binding residues.Specifically, after obtaining the structural information of all training and test sets, we combined them with the 75-dimensional bio-feature and obtained the 89-dimensional biofeature for nucleic acid-binding residues prediction.Subsequently, we employed the newly integrated features to train the SOFB model, and evaluated the performance of the model on the DNA-binding and RNA-binding test set.

Supplementary Figure 2 :
: (a) shows the average Recall (Rec), Precision (Pre), F1, MCC of ten runs of six dynamic contextual embeddings (ProtVec, ESM, Finetune ESM, ESM2, Finetune ESM2, ProGen) on DNA-binding test set and RNA-binding test set; (b) provides ROC curves with AUC values, PR curves with AP values for DNA-binding residue and RNA-binding residue predictions, respectively, where SOFB performs best by both metrics.Source data are provided with this paper.3Supplementary Note 3: The details of the heat maps in the correlation analysis Supplementary Figure2exhibits the feature correlation of SOFB using different feature characterizations.Upon analysis of the correlation heatmaps, we observe that SOFB consistently demonstrates superior performance compared to all other dynamic methods in the classification of amino acids and the segregation of amino acids within a sequence into two distinct groups, crucial for subsequent identification of nucleic acid binding residues.It's noteworthy that in the context of recognizing DNA binding residues, the heatmap derived from fine-tuned ESM2 lacks discriminatory patterns entirely, attributing to the stark decline in performance observed.Overall, the feature construction strategy of SOFB surpasses other methods, showcasing the effectiveness and robustness of SOFB.Heat maps of correlation analysis of six dynamic contextual embeddings (ProtVec, ESM, Finetune ESM, ESM2, Finetune ESM2, ProGen) and SOFB on DNA-binding test set and RNA-binding test set are given.Source data are provided with this paper.

Supplementary Figure 4 :
provides a comparison of SOFB with other state-of-the-art algorithms on the DNA (RNA)-binding test sets, where other algorithm results are reported in [3].It shows violin plots depicting multiple performance metrics of different baseline methods along with SOFB, where SOFB outperforms all other methods.The triangle markers measn Recall (Rec), Precision (Pre), F1 score, Matthews correlation coefficient (MCC) and Area Under the Curve (AUC) (n=5).Source data are provided with this paper.

7 Supplementary Note 7 :
The results of Ablation Study Supplementary Figure 6: (a) provides ablation experiments of SOFB testing on nucleic-acid-binding test sets in the different settings, from top to bottom, SOFB, setting (a), setting (b), setting (c) and setting (d), respectively.Then setting (e), setting (f) and setting (g), where the setting (a, b, c and d) and (e, f, g) are ablation of the structure and feature matching modules, respectively.(b) shows the ROC curves of the ablation experiment on DNA and RNA-binding residues prediction tasks, from top to bottom, a(no-Bi-LSTM), b(no Diff-k sizes), c(no Stack module) and d(no-State), respectively.Then e(Both ProtT5),f(Both NABert), e(Exchange module) and SOFB, where the setting (a, b, c and d) and (e, f, g) are ablation of the structure and feature matching modules, respectively.(n=5) Source data are provided with this paper.

Table 3 :
Showing the values of Pre, Rec, F1, MCC, AUROC for combining different baseline methods with SOFB, where the results of SOFB outperforms all other methods.Source data are provided with this paper.
To comprehensively demonstrate the effectiveness of our proposed SOFB, we combined multiple performance metrics and depicted them in a violin plot (Supplementary Figure4), where each data point (represented by a triangle) signifies a predictive metric for the respective model.It is evident that our SOFB exhibits superior overall performance compared to other methods.The Supplementary Table3is given for a detailed comparison.Supplementary

Table 4
, we can observe that SOFB performs best comparing with the other methods on all datasets.For instance, in DNA binding prediction, none of the methods achieved a performance exceeding 0.9.Both DRNApred and COACH-D fell short of 0.8.In contrast, our SOFB exhibited the highest prediction AUC, reaching a remarkable 0.949.This underscores the effectiveness of SOFB in recognizing amino acid binding patterns.Furthermore, in RNA binding prediction, a decrease in performance was observed for all methods except SOFB.For example, DRNApred yielded a prediction AUC of only 0.467 on the MW15 dataset, while COACH-D achieved merely 0.579 on the same dataset.