Supporting online material for: Protein embeddings and deep learning predict binding residues for various ligand classes

1 TUM (Technical University of Munich) Department of Informatics, Bioinformatics & Computational Biology i12, Boltzmannstr. 3, 85748 Garching/Munich, Germany 2 TUM Graduate School, Center of Doctoral Studies in Informatics and its Applications (CeDoSIA), Boltzmannstr. 11, 85748 Garching, Germany 3 Institute for Advanced Study (TUM-IAS), Lichtenbergstr. 2a, 85748 Garching/Munich, Germany & TUM School of Life Sciences Weihenstephan (TUM-WZW), Alte Akademie 8, Freising, Germany 4 Department of Biochemistry and Molecular Biophysics, Columbia University, 701 West, 168th Street, New York, NY 10032, USA


Short description of Supporting Online Material
In this Supporting Online Material (SOM), we show a more thorough performance assessment and provide more details about the used data set and the underlying redundancy reduction (2.1), the Machine Learning (ML) method (Fig. S13), the MMseqs2 commands (Section 2.3), the calculation of error estimates (Section 2.4) and related work (Section 3). Section 1.1 provides more details about the performance of bindEmbed21DL showing an assessment on various data sets (Table S1, Table S3), a comparison to random (Table S2) and a binarized version of bindEmbed21DL, namely bindEmbed21DL-binary (Table S6), and a more thorough analysis of the effect of over-prediction (Table S4) and cross-predictions (Table S5) on the overall performance of bindEmbed21DL.
Over-predictions could be reduced and therefore performance increased by only considering residues as binding if at least x residues were predicted as binding in this protein (Fig. S6). Additionally, the output probability of the method could be used to influence CovOneBind, CovNoBind, and precision (Section 1.5). A more thorough comparison of different annotations used to define binding showed that the used annotations could highly affect the performance of a prediction method (Section 1.2&1.4, Fig. S3). However, the performance improvement of bindEmbed21DL over its predecessor bindPredictML17 was mainly due to replacing MSA-based input features with embeddings (Section 1.3, Fig. S2).
Combining bindEmbed21DL with homology-based inference (HBI) allowed an increase of precision and F1 even for high E-value thresholds, while recall dropped below the level of bindEmbed21DL for E-values >10 -3 (Fig. S7). Small changes in performance were due to only few new residues being inferred as binding for higher E-values (Fig. S8). Combining both approaches at an E-value cutoff of 10 -3 led to an increase in CovNoBind but a drop in CovOneBind (Table S9). bindEmbed21DL could be applied to obtain binding residues predictions for 92% of the human proteome (Section 1.7, Table S10, Table S11). A comparison of the distributions of prediction scores for experimentally verified binding residues, residues inferred through HBI, and previously unknown binding residues showed that previously unknown binding residues were predicted with on average slightly lower probability (Fig. S9). Neither an enrichment of disorder proteins nor transmembrane proteins nor a different length distribution could explain this difference (Fig. S10).  * We show precision, recall, F1 and MCC for the test set (TestSet300) using a random prediction. Random was generated by randomly shuffling the prediction probabilities of bindEmbed21DL. Error estimates indicate 95% confidence intervals.

Details on performance assessment of bindEmbed21DL.
Performance differed between ligand classes (Table S1). This could be due to differences in biophysical properties (i.e., small molecule binding was more clearly encoded in the embeddings) or due to differences in the data distribution (i.e., small molecule binding was more abundant in the development set, Table S12). To investigate, we re-trained bindEmbed21DL using a smaller development set of 515 proteins (DevSet515 , Table S3) with only 108 proteins binding to small molecules. For this new set, performance of small molecule binding dropped immensely by 22 percentage points (Table S3). This suggested that rather data abundance than biophysical properties explained the difference in performance. If anything, it rather seems that nucleic acid binding was easier to predict due to the biophysical properties being more clearly encoded in the embeddings because this ligand class was predicted with an acceptable performance (Table S1) even though the number of proteins in DevSet1014 was fairly small compared to the other two classes (Table  S12). While for all three ligand classes for over 86% of the proteins at least one residue was predicted as binding (CovOneBind, Eqn. 8 in main text) (metal 86%, nucleic 93%, small 96%, Table S4), this high coverage of experimentally known ligands came from what appeared to be over-prediction as measured by the fraction of proteins not experimentally known (yet) to bind a particular ligand for which one was deemed to have been incorrectly predicted (1-CovNoBind(l), Eqn. 9 in main text): While binding to nucleic acids was only predicted for 19% of proteins without experimental data for nucleic acid binding (1-CovNoBind(nucleic acid)=100%-81%), this number rose to three fourth of the proteins for small molecules (Table S4). Metal ions and small molecules were also most often cross-predicted, i.e., residues in fact binding to small molecules were often predicted as binding to metal ions and vice versa (Table S5). This also explained the higher performance of the binary prediction (binding vs non-binding) compared to the performance for the individual ligand classes: Some residues were incorrectly predicted as binding to a certain ligand class and, therefore, were considered as false positives for this ligand class, but they could be in general involved in binding. * In each row, CovOneBind (Eqn. 8 in main text) indicates the number of proteins for which at least one residue was (correctly or incorrectly) predicted to bind to this ligand class (or any ligand class for the last row). CovNoBind(l) (Eqn. 9 in main text) is the percentage of proteins not annotated to bind to a certain ligand class for which also no residue was predicted as binding. Since the data set did not contain proteins without any binding annotations, the negative coverage is not defined for the general prediction of binding residues (last cell in the table). While bindEmbed21DL achieved a reasonable coverage, the negative coverage was low for metal ions and small molecules indicating that too many residues were predicted to bind to one of these two ligand classes. Data set: DevSet1014. * Rows indicate residues predicted by bindEmbed21DL as binding to a specific ligand; columns show the experimental (true) annotations. Values in the diagonal in bold font marked correct predictions. Most incorrect binding predictions were in fact non-binding residues. In addition, many residues predicted to bind metal ions are in fact binding to small molecules and vice versa. Data: DevSet1014.
Separately predicting whether a residue binds to a metal ion, a nucleic acid, or a small molecule is a more complicated prediction task than the binary distinction of binding and non-binding residues. To investigate whether performance could improve by only training on the binary task, we developed bindEmbed21DL-binary trained to distinguish binding from non-binding residues. On the same validation set as bindEmbed21DL, bindEmbed21DL-binary achieved F1=40±2%, i.e., one percentage point higher than bindEmbed21DL trained on three different ligand classes (Table S6). The two results could not be distinguished statistically, implying that the higher complexity in training on three ligand classes did not clearly affect performance. On the one hand, ML models tend to do better when applied to the same problem used for training, i.e., the class-agnostic method, bindEmbed21DLbinary, should have performed better. On the other hand, when the task is better defined, it is better to learn, i.e., the method trained on three classes, bindEmbed21DL, should have performed better. The observation of "no significant improvement" might have been the result of these two opposing trends. * While being trained on the more complex task of distinguishing between three different ligand classes, bindEmbed21DL achieved F1=39±2% being only one percentage point worse than bindEmbed21DL-binary (F1=40±2%) which was only trained on predicting binding vs non-binding residues. All performance values are reported on the validation set. Error estimates indicate 95% confidence intervals.

AI identified annotation errors.
Unlike bindEmbed21DL, bindPredictML17 1 was trained using annotations available through PDB 2 for enzymes and through PDIdb 3 for DNA-binding proteins. However, some binding annotations in the PDB might reflect crystal-induced rather than biologically relevant binding 4 . Therefore, we used annotations from BioLiP 4 for the training of bindEmbed21DL. Considering the predictions of bindPredictML17 for the 225 test proteins, we observed a better performance when using annotations from BioLiP for evaluation than when using annotations from PDB or PDIdb, although bindPredictML17 was trained on those annotations ( Fig. 2A in the main text, lighter shaded bars higher than lightest shade bars). First, while training on noisy data, the seemingly false negative predictions of bindPredictML17 (Fig. 2B in the main text, rightmost bar labeled 'FN') were in fact often due to wrong annotations in the PDB.
Without any re-training, the number of FN dropped by almost 40% when evaluating on annotations from BioLiP ( Fig. 2B in the main text). Hence, bindPredictML17 had correctly captured incorrect binding annotations as non-binding. Secondly, these differences highlighted the importance of using high-quality binding annotations. Training on less noisy data might have been one reason for the improvement of bindEmbed21DL over bindPredictML17.

Fig. S1: Seemingly false negative predictions in fact incorrect annotations.
Investigating the number of true positives (TP), false positives (FP), true negatives (TN), and false negatives (FN) revealed that bindPredictML17 predicted many more FN when measured by PDB annotations than by BioLiP annotations. Hence, bindPredictML17 captured the incorrect binding annotations from the PDB correctly predicting those as nonbinding which worsened its performance when assessing on those annotations but actually better captured the true binding residues. More details on the comparison of bindPredictML17 using BioLiP or PDB annotations can be found in SOM, Section 1.2.

Performance gain mainly attributed to the replacement of MSAs with embeddings.
To investigate whether the performance gain of bindEmbed21DL over bindPredictML17 was mainly due to training on less noisy data or due to the replacement of MSA-based input features with embeddings, we re-trained bindEmbed21DL using the original training set of 412 proteins and the corresponding binding annotations of bindPredictML17. bindEmbed21DL-PDB already outperformed bindPredictML17 by, e.g., 13 percentage points in terms of F1 score (29±2% vs. 42±3%; Fig. S2). The replacement of PDB annotations with BioLiP annotations which also led to an increase in data set size from 412 to 1,014 resulted in a performance improvement of another five percentage points. Hence, training on BioLiP annotations instead of PDB annotations clearly improved performance, but the major gain in performance was achieved by replacing MSA-based features with data-driven inputs, namely embeddings. bindPredictML17 trained on a set of 412 proteins and PDB annotations achieved F1=29±2% (rightmost, lightest shaded bars). Training bindEmbed21DL on the same set but using embeddings as input improved performance by 13 percentage points leading to F1=42±3% (middle, darker shaded bars). Replacing PDB annotations with less noisy annotations from BioLiP improved performance by another five percentage points to F1=47±2% (leftmost, darkest shaded bars). This clearly showed that while using high-quality data was important, the major improvement was achieved by replacing MSA-based features with embeddings.

Definition of binding highly influences performance.
In general, bindEmbed21DL achieved a higher F1 score and precision than ProNA2020 5 , while ProNA2020 achieved a higher recall indicating that ProNA2020 predicted larger binding sites (see main text). ProNA2020 was trained on a different set of annotations obtained from PDIdb 3 and the Protein-RNA Interface Database (PRIDB) 6 . In this set, on average 21% of residues are annotated to bind to DNA or RNA compared to 12% for nucleic acid binding proteins in the test set of Appendix p. 10 bindEmbed21DL. Therefore, ProNA2020 was trained on data where binding sites to DNA and RNA are more broadly defined, and therefore, consist of more binding residues leading to an over-prediction of binding residues from ProNA2020 for the test set of bindEmbed21DL. Since ProNA2020 was trained on different annotations, evaluating it using annotations from BioLiP is an unfair comparison. Therefore, we also assessed performance using the test set and annotations from ProNA2020. Using the 106 proteins binding to DNA or RNA from the test set of ProNA2020, ProNA2020 achieved F1=44±4% (Precision=45±5%, Recall=58±6%), while bindEmbed21DL-XNA achieved F1=38±5% (Precision=66±7%, Recall=32±5%) (Fig. S3). Therefore, bindEmbed21DL-XNA performed worse in terms of F1 score than ProNA2020 on its test set. However, the precision for bindEmbed21DL was significantly higher than for ProNA2020. Hence, the major difference between ProNA2020 and bindEmbed21DL seems to lie in the definition of what is involved in binding: While predictions from ProNA2020 focus on larger patches of binding residues, and therefore covering more of the actual binding site, bindEmbed21DL rather focuses on the prediction of key binding residues losing recall by making fewer predictions but resulting in more precise ones. ProNA2020 (lightest shaded bars) was trained on a different set of annotations where, on average, 21% of residues were annotated to bind to DNA or RNA compared to 12% in the test set of bindEmbed21DL. To assess the effect of this different definition of binding, we evaluated performance using the test set and annotations from ProNA2020. Using the 106 proteins binding to DNA or RNA from the test set of ProNA2020, ProNA2020 achieved F1=44±4%, while bindEmbed21DL-XNA achieved F1=38±5%. Therefore, bindEmbed21DL-XNA performed worse than ProNA2020 in terms of F1, recall, and MCC on its test set.
However, the precision for bindEmbed21DL was significantly higher. Error bars indicate 95% confidence intervals.

Refinement of predictions through focus on probability cutoff or number of predictions.
We analyzed the trade-off between precision, recall, and CovOneBind in dependence of the output probability of bindEmbed21DL for the different ligand classes. For higher cutoffs, precision increased, while CovOneBind dropped; the opposite trends were observed for lower cutoffs (Fig. S4). Based on the results for binding in general (Fig. 3 in the main text), we expected recall to increase for lower and decrease for higher cutoffs. However, the trend was not that consistent: While recall decreased as expected for higher cutoffs for small molecules (Fig. S4C), it first decreased and then increased for metal ions (Fig. S4A), and first increased and then decreased for nucleic acids (Fig. S4B). For proteins not binding to a certain ligand class x for which any residue was predicted to bind to x, precision and recall were set to 0. Increasing the cutoff to define a residue as binding decreased the number of residues incorrectly predicted to bind to x. Therefore, for more proteins not bound to x, there were also no residues predicted to bind to x, and those proteins were then ignored for the performance assessment (i.e., recall and precision are not set to 0). Therefore, recall could increase for higher cutoffs because CovNoBind increased (Fig. S4). Residues were considered as binding to a certain ligand class if the output probability of bindEmbed21DL for this class was greater or equal to a specific cutoff. Choosing larger cutoffs led to an increase in precision and a decrease in coverage for A. metal ions, B. nucleic acids, and C. small molecules. The trend was not as clear for recall. While we would expect recall to decrease for higher cutoffs, it could also increase in this scenario due to an increase in negative coverage, i.e., if a residue is predicted to bind to a certain ligand class in a protein not binding to this class at all, recall is set to 0. If the number of such false positive predictions decreases (as it does for higher cutoffs), and therefore, less proteins are evaluated with a recall of 0, the recall could increase overall while actually decreasing for individual proteins. Black line at 0.5 marks performance for the default cutoff.
Seemingly incorrect binding predictions could in fact point towards new binding sites not yet experimentally verified. This is especially true for binding residues predicted with high probability (p≥0.95). To investigate whether this assumption holds, we compared 1024-dimensional ProtT5 embeddings and the internal representations from the first CNN layer (128 dimensions) of bindEmbed21DL for annotated binding residues, residues correctly predicted as binding (TP), and residues incorrectly predicted as binding (FP). The dimensionality of the input embeddings and representations from bindEmbed21DL was first reduced to 32 dimensions applying a Principle Component Analysis (PCA) 7 and was then further reduced to two dimensions using t-SNE 8 . For the original ProtT5 embeddings, falsely predicted binding residues formed wider spread clusters than correct predictions with the highly reliable predictions spread across those clusters for both false and correct binding predictions (Fig. S5B&C). Using the internal representations from bindEmbed21DL, clusters for false predictions were still more widely spread. However, highly reliable predictions were concentrated on the borders of the clusters (Fig. S5F). A similar pattern was observed for correct predictions with p≥0.95 (Fig.  S5E). This indicated that highly reliable but false predictions were similar to correct predictions and could therefore point towards new potential binding residues.

Fig. S5: t-SNE visualizations for ProtT5 embeddings and internal representations of the first CNN layer for binding annotations, true positive and false positive predictions.
ProtT5 embeddings (1024 dimensions) and internal representations from the first CNN layer of bindEmbed21DL (128 dimensions) were first reduced to 32 dimensions using a PCA and were then further mapped to 2-dimensional representations using t-SNE. Those 2dimensional representations were visualized for ProtT5 embeddings (Panel A-C) and representations from the first CNN layer (Panel D-F). While all residues (including nonbinding) were used to generate the 2-dimensional representations, we only visualize known binding residues (Panel A and D), correctly predicted binding residues (TP; Panel B and E), and falsely correct binding residues (FP; Panel C and F). While highly reliable predictions were spread among all clusters for ProtT5 embeddings, they were more concentrated to the borders of the clusters for the internal representations of bindEmbed21DL. The similar patterns for highly reliable correct and false predictions indicated that highly reliable but incorrectly predicted binding residues could point towards new potential binding residues.
To provide binding predictions for as many proteins as possible, we considered a protein to bind to a specific ligand class if at least one residue was predicted to bind to this class. However, binding usually involves more than one residue, i.e., predicting only one residue as binding could indicate a wrong prediction. Predictions could be refined by only considering binding predictions if at least x residues were predicted to bind to this ligand class in a protein. Applying this filter led to an increase in CovNoBind(l) (Eqn. 9 in main text) for larger x, while decreasing CovOneBind (Eqn. 8; Fig. S6). While precision and recall were set to 0 for proteins annotated but not predicted to bind to a certain ligand class, those performance values still increased up to a certain threshold ( Fig. S6; optimal threshold of 3, 10, and 8 residues for metal ions, nucleic acids, and small molecules, respectively). For those thresholds, more proteins falsely predicted to bind to this ligand class were removed than proteins actually binding to a certain ligand. Therefore, a low number of binding predictions in a protein indicated that those predictions were incorrect, and taking the number of predicted residues into consideration could help refining predictions (too few residues predicted: prediction less likely correct).

Fig. S6: Performance of bindEmbed21DL in dependence of the minimum number of predictions considered.
We show precision, recall, CovOneBind (Eqn. 8 in main text), and CovNoBind (Eqn. 9 in main text) if proteins were only considered to bind to a certain ligand class if at least x residues were predicted for this class, i.e., for proteins with <x binding predictions, we assumed that no binding residue was predicted. While no binding prediction was generated for more proteins (CovOneBind decreased) for larger x, CovNoBind increased because erroneous predictions were removed. Precision and recall also increased to a certain point (optimal x indicated by black, vertical line) indicating that proteins incorrectly predicted to bind to a ligand class had on average fewer binding predictions than proteins correctly predicted to bind. 1.6. Combination of bindEmbed21DL with homology-based inference.

Fig. S7: Performance of homology-based inference for different E-value thresholds.
Performance for homology-based inference (HBI) as measured by A. the F1 score, B. the precision, and C. the recall varied with the E-value thresholds (red bars). The highest F1 of 56±4% was reached at E-value ≤ 10 !"# . However, if forcing predictions for all proteins by assigning binding residues at random if no homolog was available, F1 dropped to 21±2% (leftmost light red bar). The combination of HBI with bindEmbed21DL (blue bars) performed numerically best for E-value ≤ 10 !$ achieving F1=45±2%. However, performance values behaved similarly for all three measures (F1, precision, recall). To allow annotation transfer for the largest number of proteins possible without having the performance drop below that of bindEmbed21DL, we chose a final E-value threshold of 10 !% where F1 and precision are higher than for bindEmbed21DL (dashed line) and the recall is the same. Error bars indicate 95% confidence intervals.

Fig. S8: Number of proteins and number of binding residues inferred through homology-based inference.
A. For lower E-value thresholds, binding residues could be inferred through homology-based inference (HBI) for fewer proteins. With increasing E-values, the number of hits increased. However, for some proteins, the local alignment did not contain any binding annotations, and those hits were discarded (difference between light and darker red). B. For many higher E-values, the increase in the number of inferred binding residues was small. This also explained why we did not observe a difference in performance for these different E-values (Fig. S7).  * In each row, CovOneBind (Eqn. 8 in main text) indicates the number of proteins for which at least one residue was (correctly or incorrectly) predicted to bind to this ligand class (or any ligand class for the last row). The CovNoBind(l) (Eqn. 9 in main text) is the percentage of proteins not annotated to bind to a certain ligand class for which also no residue was predicted as binding. Combining bindEmbed21DL with HBI led to an increase in CovNoBind(l) but a drop in CovOneBind. Since HBI only used binding annotations from one local alignment, binding to multiple ligand classes is hard to predict because we could not identify different binding sites not close in the sequence. Data set: TestSet300. * We show the number of proteins from the 20,386 sequences in the human proteome with binding information and the percentage of binding residues of (i) proteins with binding information and (ii) all proteins (in brackets). Using all available information from BioLiP (2nd row), 15% could be annotated with binding Appendix p. 21

Full proteome prediction allows identification of previously unknown binding residues.
information. Homology-based inference (HBI) (3rd row) adds another 36%. bindEmbed21DL provides predictions for another 42% corresponding to 8,510 proteins (5th row). Of those 8,510 proteins, 5,962 proteins contain highly reliable binding predictions (residues predicted with a probability ≥ 0.95), i.e., for 29% of the human proteome, highly reliable binding predictions could be provided by bindEmbed21DL while no annotations were available from experiments or homologs. * We show the percentage of predicted residues in each ligand class (metal ions, nucleic acids, small molecules) and the percentage of all three classes those residues account for (predicted residues/total). The composition for the human proteome (20:30:50 for metal:nucleic:small) was most similar to TestSet300. Data sets: Human: Predictions for 92% of the human proteome (Table S11); DevSet1014: Development set with 1,014 proteins, TestSet300: Test set with 300 proteins; TestSetNew46: New independent set with 46 proteins.
For almost half of the human proteins, no binding annotation is known, and for previously annotated proteins, many residues were newly predicted as binding. This could indicate a high prediction error of our method. On the other hand, those predictions could indicate previously unknown binding sites. The distributions of prediction scores for residues predicted as binding and (i) annotated as binding or (ii) inferred as binding through homology-based inference were similar (Fig. S9), while residues not annotated or inferred as binding were on average predicted with lower scores (Fig. S9). We expected a certain shift to the left (i.e., to 0.5) because the residues predicted as binding without any annotations will contain some false positive predictions which are predicted less reliably (Fig. 4 in the main text). Also, proteins with known or inferred binding annotations could have been similar to proteins in our training set. However, also other aspects could lead to this shift in the observed prediction probabilities. We investigated whether predictions were more difficult for (i) disordered proteins, (ii) transmembrane proteins, or (iii) proteins with different length than in the development set. Disorder was calculated using MetaDisorder 9 , and we removed all proteins with at least 30 consecutive disordered residues to obtain a set of ordered proteins. Transmembrane helices (TMH) were predicted using TMSEG 10 , and we excluded every protein with at least two TMHs to obtain a set of non-membrane proteins. In addition, for each of the three sets (proteins with experimentally verified annotations, proteins with inferred annotations, proteins with no annotations), we drew a subset of proteins mirroring the length distribution of the development set. None of these three aspects could explain the observed shift in distributions (Fig. S10). While this analysis did not reveal any insights whether new binding predictions could indicate previously unknown binding residues, it clearly showed that our method was not biased to ordered proteins, nonmembrane proteins, or proteins of a specific length. Since no bias in the data set explained the shift in distribution, some of the shift is most likely explained by prediction mistakes, i.e., the large fraction of residues predicted with a probability close to 0.5 is probably pointing to wrong predictions. On the other hand, the distributions overlap to a certain extent, and especially residues predicted with a large probability could still point towards previously unknown binding residues.

Fig. S9: Distribution of prediction scores for predicted binding residues annotated and not annotated as binding.
Residues which were not experimentally verified as binding residues or could be inferred through homology-based inference (HBI) were on average predicted with lower probability as binding (dark blue box lower than the other 2 boxes). Proteins with any binding information (either annotated or inferred) were similar to proteins in our training set. Therefore, bindEmbed21DL had seen those data points before and could make more reliable predictions. However, also some residues without any binding annotation could be predicted reliable and the distributions for all three sets overlapped to a large extent indicating that those residues not annotated as binding but predicted as such did not only originate from prediction mistakes but could indicate previously unknown binding annotations.

Fig. S10: Distribution of prediction scores for ordered proteins, nonmembrane proteins, and proteins of specific length.
Distribution of prediction probabilities were similar for A. all proteins compared to ordered proteins (<30 consecutive disordered residues; predicted with MetaDisorder 9 , B. all proteins compared to non-membrane proteins (< 2 transmembrane helices (TMHs), predicted with TMSEG 10 , and C. all proteins compared to proteins of the same length as in the development set DevSet1014. We distinguished 3 subsets of human proteins: Proteins with experimentally verified binding annotations, proteins with binding annotations inferred through homology-based inference (HBI), and proteins without any known binding residues.

Data sets.
For the construction of our non-redundant data sets, we applied UniqueProt 11 with an HVAL<0. This was a rather strict cutoff which resulted in a reduction of our dataset by more than 90% from 14,894 to 1,314 proteins. To assess whether a less strict cutoff would still lead to a data set of proteins where no pair shares a common binding annotation, we tested how well homology-based inference (HBI) for our training set DevSet1014 (Table S12) would perform using the non-redundant set as lookup set and the HVAL as criterion to decide whether a protein is a homolog or not. Comparing the performance of HBI with our method bindEmbed21DL showed that HBI outperformed bindEmbed21DL for HVAL>0. Only for HVAL=0, performance dropped to the level of bindEmbed21DL. Therefore, the choice of our strict cutoff was necessary to ensure a non-redundant data set although it led to a huge reduction in protein sequences available for training.

Fig. S11: Homology-based inference using HVAL.
F1 score for homology-based inference using HVAL using DevSet1014 as query set. For HVAL>0, homology-based inference outperformed our Machine Learning method bindEmbed21DL (dashed line). Only when using HVAL=0, performance dropped to the level of bindEmbed21DL. For even lower H-values, no additional hits could be found, probably because no meaningful alignment could be generated. Therefore, performing redundancy reduction at a higher HVAL threshold than zero would lead to a dataset where proteins could share a common binding site.
On the other hand, using even stricter HVAL cutoffs would have led to a tremendous drop in data set size. When reducing our test set TestSet300 at HVAL=-1 against DevSet1014, only 44 proteins remained in the test set. While the number of test proteins dropped further to 11 proteins for an HVAL cutoff of -5, we did not observe any difference in performance (Fig. S12A), indicating that no information leakage appeared for HVAL=0 compared to using even stricter HVAL cutoffs. However, due to the small data set size, confidence intervals were very large for lower HVAL cutoffs. We observed similar results for a redundancy reduction of TestSet300 against DevSet1014 using the E-value: We removed every protein in TestSet300 where we could find a local alignment with a smaller E-value than a certain threshold. For this reduction, the number of proteins was not reduced so largely as for the HVAL cutoff. At E-value=1, our test set still consisted of 202 proteins; for E-value=10, this number dropped to 89 proteins. While the F1 score dropped to 38±5% for E-value=10, it remained within the confidence interval of the performance for the entire set, and performance at lower cutoffs was similar to the overall set (Fig. S12B), again indicating that the redundancy reduction at HVAL=0 yielded a non-redundant data set which did not allow information leakage between train and test set. To ensure that our data set split in training (DevSet1014) and test (TestSet300) represented an unbiased split without information leakage between the two independent sets, we assessed performance of further reduced versions of TestSet300 using A. stricter HVAL cutoffs and B. E-value cutoffs. For both approaches, the F1 score did not change tremendously except for applying an E-value cutoff of 10, where F1 dropped by five percentage points, while remaining within the confidence interval of the performance of the full set. * We compared performance calculated using CIs assuming a normal distribution of the per-protein performance values and bootstrapped CIs for the development set (DevSet1014) and the test set (TestSet300). For both sets, we did not observe a difference in the estimated performance.

Related Work
Many methods focusing on prediction of binding have been reported in the past 13 . However, we did not compare bindEmbed21DL to most of them for various reasons. First, we excluded template-based methods from our comparisons because we see the strength of our method in the area where template-based methods could not be applied (because no template is available). Also, most template-based methods use structural templates and annotations from PDB or BioLiP, i.e., use the annotations from our test set in their template databases. This makes them incomparable to our method because the predictions would just be based on a self-hit of the query protein against its respective template. Secondly, many methods, while focusing on the prediction of binding, do not predict binding residues but rather binding pockets or binding cavities without providing the exact residues involved in binding. Therefore, such methods were not comparable to our approach. Other methods could not be used for comparison because they were simply not available (anymore), or instructions were insufficient for a local installation. Also, other DNA-or RNA-binding prediction methods were excluded because it has been shown that ProNA2020 outperformed its competitors 5 . Table S14 gives a general overview of reasons why methods could not have been used for comparison and lists known binding prediction methods excluded because of those reasons. * While many methods focusing on the prediction of binding exist, many could not be compared to our method bindEmbed21DL. Here, we show some example methods and the reasons for not using them for comparison.