Chemical features and machine learning assisted predictions of protein-ligand short hydrogen bonds

There are continuous efforts to elucidate the structure and biological functions of short hydrogen bonds (SHBs), whose donor and acceptor heteroatoms reside more than 0.3 Å closer than the sum of their van der Waals radii. In this work, we evaluate 1070 atomic-resolution protein structures and characterize the common chemical features of SHBs formed between the side chains of amino acids and small molecule ligands. We then develop a machine learning assisted prediction of protein-ligand SHBs (MAPSHB-Ligand) model and reveal that the types of amino acids and ligand functional groups as well as the sequence of neighboring residues are essential factors that determine the class of protein-ligand hydrogen bonds. The MAPSHB-Ligand model and its implementation on our web server enable the effective identification of protein-ligand SHBs in proteins, which will facilitate the design of biomolecules and ligands that exploit these close contacts for enhanced functions.


Protein structure preparation and hydrogen bond analysis
From the PDB, we obtained 1070 structures of protein-ligand complexes that were refined from X-ray or neutron scattering experiments and had resolution greater than or equal to 1.1 Å. These structures were deposited in the database between 1994 and 2019, with 35.7% of them deposited after 2015. The following table provides the breakdown of their year distribution. Year Number of structures Year  Number of structures   1994  2  2008  43  1995  1  2009  32  1997  1  2010  57  1998  6  2011  62  1999  4  2012  59  2000  8  2013  68  2001  18  2014  87  2002  29  2015  69  2003  36  2016  78  2004  43  2017  76  2005  51  2018  111  2006  42  2019  48  2007  39 99.5% of the 1070 protein-ligand complexes had R-factor ≤ 0.20 and R f ree − R ≤ 6%, demonstrating that the dataset contained high-quality crystal structures. 1 Analysis conducted using the CD-HIT program 2,3 revealed sequence redundancy among the proteins in the dataset. By applying a sequence identity cutoff of 0.9, we observed a reduction from 1113 chains to 502 independent chains. However, we retained all the structures for the hydrogen bond analysis because they contained distinct small molecule ligands despite sharing the same protein structures. We expect that the MAPSHB-Ligand model will effectively capture the structural features of protein-ligand SHBs regardless of potential redundancy in the protein sequences in the dataset. The PDB IDs of the protein-ligand complexes are listed in Table S9.
From the protein structures, we removed the crystallographic water molecules and analyzed the hydrogen bonds that formed between amino acids or between amino acids and small molecule ligands using the Amber 2016 package. 4 A pair A-H· · ·B was treated as a hydrogen bond if the heteroatoms were O or N atoms, 2.3 Å ≤ R ≤ 3.2 Å, and the A-H-B angle ≥ 135 • . It was then further categorized as a SHB if R ≤ 2.7 Å, and a NHB if R ≥ 2.8 Å. A distance interval of 0.1 Å was used to better distinguish the two classes of hydrogen bonds. We found that 6.5% of the amino acids involved in the SHBs were near the C-or N-terminus of a protein and did not have the neighboring 3 residues that were used as the sequence information in the model development. In addition, 0.4% of the amino acids were in structurally uncertain regions of proteins that showed multiple possible amino acids in one position. We removed all of these cases in the hydrogen bond analysis since they occurred only rarely in the dataset. Furthermore, we excluded inorganic anions and small inorganic molecules, such as SO 2 -4 , PO 3 -4 , and polyols such as ethylene glycol and glycerol in the analysis of protein-ligand SHBs. These ligand molecules and ions are mainly used in the solvation of biomolecules to prepare for the X-ray and neutron scattering experiments and are not likely to possess biological functions. The chemical IDs of the excluded ligands are listed in Table S4 and the resulting ligands involved  in protein-ligand SHBs are provided in Tables S4 and S10. The numbers of SHBs and NHBs observed in the high-resolution protein structures are summarized in Tables S1 and S2. When protein backbone is involved, the probability of observing protein-protein and protein-ligand SHBs are below 3% and 6%, respectively. As such, we only consider amino acid side chains for the hydrogen bond analysis in this work. We observe 7070 SHBs and 22353 NHBs that form between the side chains of amino acids and they are distributed in 1061 structures of protein-ligand complexes. When small molecule ligands are involved, we find a total of 1272 SHBs and 2733 NHBs that are distributed in 917 crystal structures.

Development and analysis of the machine learning model
To develop the MAPSHB-Ligand model, we randomly partitioned the overall dataset of protein-ligand hydrogen bonds with an 80:20 ratio. This yielded 1019 SHBs (31.7%) and 2200 NHBs (68.4%) in the training dataset, and 253 SHBs (32.2%) and 533 NHBs (67.8%) in the test dataset. We note that recent studies have highlighted the potential limitations of using a random split to form the training and testing datasets. 5 Specifically, new data may differ from the observations in the current dataset, so evaluating the model solely based on a random split may lead to an overestimation of its performance. 5 In the field of structural biology, there are continuous advancements that improve the effectiveness of protein structure determination and refinement and new techniques may introduce variations in future protein-ligand complexes. However, since we have focused on atomic-resolution protein structures in the PDB, which represent the highest quality structures available to date, it is unlikely that protein-ligand complexes determined in future studies will exhibit significantly different patterns compared to the existing data. To improve the treatment of data in this work, one potential approach is to use historical data for training and more recent data for evaluation. However, given the relatively small size of the dataset and the fact that only 35.7% of the structures are deposited after 2015, relying solely on historical data in the training of the machine learning model may result in the loss of crucial patterns in the data. As more high-resolution protein structures become available, we will continue to update and refine the model to ensure its effectiveness in capturing the evolving patterns and trends in protein-ligand SHBs.
Similar to the MAPSHB model, 6 the MAPSHB-Ligand model must tackle two challenges. First, SHBs only account for 31.7% of the hydrogen bonds in the training dataset. Based on this imbalanced dataset, a standard machine learning model is more likely to classify a given hydrogen bond as a NHB than a SHB. Second, the 14 input features display strong interaction effects when predicting the formation of SHBs. For example, the effect of ligand functional group is highly dependent on the type of amino acid residue present. As an example, Thr shows SHB probabilities of 39% and 14% when paired with the hydroxyl and amine groups of ligands, respectively. The odds ratio is 39% 1−39% / 14% 1−14% = 3.9. In comparison, Glu has probabilities of 85% and 11% for forming protein-ligand SHBs with the hydroxyl and amine groups of ligands, respectively. The odds ratio now becomes 85% 1−85% / 11% 1−11% = 45.8, which is 11 times larger than that of the previous case. Therefore, it is crucial for the model to explicitly account for the interaction effects among the input features when it predicts the class of a protein-ligand hydrogen bond.
As described in the main text, we chose the undersampling technique to overcome the class imbalance problem, and the gradient boosting models as blocks to handle the interaction effects of the input features in the construction of the MAPSHB-ligand model. In the following, we provide a detailed description of the gradient boosting model and discuss the evaluation metrics used to assess the model predictions. We then assess the necessity of using the undersampling strategy in the development of the machine learning model.

Gradient boosting model
Gradient boosting is a commonly used machine learning algorithm for classification tasks. It is an ensemble method that improves the prediction accuracy by combining the outputs of multiple weaker models, such as decision trees. 7 The main concept behind gradient boosting is to iteratively add new base learners to the model, with each one focusing on the errors made by the previous model and attempting to correct them. In other words, each subsequent model in the sequence primarily concentrates on the erroneous predictions made by the preceding models, which enables the algorithm to gradually correct and refine the predictions.
As shown in Algorithm 1, 8 the input dataset of each protein-ligand hydrogen is D = {xi, yi} N i=1 , where xi are the 14 input features of the i th hydrogen bond and yi is its class. yi takes the value of -1 for a NHB, and +1 for a SHB. The gradient boosting model initializes the model, denoted as F0(x), with a constant value that can minimize the exponential loss function. After initialization, the gradient boosting model takes M steps to train M decision trees and average them to obtain the final model. In each step, a decision tree is fitted to predict the negative gradients of the loss function with respect to the predictions of the model in the previous step. Figure S1 illustrates a fitted decision tree. The algorithm then updates the model by adding the newly fitted decision tree to the previous one. The output of the gradient boosting model is the predicted probability that a hydrogen bond belongs to a SHB.
As discussed in the main text, the MAPSHB-Ligand model predicts the SHB probability by averaging over the outputs of 10 gradient boosting models. The predicted probability is then compared with a pre-determined classification threshold to obtain the class of a hydrogen bond. If the probability is greater than or equal to the threshold, the hydrogen bond is classified as a SHB. If the probability is lower than the threshold, it is classified as a NHB.

Algorithm 1
The gradient boosting method (adapted from Algorithm 1 of reference 8) Step 2: for m = 1, · · · , M do 1. Calculate the steepest-descent directioñ Fit a decision tree model using Here h(x; a) is a decision tree with parameter a.

Update the model
Step 3: Calculate the predicted probability that x belongs to the SHB class

Evaluation metrics
We consider the following four metrics to evaluate the MAPSHB-ligand model: precision, recall, the receiver operating characteristic (ROC) curve and the area under the curve (AUC). These metrics are defined based on a confusion matrix. As shown below, the confusion matrix has four entries: True Positive (TP), False Positive (FP), False Negative (FN), and True Negative (TN). TP represents the number of SHBs that are correctly classified as SHBs; FP represents the number of NHBs that are incorrectly classified as SHBs; FN represents the number of SHBs that are incorrectly classified as NHBs; TN represents the number of NHBs that are correctly classified as NHBs. We apply the MAPSHB-Ligand model to the test dataset and calculate the precision and recall of its predictions for different classification thresholds. From Figure 4a and Table S7, both metrics depend on the choice of the classification threshold and there is a trade-off between them. As a result, it is not feasible to find a threshold that maximizes precision and recall at the same time. To further evaluate the model, we plot an ROC curve and compute the AUC score as a metric that is independent of the choice of the classification threshold. As demonstrated in Figure 4b, the ROC curve is obtained by varying the classification threshold from 0 to 1, and each point corresponds to a (FPR, recall) pair at a given classification threshold. If the threshold is set as 0, all hydrogen bonds would be predicted as SHBs and it gives a point of (FPR=100%, recall=100%) on the ROC curve. Conversely, all the hydrogen bonds would be classified as NHBs if a threshold value of 1 is used, leading to a point of (FPR=0, recall=0) on the curve. A perfect classification model assigns higher probabilities to real SHBs than real NHBs and its ROC curve is a combination of two straight lines. It starts at the point of (FPR=0, recall=0) when the classification threshold is 1, and follows a vertical line to reach (FPR=0, recall=100%) as the threshold decreases. After all the SHBs have been identified, further reducing the threshold does not alter the recall rate but increases the FPR. The ROC curve hence follows a horizontal line at recall=100% until it reaches the point of (FPR=100%, recall=100%). The AUC score for such a perfect model is 1. In contrast, a classification model that gives completely incorrect predictions assigns higher probabilities to real NHBs than real SHBs and its ROC curve also consists of two lines. It moves horizontally from the point of (FPR=0, recall=0) to (FPR=100%, recall=0) when the classification threshold decreases from 1 to the value where all the NHBs are predicted as SHBs. As the threshold further decreases, the recall rate starts increasing while the FPR remains at 100%, so the ROC curve moves vertically to the point of (FPR=100%, recall=100%). The resulting AUC score is 0. Most classification models fall between the completely incorrect model and the perfect model and their AUC scores are between 0 and 1. For example, a model that classifies hydrogen bonds by randomly tossing a fair coin would predict a 50% probability for any hydrogen bond to be a SHB. If the classification threshold is set to a value below 0.5, all the hydrogen bonds would be predicted as SHBs, resulting in a point of (FPR=100%, recall=100%) in the ROC curve. If the threshold is instead set to a value greater than or equal to 0.5, all the hydrogen bonds would be predicted as NHBs, leading to a point of (FPR=0, recall=0). The ROC curve is then a diagonal line connecting these two points on the plot and the corresponding AUC is 0.5. Therefore, the ROC curve and the AUC score are useful tools to identify how well the MAPSHB-Ligand model distinguishes between SHBs and NHBs.

Gradient boosting-only model
Given that the number of NHBs is only twice that of SHBs in the training dataset, the class imbalance issue for the protein-ligand SHBs exists but is not significant. This indicates that applying the gradient boosting model directly, without the undersampling strategy, would likely yield similar results. We hence repeated the development of the machine learning model without the undersampling method and obtained a gradient boosting-only model.
As shown in Figure S2, the ROC curve of the gradient boosting-only model closely resembles that of the MAPSHB-Ligand model and their AUC scores are both 0.96. These metrics suggest that the two models have similar overall performance.