Improving the performance of the PLB index for ligand-binding site prediction using dihedral angles and the solvent-accessible surface area

Protein ligand-binding site prediction is highly important for protein function determination and structure-based drug design. Over the past twenty years, dozens of computational methods have been developed to address this problem. Soga et al. identified ligand cavities based on the preferences of amino acids for the ligand-binding site (RA) and proposed the propensity for ligand binding (PLB) index to rank the cavities on the protein surface. However, we found that residues exhibit different RAs in response to changes in solvent exposure. Furthermore, previous studies have suggested that some dihedral angles of amino acids in specific regions of the Ramachandran plot are preferred at the functional sites of proteins. Based on these discoveries, the amino acid solvent-accessible surface area and dihedral angles were combined with the RA and PLB to obtain two new indexes, multi-factor RA (MF-RA) and multi-factor PLB (MF-PLB). MF-PLB, PLB and other methods were tested using two benchmark databases and two particular ligand-binding sites. The results show that MF-PLB can improve the success rate of PLB for both ligand-bound and ligand-unbound structures, particularly for top choice prediction.

To identify a potential ligand-binding site, Soga et al. developed an index known as the propensity for ligand binding (PLB), which is calculated by simply summing up the RA (preference factor for an amino acid) of all residues involved in the site 22 . The PLB index has also proven to be useful for identifying druggable protein cavities 23 . The RA is derived from a database of high-quality protein-ligand complex structures and has a constant value for each type of amino acid; thus, only 20 RA values are provided 22 . However, we have found that amino acids exhibit different propensities under different solvent exposure conditions. In addition, our previous study revealed that the dihedral angles of the residues in specific regions of the Ramachandran plot reveal preferences for ligand-binding sites 24 .
In this study, a method that considers the amino acid solvent-accessible surface area (SASA) and dihedral angles based on the RA and multi-factor RA (MF-RA) was developed. In addition, similar to the PLB, the proposed MF-PLB sums up all of the amino acid MF-RAs at the ligand site. Two frequently used test datasets were applied to compare the prediction abilities of the MF-PLB, PLB, and several popular ligand-binding site prediction methods. Because large ligand-binding sites can be easily identified, we constructed two new databases to evaluate the performances of different methods for the identification of small-volume ligand-binding sites and protein-protein interface ligand-binding sites.

Methods
Calculating hydrogen bonds and van der Waals (vdW) contacts. The hydrogen bonds between a ligand and residues in a protein were calculated using the HBPLUS programme with default values 25 . Several geometrical criteria of specific atoms, including hydrogen bond donors (D) and acceptors (A), were applied for the identification of hydrogen bonds using HBPLUS. The detection of vdW contacts between a ligand and protein residues is simple: For a non-hydrogen ligand atom A and a non-hydrogen residue atom B, a vdW contact exists between the two atoms if the distance between A and B satisfies dist(AB)-0.5 Å < vdW radius(A) + vdW radius(B). SASA calculation. We applied the NACCESS programme to calculate the SASA and relative SASA for all residues 26 . NACCESS scans a probe over the vdW surface of the protein and calculates the SASA for each residue according to the β-centre trace of the probe. The default probe radius (1.4 Å) and behaviours (water molecules, hydrogen atoms, and HET groups are excluded from the protein structure) are used. The relative SASA describes the relative accessibility of residue X and was calculated by expressing the actual SASA of X as a percentage of the residue in an extended tripeptide: ALA-X-ALA.

Databases. Set N:
We constructed a non-redundant database, set N, consisting of 6,635 ligand-bound structures obtained from the Binding MOAD released in 2014 27 to derive the MF-RA. Binding MOAD groups structures at the 90% sequence identity level, and all of the structures in Binding MOAD having more than 70% sequence identity with any protein in set T or set S were excluded from set N.
Set T and set S: Set T and set S are two benchmark protein structure databases that were used to evaluate the performances of the different ligand-binding site prediction methods. The two test databases have been widely used in previous studies of ligand-binding site prediction methods 15,19,20,28 . Set T consists of 210 ligand-bound protein structures, whereas set S includes 96 structures that can be grouped into two classes, specifically 48 ligand-unbound protein structures and their corresponding ligand-bound forms. To obtain the actual ligand-binding sites of the 48 ligand-unbound structures, the ligand-bound structures were aligned with their ligand-unbound forms using the PyMOL align function, and the ligands' coordinates and connectivity information were obtained from the ligand-bound structures 29 .
Set L: The average molecular weight of the ligands in set S is as high as 269 dalton, whereas that of some ligands, such as "NAD" and "HEM", exceeds 500 dalton. Additionally, large ligand-binding sites can be easily identified by geometry-based methods or even by eye based on the three-dimensional (3D) protein structures. Set L was constructed to evaluate the performances of different methods for small-volume ligand-binding site prediction and consists of 169 ligand-bound structure chains downloaded from the Protein Data Bank (PDB) 30 . In the PBD, each structure chain has only one ligand, and the molecular weight of the ligand should be less than 150 dalton. Inorganic molecules and metals in protein structures are ignored, the ligand should not be completely exposed to the solvent, and the ligand-binding site is formed by the only chain in the structure. In set L, no two structures have more than 70% sequence identity. Detailed information for set L is provided in Table S6.
Set P: Set P was derived from a database of dimeric protein complexes that consists of 1,611 structures obtained in previous studies 31,32 . No two chains from different protein complexes share a sequence identity of 35% or higher. In addition, all ligands in proteins should be located at protein-protein interfaces: the distance between a ligand and both protein chains, defined as the shortest distance between any ligand atom and any residue atom that belongs to the protein chains, is less than 5 Å. After refinement, set P includes 149 protein structures and was constructed to evaluate the accuracy of the prediction of ligand-binding sites on protein-protein interaction region. Detailed information for set P is provided in Table S7.
The residues show different preferences for ligand-binding sites, as determined by dividing the residues into two groups, namely high-accessibility residues (relative SASA >13.1%) and low-accessibility residues (1% < relative SASA ≤ 13.1%; please note that 13.1% is the median of the relative SASA values of all ligand-binding site residues in set N). As shown in Fig. 1, with the exception of Gly, Ser and Thr, the amino acids have significantly different RA values (preference factor, defined by Soga et al.). For instance, Cys has a constant RA of 1.65 according to Soga et al. 22 , whereas the RA values for high-accessibility and low-accessibility Cys residues are 3.01 and 0.69, respectively. As a result, dividing amino acids into two groups according to their solvent exposure can better reflect their ligand-binding site preferences.
The amino acid SASA and the dihedral angles of the amino acid backbone influence the ligand-binding site preferences. Thus, these two factors were considered, and the MF-PLB was defined as x n 1 For each cavity, the MF-PLB can be obtained by simply summing the MF-RA(x, r, SASA) of all of the residues involved in the cavity. Here, x is the amino acid type, r is the Ramachandran plot region where the dihedral angle of the residue' backbone lies (the classification of the dihedral angles is shown in Table S4), and SASA indicates the accessibility of the residue, i.e., high or low. − MF RA x r SASA ( , , ) was calculated by taking the ratio of the frequency of the amino acid x in the protein ligand-binding site to its frequency on the protein surface. The detailed formula for the calculation of MF-RA(x, r, SASA) is shown below: In this equation, CA x r ( , ) is the frequency of amino acid x observed at the ligand-binding site and is calculated as where N(x, r) and N(y, r) denote the numbers of amino acids x and y detected in the ligand-binding site, respectively; additionally, the dihedral angles of both amino acids are in the same Ramachandran region r. Similarly, where N(x, r, SASA) and N(y, r, SASA) are the number of amino acids x and y, respectively, and the dihedral angles of both amino acids x and y should be located in the same region r and have the same accessibility state.
In conclusion, CA is the rate of occurrence of amino acid x in the ligand-binding site, whereas SA is the rate of occurrence of amino acid x on the protein surface. Each type of amino acid can have 40 MF-RAs according to its dihedral angles and SASA, and detailed information for the MF-RAs of the 20 amino acids is shown in Table 1.

Protein cavity calculation and protein-ligand binding site prediction. Soga et al. used the Alpha
Site Finder to identify and catalogue protein cavities, and all cavities were re-ranked according to the sums of the RA values of the residues involved in the cavities (PLB). The top three clusters were retained, and the top cluster was selected as the predicted ligand-binding site. Our protein cavity calculation consists of two steps. First,  Ligiste-cs was employed to detect solvent grids. Ligiste-cs divides all grids projected from the protein into three groups: protein grids, surface grids and solvent grids 15 . A surface-solvent-surface event occurs when a sequence of grids starts and ends with surface grids and contains solvent grids (exceeding the minimal threshold of six grids) in the middle. Solvent grids in the surface-solvent-surface event are labelled pocket grids. Second, the depth-first search algorithm clusters all pocket grids because the web server does not provide clustering information. In this study, the clustering result should satisfy the following condition: The distance between any two pocket grids from different clusters must exceed 3.0 Å. Before predicting the protein-ligand binding site, we defined the residues involved in the cavities: For each pocket grid cluster, the involved residue needs to have at least one non-hydrogen atom located within 4.5 Å of any grid in the cluster. All cavities were then re-ranked by their MF-PLB, which was calculated by summing the MF-RA values of all of the residues in the cavity. The cavities with the top three MF-PLB values were retained.
Comparison measures. All cavities calculated by Ligsite-cs were re-ranked based on the MF-PLB, PLB and Ligsite-csc. Ligsite-csc re-ranks cavities according to the degree of conservation of the residues in the cavity 15,37 . Based on this approach, the protein-ligand binding site cavity has at least one pocket grid within 4.0 Å of any atom of the actual ligand. This method successfully predicted whether the top-scoring cavity was the actual ligand-binding site (top 1) or whether any one of the top three cavities is the actual ligand-binding site (top 3).

Results
We evaluated the ligand-binding site prediction performances of the PLB, MF-PLB, Ligsite-csc and other methods. Two benchmark databases (set T and set S) and two evaluation criteria (top 1 and top 3) were employed. Table 2 shows a detailed comparison of the methods based on an analysis of 210 ligand-bound structures, and Table 3 Table 1. MF-RA for 20 amino acids. 1 Region in the Ramachandran plot. 2 Capital letters indicate highaccessibility amino acids (relative SASA > 13.1%), whereas lowercase letters indicate low-accessibility amino acids (relativeSASA ≤ 13.1%). 3 The backbone conformation is considered rare if fewer than 20 residues are located in the region, and the MF-RA value of rare-conformation amino acids is 1.5 based on Petock et al., who reported that these residues are likely to be located at the functional site 35  prediction success rates. This finding suggests that the MF-PLB can re-rank the cavities more accurately and reflect the amino acid ligand-binding site preferences better than the PLB. MF-PLB is superior to MPK1, Q-SiteFinder, Ligsite-cs, Pass, Surface and SiteHound for both set T and set S, particularly with respect to its top 1 prediction. In general, combined ligand-binding site prediction methods show better success rates than those that utilize only geometrical information. MPK2 and LISE have higher success rates than MF-PLB for the set of 210 ligand-bound structures and the set of 48 ligand-bound structures because these two methods are meta-predictors, whereas the MF-PLB exhibited the best top-1-prediction performance in the analysis of the set of 48 ligand-unbound structures. The use of ligand-unbound structures for the accurate prediction of ligand-binding sites is more important than the use of the ligand-bound forms because the ligand coordinates in the protein structure are unknown in practical applications. A detailed comparison of the performances of Ligsite-csc, PLB and MF-PLB in the analysis of the set of 48 ligand-unbound structures is shown in Table 4.

Discussion
In this study, we show that MF-PLB exhibits good performance in ligand-binding site identification. We divided the residues into two types: low-accessibility and high-accessibility residues. Thus, it was interesting to assess the interactions between a ligand and these types of residues. Figure 2a details the average number of hydrogen bonds formed between a ligands and low-accessibility and high-accessibility residues. Approximately half of the residues establish similar numbers of hydrogen bonds in both their low-and high-accessibility states. Cys, Asp, Gly, Ser, and Thr form more hydrogen bonds in their high-accessibility states than in their low-accessibility states. The only residues that show significantly more hydrogen bonds with a ligand in their low-accessibility state compared with their high-accessibility state are the two positively charged amino acids: Lys and Arg.
We then examined the number of vdW contacts between a ligand and each residue involved in the ligand-binding site in both their low-accessibility and high-accessibility states (Fig. 2b). The results show that on average, high-accessibility residues form more vdW contacts with a ligand than low-accessibility residues with the exception of Glu and Lys. The analysis of the three amino acids with electrically charged side chains (Asp, Arg and His) showed that Asp and Arg have similar vdW contacts with ligands in both their high-accessibility and low-accessibility states, whereas the imidazole ring allows the high-accessibility His to establish many more vdW contacts with a ligand.  The average numbers of vdW contacts for every atom in each type of amino acid are listed in Table S5. Most atoms in the high-accessibility residues can provide more vdW contacts with a ligand than those in the low-accessibility residues. This finding is expected because the high-accessibility residues are more exposed on the protein surface, whereas the low-accessibility residues are relatively buried in the protein core. However, the  Table 4. Comparison of the performances of Ligsite-csc, PLB and MF-PLB in the analysis of ligandunbound structures. 1 Best ranking of ligand-binding site. 2 Distance between the ligand and the central pocket grid of the ligand-binding site.
atoms in an electrically charged side chain of a low-accessibility residue form many more vdW contacts with a ligand: for instance, "NZ" in low-accessibility Lys forms an average of 3.09 vdW contacts with a ligand, whereas "NZ" in high-accessibility Lys forms only 1.72 vdW contacts with a ligand. The hydrogen bonds between a ligand and high-accessibility and low-accessibility residues are shown in Fig. 2. Set L consists of 169 structures and was used to evaluate the performances of different methods for small-volume ligand-binding site prediction because large-volume ligand sites can be easily detected. As shown in Table 5, almost all methods showed significantly worse performance for small-volume ligand-binding site prediction; in fact, the prediction success rates achieved with the small-volume ligand-binding site database were more than 20% lower than those achieved with the 210 ligand-bound structure database. Detailed information of the comparison can be found in Table S6. The results suggest that small-volume ligand-binding sites are relatively difficult to detect using the currently available methods. Although ConCavity shows improved performance in drug-binding site prediction 18 , it provides the fewest candidate cavities (average of 2.4 cavities) and achieves success rates of 49.1% and 67.4% for the top 1 and top 3 hits. In contrast, Surfnet supplies as many as 47.4 candidate cavities but shows the worst performance for both the top 1 and top 3 hits. LISE exhibits the highest success rate, followed by our MF-PLB index, for both benchmarks. MF-PLB performed better in small-volume ligand-binding site identification than both PLB and Ligsite-csc, with success rates that were approximately 5% higher than those of PLB for both the top 1 and top 3 hits and approximately 10% higher than that of Ligsite-csc for the top 1 hits. In conclusion, although the MF-PLB, PLB and Ligsite-csc methods show worse performance in small-volume ligand-binding site prediction, MF-PLB achieves the highest success rate among the three methods.
Another particular ligand-binding site, namely a ligand site that binds to a protein-protein interface, was taken for comparing methods prediction ability 38 . Most protein-protein interfaces are planar in shape, whereas the majority of ligand-binding sites are concave. As a result, residues in the ligand-binding sites can create more contacts with ligands. Although both protein-protein interaction and protein-ligand interaction play key roles in biological processes 39 , few studies have focused on the ligand-binding sites located at the protein-protein interface. The analysis of set P revealed that LISE exhibited the best performance, followed by MF-PLB and ConCavity (shown in Table 6). Two energy-based prediction methods, Q-SiteFinder and SiteHound, achieved lower success rates than the other combined methods. Detailed information of the comparison can be found in Table S7. Although various methods, such as MF-PLB, ConCavity, MPK2, exhibit good performance in the prediction of regular ligand-binding sites, the accurate identification of some particular sites, such as small-volume ligand-binding sites and ligand-binding sites on protein-protein interfaces, remains a challenging problem in the field. In the future, more physicochemical properties should be incorporated to address this problem.
Three typical examples, two from set S and one from set L, are provided in Fig. 3 (id: 2tga), Fig. 4 (id: 2sil) and Fig. 5 (id: 1chg) to illustrate the procedure underlying the prediction of a ligand-binding site by the MF-PLB.
The MF-PLB re-ranks all of the cavities calculated by Ligsite-cs; thus, a prerequisite for an accurate prediction using the MF-PLB is that Ligsite-cs successfully finds the ligand-binding sites. The ability of Ligsite-cs to detect ligand-binding sites directly influences the prediction results of the MF-PLB approach. The MF-PLB shows lower prediction success rates with sets L and P compared with those obtained using sets T and S for the following three   reasons. First, Ligsite-cs shows a lower ability to detecting ligand-binding sites in sets L and P. The success rates for the detection of ligand sites in sets T and S using Ligsite-cs are 98.1% and 97.9%, respectively, whereas the corresponding values for sets L and P are 94.6% and 85.8%. Furthermore, the MF-PLB of one cavity is calculated by summing up the MF-RA values of all residues surrounding the cavity; thus, larger cavities with more residues tend to have a higher MF-PLB. The average molecular weight of the ligands in sets T and S are 327 and 269 dalton, whereas that of the ligands in set L is only 126 dalton. In addition, the residues located at two particular ligand-binding sites, namely small-volume sites and protein-protein interface sites, are buried more than they are in sets L and P: the average relative SASA of the ligand-binding site residues in sets L and P is 16.1% and 17.6%, respectively, whereas that of the ligand site residues in sets T and S is 25.7% (shown in Fig. 6). The MF-PLB might not accurately reflect the residue preference for ligand-binding sites in sets L or P because the exposure status of the ligand-binding site residues in sets T and S are different from the typical status. The ligand sites in set L are buried and have a small volume, whereas the ligand-binding sites on protein-protein interfaces are planar in shape and are also buried in the protein. These reasons could explain why MF-PLB do not exhibit good performance in the analysis of sets L and P.

Conclusion
The accurate identification of ligand-binding sites in proteins is a very important step in protein function determination and structure-based drug design. Soga et al. suggested that the amino acid compositions at the ligand-binding site are markedly different from those on the protein surface. These researchers created the PLB index based on the amino acid preferences for the ligand-binding site. We found that amino acids show different preferences for ligand-binding sites depending on their SASA and dihedral angles. Based on these findings, we successfully developed the MF-PLB index and applied it to the identification of ligand-binding sites in   . The actual ligand-binding site (yellow ball) was the second cavity after re-ranking by the MF-PLB, whereas the first cavity (red ball) assigned by the MF-PLB is 16.63 Å away from the ligand (OAC). After examining the protein pdb file, we found that some residues were missing (red part in b) and that the presence of these missing residues would result in an abnormal cavity (red ball in a). We added these missing residues to protein 1CHG by aligning 1CHG to its ligand-bound form (3GCH) using PyMOL. The red-ball binding site was not obtained after the pocket grids were calculated by Ligsite-cs, and the first cavity assigned by the MF-PLB after the addition of the missing residues was the actual ligand-binding site. The MF-PLB values for the red-ball cavity and the yellow-ball cavity are 23.93 and 20.55, respectively. All pocket grids calculated by Ligsite-cs are shown as cyan grids.  proteins. The obtained results showed that the MF-PLB can significantly improve the performance of the PLB for ligand-binding site prediction in both ligand-bound and ligand-unbound structures; therefore, the MF-PLB can better reflect the amino acid preferences for the ligand-binding site. The currently available methods, including the MF-PLB, show lower success rates in the prediction of small-volume ligand-binding sites and ligand-binding sites on protein-protein interfaces, but the MF-PLB still exhibits the best performance among the MF-PLB, PLB, and Ligsite-csc methods. Additionally, the performance of the MF-PLB was better than those of the other methods, with the exception of LISE. The calculation of the MF-PLB is simple, and clearly, this index could be a useful tool for ligand-binding site prediction and other aspects of drug discovery.