CRHunter: integrating multifaceted information to predict catalytic residues in enzymes

A variety of algorithms have been developed for catalytic residue prediction based on either feature- or template-based methodology. However, no studies have systematically compared these two strategies and further considered whether their combination could improve the prediction performance. Herein, we developed an integrative algorithm named CRHunter by simultaneously using the complementarity between feature- and template-based methodologies and that between structural and sequence information. Several novel structural features were generated by the Delaunay triangulation and Laplacian transformation of enzyme structures. Combining these features with traditional descriptors, we invented two support vector machine feature predictors based on both structural and sequence information. Furthermore, we established two template predictors using structure and profile alignments. Evaluated on datasets with different levels of homology, our feature predictors achieve relatively stable performance, whereas our template predictors yield poor results when the homological relationships become weak. Nevertheless, the hybrid algorithm CRHunter consistently achieves optimal performance among all our predictors. We also illustrate that our methodology can be applied to the predicted structures of enzymes. Compared with state-of-the-art methods, CRHunter yields comparable or better performance on various datasets. Finally, the application of this algorithm to structural genomics targets sheds light on solved protein structures with unknown functions.


Secondary structure
Secondary structure can be utilized to reflect the local conformation of protein structures. The secondary structure of each residue was assigned by the DSSP program 4 . Residues in H (αhelix), G (3/10-helix), and I (π-helix) are considered to be in helical conformation, and those in E (β-strand) and B (β-bridge) are in sheet conformation. All remaining residues are in coil conformation.

Hydrogen bonds
Hydrogen bonds play important roles in maintaining the conformation of catalytic residues serving as donors or acceptors. We calculated hydrogen bonds using HBPLUS 5 and achieved three features including the number of hydrogen bonds between side-chain atoms in query residue and other atoms in a protein, the number of hydrogen bonds between main-chain atoms in query residue and other atoms, and the number of hydrogen bonds including any atom in query residue.

B-factor
B-factor, also called temperature factor, is a measure of atomic thermal motion and disorder. In this work, we used the B-factor of alpha carbon to represent the flexibility of each residue.

Sequence-based feature generation
As the complements of structure-based features, we extracted a variety of sequence-based descriptors to characterize each residue, such as residue type, position-specific scoring matrix, residue conservation scores, predicted structural features, physicochemical properties, catalytic residue propensity, sequential position, and global sequence features.

Residue type
The identity of each residue can be used to check its catalytic signature. Each sequence position was thus encoded by a vector composed of 20 elements where the target residue type was set to 1 and the remainder was set to 0.

Position-specific scoring matrix (PSSM)
PSSM is a sequence profile comprising the evolutionary information of a protein sequence. For each query, we performed a search against NR database from NCBI using the PSI-BLAST program 6 with the parameters j = 3 and e = 0.001 to generate this profile.

Residue conservation scores
In addition to PSSM, the conservation scores of each residue can also be used to measure its evolutionary signatures. Here we implemented five entropy-based scores proposed by Capra and Singh 7 , such as Shannon entropy 8 , property entropy 9,10 , von Neumann entropy 11 , relative entropy 12 , and Jensen-Shannon divergence 7 . These features were generated based on the weighted observed percentages matrix output by PSI-BLAST.

Predicted structural features
Predicted structural features can offer useful information in the absence of native structures. We generated the predicted secondary structure, solvent accessible surface area, and backbone dihedral angles by SPINEX 13 and extracted the predicted disorder score by SPINED 14 .

Physicochemical properties
The physicochemical attributes of each residue can affect its catalytic performance. In this work, we retrieved nine amino acid indices from the AAindex database 15 , including number of atoms, number of electrostatic charge, number of potential hydrogen bonds, hydrophobicity, hydrophilicity, isoelectric point, mass, expected number of contacts within 14Å sphere, and electron-ion interaction potential, to reflect the specific nature of each residue type.

Catalytic residue propensity
It is well known that residues have different tendency to be involved in catalytic activity. We thus calculated the catalytic residue propensity of each residue type which is defined as the ratio between the amino acid frequency in catalytic residues and that in the whole sequence.

Sequential position
Sequential position can reflect the relative location of each residue in a given sequence. To this end, we extracted two features proposed by Chen et al. 16 , including terminus indicator and secondary structure segment indicators. The former is used to check whether a residue locates in the N-or C-terminus, whereas the latter checks whether a predicted helix or sheet segment exists in the neighborhood of a given residue. More details can be found in the original reference.

Global sequence features
The global information of a given sequence was captured by two features, including the sequence length and the amino acid composition of each sequence.

Dataset preparation Primary dataset
We used the dataset collected by Han et al. 17 , composed of 223 enzymes (CSA223), as the primary dataset to construct and evaluate our method. All entries in CSA223 were extracted from the SCOP database 18 and their structures cover the four major structural classes (i.e., allα, all-β, α+β, α/β). The sequence identity of any chain pair is less than 30%. Catalytic residues were collected from scientific literatures in the Catalytic Site Atlas. We finally attained 630 catalytic residues in this non-redundant dataset and the remaining 60658 residues were considered as non-catalytic residues.

Alternative datasets
Besides the CSA223 dataset, we further checked the robustness of our method using other well-established datasets, including EF_family, EF_superfamily, and EF_fold from Youn et al. 19 , HA_superfamily from Chea et al. 20 , NN from Gutteridge et al. 21 , and PC from Petrova and Wu 22 . These datatsets were non-redundant at different structural levels. For instance, the entries of EF series were from the representative families, superfamilies, and folds in terms of the SCOP classification.

Independent test datasets
The T124 dataset curated by Zhang et al. 23 was applied to independent testing in our work. All these 124 entries came from the HA_superfamily dataset and shared less than 30% sequence identity with any chain from the EF_fold dataset serving as the training set. In addition to the native structures of these enzymes, we also attempted to estimate whether our algorithm can be applicable to the predicted structures. To this end, we prepared another two datasets T124_M90 and T124_M30, in which the structures were modelled by the I-TASSER software 24 with relaxed and stringent sequence identity cutoffs (90% and 30%), respectively. Structural model having the best C-score was chosen as a representative for each entry. The mean RMSD between the native structures and high quality models (T124_M90) is 1.86Å, while the value between the native structures and low quality models (T124_M30) is 2.48Å, suggesting that the quality of predicted structures is generally accepted.

Structural genomics dataset
We also built a structural genomics targets dataset called SG2332 which was selected from the Oct 2015 PDB release. We first searched the classification keyword 'structural genomics' in the PDB database and attained 2577 PDB entries having structural information but without functional annotations. The retrieved entries were then split into 5505 individual chains, which were further clustered at a 90% sequence identity cutoff using the CD-HIT program 25 . We randomly selected one representative from each cluster and finally achieved 2332 chains.

Performance evaluation
To make a direct comparison with existing methods, we evaluated our predictors using 5fold cross-validation on the primary dataset (CSA223) and 10-fold cross-validation on the alternative datasets (e.g., EF_family, EF_superfamily, etc.), respectively. For conducting n-fold cross-validation, the target dataset was first separated into n subsets with an equal number of chains. Then one subset was used as the test set, while the rest was used as the training set. This procedure was repeated n times, in which each subset was tested in turn, to estimate the average performance. All parameters in our algorithm were determined using 5-fold cross-validation on CSA223. Moreover, the native and predicted structures in T124 were considered as the external data to assess our predictors trained on the EF_fold and CSA223 datasets. As a primary measure of prediction performance, the area under the receiver operating characteristic curve (AUC) was calculated. We also computed other well-established metrics including recall, precision, F1-score, accuracy (ACC), and Matthews correlation coefficient (MCC) as below. TP, TN, FP, and FN denote the numbers of true positives, true negatives, false positives, and false negatives, respectively. Additionally, we provided the standard error (SE) for all measures used in this work. For the 5-fold cross-validation (CSA223), the SE was estimated based on the performance of the five subsets. For the independent testing (T124), the bootstrapping algorithm was applied to the estimation of SE. We randomly selected (with replacement) a set of 124 chains from the T124 dataset and repeated this procedure 100 times. The SE was then estimated using the results of these bootstrap datasets. Figure S1. Selection of the optimal parameter for residue microenvironment. We utilized the number of shared facets in residue pairs as a constraint to generate different microenvironments for each residue. When the number is no less than 9, both DT-based MEscore and closeness yield optimal performance. Note that the catalytic function of a query residue was identified if its attribute value is larger than a given cutoff.  26 looks very similar. We thus selected five representative scalar factors with an increased distribution discrepancy, which might contribute to more multifaceted LN features. Figure S3. Selection of the optimal parameters for our hybrid algorithm. In the heat maps, the color is changed according to the AUC value of our structural or sequence prediction module with different parameter combinations (orange/yellow/cyan = high/medium/low), and the black dot denotes the highest AUC value. Figure S4. Performance of DT-and distance-based microenvironment score and topological features. Distance-based method was implemented based on Han et al. 17 . CL: closeness, DG: degree, BW: betweenness, CC: clustering coefficient, and ME: microenvironment score.   Figure S7. A snapshot of prediction results from our server. As shown in this figure, the results will be displayed from four perspectives. The first section provides summary information about the query protein and its optimal template. The second section shows graphical representation of prediction results, in which STRscore, SEQscore, and COMscore denote the prediction scores output by our structure-based module, sequence-based module, and integrative algorithm, respectively. The third section provides three-dimensional visualization of prediction results, in which the putative catalytic residues are highlighted in red sticks. The last section includes the details about prediction results, such as the outputs from our different predictors.