Estimation of model accuracy by a unique set of features and tree-based regressor

Computationally generated models of protein structures bridge the gap between the practically negligible price tag of sequencing and the high cost of experimental structure determination. By providing a low-cost (and often free) partial alternative to experimentally determined structures, these models help biologists design and interpret their experiments. Obviously, the more accurate the models the more useful they are. However, methods for protein structure prediction generate many structural models of various qualities, necessitating means for the estimation of their accuracy. In this work we present MESHI_consensus, a new method for the estimation of model accuracy. The method uses a tree-based regressor and a set of structural, target-based, and consensus-based features. The new method achieved high performance in the EMA (Estimation of Model Accuracy) track of the recent CASP14 community-wide experiment (https://predictioncenter.org/casp14/index.cgi). The tertiary structure prediction track of that experiment revealed an unprecedented leap in prediction performance by a single prediction group/method, namely AlphaFold2. This achievement would inevitably have a profound impact on the field of protein structure prediction, including the accuracy estimation sub-task. We conclude this manuscript with some speculations regarding the future role of accuracy estimation in a new era of accurate protein structure prediction.


Methods
The following sections introduce the basic components of MESHI_consensus. We first present our benchmark, a dataset of targets and structural models thereof from previous CASP experiments, and the features derived from them. Then we describe the performance measures that guided the development of the ML model, as well as the model design procedure, which includes regressor and hyper-parameters selection. Finally, we present a data filtering process that reduces training set noise.
Structural models dataset. We trained and evaluated our method using a dataset that consists of 73,053 single-chain models, which were generated as blind server predictions of 345 CASP9-CASP13 targets (2010-2018) ( Table S1) an extension of the dataset used in 22 . These targets are a non-redundant subset of the ≈430 targets of these CASP experiments. To this end, targets were considered redundant, and discarded, if a newer target was strictly similar by either sequence ( E − value < 10 −3 ) or structure (more than a half of the residues could be structurally aligned by the iterative magic fit method of Swiss-PDB-Viewer 48,49 ). Duplicated models, having identical conformations as other ones (typically from different servers of the same group), were identified and removed.
Many server-generated models include clashes (too short distances) between atoms and other structural distortions, such as deviations from correct bond lengths and angles. Thus, before feature extraction, each structural model was subjected to energy minimization using the MESHI molecular modeling package 50 . The energy function includes strong spatial constraints, and most distortions are removed with negligible structural changes.
The test set of this study includes 67 CASP14 targets (10,889 structural models), which were predicted by modeling servers during the CASP14 experiment (May-August, 2020). Both model generation and the estimation of their accuracy by MESHI_consensus web server were done in a blind fashion before their structures became available. After the models were downloaded from the CASP14 website they were energy minimized by MESHI package and their features were fed to the MESHI_consensus model.
Features dataset. In this study, each structural model is represented by a vector of 982 features. These features may be divided into two broad classes: structural model-features, and target-features that modulate the former.
Structural model-features . The following features are calculated using the MESHI package 50 .
• Basic features: 142 features derived solely from single model structures 51 . They include a mix of commonly used and novel knowledge-based energy terms (some of which are unpublished yet). These terms include pair-wise atomic potentials 40,52,53 , torsion angles 54 , hydrogen bonds, and hydrogen bond patterns 55 , solvation terms, "meta" energy terms that consider the distribution of other terms within the protein atoms, an extended radius of gyration that takes into account different classes of amino acids (polar vs. non-polar, secondary structure elements vs. coil region, etc.), and compatibility of the models with solvent exposure prediction 56 , and with 3-, 8-and 13-classes secondary structure predictions [57][58][59] . • Consensus features: seven features (Eqs. 1-7) that represent the similarities between a model of a specific target and the other models of the same target. These terms are calculated as follows: ∀d ∈ T , where T is the set of structural models of some target. Target-features. EMA datasets are organized in two levels; the objects that we study, and whose accuracies we predict are structural models. Yet each model belongs to a specific target (with no overlap between the targets). Each target is characterized by a unique sequence, which is shared by all its models, and a unique native structure. Thus, the mapping of features to model qualities may be biased by target characteristics such as length and chemical composition (amino acid sequence), which differ between targets but are typically identical in most models of a given target. Therefore, feature distributions and their relation to model quality differ between targets (Fig. 1). Further, some training set targets may be less informative than others, with respect to certain features, due to specific characteristics such as ligand binding. (1) where i ∈ {1, 2, 4, 8} gdt j∈{0.5,1,2,4,8] (d, s) = The maximal fraction of s residues that are less than j • A form the corresponding residues of d after superposition.  60 and accuracy of a single structural model. Models of the two CASP targets T0601 and T0530 are depicted by blue and red points respectively. Accuracy is measured in GDT_TS between the models and the native structures, that is gdt − ts(d, n) , Eq. (6), where n is the native structure of that target). "GOAP" energy term is a strong feature and it (anti) correlates well with the accuracies of both targets. Yet, given a feature value range (e.g., around-10,000), the qualities of the two proteins are very different. We use this domain knowledge to generate target-specific features that allow the learning process to modulate the outcome of the model features: • One-hot encoding of target name: binary features (one per target). That is, for each target T of the training set, there is a feature OH T such that OH T = 1 for all the models of T and 0 otherwise. When positioned in a node of a decision tree, OH T splits the leaves of the sub-tree to T and non − T , rendering the features in the nodes of the T sub-tree practically meaningless. Interestingly the training process does make use of this ability to eliminate the effect of specific features in specific targets. • Z-score, a normalized (zero mean and standard deviation of one) version of each basic feature, based on the target's mean and standard deviation. • Amino acid composition: • 20 features for the frequency of each amino acid in the sequence of the target. • 6 features for the frequency of amino acids with certain properties in the sequence: Positive charged, negative charged, aromatic, polar, and non-polar.
Combining all the feature vectors of the structural models dataset creates a features dataset.
Performance measures. We aim to predict the accuracy of structural models in terms of GDT_TS between the models and the native structures, that is gdt − ts(d, n) (Eq. 6) where n is the native structure of that target. Specifically, we use three performance criteria: 1. Root mean square of prediction errors (RMSE)-the per-target distance between the prediction values and the observed (true) values. 2. LOSS -for each target, the difference between the quality of the best model (highest observed GDT_TS) and the quality of the top-ranking model. 3. 5-LOSS -for each target, the minimum difference between the quality of the best model (highest observed GDT_TS) and the qualities of the five top-ranking models.
For dataset models, method performance is estimated by the median of 345 Leave-One-Target-Out cross-validation experiments (one per dataset target). In each experiment, the statistical model is trained using all the targets except one, which serves as the test set. This strategy is computationally expensive but reduces biases, and in a sense simulates the real-world scenario, where we learn from all the models of targets whose native structures are known and assess the model qualities of a target whose structure is yet unknown.
Method design. The design of these EMA methods aimed to optimize two performance criteria: the median of the per-target RMSE and median LOSS. We used Leave-One-Target-Out cross-validation experiments to choose the regressor, its hyper-parameters, and the data filtering strategy.
Regressor. In this work, we formulate EMA as a regression problem that maps measurable features of the structural models to a continuous quality score (GDT_TS) between the models and the native structures. To this end, we tested three regressors: linear regression, Light Gradient Boosted Machine (LightGBM) regressor 61 , and a fully connected neural network. The superior performance ( Fig. 2) of LightGBM motivated us to examine five other tree-based regressors: BaggingRegressor, GradientBoostingRegressor, RandomForestRegressor, and ExtraTreesRegressor from the scikit-learn library 62 , as well as Extreme Gradient Boosting (XGB) regressor 63 . We remained with LightGBM, however, as it outperforms all five by a small margin and is faster to train.
Hyperparameters. We used grid search to find the optimal values for two hyperparameters of LightGBM regressor: learning rate and the number of estimators. Learning rate 0.1 and 100 estimators achieved good results in a reasonable computation time. For all the other parameters, we used the default values as supplied by the LightGBM framework. Specifically, the loss function that the regressor training minimizes is the RMSE.
Data filtering. Some of the dataset targets are isolated chains of multi-subunit complexes (e.g., single helices of helix-bundles). Estimating their quality is a challenge due to hydrophobic interface residues that are superficially exposed when seen out of context. For such targets, MESHI may produce feature values that are inconsistent with the label (GDT_TS), increasing noise and impairing the learning process. The qualities of such targets are hard to estimate, even in an over-fitting scenario. The removal of 18 such targets (Table S1) from the training set significantly reduces the median error of quality estimates and does not affect the identification of the best models (Fig. 3).

Results
MESHI_consensus was developed using a dataset of CASP server models that were generated as blind predictions in five consecutive CASP experiments (9)(10)(11)(12)(13). CASP14 models serve as the ultimate test set as their true qualities were unknown at the time of prediction. Here we present the method's performance in predicting the accuracies of models in the dataset, consider the contributions of different feature types, and conclude by presenting, and discussing CASP14 performance.     deepCNF8Compatibility-a measure of the agreement between the secondary structure (8 states) of the model's residues (as measured by DSSP 65 ) and their predicted secondary structure 66 . contacts8 and contacts14-the average numbers of contacts with thresholds of 8 Å and 14 Å, respectively, between carbon atoms. meshinr_ dssp8Compatibility and meshirw_dssp8Compatibility_Weighted-two slightly different measures of the agreement between the secondary structure (8 states) of the model's residues (as measured by DSSP 65 ) and their predicted secondary structure 59 . scCarbonN-the number of carbon atoms in the model's side-chains. coverage-the fraction of the target sequence, which the are modeled. SheetFraction-the fraction of beta-sheet resides within the residues with any secondary structure. consensus features-see Eqs. (1)(2)(3)(4). gdt1_consensus_ median-the median value of gdt1_consensus, among all the models of a specific target. hydrogenBondsPairs_ median-the median value of a cooperative hydrogen bonds energy term 67 among all the models of a specific target. cooperativeZstdRamachandranSidechain_median-the median value of a cooperative torsion angle energy term, among all the models of a specific target. www.nature.com/scientificreports/ tional subgroups of features were tested (data not shown), but the best result was obtained when we used the whole set of features. The measure we used for the importance of the features is "GAIN", which is the sum of the information gains of all splitting points that use that feature, where information gain is the Kullback-Leibler 64 divergence of the data before and after the split. A higher value of this metric, when compared to another feature, implies it is more important for the predictive model. The ten highest importance features in LightGBM models that use either all the features or only the basic ones are depicted in Fig. 5. These models were trained on all the benchmark targets. Qualitatively similar estimates of feature GAINs, derived from benchmark subsets are reported in Feature importance file (supplementary material). All but one of the LightGBM models make use of the possibility to distinguish between structural models from different targets. When target-specific features are available to the models, they choose from a wide variety, without a clear preference. The basic features were not intended to include such features explicitly, yet the models assign relatively high importance to the total number of side-chain atoms in the model. This feature seems almost arbitrary and was added to the basic features by mistake (being a component of other features). We speculate that the models consistently chose it as it allows target distinction. When all features are considered, consensus features are ranked highest by a large margin, apparently, because good (i.e, close to native) models are similar to one another (they are all similar to the native structure), while low-quality models can be very different from each other. This observation is consistent with the dominance of consensus-based methods in the EMA field. When only the basic features are considered, two classes of features become dominant. The first class (e.g., sasaCompatibility), quantifies the agreement between the observed and predicted solvent exposed area and the secondary structure of model residues. Apparently, an inability to reproduce them is a strong indicator of low model quality. This is consistent with the disruptive effect of out-of-context targets (e.g., isolated subunits) on learning.
The second important class of basic features (e.g., goap_ag) rewards structural traits that are common in native structures. One common trait is compactness which manifests itself by a large number of atom contacts. The other common trait is the abundance of specific atomic interactions (e.g., contacting side-chain atoms of hydrophobic residues). Having many favorable interactions, and a compact structure are strong indicators of high model quality. Finally, as our quality measure, GDT_TS between the models and the native structures represents the fraction of accurately modeled residues, it is bound by the fraction of the target sequence which is actually modeled. Coverage features depict this fraction and are consistently ranked among the ten most important.
Considering the dominance of consensus features, we tested whether the other features are needed at all. To this end, we performed a Leave-One-Target-Out experiment with three different sets of features (Fig. 6). The first experiment served as a baseline and included only the basic features. The second experiment used only the consensus features and the third used all the features. In both, the RMSE and LOSS, the best results are obtained by using the entire set of features. For RMSE (A), which is the loss function of the regressor, most of the improvement is due to the consensus features, consistent with the "GAIN" results (Fig. 5). Yet by using all the features we obtain statistically better performance. For LOSS (B), the basic features outperform the consensus ones, reflecting the difficulty of consensus features to identify exceptionally good models. The best structural model of target T0581 for example (BAKER-ROSETTASERVER_TS4, 0.64 GDT_TS) was picked by the statistical model that was trained with the basic features only. Training with all the features resulted picking the second-best model (GDT_TS 0.33), which is similar to some other inaccurate ones. Notably, however, adding consensus features serves as the independent test set of this study. MESHI_consensus took part in that experiment as an EMA server and submitted 11,135 global quality predictions of server models. One target, T1093, was missed due to technical failure. Figure 7 depicts the best (left) and worst (right) results with targets T1046s2 and T1031 respectively. The overall performance on this test set (Fig. 8) is comparable to that of the benchmark (Fig. 4). The slight performance reduction is discussed below. Comparison with the other 71 research groups that competed in the EMA category reveals a state-of-the-art performance, with more than 65% of the predictions ranked within the top 10 in either LOSS or RMSE. Specifically, MESHI_consensus reached:  (Table 1A) (https:// predi ction center. org/ casp14/ qa_ aucmcc. cgi) • Third lowest average prediction error (Table 1B) (https:// predi ction center. org/ casp14/ qa_ diff_ mqas. cgi. • Sixths lowest average LOSS (Table 1C) (https:// predi ction center. org/ casp14/ qa_ diff2 best. cgi).
Notably, among the other top-performing methods, one (MESHI) is a curiosity-driven variant of MESHI_consensus variant, that simply adds server names as a feature and ranked a bit higher. Notwithstanding these achievements, the overall performance of MESHI_consensus in CASP14 (Fig. 8) is worse than in the dataset's Leave-One-Target-Out experiments (Fig. 4). Notably, nine of the EMA targets are domains of a single large protein (Fig. 9). MESHI_consensus failed to predict six of them with LOSS values ranging from 0.16 to 0.4 (Fig. 7, right). We speculate that the lack of the protein context had to do with the poor performance, as no other targets showed so high LOSS values. As demonstrated in Fig. 9, these structures have numerous inter-domain stabilizing interactions, that are missing in the isolated EMA targets. A similar phenomenon is also observed in the dataset (see the Data filtering section). Target T1073 also showed exceptionally bad performance with an RMSE of 0.41. Unfortunately, we cannot study this case, as its structure has not been published yet. Another, more speculative explanation for the lower performance is the methodological turning point of CASP14 (discussed below). It raises a major challenge to MESHI_consensus, as well as to any supervised learning method that uses sets of CASP server models training. The test (CASP14 models) and training sets were not sampled from the same distribution, as the models of CASP14 are on average more accurate than those of previous CASP experiments. , and LOSS (C). Results are reproduced from the CASP website at (https:// predi ction center. org/ casp14). Note that in the CASP site RMSE and LOSS are referred to as "differences (predicted vs observed)" and "difference from the best", respectively, and their performances are depicted as percentages. The servers "Seder2020" and "Seder2020hard" that submitted an EMA prediction for a single target were omitted.

Discussion
This manuscript introduces MESHI_consensus, a new method for quality assessment of protein models. MESHI_ consensus uses a large and diverse set of features, representing both physical concepts (e.g., energy terms) and domain knowledge (target-specific and consensus-based features). The features are integrated to a single score by a powerful and computationally efficient, tree-based LightGBM regressor 61 . One type of state-of-the-art features, which the method lacks, is compatibility with predicted distances derived from multiple sequence alignments (MSA) 11,76 . MSA-driven distances are the keystone of the current breakthrough in PSP, and compatibility with them seems to be a powerful feature 20 . The development of MESHI_consensus was guided by Leave-One-Target-Out experiments using a nonredundant dataset of structural models from previous CASP experiments. The recent CASP14 provided an objective performance test and MESHI_consensus scores among the top methods (see Results). One may speculate that had we used contact compatibility features we could perform better. During the study, we invested much effort in analyzing failures, that is targets for which we considerably missed the actual model qualities and/or their rankings. Many of these failures could be rationalized in retrospect as related to the inability of our features to consider stabilizing inter-molecular interactions. We tried to implement insights from this analysis through data filtering with limited success. One may hope though that a more systematic approach to this problem may lead to better performance in future studies.
A profound limitation of MESHI_consensus, is the reliance on a single native structure as the gold standard for the labeling of the data. This is in line with the common practice in the field, which is applied in CASP and in all the studies that we are aware of. Yet, this practice ignores the structural flexibility of proteins as manifested by diverse structures of the same protein in different PDB entries 77 , and in the results of NMR studies. Figure 10 , and  nine domain targets T1031, T1033, T1035, T1037, T1039, T1040, T1041, T1042, and T1043. For six of them:  T1031, T1033, T1035, T1039, T1040, and T1043, MESHI_consensus failed to provide reliable EMA predictions (LOSS above 0.16). We speculate that these failures may be attributed, at least partially, to the absence of the protein context in the isolated domains.  www.nature.com/scientificreports/ demonstrates this observation by assessing the server models of target T1025 using the ligand-bound and apo crystal structures of that protein. Target T1064 shows a similar, yet less pronounced trend (data not shown). In the CASP context alternative structures are rarely available, and thus ignoring them is practically unavoidable, and can be seen as part of the "noise" characteristic of any experiment. In the training phase, however, ignoring available knowledge of structural multiplicity adds superficial, arbitrary, constraints to the learning process, and probably harms the resulted statistical model. Structural multiplicity can be introduced into the training phase of EMA methods if structural models are evaluated by their similarity to the closest of the known structures, rather than to a single arbitrary one. We have already demonstrated the usefulness of a similar approach in the related fields of secondary structure prediction 78 and knowledge-based energy functions 79 . Finally, CASP14 has witnessed a remarkable breakthrough in PSP, with many models of hard targets reaching experimental quality. This achievement had a limited effect on the EMA section of CASP14 as the cutting-edge method, AlphaFold 16 , did not provide a server. Yet, it is evident that a new standard of model qualities is set 80 . Will EMA be needed at all when the models are "almost perfect"? An obvious answer is that we are in the middle of the event and its consequences cannot be predicted. More fundamentally, the new achievements seem to open new horizons for PSP, considerably improving our ability to cope with essential problems like structures of molecular complexes and protein dynamics. These challenges require their own EMA tools.
We believe that the insights of this study, most importantly the central role of structural multiplicity and molecular context, will gain much importance in the era of high-accuracy modeling. On a more speculative note, we suggest that features like the ones used in this and related studies will also remain relevant, as design principles for new, probably neural-network-based, architectures. The current application of neural network techniques to EMA use standard architectures and avoid "feature engineering", such as energy terms. One may speculate that introducing more domain knowledge into the network architecture and its input features will result in more accurate and stable performance.

Data availability
The data sets generated and analysed during the current study are available at http:// meshi1. cs. bgu. ac. il/ Bitto nAndK easar 2021/.
Received: 24 February 2022; Accepted: 20 July 2022 Figure 10. Model accuracy with respect to two alternative native structures. Target T1025 was evaluated by the CASP14 assessors with the ligand-bound D-glucose O-methyltransferase as its native structure (PDB entry 6UV6 69 ). The structure of the unbound protein is also available (PDB entry 6UVQ 70 ). While the choice of either structure is arbitrary, the resulted performance measures are quantitatively different. A theoretical EMA oracle that provides 6UVQ based accuracy as its prediction, would have 4% RMSE and 6% LOSS. The dots represent server models, the black line is the diagonal (x = y), and the red and magenta circles depict the top model by 6UV6 and 6UVQ respectively.