Introduction

Knowledge based potentials (KBP)1,2,3,4 are energy functions derived from databases of known protein conformations that empirically aim to capture the most relevant aspects of the physical chemistry of protein structure and function. They are derived by measuring the probability of an observable in an ensemble of experimental structures relative to a reference state2,6,16,17,18. The conversion of the probability into an energy function is normally done employing Boltzmann’s law3. Theory of conditional probabilities5,6, linear and quadratic programming7 and information theory8 have been invoked to justify the approach. The simplest observable1,2 that one can use to characterize a structure is the presence of a contact between two specific residues. This procedure has been generalized to include more and more complex observables3,6,7,9,10,11,12,13,14,15, making the KBPs more and more accurate. Because of their excellent balance between accuracy and computational efficiency, KBPs are widely used in protein design19, in simulation of protein folding20, binding21 and aggregation22, in protein structure prediction23 and in fold recognition24.

The quality of knowledge-based potentials is normally benchmarked on their capability of recognizing the folded state and/or the best model in a set of decoys5,25,26,27. It is now believed that decoy sets composed by models submitted during the biennal CASP (Critical Assessment in Structure Prediction) competition28 are the most challenging29. Although it is not trivial to definitely assess their absolute efficiency, many KBP7,15,17,30,31,32,33,34 perform quite well. Rosetta10,27, a scoring function derived using an elegant Bayesian analysis, the composite scoring function QMEAN635,36 and the potential RF_CB_SRS_OD introduced by Rykunov and Fiser18 are particularly successfull, even when tested on CASP targets18,28. However, even the best performing KBP is not capable of distinguishing the folded state in all the decoy sets, indicating that improvements are still possible. Moreover, the state-of-the-art KBPs exploit many complex observables, such as the distance between pair of residues or atoms, which significantly boosts the number of parameters. The weights of the various terms (which may include correlated observables, such as value of the torsions and the presence of secondary structure elements) are optimized on the decoy sets in order to obtain the best performance10,35. This may affect the robustness of the potentials and the possibility to use them for different purposes other than fold recognition.

We here introduce a statistical potential that we call BACH (Bayesian Analysis Conformation Hunt). Its definition relies on a few binary structural observables, such as the presence of short range contacts between pair of residues, in the spirit of the original works1,2 on KBPs. As a consequence, the energy function depends on only 1091 parameters, much less than other state-of-the-art potentials. The parameters are derived from a relatively small set of experimental folded structures (TOP500) but do not vary significantly when computed over a larger dataset, revealing a remarkable robustness. Moreover (and possibly more importantly) the parameters are not optimized on any decoy set. This potential, despite of its relative simplicity, has a remarkable capability in recognizing the folded state in the CASP sets. By assigning to it the lowest energy value, BACH discriminates the native conformation in 58 % of the sets we considered. In all the other cases the native structure is among the best. In the worst case, it ranks #35 in a decoy set with 248 structures. This performance is significantly better than that of other potentials and remains remarkably unaffected for different sets of alternative BACH parameters, derived either from a subset of TOP500 consisting of only 50 randomly chosen structures, or from TOP500 with a reduced 9 class representation of aminoacids37. We also found that BACH performs marginally better than other competing potentials in finding a model as close as possible to the native state within the pool of the few lowest energy decoys.

We finally investigated the reason of the poorer performance of BACH on some decoys sets. By analyzing ensembles generated by molecular dynamics from a single model we find that the BACH energy of thermodynamically equivalent structures is affected by large fluctuations. In fact, fluctuations of similar amplitude are also present among different models of the same protein obtained by NMR. Based on these observations, we propose that computing a scoring function on single structures may not be always adequate for discriminating the best. A more reliable quality measure is the probability distribution of a scoring function computed in a finite temperature molecular dynamics run, rather than a single value. Applying this idea we are able to discriminate the native state also in cases where evaluating the scoring function on single structures leads to ambiguous results.

Results

The main goal of this work was to develop an efficient and physically inspired knowledge based potential which is able to discriminate the native conformation from a set of different structures with the same primary sequence. This potential is constructed by analyzing a set of 500 experimentally resolved protein structures (see Methods), monitoring the probability of the event that single residues or residue pairs are observed in one of the structural classes defined as follows. Each residue is classified as solvent exposed or not by computing its solvent accessible surface38. Every pair of residues is classified in one out of five mutually exclusive states: (i) as forming an α-bridge; (ii) as forming an anti-parallel β-bridge; (iii) as forming a parallel β-bridge; (iv) as forming a contact through their side chain atoms; (v) as not realizing any of the previous four conditions. All classifications are verified considering the full atomic configuration of the system (see Methods). Following a Bayesian analysis procedure, the 1090 energy parameters associated to event occurrences are estimated from their observed probabilities. An extra parameter, p, fixes the relative weight of the two contributions, the solvation and the pairwise. This parameter is determined (p = 0.6) by monitoring the fluctuations of the two components on the TOP500 dataset as described in Methods. These 1091 parameters define the BACH statistical potential. Moreover, we have validated the robustness of our approach by checking that the performance remains remarkably unaffected for different sets of alternative BACH parameters (see Methods).

Discriminating the native conformation

We have calculated the BACH energy over various decoy sets and checked its ability of assigning the lowest energy value to the native conformation. In figure 1 we show the BACH energy versus the distance in GDT39 to the native structure for four decoy sets: A) T0388 from CASP8 ( pdb code 3CYN), B) T0488 from CASP8 (pdb code 2VWR), C) T0575 from CASP 9 (pdb code 3NRG), D) 15561 decoy structures of protein GB3 generated by molecular dynamics and bias exchange metadynamics (details of the simulations are presented in Methods). One can see that in cases A) and C) BACH is able to discriminate the native structure from the other structures by assigning to it the lowest energy value. Case B) is the CASP8-9 decoy set in which BACH performance is the worst. The native conformation is ranked #35 out of 248 structures (14%). In the case of the structures generated by molecular dynamics (figure 1-D) the native conformation is in the best 0.5%, ranked #86 out of 15561 structures. In all cases, the closer a decoy structure is to the native state (as quantified by a large GDT), the lower is, on average, its BACH energy, thus showing a funnel-like behaviour. This is a first qualitative indication that BACH might be a rather powerful tool for protein structure discrimination. In the following, we will provide quantitative evidence supporting this claim and compare the performance of BACH with other knowledge based potentials.

Figure 1
figure 1

The BACH energy as a function of the GDT with respect to the native structure, for three decoy sets of CASP8-9: A) T0388 (native pdb code 3CYN), B) T0488 (pdb code 2VWR), C) T0575 (pdb code 3NRG) and D) for a decoy set of 15561 structures of protein GB3 (pdb code 2OED) generated by molecular dynamics simulations and bias exchange metadynamics. Red points: native conformation. Blue circles: structures with low GDT (distant from the folded state) and with a BACH energy lower than the native state.

Comparison with other knowledge-based potentials

We compare the performance of BACH with QMEAN635,36, ROSETTA5,10,27 and RF_CB_SRS_OD18. These potentials have been shown to perform extremely well in decoy discrimination18,28. The comparison is performed on a subset of CASP8-9 targets (see Methods), as it has been previously shown that these decoys are the most challenging29. We used only sets in which the decoy structures have the same sequence as the native structure and that contain the coordinates of all the heavy atoms (see Table S1). This greatly reduces the number of targets, but allows comparing structures that are strictly chemically equivalent, removing a source of systematic errors.

The first measure of performance we use is the normalized rank, defined as the rank of the native structure divided by the total number of structures in the decoy set. For example, if the native structure has the lowest (resp. highest) possible energy among a set of 100 structures, its rank will be 0.01 (resp. 1). In figure 2-A we show the normalized rank for the decoy sets in CASP 8/9 (see Methods) for BACH, QMEAN6, RF_CB_SRS_OD and ROSETTA. Strikingly, BACH has the lowest sorted rank, namely the best performance in discriminating the native structure from the decoys. BACH ranks the native within the best 5% for 28 decoy sets, whilst QMEAN6 does it for 23, RF_CB_SRS_OD for 25 and ROSETTA for 19 out of a total of 33 sets. Moreover, for 19 sets BACH ranks the native structure as the first, whilst QMEAN6 does this for 14 cases, RF_CB_SRS_OD for 13 and ROSETTA for 3. These results show that BACH is able to discriminate accurately the native structure, assigning to it, in many cases, the lowest energy value.

Figure 2
figure 2

A) Normalized rank and B) Z-score sorted for the decoy sets in CASP 8–9 and calculated for the BACH, QMEAN6, RF_CB_SRS_OD and ROSETTA scoring functions.

Another standard measure to characterize the performance of a scoring function is the Z-score, defined as the distance, measured in standard deviations, of the energy of the native state from the mean energy of the set. The larger the Z-score, the better the potential is in discriminating the native structure. In figure 2-B, the Z-scores for the decoy sets in CASP 8/9 are shown for each potential. As one can see, BACH has, in almost all cases, the largest Z-score values. Also for the Z-score, QMEAN6 has the second best performance, while RF_CB_SRS_OD and ROSETTA perform more poorly.

BACH performance in native state discrimination is largely unaffected by varying the relative weight p of the solvation term with respect to the pairwise term by a factor of 2 (see Eq. (1)). For both p = 0.3 and p = 1.2 we still rank the native structure first in 17 cases out of 33. We performed even more stringent tests on the robustness of the BACH potential. First, we rederived the 1090 BACH energy parameters on a much smaller subset of the TOP500 dataset, consisting of 50 randomly chosen structures. Secondly, we derived 243 reduced BACH energy parameters, from the TOP500 dataset, implementing a 9 classes residue alphabet37 (see Methods). In both cases we kept p = 0.6. As is shown in Fig. 3, the performance in properly ranking native structures is strikingly comparable, even though slightly worse, to the one obtained with the BACH energy parameters constructed with 20 aminoacid classes from the full TOP500 dataset.

Figure 3
figure 3

Normalized rank sorted for the decoy sets in CASP 8–9, calculated for the BACH, QMEAN6 and for alternative implementations of the BACH scoring functions: A) BACH parameters derived from 50 randomly chosen structures of TOP500 B) reduced BACH parameters for 9 aminoacid classes. The black curve is the same as in Fig. 2-A.

In order to quantify the “funnelness” we also considered the Pearson correlation coefficient between the score and the GDT with respect to the native structure. This quantity measures how large are the fluctuations of a set of points with respect to a linear fit of the data. Its absolute value for the 33 decoy sets and for the four scoring functions we considered is shown in Fig. 4-A. For this specific quantity, QMEAN6 performs better than BACH, namely it produces on average a more linear dependence of the scoring function on GDT. However, the highest Pearson coefficient is observed in the decoy sets in which the native state is poorly discriminated. This is shown in figure 4-B, where we plot the absolute value of the Pearson correlation coefficient versus the Z-score for BACH and QMEAN6 for the 33 decoy sets of CASP 8/9. Strikingly, in almost all the cases in which the Pearson coefficient is higher than 0.8 the Z-score is below 1.5, indicating that a scoring function producing a good linear correlation is normally less capable of distinguishing the native state. This could indicate that the profile of the energy score as a function of the GDT to the native is not necessarily best fitted to a linear curve for a funnel-shaped landscape. This idea is consistent with previous theroretical arguments40.

Figure 4
figure 4

A) Absolute value of the Pearson correlation coefficient sorted for the decoy sets in CASP8-9, for the BACH, QMEAN6, RF_CB_SRS_OD and ROSETTA scoring functions. B) Absolute value of the Pearson correlation coefficient versus the Z-score for BACH and QMEAN6 for the decoys in CASP 8-9. C) Difference in GDT between the structure in each set which is closest to the folded state and the low energy representative structure (the decoy that is ranked within the first N, according to each potential and that has the highest GDT).

We also benchmarked the capability of BACH in recognizing models that are close to the native structure. This is relevant, as KBPs are normally used in computational procedures aimed at iteratively finding conformations that are closer and closer to an unknown folded state. As we will show quantitatively later, estimates of a scoring function on a single structure are unavoidably affected by fluctuations, that hinder their statistical significance. If in an algorithm one always selects only the lowest energy structure, there would be high chances of missing a relevant “branch” of native-like conformations. In order to quantify this effect, we consider the N lowest energy structures and we select among them the one with the largest GDT. We compute the difference ΔNGDT between this value and the GDT of the best model in all the decoy set, namely the decoy with the largest GDT. If the best model is one of the N lowest energy structures ΔNGDT = 0. In figure 4-C, we report ΔNGDT as a function of N averaged over the 33 CASP8-9 decoy sets. For N = 1, only the lowest energy structure is selected, providing a quality measure similar to the one used in ref.18. For BACH and QMEAN6 the average Δ1GDT is 16 and 11 GDT units, respectively. This indicates that the top scoring QMEAN6 model is on average 5 GDT units closer to the best model in the set. For increasing values of N, the average ΔNGDT decreases rapidly down to ~5 for all scoring functions, indicating a significant gain in information achieved by considering 5–10 alternative structures. In this range of N, the performance of BACH becomes better than that of the other potentials. Adding more and more structures reduces only marginally the average ΔNGDT, but with the drawback that in a hypothetical algorithm one would have to consider a very large number of branches which would increase the computational cost.

The performance of BACH on traditional decoy sets

We also benchmarked the performance of BACH over decoy sets that are considered the standard ones to test the scoring functions: semfold26, 4state25, fisa5 and RosettAll27. Since these decoy sets include a large number of targets satisfying the aforementioned condition of chemical equivalence, we use these sets to assess how BACH's performance is affected by the presence of bound partners (e.g. ligands or other protein chains) in the native conformation. As supplementary information we present the normalized rank of the folded state and compare BACH's performance with the other potentials. In figure S1-B we have selected, from these sets, only the decoys that have a monomeric native structure with no ligands. The performance of BACH remains marginally better than the other KBPs also in this case. In particular, the normalized rank of BACH is smaller than 0.2 in all the cases. For 12 sets BACH ranks the native structure as first, similarly to QMEAN6, whilst RF_CB_SRS_OD does it for 2 cases and ROSETTA for 8 out of a total of 19 sets. In figure S1-A, we plot the normalized ranking over all the sets, also those including target structures that are multimeric or have ligands. Even if these decoy sets include cases in which the native state is not monomeric, BACH still performs at least as good as other competitors. We also see that the other scoring functions, in particular ROSETTA, perform much better with these decoys than with the CASP decoy sets, consistently with what is found in Ref.29.

The role of fluctuations

The results presented in Fig. 1 and 2 demonstrate the excellent capability of BACH in distinguishing the folded state. However, in the decoy sets we considered there are still several cases in which the native structure does not have the lowest BACH energy. Two examples are shown in Fig. 1-B and 1-D. For instance, for the decoys of GB3 generated by molecular dynamics and bias exchange metadynamics there are several structures that have high GDT and a BACH energy lower than the native structure. Even some structures of low GDT have a remarkably low BACH energy (blue circle in Fig. 1-D). These decoys have a different tertiary arrangement with respect to the native structure. One wonders whether these structures are a signal of a flaw in the BACH potential, that it is not able to recognize them as misfolded states. In order to investigate this point, we selected two structures enclosed in the blue circle in Figure 1-D and we performed for each of them a finite temperature molecular dynamics (MD) simulation in explicit solvent (see Methods). We compare the results with an ensemble of structures obtained with a similar run performed starting from the native state. As found in Fig. 5 all the MD simulations show high thermal fluctuations in BACH energy, of the order of 7 or more. Remarkably, the structures generated from the folded state show a BACH energy distribution (Fig. 5-A) clearly different from those obtained for the other ensembles. This demonstrates that the low BACH energy structures enclosed by the blue circle in Figure 1-D do not correspond to meaningful low energy conformations.

Figure 5
figure 5

A) BACH energy distributions for sets of structures generated with molecular dynamics simulations for GB3 protein (see Methods): the folded state in explicit solvent (black line), two structures with initial low BACH energy (green and blue lines, respectively). Points: initial conformations. B) 2VWR protein (set T0488 of CASP8) 1ns of MD simulations for: the native conformation (black line), decoy structure T0488TS256_1 (blue line) and decoy structure T0488TS299_3 (green line). Points: initial conformations. C) BACH energy distribution for the 20 NMR models of protein 2K49 (set T0472 of CASP8) and for the structures obtained by 1ns of MD simulations for model #1.

To confirm and exploit further this idea we considered the CASP8-9 decoy set in which BACH has the worst performance (T0488, where the native was ranked #35 out of 248 structures, see Fig. 1-B). Following a procedure similar to the one described above, we generated by MD simulations three sets of structures (see Methods for details) starting from the native state and from two decoy structures with lower energy than the native state (T0488TS256_1 and T0488TS299_3, ranked according to BACH #1 and #4, respectively). These structures have a high GDT score (GDT > 80.0) and therefore a high topological similarity with the folded conformation. In Fig 5-B we compare the BACH energies distributions for the three sets of structures. The ability of BACH in distinguishing the native state from both decoys is confirmed; the discrimination is less sharp than in the case of GB3, because of the higher similarity of decoys to the native state. In conclusion, the failure of BACH in pinpointing the native state in the original decoy set can be removed by looking at the appropriate energy distribution over a finite temperature run.

The fluctuations we observe in Fig. 5-A/B are not artifacts of the simulation, but are also present in experimental data. This is shown in Fig 5-C, where we compare the probability distribution of BACH energy observed for the 20 native NMR models of the T0472 CASP8 decoy set (pdb code 2K49) with that computed over a finite temperature run starting from one of such models. These results confirm what we discussed previously, namely that the energy value of a single structure is not always significant, as it is affected by thermal fluctuations. A more reliable quality measure of a model is the probability distribution of the BACH energy computed on an ensemble of structures that can be obtained by means of a finite temperature molecular dynamics run.

Discussion

We have developed a knowledge based potential, named BACH, that employs in its definition only a few binary structural observables, monitoring the presence of contacts, like originally proposed more than thirty years ago1,2. A novel ingredient in BACH is splitting residue-residue contacts, within the same Bayesian framework, in those present within α-helices or β-sheets or instead in those that are not involved in secondary structure motifs. The former are monitored by verifying the presence of hydrogen bonds between backbone atoms, the latter by assessing the proximity between side chain heavy atoms. This last choice of defining non-secondary structure contacts by ignoring backbone atoms has been already recognized to improve crucially the performance of a scoring function41. A second innovative aspect of BACH is the evaluation, again within a Bayesian framework, of the propensities of single residues to be buried within the interior of the protein globule or to stay exposed on its surface.

BACH's performance has been benchmarked against other state-of-the-art potentials on the most challenging decoy sets. We used indicators that assess either how the native structure is recognized or how different models are sorted based on their similarity to the native state. BACH performs better than other scoring functions in discriminating the native structure and in assigning to it the largest energy gap with respect to the mean energy of the decoy set. This is a striking result since the parameters employed by BACH are fewer in comparison with other KBPs and no optimization on decoy sets was used to boost the performance. The latter is moreover maintained if the number of parameters is further reduced by employng a reduced 9 classes aminoacid alphabet37 or if the parameters are derived from just 50 structures, demonstrating the remarkable robustness of our potential.

We then benchmarked the capability of BACH in producing “funnel-shaped” distributions. QMEAN6 performs better than BACH in generating a linear correlation of the score with the GDT from the native state, as quantified by the Pearson coefficient. However, the highest Pearson coefficient is observed in the decoy sets in which the native state is poorly discriminated.

We have also shown that the 5-10 lowest energy decoys selected by BACH are likely to include better models of the native structure than other competing potentials. The procedure of searching for the best native-like model in a pool of low energy conformations is justified by the observation that the rankings may be affected by the presence of thermal fluctuations. Therefore, considering only the lowest energy structure might lead to inaccurate modeling. In fact, we have shown that thermal fluctuations are present with the same magnitude both among the different NMR models of the same protein and within a structural ensemble obtained by finite temperature molecular dynamics simulations. As a consequence, we propose that an appropriate strategy to rank two structures requires to run a short MD simulation starting from each structure, in order to produce ensembles of thermodynamically equivalent conformations. We have shown that the probability distribution of the BACH energy estimated on the two ensembles allows discriminating correctly the two models. This procedure can be applied successfully not only for models that are very dissimilar from the native state but also for those which are close to it. This suggests that BACH might be a valuable tool for the refinement of protein structures roughly determined by experimental or numerical methods.

Methods

BACH parameters

The BACH energy function is based on two terms

Both are statistical potentials derived from a dataset of native PDB structures. They take respectively into account effective pairwise residue-residue interactions and single residue solvation properties. p is a parameter that fixes the relative units of the two energy terms.

The pairwise statistical potential EPAIR is based on classifying all residue pairs within a protein structure in five different structural classes, labeled by x: two residues may form a α-helical bridge (x = 1), or an anti-parallel β-bridge (x = 2), or a parallel β-bridge (x = 3), or may be in contact with each other through side chain atoms (x = 4), or may not realize any of the previous four conditions (x = 5). Pairs of consecutive residues along the protein chain are included in the above classification. α- and β-bridges are detected by using a modified version of the DSSP algorithm42 that employs a more stringent energy threshold (-1 Kcal/mol in place of the original −0.5 Kcal/mol, as already done in43) to assess hydrogen bond formation. A residue pair is assigned to the side chain - side chain contact class (x = 4) if it is not assigned to any of the previous classes (x < 4) and if any inter-residue pair of side chain heavy atoms is found at a distance lower than 4.5Å. If none of the above conditions are verified, the residue pair is assigned to the no-interaction class (x = 5). The pairwise statistical potential EPAIR requires five distinct symmetric matrices , where a and b vary among the 20 aminoacid types, for overall 1050 parameters:

ai is the aminoacid type of residue at position i and xij is the structural class to which residue pair i - j is assigned.

The five interaction matrices are determined from a dataset of native protein structures employing the ensemble of all residue pairs from the dataset as the reference state6:

where is the total number of residue pairs of type a and b found in the structural class x within the dataset. In order to treat coherently as a unique physical system the full sequence which determines a given protein structure, residues that are not seen in the PDB structure, even if they are part of the sequence expressed by the crystallographer, are considered as disordered ones and included in the countings, so that any pair involving a disordered residue is classified in the no-interaction class (x = 5).

The solvation statistical potential ESOLV is based on classifying all residues in two different environmental classes, either buried (b) or solvent exposed (s). The environmental class is defined based on the evaluation of the solvent accessible surface area (SASA) performed by the SURF tool of VMD graphic software44,45. The SASA is computed by SURF for all heavy atoms of the protein chain by rolling a probe sphere (representing a water molecule) on the surface of the set of spheres centered at heavy atom coordinates. We input to SURF the same value (1.8Å) for the radii of all atom types and the radius of the probe sphere. The latter is higher than what is employed in VMD (1.4Å) because we want to avoid considering internal cavities as areas exposed to the solvent. The output of SURF is the number of triangle vertices associated to each atom of the protein. These vertices are used in the triangulated representation of the protein surface employed by VMD and the area associated with each vertex is ~0.15Å2. By summing over all atoms of a given residue, we obtain the number of vertices t associated to that residue, which is proportional to its SASA. The distribution of the values of t observed in the dataset for alanine, valine and arginine residues is shown as supplementary information in Fig. S2; the observed behaviour is typical of all residue types. The presence of a sharp peak at t = 0, well separated by a broader peak at larger values of t, allows a clearcut definition of residue environment as either buried (tt*) or solvent exposed (t > t*), using the same threshold t* = 10 for all residues. The single residue statistical potential ESOL requires two separate parameter sets , for overall 40 parameters:

ei = b or s is the environmental class of residue at position i. The two parameter sets are determined employing the ensemble of all residues from the dataset as the reference state6:

where is the total number of residues of type a found in the environment class e within the dataset. Residues that are not seen in the PDB structure, even if they are part of the sequence expressed by the crystallographer, are included in the counting as solvent exposed (e = s).

An alternative implementation of BACH was derived using a reduced aminoacid alphabet consisting of 9 classes37: small hydrophobic (ALA,VAL,ILE,LEU,MET), large hydrophobic (TYR,TRP,PHE), small polar (SER,THR), large polar (ASN,GLN,HIS), positively charged (ARG,LYS), negatively charged (ASP,GLU) and finally GLY, PRO, CYS separately on their own. In this way we obtain just 225 energy parameters for the pairwise term and 18 energy parameters for the solvation term.

The parameter p entering in Eq. 1 and fixing the relative weight of the two contributions EPAIR and ESOLV is chosen in such a way that the energy per residue of the two terms has approximately the same standard deviation over the dataset. This criterion gives p = 0.6. Therefore, at variance with other composite methods10,35 no optimization criteria is introduced in BACH to fix the relative weigths of different physical terms. We have also checked that the value of p does not affect significantly the performance of BACH in native state discrimination.

PDB dataset

The BACH parameters have been derived using the TOP500 database46. This set includes 500 non redundant protein domain conformations, extracted from both monomeric and multimeric PDB protein structures, between 30–840 amino acids long, which have been solved with resolution better than 1.8 by X-ray crystallography (no NMR). The structures in the set include disordered and not resolved regions. We count the contributions of these amino acids as no-interaction ones. We have checked that the choice of the fold library does not significantly affect parameters. In figures S3 and S4, we present the correlation between the BACH parameters calculated using either TOP500 or an ensemble of 8000 structures found in the CATH database47. Correlation is excellent especially for parameters corresponding to favourable interactions, that are by definition highly represented in the datasets. We have checked that the BACH energy calculated for a decoy set using the parameters obtained with the two different databases, have an excellent correlation as well (see figure S5). We have also derived a different alternative set of the 1090 energy parameters from a subset of TOP500 consisting of only 50 randomly chosen structures (these structures are listed in Table S2).

Scoring function evaluation

The scoring functions used in this work were evaluated on native and decoy structures as follows: online submission at the server http://swissmodel.expasy.org/qmean/cgi/index.cgi for the composite function QMEAN6; a linux executable downloaded from http://www.fiserlab.org/potentials was used for the RF_CB_SRS_OD potential and the 3.2 ROSETTA scoring function was downloaded from http://boinc.bakerlab.org/.

CASP decoy sets

The performance of BACH has been assessed on selected decoy sets from CASP 8-9. A decoy set was used if the sequence of the experimentally solved structure is the same as the sequence given initially in the CASP competition, so that the majority of decoy structures have the same sequence as the native structure. The list of the decoy sets used is presented in Table S1 (33 decoys sets). The structures in each decoy set were used if they had the same length and sequence as the native structure and had all the side-chain and backbone atoms. If for any reason a structure could not be analyzed by one of the scoring functions with whom we compare the performance of BACH, the structure was taken out of the set. The native conformations given by NMR were cut to have the same number of residues as the decoys of the set and since their folded structure is given in the form of 20 different models, we defined as native, the pdb model that had the lowest energy (for each potential). So, for two scoring functions the native structure for a decoy set could be a different model out of the 20 presented in the NMR structures.

Decoy set generated by Molecular Dynamics

We also generated decoy sets, using molecular dynamics simulations, either in explicit solvent (MD) or in combination with bias-exchange metadynamics48 (BE) to produce realistic structures, for three proteins: GB3 (pdb code 2OED), protein T0488 from CASP8 (pdb code 2VWR) and protein T0472 from CASP8 (pdbcode 2K49). For protein GB3 we ran three different simulations: i) MD starting from the native state at T = 330 K for 10 ns, ii) MD starting from two selected structures (blue circle in Fig. 1-D) at 330 K for 1ns, iii) BE of 20 ns at 400 K. For protein 2VWR we ran MD of 1ns at T = 300 K starting from i) the native conformation and two decoys: ii) T0488TS256_1 and iii) T0488TS299_3. For protein 2K49 we ran MD of 1ns at T = 300K starting from NMR model #1. MD simulations were performed using the GROMACS 4.5.3 package49, employing the AMBER99-ILDN50 force field, using TIP3P water model51 for the explicit solvent. BE together with the OBC52 implicit solvent model and T = 400K were used to enhance the conformational searching53. See Supplementary Text for details on MD and BE simulations.

Similarity measure between structures: GDT

As a similarity measure between structures, we use the Global Distance Test (GDT)39, defined as:

where FX denotes the percentage of residues under a distance cutoff of XÅ after the two molecules have been optimally superimposed. To detect the optimal superposition, we used the MaxSub algorithm54, in the implementation given within the MaxCluster tool (www.sbg.bio.ic.ac.uk/maxcluster/index.html#MaxSub).