Sequence and structural patterns detected in entangled proteins reveal the importance of co-translational folding

Proteins must fold quickly to acquire their biologically functional three-dimensional native structures. Hence, these are mainly stabilized by local contacts, while intricate topologies such as knots are rare. Here, we reveal the existence of specific patterns adopted by protein sequences and structures to deal with backbone self-entanglement. A large scale analysis of the Protein Data Bank shows that loops significantly intertwined with another chain portion are typically closed by weakly bound amino acids. Why is this energetic frustration maintained? A possible picture is that entangled loops are formed only toward the end of the folding process to avoid kinetic traps. Consistently, these loops are more frequently found to be wrapped around a portion of the chain on their N-terminal side, the one translated earlier at the ribosome. Finally, these motifs are less abundant in natural native states than in simulated protein-like structures, yet they appear in 32% of proteins, which in some cases display an amazingly complex intertwining.

Theoretical and experimental efforts of the last two decades have established that kinetic and thermodynamic properties of the protein folding process can be inferred by some spatial features of the native structure itself [1][2][3]. For instance, the contact map of the native state [4][5][6], the matrix indicating which pairs of residues are close in space, determines the folding nucleus, i.e. the group of residues whose interaction network is essential for driving the folding. Similarly, the loops formed between residues in contact have an average chemical length, the contact order, which is strongly correlated with the folding time [7][8][9]. However, some proteins, being extremely self-entangled in space, are characterized by a folding process that cannot be simply rationalized by local (contact) properties. Examples are proteins hosting knots [10][11][12][13][14][15][16][17], slipknots [18,19], lassos [20,21] and links [22,23]. These complex motifs were found in about 6% of the structures deposited in the protein data bank (PDB) and, although it is expected that their presence can severely restrict the available folding pathways [17,22], it is not clear how proteins avoid the ensuing kinetic traps and fold into the topologically correct state.
A crucial question is whether and how these topologicaly entangled motifs affect the protein energy landscape. According to the well established paradigm of minimal frustration [24,25], energetic interactions in proteins are optimized in order to avoid as much as possible the presence of unfavorable interactions in the native state. Although non optimized interactions may result in kinetic traps along the folding pathway, some amount of residual frustration has been detected and related to functionality and allosteric transitions [26].
A further issue is whether the effect of topologyinduced traps depends on the folding direction along the chain: if proteins fold cotranslationally when they are be- * seno@pd.infn.it ing produced at the ribosome, one then expects sequential folding pathways proceeding from the N-terminus to be less hindered by such traps.
To understand the relevance of topological motifs in proteins, here we quantify the amount of selfentanglement in native structures by computing the Gaussian entanglement (GE) [9,23], a generalization of the Gauss integrals used to compute the linking number [27] (see Materials and Methods for details). Indeed, if integrals were computed for two closed curves, e.g. loops in proteins closed by disulphide bridges or by any other form of covalent bond [20][21][22] (Fig. 1b), the result would be the (integer) linking number [27]. By applying the method to open chains [9,23,[28][29][30], the GE provides a real number that quantifies the mutual winding of any pair of subchains along the structure [9] or between two proteins in a dimer [23].
Our analysis, when applied to a large-scale database of protein domains (see Materials and Methods), identifies entangled motifs that are more elusive than knots (see Fig. 1a). For instance, portions of proteins characterized by high values of GE correspond to links between non-covalent loops (Fig. 1c) as well as to interlacings between a loop and another part of the polypeptide chain (Fig. 1d). A preliminary search for some of these motifs was carried out in the 80' but, due to the shortage of protein structures available in the PDB and the specificity of the chosen entanglement to explore (threading, pokes or co-pokes [31]), the conclusion was that these forms of entanglement were rare [32]. This finding was practically used to discriminate between natural proteins and artificial decoys [33,34]. By performing a detailed analysis of protein structures with the GE tool, we discover that mutually entangled motifs as those sketched in Fig. 1c and Fig. 1d, at a first glance, are not uncommon, given that about one third of the 16968 analyzed proteins include at least one entangled loop. Nonetheless, we find that natural existing folds are much less topologically intertwined than same-length protein-like structures generated by all-atom molecular dynamics [35].
More importantly, by focusing on the pairs of amino acids forming contacts at the end of entangled loops, we discover that they are enriched in hydrophilic classes with respect to the mainly hydrophobic generic contacts. Therefore, the corresponding interaction strengths are on average significantly weaker. The presence of nonoptimized interactions and the consequent energetic frustration can be interpreted as the result of natural selection toward sequences that keep the intertwined structures more flexible.
Another footprint of evolutionary mechanisms is the observation that entangled loops are found more frequently toward the C-terminus (Fig. 1e) than toward the N-terminus (Fig. 1f). Indeed, in the case of cotranslational folding, it is reasonable to assume that it is easier to first fold the open blue threaded part (Fig. 1e) and then bundle the red loop around it, than to first fold the loop and then thread the open portion through it (Fig. 1e), as already pointed out [31].

RESULTS AND ANALYSIS
Proteins with entangled loops are not rare We use the contact GE parameter G ′ c to find protein domains with at least one loop γ i intertwined with a "thread" γ j , which is another portion of the protein (Fig. 1d-f). More precisely, we can associate G ′ c (i, j) to a given loop-thread pair by using the Gauss double integral described in Materials and Methods. By maximizing |G ′ c (i, j)| over all the possible threads γ j we assign an entanglement score G ′ c (i) to the loop and, by further maximizing |G ′ c (i)| over all loops γ i , we find the entanglement G ′ c of the protein. At variance with similar quantities defined for closed curves, G ′ c is a real number. Yet, we define a loop γ i in a configuration as in Fig. 1d to be entangled if |G ′ c (i)| ≥ 1. Such a threshold is natural because a linking number |L| = 1 is the minimum value that guarantees that two closed curves are linked [27]. In a data set of 16968 protein domains, 5375, the 31.7%, host at least one entangled loop. We also monitor the value L of the linking entanglement (LE) for a single protein, defined as G ′ c for two subchains that are both loops, as in Fig. 1c. In Fig. 2 we show five examples of "entangled" protein domains, along with their respective values of G ′ c and L.
The non trivial entanglement features of protein structures, when analyzed with GE and LE, are apparent in Fig. 3a, where protein domains are represented in the L vs G ′ c space. All the points lie in the region | L| ≤ |G ′ c | because the latter quantity is defined as an extremum over a wider subset.
A typical example with G ′ c ≃ L ≃ 1 is shown in Fig. 2a. As expected, however, there are cases of proteins with at least one entangled loop (|G ′ c | ≥ 1) and all pairs of loops with negligible | L|. These proteins corresponds to the conformation sketched in Fig. 1d and to the natural protein represented in Fig. 2b. In other cases, the difference between |G ′ c | and | L| is large, even in the presence of linked loops. This is due to the behavior of the protein portion which, after threading the first loop, forms a second loop linked with it, and then continues to wind around it without further looping, see Fig. 2c.
It is interesting to observe that in several other cases the GE has a different sign with respect to the LE. This may take place if the chain winds around itself with opposite chiralities in different portions of the same protein.
An example is shown in Fig. 2d-e. One of the most entangled structures found in the database, with G ′ c ≃ L ≃ −3, is shown in Fig. 2f. Fig. 3a shows that the GE is distributed over a broad spectrum of values and that the threshold |G ′ c | ≥ 1 for entangled loops is conservative enough. Clusters emerge in the density plot of Fig. 3b, where the majority low LE points are removed by excluding data with | L| < 1/2 (see also Fig. S1 [36], which is an enlargement of Fig. 3a). The clusters are found around L ≃ ±1 vs G ′ c ≃ ±1, in particular the most populated region has G ′ c ≃ L ≃ 1. For further analysis, we consider only the GE indicator, which captures more varieties of entangled motifs than the LE (e.g. winding without linking, see Fig. 2b). separately the following two cases: when the threading arm is between the N-terminus and the loop (N-terminal thread, see Fig. 1e) or when it is between the loop and the C-terminus (C-terminal thread, Fig. 1f). It is possible to classify a generic (non-entangled) loop as belonging to the N(C)-terminal threads classes, even if there is no real threading, because G ′ c (i) is well defined for every loop γ i . In the ensemble of all loops, the fractions of N-and C-terminal threads are perfectly balanced, as expected. However, if we restrict the analysis only to entangled loops (3.75% of the total), the fraction of N-terminal threads becomes 0.60, highlighting an asymmetry in favor of the N-terminal threads ( Fig. 1e) against the Cterminal ones (Fig. 1f). A somewhat similar result was found by studying topological barriers in protein folding [37]. The formation of an entangled structure is not simple, as it requires a non local concerted organization of the amino acids in space, where a crucial role is played by the order of formation of different native structural elements along the folding pathway [18]. A misplaced nucleation event in the early stages of the folding pathway might prevent the protein from folding correctly. Dealing with spontaneous "in vitro" refolding, there is no reason to expect the folding order of different elements to be related to a preferential specific direction along the chain.
Nevertheless, an asymmetry can be envisaged if a protein folds cotranslationally, according to the following argument. For the C-terminal thread, the loop might be formed in the early folding stages, making it difficult for the rest of the protein to entangle with it and thus to reach the native conformation. Conversely, for the Nterminal thread, the loop could wrap more easily around the open threading arm, already folded in its native conformation, after ejection from the ribosome. If confirmed, this picture would explain the asymmetry we observe be- tween N-and C-terminal threads. The latter can be anyway interpreted as a footprint of an evolutionary process, intimately related to entanglement regulation driven by cotranslational folding.
Such conclusion is corroborated by looking, separately for C-and N-terminal threads, at the normalized distributions of loop-thread sequence separations s, plotted in Fig. 4. The distributions for all loops (full symbols) are very similar for N-terminal (blue circles) and C-terminal (red squares) threads, showing again that in the absence of entanglement no asymmetry is present. For the vast majority of loops, the |G ′ c (i, j)| maximization leading to G ′ c (i) selects arms which start just after (or before) the loop, at a distance of one or few amino acids. This is similar to what already observed for pokes [33], yet here it is found also for typically non-entangled loops with |G ′ c (i)| ≪ 1. This reflects the fact that a rapid turning of the protein chain is the simplest way for maximizing the mutual winding between two subchains. The distributions for entangled loops (|G ′ c (i)| ≥ 1, empty symbols) are also strongly peaked at unitary loop-thread separation, although to a lesser extent than for all loops. This shows that larger separations (10 s 20) are promoted to achieve proper entanglement (|G ′ c (i)| ≥ 1). However, such large separations are much more frequent in the N-terminal case than in the C-terminal one (notice the logarithmic scale), showing again an asymmetry between the two cases. Consistently with cotranslational folding, N-terminal threads could allow for more complex topological structures with on average larger separations, when compared to C-terminal threads. Accordingly, the distribution of G ′ c (i) values for both the Nand C-terminal threads, shown in Fig. 5, highlights that the values around G ′ c (i) ≈ 1 are more probable in the former case. Strikingly, this happens only for positive G ′ c (i) values, whereas for negative ones there is no significant difference between N-and C-terminal threads. As a matter of fact, we find C-thread entangled loops to be perfectly balanced between positive and negative chiralities, whereas N-thread entangled loops are highly biased In the ensemble of the PDB structures there are 3617208 loops, of which 135530 (3.75%) are entangled. To assess whether this fraction is small or large we compare it with an analogous quantity computed in an unbiased reference state formed by a set of putative alternative compact conformations (i.e. rich in secondary structures) that a protein could in principle adopt. This ensemble is found in a poly-valine "VAL60" database [35], obtained with an accurate all atom simulation of the configurational space of a homopolypeptide formed by 60 valine amino acids (see Materials and Methods for details).
For a proper comparison with VAL60, we restrict our CATH database only to the proteins of comparable length, filtering out 772 proteins with length n from n = 55 to n = 64 amino acids. In this reduced "CATH60" ensemble of natural proteins there are 47954 loops, of which 138 (0.3%) are entangled. There are 19 proteins (2.46%) hosting at least one entangled loop. These values are of course lower than those for the full CATH ensemble, in which longer proteins can host more entanglement. In VAL60 there are 2284693 loops, of which 57577 are entangled (2.52%), a fraction ten times larger than for natural proteins of CATH60. Similarly, 3560 out of the 30064 VAL60 structures host at least one entangled loop (11.8%), a fraction five times larger than for natural proteins.
However, it is known that, presumably for kinetic reasons [35], VAL60 is characterized by loops on average longer than those of natural proteins. Consequently, to avoid any possible bias in the comparison, we divide loops in classes of homogeneous length m. For some classes, the normalized histogram of the GE for CATH60 and VAL60 datasets are plotted in Fig. 6a- apparent that the range of G ′ c (i) is wider for the VAL60 homopolypeptides than for the natural proteins. The deep difference between the two distributions can be appreciated in Fig. 6d, where the root mean squared G ′ c (i) is plotted as a function of the loop length: the values for VAL60 are always significantly higher than those for natural proteins. Note that the root mean squared G ′ c (i) increases with m only up to half of the protein length. From there on, the remaining subchain starts getting too short to entangle.
In conclusion, we have a clear statistical evidence that entangled loops occur less frequently in natural proteins with respect to random compact protein-like structures.
Amino acids at the ends of entangled loops are frustrated In the preceding sections we provided two independent evidences that, although entangled loops are not rare in natural protein structures, their occurrence and position along the backbone chain are kept under evolutionary control. A possible reason is the need to limit potential kinetic traps in the folding process brought about by entangled loops, for example by deferring their formation to the latter stages of the folding pathway. Thus, we expect to detect a related evolutionary fingerprint in the specific amino acids found in contact with each other at the end of entangled loops ("entangled contacts"). We check whether such amino acids share the same statistical properties of the amino acids forming any other contact ("normal contacts").
The frequency with which two amino acids are in con- tact is typically employed to estimate knowledge based potentials [38,39]. In a nutshell, if two amino acids a and b occur to be in contact more frequently than on average, they are expected to manifest a mutual attraction and are therefore characterized by a negative effective interaction energy E norm (a, b) (see Materials and Methods).
If effective interaction energies are computed by restricting the analysis only to the entangled contacts, a new set of entangled contact potentials E GE (a, b) can be derived. The discrepancies between such potentials and the normal ones can be conveniently captured by an enrichment score ∆E enr (a, b). A negative enrichment score ∆E enr (a, b) < 0 implies that (a, b) are more frequently in contact when they are at the ends of entangled loops, and vice-versa for positive scores. Fig. 7 shows that ∆E enr (a, b) anticorrelates with E norm (a, b). This correlation is statistically significant. The Pearson correlation coefficient is r = −0.27, with a P -value of 7 × 10 −5 . The Spearman rank correlation is ρ = −0.22 with a P -value of 1.1 × 10 −5 .
The anticorrelation of Fig. 7 has an important consequence: pairs of amino acids that in a globular protein interact strongly (E norm (a, b) < 0, mainly hydrophobic amino acids) are present less often (∆E enr (a, b) > 0) in entangled contacts, while amino acids that typically interact weakly (E norm (a, b) > 0, mainly polar and hydrophilic amino acids) are instead more abundant (∆E enr (a, b) < 0) at the ends of entangled loops. We checked that this result is not trivially due to entangled contacts being preferentially located on the protein surface, finding that residues involved in entangled contacts are even slightly more buried in the protein interior than those involved in normal contacts (see Fig. S2 [36]). The deep difference between the two set of scores E norm (a, b) and E GE (a, b) emerges clearly from the graphical representations in Fig. 8 of E norm (a, b) and ∆E enr (a, b), in which positive and negative values are marked red and  10  18  16  71 39 43  32  12  13 25 -12 -57 -4  31  18  27  25  26  21  53  ILE -1  17  8  39 41 42  24  11  10 19  7  -20 -3  -5  13  19  12  8  17  32  VAL 21  21  12  43 42 56  29  15  7  29  19 -10  5  12  30  24  28  20  22  28  LEU  1  3  2  32 24 29  blue, respectively. The blue spots in Fig. 8a represent interactions between amino acids which share hydrophobic properties (mainly hydrophobic pairs), whereas the red area is populated by amino acids that are rarely in contact (mainly polar pairs).
In Fig. 8b, the blue spots highlight amino acids that have decreased their energy score and which are therefore more present at the ends of the entangled loops than in normal contacts. These include mainly polar amino acids. Note that proline is particularly enriched at the end of entangled loops. The red spots in Fig. 8b indicate amino acids which are less present at the ends of the entangled loops than in normal contacts. These include mainly hydrophobic ones. The case of cysteine selfinteraction is pedagogical: the strongest attractive interaction between amino acids turns out to be the more diminished one at the end of entangled loops (see also Fig. 7), consistently with the very low number of linked loops closed by disulphide bonds (Fig. 1c) that was found in the PDB [40].
Interestingly, the four aromatic amino-acids (HIS, PHE, TRP, TYR) violate the general trend. Interactions between aromatic pairs are found in the bottomleft quadrant of Fig. 7. Despite being very frequent in normal contacts (all their mutual entries are dark blue in Fig. 8a), they become even more abundant when at the ends of entangled loops (still blue in Fig. 8b), highlighting a special role likely played by aromatic rings in such complex structures. Fig. 7 and Fig. 8b provide clear evidence for the existence of an evolutionary pressure shaping the amino acid sequences. This natural bias weakens energetically the contacts which close entangled loops, consistently with the argument that a too early stable formation of the loop could prevent the correct folding of the full protein.
These results are very robust to changes in the G ′

DISCUSSION AND CONCLUSIONS
With the notion of Gaussian entanglement we extend the measure of mutual entanglement between two loops to any pair of open subchains of a protein structure. This allows us to perform an unprecedented large scale investigation of the self entanglement properties of protein native structures, through which we identify and locate a large variety of entangled motifs (Fig. 2), by focusing on the notion of "entangled loop", a loop intertwining with another subchain (Fig. 1d). Different entangled motifs can coexist in the same protein domain, even with opposite chiralities, and few domains exhibit a pair of loops intertwining even thrice around each other (see the examples in Fig. 2c and Fig. 2f, and points in Fig. 3). Gaussian entanglement could be used to improve the classification of existing protein folds [41], as previously done with Gauss integrals computed over the whole protein chain [42].
Our analysis shows unequivocally (Fig. 6) that, although entangled motifs are present in a remarkably high fraction, 32%, of protein domains, these host a lower amount of entangled loops than protein-like decoys produced with molecular dynamic simulations [35]. The question is then why natural folds avoid overly entangled conformations with otherwise plausible secondary structure elements. Are entangled loops obstacles for the folding process? If yes, how does Nature cope with them when they are present?
To answer these questions, we recall that an efficient folding of proteins is fundamental for sustaining the biological machinery of cell functioning. The rate and the energetics of the protein folding process, which are defined by its energy landscape, are encoded in the amino acid sequence. Over the course of evolution, this landscape was shaped to allow and stabilize protein folding, avoiding possible slowdowns.
We find indeed two clear hallmarks suggesting that the entangled loops in proteins are kept under control at evolutionary level: (i) an asymmetry in their positioning with respect to the other intertwining chain portion and to the C and N-termini, which is consistent with cotranslational folding promoting the presence of entangled loops with positive chiralities toward the C-terminus (see Fig. 4 and Fig. 5); (ii) weak non optimized interactions between the amino acids in contact at the end of entangled loops, an example of energetic frustration (see Fig. 7 and Fig. 8). Both these findings suggest that the late formation of entangled loops along the folding pathway could be a plausible control mechanism to avoid kinetic traps.
Interestingly, interactions between aromatic amino acid pairs are promoted at the end of entangled loops (see Fig. 8b), suggesting that their presence could be related to the protein biological function. Whether entangled loops may have specific biological functions is an intriguing open question, as in the case of knots in protein domains [17,43]. The observation that a bias favoring entangled loops with positive chiralities is intimately related to their position asymmetry, and thus to cotranslational folding, suggests that loop winding at the ribosome may have a preferred orientation. As a matter of fact, the ribosome can discriminate the chirality of amino acids during protein synthesis [44].
Stemming from works on glassy transitions [45,46], the concept of minimal frustration between the conflicting forces driving the folding process is a well established paradigm [24][25][26] in protein physics. It has been further argued [26] that frustration is an essential feature for the folding dynamics and that it can give surprising insights into how proteins fold or misfold.
Is it possible to reconcile the frustration detected at the ends of entangled loops with the minimal frustration principle? Let us assume that a non optimal ordering of the events along the folding pathways (for example, the formation of a loop which has then to be threaded by another portion of the protein to form an entangled structure) is highly deleterious. In order to prevent this, it could indeed be preferable to select suitable sub-optimal interactions. In fact, this would be a remarkable example of minimal frustration in action, having to compromise between topological end energetic frustration.
Obviously, other data will be needed to confirm this proposed mechanism for the folding process, from both simulations and experiments. In either case, a simple protocol could consist in mutating into cysteines both residues at the ends of an entangled loop, provided no other cysteins are present in the sequence, and in assessing whether the folding is then hindered by the formation of a disulfide bridge in oxidizing conditions. In the context of knotted proteins, single molecule force spectroscopy techniques were shown to be particularly useful in controlling the topology of the unfolded state [47]. Similarly, both "in vivo" folding experiments [48] and appropriate simulation protocols [49][50][51] could be employed to test the possible role of cotranslational folding in shaping the evolutionary control over entangled motifs: double cystein mutants would then be predicted to be more deleterious for the folding of C-terminal threads with re-spect to N-terminal threads. In all cases, it is essential to gather statistics over several different proteins before validating or rejecting our hypothesis; the signals that we reveal in this contribution are statistical in nature; therefore we do not expect all entangled loops to form late in the folding process nor all C-terminal threads to be contranslationally disfavored.

CATH database
We use the v4.1 release of the CATH database for protein domains, with a non-redundancy filter of 35% homology [52]. To avoid introducing entanglement artificially for proteins with big gaps in their experimental native structures, we do not consider any protein in the CATH database that presents a distance > 10Å between subsequent C α atoms in the coordinate file. We find that this selection keeps N prot = 16968 out of the available 21155 proteins.

Poly-valine database
The VAL60 database is an ensemble of 30064 structures obtained by an exhaustive exploration of the conformational space of a 60 amino acid poly-valine chain described with an accurate all-atom interaction potential [35]. The exploration was performed with molecular dynamics simulations using the AMBER03 force field [53] and the molecular dynamics package GROMACS [54] and by exploiting a bias exchange metadynamics approach [55] with 6 replicas. The simulation was performed in vacuum at a temperature of 400 K. The conformations have been selected as local minima of the potential energy with a secondary structure content of at least 30% and a small gyration radius. It has been observed that the VAL60 database contains almost all the natural existing folds of similar length [35]. However, these known folds form a rather small subset of the full ensemble, which can be thought as an accurate representation of the universe of all possible conformations physically attainable by polypeptide chains of length around 60.

Mathematical definition of the linking number and its computational implementation
The linking number between two closed oriented curves γ i = {r (i) } and γ j = {r (j) } in R 3 may be computed with the Gauss double integral It is an integer number and a topological invariant [27].
If computed for open curves, it becomes a real number G ′ (the GE) that quantifies the mutual entanglement between the curves [9,23,[28][29][30]. In proteins, piecewise linear curves join the coordinates of subsequent C α atoms. In particular, γ i is an open subchain joining C α atoms from index i 1 to i 2 and similarly γ j is another nonoverlapping subchain from j 1 to j 2 . We specialize to the configurations studied in Ref. [9], in which i 1 and i 2 amino acids are required to be in contact. In this study, the contact is present if any of the heavy (non hydrogen) atoms of residue i 1 is near any of the heavy atoms of residue i 2 , namely they are at a distance at most d = 4.5Å. The "contact" Gaussian entanglement of these configurations (sketched in Fig. 1d-f) is named G ′ c (i, j). Since proteins are thick polymers and bonds joining C α atoms are quite far from each other (compared to their length), we may approximate the integral [1] with a discrete sum. Given the coordinates r i of C α 's, the average bond positions R i ≡ 1 2 (r i + r i+1 ) and the bond vectors ∆R i = r i+1 − r i enter in the estimate of G ′ c (i, j) for γ i and γ j , We then associate a contact entanglement G ′ c (i) to a "loop" γ i as the extreme (i.e. with largest modulus) G ′ c (i, j), for all "threads" γ j , with j 2 − j 1 ≥ m 0 (m 0 = 10). Finally, the contact entanglement G ′ c of a protein is the extreme of G ′ c (i) for all loops of length m = i 2 − i 1 ≥ m 0 . The linking entanglement L is equal to G ′ c for configurations with two loops as in Fig. 1c. It is not exactly the linking number L because the two closures between contacts are not performed.

Inference of statistical potentials
In order to estimate effective interactions between amino acids in protein structures, we use an established knowledge based approach [39]. Pairwise potentials can be obtained by analyzing databases of know protein conformations [56]. These potentials are derived measuring the probability of an observable, such as the formation of a contact, relative to a reference unbiased state [38]. The conversion of the probability in an energy is done by employing Boltzmann's law [57].
The first step includes characterizing the reference null space of possible pairs of amino acids. All amino acid pairs within each protein sum up to a grand total of N generic pairs (i.e. just combinatorial pairings not necessarily related to a spatial contact) in our ensemble of protein structures. In the same way, given two amino acid kinds a and b, one sums up the occurrence of a-b pairs within each protein to a grand total of N (a, b) pairs in the ensemble.
To quantify energies of "normal" contacts E norm (a, b) between amino acids of type a and b, we consider two amino acids to be in contact if any inter-residue pair of their side chain heavy atoms is found at a distance lower than 4.5Å. By considering only the ensemble of amino acids which are in contact within each protein, their total counting results in N c generic contacts. Similarly, the specific contacts between amino acids of kind a and b are summed up to a total N c (a, b).
The statistical potentials for normal contacts are defined by comparing the frequencies [38,58]