Introduction

Recently, we have witnessed a remarkable leap in the prediction of three-dimensional protein structure from amino acid sequences by deep neural networks1,2, and these breakthroughs have made significant progress toward solving the structure prediction component of the ‘protein folding problem’3. Nevertheless, state-of-the-art protein structure prediction methods do not provide an understanding of how proteins fold into specific structures, i.e., the folding process component of the ‘protein folding problem’3,4. Therefore, computational prediction of protein folding mechanisms based on protein structures readily available from experiments or machine learning remains a challenge. Many experimental studies have characterized the folding mechanisms of both single-domain proteins, which typically show simple two-state folding behavior5, and multidomain proteins, which have many nonlocal interactions and exhibit more complicated folding behavior6,7,8,9. The experimentally observed folding pathways of small single-domain proteins have been successfully predicted using an Ising-like simple statistical mechanical model, called the Wako–Saitô–Muñoz–Eaton (WSME) model, which is a coarse-grained structure-based model for proteins10,11,12,13,14,15,16,17,18,19,20,21,22,23,24. This model assumes that folding is initiated by local interactions between adjacent residues and propagates to distal regions via the growth and docking of native segments. Remarkably, the WSME model can be used to calculate the free energy landscapes of proteins, providing a comprehensive understanding of folding mechanisms, including folding pathways, kinetics, and the structures of intermediates and transition states. However, this model cannot be used for the prediction of the folding mechanisms of multidomain proteins. Furthermore, although long-time molecular dynamics calculations allow simulating the folding reactions of small single-domain proteins in up to 1 ms25, they cannot simulate multidomain protein folding, which typically takes more than 100 ms6,7,8,9. As multidomain proteins are naturally abundant and constitute most of the proteomes26, predicting the folding mechanisms of multidomain proteins from native structures is a key problem to be solved.

The folding reactions of multidomain proteins often involve multiple folding pathways and molten globule-like compact intermediates6,7,8,9. The intermediates accumulate via a hydrophobic collapse mechanism driven by nonlocal interactions between distant residues27. However, the WSME model does not allow for nonlocal interactions between distant residues in an amino acid sequence unless all intervening residues are folded (Fig. 1a, b, d). This precludes the folding of a discontinuous domain, consisting of residues separated in a sequence, prior to the folding of an intervening continuous domain, resulting in a failure to predict the folding mechanisms of multidomain proteins. In addition, the WSME model cannot explicitly consider the folding reactions of disulfide-intact proteins or those involving the oxidative formation of disulfide bonds28, which are typical nonlocal interactions that stabilize many extracellular proteins29.

Fig. 1: Schematic representation of WSME and WSME-L models.
figure 1

Residues in native or unfolded conformation (mk = 1 or 0) are shown as filled or open circles, respectively. ac Assumption on native contact formation. In the original WSME model (a), native contact between residues i and j (red line) is formed only when these residues are connected by a native stretch along main chain; that is, when all intervening residues between them are in native conformation. A nonlocal native contact cannot be formed if two native stretches are not continuously connected along main chain (b). In the WSME-L model (c), nonlocal native contact between residues i and j separated in a sequence can be formed if a native stretch between them is continuously connected through a linker shortcut at residues u and v (blue line), bypassing a long stretch of the main chain. d Original WSME model. Contact energy between residues i and j, εi,j, can be uniform for all residues (Original model 1) or residue-dependent (Original model 2). e WSME-L model for the folding of small and large proteins without disulfide bonds. f WSME-L(SS) model for the folding involving oxidative disulfide bond formation. g WSME-L(SSintact) model for the folding of disulfide-intact proteins. In (eg), N- and C-terminal regions are shown by magenta circles and the intervening region is shown by blue circles. In (f, g), S denotes a cysteine residue.

To overcome these limitations, we developed a simple structure-based statistical mechanical model, named the WSME-L model (L denotes linker), which can introduce virtual linkers corresponding to nonlocal interactions anywhere in a protein molecule (Fig. 1c, e). This model improves the prediction of the folding mechanisms of small single-domain proteins compared with the original models. Surprisingly, our model successfully predicts the free energy landscapes that reproduce the experimentally observed folding behaviors of multidomain proteins. In addition, the models with slight modifications, named the WSME-L(SS) and WSME-L(SSintact) models, allow the prediction of the folding processes involving oxidative disulfide bond formation and those of disulfide-intact proteins, respectively (Fig. 1f, g). These results suggest that the WSME-L models may pave the way for predicting protein folding mechanisms from native structures without the limitations of size and shape, and will be a useful tool for protein folding prediction in the post-AlphaFold era.

Results

WSME-L model

In the original WSME model, an Ising-like two-state variable, mk, was assigned to each residue of the protein (mk = 1 for native and 0 for other conformations). The Hamiltonian of the model is defined as:

$$H(\{m\})=\mathop{\sum }\limits_{i=1}^{N-1}\mathop{\sum }\limits_{j=i+1}^{N}{\varepsilon }_{i,j}{m}_{i,j}$$
(1)

where N is the number of residues, and εi,j is the contact energy between residues i and j in the native state10,11,12,13. The protein state \(\{m\}\) represents a set of all residue states, \(\{{m}_{1},{m}_{2},\cdot \cdot \cdot,{m}_{N}\}\), with 2N possible conformations, and:

$${m}_{i,j}={m}_{i}{m}_{i+1}\cdots {m}_{j}=\mathop{\prod }\limits_{k=i}^{j}{m}_{k}$$
(2)

The free energy landscape can be calculated based on the exact analytical solution of the partition function18. An order parameter,

$$n=\frac{1}{N}\mathop{\sum }\limits_{i=1}^{N}{m}_{i}$$
(3)

which indicates the degree of native structure formation, was used as the reaction coordinate in the free energy landscape. A partition function restricted by the order parameter n is denoted:

$$Z(n)={{{{{{\rm{Tr}}}}}}}_{n} \exp \left[-\frac{1}{{k}_{{{{{{\rm{B}}}}}}}T}\left(H(\{m\})-T\mathop{\sum }\limits_{i=1}^{N}{S}_{i}{m}_{i}\right)\right]$$
(4)

where kB is the Boltzmann constant, T is the temperature, and Si (<0) is the entropic reduction of residue i attributed to the formation of the native conformation. \({{{{{{\rm{Tr}}}}}}}_{n}\) is the sum of all possible states constrained by the value of the order parameter n. Two important assumptions were made in the WSME model. One assumption is that a protein is stabilized only by contacts present in the native structure. Another assumption is that native interactions between residues i and j are established only when all intervening residues fold cooperatively into their native conformations, that is, mi,j = 1 (Fig. 1a, b).

To consider the nonlocal interactions between the N- and C-termini of a protein, Inanami et al. introduced a virtual linker at both termini23. Inspired by this, we developed a method to introduce a virtual linker between arbitrary residues u and v (u < v) and established a new model (WSME-L) that can consider all nonlocal interactions in a protein molecule. In this model, to represent nonlocal interactions through a virtual linker, we define \({m}_{i,j}^{(u,v)}\) as follows:

$${m}_{i,j}^{(u,v)}=\left(\mathop{\prod}\limits_{\min (i,u)\le k\le \,\max (u,i)}{m}_{k}\right)\left(\mathop{\prod}\limits_{\min (v,j)\le k\le \,\max (j,v)}{m}_{k}\right)$$
(5)

where residues i and j interact via the linker, if the following two consecutive regions are in their native conformations: (1) from residue i to u (or from residue u to i) and (2) from residue v to j (or from residue j to v) (Fig. 1c). Thus, residues i and j are consecutively connected as a “native stretch” through a linker shortcut, bypassing a long stretch of the main chain. The Hamiltonian of the WSME-L model with a single linker is expressed as follows:

$${H}^{(u,v)}(\{m\})=\mathop{\sum }\limits_{i=1}^{N-1}\mathop{\sum }\limits_{j=i+1}^{N}{\varepsilon }_{i,j}\left\lceil \frac{{m}_{i,j}+{m}_{i,j}^{(u,v)}}{2}\right\rceil$$
(6)

where \(\lceil \rceil\) is the ceiling function that prevents double counting. Thus, native contacts can be formed between the residues i and j if they are connected in a native stretch, either through the main chain (mi,j) or through a linker (\({m}_{i,j}^{(u,v)}\)).

To incorporate all interactions present in the native state of a protein, we defined the partition function of the WSME-L model as an ensemble of partition functions with a virtual linker introduced at each inter-residue contact, as follows:

$${Z}_{{{{{{\rm{L}}}}}}}(n)=Z(n)+\mathop{\sum}\limits_{(u,v):{{{{{\rm{All}}}}}}\,{{{{{\rm{contacts}}}}}}}\left({Z}^{(u,v)}(n)-Z(n)\right)\exp \left(\frac{{S^{\prime} }^{(u,v)}(n)}{{k}_{{{{{{\rm{B}}}}}}}}\right)$$
(7)

where \({Z}^{(u,v)}(n)\) is a partition function with a virtual linker between residues u and v, and \({S^{\prime} }^{(u,v)}(n)\) is the entropy penalty corresponding to the reduced number of states owing to the virtual linker formation between residues u and v (see Methods for details). This WSME-L model corresponds to a generalization of the virtual linker model by Inanami et al.23 and can consider all contacts present in the protein. We obtained an exact analytical solution for the partition function of the WSME-L model using the transfer matrix method18 (see Methods for details). This solution allows us to calculate rigorous free energy landscapes with greatly reduced computational complexity compared to the direct calculation using Eq. (7).

Folding of small single-domain proteins

To test whether the WSME-L model (Fig. 1e) can predict the folding mechanisms of small single-domain proteins with different topologies, we calculated the free energy landscapes of six small proteins for which the folding mechanisms have been extensively studied experimentally: engrailed homeodomain (En-HD; SCOP2 classification30: all α), src SH3 domain (all β; Fig. 2a), α-spectrin SH3 domain (all β), cold shock protein B (CspB; all β), chymotrypsin inhibitor 2 (CI2; α + β), and activation domain of human procarboxypeptidase A2 (ADA2h; α + β) (Supplementary Table 1; see Methods for detailed calculations). The predictions of the WSME-L model were compared with those of two types of original WSME models: Original model 1 used a uniform contact energy, while Original model 2 used a residue-dependent weighted contact energy similar to that used in the WSME-L model.

Fig. 2: Folding of small single-domain proteins predicted by WSME-L model.
figure 2

a Native structure of src SH3 domain. N-terminal (residues 5–36) and C-terminal (residues 37–64) halves are shown in magenta and cyan, respectively. b Contact map (top left) and AMBER-derived contact energy (bottom right) of src SH3 domain. c One-dimensional free energy landscape of src SH3 domain. U, TS, and N denote the unfolded, transition, and native states, respectively. d Two-dimensional free energy landscape of src SH3 domain. n1 and n2 are order parameters for N-terminal (residues 5–36) and C-terminal (residues 37–64) halves, respectively. Dominant folding pathway is indicated by white dashed line. Width and height of the panel are set proportional to the number of residues involved in regions for n1 and n2, respectively. e Residue-specific structure formation of src SH3 domain by theoretical Φ-value analysis. White dashed line indicates position of transition state. f Theoretical Φ-values predicted by Original model 2 (blue line) and WSME-L model (red line), and experimentally observed Φ-values (black filled circles) at transition state of src SH3 domain folding33. g, h Correlation between predicted and experimentally observed values of folding rate (g) and stability (h). In (b, e, f), green boxes on frame indicate locations of strands. In (fh), correlation coefficients, r, and p-values of two-sided t-test without adjustments are shown. Source data are provided as a Source data file.

The one-dimensional (1D) free energy landscapes predicted by these models indicated two-state folding reactions from the unfolded to the native state for these proteins, in agreement with experimental results31,32,33,34,35,36 (Supplementary Figs. 16 for all proteins; the results for the src SH3 domain are also shown in Fig. 2 as a representative). However, the rate constant for the folding reaction kf and the stability of the folded protein ΔGNU predicted by the WSME-L model showed higher correlations with the experimental results than those obtained by the original models (Fig. 2g, h and Supplementary Fig. 7). We also calculated the two-dimensional (2D) free energy landscapes by defining the order parameters n1 and n2, corresponding to the degree of structure formation in the N- and C-terminal halves of each protein, respectively, as follows:

$$\left\{\begin{array}{c}{n}_{1}=\frac{1}{{N}_{1}}\mathop{\sum}\limits_{{m}_{i}\in \{{m}_{1}\}}{m}_{i}\\ {n}_{2}=\frac{1}{{N}_{2}}\mathop{\sum}\limits_{{m}_{i}\in \{{m}_{2}\}}{m}_{i}\end{array}\right.$$
(8)

where \(\{{m}_{1}\}\) and \(\{{m}_{2}\}\) are sets of residue states, and N1 and N2 are the number of residues in each region. The 2D free energy landscapes indicated that the structured region of each protein in the transition state was mainly localized in the N- or C-terminal half, as suggested by the experiments31,32,33,34,35,36 (Fig. 2d and Supplementary Figs. 16).

To predict the detailed folding process at the amino acid residue resolution, we performed a theoretical Φ-value analysis (see Methods for details). Φ-value analysis has been widely used to experimentally study residue-specific structure formation during folding31,32,33,34,35,36. Φ = 0 or 1 when the residue is unfolded or fully folded, respectively. The theoretical Φ-values in the folding transition state calculated by the WSME-L model reproduced the experimentally observed Φ-values for all proteins well, and their correlation coefficients were higher than those of the original models (Fig. 2e, f and Supplementary Figs. 16). Interestingly, the WSME-L model can explain the differences in the folding pathways of two proteins with similar structures, src and α-spectrin SH3 domains (Supplementary Figs. 2 and 3). Therefore, the WSME-L model quantitatively improves the prediction accuracy of the folding mechanisms of small proteins compared with the original WSME models.

Folding of large multidomain proteins

Next, to test whether the WSME-L model (Fig. 1e) can predict the folding mechanisms of large multidomain proteins with more than 100 residues, we applied the model to six well-studied proteins: apomyoglobin (apoMb; all α; Fig. 3a), barnase (α + β; Fig. 3d), ribonuclease HI (RNase HI; α/β; Fig. 3g), dihydrofolate reductase (DHFR; α/β; Fig. 3j), α-subunit of tryptophan synthase (αTS; α/β; Fig. 3m), and indole-3-glycerol phosphate synthase (IGPS; α/β; Fig. 3p) (Supplementary Table 1; see Methods for detailed calculations). Kinetic folding experiments showed that apoMb accumulates an early folding intermediate in which the A, B, G, and H helices are formed among its eight helices (A–H). This then folds to the native state after forming the C, D, and E helices (Supplementary Fig. 8a)37. We calculated the 2D free energy landscape of apoMb using the original and WSME-L models with order parameters n1 and n2, which correspond to the degree of structure formation in the A, B, G, and H helices and the C, D, and E helices, respectively. The original models could not predict a free energy landscape consistent with the experimental results (Supplementary Figs. 9 and 10), because they assumed that the continuous region (C, D, and E helices) must fold before the discontinuous region (consisting of A and B helices on the N-terminal side and G and H helices on the C-terminal side). In contrast, the WSME-L model predicted the accumulation of a folding intermediate, in which the A, B, G, and H helices were formed, in the dominant folding pathway of apoMb (Fig. 3b). These results agree with the experimental results and are further supported by theoretical Φ-value analysis along the folding pathway (Fig. 3c).

Fig. 3: Folding of large multidomain proteins predicted by WSME-L model.
figure 3

a Native structure of myoglobin consisting of A, B, G, and H helices (magenta) and C, D, E, and F helices (cyan). b Two-dimensional (2D) free energy landscape of apomyoglobin (apoMb). n1 and n2 are order parameters for magenta and cyan regions in (a), respectively. Dominant folding pathway is indicated by white dashed line. Residues predicted to be folded by theoretical Ф-value analysis are shown in red for structures of intermediate and native state. U, I, and N denote unfolded, intermediate, and native states, respectively. c Residue-specific structure formation of apoMb by theoretical Φ-value analysis. The Φth-values along dominant folding pathway (order parameter npathway) are plotted against residue number. Cross-section of 2D free energy landscape along the pathway is shown on left. Red and green boxes in upper frame indicate locations of helices and strands, respectively, and their names are shown in the corresponding colors. df Results for barnase. In (d), N-terminal (residues 3–51) and C-terminal (residues 52–110) halves are shown in magenta and cyan, respectively. gi Results for ribonuclease HI (RNase HI). In (g), residues 1–43 and 123–155 are shown in magenta, and residues 44–122 are shown in cyan. jl Results for dihydrofolate reductase (DHFR). In (j), discontinuous loop domain (DLD; residues 1–37 and 107–159) and adenosine-binding domain (ABD, residues 38–106) are shown in magenta and cyan, respectively. m–o Results for α-subunit of tryptophan synthase (αTS). In (m), N-terminal (residues 1–123) and C-terminal (residues 124–268) halves are shown in magenta and cyan, respectively. p–r Results for indole-3-glycerol phosphate synthase (IGPS). In (p), N-terminal (residues 1–102) and C-terminal (residues 103–223) halves are shown in magenta and cyan, respectively. Details are as described in (ac). Source data are provided as a Source data file.

For the folding reactions of barnase and RNase HI, the predictions of Original model 1 did not agree with the experimental results38,39 (Supplementary Figs. 8b, c and 1114). In contrast, the predictions of Original model 2 and the WSME-L model, both of which use residue-dependent weighted contact energies, agreed with the experimental results (Fig. 3d–i and Supplementary Figs. 8b, c and 1114). However, the WSME-L model better predicted the Φ-values compared with Original model 2 (Supplementary Figs. 11 and 13).

DHFR consists of two domains, with one domain (the adenosine-binding domain, ABD) inserted into the other (the discontinuous loop domain, DLD) (Fig. 3j). Experiments have shown that the ABD of DHFR folds first, followed by DLD, and finally, both dock to form the native state via multiple folding pathways (Supplementary Fig. 8d)40. Although the original models did not explain this reaction (Supplementary Figs. 1516), the extended WSME model with a single virtual linker between the N- and C-termini partially explained the experimental results23. However, the predicted intermediates were unstable, and multiple folding pathways were not predicted. In contrast, our WSME-L model reproduced the accumulation of stable intermediates, in which only ABD was formed, and the presence of multiple folding pathways (Fig. 3k, l and Supplementary Figs. 15 and 16).

αTS (268 residues) and IGPS (223 residues) are among the largest proteins whose folding mechanisms have been studied in detail by experiment41,42. They have similar native structures and show parallel folding pathways involving on-pathway intermediates, but their detailed structures differ between αTS and IGPS (Supplementary Fig. 8e, f). For αTS, all three models predicted the presence of two parallel pathways (Fig. 3n and Supplementary Figs. 17 and 18). However, only the WSME-L model was able to predict the formation of both N- and C-terminal regions in the on-pathway intermediate (Fig. 3o and Supplementary Figs. 17 and 18). Remarkably, although the original models failed to reproduce the experimentally observed folding processes of IGPS (Supplementary Figs. 19 and 20), the WSME-L model successfully reproduced the parallel pathways in IGPS folding (Fig. 3q) and accurately predicted the structure of the on-pathway intermediate in which strands 2–6 and helices 2–5 are formed (Fig. 3r). Again, the WSME-L model can distinguish subtle differences in folding mechanisms even for large proteins with similar structures.

In summary, the WSME-L model, which considers all nonlocal interactions, can predict the folding free energy landscapes of small to large proteins of different shapes in a unified manner.

Folding with oxidative disulfide bond formation

Disulfide bonds are typical nonlocal interactions that stabilize a protein covalently and are present in many extracellular proteins or domains secreted outside cells29. Folding of such proteins involves oxidative disulfide bond formation. Oxidative folding reactions of RNase A and other proteins have long been studied experimentally, leading to Anfinsen’s dogma that the amino acid sequence of a protein determines its native structure43. The WSME-L model can be modified to predict oxidative folding by considering the stabilization contact energy corresponding to disulfide bond formation when a virtual linker is created between two native Cys pairs (Fig. 1f; see Methods for details). We applied this model, named the WSME-L(SS) model, to three representative proteins for oxidative folding studies: hen egg-white lysozyme (α + β; four disulfide bonds; Fig. 4a), RNase A (α + β; four disulfide bonds; Fig. 4d), and bovine pancreatic trypsin inhibitor (BPTI; small; three disulfide bonds; Fig. 4g). We also calculated the degree of disulfide bond formation between the residues i and j during folding, \({\Phi }_{i,j}^{{{{{{\rm{SS}}}}}}}\) (see Methods for details). The WSME-L(SS) model predictions described below are consistent with the experimental results and reveal complex behaviors in forming multiple disulfide bonds during folding that were not predicted by the original models (Supplementary Figs. 2127). Note that although the WSME-L(SS) model cannot simulate non-native disulfide bond formation because it considers only native contacts, experimental studies demonstrate that well-populated intermediates in oxidative folding contain only native disulfide bonds44, which can be explained by our model.

Fig. 4: Folding with oxidative disulfide bond formation predicted by WSME-L(SS) model.
figure 4

a Native structure of lysozyme. α-domain (residues 1–39 and 86–129) and β-domain (residues 40–85) are shown in magenta and cyan, respectively. Disulfide bonds are shown in yellow with Cys residue numbers. b Two-dimensional free energy landscape of lysozyme predicted by WSME-L model. n1 and n2 are order parameters for magenta and cyan regions in (a), respectively. Dominant folding pathways are indicated by white dashed lines. c Degree of disulfide bond formation \({\Phi }_{i,j}^{{{{{{\rm{SS}}}}}}}\) along Pathways 1 (top) and 2 (bottom). df Results for RNase A. In (d), N-terminal (residues 1–60) and C-terminal (residues 61–124) halves are shown in magenta and cyan, respectively. gi Results for bovine pancreatic trypsin inhibitor (BPTI). In (g), N-terminal (residues 1–36) and C-terminal (residues 37–58) regions are shown in magenta and cyan, respectively. Details are as described in (ac). Source data are provided as a Source data file.

Lysozyme is one of the best studied multidomain proteins in terms of protein folding. It consists of two domains: a discontinuous α-domain (residues 1–39 and 86–129) and a continuous β-domain (residues 40–85) (Fig. 4a). It has four disulfide bonds: two (6–127 and 30–115) in the α-domain, one (64–80) in the β-domain, and one (76–94) in the interdomain region. Oxidative folding experiments with lysozyme revealed two parallel folding pathways: the major pathway accumulates two intermediates, named des[76–94] and des[64–80], in which three disulfide bonds other than 76–94 and 64–80 are formed, respectively, whereas the minor pathway involves the des[6–127] intermediate (Supplementary Fig. 21a)45. The WSME-L(SS) model predicted that the oxidative folding of lysozyme starts with forming the 64–80 disulfide bond in the β-domain and proceeds through two contrasting pathways (Fig. 4b, c). In the major pathway (Pathway 1), as the preformed 64–80 disulfide bond is reduced, the 30–115 and then the 6–127 disulfide bonds are formed in the α-domain. Subsequently, during the repeated formation and reduction of the 76–94 disulfide bond, the 64–80 and finally the 76–94 disulfide bonds were formed in the β-domain, corresponding to the des[76–94] and des[64–80] intermediates observed in the experiments. In contrast, the minor pathway (Pathway 2) favored the formation of the 64–80 and then the 76–94 disulfide bonds in the β-domain, followed by the formation of the 30–115 and then the 6–127 disulfide bonds in the α-domain. This corresponds to the des[6–127] intermediate observed experimentally.

The prediction of the oxidative folding of RNase A by the WSME-L(SS) model agrees with the experimental results (Supplementary Fig. 21b)46 as follows: in the early stages of folding, the 65–72 disulfide bond was strongly favored among the four native disulfide bonds (Fig. 4f). RNase A folded along two parallel pathways (Fig. 4e). In Pathway 1, the initially formed 65–72 disulfide bond was broken in the middle stage and regenerated in the late stage of folding (Fig. 4f). In Pathway 2, a 40–95 disulfide bond was formed in the final step (Fig. 4f).

Surprisingly, the WSME-L(SS) model successfully reproduced the oxidative folding experiments of BPTI under conditions where the glutathione concentrations are close to those in the endoplasmic reticulum (Fig. 4h, i and Supplementary Fig. 21c)47,48. The predictions showed that BPTI has a high propensity to form a single disulfide intermediate, named [14–38], with a disulfide bond between residues 14 and 38 early in the folding reaction, but in the middle stage of folding (n > 0.4) it forms the [30–51] intermediate, followed by a two-disulfide [30–51; 14–38] intermediate (Fig. 4i). BPTI then folds into its native state via two pathways, both involving the additional formation of a 5–55 disulfide bond (Fig. 4h, i). Interestingly, Pathway 1 predicted the partial breakage of the 14–38 disulfide bond before the formation of all the disulfide bonds. This is consistent with the experimental results observed at low concentrations of oxidized glutathione, where the 14–38 disulfide bond is transiently reduced before forming the native state44.

Folding of disulfide-intact proteins

Because oxidative folding measurements have limited temporal resolution, many experimental studies of protein folding have been performed on disulfide-intact proteins, in which all disulfide bonds are preformed prior to folding experiments. This enables a detailed investigation of folding kinetics with high temporal resolution using various experimental methods. To predict the folding reactions of disulfide-intact proteins, we constructed the WSME-L(SSintact) model, which allows WSME-L predictions in the presence of covalent linker(s) (Fig. 1g; see Methods for details). We applied this model to predict the folding of disulfide-intact lysozyme, RNase A, and BPTI.

Kinetic folding experiments on disulfide-intact lysozyme revealed a complex folding behavior involving a collapsed intermediate and two subsequent parallel pathways (Fig. 5a and Supplementary Fig. 28a)49,50,51,52. The major slow folding pathway accumulates the α-domain intermediate in which the discontinuous α-domain is formed but not the continuous β-domain, whereas in the minor fast folding pathway the collapsed intermediate folds directly to the native state with simultaneous formation of α- and β-domains. This behavior was not explained by the original WSME models, which predicted that the α-domain was formed only after the formation of the β-domain (Supplementary Figs. 29 and 30). Strikingly, the WSME-L(SSintact) model predicted the presence of two intermediates (I1 and I2) and two dominant folding pathways, in agreement with the experimental results (Fig. 5b–d). In the I1 intermediate, the B- and D-helices in the α-domain are partially formed, while in the I2 intermediate in Pathway 1, the α-domain is mostly folded but the β-domain is not (Fig. 5b, c). In contrast, in Pathway 2, after the formation of I1, the β-domain and then the rest of the α-domain are formed through a downhill landscape without accumulation of stable intermediates (Fig. 5b, d). These predictions are consistent with the experimental results and provide detailed insights into the residue-specific structure formation during lysozyme folding.

Fig. 5: Folding of disulfide-intact lysozyme predicted by WSME-L(SSintact) model.
figure 5

a Folding mechanism of experimentally elucidated hen egg-white lysozyme49,50,51,52. b Two-dimensional (2D) free energy landscape of lysozyme predicted by WSME-L model under folding conditions at 293 K. n1 and n2 are order parameters for α- and β-domains, respectively. Folding pathways 1 and 2 are indicated by white dashed lines. Lysozyme residues predicted to be folded by theoretical Ф-value analysis (Фth > 0.5) are shown in red for structures of intermediates and native state. U and N denote unfolded and native states, respectively; I1 and I2 denote intermediates; and TS0, TS1, TS2, and TS3 denote transition states. c, d Residue-specific structure formation along folding pathways of lysozyme predicted by theoretical Φ-value analysis. Φth-values along Pathways 1 (c) and 2 (d) are plotted against residue number. Cross-section of 2D free energy landscape along folding pathway is shown on left. Red and green boxes in upper frame indicate locations of helices and strands, respectively, and their names are shown in the corresponding colors. e, f Time evolution of concentrations of kinetic species [U (gray), I1 (orange), I2 (red), and N (blue)] at 293 K (e) and 338 K (f). Np1 (magenta) and Np2 (cyan) indicate native state formed through Pathways 1 and 2, respectively. In (e) is expanded view of time evolution during first 0.05 s. Total concentration was normalized to 1. g Time evolution of each domain. Predictions by WSME-L model (α-domain: magenta curve, and β-domain: cyan curve), and average proton occupancy of each domain obtained from pulsed-hydrogen exchange NMR experiments49 (α-domain: red filled circles, and β-domain: blue filled circles). h Temperature dependence of heat capacity. Blue curve: prediction by WSME-L(SSintact) model. Red filled circles: experimental data54. Source data are provided as a Source data file.

Kinetic analysis based on the 2D free energy landscape predicted that 91% and 9% of lysozyme molecules fold through Pathways 1 and 2, respectively (Fig. 5e; see Methods for details). Pathway 1 is favored due to a lower free energy barrier (TS1) than Pathway 2 (TS3) (Fig. 5b). However, folding via Pathway 1 is slower because it has to overcome a rate-limiting free energy barrier (TS2) to reach the native state (Fig. 5b, c, e). This provides a rationale for the experimental observations of a major but slow folding pathway and a minor but fast folding pathway. In addition, the time evolution of α- and β-domain structure formation, calculated by combining the kinetic and theoretical Φ-value analyses, showed that α-domain folding precedes β-domain folding, reproducing the experimental results (Fig. 5g).

Thermodynamic analysis predicted that the I2 intermediate is destabilized at high temperatures (Fig. 5f and Supplementary Figs. 3135), consistent with experimental results showing that the population of the α-domain intermediate is reduced at high temperatures53. Furthermore, the temperature-dependent changes in the heat capacity of lysozyme obtained from equilibrium unfolding experiments54 were well reproduced by our model (Fig. 5h).

To further investigate the role of each disulfide bond in determining the folding pathways, we calculated the free energy landscapes of lysozyme variants containing only a single disulfide bond using the WSME-L(SSintact) model (Supplementary Fig. 36). We found that the disulfide bonds in the α-domain, especially the 30–115 disulfide bond, are critical for making Pathway 1 the major folding pathway (Supplementary Fig. 36). This finding is consistent with experimental studies showing that the two disulfide bonds in the α-domain are essential for the accumulation of the α-domain intermediate (corresponding to I2)55.

We also applied the WSME-L(SSintact) model to predict the folding of disulfide-intact RNase A and BPTI (Supplementary Figs. 3740). RNase A has two cis Pro residues (Pro93 and Pro114) in its native state and shows very fast folding reactions when starting from the Uvf unfolded state in which both Pro residues are in cis conformations (Supplementary Fig. 28b)56. The model predicted the formation of helix 2 and the N-terminal side of the β-sheet in the intermediate, but not residues 50–75 and the C-terminal side of the β-sheet (Supplementary Figs. 37 and 38). In addition, stabilization of the entire helix 1 occurs only in the final step. These predictions well reproduce the structure of the IΦ intermediate that accumulates in the folding reaction from Uvf56. For disulfide-intact BPTI, the model predicted a folding pathway similar to oxidative folding; however, the presence of the intact disulfide bonds stabilized the protein, resulting in a downhill folding with a minimal transition state barrier (Supplementary Figs. 39 and 40). Since kinetic folding experiments for disulfide-intact BPTI are not available, this prediction requires experimental verification.

Discussion

In this study, we developed a simple structure-based theoretical model (WSME-L) to predict protein folding mechanisms by introducing virtual linkers at arbitrary positions to enhance nonlocal interactions. Applying the original WSME models to predict folding processes has been limited to small proteins and multidomain proteins with close N- and C-termini, such as DHFR12,13,23. In contrast, our model can be applied to a wide variety of proteins, regardless of size and shape. For small single-domain proteins, the WSME-L model outperformed the original models in terms of prediction accuracy. Furthermore, the WSME-L, WSME-L(SS), and WSME-L(SSintact) models successfully predicted the folding pathways of large multidomain proteins, folding with oxidative disulfide bond formation, and folding of disulfide-intact proteins, respectively. Notably, the results are consistent with the experimental results, a feat that was not achieved with the original models. The success of these predictions substantiates the importance of nonlocal interactions in predicting folding mechanisms, particularly for multidomain proteins17,23,27,57. In addition, the models yielded detailed predictions of folding processes beyond reproducing the experimental results and provided the key determinants and rationale for the folding mechanisms.

An important assumption of the WSME model is that a protein is stabilized only by the contacts present in the native structure. The same assumption is employed in an ideal protein based on the consistency principle proposed by Gō58 and in the perfect funnel-like energy landscape based on the principle of minimal frustration proposed by Wolynes, Onuchic, and colleagues59. Despite their simplicity, WSME models successfully predict the folding mechanisms of single-domain and multidomain proteins. These results suggest that both the consistency and minimal frustration principles hold for many proteins, regardless of their size. This suggests that real proteins behave similarly to ideal proteins regarding folding kinetics and thermodynamics.

Detailed comparisons of predicted and experimental data reveal slight differences between ideal and real proteins due to non-native interactions and Pro isomerization during folding. In disulfide-intact lysozyme folding, non-native interactions stabilize the α-domain intermediate (I2) and retard the subsequent folding process51. The WSME-L prediction underestimated the maximum population of I2 (~45%) compared to the experimental results (~60%)53 (Fig. 5e), but this discrepancy can be resolved by accounting for the stabilization effect of non-native interactions by 0.44 kcal/mol (Supplementary Fig. 41). In αTS and IGPS folding, off-pathway intermediates formed early in folding by non-native interactions must be disrupted before folding to the native state41,42. In addition, cis Pro residues in the native state of αTS, IGPS, and RNase A must isomerize from trans to cis conformations during folding if they are in trans conformations in the unfolded state, resulting in the appearance of additional parallel folding pathways41,42,56. However, the WSME-L models predicted only the folding pathways involving on-pathway intermediates with native-like Pro conformations because they only consider native contacts. Similarly, the WSME-L(SS) model predicted that the 40–95 disulfide bond is formed together with the 26–84 disulfide bond in the final step of oxidative RNase A folding, whereas experiments have shown that the final step of the folding involves only the formation of the 40–95 disulfide bond46. This may be because the slow trans-to-cis isomerization of Pro93 near the 40–95 disulfide bond60, which is not considered in the WSME-L(SS) model, delays the formation of this disulfide. Thus, the WSME-L models are useful for capturing the general features of the folding mechanisms of ideal proteins, and slight discrepancies between the predicted and experimental data can provide clues to understanding the role of non-native interactions/conformations in protein folding. Future efforts to incorporate non-native interactions and Pro isomerization into the models would allow for even greater accuracy in predicting protein folding mechanisms.

Theoretical approaches to predicting three-dimensional protein structures have been greatly advanced by deep learning1,2, making considerable progress toward solving the structure prediction component of the ‘protein folding problem’3. Although these algorithms do not predict how proteins fold and function, WSME-L models can predict the folding mechanisms of various proteins, without the limitations of size and shape, with low computational complexity, using only native structures solved experimentally or with deep learning algorithms. Moreover, because the WSME-L model can describe protein dynamics in terms of free energy landscapes, it has a wide range of potential applications beyond protein folding, including the dynamic motions associated with protein function19,20,22, protein–protein interactions, and coupled folding and binding of intrinsically disordered proteins61. Furthermore, it would be applicable to the development of novel protein design methods based on dynamics prediction. Therefore, the WSME-L models may pave the way for solving the folding process component of the ‘protein folding problem’3 and will be increasingly useful in the forthcoming era of computational biology.

Methods

Contact energies

The native contact energy between residues i and j, \({\varepsilon }_{i,j}\), was obtained as \({\varepsilon }_{i,j}=\varepsilon {e}_{i,j}\), where \(\varepsilon\) is the energy size of the inter-residue contacts in a protein and \({e}_{i,j}\)\((-1\le {e}_{i,j}\le 1)\) is the weight of the contact energy between residues i and j. The \({e}_{i,j}\) values were determined as follows: first, the three-dimensional structure from the Protein Data Bank (PDB) was energy minimized using AMBER18/ff14SB with restraints for backbone atoms62. In Original model 1, \({e}_{i,j}=-1\) if an atom in residue i is within 4 Å of an atom in residue j; otherwise \({e}_{i,j}=0\). In Original model 2, using the energy-minimized structure, the AMBER-derived contact energy \({\varepsilon }_{i,j}^{{{{{{\rm{AMBER}}}}}}}\) was calculated using the MMPBSA.py module for implicit solvents62. \({e}_{i,j}\) was defined as \({\varepsilon }_{i,j}^{{{{{{\rm{AMBER}}}}}}}\) divided by the maximum absolute value of \({\varepsilon }_{i,j}^{{{{{{\rm{AMBER}}}}}}}\) among all inter-residue contacts excluding those with neighboring residues \((j\le i+2)\). The WSME-L model has two types of partition functions: \(Z(n)\) and \({Z}^{(u,v)}(n)\) (see Eq. (7)). The \({e}_{i,j}\) values in \(Z(n)\) were the same as those used in Original model 2. \({Z}^{(u,v)}(n)\) is described as:

$${Z}^{(u,v)}(n)={{{{{{\rm{Tr}}}}}}}_{n}\exp \left[-\frac{1}{{k}_{{{{{{\rm{B}}}}}}}T}\left({H}^{(u,v)}(\{m\})+{\varepsilon ^{\prime} }^{(u,v)}-T\mathop{\sum }\limits_{i=1}^{N}{S}_{i}{m}_{i}\right)\right]$$
(9)

In \({Z}^{(u,v)}(n)\), the contact energy acquired by the virtual linker formation between residues u and v, \({\varepsilon ^{\prime} }^{(u,v)}\), was defined as \({\varepsilon ^{\prime} }^{(u,v)}=\varepsilon ({e}_{u,v}+{e}_{u+1,v}+{e}_{u-1,v}+{e}_{u,v+1}+{e}_{u,v-1})\). The \({e}_{i,j}\) values for \({H}^{(u,v)}(\{m\})\) in \({Z}^{(u,v)}(n)\) were the same as in \(Z(n)\), except that \({e}_{u,v}={e}_{u\pm 1,v}={e}_{u,v\pm 1}=0\) to avoid double counting. For all models, \({e}_{i,j}=0\) for neighboring residues (\(j\le i+2\)).

In the predictions of folding with oxidative disulfide bond formation, –40 kcal/mol was added to \({\varepsilon }_{i,j}^{{{{{{\rm{AMBER}}}}}}}\) for the residue pair forming a disulfide bond in Original model 2 and the WSME-L(SS) model. In Original model 1, the change in contact energy corresponding to the addition of –40 kcal/mol in Original model 2 was added to the contact energy for the disulfide pair. To predict the folding of disulfide-intact proteins using the WSME-L(SSintact) model, \({e}_{i,j}={e}_{i\pm 1,j}={e}_{i,j\pm 1}=0\) for the residue pairs i and j forming a disulfide bond.

It is important to note that the use of native contact energies calculated by the AMBER force field was necessary to obtain the free energy landscape consistent with the experimental data. For disulfide-intact lysozyme, the landscape calculated using a uniform contact energy predicted only a single folding pathway corresponding to Pathway 1 (Supplementary Fig. 42). This inconsistency with the experiments is a marked contrast with previous studies in which residue-specific weighing on contact energies in an Ising-like coarse-grained model had a small contribution to the folding prediction of small proteins13,57. These results suggest the importance of side-chain packing interactions in determining the folding mechanisms of multidomain proteins.

The three-dimensional structure used to calculate the contact energies of apoMb was determined as follows: the all-atom molecular dynamics simulations of apoMb with heme removed from the heme-bound structure were performed using GROMACS 2021.2 with ff14SB for 1 μs with explicit solvents, as only heme-bound myoglobin structures were available from PDB63,64. The initial structure was built using the LEaP module of the AmberTools19 package with ff14SB62 and the crystal structure of myoglobin (PDB ID: 1bzp) as input. All ionizable side chains were set to their pH 7 protonation state. Charge-neutralizing chloride ions were placed around the protein molecule. The protein molecule was immersed in a 70.4 × 65.9 × 70.6 Å3 periodic box of TIP3P water molecules (7680 water molecules and 25,501 atoms in total). The long-range electrostatic interactions were treated using the particle-mesh Ewald method. The systems were energy minimized by the steepest descent algorithm for 200 steps with positional restraints and for additional 200 steps without restraints64. The system was then heated from 0 K to 310 K during a 200-ps constant-NVT MD simulation with harmonic position restraints on the heavy atoms of the solutes (with a force constant of 10 kcal mol−1 A−2). During the subsequent 700-ps constant-NPT MD simulation at 310 K and 1.0 × 105 Pa, the force constants of the position restraints were gradually reduced. The system was further equilibrated for 100 ps without position restraints. The bonds between hydrogen atoms and heavy atoms were constrained using the P-LINCS algorithm, allowing the use of 2-fs time steps. Temperature was controlled using the stochastic velocity-rescaling (V-rescale) algorithm. Pressure in NPT simulations was controlled using a Berendsen barostat with a coupling constant of 1 ps-1 64. The time-course analysis showed that the energy of the system was well converged during equilibration. Subsequently, an unrestrained constant-NPT MD simulation was performed for 1 μs at 310 K and 1.0 × 105 Pa using a Parrinello-Rahman barostat with a coupling constant of 2 ps−1 64, and the native ensembles of apoMb were obtained. The MD trajectories were analyzed using the CPPTRAJ module in AmberTools1962. The 1-μs MD simulations were performed in triplicate under the same conditions. The three independent MD trajectories represented by a root mean square deviation (RMSD) of the main-chain Cα atoms from the initial structure were similar (Supplementary Fig. 43a), and the average RMSD in each trajectory was almost the same for all trajectories, indicating the reproducibility of the MD simulations (Supplementary Fig. 43b). We then performed cluster analysis on 3000 structures taken every 1 ns from a total of 3-μs simulations with an RMSD cutoff of 1.6 Å65. The central structure of the first cluster was used as a computational model of apoMb to calculate contact energies; the PDB file of the apoMb structure is provided as Supplementary Data 1. As the F helix of apoMb does not form a stable structure in the native state66, the contact energies involving the residues in the F helix were set to zero. The resulting free energy landscape was consistent with the experimentally observed folding mechanisms of apoMb (Fig. 3b, c), indicating that both the accuracy and timescale of the MD simulation are sufficient for the present study.

Entropic costs

The entropic cost of the main chain of the i-th residue, Si, was set to –2.0 cal/(mol·K) for Original model 1, –2.5 cal/(mol·K) for Original model 2, and –3.5 cal/(mol·K) for the WSME-L models.

The entropic cost of ring closure via virtual linker formation, \({S^{\prime} }^{(u,v)}(n)\), was calculated as follows: let us consider a Gaussian chain with a chain length of La, where L is a natural number, and a (3.8 Å) is the distance between two adjacent Cα atoms. The entropic cost of ring closure between the two termini of this chain is:

$$s^{\prime} (L)=-\frac{3}{2}{k}_{{{{{{\rm{B}}}}}}}\left(\ln{L}+\frac{{r}^{2}-{a}^{2}}{2AaL}\right)$$
(10)

where A (20 Å) is the persistence length of a peptide chain, and r is the distance between the Cα atoms of the two termini of the chain23. Using this, we defined \({S^{\prime} }^{(u,v)}(n)\) as the arithmetic mean of the entropic cost \(s^{\prime}\) for all possible states as follows:

$${S^{\prime} }^{(u,v)}(n)={h}_{S^{\prime} }\mathop{\sum }\limits_{i=0}^{nN}\frac{\left(\begin{array}{c}N^{\prime} \\ i\end{array}\right)\left(\begin{array}{c}N-N^{\prime} \\ nN-i\end{array}\right)}{\left(\begin{array}{c}N\\ nN\end{array}\right)}s^{\prime} (N^{\prime} -i)$$
(11)

where \(N^{\prime}=v-u+1\), \({h}_{S^{\prime} }\) is a scaling factor of the ring entropy, and \(()\) denotes a combination.

Folding rate and stability of small proteins

The folding rate and stability of small proteins were calculated from the 1D free energy landscape according to previous studies67. The microscopic rate constant for the transition from a structure with nN native contacts (order parameter n) to one with nN ± 1 native contacts (order parameter n ± 1/N) can be described by:

$${k}_{nN,nN\pm 1}=A\exp \left(-\frac{F(n\pm 1/N)-F(n)}{2{k}_{{{{{{\rm{B}}}}}}}T}\right)$$
(12)

where A = 107 and F(n) is the 1D free energy at order parameter n. The macroscopic rate constant was obtained as the eigenvalue with the smallest nonzero absolute value of the following rate matrix:

$$\left(\begin{array}{cccccc}-{k}_{0,1} & {k}_{1,0} & 0 & \ldots & 0 & 0\\ {k}_{0,1} & -{k}_{1,0}-{k}_{1,2} & {k}_{2,1} & \ldots & 0 & 0\\ 0 & {k}_{1,2} & -{k}_{2,1}-{k}_{2,3} & \cdots & 0 & 0\\ \vdots & \vdots & \vdots & \ddots & \vdots & \vdots \\ 0 & 0 & 0 & \cdots & -{k}_{N-1,N-2}-{k}_{N-1,N} & {k}_{N,N-1}\\ 0 & 0 & 0 & \cdots & {k}_{N-1,N} & -{k}_{N,N-1}\end{array}\right)$$
(13)

Under conditions in which the native state is sufficiently stable relative to the unfolded state, the macroscopic rate constant is equivalent to the folding rate constant kf.

The stability of the small proteins, \(\Delta {G}_{{{{{{\rm{NU}}}}}}}\), was estimated as the difference between the minimum free energy values of the native and unfolded state basins in a 1D free energy landscape at 293 K or 298 K.

Determination of parameters

In the present WSME-L models, the parameters to be determined for each protein are \(\varepsilon\) and \({h}_{S^{\prime} }\). To determine these values for small proteins, the \({h}_{S^{\prime} }\) was set to 0.5–2.0 in increments of 0.1, and for each of them, the \(\varepsilon\) was determined that minimized the following loss function for the stability and folding rate between the experimentally determined and predicted values:

$$loss={\left(\frac{\Delta {G}_{{{{{{\rm{NU}}}}}}}}{{k}_{{{{{{\rm{B}}}}}}}T}-\frac{\Delta {G}_{{{{{{\rm{NU}}}}}}}^{\exp }}{{k}_{{{{{{\rm{B}}}}}}}T}\right)}^{2}+{\left[\ln{\left(\frac{{k}_{{{{{{\rm{f}}}}}}}}{{k}_{{{{{{\rm{f}}}}}}}^{\exp }}\right)}\right]}^{2}$$
(14)

Among the pairs of \({h}_{S^{\prime} }\) and \(\varepsilon\) thus obtained, we selected a pair for which the theoretical Φ-values correlate best with the experimental values. Similarly, for other proteins, we searched for the pairs of \({h}_{S^{\prime} }\) and \(\varepsilon\) that minimized the loss function for the stability between the experimentally determined and predicted values:

$$loss={\left(\frac{\Delta {G}_{{{{{{\rm{NU}}}}}}}}{{k}_{{{{{{\rm{B}}}}}}}T}-\frac{\Delta {G}_{{{{{{\rm{NU}}}}}}}^{\exp }}{{k}_{{{{{{\rm{B}}}}}}}T}\right)}^{2}$$
(15)

Among them, we selected a pair for which the theoretical Φ-values and stability of folding intermediates correlate best with the experimental values. See Supplementary Table 2 for the \(\varepsilon\) and \({h}_{S^{\prime} }\) values used for each protein.

Exact solution for WSME-L model

The exact solution of \({Z}^{(u,v)}(n)\) (see Eq. (9)) was obtained by rearranging the exact solution of the original WSME model18. First, we considered the Boltzmann weight of a native stretch, as follows:

$${w}_{i,j}=\exp \left[-\frac{1}{{k}_{{{{{{\rm{B}}}}}}}T}\left(\mathop{\sum }\limits_{k=i}^{j-1}\mathop{\sum }\limits_{l=k+1}^{j}{\varepsilon }_{k,l}-T\mathop{\sum }\limits_{k=i}^{j}{S}_{k}\right)\right]{\lambda }^{j-i+1}$$
(16)

where λ is an indeterminant whose exponent is related to the order parameter n. For convenience in later calculations, we define \({w}_{i+1,i}=1\). To reduce computational complexity, virtual linkers were introduced only for residues i and j with interaction energies \({\varepsilon }_{i,j}^{{{{{{\rm{AMBER}}}}}}}\) less than –0.6.

According to the exact solution of the original WSME model, the generating function of the partition function Z, restricted by an order parameter n, is written as:

$${G}_{Z}(Z(n);\lambda )={}^{0}\langle 0|\mathop{\prod }\limits_{i=0}^{N}{Q}_{i+1}^{i}{|0\rangle }{}^{N+1}$$
(17)

Here, \({Q}_{\mu+1}^{\mu }\) was a transfer matrix defined as:

$${Q}_{\mu+1}^{\mu }=\mathop{\sum }\limits_{k=1}^{\mu+1}{|k-1\rangle }{}^{\mu }\, {}^{\mu+1}\langle k|+\mathop{\sum }\limits_{k=0}^{\mu }{w}_{\mu -k+1,\mu }{|k\rangle }{}^{\mu }\,{}^{\mu+1}\langle 0|$$
(18)

where \({|k\rangle }{}^{\mu }\) is a (μ + 1)-dimensional vector whose set satisfies the orthonormal system:

$$\left\{\begin{array}{c}{}^{\mu }\langle k|k^{\prime} \rangle {}^{\mu }={\delta }_{k,k^{\prime} }\hfill\\ \mathop{\sum }\limits_{k=0}^{\mu }{|k\rangle }{}^{\mu }\,{}^{\mu }\langle k|={I}_{\mu+1}\end{array}\right.$$
(19)

where k,  = 0, 1, ···, μ. From the generating function, the partition function restricted by the order parameter is formally given as follows:

$$Z(n)=\left.{\frac{1}{(nN)!}{\left(\frac{\partial }{\partial \lambda }\right)}^{(nN)}{G}_{Z}}\right|_{\lambda=0}$$
(20)

When calculating high-dimensional free energy landscapes, more indeterminants with proper exponents should be prepared similarly.

Next, when a linker is provided, the additional weights corresponding to the interactions between the two native stretches connected by the linker should be considered. The introduction of a linker connecting residues u and v implied that an additional weight, \({w}_{{{{{{\rm{L}}}}}}}^{\alpha,\beta,\gamma,\delta }\), should be multiplied when calculating the products of \({w}_{\alpha,\beta }\) (\(\alpha \le u\le \beta\)) and \({w}_{\gamma,\delta }\)(\(\gamma \le v\le \delta\)). The additional weight is given as:

$${w}_{{{{{{\rm{L}}}}}}}^{\alpha,\beta,\gamma,\delta }=\exp \left(-\frac{1}{{k}_{{{{{{\rm{B}}}}}}}T}\mathop{\sum }\limits_{k=\alpha }^{\beta }\mathop{\sum }\limits_{l=\gamma }^{\delta }{\varepsilon }_{k,l}\right)$$
(21)

For convenience in later calculations, we defined that \({w}_{{{{{{\rm{L}}}}}}}^{\alpha,\beta,\gamma,\delta }=1\) when \(u \, < \, \alpha\) or \(\beta \, < \, u\).

The products of \({w}_{\alpha,\beta }\) and \({w}_{\gamma,\delta }\) must be extracted to calculate the partition function using additional weights. To achieve this, we insert a unit matrix into the generating function as follows:

$${G}_{Z}(Z(n);\lambda )={}^{0}\langle 0|\left(\mathop{\prod }\limits_{i=0}^{v-1}{Q}_{i+1}^{i}\right)\left(\mathop{\sum }\limits_{k=0}^{v}{|k\rangle }{}^{v}\, {}^{v}\langle k|\right)\left(\mathop{\prod }\limits_{i=v}^{N}{Q}_{i+1}^{i}\right){|0\rangle }{}^{N+1}$$
(22)

By rearranging this expression according to the definition of the transfer matrix, the generating function can be expanded as:

$${G}_{Z}(Z(n);\lambda )=\mathop{\sum }\limits_{\gamma=1}^{v}\mathop{\sum }\limits_{\delta=v}^{N}{R}^{0,\gamma -1}{w}_{\gamma,\delta }{R}^{\delta+1,N+1}+{R}^{0,v}{R}^{v,N+1}$$
(23)

with:

$${R}^{i,j}={}^{i}\langle 0|\mathop{\prod }\limits_{k=i}^{j-1}{Q}_{k+1}^{k}{|0\rangle }{}^{j}$$
(24)

where \(\gamma=v-k+1\). Index γ runs from 1 to v, and index δ runs from v to N. This means we can extract all the weights, including for the v-th residue (i.e., \({w}_{\gamma,v}\) and \({w}_{v,\delta }\)).

We then obtained \({R}_{\gamma,\delta }^{0,\gamma -1}\) by converting all the weights, \({w}_{i,j}\), included in \({R}^{0,\gamma -1}\) into \({w}_{i,j}{w}_{{{{{{\rm{L}}}}}}}^{\alpha,\beta,\gamma,\delta }\). This allowed us to calculate the generating function with a linker, \({G}_{{Z}^{(u,v)}}\), which is given by:

$${G}_{{Z}^{(u,v)}}({Z}^{(u,v)}(n);\lambda )=\left(\mathop{\sum }\limits_{\gamma=1}^{v}\mathop{\sum }\limits_{\delta=v}^{N}{R}_{\gamma,\delta }^{0,\gamma -1}{w}_{\gamma,\delta }{R}^{\delta+1,N+1}+{R}^{0,v}{R}^{v,N+1}\right)\exp \left(-\frac{{\varepsilon ^{\prime} }^{(u,v)}}{{k}_{{{{{{\rm{B}}}}}}}T}\right)$$
(25)

The contact maps and free energy landscapes were drawn using Mathematica 12.2 (Wolfram, Champaign, IL, USA).

Theoretical Φ-value analysis for WSME-L model

In protein folding experiments, residue-specific structure formation has been extensively studied using Φ-value analysis, which requires the measurement of free energy changes in intermediates and transition states due to an amino acid substitution at the residue of interest68. Here, we performed a theoretical Φ-value analysis by introducing a small perturbation in the contact energies of the l-th residue, equivalent to an amino acid substitution, and calculated the free energy landscape, which corresponded to the kinetic folding measurements of the mutant. The difference in free energy between the perturbed and unperturbed landscapes along a folding pathway was normalized by the total free energy change due to the perturbation, resulting in theoretical Φ-values (Φth,l): Φth,l = 0 or 1 when the l-th residue was unfolded or fully folded, respectively. The Φth,l values in the intermediate or transition states corresponded to the experimentally observed Φ-values.

A theoretical Φ-value analysis of the original WSME model has been performed elsewhere13. To apply this to the WSME-L model, we defined the perturbations at the l-th amino acid residue as follows:

$$\left\{\begin{array}{c}\Delta {H}_{l}=\mathop{\sum }\limits_{i=1}^{l}{\eta }_{i,l}{m}_{i,l}+\mathop{\sum }\limits_{j=l}^{N}{\eta }_{l,j}{m}_{l,j}\hfill\\ \Delta {H}_{l}^{(u,v)}=\mathop{\sum }\limits_{i=1}^{l}{\eta }_{i,l}\lceil ({m}_{i,l}+{m}_{i,l}^{(u,v)})/2\rceil+\mathop{\sum }\limits_{j=l}^{N}{\eta }_{l,j}\lceil ({m}_{l,j}+{m}_{l,j}^{(u,v)})/2\rceil \end{array}\right.$$
(26)

where \({\eta }_{i,l}\) and \({\eta }_{l,j}\) are the contact energy changes due to small perturbations in the native contacts involving the l-th residue. All interactions formed by the l-th residue were simultaneously modulated; \({\eta }_{i,j}=0.1|{\varepsilon }_{i,\, j}|\) for \({\varepsilon }_{i,\, j} \, < \, 0\), while \({\eta }_{i,\, j}=0\) for \({\varepsilon }_{i,\, j} \, > \, 0\).

The partition function with perturbations owing to such a pseudo-mutation was described as follows:

$${Z}_{{{{{{\rm{L}}}}}},l}^{{{{{{\rm{Mu}}}}}}}(n) \,=\, {Z}_{l}^{{{{{{\rm{Mu}}}}}}}(n)+\mathop{\sum}\limits_{(u,v):{{{{{\rm{All}}}}}}\,{{{{{\rm{contacts}}}}}}}\left({Z}_{l}^{{{{{{\rm{Mu\,}}}}}}(u,v)}(n)-{Z}_{l}^{{{{{{\rm{Mu}}}}}}}(n)\right)\exp \left(\frac{{S^{\prime} }^{(u,v)}(n)}{{k}_{{{{{{\rm{B}}}}}}}}\right)$$
(27)

where Mu denotes mutation at the l-th residue,

$${Z}_{l}^{{{{{{\rm{Mu}}}}}}}(n)={{{{{{\rm{Tr}}}}}}}_{n}\exp \left[-\frac{1}{{k}_{{{{{{\rm{B}}}}}}}T}\left(H+\Delta {H}_{l}-T\mathop{\sum }\limits_{i=1}^{N}{S}_{i}{m}_{i}\right)\right]$$
(28)
$${Z}_{l}^{{{{{{\rm{Mu\,}}}}}}(u,v)}(n)={{{{{{\rm{Tr}}}}}}}_{n}\exp \left[-\frac{1}{{k}_{{{{{{\rm{B}}}}}}}T}\left({H}^{(u,v)}+\Delta {H}_{l}^{(u,v)}+{\varepsilon ^{\prime} }^{(u,v)}+\Delta {\varepsilon ^{\prime} }_{l}^{(u,v)}-T\mathop{\sum }\limits_{i=1}^{N}{S}_{i}{m}_{i}\right)\right]$$
(29)

and:

$$\Delta {\varepsilon ^{\prime} }_{l}^{(u,v)}=\left\{\begin{array}{c}{\eta }_{u\pm 1,v}\,\hfill\\ {\eta }_{u,v-1}+{\eta }_{u,v}+{\eta }_{u,v+1}\\ {\eta }_{u,v\pm 1}\hfill\\ {\eta }_{u-1,v}+{\eta }_{u,v}+{\eta }_{u+1,v}\\ 0\end{array}\,\begin{array}{c}(l=u\pm 1)\hfill\\ (l=u)\hfill\\ (l=v\pm 1)\hfill\\ (l=v)\hfill\\ ({{{{{\rm{otherwise}}}}}})\end{array}\right.$$
(30)

Φth,l is obtained as follows:

$${\Phi }_{{{{{{\rm{th}}}}}},l}(n)=\frac{-{k}_{{{{{{\rm{B}}}}}}}T\,\ln[{Z}_{{{{{{\rm{L}}}}}},l}^{{{{{{\rm{Mu}}}}}}}(n)/{Z}_{{{{{{\rm{L}}}}}},l}^{{{{{{\rm{Mu}}}}}}}(0)]-(-{k}_{{{{{{\rm{B}}}}}}}T\,\ln[{Z}_{{{{{{\rm{L}}}}}}}(n)/{Z}_{{{{{{\rm{L}}}}}}}(0)])}{-{k}_{{{{{{\rm{B}}}}}}}T\,\ln[{Z}_{{{{{{\rm{L}}}}}},l}^{{{{{{\rm{Mu}}}}}}}(1)/{Z}_{{{{{{\rm{L}}}}}},l}^{{{{{{\rm{Mu}}}}}}}(0)]-(-{k}_{{{{{{\rm{B}}}}}}}T\,\ln[{Z}_{{{{{{\rm{L}}}}}}}(1)/{Z}_{{{{{{\rm{L}}}}}}}(0)])}$$
(31)

When the value is less than zero, Φth,l is set to zero.

\({Z}_{l}^{{{{{{\rm{Mu\,}}}}}}(u,v)}(n)\) was obtained using the exact solution of the WSME-L model. First, to consider the change in the free energy of a native stretch including the l-th residue connected along the main chain, all weights \({w}_{i,j}\) in the transfer matrices were replaced by \({w}_{i,j}\exp [-({\sum }_{i\le s\le l}{\eta }_{s,l}+{\sum }_{l\le t\le j}{\eta }_{l,t})/{k}_{{{{{{\rm{B}}}}}}}T]\). Second, to consider the change in free energy of a native stretch connected via a linker to another native stretch including the l-th residue, all weights \({w}_{{{{{{\rm{L}}}}}}}^{\alpha,\beta,\gamma,\delta }\) in the transfer matrices were replaced by \({w}_{{{{{{\rm{L}}}}}}}^{\alpha,\beta,\gamma,\delta }\exp [-({\sum }_{\gamma \le t\le \delta }{\eta }_{l,t})/{k}_{{{{{{\rm{B}}}}}}}T]\) when \(\alpha \le l\le \beta\) or by \({w}_{{{{{{\rm{L}}}}}}}^{\alpha,\beta,\gamma,\delta }\exp [-({\sum }_{\alpha \le s\le \beta }{\eta }_{s,l})/{k}_{{{{{{\rm{B}}}}}}}T]\) when \(\gamma \le l\le \delta\). This modification of the weights allowed us to calculate \({Z}_{l}^{{{{{{\rm{Mu\,}}}}}}(u,v)}(n)\) efficiently.

Among the experimentally determined Φ-values, we used the data of the mutants with a destabilization free energy \(\Delta \Delta {G}_{{{{{{\rm{NU}}}}}}}^{\exp }\) of more than 0.7 kcal/mol by amino acid substitution to compare the predictions with those of the experiments68.

Degree of disulfide bond formation in WSME-L(SS) model

To calculate the degree of disulfide bond formation between residues i and j during folding, \({\Phi }_{i,j}^{{{{{{\rm{SS}}}}}}}(n)\), we defined the perturbation at the contact between residues i and j that form a disulfide bond as follows:

$$\left\{\begin{array}{c}\Delta {H}_{i,j}={\eta }_{i,j}{m}_{i,j}\hfill\\ \Delta {H}_{i,j}^{(u,v)}={\eta }_{i,j}\lceil ({m}_{i,j}+{m}_{i,j}^{(u,v)})/2\rceil \end{array}\right.$$
(32)

Similar to the calculation of Φth-values, the partition function with the perturbation is described by:

$${Z}_{{{{{{\rm{L}}}}}},i,j}^{{{{{{\rm{Mu}}}}}}}(n)={Z}_{i,j}^{{{{{{\rm{Mu}}}}}}}(n)+\mathop{\sum}\limits_{(u,v):{{{{{\rm{All}}}}}}\,{{{{{\rm{contacts}}}}}}}\left({Z}_{i,j}^{{{{{{\rm{Mu\,}}}}}}(u,v)}(n)-{Z}_{i,j}^{{{{{{\rm{Mu}}}}}}}(n)\right)\exp \left(\frac{{S^{\prime} }^{(u,v)}(n)}{{k}_{{{{{{\rm{B}}}}}}}}\right)$$
(33)

where:

$${Z}_{i,j}^{{{{{{\rm{Mu}}}}}}}(n)={{{{{{\rm{Tr}}}}}}}_{n}\exp \left[-\frac{1}{{k}_{{{{{{\rm{B}}}}}}}T}\left(H+\Delta {H}_{i,j}-T\mathop{\sum }\limits_{i=1}^{N}{S}_{i}{m}_{i}\right)\right]$$
(34)
$${Z}_{i.j}^{{{{{{\rm{Mu\,}}}}}}(u,v)}(n)={{{{{{\rm{Tr}}}}}}}_{n}\exp \left[-\frac{1}{{k}_{{{{{{\rm{B}}}}}}}T}\left({H}^{(u,v)}+\Delta {H}_{i,j}^{(u,v)}+{\varepsilon ^{\prime} }^{(u,v)}+\Delta {\varepsilon ^{\prime} }_{i,j}^{(u,v)}-T\mathop{\sum }\limits_{i=1}^{N}{S}_{i}{m}_{i}\right)\right]$$
(35)

and:

$$\Delta {\varepsilon ^{\prime} }_{i,j}^{(u,v)}=\left\{\begin{array}{c}{\eta }_{u,v}\,\\ {\eta }_{u\pm 1,v}\\ {\eta }_{u,v\pm 1}\\ 0\end{array}\,\begin{array}{c}(i=u,\,j=v)\\ (i=u\pm 1,\,j=v)\\ (i=u,\,j=v\pm 1)\\ ({{{{{\rm{otherwise}}}}}})\end{array}\right.$$
(36)

\({Z}_{i,j}^{{{{{{\rm{Mu\,}}}}}}(u,v)}(n)\) was obtained using the exact solution of the WSME-L model. \({\Phi }_{i,j}^{{{{{{\rm{SS}}}}}}}(n)\) was calculated as follows:

$${\Phi }_{i,j}^{{{{{{\rm{SS}}}}}}}(n)=\frac{-{k}_{{{{{{\rm{B}}}}}}}T\,\ln{{Z}_{{{{{{\rm{L}}}}}},i,j}^{{{{{{\rm{Mu}}}}}}}(n)}-(-{k}_{{{{{{\rm{B}}}}}}}T\,\ln{{Z}_{{{{{{\rm{L}}}}}}}(n))}}{{\eta }_{i,j}}$$
(37)

Modification of WSME-L model for WSME-L(SSintact) model

The partition function for disulfide-intact folding, \({Z}_{{{{{{\rm{L}}}}}}}^{{{{{{\rm{SSintact}}}}}}}(n)\), was constructed by slightly modifying \({Z}_{{{{{{\rm{L}}}}}}}(n)\) for the WSME-L model. First, the Hamiltonian with two linkers at the two residue pairs (u1, v1) and (u2, v2) was given by:

$${H}^{({u}_{1},{v}_{1}),({u}_{2},{v}_{2})}(\{m\})=\mathop{\sum }\limits_{i=1}^{N-1}\mathop{\sum }\limits_{j=i+1}^{N}{\varepsilon }_{i,j}\left\lceil \frac{{m}_{i,j}+{m}_{i,j}^{({u}_{1},{v}_{1})}+{m}_{i,j}^{({u}_{2},{v}_{2})}}{3}\right\rceil$$
(38)

The partition function with double linkers,

$${Z}^{({u}_{1},{v}_{1}),({u}_{2},{v}_{2})}(n)={{{{{{\rm{Tr}}}}}}}_{n}\exp \left[-\frac{1}{{k}_{{{{{{\rm{B}}}}}}}T}({H}^{({u}_{1},{v}_{1}),({u}_{2},{v}_{2})}(\{m\})+{\varepsilon ^{\prime} }^{({u}_{1},{v}_{1})}-T\mathop{\sum }\limits_{i=1}^{N}{S}_{i}{m}_{i})\right]$$
(39)

can be computed using the same solution as for the partition function with a single linker, \({Z}^{(u,v)}(n)\), by expanding the generating function with the insertion of two unit matrices, \(\mathop{\sum }\limits_{k=0}^{{v}_{1}}{|k\rangle }{}^{{v}_{1}}\,{}^{{v}_{1}}\langle k|\) and \(\mathop{\sum }\limits_{k=0}^{{v}_{2}}{|k\rangle }{}^{{v}_{2}}\,{}^{{v}_{2}}\langle k|\), as in Eq. (22), and by introducing an additional weight to the weight stabilized by the linker. Using this equation, the partition function for the disulfide-intact folding was defined as follows:

$${Z}_{{{{{{\rm{L}}}}}}}^{{{{{{\rm{SSintact}}}}}}}(n)= \mathop{\sum}\limits_{({u}_{2},{v}_{2}):\,{{{{{\rm{SS}}}}}}\,{{{{{\rm{bonds}}}}}}}\left[{Z}^{({u}_{2},{v}_{2})}(n)+\mathop{\sum}\limits_{({u}_{1},{v}_{1}):{{{{{\rm{All}}}}}}\,{{{{{\rm{contacts}}}}}}}\left({Z}^{({u}_{1},{v}_{1}),({u}_{2},{v}_{2})}(n)\right.\right. \\ \left.\left.-{Z}^{({u}_{2},{v}_{2})}(n)\right)\exp \left(\frac{{S^{\prime} }^{({u}_{1},{v}_{1})}(n)}{{k}_{{{{{{\rm{B}}}}}}}}\right)\right]$$
(40)

where the summation of (u2, v2) was calculated for all residue pairs that formed a disulfide bond in a protein. For example, (u2, v2) = (6, 127), (30, 115), (76, 94), and (64, 80) for lysozyme, which has disulfide bonds at 6–127, 30–115, 76–94, and 64–80. In contrast, the summation of (u1, v1) was calculated for all nonlocal interactions present in the native state. In addition, the contributions of multiple disulfide bonds were considered by summing the partition functions for all the disulfide bonds.

Computation time

The typical time required to calculate a 2D free energy landscape using the WSME-L model was 13 s for src SH3 domain (60 residues) on a standard desktop computer and 4,680 s for αTS (268 residues) on a 128 CPU parallel supercomputer at the Supercomputer Center, the Institute for Solid State Physics, the University of Tokyo (ISSP). The typical time required to calculate a 2D free energy landscape with the WSME-L(SSintact) model was 74 s for BPTI (58 residues, three disulfide bonds) on a standard desktop computer and 3,180 s for RNase A (124 residues, four disulfide bonds) on the 128 CPU parallel supercomputer at ISSP. The calculation of the degree of residue-specific structure formation for theoretical Φ-value analysis requires the above computation time multiplied by the number of residues.

Kinetic analysis for hen egg-white lysozyme

The microscopic rate constant of the reaction from state X to Y was given by \({k}_{{{{{{\rm{XY}}}}}}}=A({n}_{{{\ddagger}} })\exp (-\Delta {F}_{{{{{{\rm{XY}}}}}}}^{{{\ddagger}} }/{k}_{{{{{{\rm{B}}}}}}}T)\), where \(\Delta {F}_{{{{{{\rm{XY}}}}}}}^{{{\ddagger}} }\) is the free energy difference between state X and the transition state and is obtained from the 2D free energy landscape. \(A({n}_{{{\ddagger}} })\) is a pre-exponential factor for crossing the transition state at order parameter \({n}_{{{\ddagger}} }\). This value was calculated as \(A(n)={A}_{0}/(1+5Q(n))\) considering internal friction23,69. \(Q\) is the degree of contact formation during the folding process. Since this number is approximately proportional to the square of the number of residues, we used \(Q={n}^{2}\,(0\le n\le 1,\,0\le Q\le 1)\) for simplicity. \({A}_{0}\) was determined to simulate the experimentally observed folding rates at 293 K52. The time-dependent changes in the concentrations of U, I1, I2, and N were calculated as described below.

The kinetic analysis was performed according to the following scheme:

(41)

which gives the following matrix equation:

$$\frac{d}{dt}{{{{{\bf{C}}}}}}(t)=-{{{{{\bf{M}}}}}}{{{{{\bf{C}}}}}}(t)$$
(42)

where \({{{{{\bf{C}}}}}}=\left(\begin{array}{l}{{{{{\rm{U}}}}}}\\ {{{{{{\rm{I}}}}}}}_{1}\\ {{{{{{\rm{I}}}}}}}_{2}\\ {{{{{\rm{N}}}}}}\end{array}\right)\) and \({{{{{\bf{M}}}}}}=-\left(\begin{array}{cccc}-{k}_{{{{{{{\rm{UI}}}}}}}_{1}} & {k}_{{{{{{{\rm{I}}}}}}}_{1}{{{{{\rm{U}}}}}}} & 0 & 0\\ {k}_{{{{{{{\rm{UI}}}}}}}_{1}} & -{k}_{{{{{{{\rm{I}}}}}}}_{1}{{{{{\rm{U}}}}}}}-{k}_{{{{{{{\rm{I}}}}}}}_{1}{{{{{{\rm{I}}}}}}}_{2}}-{k}_{{{{{{{\rm{I}}}}}}}_{1}{{{{{\rm{N}}}}}}} & {k}_{{{{{{{\rm{I}}}}}}}_{2}{{{{{{\rm{I}}}}}}}_{1}} & {k}_{{{{{{{\rm{NI}}}}}}}_{1}}\\ 0 & {k}_{{{{{{{\rm{I}}}}}}}_{1}{{{{{{\rm{I}}}}}}}_{2}} & -{k}_{{{{{{{\rm{I}}}}}}}_{2}{{{{{{\rm{I}}}}}}}_{1}}-{k}_{{{{{{{\rm{I}}}}}}}_{2}{{{{{\rm{N}}}}}}} & {k}_{{{{{{{\rm{NI}}}}}}}_{2}}\\ 0 & {k}_{{{{{{{\rm{I}}}}}}}_{1}{{{{{\rm{N}}}}}}} & {k}_{{{{{{{\rm{I}}}}}}}_{2}{{{{{\rm{N}}}}}}} & -{k}_{{{{{{{\rm{NI}}}}}}}_{1}}-{k}_{{{{{{{\rm{NI}}}}}}}_{2}}\end{array}\right)\).

Under native conditions, where the free energy barriers between N and TS3 and between N and TS2 are large, both \({k}_{{{{{{{\rm{I}}}}}}}_{1}{{{{{\rm{N}}}}}}}\gg {k}_{{{{{{{\rm{NI}}}}}}}_{1}}\) and \({k}_{{{{{{{\rm{I}}}}}}}_{2}{{{{{\rm{N}}}}}}}\gg {k}_{{{{{{{\rm{NI}}}}}}}_{2}}\) are satisfied. Then, we can neglect the unfolding of the native state, as follows:

(43)

In this scheme, the concentrations of the native states formed through Pathways 1 and 2 are denoted as Np1 and Np2, respectively. Here, vector C and the matrix M in Eq. (42) are written as follows:

$${{{{{\bf{C}}}}}}=\left(\begin{array}{l}{{{{{\rm{U}}}}}}\\ {{{{{{\rm{I}}}}}}}_{1}\\ {{{{{{\rm{I}}}}}}}_{2}\\ {{{{{{\rm{N}}}}}}}_{{{{{{\rm{p1}}}}}}}\\ {{{{{{\rm{N}}}}}}}_{{{{{{\rm{p2}}}}}}}\end{array}\right)\,{{{{{\rm{and}}}}}}\,{{{{{\bf{M}}}}}}=-\left(\begin{array}{ccccc}-{k}_{{{{{{{\rm{UI}}}}}}}_{1}} & {k}_{{{{{{{\rm{I}}}}}}}_{1}{{{{{\rm{U}}}}}}} & 0 & 0 & 0\\ {k}_{{{{{{{\rm{UI}}}}}}}_{1}} & -{k}_{{{{{{{\rm{I}}}}}}}_{1}{{{{{\rm{U}}}}}}}-{k}_{{{{{{{\rm{I}}}}}}}_{1}{{{{{{\rm{I}}}}}}}_{2}}-{k}_{{{{{{{\rm{I}}}}}}}_{1}{{{{{{\rm{N}}}}}}}_{{{{{{\rm{p2}}}}}}}} & {k}_{{{{{{{\rm{I}}}}}}}_{2}{{{{{{\rm{I}}}}}}}_{1}} & 0 & 0\\ 0 & {k}_{{{{{{{\rm{I}}}}}}}_{1}{{{{{{\rm{I}}}}}}}_{2}} & -{k}_{{{{{{{\rm{I}}}}}}}_{2}{{{{{{\rm{I}}}}}}}_{1}}-{k}_{{{{{{{\rm{I}}}}}}}_{2}{{{{{{\rm{N}}}}}}}_{{{{{{\rm{p1}}}}}}}} & 0 & 0\\ 0 & 0 & {k}_{{{{{{{\rm{I}}}}}}}_{2}{{{{{{\rm{N}}}}}}}_{{{{{{\rm{p1}}}}}}}} & 0 & 0\\ 0 & {k}_{{{{{{{\rm{I}}}}}}}_{1}{{{{{{\rm{N}}}}}}}_{{{{{{\rm{p2}}}}}}}} & 0 & 0 & 0\end{array}\right)$$
(44)

The matrix equation was solved analytically using Mathematica 12.2 (Wolfram), and the time-dependent changes in U, I1, I2, Np1, Np2, and N (= Np1 + Np2) were obtained under the following initial conditions: U(0) = 1 and I1(0) = I2(0) = Np1(0) = Np2(0) = 0.

Time evolution of domain-specific structure formation in lysozyme

The theoretical Φ-values provided the degree of residue-specific structure formation in the U, I1, I2, and N states of the lysozyme. Using these values, the degrees of α- and β-domain structure formation, Φth,α and Φth,β, were obtained by calculating the average of the Φth-values for the residues involved in the α- and β-domains, respectively. For comparison with experiments49, the following residues were used for the α-domain: residues 8, 10, 11, 12, 13, 17, 23, 27, 28, 29, 31, 34, 36, 37, 38, 39, 92, 93, 94, 95, 96, 97, 99, 108, 111, 112, 115, 123, 124, and 125, while the following residues were used for the β-domain: residues 40, 42, 44, 50, 52, 53, 56, 58, 61, 63, 64, 65, 75, 76, 78, 82, 83, and 84. Combined with the kinetic analysis, the time evolution of the domain-specific structure formation for the α- and β-domains, \({p}_{{{{{{\rm{\alpha }}}}}}}(t)\) and \({p}_{{{{{{\rm{\beta }}}}}}}(t)\), respectively, during folding was obtained as follows:

$$\left\{\begin{array}{c}{p}_{{{{{{\rm{\alpha }}}}}}}(t)={\Phi }_{{{{{{\rm{th}}}}}},{{{{{\rm{\alpha }}}}}}}({n}_{{{{{{\rm{U}}}}}}}){{{{{\rm{U}}}}}}(t)+{\Phi }_{{{{{{\rm{th}}}}}},{{{{{\rm{\alpha }}}}}}}({n}_{{{{{{{\rm{I}}}}}}}_{1}}){{{{{{\rm{I}}}}}}}_{1}(t)+{\Phi }_{{{{{{\rm{th}}}}}},{{{{{\rm{\alpha }}}}}}}({n}_{{{{{{{\rm{I}}}}}}}_{2}}){{{{{{\rm{I}}}}}}}_{2}(t)+{\Phi }_{{{{{{\rm{th}}}}}},{{{{{\rm{\alpha }}}}}}}({n}_{{{{{{\rm{N}}}}}}}){{{{{\rm{N}}}}}}(t)\\ {p}_{{{{{{\rm{\beta }}}}}}}(t)={\Phi }_{{{{{{\rm{th}}}}}},{{{{{\rm{\beta }}}}}}}({n}_{{{{{{\rm{U}}}}}}}){{{{{\rm{U}}}}}}(t)+{\Phi }_{{{{{{\rm{th}}}}}},{{{{{\rm{\beta }}}}}}}({n}_{{{{{{{\rm{I}}}}}}}_{1}}){{{{{{\rm{I}}}}}}}_{1}(t)+{\Phi }_{{{{{{\rm{th}}}}}},{{{{{\rm{\beta }}}}}}}({n}_{{{{{{{\rm{I}}}}}}}_{2}}){{{{{{\rm{I}}}}}}}_{2}(t)+{\Phi }_{{{{{{\rm{th}}}}}},{{{{{\rm{\beta }}}}}}}({n}_{{{{{{\rm{N}}}}}}}){{{{{\rm{N}}}}}}(t)\end{array}\right.$$
(45)

where nU, \({n}_{{{{\rm{I}}}}_{1}}\), \({n}_{{{{\rm{I}}}}_{2}}\), and nN are the order parameters of the U, I1, I2, and N states, respectively.

Thermodynamics of lysozyme folding

The temperature dependence of the heat capacity C(T) was obtained from the partition function as follows:

$$C(T)=\frac{d}{dT}\left({k}_{{{{{{\rm{B}}}}}}}{T}^{2}\frac{d\,{\ln{Z}}}{dT}\right)$$
(46)

For comparison with the experimental data from differential scanning calorimetry54, 8 kcal/(mol·K) was added as a baseline.

According to previous studies on solvation free energy, the temperature dependence of the contact energy size, \(\varepsilon\), of lysozyme was defined as follows:

$$\varepsilon=\varepsilon (T)={\varepsilon }_{{{{{{\rm{f}}}}}}}+p\left(1-\frac{T}{{T}_{{{{{{\rm{f}}}}}}}}\right)-q\left[\left(1-\frac{T}{{T}_{{{{{{\rm{f}}}}}}}}\right)+\frac{T}{{T}_{{{{{{\rm{f}}}}}}}}\,\ln{\frac{T}{{T}_{{{{{{\rm{f}}}}}}}}}\right]$$
(47)

where Tf was 293 K70. \({\varepsilon }_{{{{{{\rm{f}}}}}}}[=\varepsilon ({T}_{{{{{{\rm{f}}}}}}})]\) was set to 1.783 to satisfy the stability of lysozyme (\(\Delta {G}_{{{{{{\rm{NU}}}}}}}=16{k}_{{{{{{\rm{B}}}}}}}T\)) at Tf. p and q were set to –0.307 and 11.3 to satisfy \(\Delta {G}_{{{{{{\rm{NU}}}}}}}({T}_{{{{{{\rm{m}}}}}}})=0\) at a midpoint temperature of thermal unfolding (Tm = 350 K) and match the heat capacity observed in experiments54.

Simultaneous introduction of multiple disulfide bonds in lysozyme

Generalization of the Hamiltonian with two linkers in the WSME-L(SSintact) model yields a Hamiltonian for proteins with multiple linkers. The presence of L linkers at the residue pairs of (u1, v1), (u2, v2), ···, and (uL, vL), in addition to the main chain, provides (L + 1) possible ways to connect the residues i and j. Accordingly, the Hamiltonian for multiple linkers is described as follows:

$${H}_{{{{{{\rm{ML}}}}}}}(\{m\})=\mathop{\sum }\limits_{i=1}^{N-1}\mathop{\sum }\limits_{j=i+1}^{N}{\varepsilon }_{i,j}\left \lceil \left({m}_{i,j}+\mathop{\sum }\limits_{k=1}^{L}{m}_{i,j}^{({u}_{k},{v}_{k})}\right)/(L+1)\right \rceil$$
(48)

The partition function for this Hamiltonian can be solved by repeatedly applying the exact solution for the partition function with a single linker \({Z}^{(u,v)}(n)\), as many times as the number of linkers. We extracted the appropriate weights from the generating function and introduced additional weights corresponding to the new linkers into the original weights similarly extracted. However, as the number of linkers increased, the formula became more complex, and the computation time increased.

The folding processes of disulfide-intact lysozyme obtained using this method (Supplementary Fig. 44) were almost the same as those obtained using the WSME-L(SSintact) model (Fig. 5). As the simultaneous introduction of more than three disulfide bonds complicates the calculations, the WSME-L(SSintact) model is suitable for predicting the folding of disulfide-intact proteins.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.