PFDB: A standardized protein folding database with temperature correction

We constructed a standardized protein folding kinetics database (PFDB) in which the logarithmic rate constants of all listed proteins are calculated at the standard temperature (25 °C). A temperature correction based on the Eyring–Kramers equation was introduced for proteins whose folding kinetics were originally measured at temperatures other than 25 °C. We verified the temperature correction by comparing the logarithmic rate constants predicted and experimentally observed at 25 °C for 14 different proteins, and the results demonstrated improvement of the quality of the database. PFDB consists of 141 (89 two-state and 52 non-two-state) single-domain globular proteins, which has the largest number among the currently available databases of protein folding kinetics. PFDB is thus intended to be used as a standard for developing and testing future predictive and theoretical studies of protein folding. PFDB can be accessed from the following link: http://lee.kias.re.kr/~bala/PFDB.

Scientific RepoRts | (2019) 9:1588 | https://doi.org/10.1038/s41598-018-36992-y carefully examined each data in the AG dataset. For instance, if there is no updated protein folding kinetics data available for a protein, we included those proteins as such in PFDB, otherwise replaced with the updated data. Furthermore, we added the data of 33 new proteins into the PFDB from our own collection based on extensive literature search, resulting in the entry size of 141 globular proteins (89 two-state (2S) and 52 non-two-state (N2S) proteins) in our dataset (see Methods for details of the database construction). Our dataset lists the following items: (i) the protein short name with a reference to the original experimental paper(s) on the folding kinetics, (ii) the PDB code, (iii) the structural class (α, β, α/β, and α + β), (iv) folds in the SCOP classification 21 (http://scop.mrc-lmb.cam.ac.uk/scop/), (v) the number of residues in the PDB structure (L PDB ), (vi) the actual number of residues of the protein used in the folding experiment (L), (vii) the experimental conditions (pH and temperature), (viii) the folding type (2S or N2S), (ix) the ln(k f ) value reported, (x) the ln(k f ) value after the temperature correction for the proteins whose folding experiments were carried out at a temperature other than 25 °C, (xi) the logarithmic rate constant of formation of a folding intermediate, ln(k I ), when the value is available in the literature (only for N2S proteins), (xii) the ln(k u ) value reported, (xiii) the ln(k u ) value after the temperature correction, and (xiv) the Tanford β (β T ) value, which is defined as β T = 1 − (m u ‡ /m NU ), where m u ‡ (kJ/mol/M) and m NU (kJ/mol/M) are the denaturant concentration dependence of the activation free energy of unfolding and the denaturant concentration dependence of the unfolding free energy from the native (N) to the fully unfolded (U) state, respectively 22 . The ln(k f ), ln(k I ) and ln(k u ) values listed in PFDB are those in the absence of denaturant, usually obtained by linear extrapolation of the logarithmic rate constant along denaturant concentration.
In PFDB, the folding type is thus clearly specified. The proteins that exhibited a stable folding intermediate during the kinetic folding process were classified as N2S proteins, while the proteins, exhibiting the single-exponential kinetics of folding without stable intermediates, were classified as 2S proteins even if the existence of an unstable high-energy intermediate was expected from the unfolding-limb or the folding-limb curvature of the chevron plot 23 . To discriminate the 2S proteins with a high-energy intermediate from the other 2S proteins, the former proteins were denoted by 2S*. Each entry of the AG dataset is also included in PFDB for comparison. A comment section is provided in the final column of the dataset and interprets discrepancies between the present and the AG datasets if any/necessary. Figure 1 depicts a snapshot of our dataset shown in the PFDB homepage.
The protein composition in PFDB in terms of the folding type and the structural class is given in Table 1. It shows that both the 2S and N2S proteins cover all four structural classes of globular proteins. However, the 2S proteins contain only one α/β protein.
Temperature correction. Figure 2A shows a distribution of the temperature at which the ln(k f ) was determined experimentally for the proteins in our dataset. Among the 141 proteins in PFDB, 99 were measured at the standard temperature of T 0 (25 °C (=298.15 K)), but the other 42 (24 2S and 18 N2S proteins) were measured at different temperatures (T x ). The T x value ranged from 5 °C to 75 °C. To maintain the consistency of folding temperature in PFDB, we developed a method for temperature correction. The predicted shape of the Eyring plot of a particular protein is determined by two parameters of the folding or unfolding reaction, the activation heat Figure 1. A snapshot of our dataset in the PFDB homepage. For each protein, our dataset lists (i) protein short name, (ii) PDB code, (iii) structural class (α, β, α/β, and α + β), (iv) folds in the SCOP classification, (v) the number of residues in the PDB structure (L PDB ), (vi) the actual number of residues of the protein used in the folding experiment (L), (vii) experimental conditions (pH and temperature), (viii) folding type (2S or N2S), (ix) ln(k f ) reported, (x) ln(k f ) after temperature correction, (xi) ln(k I ) (only for N2S proteins), (xii) ln(k u ) reported, (xiii) ln(k u ) after temperature correction, and (xiv) Tanford β (β T ). The AG dataset is also included in our database for comparison. A comment section is provided in the final column.
where R is the gas constant, T 0 and T x are given by the absolute temperature, and ln[k(T x )] is the logarithmic rate constant measured at T x ; the detailed derivation of Eq. (1) is given in Methods. We assumed that ΔC p ‡ is proportional to the heat capacity change (ΔC p ) of the equilibrium protein unfolding. The ΔC p is approximately proportional to the protein chain length in the PDB structure (L PDB ) and empirically given by 24 :   (1) and (3). It is worth mentioning that Eq. 2 is an empirical one, and theoretically, the ΔC p diminishes to zero when L PDB tends to zero. A regression equation between ΔC p and L PDB with the zero intercept has thus also been reported in the original literature as given by ΔCp = 0.058 • L PDB 24 . Whether we used this equation or Eq. 2, the results of temperature correction were essentially identical for the proteins in our dataset, where L PDB ≥ 34.
Temperature correction for folding. We introduced the temperature corrections into the proteins whose k f values were measured at a temperature other than the standard temperature (298.15 K). First, we found that the Eyring plot or the equivalent plot of folding was well described in 14 2S proteins and 3 N2S proteins; the k f values were measured at every few degrees absolute from ~280 K to ~320 K for most of these proteins [25][26][27][28][29][30][31][32][33][34][35][36][37][38][39][40][41] . Both the T H and β values for folding kinetics, T Hf and β f , respectively, were more or less common among the different 2S proteins ( Table 2) and also among the different N2S proteins (Table 3), except for two 2S proteins (1K9Q 40 and 1PIN 41 ), for which −ΔC p ‡ for folding was larger than ΔC p . Therefore, we employed the 12 2S proteins except for these two and the 3 N2S proteins, and from their Eyring plots, we calculated the T Hf and ΔC pf ‡ . Examples of the Eyring plot for three proteins (1APS 34 , 1D6O 35 , and 1AVZ 37 ) are shown in Figure S1. For folding kinetics, the Eyring  plot is convexed, and hence, T Hf corresponds to the temperature of the maximum point in the Eyring plot. The ΔC pf ‡ is given by the curvature of the Eyring plot, and the β f was thus evaluated by β f = ΔC pf ‡ /ΔC p , where ΔC p was obtained by Eq. (2); ΔC pf ‡ and β f are negative because the Eyring plot is convexed. The T Hf and β f values thus obtained were averaged for the 12 2S proteins and for the 3 N2S proteins (Tables 2 and 3). The T Hf and β f values thus obtained are 315 ± 1 (standard error estimate) K and −0.62 ± 0.03 for the 2S proteins, and 305 ± 4 K and −0.75 ± 0.07 for the N2S proteins.
For the proteins whose T Hf and ΔC pf ‡ were not available directly, we employed Eqs (1) and (3) to predict ln[k f (T 0 )] by assigning the T Hf and β f values to T H and β in the equations. However, for the proteins whose T Hf and ΔC pf ‡ were available (1E0G 28 , 1HDN 30 , 2VH7 29 , 1EHB 27 , 1HCD 31 , and 2CRO 26 ), we directly calculated the ln[k f (T 0 )] values by Eq. (1). To distinguish ln[k f (T 0 )] predicted by using the averaged T Hf and β f and that directly calculated by Eq. (1) with the known T Hf and ΔC pf ‡ , the latter values are indicated in boldface type in our dataset. It should be also noted that the above T Hf and β f estimates were based on the folding data of the proteins from mesophilic organisms, and hence some care may be required when applied to the thermophilic proteins.
Next, we compared predicted ln[k f (T 0 )] after the temperature correction with the experimentally observed ln[k f (T 0 )]. For 9 2S and 5 N2S proteins (Table 4), which were not included in those used for estimating T Hf and β f , the experimental ln(k f ) was available at both T 0 and T x . We thus applied the temperature correction to the ln[k f (T x )] values using the above T Hf and β f , and compared predicted ln[k f (T 0 )] with the experimentally observed ln[k f (T 0 )]. From Fig. 2B, the predicted ln[k f (T 0 )] values show good agreement with the experimentally observed ones, showing the validity of our temperature correction. Although the number of data points used for this analysis is not very large (only 14 proteins), it may be enough to suggest that the temperature corrections have improved the quality of the database of protein folding.
Denaturant m values, the dependence of the free energy of unfolding on denaturant concentration, are well correlated with the ΔC p of unfolding 42 . Therefore, we can reasonably assume that β f is equivalent to −β T for 2S proteins. Therefore, for the 2S proteins for which the β T is available, we also calculated the ln[k f (T 0 )] values by assigning the T Hf and −β T values to T H and β in Eqs (1) and (3). The ln[k f (T 0 )] values thus obtained are also listed in PFDB and indicated in italic type to distinguish them from those (in roman type) predicted on the basis of T Hf and β f . As seen from the PFDB dataset, these two types of predicted ln[k f (T 0 )] are reasonably coincident with each other.
Temperature correction for unfolding. We introduced the temperature corrections into the proteins whose k u values were measured at a temperature other than the standard temperature (298.15 K), and the T H and β values for unfolding kinetics, T Hu and β u , respectively, were required for temperature correction. For unfolding kinetics, the Eyring plot is usually concaved with a positive β u . For 2S proteins, there is only a single transition state between U and N with a β f of −0.62 ± 0.03, and we can reasonably assume that β u = 1 + β f . Therefore, we find that β u = 0.38 ± 0.03. For N2S proteins, this simple relationship may not hold, because of a contribution from an intermediate (I) state. For the N2S proteins, however, (1 − β T ) is expected to be equivalent to β u , because β T represents the relative position of the transition state between U and N in terms of the denaturant m values. The β T was  reported for 38 N2S proteins in PFDB, and their average was estimated at 0.79 ± 0.02, and hence β u = 0.21 ± 0.02 for N2S proteins; 1FTG was excluded in this calculation because the I state was mostly off-pathway in this protein.
The T Hu corresponds to the temperature of the minimum point of the Eyring plot, but this is usually located at far below an observable temperature range of unfolding kinetics, leading to a large error in estimation of T Hu due to a long extrapolation along temperature. Furthermore, the Eyring plot of unfolding is not available for many of the proteins used above for estimation of T Hf and β f . Therefore, we had to use a different way to estimate T Hu . We thus chose 6 2S proteins (1IMQ 13,43 ,1K9Q 40,44 , 1RFA 45 , 1SS1 46 , 1U4Q 47,48 , and 2WXC 49,50 ) and 3 N2S proteins (1BNI 51 , 1EKG 52 , and 1ENH 53 ), for which the experimental ln(k u ) is available at both T 0 and T x (Table 5). First, we assumed appropriate T Hu values (e.g., 200 K and 150 K) for 2S and N2S proteins, and assigned these T Hu values and the above β u values to T H and β in Eqs (1) and (3) to calculated tentative predictions of ln[k u (T 0 )] for 2S and N2S proteins. Then, the T Hu values were gradually increased or decreased until the root-mean-square deviation between the experimentally observed ln[k u (T 0 )] and the predicted ln[k u (T 0 )] values was minimized. The optimized T Hu values thus obtained were 224 K and 119 K for the 2S and N2S proteins, respectively. Figure 3 shows a comparison between the experimental ln[k u (T 0 )] values and those predicted by using the above T Hu and β u values, which indicates a reasonable coincidence between the experimental and predicted values.
For the proteins whose T Hu and ΔC pu ‡ were not available directly, we thus employed Eqs (1) and (3) to predict the ln[k u (T 0 )] by assigning the T Hu and β u values to T H and β in the equations. However, for the proteins whose T Hu and ΔC pu ‡ were available (1EHB 27 and 1HCD 31 ), we directly calculated the ln[k u (T 0 )] values by Eq. (1). To distinguish the ln[k u (T 0 )] predicted by using the optimized T Hu and β u and that directly calculated by Eq. (1) with the known T Hu and ΔC pu ‡ , the latter values are indicated in boldface type in our dataset. For the 2S proteins for which the β T is available, we also calculated the ln[k u (T 0 )] values by assigning the T Hu and (1 − β T ) values to T H and β in Eqs (1) and (3). The ln[k u (T 0 )] values thus obtained are also listed in PFDB and indicated in italic type to distinguish them from those (in roman type) predicted on the basis of T Hu and β u . As seen from the PFDB dataset, these two types of predicted ln[k u (T 0 )] are reasonably coincident with each other.   Table 5. List of proteins used for predicting ln(k u ) at 25 °C. Normal font and bold, respectively, represent the 2S and N2S proteins.

Conclusions
In this study, we have constructed PFDB, a systematically compiled standardized database of protein folding kinetics. It is currently the most updated one with the highest number of unique entries. The quality of the dataset has been improved significantly by our temperature correction method. Therefore, our dataset can be used as a standard for developing and testing future predictive and theoretical studies of protein folding kinetics.

Methods
Construction of the AG dataset. The most recent datasets of protein folding kinetics are ACPro 19 and the Garbuzynskiy dataset 17 . Prior to the filtering processes shown below, the ACPro dataset contained 126 proteins. Among these, we weeded out proteins with less than 34 residues (1PGB (41-56), 1L2Y and 3M48), proteins with disulfide bonds (2HQI, 1HEL, 1E65 and 1HMK), proteins with a covalently-bound prosthetic group (1YCC, 1YEA, 256B and 1HRC), proteins with irrelevant rate constants (i.e., the rate constant for formation of an intermediate instead of the actual folding rate constant (k f ) for a few proteins (1AON, 1BD8 and 1JON)), and proteins whose k f was reported in the presence of denaturant (1QOP chain B). In the case of ileal lipid binding protein, the actual folding experiment was performed on the rat protein, but its PDB coordinates were not available at the time of our database creation. Instead, the reported PDB ID of 1EAL is the pig protein that is of 71.1% sequence identity with the rat protein. Since the exact PDB coordinates were not available, we excluded this protein as well as another protein without experimental references (1PSF). Furthermore, 6 proteins had duplicate entries (1NTI-2FDQ, 1SRL-1FMK, 1BF4-1BNZ, 1POH-2HPR, 1O6X-1PBA and 1EAL-2EAL) which we corrected. These filtering processes resulted in the reduction of the size of the ACPro dataset from 126 to 102 proteins. We then applied the same filtering scheme to the Garbuzynskiy dataset (107 proteins) where we weeded out proteins with less than 34 residues (1L2Y, 1T8J, 1PGB (41)(42)(43)(44)(45)(46)(47)(48)(49)(50)(51)(52)(53)(54)(55)(56), and the 3rd entry in the Garbuzynskiy dataset), proteins with irrelevant rate constants (1AON and 1BD8), the protein 1EAL (the reason is given above), and a protein with a covalently-bound prosthetic group (256B). This change reduced the size of the Garbuzynskiy dataset from 107 to 99 proteins. When we compared the updated Garbuzynskiy (99 proteins) and ACPro (102 proteins) datasets, 6 unique proteins (1IFC, 1CBI, 1IGS, 1OPA, 2MYO and 3H08) were identified in the Garbuzynskiy dataset. Therefore, we added these 6 proteins to the ACPro dataset, and collectively named it the AG dataset (108 proteins).

Data collection and construction of PFDB.
We manually collected the data of protein folding and unfolding kinetics by extensive literature search. Then we compared our collected data with those of the AG dataset. We carefully examined the data of each entry of the AG dataset, and when newer updated data did not exist, the data of that entry were included as such in our dataset of PFDB, otherwise replaced by the updated data. Finally, we added the data of 33 new proteins into the PFDB from our own collection. Of these 33 proteins, 19 are 2S proteins (1DKT, 1FGA, 1IO2, 1KDX, 1NFI,1QAU, 1RG8, 2BKF, 2GA5, 2J5A, 2JMC, 2LLH, 2L6R, 2WQG, 3O48, 3O49, 3O4D, 3ZRT (N-terminal), and 3ZRT (C-terminal)) with the remaining 14 being N2S proteins (1DWR, 1EKG, 1FA3, 1HRH, 1OKS, 1THF, 1UCH, 2BJD, 2FS6, 2KDI, 2KLL, 2X7Z, 3BLM, and 5L8I). For 4 proteins (1RA9, 1B9C, 1FA3, and 2PQE), the presence of multiple parallel pathways of folding has been reported [54][55][56] , and the k f value was obtained by averaging the rate constant values along the individual pathways: where f i and k i are the fractional amplitude and the observed rate constant, respectively, for the i th pathway of folding, and the ln(k f ) values thus obtained are listed in our dataset.
The ln(k f ), ln(k I ) and ln(k u ) values listed in PFDB are those in the absence of denaturant, usually obtained by linear extrapolation of the logarithmic rate constants along molar denaturant concentration. However, for 5 N2S proteins (1PHP (1-175) 57 , 1PHP (186-394) 58 , 1L63 59 , 1HNG 60 , and 1TTG 61 ), the equilibria and kinetics of folding and unfolding were analyzed in terms of denaturant activity rather than the molar concentration. Whether we use the activity or the concentration in our calculation seriously affects the ln(k u ) estimation, because a long extrapolation from high concentrations of denaturant back to the native condition is required. To keep consistency of our dataset, we used the linear extrapolation along the molar concentration, as far as such data were available, to estimate the ln(k u ).
Derivation of Eq (1) for the temperature correction. In this study, we introduced a method for temperature correction, which gives the folding and unfolding rate constants at 25 °C (k(T 0 ) where T 0 = 298.15 K) for a protein whose rate constant at any temperature (T x ) is known. The following section will describe the derivation of Eq. (1).
According to the Eyring-Kramers equation 20 , we find that: where ΔH ‡ (T H ) and ΔS ‡ (T H ) are the activation enthalpy and the activation entropy, respectively, at a reference temperature T H , and ‡ C p Δ is the activation heat capacity; we assume that C p Δ ‡ is a constant independent of temperature (T). When we set T H to the temperature where ΔH ‡ is zero, i.e., the maximum or minimum point of the Eyring plot, Eq.