Configurational space discretization and free energy calculation in complex molecular systems

We sought to design a free energy calculation scheme with the hope of saving cost for generating dynamical information that is inherent in trajectories. We demonstrated that snapshots in a converged trajectory set are associated with implicit conformers that have invariant statistical weight distribution (ISWD). Since infinite number of sets of implicit conformers with ISWD may be created through independent converged trajectory sets, we hypothesized that explicit conformers with ISWD may be constructed for complex molecular systems through systematic increase of conformer fineness, and tested the hypothesis in lipid molecule palmitoyloleoylphosphatidylcholine (POPC). Furthermore, when explicit conformers with ISWD were utilized as basic states to define conformational entropy, change of which between two given macrostates was found to be equivalent to change of free energy except a mere difference of a negative temperature factor, and change of enthalpy essentially cancels corresponding change of average intra-conformer entropy. By implicitly taking advantage of entropy enthalpy compensation and forgoing all dynamical information, constructing explicit conformers with ISWD and counting thermally accessible number of which for interested end macrostates is likely to be an efficient and reliable alternative end point free energy calculation strategy.


Introduction
For two arbitrary macrostates A and B visited in a set of converged molecular dynamics (MD) simulation trajectories, the free energy difference may be expressed as: (1) with N A(B) snap being observed number of snapshots in macrostate A(B), k B being Boltzmann constant and T being the temperature. However, if a converged MD trajectory set was generated for the sole purpose of calculating free energy differences between interested macrostate pairs, all dynamical information contained would have been discarded. One question we sought to answer is that if there is a way to save computational cost used for generating dynamical information by designing a free energy calculation method without explicit utilization of trajectories. A rarely discussed fact is that each snapshot represents an implicit microscopic volume (termed conformer hereafter) in configurational space. More importantly, equation (Eq. (1)) implies that, in a set of converged trajectories, implicit conformers associated with snapshots have invariant statistical weight distribution (ISWD) across the whole configurational space (see Fig. Figure 1). Therefore, one way to answer our original question is to accomplish the following two tasks: i) to construct a set of configurational-space-filling 1 explicit conformers, with thermally accessible ones among which have the property of ISWD ( or a sufficiently good approximation of it ), and ii) to design an efficient method to count such conformers that are thermally accessible in given macrostates. To be concise, we use "explicit conformers with ISWD (ECISWD)" to represent "configurationalspace-filling explicit conformers, with thermally accessible ones among which have the property of ISWD ( or a sufficiently good approximation of it )" hereafter. For two arbitrary macrostates A and B that have N A con f and N B con f (Note that both are functions of potential energy) thermally accessible conformers, denoting corresponding average statistical weight of conformers as w A and w B , the change of free energy between these two macrostates may be written as: For ECISWD, w A ≈ w B , therefore: It was demonstrated that sequential Monte Carlo (SMC) in combination with importance sampling 1,2 may rapidly count the number of explicit conformers that are thermally accessible. Therefore, the hinging issue is to construct a set of ECISWD. We set to address this issue and accompanying implications in this study.

Hypothesis on ECISWD
Conformers associated with MD snapshots are implicit with no information available for their shapes or sizes, we consequently may not directly learn from MD trajectories. One principal consideration for defining ECISWD is sufficient fineness since statistical weight of complex molecular systems are in general exponentially different for different macrostates, 3 very coarse conformers are associated with the possibility that the heavist conformer in the statistically most dominant macrostate weighs more than the total of all other macrotates, hence rendering ISWD impossible.
Better uniformity is another factor to consider for the same reason. It is noted that ISWD holds for each set of implicit conformers associated with snapshots of corresponding independent and converged MD trajectory set. Therefore, infinite number of ways exist for constructing sets of implicit conformers with ISWD for a given complex molecular systems. Based on this thought, we hypothesized that any set of sufficiently fine and uniform conformers should approximately have the property of ISWD, and we may consequently define ECISWD through systematically increasing their fineness according to our convenience.
This hypothesis is immediately disproved by a simple double well system shown in Fig. Figure 2. With increasingly different ∆U between two wells A and B, regardless of the fineness for any uniformly defined conformers, the statistical weight distribution of which in two macrostates will be increasingly different. The only way to achieve sufficiently good approximate ISWD is to construct conformers that were properly weighted by U , the potential energy surface that we do not know a priori in a real complex molecular system. Nonetheless, complex molecular systems are very different from a double well system. As shown in Fig. Figure 2, if we divide macrostates A and B into N A and N B (e.g. N A = N B = 1000, 000 ) conformers, U is consistently higher in A than in B in terms of conformer average, and within each conformer U is essentially a constant. Such situation is unlikely, if ever possible, to occur in a complex molecular system. With large number of degrees of freedom (DOFs), tight packing and steep van der waals repulsive core of constituting atoms, potential energy may vary significantly within a microscopic volume of configurational space. Therefore, we think that competitions among large number of DOFs may render construction of ECISWD an achievable task, and the above mentioned hypothesis may well be valid for complex molecular systems.
Sufficiently well-converged MD trajectory sets of specific molecular systems provide ideal test grounds for ISWD property of given explicit conformers based on the following two arguments.
Firstly, trajectory sets are generated by known force fields, and therefore no convolution of force fields inaccuracy and experimental error exists as in the case of comparing computational results with experimental ones; Secondly, we may arbitrarily partition configurational space visited in a trajectory set, and a hypothesis tested for arbitrarily given partitions should remain true for the whole configurational space. This is an important logic since traversing configurational space for complex molecular systems is practically impossible. The symbolic equivalence between equation (Eq. (1)) and equation (Eq. (3)) suggests that for a set of ECISWD, if we assign each snapshot in a trajectory set to a corresponding conformer and utilizing equations (Eq. (1)) and (Eq. (3)) respec-tively to calculate free energy changes for arbitrarily selected pairs of macrostates, differences in results caused by different conformer definitions ( between a given explicit conformer set and the implicit one associated with snapshots ) should decrease with increasing size of trajectory set and essentially disappear for a fully converged trajectory set, the reason is that free energy difference between two arbitrarily given macrostates does not depend on the way it is calculated. Conversely, if statistical weight distribution of a set of explicit conformers is widely different in different part of the configurational space, the corresponding differences in results would increase with increasing size of trajectory set and saturate for a fully converged trajectory set since the largest possible error is limited by the number of available snapshots in any trajectory sets that are not fully converged.
Both complete disappearance of differences resulted from equations (Eq. (1)) and (Eq. (3)) for the case of ECISWD, and full saturation of differences resulted from these two equations for the case of explicit conformers without ISWD will be extremely difficult to observe for complex molecular systems duo excessive amount of data needed. Nonetheless, the trend should be equivalently informative as long as the largest trajectory set is sufficiently well-converged.
We chose lipid POPC to carry out such tests based on the fact that large MD trajectory sets are available for this molecule. Specifically, we firstly extracted MD trajectories of POPC from trajectories of M2 muscarinic acetylcholine receptor study. 4 Three increasingly larger trajectory sets, TSA1, TSA2 and TSA3 were constructed with smaller trajectory sets being subsets of larger ones. Secondly, we defined four different sets of conformers, which were denoted as CONF1 through CONF4 (see Fig. Figure 3) respectively, with CONF1 being the finest and CONF4 being the coarsest. Thirdly, we used backbone dihedrals as order parameters to construct macrostates through projection operations. Finally, number of conformers (N con f ) were calculated for each macrostate of the given combination of trajectory set and definition of conformers (see Methods for details).
With the above given definitions of conformers, macrostates and trajectory sets, we calculated ∆F for all pairs of macrostates on each combination of conformer definition and trajectory set according to equation (Eq. (1)) (denoted as ∆F snapshot ) and equation (Eq. (3)) (denoted as ∆F con f ormer ) respectively, and their differences were denoted as δ ∆F = ∆F snapshots − ∆F con f ormer , which essentially measures differences between our constructed set of explicit conformers and implicit conformers associated with snapshots. Distributions of δ ∆F and cumulative probability density (CPD) of its absolute values for the four sets of explicit conformers (CONF1 through CONF4) are shown in Fig. Figure 4. Firstly, for CONF2 through CONF4 ( Fig. Figure 4b-d), distribution of δ ∆F is significantly broader for larger trajectory set. Secondly, it is noted that the range of horizontal axis is widely different for these three sets of conformers (ranging from less than 0.1k B T to a few a-d and Fig. S2 a-d), this is inevitable since selection of start and end macrostate is arbitrary and consistent in calculating both ∆F snapshot and ∆F con f ormer .
For coarser explicit conformers without ISWD, deviations from ISWD are expected to occur in the heaviest macrostates, where larger probability for occurrence of excessively heavy conformers would cause uneven distribution of statistical weight. Again, such deviations are expected to be larger for larger trajectory sets (and eventually saturate for a fully converged trajectory set). To this end, we plotted −lnN snap vs −lnN con f for all constructed macrostates in Fig. Figure 7 for CONF1 and CONF4. Indeed, deviations occur for the heaviest macrostates and are larger for larger trajectory set for CONF4 ( Fig. Figure 7b,d,f). Perfect scaling was observed for CONF1 ( Fig. Figure 7a,c,e) as expected.

Conformational entropy based on ECISWD
Typical molecular systems in chemical, materials and biological studies, when treated quantum mechanically, present intractable complexity. Classical (continuous) representation of atomic DOFs, however, presents an awkward situation for the definition of microstates and entropy. 5 Correspondingly, density of states of classical systems may be determined only up to a multiplicative factor. 6 The term "conformational entropy", despite its widespread usage, has no well established definition available for major complex biomolecular systems. Explicit conformers with ISWD, despite its system dependence and the fact that infinite number of specific definitions exist for each given complex molecular systems, may be utilized as basic states for defining conformational entropy in an abstract and general sense for any complex molecular systems, and we explore this idea and its implications in this section.
It is well established in the informational theory field 7 that for a given static distribution with well-defined basic states, entropy may be constructed by arbitrary division of the whole system into M subparts.
with P i , P j and P k being properly normalized: S is the global informational entropy and S j s ( j = 1, 2, · · ·, M) are local informational entropies, it is noted that such division may be carried out recursively. We may similarly construct both local entropies of macrostates (say A and B) and global entropy for the given molecular system based on a set of explicit conformers: j(k) is the intra-conformer entropy for the j(k)th conformer in macrostate A(B). Again, P i , p j and q k are properly normalized: The first terms on the right hand side of equations (Eq. (8), Eq. (9) and Eq. (10)) describe distributions of conformer statistical weights within a macrostate or within the whole configurational space, and is referred to as "conformational entropy" (S con f ), the second terms are averages of the intra-conformer entropies of corresponding conformers and are denoted S int . We may rewrite S A and S B in the following form: With a simple algebraic manipulation shown below: Conformational entropy of macrostate A (S A con f ) is divided into two terms. The first term is the Boltzmann entropy (or ideal gas entropy, denoted as S A Boltzmann ) based on the number of conformers. The second term represents deviation from the Boltzmann entropy (denoted as δ S A con f ). It is the product of the Boltzmann constant and the Kullback-Leibler divergence 8 between the actual probability distribution of conformers in macrostate A (p = (p 1 , p 2 , · · · , p N A con f )) and the uniform distribution (uni f {1, N A con f }). S A con f may be rewritten as: Similarly, denote probability distribution of conformers in macrostate B as q = (q 1 , q 2 , · · · , q N B con f ) and the corresponding uniform distribution as uni f {1, N B con f }, we have: For ECISWD, if we denote the corresponding ISWD with a continuous probability density R, then p ≈ R and q ≈ R. Denote the continuous uniform distribution as unif, we have: δ ∆S AB con f = δ S B con f − δ S A con f ≈ 0 (25) Note that ∆S AB con f (equation Eq. (26)) is equivalent to ∆F AB (equation Eq. (3)) except a mere difference of a negative temperature factor. δ ∆S AB con f reflect the difference between two KL divergences, which correspond to distances between the statistical weight distribution of conformers in macrostate A(B) and the uniform distribution. The advantage of utilizing ECISWD for defining conformational entropy is the generality by concealing system specific molecular structural information in specific definition of conformers. Additionally, when difference of conformational entropy is taken between two arbitrary macrostates, deviation of the unknown ISWD from the uni-form distribution is cancelled and we need only to deal with the number of conformers. Based on the same logic as in the case of free energy analysis, with increasingly larger subsets of a sufficiently well-converged MD trajectory set, we expect to observe systematic decrease of δ ∆S con f calculated for arbitrarily defined macrostate pairs as long as ECISWD are basic states of conformational entropy. Conversely, we expect to observe systematic increase of δ ∆S con f when explicit conformers with widely variant statistical weight distributions are basic states of conformational entropy. To this end, we took the same trajectory sets, definition of conformers and macrostates as in the analysis of δ ∆F, and calculated corresponding δ ∆S con f = δ S B con f − δ S A con f based on equations (Eq. (20)) and (Eq. (22)) for each macrostate pair. Both distributions of δ ∆S con f and corresponding CPD of its absolute value were shown in Fig. Figure 8. As expected, and consistent with free energy analysis as shown in Fig. Figure 4, trend of δ ∆S con f based on conformers in set CONF1 (Fig. Figure 8a,e) matches our expectation for that of ECISWD, while trends of δ ∆S con f based on conformers in sets CONF2 through CONF4 (Fig. Figure 8bcd, fgh) match our expectation for that of conformers with variant statistical weight distribution, with coarser conformers and larger trajectory sets correspond to wider distributions of δ ∆S con f .

Entropy enthalpy compensation
In canonical ensemble, we have: While the derivation is carried out in canonical ensemble, it should be applicable for many isobaricisothermal processes (e.g. many biomolecular systems under physiological conditions or routine experimental conditions) where change of the PV term is negligible. Note that equation Eq. (28) is the intriguing entropy-enthalpy compensation (EEC) phenomena (when the PV term is negligible), which had long been an enigma, [9][10][11][12][13] and has attracted a revival of interest due to its critical relevance in protein-ligand interactions. [14][15][16][17][18][19][20][21][22][23][24][25][26] Careful statistical analysis confirm that EEC does exist to various extent in many protein-ligand interaction systems after experimental errors are effectively removed. 20 For a given molecular system, once we have constructed a set of ECISWD, equations (Eq. (3)) and (Eq. (28)) state that change of molecular interactions does not necessarily cause change of free energy, which depends on relative number of thermally accessible ECISWD in end macrostates, and local effects from change of molecular interactions will be cancelled almost completely by corresponding change of average intra-conformer entropy. Note that correlation of neither signs nor magnitudes between ∆S AB con f and ∆S AB int is implied. Therefore, depending upon signs and magnitudes of ∆S AB con f and ∆S AB int (we neglect the PV term here), this theory is compatible with molecular processes driven by enthalpy, entropy or both and various extent of observed EEC. When ∆S AB con f ≈ 0, perfect EEC would be observed; when ∆S AB con f > 0 and ∆U > 0 (or ∆S AB int > 0), a seemingly entropy driven (and a reverse entropy limited) process would be observed; when ∆S AB con f > 0 and ∆U < 0 (or ∆S AB int < 0), depending upon the sign of ∆S AB = ∆S AB con f + ∆S AB int , a seemingly enthalpy or entropy-enthalpy jointly driven (and a reverse enthalpy or entropy-enthalpy  28)). Without sufficient number of complex and heterogeneous microstates within each conformer, it is hard to imagine how such EEC occur. Along the same lines, a simple Morse potential type of protein-ligand interaction model was not found to allow significant EEC. 23 Based on the widespread observation of strong EEC effect in many molecular systems, it was suggested 23 that any attempt to calculate the change of free energy as a sum of its enthalpic and entropic contributions is likely to be unreliable. The proposed conformer counting strategy (equation Eq. (3)) implicitly utilizes EEC by completely avoiding direct calculation of ∆U and ∆S int , which is expensive and error prone.

Conclusions
In summary, we presented the idea that snapshots in a converged MD trajectory set map directly to implicit thermally accessible conformers with ISWD. Based on the thought that infinite number of ways exist for defining implicit conformers with ISWD for a given molecular system, we hypothesized that any sufficiently fine set of conformers should have sufficiently good approximate ISWD. This hypothesis, while being disproved by a double well potential, tested successfully on extensive MD trajectories of lipid POPC. We think that competition of many DOFs, each allowed to vary significantly in both potential energy and spatial position within a conformer, constitutes the foundation for the observed validity of the hypothesis. Considering the moderate complexity of lipid POPC, it is likely that the hypothesis holds for complex molecular systems in general.
This is a useful demonstration of the idea that "More is different". 27 Active research is undergoing in our group toward defining ECISWD for more biomolecular systems (e.g. protein-ligand, protein-protein interaction and protein-nucleic acid interactions systems with explicit or implicit solvation). Furthermore, when ECISWD are utilized as basic states for definition of conformational entropy, change of which between two macrostates was found to be equivalent with corresponding change of free energy except a mere difference of a negative temperature factor. Meanwhile, change of potential energy between two macrostates was found to cancel corresponding change of average intra-conformer entropy. This finding suggests that EEC is inherently a local phenomenon in configurational space, and is likely universal in complex molecular systems. While providing an alternative perspective to the long-standing enigmatic EEC, this result is consistent with different extent of EEC observed for both enthalpy driven and entropy driven molecular processes in conventional sense where change of enthalpy is compared with change of total entropy. Counting thermally accessible ECISWD (equation Eq. (3)) is a natural extension of the population based free energy formula (equation Eq. (1)), which is only useful posterior to a converged simulation.
However, equation Eq. (3) effectively utilizes EEC implicitly through separation of entropy into conformational entropy based on ECISWD and intra-conformer entropy, and renders direct utilization of SMC and importance sampling possible for rapid free energy difference estimation. 1,2 In accordance with "no free lunch theorem", 28 this expected gain in efficiency pays the price of all dynamical and pathway information associated with converged trajectories.

Methods
To define conformers, we first take a given set of torsional DOFs (Fig. Figure 3 To prepare macrostates, all snapshots in a given trajectory set were projected onto a selected backbone dihedral that was partitioned into 20 18 • -windows, snapshots fall within each of which constitute an observed macrostate. Such projections were performed for each of 43 dihedrals ( Fig. Figure 3) and we have collectively 860 macrostates for each given combination of trajectory set and conformer definition. Apparently, macrostates based on the same dihedral angle do not overlap, while those based on different dihedral angles may overlap to different extent. To assign each snapshots to its belonging conformer and calculate N con f for each constructed macrostates, torsional states for the selected torsional DOFs were encoded into bit vectors and the radix sort algorithm 29 was utilized.
Trajectory sets TSA1, TSA2 and TSA3 are constructed from snapshots of POPC collected in simulation condition A in the supplementary table 2 of the GPCR simulation study. 4 There were totally 34143653 snapshots, which collectively amount to ∼ 6.15ms (6.14585754ms). Five subsets, with collective length (CL) being ∼ 1.58ms, ∼ 1.32ms, ∼ 1.32ms, ∼ 1.32ms and ∼ 0.66ms respectively, were available for this simulation condition. We take the first six trajectories out of the total 66 trajectories of the first subset as TSA1, which has a CL of 142.56µs. The first subset (∼ 1.58ms) was taken as TSA2, and the union of all subsets was taken as TSA3 (∼ 6.15ms).    Table Table 1 for detailed lists of comprising atoms) utilized to define conformers are labeled with numbers on their central bonds. Set CONF1 is defined with all 43 torsions; set CONF2 is defined by 28 torsions, which are {2, 3,5,6,8,9,11,12,14,15,17,18,20,21,23,24,26,27,29, Figure 6: Distributions of δ ∆F (a -d) and CPD of its absolute values (e -h) for POPC with conformer sets CONF1 through CONF4 on trajectory sets TSB1 through TSB3. These trajectory sets are constructed from snapshots of POPC collected in simulation condition B in the supplementary table 2. 4 There were 36724760 snapshots, which collectively amount to a CTS of ∼ 6.61ms (6.6104568ms). Five subsets, each including 56 trajectories with CTS being ∼ 1.32ms, were available for this simulation condition. After trajectories of the first subset were sorted according to file name, the first six trajectories were taken as TSB1 (∼ 200µs). The first subset is taken as TSB2 (∼ 1.32ms), and the union of all subsets was taken as TSB3 (∼ 6.61ms). Different trajectory sets are represented by different line colors. The unit of the horizontal axis is in k B T .   3)) holds sufficiently well. Each red dot represents a macrostate; green lines are the best linear fits for the observed data with R 2 being the squared linear correlation coefficients.  Figure 8: Distributions of δ ∆S con f (a -d) and CPD of its absolute values (e -h) for POPC with four sets of explicit conformers (CONF1 through CONF4, which are indicated in the horizontal label as subscripts, e.g. δ ∆S con f 1 in (a) and |δ ∆S con f 1 | in (e)). Different trajectory sets are represented by different line colors. The unit of the horizontal axis is in k B .