## Introduction

Water, an essential constituent of life1, remains an elusive target for modeling and simulation. Effective coarse-grained (CG) models of liquid water must balance computational savings, by handling fewer degrees of freedom, while at the same time capturing its essential physical properties2,3,4,5,6. CG water models have enabled simulations exceeding micro-meters/seconds that are relevant for processes in biophysical systems that are beyond the reach of conventional atomistic molecular dynamics (MD) simulations. CG modeling entails recasting the complex and detailed atomistic model into a simpler yet accurate representation. A CG model has the ability to model key quantities of interest (QoI) when it captures the effects of the eliminated degrees of freedom (DOFs)7,8,9. The CG process requires: (i) The identification of the system’s optimal resolution. Commonly, groups of atoms are described with pseudo-atoms/interaction sites and a “mapping” function is used to determine the relation between these sites and the atomistic coordinates. For a given system various coarse-graining levels can be employed. For example, existing CG lipid membrane models range from representations with a single anisotropic site10 to three sites per lipid thus differentiating between the head and the tail11,12, or by grouping three or four heavy atoms into beads thus capturing varying degrees of chemical properties13,14,15,16; (ii) The specification of the associated Hamiltonian. Here, DOFs can be reduced by simplifying the form or by neglecting specific terms in the Hamiltonian. For instance, one can neglect the bond/angle vibrations and resort to rigid models17.

In effective CG models, the removed DOFs are insignificant for the QoI. However, to what degree a specific DOF is negligible for a given observable is hardly ever known beforehand. Thus, the majority of the CG models are designed based on intuition or extrapolations from existing models. Additionally, the number of the removed DOFs must be large so that the diminished accuracy compared to the AT models is justified with the substantial computational gains. It is usually assumed that increasing the level of coarse-graining will decrease the model’s accuracy. However, and this is perhaps a key issue, the relation between the number of employed DOF in a model and its accuracy may not be a monotonic function7,18,19. Thus, one can end up (without realizing) in a worst case scenario, where the constructed model is redundant, i.e., better accuracy can be achieved with fewer DOFs (computationally less demanding model). For example, the two-site and four-site models of n-hexane molecule perform reasonably well, while a very similar three-site model fails19. A number of works have addressed the systematic selection of CG models in bio-molecular systems20,21,22,23,24,25.

The challenge of striking the optimal balance between accuracy and computational cost is crucial for CG models of water. At the same time, obtaining water-water interactions consumes the majority of the computational effort. Thus, many CG models of water were developed. These models differ in the coarse-graining resolution level, i.e., the mapping, which ranges from 1 to 11 water molecules per CG bead2,26. Models also differ in the employed Hamiltonian. For CG models where one bead represents one water molecule (1-to-1 mapping), the Hamiltonian is either derived from the atomistic simulations9,27 or parametrized based on analytic potentials ranging from a simple Lennard-Jones (LJ) to potentials incorporating tetrahedral ordering, dipole moment, and orientation-dependent hydrogen bonding interactions28,29,30,31. On a higher coarse-graining level, it was soon realized that chargeless models, such as the standard MARTINI model32,33, introduce unphysical features when applied to interfaces, such as an interface between water and a lipid membrane34,35,36. Thus, new CG models were developed which treat the electrostatics explicitly. In the PCGS model (3-to-1)37, the CG beads carry induced dipoles, in the polarizable MARTINI model (4-to-3)34 the electrostatic is modeled analog to the Drude oscillator, in the BMW model (4-to-3)35 the CG representation resembles a rigid water molecule with a fixed dipole and quadrupole moment, while the GROMOS CG model (5-to-2)38 introduces explicit charges with a fluctuating dipole. Note that in these models the extra interaction sites have no relation to the physical system making the intuitive construction of the model even more difficult.

Thus far, studies reporting the effects of the choices made in the coarse-graining level and model structure are relatively few. For water, the mapping was investigated by Hadley et al.39, where the investigated CG models were single-site models and the Hamiltonian was parameterized to reproduce the structural properties of water. The mapping 4-to-1 was found to give the optimal balance between efficiency and accuracy. However, by comparing the properties of the available water models it is hard to extract any physics as the models were developed to reproduce different properties. Furthermore, one should avoid artificially constructed scoring functions that could be biased but rather perform model selection based on rigorous mathematical foundation. In this respect, the Bayesian statistical framework can serve as a powerful tool which has become a popular technique to refine, guide and critically assess the MD models40,41,42,43,44,45.

In this work, we employ the Bayesian statistical framework to critically assess many CG water models (see Fig. 1). We investigate the biologically relevant CG resolution levels, i.e., mappings, where the number of grouped water molecules ranges from 1 to 6. At each resolution, multiple model structures are examined ranging from 1-site to 3-site models where we additionally investigate the rigid and flexible versions of the 2 and 3-site models for mapping M = 4. Our main objective is to determine the model evidence for all models and thus elucidate the impact of the mapping on the model’s performance and the relevant DOFs in CG modeling of water. Furthermore, we evaluate the speed-up for each developed model which allows us to assess efficiency-accuracy trade-off. Lastly, we investigate the transferability of the water models to different thermodynamics states, i.e., to different temperatures. To this end, we employ the hierarchical Bayesian framework46 that can accurately quantify the uncertainty in the parameter space for multiple QoI, i.e., different properties or the same property at different conditions.

## Methods

We investigate a set of CG water models (partially shown in Fig. 1). For all models, we employ the interactions that are implemented in the standard MD packages. In the 1S model, a water cluster is modeled with a single chargeless spherical particle employing the LJ potential ULJ (rij) = 4ε[(σ/rij)12 − (σ/rij)6] between particles i and j. The model parameters are $${\phi }_{1S}=(\sigma ,\epsilon )$$. The 2S model is a two-site model, where the sites are oppositely charged (±q) and constrained to a distance r0. The negatively charged (blue in Fig. 2) site interacts additionally with the LJ potential. The model parameters are $${\phi }_{2S}=(\sigma ,\epsilon ,q,{r}_{0})$$. In order to satisfy the net neutrality of the water cluster, the three-site model can be constructed in two ways, which we denote as 3S and 3S* models. The 3S model resembles a big water molecule where all three particles are charged. The central (blue) site has a charge of −q, and the other two sites have a charge of +q/2. In the 3S* model, the central site is chargeless and the other two carry a ±q charge. Both three-site models have the parameters $${\phi }_{3S\mathrm{,3}S\ast }=(\sigma ,\epsilon ,q,{r}_{0},{\vartheta }_{0})$$. For all rigid model structures, we consider four levels of resolution with the number of grouped water molecules equal to 3, 4, 5, and 6. For the 1S model, we additionally consider the M = 1 mapping, while for the models with partial charges we investigate also M = 12. The level of resolution fixes the total mass of the CG representation. The mass ratio between the interaction sites in the two and three-sited models is fixed to 2 with the central particle carrying the larger mass. The electrostatic is in all cases modeled with the Coulomb’s interaction Ue (rij) = qiqj/(4πεε0rij), where we set the global dielectric screening to ε = 2.5. For M = 4, we consider also the flexible analogs of the models with charges. In the 2SF model, the two sites are interacting with a harmonic potential Ub (rij) = kb (rij − r0)2 with force constant kb. Therefore, the model parameters are $${\phi }_{2SF}=(\sigma ,\epsilon ,q,{r}_{0},{k}_{b})$$. For the flexible three-site models 3SF and 3SF*, the angle is unconstrained and modeled with the harmonic angle potential Ua (ϑij) = ka (ϑij − ϑ0)2 thus adding the force constant ka parameter to the parameter set, i.e., the model parameters are $${\phi }_{3SF\mathrm{,3}SF\ast }=(\sigma ,\epsilon ,q,{r}_{0},{\vartheta }_{0},{k}_{a})$$.

We remark that the data used as target QoI is part of the modeling choice. In this work, we use experimental data of density, dielectric constant, surface tension, isothermal compressibility, and shear viscosity, i.e., mostly thermodynamic properties. These are deemed of key importance for biophysical systems. The data used and the properties of the reference coarse-grained water models are reported in Table 1. The structural properties, e.g. radial distribution function or the dynamical properties, e.g. diffusion constant were not considered in this work because these properties cannot be measured experimentally for M > 1.

### Uncertainty Quantification

#### Bayesian Framework

We consider a computational model $${\mathscr{C}}$$ that depends on a set of parameters $${\phi }_{c}\in {{\mathbb{R}}}^{{N}_{\phi }}$$ and a set of input variables or conditions $${\boldsymbol{x}}\in {{\mathbb{R}}}^{{N}_{x}}$$. In the context of the current work, the computational model is the molecular dynamics solver, the model parameters correspond to the parameters of the potential and the input variables to the temperature of the system. Moreover, we consider an observable function $$F({\boldsymbol{x}};\,{\phi }_{c})\in {{\mathbb{R}}}^{N}$$ that represents the output of the computational model. Here, the observable function is an equilibrium property of the system, e.g., the density. We are interested in inferring the parameters $${\phi }_{c}$$ based on the a set of experimental data d = {di| i = 1, …, N} that correspond to the fixed input parameters of the model x.

In the frequentist statistics framework, the parameters of the model are obtained by optimizing a distance of the model from the data, usually the likelihood function. In the Bayesian framework, the parameters follow a conditional distribution which is given by Bayes’ theorem,

$$p(\phi |{\boldsymbol{d}}, {\mathcal M} )=\frac{p({\boldsymbol{d}}|\phi , {\mathcal M} )\,p(\phi | {\mathcal M} )}{p({\boldsymbol{d}}| {\mathcal M} )},$$
(1)

where $$p({\boldsymbol{d}}|\phi , {\mathcal M} )$$ is the likelihood function, $$p(\phi | {\mathcal M} )$$ is the prior probability distribution and $$p({\boldsymbol{d}}| {\mathcal M} )$$ is the model evidence. Here, $$\phi$$ is the vector containing the computational model parameters $${\phi }_{c}$$ and any other parameters needed for the definition of the likelihood or the prior density. $${\mathcal M}$$ stands for the model under consideration and contains all the information that describes the computational and the statistical model.

The likelihood function is a measure of how likely is that the data d are produced by the computational model $${\mathscr{C}}$$. Here, we make the assumption that the datum di is a sample from the generative model

$${y}_{i}={F}_{i}({\boldsymbol{x}};{\phi }_{c})+{\sigma }_{n}{d}_{i}\varepsilon ,\,\varepsilon \sim {\mathscr{N}}\mathrm{(0,}\,\mathrm{1).}$$
(2)

Namely, yi are random variables independent and normally distributed with mean equal to the observable of the model and standard deviation proportional to the data. The reason we choose this error model is because the set of experimental data d contains elements of different orders of magnitude, e.g., density is of order of 1 and surface tension of order 100. With this model the error allowed by the statistical model becomes proportional to the value of the data we want to fit47. The likelihood of the data $$p({\boldsymbol{d}}|\phi )$$ has the form,

$$p({\boldsymbol{d}}|\phi )={\mathscr{N}}({\boldsymbol{d}}|F({\boldsymbol{x}},{\phi }_{c}),{\rm{\Sigma }}),\,{\rm{\Sigma }}={\sigma }_{n}^{2}\,{\rm{diag}}\,(\,{{\boldsymbol{d}}}^{2}),$$
(3)

where $$\phi ={({\phi }_{c}^{{\rm{T}}},{\sigma }_{n})}^{{\rm{{\rm T}}}}$$ is the parameter vector that contains the model and the error parameters.

The denominator of Eq. (1) is defined as the integral of the numerator and is called the model evidence. This quantity can be used for model selection48 as it is discussed in the next section. Finally, the prior probability encodes all the available information on the parameters prior to observing any data. If no prior information is known for the parameters, a non informative distribution can be used, e.g. a uniform distribution. In this work we use uniform priors, see SI for detailed information.

#### Model Selection

Assuming we have $${N}_{ {\mathcal M} }$$ models $${ {\mathcal M} }_{i},\,\,i=1,\ldots ,{N}_{ {\mathcal M} }$$ that describe different computational and statistical models, we wish to choose the model that best fits the data. In Bayesian statistics, this is translated into choosing the model with the highest posterior probability,

$$p({ {\mathcal M} }_{i}|{\boldsymbol{d}})=\frac{p({\boldsymbol{d}}|{ {\mathcal M} }_{i})p({ {\mathcal M} }_{i})}{p({\boldsymbol{d}})},$$
(4)

where $$p({ {\mathcal M} }_{i})$$ encodes any prior preference to the model $${ {\mathcal M} }_{i}$$. Assuming all models have equal prior probabilities, the posterior probability of the model depends only on the likelihood of the data. Taking the logarithm of the likelihood and using Eq. (1) we can write

$$\begin{array}{rcl}\mathrm{ln}\,p({\boldsymbol{d}}|{ {\mathcal M} }_{i}) & = & \int \,\mathrm{ln}\,p({\boldsymbol{d}}|{ {\mathcal M} }_{i})\,p(\phi |{\boldsymbol{d}},{ {\mathcal M} }_{i}){\rm{d}}\phi \\ & = & \int \,\mathrm{ln}\,\frac{p({\boldsymbol{d}}|\phi ,{ {\mathcal M} }_{i})\,p(\phi |{ {\mathcal M} }_{i})}{p(\phi |{\boldsymbol{d}},{ {\mathcal M} }_{i})}p(\phi |{\boldsymbol{d}},{ {\mathcal M} }_{i}){\rm{d}}\phi \\ & = & {\mathbb{E}}[\mathrm{ln}\,p({\boldsymbol{d}}|\phi ,{ {\mathcal M} }_{i})]-{\mathbb{E}}[\mathrm{ln}\,\frac{p(\phi |{ {\mathcal M} }_{i})}{p(\phi |{\boldsymbol{d}},{ {\mathcal M} }_{i})}],\end{array}$$
(5)

where the expectation is taken with respect to posterior probability $$p(\phi |{\boldsymbol{d}},{ {\mathcal M} }_{i})$$. The first term is the expected fit of the data under the posterior probability of the parameters and is a measure of how well the model fits the data. The second term is the Kullback-Leibler (KL) divergence or relative entropy of the posterior from the prior distribution and is a measure of the information gain from data d under the model $${ {\mathcal M} }_{j}$$. The KL divergence can be seen as a measure of the distance between two probability distributions.

If one would only consider the first term of Eq. (5) for the model selection, then the model that fits the data best would be selected. However, such an approach is prone to overfitting, i.e., choosing a too complex model, which reduces the predictive capabilities of the model. The second term serves as a penalization term. Models with posterior distributions that differ a lot from the prior, i.e., models that extract a lot of information from the data, are penalized more. Thus, model evidence can be seen as an implementation of the Ockham’s razor that states that simple models (in terms of the number of parameters) that reasonably fit the data should be preferred over more complex models that provide only slight improvements to the fit. For a detailed discussion on model selection and estimators of the model evidence, we refer to refs49,50.

#### Hierarchical Bayesian Framework

We consider data structured as: $$\overrightarrow{{\boldsymbol{d}}}=\{{{\boldsymbol{d}}}_{1},\ldots ,{{\boldsymbol{d}}}_{{N}_{d}}\}$$ where $${{\boldsymbol{d}}}_{i}\in {{\mathbb{R}}}^{{N}_{i}}$$ corresponds to the conditions xi. For example, xi may correspond to different thermodynamic conditions under which the experimental data di are produced.

The classical Bayesian method for inferring the parameters of the computational model is to group all the data and estimate the probability $$p(\phi |\overrightarrow{{\boldsymbol{d}}})$$ (see Fig. 3 left). However, this approach may not be suitable when the uncertainty on $$\phi$$ is large due to the fact that different parameters may be suitable for different data sets. On the opposite side, individual parameters $${\phi }_{i}$$ can be inferred using only the data set di (see Fig. 3 middle). This approach preserves the individual information but any information that may be contained in other data sets is lost.

Finally, a balance between retaining individual information and sharing information between different data sets can be achieved with the hierarchical Bayesian framework. In this approach, the independent models corresponding to different conditions are connected using a hyper-parameter vector ψ (see Fig. 3 right). The benefits of this approach is twofold: (i) better informed individual probabilities $$p({\phi }_{i}|\overrightarrow{{\boldsymbol{d}}})$$; and (ii) a data informed prior p(ψ|d) is available in case new parameters $${\phi }^{new}$$ that correspond to unobserved data need to be inferred. A detailed description of the sampling algorithm of this approach is given in Supporting information (SI).

## Results

### Impact of mapping

First, we examine the impact of the level of resolution on the model accuracy using density, dielectric constant, surface tension, isothermal compressibility, and shear viscosity experimental data (see SI). In Fig. 4 the model accuracy, as measured by the model evidence, is shown as a function of mapping M, which denotes the number of water molecules represented by a given CG model. It is usually assumed the model’s performance is decreasing with the decreased resolution of the model. Indeed, for the 1S model, we observe precisely this trend. For the charged models, the evidence is still overall monotonically decreasing with M, however, compared to the 1S model the dependency of the evidence on M is much less drastic. To investigate this dependency further, we perform the UQ inference also for the charged models at M = 12. The observed evidences are comparable to the evidence of the 1S model at M = 4. Thus, with the models that incorporate partial charges, one can resort to models with higher mappings. According to the UQ, the best model for M = 1, 3 is the 1S model whereas for M > 3 the charged models are superior. However, one should keep in mind that the chargeless and charged models are not comparable as the chargeless models cannot provide the same amount of information, e.g., the dielectric constant is not defined. Comparing the evidences of the 2S, 3S, and 3S* models, we see that the three models rank very closely with the 3S* model being somewhat better than the other two.

We emphasize that the model evidence encompasses much more than a mere evaluation of the model’s properties at the best parameters. Nonetheless, it is insightful to examine the target QoI and their dependency on the mapping. Figure 5 shows the density ρ, dielectric constant ε, surface tension γ, isothermal compressibility κ, and shear viscosity η for rigid models and mappings 1 to 6. The target QoI are obtained using the maximum a posteriori (MAP) parameters and evaluated as a mean of 5 independent simulation runs with different initial conditions. Note that ε is not defined for the 1S model and it is excluded from the target QoI in the second UQ inference of the charged models. We observe that the ρ and ε are within 10% of the experimental data for all mappings. On the contrary, the γ, κ, and η depend very strongly on the mapping. The general trend is similar for all models, i.e., as we increase the mapping the γ is decreasing, κ is increasing, and η is decreasing. This observation agrees with the general picture of coarse-graining. The more we increase the level of coarse-graining the softer are the interactions between the CG beads which correlates with increased κ and decreased γ and η. We observe that for some models there are no parameters σ and ε of the LJ potential that would fit well a certain target QoI (within the liquid state), in particular, the γ and η. A possible solution would be to replace the LJ non-bonded interaction with another interaction, e.g, the Born-Mayer-Huggins interaction that is used in the BMW model35. The 1S model with M = 4 can be directly compared with the MARTINI model as the models are equal but were developed with different target QoI. With our model, we observe very similar properties as reported for the MARTINI model. Additionally, the inferred parameters with the MAP estimates are also very close to those of the MARTINI (see SI).

### Rigid vs. flexible models

For the mapping M = 4, we examine also the three flexible models 2SF, 3SF, and 3S* F. The resulting evidences for these models are listed in Table 2 along with the model evidences for the rigid counterparts. The physical motivation behind the flexible models is that they encompass the fluctuations in the dipole moment of the water cluster. In the two-site model, we incorporate them via bond vibrations, whereas in the three-site models with the angle fluctuations. Thus, the flexible models have 1 extra DOF compared to the rigid counterparts. However, as can be seen in Table 2, for the three-site models the flexible versions perform worse than the rigid ones. For the two-site model, the flexible model is only slightly better than the rigid model. Nonetheless, as flexible models usually demand smaller integration timesteps and consequently have a higher computational cost the model’s performance should be more substantial to justify the extra computational resources.

### Accuracy vs. efficiency

We examine the accuracy vs. efficiency trade-off in Fig. 6 where we plot the evidence as a function of the speedup compared to the all-atom simulation. As a test simulation, we choose the NVE ensemble simulation at ambient conditions, a cubic domain with an edge of 5 nm and a simulation length of 10 ns. We also employ the maximal integration timestep still permitted by the model (see SI). The variation of the runtime varies extensively between the considered models. The computational cost depends on two factors: (i) on the number of particles, which in turn depends on the employed mapping and the number of interaction sites of the model; (ii) on the integration timestep that is increasing with increased coarse-graining since the interactions are softening. For a given mapping, we observe the smallest computational cost for the 1S model, followed by the 2S, and 3S* model while the 3S model has the highest computational cost. The difference in the 3S and 3S* models is due to the smaller timesteps required by the 3S model.

The trade-off between the accuracy and efficiency can be formally addressed as a decision problem, where the expected utility51 $${\mathscr{U}}({ {\mathcal M} }_{i};{\boldsymbol{d}})$$ of an individual model $${ {\mathcal M} }_{i}$$ given data d is given by:

$${\mathscr{U}}({ {\mathcal M} }_{i};{\boldsymbol{d}})=p({ {\mathcal M} }_{i}|{\boldsymbol{d}})u({ {\mathcal M} }_{i}\mathrm{).}$$
(6)

We define the utility function $$u({ {\mathcal M} }_{i})$$ as the decimal logarithm of the computational speedup over the atomistic model. As shown in Fig. 6c the model with the maximal expected utility is found for the 1S model with M = 1. When we consider the models incorporating the partial charge: the 3S model is the most unfavorable, while the 2S and 3S* models are comparable in terms of their expected utility. In turn, the appropriate choice for the 2S and 3S* models are mappings M = 5 and 3, respectively.

### Transferability to non-ambient TD conditions

One of the challenges of coarse-graining is the transferability of CG models. Typically, CG models are more sensitive to variations in the thermodynamic conditions than the atomistic models. Furthermore, the more we increase the level of coarse-graining, the more restricted the model is to the thermodynamics state at which it is parametrized. One way of making the model more robust to transferability is to parametrize it for different conditions. Within the Bayesian formalism, the hierarchical UQ allows us to merge multiple QoI. We test the transferability of three models 2SF, 3S, and 3S* for mapping M = 4. In Fig. 7, we plot the model evidences for the hierarchical UQ, where the temperatures T = 283, 298, 323 K are merged and the evidences for the classical UQ at each temperature. We observe that the 3S* model is the most transferable, having the highest hierarchical evidence. For the three-site models, we also observe that it is easier for the CG model to fit higher temperatures.

## Summary and Discussion

We propose a data driven, Bayesian framework for the selection of CG water models. We re-examine the CG modeling approach where the mapping and model structure are based on rather ad-hoc assumptions and the system Hamiltonian is derived either by fitting its parameters to relevant experimental data or by deriving the effective interactions from the more detailed, e.g. atomistic simulations. Such a-priori assumptions predefine the accuracy of the model no matter what approach one employs to obtain the Hamiltonian. In this work, we propose a methodology that broadens the investigated space of all possible CG models of liquid water. The Bayesian framework is not constrained to a specific model design but considers many different mappings and model structures. Our model search space encompasses the 1, 2, and 3-site models with either rigid or flexible geometry. We find that for the 1S model one should consider mappings M < 3, while for the multiple-site models higher M are more appropriate due to the higher computational cost compared to the single sited models. When choosing between single and multiple-site models one should mainly consider whether the local electrostatics screening is essential for the problem at hand. We observed no significant improvement of models when going from rigid to flexible models, thus implying that one should use rigid geometries for efficiency reasons. The distribution of charge in the three-site models, however, plays an important role as the 3S* model outperforms the 3S model and is additionally much cheaper computationally due to the higher maximal integration time step. Additionally, the 3S* is also the best model in regard to the transferability to non-ambient temperatures.

The methodology presented in this work can be extended to investigate the CG model design of other important chemical and biological systems such as bio-molecules. We emphasize that the data used for the calibration of the models is considered an inherent aspect of the modeling process in a Bayesian framework. The adoption of Bayesian framework in studies of CG models could quantify the appropriateness of the model designs employed in established CG force fields52,53 according to target QoIs. The computational limitations associated with Bayesian inference are today largely overcome thanks to the availability of massively parallel computer architectures and a wealth of data is produced by advanced experimental procedures and detailed simulations. This combination enables this 400 year old method54,55 to become a potent alternative to challenging modeling and simulation problems of our times.