Bayesian selection for coarse-grained models of liquid water

The necessity for accurate and computationally efficient representations of water in atomistic simulations that can span biologically relevant timescales has born the necessity of coarse-grained (CG) modeling. Despite numerous advances, CG water models rely mostly on a-priori specified assumptions. How these assumptions affect the model accuracy, efficiency, and in particular transferability, has not been systematically investigated. Here we propose a data driven, comparison and selection for CG water models through a Hierarchical Bayesian framework. We examine CG water models that differ in their level of coarse-graining, structure, and number of interaction sites. We find that the importance of electrostatic interactions for the physical system under consideration is a dominant criterion for the model selection. Multi-site models are favored, unless the effects of water in electrostatic screening are not relevant, in which case the single site model is preferred due to its computational savings. The charge distribution is found to play an important role in the multi-site model's accuracy while the flexibility of the bonds/angles may only slightly improve the models. Furthermore, we find significant variations in the computational cost of these models. We present a data informed rationale for the selection of CG water models and provide guidance for future water model designs.

respectively. Density ρ is evaluated in the NpT ensemble. For the surface tension γ of the water/vapor interface, the water is surrounded by two vacuum regions in the z direction.
It is computed in the NVT ensemble as γ = L z /2 p zz − p || , where where L z is the box length in the z-direction and the p z and p || denote the perpendicular and the lateral pressure components. The dielectric constant ε is computed from the fluctuations of the total dipole moment, while the shear viscosity η is calculated according to the Green-Kubo formula as a<b p ab (t)p ab (0)dt, where V is the system volume, k B is the Boltzmann constant, and p ab are the off-diagonal components of the pressure tensor (αβ = xy, xz, yz) in the NVT ensemble. The isothermal compressibility κ is evaluated under the NpT conditions For the Bayesian inference, the protocol for the simulations is the following: a 100 ps equilibration in the NVT ensemble followed by 100 ps equilibration in the NpT ensemble followed by 1 ns production run. For the evaluation of γ, the system is additionally equilibrated for 100 ps under NVT conditions. Additional evaluations of each model's properties are performed for the parameters with the maximum likelihood. We first evaluate the maximum integration time-step ∆t by performing a series of NVE simulations varying the ∆t and monitoring the conservation of the total energy. Using the maximum ∆t, approximately doubled size of the system, 1 ns equilibration, and 10 ns production runs we evaluate the properties ρ, ε, γ, κ as stated above.

Uncertainty Quantification
In most practical applications, the posterior distribution can not be expressed analytically. Moreover, the normalizing constant p(d|M) is not known since it is given by integration of the denominator in Eq. (S1) over the potentially high dimensional domain of the parameters. Hence, we rely on efficient sampling algorithms to identify the posterior distribution. In the present work, we use the Transitional Markov Chain Monte Carlo algorithm (TMCMC). 2 The algorithm starts by sampling a population from the prior distribution, which usually is trivial to be sampled, and through annealing steps provides samples from the posterior distribution. Three major advantages of this algorithm are its ability to sample multimodal distributions, in cases where single chain MCMC methods may fail, the algorithm provides an asymptotically unbiased estimator of the evidence 3 (see SI), and the fact that it can run in parallel. We report that in all simulations the parameter β 2 was set to 0.04.
The parallel property of the algorithm is very important in this study since the model evaluation can take on the order of an hour for execution. An efficient implementation of TMCMC can be found in Π4U 1 , a platform agnostic high performance framework for uncertainty quantification. The sampling algorithm is further accelerated by using surrogate models based on Gaussian processes (see section 1.3) leading to a 10-20% speedup.
The data on which the Bayesian uncertainty quantification is performed are experimental data for density, dielectric constant (not used in the LJ model as it is chargeless), surface tension of the water/vapor interface, isothermal compressibility, and shear viscosity (see SI).
We want to allow different levels of error for different data thus we choose the covariance matrix Σ in the main text to be equal to, In order to select the best model, we work as follows. First, we perform the inference on all the parameters of each model using all five QoI. Next, we fix all parameters except σ and of the LJ interaction to the values of the maximum a posterior estimation, i.e., the parameter with the highest probability, Finally, we infer the remaining parameters of the models excluding the dielectric constant from the QoI.
Our motivation for this approach is three-fold. First, we remove the dependency of the evidence on the number of model parameters. The complexity of the model is then measured through the computational cost of the model. This measure is more appropriate as two models can have the same number of the parameters but could differ in the computational cost. Second, the evidence of the models can be directly compared to the evidence of the LJ model. Finally, having only 3 parameters reduces the demand on the population size, thus reducing the computational cost of the sampling algorithm.

Surrogate models
The computational cost of the Bayesian UQ lies in the computation of the likelihood term in Eq. S1. Each likelihood evaluation implies a model evaluation, i.e., a computational demanding simulation using LAMMPS. The idea of surrogate models 4 is to use special interpolation techniques in order to substitute the heavy model evaluations.
In this work we choose to use Gaussian Processes 5 (GPs) as a surrogate for the likelihood.
Following the notation in Bishop, 5 given the observed data t = (t 1 , . . . , t N ) that correspond and where The scalar β is the error assumed on the observed data. The kernel function k(x, y) models the fact that points which are close in the input space are expected to have more strongly correlated outputs. In general, the kernel function depends on hyper-parameters which are learned through an optimization process. In this work, we use the squared exponential kernel.
In the context of TMCMC, the training set t N consists of likelihoods values t i computed on parameters x i based only on the exact model evaluation. Based on these values, a GP model is trained and evaluated on a new point (t N +1 , x N +1 ). If the relative error σ(x N +1 )/|m(x N +1 )| is less than a threshold then the likelihood value at x N +1 is set to , otherwise, the model is computed and the likelihood is evaluated. Here, we choose the threshold to be 0.05.
In order to make more accurate surrogate models, we choose to train a GP model on a subdomain of the finite initial domain centered on x N +1 . Here, we choose to train on 25% of the initial domain.

Hierarchical UQ
We follow the methodology developed in Ref. 6 We assume that data comes split in N different datasets: The likelihood in the probabilistic model We assume that the probability of ϕ i depends on a hyper-parameter ψ ∈ R N ψ and is given by a PDF where M corresponds to the graph describing the relations between ψ, ϕ i and d i , see Figure S1. Our goal is to obtain samples from the posterior distribution, The dependency assumptions from Fig. S1 allow to simplify: , and equation (S6) can be rewritten using the Bayes' Figure S1: Bayesian networks: (left) network for the i-th dataset, (right) hierarchical Bayesian network. theorem: Finally, the posterior distribution (S6) can be approximated as where, ψ (k) ∼ p(ψ | #» d , M) and N s is sufficiently large. Thus, in order to obtain ϕ i samples, we first have to sample the probability distribution p(ψ | #» d , M), which, according to Bayes' theorem, is equal to where p(ψ | M) is the prior PDF on ψ and p( #» d | M) is the normalizing constant. Exploiting the dependency assumption of Fig. S1 we see that and the likelihood of i-th dataset can be expressed according to the total probability theorem as Here we introduce the model M i described in Fig. S1. The posterior distribution of this model will be used as instrumental density for important sampling. Under the modeling 6 ) and the use of the Bayes' theorem, equation (S12) is written as or, equivalently, as Finally, equation (S12) can be approximated as where ϕ     Table S5: Model parameters (σ inÅ, in kcal/mol, e in e 0 , l inÅ, k B in kcal/mol/nm 2 , θ in • , and k A in kcal/mol) with maximum likelihood for the considered rigid models and mappings M . The dipole moment µ is given in D.  Table S6: Model parameters (σ inÅ, in kcal/mol, e in e 0 , l inÅ, k B in kcal/mol/nm 2 , θ in • , and k A in kcal/mol) with maximum likelihood for the considered flexible models and mappings M = 4. The dipole moment µ is given in D.    Table S8: Model parameter ranges (σ inÅ, in kcal/mol, e in e 0 , l inÅ, k B in kcal/mol/nm 2 , θ in • , and k A in kcal/mol) for the considered flexible models and mapping M = 4. The dipole moment µ is given in D.    2) gamma: p(ϕ j | ψ, M) = Γ(ϕ j | ψ 2j−1 , ψ 2j ), where Γ(·|µ, σ) is a re-parametrized gamma distribution with mean µ and standard deviation σ and M is set to Γ. Given µ and σ the shape and the scale of the gamma distribution are given by k = µ 2 σ and θ = σ µ .

2S
The prior distribution on the hyper-parameters is modeled as independent uniform, where M ∈ {U, Γ} and the constants a M j , b M j are given in Table S11. The model U is according to the Bayesian model selection criterion, an order of magnitude more plausible and thus will be used for the further inference.