Learning neural network potentials from experimental data via Differentiable Trajectory Reweighting

In molecular dynamics (MD), neural network (NN) potentials trained bottom-up on quantum mechanical data have seen tremendous success recently. Top-down approaches that learn NN potentials directly from experimental data have received less attention, typically facing numerical and computational challenges when backpropagating through MD simulations. We present the Differentiable Trajectory Reweighting (DiffTRe) method, which bypasses differentiation through the MD simulation for time-independent observables. Leveraging thermodynamic perturbation theory, we avoid exploding gradients and achieve around 2 orders of magnitude speed-up in gradient computation for top-down learning. We show effectiveness of DiffTRe in learning NN potentials for an atomistic model of diamond and a coarse-grained model of water based on diverse experimental observables including thermodynamic, structural and mechanical properties. Importantly, DiffTRe also generalizes bottom-up structural coarse-graining methods such as iterative Boltzmann inversion to arbitrary potentials. The presented method constitutes an important milestone towards enriching NN potentials with experimental data, particularly when accurate bottom-up data is unavailable.


Supplementary References 13
Supplementary Methods

DiffTRe and simulation parameters
First, we summarize DiffTRe parameters relevant to all examples before we list problem-specific parameters below. We have setN eff = 0.9N as the threshold above which re-using a trajectory is allowed. We employ an Adam optimizer [1] with exponentially decaying learning rate. Adam hyperparameters β 1 = 0.1 and β 2 = 0.4 are chosen to account for training with rather large step sizes and only few parameter updates. All examples are initialized with a global random seed 0, which controls the random initialization of θ and the initial simulation state. We observed that despite setting random seeds, results are not matched exactly across different re-runs -even when running JAX on reproducibility configuration. We tackle this issue by reporting results for varying random seeds that also capture variability from non-deterministic operations. All computations are run on a single Nvidia RTX 3090 GPU with the exception of computations with the cubic spline potential in the double-well toy example. As the numerically inexpensive spline cannot saturate the GPU, computations were faster on an AMD Ryzen Threadripper 3070X CPU.

Double-well toy example
Simulations consist of N p = 2000 ideal gas particles of mass m = 1 within a box of size X = 1 and time step δt = 0.001. The constant temperature of k B T = 1 in the canonical ensemble is enforced by a Nose-Hoover chain thermostat [2] with 5 chains and time scale τ = 0.02. The initial state S init is constructed by randomly drawing particles uniformly from x ∈ [0, 1]. S init for the final production run consists of particles drawn uniformly from x ∈ [0.5, 0.51] to test convergence to the target density distribution, even from a state far from equilibrium. Density distributions are computed via the differentiable density function in Supplementary Eq. (4) with bin width ∆x = 0.01. During optimization, the initial learning rate η = 0.5 of Adam [1] is decayed exponentially by a factor of 0.01 over 200 update steps. The target and final predicted densitiesρ(x)/ρ 0 and ρ(x)/ρ 0 are computed based on a production run of 100000 states following 10000 skipped states for equilibration.

Atomistic model of diamond
Simulations are run with a time step size of δt = 0.5 fs. The temperature is controlled by a Langevin thermostat with friction coefficient γ = 4/ ps, which corresponds to a coupling time scale of 250f s. These values are common in simulations of diamond in the literature [3]. Carbon atoms have a mass m = 12.011 u. The loss weights γ σ = 5 · 10 −8 ( kJ mol nm 3 ) −2 and γ C = 10 −10 ( kJ mol nm 3 ) −2 balance the impact of both observables, i.e. stress σ and stiffness values C ij . Optimization starts with an initial Adam learning rate η = 0.002 that is exponentially decayed by a factor of 0.01 over 500 steps.
In computation of phonon density of states (PDOS), we minimize the potential energy via 500 steps of the Fast Inertial Relaxation Engine (FIRE) [4]. PDOS is computed afterwards via the finite displacement method as implemented in Phonopy [5] with displacement length 0.001 nm.

Coarse-grained water model
Coarse-grained water is simulated with a time step size of δt = 2 fs. Water molecules (and CG water particles correspondingly) have a mass m = 18.0154 u. A Nose-Hoover chain thermostat [2] with chain length 5 and time scale τ = 200 fs enforces the target temperature. We approximate radial (RDF) and angular distribution functions (ADF) with the differentiable versions presented in Supplementary Eq. (5) and (6). The RDF is discretized by 300 bins of width ∆x = 1 300 nm. The ADF is discretized by 200 bins of width ∆α = π 200 rad and triplets are cut off at r c = 0.318 nm analogous to the experimental evaluation [6]. The loss weight γ p = 10 −7 ( kJ mol nm 3 ) −2 accounts for the larger magnitude of pressure versus the RDF and ADF. The initial Adam learning rate η = 0.003 is decayed exponentially by a factor of 0.01 over 200 steps.
The tetrahedral order parameter q [7] is computed via the triplet angles α ijk spanned by neighboring particles i and k of a central particle j. i and k are indices running over the 4 nearest neighbors of particle i and We compute the self-diffusion coefficient D via the Green-Kubo relation from the velocity auto-correlation function (VACF) where we cut the VACF at t cut = 1 ps to reduce the effect of spurious long-term non-zero correlations. N p is the number of particles in the box.

Speed-up considerations
Assuming a numerically expensive (NN) potential dominating computational effort, s g is determined by the cost of necessary force evaluations during trajectory generation per retained state energy computation: As forces for NN potentials are computed by backpropagating potential energy values, they are approximately G times as expensive as energy computations. The provided rule-of-thumb formula in the main text overestimates s g for expensive observables, but systematically underestimating s g by ignoring the cost of backpropagating through time integrator operations. Recognizing that gradient computation costs with DiffTRe are negligible compared to reference trajectory generation costs (under the same assumption of numerically cheap observables), s ∼ G + 1 reflects the cost of trajectory generation plus backward pass versus only the trajectory generation in the case of DiffTRe. We presumed a value of G ≈ 3 for the given estimates in the toy example, which mirrors that gradient computation in the adjoint method requires integrating 3 ordinary differential equations backwards in time [8].

Continuously differentiable binning
The (discrete) Dirac function used in binning can be substituted by a Gaussian probability density function (PDF) centered at position x k of binned entity k. The value of bin b k (x) centered at x can be computed as where ∆x is the bin width. The implied discrete integral over a PDF guarantees an overall contribution of unity for each binned entity. We set the Gaussian standard deviation δ = ∆x. For a fine grid δ → 0, the Dirac function is recovered. Eq. (3) allows defining a normalized differentiable density function where x k is the position of each particle in the simulation and N p is the number of particles in the box. Analogously, we can define where V (d) is the volume of the sphere shell of b k (d) and Ω is the simulation box volume. The ADF is a probability density function (PDF) over triplet angles α ijk for all particle triplets ijk within a cut-off radius r c of central particle j. We smooth the radial cut-off via a Gaussian cumulative distribution function (CDF) Φ(r; r c , σ 2 ) centered at r c with variance σ 2 .
where r k,max = max(r ij , r kj ).

Stress-strain relations
Voigt notation provides a convenient way to describe the stress-strain relation by reducing pairs of indices to single digits: 11 → 1, 22 → 2, 33 → 3, 23 → 4, 13 → 5, and 12 → 6. Generalized Hooke's law can then be written as assuming σ = 0 for = 0. Due to the symmetry in the diamond cubic crystal system, Eq. (7) simplifies to only 3 distinct values in C  The inverse relation is defined by the compliance tensor S = C −1 , which is usually given in terms of Young's modulus E, shear modulus G and Poisson's ratio ν The stress-strain curves are computed by deforming the box in two separate modes that yield states of pure normal and shear strain, respectively [9]. In the normal mode, we transform the box according to which yields the non-zero strain 1 = 11 = ξ in the strain vector = ( 1 , 0, 0, 0, 0, 0). A pure shear mode is given which yields 4 = 2 23 = ξ in the strain vector = (0, 0, 0, 4 , 0, 0). These elementary deformations [9] allow probing C such that a single component of C describes the relation between i and measured σ j (Eq. (8)) 5. Derivation of the gradient

DimeNet++ hyperparameters
We refer the reader to the original DimeNet / DimeNet++ publications [10,11] for a detailed description of the neural network architecture. We reduced embedding sizes by factor 4: The standard embedding size then becomes 32, the output embedding size 64, the triplet and atom-type embedding size becomes 16 and the Bessel-basis embedding remains at a size of 8. All other hyperparameters are unchanged: A cut-off length of 0.5 nm (0.2 nm for diamond), 4 interaction layers, 3 fully-connected output layers, 1 residual block before and 2 residual blocks after the skip connection, 6 radial and 7 angular Bessel embedding function with a continuously differentiable envelope function of order 6 and a swish [12] activation function. Weights are initialized via an orthogonal Glorot [13,10] scheme.    : Random initialization study for the double-well toy example. A mean matching the target and small standard deviations (shaded area) when re-starting the optimization with random seeds from 0 − 99 demonstrates that the learned normalized density profile is robust with respect to initialization of the spline and the initial simulation state (a). The corresponding learned potential exhibits larger standard deviations at the left well boundary due to difficult training in this region (b). Potentials are shifted vertically for visualization purposes such that all potentials coincide at x/X = 0.5.   [14] and a Stillinger-Weber potential optimized for diamond [15]. The evolution of predicted PDOS over the course of the training is shown in panel d.  Figure 6: Random initialization study for diamond. For random seeds from 0 to 4 (controlling random initialization of neural network weights as well as initial particle velocities), the predicted observables are distributed closely around their respective targets (a). Corresponding predicted phonon densities of states (PDOSs) vary largely across different random seeds (b), confirming that different PDOSs are consistent with the target stress and stiffness values. The boxplots in a with median (orange line), interquartile range as box limits and whiskers representing 1.5 times interquartile range are added for visualization purposes.    a − b). These results are obtained using the same hyperparameters as in the reference case σ R = 0.3165, except for longer training (1000 steps) with increased learning rate decay factor (0.25) in the case of σ R = 0.4 nm. Additionally, predicted target observables are robust to random initialization of NN weights and initial particle velocities (c − d, p = 68 ± 32 bar).