Introduction

With their ability to accelerate the discovery of new materials and decipher the properties of existing materials, Molecular Dynamics (MD) simulations have become a cornerstone of material science1. Nevertheless, the true capability is often hindered by the accuracy vs. efficiency trade-off of traditional approaches. Ab initio MD provides high-accuracy predictions at low computational efficiency, while the contrary holds for the MD simulations based on classical force fields. In theory, Machine Learning (ML) approaches2,3 and, in particular, ML potentials4,5,6,7,8 can overcome this compromise due to the multi-body construction of the potential energy with unspecified functional form. In practice, the success of ML potentials hinges primarily on the training data, the source of which can be either simulations or experiments, or both.

Typically the former source is used with ab initio calculations providing energy, forces, and potentially virial stress (target labels) for different atomic configurations (inputs)9,10,11,12,13,14,15,16. Such a setup, also known as bottom-up learning, has the benefit of straightforward training and should result in ML potentials that reproduce all properties of the underlying model. However, generating ab initio training data that is sufficiently accurate, large, and broad (without distribution shift) is challenging.

CCSD(T) (coupled cluster with single, double, and perturbative triple excitations) method, regarded as the gold standard of electronic structure theory, is generally computationally infeasible for large dataset generation. Thus, most ML potentials are trained on the more affordable but less accurate Density Functional Theory (DFT) calculations. These are not always in quantitative agreement with experimental predictions, and consequently, neither are ML potentials trained on DFT data. For example, a recent ML-based model of titanium17 does not quantitatively reproduce the experimental temperature-dependent lattice parameters and elastic constants. For these properties, it achieved a similar level of agreement with experiments as the classical MEAM (modified embedded atom method) potential18. Deviations in the phase diagram predictions are also frequent19,20,21,22. In all cases, these deviations were attributed to DFT inaccuracies. To approach the CCSD(T) level accuracy, transfer learning23 or Δ-learning24 techniques, exploiting a large DFT and a small CCSD(T) dataset, can be used.

Nevertheless, DFT training data is, albeit cheaper, still computationally expensive, and an optimal selection of atomic configurations is needed for diverse and non-redundant training data. Typically, training datasets are carefully prepared and contain specialized sub-datasets based on the target application, such as surfaces, defects, lattice distortions, thermal displacements, configurations along the phase transformation pathways, etc.25,26,27,28 Alternatively, an active learning approach29,30,31,32,33 is used, where the dataset is increased on the fly during training. These methods require a robust uncertainty quantification scheme, which remains problematic for Neural Network (NN)-based potentials34,35,36,37,38,39.

Apart from the dataset size, the system size (number of atoms per configuration) can also play a significant role in the optimal model components and hyperparameters and, consequently, the resulting trained model40. Due to the cubic scaling of DFT implementations, the average number of atoms is typically below one hundred for dense systems under periodic boundary conditions. It is questionable whether long-range interactions41 can be learned from such databases, considering the recent finding that features related to interatomic distances as large as 15 Å can play an essential role in describing non-local interactions42.

The difficulties of ab initio data generation can be circumvented if ML potentials are instead trained top-down, i.e., on experimental data43,44,45,46. While experimental data is also scarce, potentially laborious to obtain, and contains measurement errors, the obtained information per data sample is much larger compared to bottom-up learning. Experimentally observable properties of a system are in simulations computed as an ensemble average, i.e., averaged over a very large number of atomic configurations. This fact also complicates training since it requires running forward simulations to calculate the properties and, in principle, subsequent gradient backpropagation through the simulation. Automatic differentiation47 and recent end-to-end differentiable software48,49,50 have made such endeavors technically possible. In practice, backpropagation through the simulation is unfeasible for properties that require long simulations due to issues such as memory overflow, exploding gradients, and high computational costs43,51,52. However, for time-independent properties, these issues can be avoided with the Differentiable Trajectory Reweighting (DiffTRe) method43 that, rather than backpropagating through the trajectory, employs a reweighting technique. For a test case diamond system, the method yielded an ML potential that reproduced the target experimental mechanical properties at ambient conditions. Yet, for out-of-target phonon density of states, substantially different results were obtained for different random initializations, showcasing that the high-capacity ML potentials are under-constrained when trained on a handful of experimental observations43. Combining both simulation and experimental data sources, an idea used for decades to construct classical force fields53, should, therefore, yield the best approach also for ML potentials. This idea was recently used also in ref. 54 where a two-body correction trained on structural experimental data was added to a fixed ML potential trained on DFT data. However, such Δ-learning approach is limited as two-body potentials cannot reproduce many experimental observables simultaneously. On the other hand, replacing a two-body potential with another deep ML potential would double the computational cost.

In this work, we demonstrate the benefits of training a single deep ML potential to simultaneously reproduce simulation and experimental data. In particular, we train a Graph Neural Network (GNN) potential for titanium on DFT calculated energies, forces, and virial stress for various atomic configurations and experimental mechanical properties and lattice parameters of hcp titanium in the temperature range of 4 to 973 K. We then test the resulting model that faithfully reproduces all target properties on several out-of-target properties, i.e., phonon spectra, bcc titanium mechanical properties, and liquid phase structural and dynamical properties. We find that the out-of-target properties are only mildly and mostly positively affected by the combined training approach, revealing a remarkably large capacity of the state-of-the-art ML potentials.

Results

Fused data training approach

A concurrent training on the DFT and experimental data can be achieved by iteratively employing both a DFT trainer and an EXP trainer (Fig. 1). The former involves a standard regression problem. The ML potential takes as an input atomic configuration S and predicts the potential energy U from which the forces on all atoms F and virial stress tensor V are computed by differentiating with respect to atoms’ positions. The parameters θ are modified using batch optimization for one epoch to match the ML potential’s predictions and the target values in the DFT database. We reuse the previously published DFT calculations for titanium17,55. The DFT database consists of 5704 samples. It includes equilibrated, strained, and randomly perturbed hcp, bcc, and fcc titanium structures, as well as configurations obtained via high-temperature MD simulations and an active learning approach. Further details are in the Supplementary Information.

Fig. 1: Investigated models.
figure 1

The DFT pre-trained model (a) is trained only with DFT trainer (d), which optimizes the parameters of the ML potential to match the reference DFT potential energy \(\tilde{U}\), forces \(\tilde{F}\), and virial \(\tilde{V}\) for different atomic environments S. For the DFT, EXP sequential model (b), the ML potential is initialized with the parameters of the DFT pre-trained model and trained with EXP trainer (e), where the ML potential is trained to reproduce experimental observables \(\tilde{O}\). EXP trainer requires simulations since the observables are not a direct output of the ML model but computed as a time average over the simulated trajectory. The DFT & EXP fused model (c) is obtained by alternating between the DFT and EXP trainers, starting from the DFT pre-trained model. In all cases, the DFT and/or EXP trainers are repeatedly applied for one epoch until the total number of epochs is reached.

The EXP trainer, on the other hand, performs optimization of parameters θ for one epoch such that the properties of titanium (observables) computed from the ML-driven simulation’s trajectory match experimental values where the gradients are computed with the DiffTRe method43. We consider temperature-dependent, solid-state elastic constants of hcp titanium as target experimental properties. Elastic constants of titanium were measured experimentally at 22 different temperatures in the range of 4−973 K56. Nevertheless, we select only the following four temperatures: 23, 323, 623, and 923 K for the experimental training database. With this choice, we reduce the computational cost per epoch and include our expectation that the models will be, to some degree, temperature transferable. The elastic constants are evaluated in the NVT ensemble, where the box size is set according to the experimentally determined lattice constants57 (see Supplementary Information). Thus, by adding the additional target of zero pressure, we indirectly match also the experimental lattice constants.

To investigate the impact of DFT and EXP trainers, we compare three different approaches; (i) the DFT pre-trained model, employing only the DFT trainer (ii) the DFT, EXP sequential model, employing only the EXP trainer, and (iii) the DFT & EXP fused model, obtained with the alternating use of the DFT and EXP trainers. The switching between the trainers is performed after processing all respective training data, i.e., after one epoch. Alternatively, a batch-wise switching could be employed. For the last two approaches, the parameters of the ML potential are not initialized randomly but with the values of the DFT pre-trained model. This allows us to circumvent the use of prior potentials, typical for top-down learning43,52. The prior potentials are simple classical potentials added to the ML potential to avoid unphysical trajectories and, therefore, slow learning in the initial learning stage. The models are trained for a fixed number of epochs, and the final model is selected with early stopping. For further information, see Supplementary Information.

Simultaneously learning DFT and experimental target properties

We compute the energy, force, and virial errors on the DFT test dataset (Table 1) for all three investigated models. For the DFT pre-trained model, the obtained energy error is below 43 meV, generally accepted within the chemistry community as the chemical accuracy58. In Supplementary Table 3, we additionally show the errors for a portion of the test dataset containing only strained and perturbed hcp or bcc samples. The force and virial errors are an order of magnitude lower when high-temperature configurations are excluded. This difference is due to larger force magnitudes in high temperature configurations. Indeed, the force relative errors are similar for all test datasets (Supplementary Table 4). We compare favorably with the previously published ML-based potential model17 for the force errors, while the energy errors are somewhat higher. However, precedence can be given to energy, force, or virial error by changing the weights of the loss function (Eq. (1)). We give a higher emphasis on the forces as these are relevant for carrying out MD simulations.

Table 1 Root Mean Square Error (RMSE) and Mean Absolute Error (MAE) of energy, force, and virial predictions computed on the test DFT dataset

When training on both DFT and experimental data (DFT & EXP fused model), the errors are only slightly increased compared to training only on DFT data (DFT pre-trained model). An increase is expected as the model has to satisfy both DFT and experimental objectives, which are partially conflicting due to the DFT inaccuracies as well as experimental errors. The fact that the errors do not change drastically indicates that the DFT errors in energy, force, and virial predictions are minor. Nevertheless, a small difference in force prediction can amount to large differences in MD simulations and subsequent evaluation of properties, as we later show for the mechanical properties.

For the DFT, EXP sequential model, the force and virial errors are still comparable to the DFT pre-trained model, but the energy error is drastically increased. The Supplementary Fig. 2 shows the energy RMSE during training. This is not surprising considering that MD simulations and our target experimental properties do not depend on energy but only on its derivatives. The EXP trainer, therefore, leaves the energy undetermined up to a constant, as confirmed by the predicted vs. DFT energy plot (Supplementary Fig. 1). Consequently, any energy-related quantity will also be predicted incorrectly. For example, the energy versus volume equation of state curves for hcp, fcc, and bcc structures are all shifted by a constant and equal value (Fig. 2). Nevertheless, this shift can be evaluated in post analysis. In particular, we compute the mean energy shift in the training dataset and apply it to the test dataset. With this correction, the energy RMSE and MAE are 14.0 and 9.5 meV atom−1, respectively. The errors are slightly higher but comparable to the errors of the DFT & EXP fused model. The DFT, EXP sequential model demonstrates the importance of including DFT data in training, especially when the experimental dataset does not include properties directly related to energies.

Fig. 2: Agreement with DFT data.
figure 2

Energy vs. volume for the hcp (a), fcc (b), and bcc (c) titanium crystal structures computed for samples in the test DFT dataset. The predictions of the DFT pre-trained, DFT, EXP sequential, and DFT & EXP fused models are denoted with red, green, and blue points, respectively. DFT calculations are denoted with a black dashed line.

Next, we evaluate the elastic constants of hcp titanium (Supplementary Fig. 3), which are the target properties of the EXP trainer. Additionally, we report in Fig. 3a–c the bulk modulus, shear modulus, and Poisson’s ratio, which are all directly related to elastic constants. These properties are computed for all 22 temperatures in the range of 4−973 K where experimental data is available. Training only on DFT data (DFT pre-trained model) fails to reproduce the mechanical properties. On average, the model deviates from the experimental data by 6, 24, and 9% in bulk modulus, shear modulus, and Poisson’s ratio, respectively (Supplementary Table 5). In terms of elastic constants, the predictions are for some components off by more than 20 GPa. Similar deviations in mechanical properties were reported for other ML potentials17,19,22. Per contra, for the two models that include the EXP trainer, the elastic constants are within a few GPa of the experimental values, while the relative errors for the bulk modulus, shear modulus, and Poisson’s ratio are below 3%. We obtain a good agreement with experimental observations on the entire investigated temperature range, even though we fit the elastic constants only at four temperatures. Naturally, the agreement is better for the DFT, EXP sequential model because the DFT and experimental datasets are erroneous and, thus, somewhat incompatible.

Fig. 3: Agreement with EXP data.
figure 3

Bulk modulus (a), shear modulus (b), Poisson’s ratio (c), and lattice constants a (d) and c (e) as a function of temperature for hcp titanium. The DFT pre-trained, DFT, EXP sequential, and DFT & EXP fused models are denoted with red, green, and blue line points, respectively. The experimental results are denoted with a black dashed line. Error bars denote the standard deviation computed via block-averaging with ten blocks.

An additional target property of the EXP trainer is zero pressure (Supplementary Fig. 4) at fixed, experimentally determined simulation box sizes. In Fig. 3d, e, we show an equivalent result, i.e., the temperature-dependant lattice constants evaluated in the isothermal-isobaric ensemble. Similarly, as for the mechanical properties, the addition of the EXP trainer improves the results for both target and non-target temperatures, with the DFT, EXP sequential model being the closest to experimental reference values. Note that the DFT & EXP fused model’s relative deviations from experimental values are below 0.1%, i.e., smaller than deviations in mechanical properties (Supplementary Table 5).

Generalization to off-target properties and thermodynamic states

As a first test of the generalization capabilities to off-target properties, we compute the phonon spectra of hcp titanium (Fig. 4). All models agree well with experimental prediction, with the DFT pre-trained model in closest agreement based on the phonon density of states (Supplementary Fig. 5). Good agreement is expected for the DFT pre-trained and DFT & EXP fused models since ML potentials trained on DFT data typically reproduce the phonon dispersion curves well and much better than the classical potentials17,20,27,59. Interestingly, we obtain a good agreement also for the DFT, EXP sequential model. Our previous study43 showed that training randomly initialized ML potentials on mechanical properties leads to models with drastically different phonon densities of states, i.e., high-capacity models are underconstrained when trained on a small set of target properties. Additional properties could be included to converge toward a unique potential energy solution. However, the required experimental database size is unknown a priori. Our results in Fig. 4b indicate an alternative route. Pretraining on DFT data seems to constrain the solution to a particular region in parameter space which is only locally modified by the subsequent training on the experimental data. This hypothesis is in accordance with the observed similar force errors for DFT pre-trained and DFT, EXP sequential models (Table 1).

Fig. 4: Off-target solid state property.
figure 4

Phonon dispersion curves of hcp titanium for DFT pre-trained (a), DFT, EXP sequential (b), and DFT & EXP fused (c) models. The ML potential models’ predictions match well the black dashed lines denoting the experimental prediction measured at 295 K79.

To further validate our rationale, we examine the liquid-state titanium’s structural and dynamical properties. Two-body and three-body local structural order is measured with radial distribution function (RDF) and angular distribution function (ADF). For all investigated models, the results are indistinguishable within the line thickness (Fig. 5a, b). Moreover, the obtained RDFs are very close to the experimental measurement. For ADFs, the position of the minima and maxima agrees very well with the experiments, while the absolute values slightly differ. The largest deviation for the three models is observed in Fig. 5c, which presents the self-diffusion coefficients calculated via the velocity autocorrelation function (Supplementary Fig. 6). Both models trained on experimental data yield better results on average than the DFT-pretrained model. The DFT, EXP sequential model performs best.

Fig. 5: Off-target liquid state properties.
figure 5

Radial distribution function (RDF, a), angular distribution function (ADF, b), and self-diffusion coefficients (D, c) for the DFT pre-trained (red), DFT, EXP sequential (green), and DFT & EXP fused (blue) models. The RDFs and ADFs are computed at 1965 K and compared with experimentally determined RDF80 at 1965 K and ADF81 at 1973 K (black, dashed). The self-diffusion is evaluated at 1953, 2000, 2060, and 2110 K for comparison to experiments82,83,84. The experimental error bars are estimated based on the experimental error bar at 2000 K84.

Next, we consider generalization to different pressures. To this end, we compute the lattice constants of hcp titanium at the temperature 300 K and elevated pressures (Fig. 6). Similarly, as in the case of diffusion, we find the closest agreement with experimental values for the DFT, EXP sequential model. However, such an outcome is not always guaranteed, as we show next.

Fig. 6: Off-target thermodynamic states.
figure 6

Lattice constants a (a) and c (b) of hcp titanium for varying pressures at temperature 300 K. The DFT pre-trained (red), DFT, EXP sequential (green), and DFT & EXP fused (blue) models are compared to experimental values EXP-1 (SPring-8 data) and EXP-2 (NSLS data)85.

We evaluate all three models on the bcc elastic constants at 1273 K (Table 2). Concrete conclusions are difficult given significant deviations between the three experimental references at equal or similar temperature. Nevertheless, assuming that the latest experimental results by Ledbetter et al.60 are the most accurate, the DFT & EXP fused model is best overall. In particular, it performs best on C12 and second best on C11 and C44. Note that when training on the EXP database, the bcc lattice is never seen, while the DFT dataset also contains the equilibrated, strained, and perturbed bcc structures.

Table 2 Elastic constants in GPa of bcc titanium at 1273 K

Experimental data ablation

Lastly, we consider a data ablation study. An additional model, labeled DFT & EXP (323 K) fused, is trained with the same approach as the DFT & EXP fused model, but with experimental training data containing elastic constants and pressure only at a single temperature of 323 K. The aim is to reveal the effect of experimental data size as well as the model’s temperature transferability. As shown in Fig. 7, the DFT & EXP (323 K) fused model yields improved mechanical properties and lattice parameters on the entire temperature range compared to training only on DFT data, i.e., DFT pre-trained model. The predicted elastic constants are shown in Supplementary Fig. 7. However, as expected, the mechanical properties are not as accurate as training on experimental data at four different temperatures (DFT & EXP fused model trained at 23, 323, 623, 923 K). In general, due to the temperature transferability of the models, it seems more beneficial to enlarge the experimental dataset with diverse properties rather than with a single property at densely sampled temperatures.

Fig. 7: Data ablation study.
figure 7

Bulk modulus (a), shear modulus (b), Poisson’s ratio (c), and lattice constants a (d) and c (e) as a function of temperature for hcp titanium. The DFT pre-trained, DFT & EXP fused, and DFT & EXP (323 K) fused models are denoted with red, blue, and dark blue line points, respectively. The last two models differ only in experimental training data, i.e., the DFT & EXP (323 K) fused model is trained on data at a single temperature of 323 K. The experimental reference values are marked with a black dashed line. Error bars denote the standard deviation computed via block-averaging with ten blocks.

Discussion

Using titanium as a test case system, we have demonstrated the advantages of using both experimental and simulation data to train ML potentials. We tested two strategies of employing the DFT and experimental data, i.e., sequential and fused, and referenced them against using only DFT data. Note that training only on experimental data is difficult without a prior potential and was therefore not attempted.

The addition of experimental data resulted in ML potentials that reproduced target experimental properties, thus correcting for the inaccuracies of the DFT calculations and limited DFT training dataset. Moreover, some of the off-target properties (e.g., diffusion) improved even though the relevant (e.g., liquid) configurations were never seen by the EXP trainer.

On the other hand, pretraining on the DFT data has the effect of regularizing the solution, evidenced by very similar or only mildly different out-of-target properties. This is especially important when the experimental dataset is scarce. As we have shown previously43, ML potentials fitted only on a handful of observations can substantially differ on out-of-target properties due to the large capacity of these models. In general, top-down training lacks theoretical guarantees of bottom-up approaches and can result in deteriorated out-of-target properties. For this reason, we advocate for the DFT & EXP fused approach rather than the DFT, EXP sequential approach, even though the latter performed better on some out-of-target properties. With minimal computational overhead, the fused training ensures that the solution remains close to the DFT solution, which might deviate from experiments somewhat but not drastically. Furthermore, experimental measurements also contain errors, and conflicting results might be reported in the literature, e.g., mechanical properties of bcc titanium [67]. The DFT & EXP fused approach can, therefore, to some extent overcome the deficiencies of pure bottom-up or top-down training.

In this paper, the experimental properties were elastic and lattice constants. However, the DiffTRe approach is general, and, in principle, any other static structural or thermodynamic property could be used43. In practice, training on properties requires running simulations, the spatiotemporal scales of which should be sufficiently large to reasonably estimate the target properties and, consequently, obtain informative gradients. Thus, observables involving rare events might be out of reach for conventional computational resources.

The number of required simulation runs can be reduced with reweighting techniques. DiffTre method employs the simplest Zwanzig approach61 that reweights observables from a single reference state. In this work, simulations were initialized at every parameter update to avoid an additional layer of complexity. Nevertheless, the reweighting ansatz is still used to provide a relation between the observables and the parameters of the ML potential, enabling a direct route to the gradient computation. Other reweighting approaches could also be employed. For example, the multistate Bennett acceptance ratio (MBAR)62,63,64, where information from multiple states is used to probe the configuration space of the unsampled state65. Note that the computational overhead of evaluating the potential energy for multiple states is minor compared to forward simulations. Multistate reweighting techniques are typically more accurate in estimating ensemble averages and could provide more accurate gradients. On the other hand, deep ML methods sometimes benefit from noisy gradients66. Additionally, all reweighting methods require sufficient configuration overlap, and choosing appropriate reference states is a non-trivial task. Therefore, the best reweighting technique is an open question that we leave for future work.

Methods

ML potential architecture

We employ a message passing GNN DimeNet++11 using our implementation in JaxMD43, which takes advantage of neighbor lists for efficient computation of the sparse atomic graph. We select the same neural network hyperparameters (Supplementary Table 1) as in the original publication11 except for the embedding sizes, which we reduced by factor 4 for computational speed-up. The cut-off is set to 0.5 nm.

DFT trainer

We use a weighted mean squared error loss function

$$\begin{array}{l}{L}_{{{{\rm{DFT}}}}}=\frac{1}{{N}_{data}}\mathop{\sum }\limits_{i=1}^{{N}_{data}}\left[{\omega }_{U}{({U}_{i}-{\tilde{U}}_{i})}^{2}+\frac{{\omega }_{F}}{3{N}_{atoms}}\mathop{\sum }\limits_{j=1}^{{N}_{atoms}}\mathop{\sum }\limits_{k=1}^{3}\right.\\\left.{({F}_{ijk}-{\tilde{F}}_{ijk})}^{2}+\frac{{\omega }_{V}}{9}\mathop{\sum }\limits_{k=1}^{3}\mathop{\sum }\limits_{l=1}^{3}{({V}_{ikl}-{\tilde{V}}_{ikl})}^{2}\right]\end{array}$$
(1)

where Ui is the energy of the i-th atomic environment in a batch, Fijk is the force in the k-direction of the j-th atom, and Vikl is the virial in the k,l-direction. The reference DFT values are denoted with ~. The weights for the energy and force are set to ωU = 1e−6 and ωF = 1e−2, while for the virial contribution, only the uniformly deformed supercells contribute with ωV = 4e−6. The numerical optimization hyperparameters are reported in Supplementary Table 2.

EXP trainer

We define the loss function as

$$\begin{array}{l}{L}_{{{{\rm{EXP}}}}}\,=\,\frac{1}{{N}_{temp}}\mathop{\sum }\limits_{n=1}^{{N}_{temp}}\mathop{\sum }\limits_{m=1}^{{N}_{obser}}{w}_{m}{({O}_{m,n}-{\tilde{O}}_{m,n})}^{2}\\\qquad\quad =\,\frac{1}{{N}_{temp}}\mathop{\sum }\limits_{n=1}^{{N}_{temp}}\left[{\omega }_{P}{({P}_{n})}^{2}\right.\\ \qquad\quad +\,\frac{{\omega }_{C}}{5}\left\{{({C}_{11,n}-{\tilde{C}}_{11,n})}^{2}+{({C}_{12,n}-{\tilde{C}}_{12,n})}^{2}\right.\\\left.\left.\qquad\quad+{({C}_{13,n}-{\tilde{C}}_{13,n})}^{2}+{({C}_{33,n}-{\tilde{C}}_{33,n})}^{2}+{({C}_{44,n}-{\tilde{C}}_{44,n})}^{2}\right\}\right],\end{array}$$
(2)

where Om,n is a m-th observable at n-th temperature in a batch and ~ denotes the experimental value. The observables are scalar pressure Pn and elastic constants in Voigt notation C**. The weights are ωP = 1e−9 and ωC = 1e−10. The gradient of the loss with respect to the parameters of the ML potential is obtained with the DiffTRe method43, where the ensemble average of an observable Om,n is computed with the reweighting ansatz for the canonical ensemble61,67,68

$$\langle {O}_{m,n}({U}_{\theta })\rangle \simeq \mathop{\sum }\limits_{i=1}^{{N}_{traj}}{w}_{i}{O}_{m,n}({S}_{i},{U}_{\theta })\quad {{{\rm{with}}}}\quad {w}_{i}=\frac{{e}^{-\beta ({U}_{\theta }({S}_{i})-{U}_{\hat{\theta }}({S}_{i}))}}{\mathop{\sum }\nolimits_{j = 1}^{N}{e}^{-\beta ({U}_{\theta }({S}_{j})-{U}_{\hat{\theta }}({S}_{j}))}}.$$
(3)

The summation runs over the trajectory states/atomic environments S, β = 1/(kBT), kB is the Boltzmann constant, and T is the temperature. \({U}_{\hat{\theta }}\) and Uθ denote the reference and perturbed ML potentials. We initialize the forward trajectory generation for every parameter update. Thus, \({U}_{\hat{\theta }}={U}_{\theta }\) and w = 1 for every sample. Nevertheless, \({\nabla }_{\theta }{L}_{{{{\rm{EXP}}}}}\) is generally non-zero. Further details can be found in Ref. 43. The numerical optimization hyperparameters are reported in Supplementary Table 2.

ML potential-driven MD simulations

All MD simulations are performed in JaxMD48 using a velocity Verlet integrator with a time step of 0.5 fs. The simulated system contains 256 atoms unless otherwise stated. The mass of titanium atoms is set to 47.867 a.u. For NVT simulations during training and to compute the elastic constants and pressure in postprocessing, we use the Langevin thermostat with a friction constant of 4 ps−1. For the remaining postprocessing, we run NVT simulations using a Nose-Hoover thermostat and NPT simulations with a Nose-Hoover thermostat and barostat. For the Nose-Hoover chains, we use a chain length of 5, 2 chain steps, and 3 Suzuki-Yoshida steps, and set the thermostat damping parameter to τ = 50 fs and the barostat damping parameter to τ = 500 fs. The pressure is set to 0.

In the EXP trainer, the elastic constants and pressure are computed from 80 ps NVT simulation, where the first 10 ps are disregarded as equilibration, and the state is saved every 0.1 ps. The isothermal elasticity tensor is computed with the stress-fluctuation method43,69.

To analyze the properties of trained models, we perform the following simulations. For hcp elastic constants and pressure, we perform a 100 ps NVT equilibration run followed by a 1 ns NVT production run. As in training, the box size is set according to the experimental lattice parameters at a given temperature. The elastic constants are saved every 0.1 ps. The bulk modulus and shear modulus are computed from elastic constants (in Voigt notation)56,70,71 as K = 2/9(C11 + C12 + 2C13 + 1/2C33) and G = 1/30(12C44 + 7C11 − 5C12 + 2C33 − 4C13). The Poisson ratio is computed with σ = (3K − 2G)/(2G + 6K). For hcp lattice constants, we perform a 100 ps NPT equilibration followed by a 100 ps NPT production run, where a state is saved every 0.25 ps. For phonon frequency analysis, we generate a 5 × 5 × 3 hcp super cell in Avogadro72 and employ Phonopy73,74 to compute the phonon densities via finite displacements of 0.01 Å. To compute the RDF and the ADF, we perform a 100 ps NPT equilibration at 2400 K, 100 ps NPT equilibration at 1965 K, and 80 ps NVT production run at 1965 K, which we sample every 0.1 ps. For these simulations, we double the box size in each dimension, yielding a total of 2048 atoms. The ADF is computed for all triplets within 0.4 nm, corresponding to the first minimum of the experimental RDF. For VACF, we perform 100 ps NPT equilibration at 2400 K, 100 ps NPT equilibration at 2000 K, 100 ps NVT equilibration at 2000 K, and 80 ps NVT production run from which we sample every 0.01 ps. The VACF is computed by averaging over 160 different starting points that are 0.5 ps apart. We use the Green-Kurbo relation to compute the self-diffusion75,76. The errors are estimated with block-averaging using 10 blocks. The bcc elastic constants were obtained by creating a bcc titanium structure with 128 atoms as input for a 100 ps NPT followed by 100 ps NVT equilibration at 1273 K, and a 1 ns NVT production run at 1273 K. We confirmed the adequateness of equilibration protocols by repeating the analysis for RDF, ADF, VACF, and high temperature hcp lattice constants with doubled NPT equilibration lengths (i.e., using 200 ps) for the DFT & EXP fused model (Supplementary Fig. 8).