Parsimonious neural networks learn interpretable physical laws

Machine learning is playing an increasing role in the physical sciences and significant progress has been made towards embedding domain knowledge into models. Less explored is its use to discover interpretable physical laws from data. We propose parsimonious neural networks (PNNs) that combine neural networks with evolutionary optimization to find models that balance accuracy with parsimony. The power and versatility of the approach is demonstrated by developing models for classical mechanics and to predict the melting temperature of materials from fundamental properties. In the first example, the resulting PNNs are easily interpretable as Newton’s second law, expressed as a non-trivial time integrator that exhibits time-reversibility and conserves energy, where the parsimony is critical to extract underlying symmetries from the data. In the second case, the PNNs not only find the celebrated Lindemann melting law, but also new relationships that outperform it in the pareto sense of parsimony vs. accuracy.


Parsimonious neural networks learn interpretable physical laws Saaketh Desai & Alejandro Strachan *
Machine learning is playing an increasing role in the physical sciences and significant progress has been made towards embedding domain knowledge into models. Less explored is its use to discover interpretable physical laws from data. We propose parsimonious neural networks (PNNs) that combine neural networks with evolutionary optimization to find models that balance accuracy with parsimony. The power and versatility of the approach is demonstrated by developing models for classical mechanics and to predict the melting temperature of materials from fundamental properties. In the first example, the resulting PNNs are easily interpretable as Newton's second law, expressed as a nontrivial time integrator that exhibits time-reversibility and conserves energy, where the parsimony is critical to extract underlying symmetries from the data. In the second case, the PNNs not only find the celebrated Lindemann melting law, but also new relationships that outperform it in the pareto sense of parsimony vs. accuracy.
Machine learning (ML) can provide predictive models in applications where data is plentiful and the underlying governing laws are unknown [1][2][3] . These approaches are also playing an increasing role in the physical sciences where data is generally limited but underlying laws (sometimes approximate) exist [4][5][6][7][8][9] . For example, MLbased constitutive models are being used in electronic structure calculations 10 and molecular dynamics (MD) simulations [11][12][13] . One of the major drawbacks of the use of ML in the physical sciences is that models often do not learn the underlying physics of the system at hand, such as constraints or symmetries, limiting their ability to generalize. In addition, most ML models lack interpretability. That is, ML approaches generally neither learn physics nor can they explain their predictions. In many fields, these limitations are compensated by copious amounts of data, but this is often not possible in areas such as materials science where acquiring data is expensive and time consuming. To tackle this challenge, progress has been made towards using knowledge (even partial) of underlying physics to improve the accuracy of models and/or reduce the amount of data required during training [14][15][16] . Less explored is the use of ML for scientific discovery, i.e. extracting physical laws from observational data, see Refs. [17][18][19] for recent progress. In this letter, we combine neural networks (NNs) with stochastic optimization to find models that balance accuracy and parsimony and apply them to learn, solely from observational data, the dynamics of a particle under a highly non-linear potential, and expressions to predict the melting temperature of materials in terms of fundamental properties. Our hypothesis is that the requirement of parsimony will result in the discovery of the physical laws underlying the problem and result in interpretability and improved generalizability. We find that the resulting descriptions are indeed interpretable and provide insight into the system of interest. In the case of particle dynamics, the learned models satisfy non-trivial underlying symmetries embedded in the data which increases the applicability the parsimonious neural networks (PNNs) over generic NN models. Stochastic optimization has been previously used in conjunction with backpropagation to improve robustness or minimize overfitting in models [20][21][22][23][24][25][26] , this work extends these ideas to finding parsimonious models from data to learn physics.
The power of physics-based ML is well documented and remains an active area of research. Neural networks have been used to both parametrize and solve differential equations such as Navier Stokes 14,15 and Hamilton's equations of motion 27 . Recurrent architectures have also shown promise in predicting the time evolution of systems 28,29 . These examples focus on using prior knowledge of the underlying physics to guide the model, often as numerical constraints, or by using the underlying physics to numerically solve equations with variables predicted by the ML algorithms. In contrast, we are interested in learning physics, including the associated numerical solutions, directly from data, without prior knowledge. Pioneering work along these lines used symbolic regression methods, enhanced by matching partial derivatives to identify invariants 17 , or using dimensionality reduction and other symmetry-identifying methods to aid equation discovery 30 . These approaches also consider the tradeoff between parsimony and accuracy to develop simple models that describe the data well. On the other hand, neural networks such as time-lagged autoencoders have also proven useful at extracting laws that govern the time evolution of systems from data 31 , where the encoder networks attempt to learn features relevant to the problem. Advances here have considered networks with custom activation functions whose weights decide the functional form of the equation 32,33 . Lastly, other approaches to learning physics from data have focused on discovering partial differential equations directly from data, either using a library of candidate derivatives coupled with linear regression 18,19 , or using neural networks coupled with genetic algorithms to identify differential equations from an incomplete library 34 . We build on and extend these ideas to propose PNNs, models designed to balance parsimony with accuracy in describing the training data. The PNN approach allows complex compositions of functions via the use of neural networks, while balancing for parsimony using genetic algorithms. As will be shown with two examples, our approach is quite versatile and applicable to situations where an underlying differential equation may not exist. We first apply PNNs to learn the equations of motion that govern the Hamiltonian dynamics of a particle under a highly non-linear external potential with and without friction. Our hypothesis is that by requiring parsimony (e.g. minimizing adjustable parameters and favoring linear relationships between variables) the resulting model will not only be easily interpretable but also will be forced to tease out the symmetries of the problem. We find that the resulting PNN not only lends itself to interpretation (as Newton's laws) but also provides a significantly more accurate description of the dynamics of the particle when applied iteratively as compared to a flexible feed forward neural network. The resulting PNNs conserve energy and are time reversible, i.e., they learn non-trivial symmetries embedded in the data but not explicitly provided. This versatility and the generalizability of PNNs is demonstrated with a second, radically different, example: discovering models to predict the melting temperature of materials from atomic and crystal properties. By varying the relative importance of parsimony and accuracy in the genetic optimization, we discover a family of melting laws that include the celebrated Lindemann law 35 . Quite remarkably the Lindemann law, proposed in 1910, is near (but not on) the resulting pareto front.

Results
Discovering integration schemes from data. As a first example, we consider the dynamics of a particle under an external Lennard-Jones (LJ) potential with and without friction. In both cases the training data is obtained from accurate numerical trajectories with various totals energies. The input and output data are positions and velocities at a given time and one timestep ahead, respectively (this timestep is ten times what was used to generate the underlying trajectories). The numerical data was divided into training and validation sets and an independent testing set was generated at a different energy, see Methods and section S1 of the Supplementary Material (SM). The input data in this example has been generated numerically for convenience but could have been obtained experimentally, as will be shown in the second example. Before describing the PNN model, we establish a baseline by training a standard feed forward neural network (FFNN) on our data for the case without friction. The details of this architecture can be found in section S2 of the SM and can be accessed for online interactive computing on nanoHUB 36 . We find the FFNN to be capable of matching the training/validation/test data well, with root mean squared errors (RMSEs) across all sets on the order of 10 -5 in LJ units for both positions and velocities (see Figure S1 in SM). However, the network has poor predictive power. Using it iteratively to find the temporal evolution of the particle results in significant drifts in total energy over time, and a lack of time reversibility. Reversibility is judged by applying the network sequentially 1,000 times, followed by time reversal (changing the sign of the particle's velocity) and applying the NN for a second set of 1,000 steps. We find that deeper architectures do not improve the RMSE, reversibility or energy conservation. Needless to say, these FFNNs are not interpretable. These results highlight the difficulty of the problem at hand. Hamilton's equations for classical mechanics represent a stiff set of differential equations and small errors in each timestep accumulate rapidly resulting in diverging trajectories. Prior attempts to address this challenge in the context of discovering differential equations explicitly trained models for multiple steps using a recurrent architecture 37 . The resulting models are interpretable and improve accuracy over the number of steps used in training the recurrent network but accumulate relatively high errors over multiple steps. In contrast, we are interested in solutions stable over timescales far greater than those typically accessed by current recurrent architectures, while also favoring the discovery of constants relevant to the physical problem. Finding such models is non-trivial and the development of algorithms to integrate equations of motion with good energy conservation and time reversibility has a rich history [38][39][40][41][42] . An example of such algorithms is the popular Verlet family of integrators 38,39 that are both reversible and symplectic 43 ; their theoretical justification lies in Trotter's theorem 44 .

Parsimonious neural networks.
Having established the shortcomings of the state-of-the-art neural networks, we introduce parsimonious neural networks (PNN). We begin with a generic neural network shown in Fig. 1 and use genetic algorithms to find models with controllable parsimony. In this first example, the neural network consists of three hidden layers and an output layer with two values, the position and velocity of the particle one timestep ahead of the inputs. Each hidden layer has two neurons, and the central layer includes an additional force sub-net, a network pre-trained to predict the force on the atom given its position. The use of a pre-trained force sub-net is motivated by the prior success of neural networks in predicting interatomic forces in a wide variety of materials significantly more complex than our example [45][46][47] . The architecture of the force sub-net was designed to be the simplest network that can predict the force with sufficient accuracy to result in accurate dynamics. Our architecture is similar to those used to predict interatomic forces, where atomic neighborhoods are encoded in descriptors, which are then mapped to the atomic forces via a one or two-layer shallow neural network 12 . In our one-dimensional case, the input is simply the atom coordinate. In addition, our focus is on learning classical dynamics and the use of a force sub-net only incorporates the physical insight that the force is an important quantity. As a second baseline, we trained a feed forward NN including a pre-trained force subnet. This second network's performance is as poor as the previous baseline feed-forward network, see section S3 www.nature.com/scientificreports/ in the SM for details. This shows that adding the information about the force is not the key to the development of accurate models for classical mechanics, parsimony is. The starting neural network provides a highly flexible mapping from input positions and velocities to output positions and velocities and PNNs seeks to balance simplicity and accuracy in reproducing the training data. This is an optimization problem in the space of functions spanned by the possible activations and weights of the network. We consider four possible activation functions in this example: linear, rectified linear unit (relu), hyperbolic tangent (tanh), and exponential linear unit (elu). The weights connecting the artificial neurons can be either fixed or trainable, with the fixed category allowing the following values: 0, ½, 1, 2, t 2 , t , and 2 t , with t the timestep separating the inputs and outputs. This is motivated by the fact that physical laws often involve integer or simple fractional coefficients and that the timestep represents important information. Our network has twenty weights (each with eight possible settings) and six activation functions to optimize, see Fig. 1

(top panel).
A brute force approach to finding a PNN model would require training ~ 10 21 neural networks, an impossible computational task even for the relatively small networks here. We thus resort to evolutionary optimization, using a genetic algorithm to discover models that balance accuracy and parsimony.
PNNs use an objective function that includes measures of the accuracy in the test set and parsimony. The latter term favors: i) linear activation functions over non-linear ones, and ii) non-trainable weights with simple values over optimizable weights. The objective function for the genetic optimization is defined as: where E test represents the mean squared error of the trained PNN on the testing set and f 1 is a logarithmic function that converts the wide range errors into a scale comparable to the parsimony terms, see section S4 of the SM. The second term runs over the N N neurons in the network and is designed to favor simple activation functions. The linear, relu, tanh and elu activation functions are assigned scores of w i = 0, 1, 2 and 3, respectively. The third term runs over the network weights and biases ( N w ) and favors fixed, simple weights over trainable ones. A fixed weight value of 0 is assigned a score of 0, while other fixed weights are assigned the score 1, and a trainable weight is assigned a score of 2. The parsimony parameter p determines the relative importance of parsimony and accuracy. As will be shown with two examples, PNNs of interest will correspond to parameters p where both accuracy and parsimony affect model selection. We use the DEAP package for the evolutionary optimization 48 and Keras 49 to train individual networks, see Methods. We note that our approach is similar in spirit to recent work combining genetic algorithms with neural networks to discover partial differential equations 34 , but PNNs are more versatile in terms of parsimony, composition of functions, and are applicable to situations where an underlying differential equation may not exist, as we will see in the second example discussed in this paper. We also note that evolutionary optimization is not the only way to achieve parsimony. For example, one could include hidden layers containing a library of possible activation functions and use sparsity to prune unnecessary activations. This has recently been used to discover simple kinematics equations 33 . An advantage of this approach over ours is simplicity and computational expedience since such networks can be trained using backpropagation The PNNs resulting from a genetic optimization with p = 1 reproduce the training, validation and testing data more accurately than the architecturally complex FFNNs. Figure 2(a) compares the RMSE for positions and velocities from the optimal PNN (denoted PNN1) to the FFNN. Remarkably, PNN1 also results in excellent long-term energy conservation and time reversibility, evaluated using the same procedure as before. Figures 2(b) and 2(c) compare the total energy and trajectories generated by PNN1, the FFNN, and the state-of-the-art velocity Verlet integrator. We see that PNN1 learns both time-reversibility and that total energy is a constant of motion. This is in stark contrast to the physics-agnostic FFNN and even naïve physics-based models like a first order Euler integration. A few of the top ranked PNNs perform similarly to PNN1 and they will be discussed in section S7 of the SM.
Having established that the PNNs learn the physics of the system and result in stable and accurate integrators, we now explore their interpretability in the hope of finding out how time-reversibility and energy conservation are achieved. In short: can the PNNs teach us what they learned? We find that the PNNs discover simple models, with many weights taking fixed values (including zero) and all activations functions taking the simplest possible alternative (linear functions). As an example, the parameters corresponding to PNN1 are shown in Fig. 2(d), and other PNNs with comparable (but higher) objective functions are shown in section S6 of the SM. This simplicity allows us to trivially obtain position and velocity update equations. The equations of motion learned by PNN1, rewritten in terms of relevant quantities such as timestep and mass, are: Inspecting Fig. 2(d) and Eqs. (2,3) we find that PNN1 achieves time-reversibility by evaluating the force at the midpoint between inputs and outputs, this central force evaluation is key to many advanced numerical methods. In fact, PNN1 represents the position Verlet algorithm 39 except that the NN training makes an error in the mass of approximately 3 in 10,000. This algorithm is both reversible and symplectic, i.e. it conserves volume in phase space. The small error in mass seems to originate from the small inaccuracies of the force sub-net to describe the Lennard -Jones potential, see section S6 of the SM.
Inspecting Fig. 2(d) and Eqs. (2-3) we find that PNN1 achieves time-reversibility by evaluating the force at the midpoint between inputs and outputs, this central force evaluation is key to many advanced numerical methods. In fact, PNN1 represents the position Verlet algorithm 39 except that the NN training makes a small error in the mass, of approximately 3 in 10,000. This algorithm is both reversible and symplectic, i.e. it conserves volume in phase space. The small error in mass actually seems to originate from the small inaccuracies of the force sub-net to describe the Lennard -Jones potential, see section S6 of the SM.
The genetic optimization provides an ensemble of models and inspecting slightly sub-optimal ones provides interesting insights. The SM provides the equations of motion predicted by PNN2 and 3. These are similarly (2) x(t + �t) = x(t) + 1.0001 v(t)�t + 0.9997 www.nature.com/scientificreports/ interpretable and, quite remarkably, they also learn to evaluate the force at the half-step. They represent a slightly inaccurate version of the position Verlet algorithm with minor energy drifts due to a slight asymmetry in effective mass in the position and velocity update equations. Finally, changing the parsimony parameter p in the objective function, allows us to generate a family of models with different tradeoffs between accuracy and parsimony; see Figure S8 in the SM. Interestingly, we find models that reproduce the training and testing data more accurately than PNN 1 and Verlet. However, these models are not time reversible and their energy conservation is worse than PNN1, see section S7 in the SM, stressing the importance of parsimony.
Along similar lines, we tested the ability of the PNNs to discover the physics governing a damped dynamical system, see Methods. The equations learned by the top PNN, with γ the damping constant, are: In this second example, PNNs learn classical mechanics, the fact that the frictional force is proportional to negative the velocity, and discover the same stable integrators based on the position Verlet method, all from the observational data.
We consider the emergence of Verlet style integrators from data remarkable. This family of integrators is the preferred choice for molecular dynamics simulations due to its stability. Unlike other algorithms such as the Runge-Kutta family or the first order Euler method, Verlet integrators are symplectic and time reversible 50 . This class of integrators has been long known, and proposed independently by several researchers over decades (see Ref. 50 for a review), but a detailed understanding of their properties and their justification from Trotter's theorem are relatively modern 39 . Importantly, we find more complex models that reproduce the data more accurately than PNN1 but do not exhibit time reversibility nor conserve energy. This shows that parsimony is critical to learn models that can provide insight into the physical system at hand and for generalizability. We stress that the equations of motion and an advanced integrator were obtained from observational data of the motion of a particle and the force-displacement relationship alone. We believe that, at the expense of computational cost, the force sub-net could be learned together with the integrators (effectively learning the acceleration) from large-enough dynamical datasets. This assertion is based on the observation that on fixing some of the network parameters that result in a Verlet integrator, the remaining parameters and the force subnet can be learned from the observational data used above, see section S7 of the SM.
Melting temperature laws. To demonstrate the versatility and generalizability of PNNs, we now apply them to discover melting laws from experimental data. Our goal is to predict the melting temperature of materials from fundamental atomic and crystal properties. To do this, we collected experimental melting temperatures for 218 materials (including oxides, metals, and other single elements crystals) as well as fundamental physical quantities including: bulk modulus K , shear modulus G , density ρ , a characteristic atomic distance a (the cube root of the volume per atom), and mean atomic mass m.
Before feeding this data to PNNs, we perform a standard dimensionality analysis to use dimensionless inputs and output. For convenience we first define an effective sound speed, v m , from density and elasticity moduli, see section S8 of the SM. From these fundamental quantities, we define four independent quantities with the dimensions of temperature: where is Planck's constant and k b is the Boltzmann's constant. All variables have physical meanings, for example, θ 0 is proportional to the Debye temperature. The inputs to the PNNs are the last three quantities normalized by θ 0 and the output melting temperature is also normalized by θ 0 . Additional details on the preprocessing steps as well as network architecture, including custom activations, can be found in the section S8 of the SM.
Armed with dimensionless inputs and outputs, we use PNNs to discover melting laws. Varying the parsimony parameter in the objective function, Eq. 1, results in a family of melting laws. These models are presented in Fig. 3 in terms of their accuracy with respect to the testing set and their complexity. The latter is defined as the sum of the second and third terms of the objective function Eq. 1, i.e., the sum of the activation function and weight (4) x(t + �t) = x(t) + 1.0008 v(t) * �t − γ �t 2 2m + 0.9991 This makes physical sense as the Debye temperature is related to the characteristic atomic vibrational frequencies and stiffer and stronger bonds tend to lead to higher Debye and melting temperatures. Next in complexity, PNN B adds a correction proportional to the shear modulus: This is also physically sensical as shear stiffness is closely related to melting. This fact is captured by the classic Born instability criteria 51 that associates melting to loss of shear stiffness. Just above PNN B in complexity, PNN finds the celebrated Lindemann melting law 35 : Written in its classical form in the third term; here T D is the debye temperature of the material and f and C are an empirical constant. Remarkably, this law, derived using physical intuition in 1910, is very close to, but not on, the optimal pareto front in accuracy-complexity space. For completeness, we describe the model with the lowest RMS error, PNN C predicts the melting temperature as: Quite interestingly, this model combines the Lindemann expression with Debye temperature and bulk (not shear) modulus. This combination is not surprising given the expressions above, but the selection of bulk over shear modulus is not clear at this point and should be explored further.
In summary, we proposed parsimonious neural networks that are capable of learning interpretable physics models from data; importantly, they can extract underlying symmetries in the problem at hand and provide physical insight. This is achieved by balancing accuracy with parsimony, an adjustable parameter is used to control the relative importance of these two terms and generate a family of models that are pareto optimal. We quantify parsimony by ranking individual activation functions and favoring fixed weights over adjustable ones. The combination of genetic optimization with neural networks enables PNNs to explore a large function space and obviate the need for estimating numerical derivatives or matching a library of candidate functions, as was done in prior efforts [17][18][19] . Additionally, PNNs perform complex composition of functions in contrast to sparse regression, which combine functions linearly. The libraries of activation functions in our first examples of PNNs are relatively small and based on physics intuition, the application of PNNs in areas where less is known about the problem at hand requires more extensive sets of activation functions, at increased computational cost. As opposed to most efforts attempting to discover differential equations, such as DLGA-PDE, PNN can discover laws in situations where an underlying differential equation may not exist. The state-of-art solutions PNNs provide two quite different problems attest to the power and versality of the approach. From data describing the classical evolution of a particle in an external potential, the PNN produces integration schemes that are accurate, conserve energy and satisfy time reversibility. Furthermore, they can be easily interpretable as discrete versions of Newton's equations of motion. Quite interestingly, the PNNs learn the nontrivial need to evaluate the force at the half step for time reversibility. The optimization could have learned the first order Runge-Kutta algorithm, which is not reversible, but it favored central-difference based integrators. Furthermore, parsimony favors Verlet-type integrators over more complex expressions that describe the training data more accurately but do not exhibit good stability. We note that other high-order integrators are not compatible with our initial network, but these can easily be incorporated by starting with a more complex network. As discussed above, the resulting algorithms would not come as a surprise to experts in molecular dynamics simulations as this community has developed, over decades, accurate algorithms to integrate Newton's equations of motion. The fact that such knowledge and algorithms can be extracted automatically from observational data has, however, deep implications in other problems and fields. This is confirmed with a second example that shows the ability of PNNs to extract interpretable melting laws from experimental data. We discover a family of expressions with varying tradeoffs between accuracy vs. parsimony and our results show that the widely used Lindeman law, proposed over a century ago, is remarkably close to the pareto front but we find PNNs that outperform it. The PNN models highlight the relationships between melting and a materials Debye's temperature as well as shear moduli, providing insight into the processes that determine melting.

Methods
Training data. To discover integration schemes, training data was generated using molecular dynamics simulations under an NVE ensemble, using the velocity Verlet scheme with a short timestep (about a tenth of what is required for accurate integration), see section S1 of the SM for additional details. Training and validation data are obtained from trajectories with four different total energies, 20% of which is used as a validation set. Our test set is a separate trajectory with a different total energy. For the damped dynamics cases, a frictional force proportional to negative the velocity is added, with frictional coefficient γ = 0.004 eVps/Å 2 . To discover novel melting laws, we queried the Pymatgen and Wolfram alpha databases for experimental melting temperatures. We obtained fundamental material properties such as volume and shear modulus by querying the Materials Project. Additional details are provided in section S8 of the SM. Evolutionary optimization. We used populations of 200 and 500 individuals and a two-point crossover scheme and random mutations to evolve the population (weights and activation functions) 54 . For each generation, individual networks in the population are trained using backpropagation using the same protocols as for the feed-forward networks; only adjustable weights are optimized in this operation. The populations were evolved over 50 generations, additional details of the genetic algorithm are included in section S5 of the SM.