Data-driven discovery of the governing equations of dynamical systems via moving horizon optimization

Discovering the governing laws underpinning physical and chemical phenomena entirely from data is a key step towards understanding and ultimately controlling systems in science and engineering. Noisy measurements and complex, highly nonlinear underlying dynamics hinder the identification of such governing laws. In this work, we introduce a machine learning framework rooted in moving horizon nonlinear optimization for identifying governing equations in the form of ordinary differential equations from noisy experimental data sets. Our approach evaluates sequential subsets of measurement data, and exploits statistical arguments to learn truly parsimonious governing equations from a large dictionary of basis functions. The proposed framework reduces gradient approximation errors by implicitly embedding an advanced numerical discretization scheme, which improves robustness to noise as well as to model stiffness. Canonical nonlinear dynamical system examples are used to demonstrate that our approach can accurately recover parsimonious governing laws under increasing levels of measurement noise, and outperform state of the art frameworks in the literature. Further, we consider a non-isothermal chemical reactor example to demonstrate that the proposed framework can cope with basis functions that have nonlinear (unknown) parameterizations.


Introduction
Differential and partial differential equation models play a critical role in describing the governing behavior of a variety of systems arising in science and engineering [1].As minimal order expressions describing the system behavior, governing models are generalizable and readily interpretable, and have good extrapolation capabilities.Historically, the discovery and formulation of fundamental governing equations is a relatively lengthy process, supported by careful experimentation and data collection using prototype systems.
The data sets that have, through the history of science, supported the discovery of fundamental natural laws may seem "small" by today's standards.With decreasing costs of sensors, data storage systems, and computing hardware, immense quantities of data can be easily collected and efficiently stored.As a consequence, the applications of machine learning (ML) and artificial intelligence (AI) have witnessed meteoric growth in science and engineering.ML techniques perform well in regression and classification tasks, but the resulting models are "black" or "grey-box" in nature, offering little physical insight [2], and extrapolate poorly to regimes beyond the scope of the training data.Moreover, prediction accuracy typically comes at the cost of model complexity, which is at odds with the parsimonious nature of a system's governing dynamics derived via first principles analysis.
Leveraging ML/AI frameworks to discover (as opposed to merely fit) the governing equations of physical systems from large amounts of data offers intriguing possibilities and remains an open field of research.Recent efforts in this direction include physics-informed discovery strategies [3], combining first principles arguments with ML models such as Gaussian processes [4] and deep neural networks [5,6].While the predictive capabilities of such physics-informed strategies are good, even when trained on coarse and noisy measurements, their reliance on significant (if not complete) structural knowledge of the equations governing the system dynamics constitutes a significant disadvantage when it comes to discovering they governing equations of new and unknown systems.Further, as is inherent to most (deep) machine learning architectures, such models can suffer from lack of interpretability, hence failing to provide insights on the selection process of the functional terms that dictate a system's dynamic behavior.
A different approach that is generally deemed more transparent towards automating the data-driven discovery of governing equations is based on nonlinear regression strategies [7].
Initial efforts exploited symbolic regression [8,9,10] and genetic programming algorithms [11].However, the combinatorial nature of these approaches can render them computationally prohibitive, restricting their applicability to low-dimensional systems and to considering relatively small initial sets of candidate symbolic expressions (which inherently diminishes the probability of identifying the true underlying system dynamics).Furthermore, symbolic regression strategies are prone to overfitting, i.e., generating overly complex expressions in an attempt to decrease prediction error [10].
More recently, sparse regression techniques [12,13] have been proposed.Brunton et al. [14] employed a modified ordinary least-squares (OLS) and LASSO regression (i.e., 1 penalized regression) to discover parsimonious representations of the dynamics of nonlinear systems from high-dimensional data sets by selecting the elements of the governing equations form a a priori specified large set of candidate basis functions.Numerous extensions to [14] have since been proposed addressing a variety of classes of systems and problem settings (e.g., [15,16,17,18]).A similar approach based on elastic net regression (i.e., a combination of both 1 and 2 penalized regression) was introduced in [19], but resulted in less parsimonious equations relative to e.g.LASSO regression.In a related effort [20], the selection of basis functions was performed via mixed-integer optimization, with a view towards identifying low-order surrogate representations of nonlinear algebraic models.
In spite of these advances, several fundamental challenges remain.From a numerical perspective, existing approaches rely on either directly measuring the system state derivatives (which are likely not observable in practical settings) or approximating them accurately (which can be difficult particularly in high noise environments and when the dynamics evolve over multiple time scales).Additionally, when considering large libraries of nonlinear basis functions, feature collinearity may result in ill-conditioned regression problems.As the size of the available data set increases, the ensuing numerical instability renders these approaches unrealiable for discovering fundamental equations that are optimal (in the Pareto sense) with respect to both model parsimony and predictive power.Motivated by the above, in this work, we propose DySMHO, a radically different perspective to discovering governing equations from data.The present method rooted in control theory, namely, moving horizon estimation and control, and offers (i) excellent scalability with respect to the system dimensions and the size of the data set, (ii) rigorous statistical arguments for selecting the model structure from a large dictionary of basis functions, (iii) robustness to noisy training data and (iv) the ability to incorporate first principles knowledge (when available) in the form of additional constraints in the problem formulation.

Representation of system dynamics
We consider dynamical systems governed by ordinary differential equations of the form: where x(t) ∈ R nx is the vector of states at time t, and the map f( To discover the underlying governing equations, we consider a dictionary of n θ candidate symbolic nonlinear basis functions denoted as Θ(x T ), where Θ(•) : R 1×nx → R 1×n θ .The dictionary is defined a priori, potentially leveraging some domain insights (e.g.[19]).Importantly, we assume that the governing equations can be expressed as an (as of now unknown) linear combination of the basis functions in this dictionary or, equivalently, that the true model is contained within the dictionary.That is, we can express the model in (1) as: where Ξ ∈ R n θ ×nx is a matrix of coefficients whose columns are given by the sparse (i.e., most entries are zero) vectors ξ ξ ξ 1 , . . ., ξ ξ ξ nx .This sparse model structure is illustrated in Fig. 1 (A) using the well-known two-dimensional Lotka-Volterra predator-prey model as an example.
Further, we distinguish between basic coefficients (i.e., ξ i,j = 0 in the true dynamics, and the corresponding basis functions are active), and non-basic coefficients (i.e., ξ i,j = 0 in the true dynamics).Basic coefficients within Ξ T for the predator-prey model are highlighted in color in Fig. 1 (A).Following the sparsity argument, the fundamental challenge of discovering the governing equations translates to identifying the (few) basic coefficients associated with the active nonlinearities that compose f(x(t)).Fig. 1 (B) shows first an example dictionary of basis functions evaluated over time for the Lotka-Volterra model, where the active basis functions relevant to describing the dynamics of each of the two system states are shown in color.A linear combination of the selected basis functions is then used to construct the dynamics for each state.Prior works (e.g., [14,19]) proposed sparse regression techniques, whereby for each state variable i ∈ {1, . . ., n} the sparse vector ξ ξ ξ i is estimated as the solution of the following optimization problem: The goal is to minimize, in a norm sense, the difference between the value of the state derivatives predicted by the model, Θ(x(t j ) T )ξ ξ ξ i , and the corresponding derivative values ẋ(t j ) that are either measured directly or approximated (via e.g.finite difference equations) from the m data samples at each sample time t j .The second part of the expression above is a regularization term that places a penalty on the 1 and/or 2 -norms of the magnitudes of the coefficients ξ ξ ξ i , thereby ensuring that the solution is sparse (i.e., that as many of the elements of this vector are zero).λ ≥ 0 controls the extent of this spasification, while 0 ≤ ρ < 1 allows for balancing between

Nonlinear optimization-based discovery formulation
Differently, to obtain the sparse coefficient matrix Ξ in (2), we formulate a constrained multiperiod dynamic nonlinear program (DNLP) minimizing a given error metric.Discretization strategies are used to convert the candidate model from the differential form in (2) to a system of nonlinear algebraic equations, which is embedded in the constraint system of the DNLP [21] (See Materials and Methods for further details).Pre-processing techniques and/or domain expertise can be used to introduce additional constraints to the DNLP in order to improve convergence to the governing equations (See Materials and Methods for further details).The DNLP is thus formulated in discrete time to minimize the mean 2 -error as follows: where x(k) are the states predicted by the candidate governing law at time index k, g(•) : R nx × R n θ ×nx → R nx represents the discretized version of the nonlinear dynamics given in (2), Ξ L and Ξ U are respectively the lower and upper bounds of the estimated coefficients, and M is the number of discrete time points of the transformed dynamics.Note that in (4) the basis functions in the dictionary are not directly evaluated on the measurement data x as is the case in (3).Rather, expression (4) represents a symbolic nonlinear function of the (predicted) states, which are decision variables in the DNLP.Further, (4) does not (directly) require an approximation of x to derive the coefficients Ξ (as is done in the sparse regression problem (3)).
The derivative approximation ẋ is used indirectly in pre-processing (as discussed in Materials and Methods) as a potential first step for pruning the basis function dictionary, as well as for estimating Ξ L and Ξ U .
An additional advantage of the proposed formulation is that the dynamics of all states are simultaneously recovered, as opposed to prior works [14,19] in which regression problems are solved separately for each measured state.It is thus expected that the solution of (4) results in governing equation models that better explain the overall system behavior, avoiding possible additional variance that might be introduced when the dynamics are independently identified for each observed state variable.Note that interpolation is typically required to approximate the state measurements x(t k ) ∀k = 1, . . ., m at each of the points in the new time grid defined by the discretization method employed to construct g(•).The objective function in (4) includes a regularization term, where the parameter λ penalizes the norm of the coefficient vector Ξ and imposes sparsity in the same sense as described earlier.
The dimension of the DNLP (4) increases with the dimension M of the data set available and with the dimension n θ of the dictionary of basis functions.Intuitively, M may be very large (i.e., many data points are available from experiments), while a large n θ (i.e., a large dictionary of basis functions) is highly desirable in order to increase the probability of discovering the true governing equations (1).Thus, solving (4) while taking into consideration the entire available data set, as is generally done in existing discovery frameworks based on sparse regression [14,19], is likely computationally expensive.Noting that such problems are NPhard, the actual solution time cannot be anticipated from the above problem dimensions.An additional fundamental challenge is related to imposing parsimony in the learned model.This entails eliminating the basis functions that are not part of the true model ( 1) by setting the corresponding coefficients Ξ to zero.A particular difficulty arises when the true value of a coefficient is "small:" while the corresponding estimate may also be small, it is difficult to discern whether this outcome is correct or the non-zero estimated value is the result of ovefitting (i.e., a spurious attempt to further decrease the value of the objective function in (4) by increasing model complexity).A thresholding approach consisting of eliminating terms whose estimated coefficients are below a specific value (determined via cross-validation) can in principle be employed, but its performance is expected to degrade with increasing model stiffness and to our knowledge there is no rigorous way of defining such thresholds other than through cross-validation.
The framework proposed here addresses both fundamental problems described above.To deal with problem dimensionality, we propose decomposing (4) into a sequence of lowerdimensional problems defined on shorter time horizons (i.e., using smaller subsets of the available data), for which optimal solutions to (4) can be attained with significantly lower computational effort.An illustration of the data subsets is shown in Fig. S.2 in the Supporting Materials.After a solution to ( 4) is computed, a new data subset is selected (intuitively -but not necessarily -by shifting the time window forward by a smaller step than the window size) which is then used to resolve (4) again.The repetition of this procedure allows for efficiently learning and refining a sequence of governing equation models each with different coefficient estimates (further details regarding the moving horizon optimization and thresholding strategies proposed can be found in Materials and Methods and in the Supporting Materials).In conjunction with this moving-horizon strategy, the following thresholding claim is made: Claim 1.The non-basic coefficients in Ξ typically contribute to overfitting.Small, non-zero values reflect the use of the corresponding functions to fit the noise in the training data.Hence, the mean of the value of the estimates of a non-basic coefficient derived from the sequence of problems described above is likely to have a relatively high standard deviation.
The converse argument can be made for basic coefficients: the variance of a sequence of estimates is expected to be relatively low.These claims then support the use of dispersion metrics in statistics (e.g., the coefficient of variation) for the parameters obtained in a sequence of estimates based on subsets of the data to infer whether a parameter belongs to the true dynamics or not (i.e., if it is basic or non-basic).Note the similarity of the proposed training mechanism with ensemble methods (e.g., [22]) where a pool of models is trained on different (typically randomly sampled) subsets of the data to reduce variance and thus improve the performance of the final model prediction.Our moving horizon and thresholding mechanisms are discussed in depth in the Materials and Methods sections, and detailed algorithm steps are provided in the Supporting Materials.A comprehensive illustration of the entire workflow of the DySMHO framework is shown in Fig. 2.
To evaluate the performance of DySMHO, we consider a series of canonical nonlinear dynamical systems: the Lotka-Volterra predator-prey model [23,24], the van der Pol oscillator [25], the Brusselator [26], and the chaotic Lorenz oscillator [27].In each case, the true model is used to generate data via simulation, and data are artificially contaminated with Gaussian noise of increasing standard deviation σ.The results from our numerical experiments are shown in Fig. 3 and Fig. 4, where we explore the convergence of DySMHO in relation to the discovered model's complexity (i.e., number of terms), accuracy (i.e., error in the coefficient estimates), and predictive capabilities (i.e., mean squared error with respect to the true system dynamics).
Complete details regarding the computational experiments performed as well as the numerical parameters and configurations for DySMHO are reported in the Supporting Materials Tables S.1-S.4.

Discussion
DySMHO accurately recovers the true governing equations for all systems, even in cases where data are noisy (for noise levels up to the standard deviations shown in Fig. 3 and Fig. 4).Discovery for higher noise settings than the ones considered can be accomplished by tuning pre-processing and moving horizon parameters on a system-by-system basis (e.g. the thresh- olding tolerance for the coefficient of variation of a sequence of model coefficient estimates).
Evidently, as shown in Fig. 4 (B) as the amount of measurement noise increases, so does the mean squared error (MSE) between the simulated state trajectories from the recovered equations and that of the simulation based on the true dynamics.The higher observed MSE values correspond to higher coefficient error (relative to the true coefficients) as shown in Fig. 3 (D).
While the dynamics are seen to converge to the correct functional form of the true governing dynamics, noisier data inherently introduce additional uncertainty in the values of the coefficient estimates.Under significant coefficient estimate variability, Monte Carlo sampling can be used to generate candidate models from which several plausible state trajectories can be evaluated to quantify uncertainty in the discovered dynamics.Further, the deleterious effect of noise can also be appreciated from Fig. 3 (B), where it is noticeable that high-noise measurements require a larger number of iterations for the proposed algorithm to converge.Nonetheless, it is important to note that this apparent slower convergence rate is caused mainly by a larger number of candidate basis functions (often with low magnitude coefficients) remaining in the library after the pre-processing step (Fig. 3 (B-C)).That is, the proposed pre-processing approach is less effective when the noise standard deviation increases.These results emphasize that using static regression approaches (such as OLS which is employed in pre-processing) does not necessarily lead to discovery of the true dynamics, even if the recovered models exhibit high goodness-of-fit.Despite the increasing size of the candidate basis function library after pre-processing, Fig. 3 (C) shows that the correct number of terms was identified via DySMHO for each dynamical system considered for increasing noise values.Furthermore, convergence to the governing equations was achieved in just a few number of thresholding steps (6-8 iterations) as seen in Fig. 3 (D).
Of particular interest are stiff systems (arising commonly in applications relating to learning chemical reaction kinetics and mechanisms), whose dynamics have slow and fast components resulting on the states evolving on different time scales.Learning stiff differential equations is challenging due to the high computational cost involved in solving said systems (due mainly to the smaller integration step size requirements), as well as due to ill-conditioning of the relevant gradient matrices [28].For the systems under consideration shown in Fig. 3 (A), the van der Pol oscillator and the Brusselator exhibit stiff dynamic behavior for the selected model parameters, which can be handled by employing advances discretization strategies to prevent the aforementioned numerical stability issues.DySMHO uses orthogonal collocation on finite elements [29], which is one of the highest order methods (i.e., having the lowest approximation error) and for which relatively large time step sizes are allowed even for stiff equations, also improving on the computational effort required for this class of systems.As can be seen from the results in Fig. 3 and Fig. 4, DySMHO shows good performance for discovering the system dynamics regardless of the underlying models stiffness and under increasing measurement noise for van der Pol and Brusselator dynamics.
Validation results for the discovered fundamental equations are shown in Fig. 4 (A-B).The first quantitative step for validating the identified equations is computing relevant performance metrics such as the ones commonly used for regression tasks (e.g.mean squared error, mean absolute error, coefficient of determination, etc. [30]).In the case of Fig. 4 (A), the true system dynamics were used to generate the reference trajectories to be used for validation.In practice, nonetheless, the original/smoothed training data or a withheld testing set can be used to assess the predictive power of the discovered governing equations.While the mean squared error between the trajectories from the discovered equations and that of the true dynamics increases with data measurement noise (Fig. 4 (A)), the MSE is generally low relative to the magnitude of the measured states for each dynamical system.Qualitatively, the discovered trajectories accurately reproduce the dynamic behavior of the true system as seen in the phase plane plots in Fig. 4 (B).The exception to these results is the Lorenz oscillator, which due to its inherent chaotic nature is very sensitive to small perturbations to the model coefficients and initial conditions [14].However, Fig. 4 (B) shows that DySMHO successfully captures the attractor dynamics even if the trajectories simulated from the discovered equations do not overlap perfectly with the measurement data.Note that for all instances of the Lorenz oscillator DySMHO revealed the correct terms in the dynamics and resulted in coefficients estimates with small average percent error of 2.23% and standard deviation of 0.14 for the highest noise setting considered.
Differently from less transparent and interpretable ML frameworks, DySMHO produces dynamic equations in which each functional term can be directly attributed to some underlying physical phenomena.In the case of the Lotka-Volterra equations, for example, the recovered terms in the dynamics can be given the following interpretations: • For the discovered dynamics of prey population ẋ1 = x 1 − 0.01x 1 x 2 : The first term (x 1 ) reflects the fact that members of the prey species reproduce at an exponential rate in the absence of predators.The second term (−0.01x 1 x 2 ) can be explained by the fact that the probability that members of the prey and predator species meet is proportional to the product of their populations, and so is the predation rate which drives a reduction in the number of prey individuals.
• For the discovered dynamics of predator population ẋ2 = −x 2 + 0.02x 1 x 2 : The first term (−x 2 ) corresponds to the fact that predators die or leave the ecosystem (emigrate) in the absence of prey, which leads to an exponential decay in their numbers.The second term (0.02x 1 x 2 ), similar to the predation rate, can be explained by the fact that the predator population growth depends on the availability of prey, but this growth rate need not be the same rate at which the predators consume prey.
This type of analysis is a critical step in validating the discovered models, thus any available domain expertise should be leveraged when available to ensure that DySMHO equations are explainable based on the observed behavior of the physical system.
Data-driven discovery of governing laws is a promising avenue for advancing our understanding of and elucidating new phenomena across a wide range of disciplines.This new fundamental knowledge can in turn be used to drive the development of new technologies to solve pressing human-centered challenges.In this paper, we introduced and validated DySMHO, a novel moving horizon, nonlinear dynamic optimization framework for learning governing equations from noise-contaminated state measurements over time.DySMHO leverages a discretized model of the system dynamics to estimate the basis function coefficients, as opposed to prior works that use sparse regression techniques to predict the approximate values of the state derivatives as a linear combination of the basis functions evaluated at the the measured data.
We demonstrated DySMHO's main advantages using a variety of dynamical systems including highly stiff nonlinear differential equations, and showed that DySMHO is highly robust to noise-contaminated data.

Materials and Methods
DySMHO pre-processing

Data smoothing
Data are assumed to be contaminated with noise (which we assume to be Gaussian and with zero mean but with unknown standard deviation).Data smoothing techniques are employed to reveal patterns otherwise hidden by noise, and to aid model training.To this end, the Savitzky-Golay filter (SVGF) [31] is employed, which is based on local least-squares regression polynomial approximations applied to the data on moving windows of a given size.Since data sets with greater amounts of noise require more smoothing (accomplished by e.g.longer filter windows and lower order polynomials), an iterative smoothing strategy is proposed to determine the appropriate window size for each of the measured state variables.The proposed smoothing scheme is detailed in SM Algorithm 1.The algorithm consists of increasing the Savitzky-Golay filter window size iteratively until there is no significant reduction in the standard deviation of the differenced initial measurement time series (in this sense, e.g. the first difference of a time series y denoted y has entries given by y (t) := y(t) − y(t − 1)).It is expected that data sets with noise of higher magnitude will take a greater number of iterations, and thus larger window

Statistical analysis
Statistical tests are applied to the smoothed data to infer which features (i.e., basis functions) are most important for predicting the systems dynamics.Similar to [14], we arrange the smoothed state measurements in the form of a data matrix X: where t 1 , t 2 , . . ., t m are the time intervals at which the measurements were collected.The dictionary of candidate basis functions is then evaluated at every point in the data matrix.This can result in a structure such as the one below: where XP 2 denotes second order polynomials which may include interaction terms, e.g.: Evidently, the form of this structure will depend on the choice of basis functions.The derivative of each state variable is approximated from the data by using central differences as follows: and similar to (6) a data derivative matrix Ẋ is formed.Other derivative approximation strategies (e.g., [32]) can be used when there is significant noise in the data.
The first statistical test implemented is the Granger causality test [33], which establishes that: if a signal y "Granger-causes" a signal z, then the past values of y should contain information that helps predict z above and beyond the information contained in the past values of z alone.The statsmodels [34] implementation of the Granger Causality test was used in Python.The test was designed to assess whether each of the basis functions evaluated on X provides meaningful information in predicting future values of X (i.e., predicting the evolution of future system states over time).In the typical formulation of the Granger causality test, the null hypothesis is Θ i ( Xj ) does not Granger-cause Xj for a given basis function i ∈ {1, . . ., n θ } and a given state j ∈ {1, . . ., n x }, which is evaluated by fitting autoregressive models of the form: where in this case we are only interested in a single lagged value of the evaluated basis functions (i.e.,Θ i ( Xj (t − 1))).The significance of using Θ i ( Xj ) to predict Xj is determined by examining the variance of the residuals e(t) and ê(t) by performing statistical tests based on F and chisquared distributions.If the average p-value across all tests is less than the given significance level, then basis function i is kept in the dictionary (i.e., the null hypothesis is rejected) for the state j.Otherwise, the basis function is removed from the dictionary for subsequent steps.
The Granger causality test is repeated for every state variable and each basis function in our library.It should be noted that stationarity of the time series might need to be enforced (e.g., by differencing the signals and checking for stationarity via the Dickey-Fuller test [35]).
The second, and likely the most important, step in the proposed pre-processing statistical analysis involves an ordinary least-squares (OLS) regression problem to obtain a preliminary estimate of the coefficients Ξ.This step not only provides a solution to initialize DNLP embedded in the moving horizon discovery process, but also can also be used to derive lower and upper confidence bounds on Ξ, that are important in improving the convergence of the corresponding nonlinear programming problem.The linear model to be determined by performing OLS is given by: where the coefficients Ξ OLS ∈ R n θ ×nx are to be estimated.Note that a separate regression problem is solved for each of the states 1, . . ., n x to compute each of the columns ξ ξ ξ OLS i of Ξ OLS .The OLS problems were implemented in the statsmodels package [34] in Python, which leverages linear algebra tools to efficiently solve the normal equations to estimate the coefficient vector for state i: where Ẋi represents the approximation of the derivative of state i estimated from the data.
As a preliminary approach to selecting the most informative basis functions, the results obtained by OLS can be used to intuit the most important predictor variables within Θ( X).In brief, classical approaches for performing this variable selection procedure involve computing the Fstatistic and examining the corresponding p-values [7], which are automatically computed by solving (10) using the statsmodels OLS implementation [34].In this sense, under the null hypothesis that a coefficient for a basis function is zero, predictors having p-values greater than a specified significance level are eliminated.For the coefficients with p-values small enough that the null hypothesis cannot be confidently rejected, we also use the OLS results to derive confidence intervals (for a pre-specified confidence level) in order to obtain lower (Ξ L ) and upper (Ξ U ) bounds on the coefficient values.For a more extensive discussion on the statistical properties of OLS, the reader is referred to established texts on machine learning [36,7], as well as the statsmodels documentation [34].
While these steps can be a useful preliminary approach to eliminating basis functions from the dictionary in a specific application and for a given data set, they inherently rely on estimating the derivative from noisy state measurements (which introduces additional error to the already noise-contaminated data).To this end, we suggest that a high significance level should be used (in both Granger causality tests and OLS regression) for eliminating basis functions from the dictionary; this is a conservative approach that prevents eliminating the basis functions that do belong in the true system dynamics.Further, since high-noise measurements may significantly affect the accuracy of coefficient estimates in OLS, it is recommended that large confidence intervals (e.g., 99.99%) be used to estimate Ξ L and Ξ U .

DySMHO dynamic nonlinear optimization Discretization of the dynamic equations
In this work, discretization is performed with respect to the time domain (i.e., by defining a finite set of discrete points in time where the dynamic equations are evaluated), to convert the model in continuous time in (2) to a discrete time expression of the form of x(k + 1) = g(Θ(x(k)), Ξ) as used in optimization problem (4).Simultaneous strategies [29] are employed, whereby the discretized equations are incorporated as nonlinear algebraic constraints in problem (4).One of the simplest and most widely used class of methods are finite difference transformations, such as explicit and implicit Euler schemes.While generally more challenging to implement, collocation methods provide substantially more accurate approximations of the dynamical system, have good numerical stability allowing relatively large time steps to be considered, and are thus advantageous for stiff dynamical systems [29].Broadly speaking, collocation on finite elements entails partitioning the time domain into M − 1 finite elements, over which polynomials of order K + 1 are used to approximate the differential variable x(t) (each polynomial for each finite element is defined using K collocation points, which act as an additional discretization within each finite element).Additional constraints are introduced to enforce continuity across the finite element boundaries for each differential variable: [29].
where the state variable x(t) is interpolated using Lagrange polynomials as follows: A comprehensive discussion on discretization strategies can be found elsewhere [29].In this work, we leverage the pyomo.DAE [21] modeling extension that enables automatic simultaneous discretization of ODEs, and leverages Gauss-Legendre and Gauss-Radau collocation schemes to determine the interpolating points in (13). .We note that the collocation equations in (12) and ( 13) used to discretize the dynamics in the form of (2) are represented in compact form as x(k + 1) = g(Θ(x(k)), Ξ) in the DNLP in (4).
It should be noted that the optimal choice of collocation points may not align with the sample times t 1 , . . ., t m at which the data were originally collected, thus spline interpolation is required to approximate the data at the relevant time instants.We employ the SciPy package [37] in Python using a cubic spline to estimate the state values at the collocation points.

DySMHO moving horizon optimization
While the formulation introduced in (4) has several advantageous properties relative to prior regression-based framework, a key challenge is addressing the computational burden of solving this problem as the size of the data set and the dimension of the dictionary of basis functions increase.Longer time horizons for the data set require a larger number of finite elements and collocation points.This in turn increases the number of variables and constraints in (4).To alleviate the computational burden of solving the optimization problem in (4), we employ ideas stemming from control and estimation theory.In particular, we draw inspiration from moving horizon estimation (MHE) [38], which involves solving a sequence of state estimation problems (typically in the form of a DNLP) online and discarding old measurements for which state estimates have already been computed (instead of estimating states, DySMHO estimates the coefficients Ξ).For the purposes of DySMHO, these operations need not be performed online; the idea is that the original data set can be segmented into several sequential subsets of smaller size over which different instances of (4) are solved.
The key elements of the moving horizon strategy are illustrated in Fig. S.2 in the SM for data corresponding to the Lotka-Volterra system.Note that the number of optimization problems to be solved depends directly on the choice of optimization horizon H. Nevertheless, the problems in the sequence are likely significantly more computationally tractable than solving (4) for the entire (large-scale) data set.It is worth mentioning that, to date, no analytical frameworks exist for determining the optimal choice of horizon H.The empirical consensus is that longer horizons yield better results (i.e., convergence of the estimates to the true values of the parameters), which intuitively comes at a computational cost [39].For periodic systems, such as the Lotka-Volterra system shown in SM Fig. S.2, an intuitive choice for H can be an integer multiple of the period corresponding to the fundamental oscillation frequency, which can be estimated from the data.
A detailed outline of the moving horizon algorithm is presented next in SM Algorithm S.2.In brief, the algorithm consists of analyzing the set of smoothed state measurements in a moving horizon fashion as illustrated in SM Fig S .2., solving (4) and performing parameter thresholding every ω iterations (the thresholding process is described in Algorithm S.3).The moving horizon algorithm terminates when the training data are exhausted (recall that the original data set consists of a total of m measurements collected at times t 1 , . . ., t m ) or convergence is established, that is when the number of basis functions remaining in the library, denoted as |Θ|, does not change after a number Ω of thresholding steps.Upon convergence, the outputs of the algorithm are set of discovered basis function and the corresponding mean values of the coefficients obtained for the last ω × Ω iterations, for which |Θ| did not change.If the algorithm fails to converge and the data set is exhausted, the following additional steps can be attempted: collecting a larger data set, extending the dictionary of basis functions -possibly leveraging domain knowledge, and/or deriving tighter coefficient bounds.

Thresholding algorithm
A key component of DySMHO is systematically eliminating non-basic functions from the initial dictionary.The proposed thresholding approach is outlined in Algorithm S.3, and is embedded within the moving horizon scheme introduced previously in Algorithm S.2.The proposed framework leverages the fact that the values of non-basic coefficients are expected to be low in magnitude but to have significant variability when estimated using different subsets of the data.Such effects can be quantified statistically by computing the coefficient of variation (CV ), defined as the ratio between the standard deviation and the mean, evaluated for a series of coefficient estimates obtained from successive portions of the data.In our framework, the coefficient of variation is computed for the coefficients of all basis functions Θ θ ∀θ ∈ {1, . . ., |Θ|}, and for all state variables j ∈ {1, . . ., n x }.If the coefficient of variation CV θ,j is greater than the specified variability threshold γ, then the basis function is pruned and not considered in future iterations of the moving horizon scheme.Otherwise, the basis function Θ θ remains in the dictionary.
It is important to note that some basis functions are significantly more susceptible to thresholding that others.For example, basis functions such as {1 1 1, x} ∈ Θ are particularly prone to contributing to overfitting the initial measurement noise in the data, as well as the differentiation error introduced when the dynamics are discretized (particularly in high noise environments and when the initial function library is larger).To prevent spurious thresholding for this type of basis functions whose associated coefficients are likely so see greater variability across different data subsets, an alternative is to keep them in the basis for the first few thresholding steps regardless of their associated coefficient's observed CV .After these initial iterations and when (potentially) some of the other non-basic basis functions have been pruned, if basis functions like e.g.{1 1 1, x x x} are in fact basic they are expected to experience less variability in their respective coefficients and remain in the basis when the algorithm converges.

Figure 1 :
Figure 1: DySMHO model structure for discovering the the Lotka-Volterra predator-prey model, with ẋ1 = x 1 − 0.01x 1 x 2 , ẋ2 = −x 2 + 0.02x 1 x 2 .(A) Graphical representation of the dictionary of basis functions Θ(x T ) and sparse coefficient matrix Ξ in two dimensions.(B) Illustration of the values of basis functions for each of the two states in predator-prey model (colored lines indicate the true basis functions against the backdrop of all basis functions which are shown in grey).

Figure 2 :
Figure 2: Illustration of the DySMHO workflow: Noisy data x are initially collected.The data are smoothed and pre-processed and bounds on basis function coefficients are established.The smoothed data x and the bounds (Ξ L,(i) , Ξ U,(i) ) are used to formulate the moving horizon DNLP.Every ω iterations, thresholding is performed to eliminate non-basic parameters and coefficient bounds are tightened.Convergence is achieved once the set of basic functions does not change for a given number of iterations, or once the entire training data set has been exhausted.The discovered model is validated via simulation and comparison with the data.Qualitative comparisons can be performed using, e.g.,phase plane plots, while quantitative assessments rely on common regression performance metrics (e.g., mean squared error).

Figure 3 :
Figure 3: Numerical experiments of DySMHO for the dynamical systems considered.(A) True governing differential equations for dynamical systems.(B) Average number of basis functions remaining in the discovered model as a function of DySMHO iterations; a dashed black line is used to indicate the true number of terms (two-dimensional systems were initially modeled with 28 basis functions, and 3-dimensional systems with 66 (See Supporting Materials for further details)).(C) Average number of terms in the discovered governing equation after pre-processing and the final model after DySMHO converged.(D) Percent cumulative coefficient error relative to the coefficients in the true governing equations after each thresholding step for data simulated with increasing noise contamination.(Results show mean values obtained from 10 random samples of the simulated measurement data for each noise level considered, and error bars and shaded areas represent one standard deviation from the mean)

Figure 4 :
Figure 4: Validation of the discovered equations obtained through DySMHO for the dynamical systems considered.(A) Average mean squared error (MSE) between simulation obtained from the true model and simulation obtained for the discovered model for 10 realizations of training data for each noise level.(B) Comparison of discovered dynamics against true model and measured data (data measured with σ = 10, 0.1, 0.1, 0.5 respectively for Lotka-Volterra model, van der Pol oscillator, Brusselator, Lorenz system.(C) Discovered differential equations showing mean coefficient estimates (with standard deviation shown in parentheses) for the highest noise level considered for each dynamical system.
sizes.An illustration of the results produced by the smoothing algorithm for data sets with different amounts of measurement noise collected from the Lotka-Volterra system are shown in the SM Fig. S.1.

Figure S. 5 :Figure S. 6 :
Figure S.5: Illustration of smoothing algorithm for Lotka-Volterra predator-prey model.(A) results obtained for state x 1 .(B) results for state x 2 .Figures on the left show the smoothed system trajectories as a function of initial measurement noise at every iteration of the smoothing algorithm, figures on the right show the standard deviation of the estimated noise at every iteration.(Results obtained using α = 0.1, W S i = 10, ∆ = 10, and φ = 2) •) : R nx → R nx represents the (nonlinear) dynamics of the system.The function f is unknown and is precisely what we attempt to infer from a given set of time-resolved measurement data.To that end, we collect a sequence of measurements x(t k ) of the state variables observed at sampling times t 1 , . . ., t m , and assume that the derivative ẋ(t k ) cannot be directly observed.The data are assumed to be contaminated with (Gaussian, zero-mean) measurement noise, and smoothing techniques and statistical tests are used to perform pre-processing of the training data set (See Materials and Methods).The resulting pre-processed data are denoted by x(t k ) ∀k = 1, . . ., m.