Automated adaptive inference of phenomenological dynamical models

Dynamics of complex systems is often driven by large and intricate networks of microscopic interactions, whose sheer size obfuscates understanding. With limited experimental data, many parameters of such dynamics are unknown, and thus detailed, mechanistic models risk overfitting and making faulty predictions. At the other extreme, simple ad hoc models often miss defining features of the underlying systems. Here we develop an approach that instead constructs phenomenological, coarse-grained models of network dynamics that automatically adapt their complexity to the available data. Such adaptive models produce accurate predictions even when microscopic details are unknown. The approach is computationally tractable, even for a relatively large number of dynamical variables. Using simulated data, it correctly infers the phase space structure for planetary motion, avoids overfitting in a biological signalling system and produces accurate predictions for yeast glycolysis with tens of data points and over half of the interacting species unobserved.

parameters to produce reasonable predictions. Here we compare mean correlations produced for out-ofsample initial conditions chosen from ranges twice as large as in-sample ranges ("wide ranges," plotted in red, listed in the "out-of-sample" column of Supplementary Table 4) to when out-of-sample conditions are chosen from the same ranges as in-sample ranges ("narrow ranges," plotted in purple, listed in the "insample" column of Supplementary Table 4). For comparison, the simple sinusoidal model defined in (15) is shown in shades of blue. The mean and standard deviation over 5 realizations of in-sample data are shown by filled symbols and shaded regions. Also plotted are the Pareto fronts for each model (solid lines on right plot) indicating the maximal correlation for a given mean N p .    Supplementary Table 5. Adaptive inference algorithm parameters. 1 In the full phosphorylation model, we fit parameters in log-space since they are known to be positive. This makes the model more sensitive to large changes in parameters, meaning that we are forced to be more conservative with taking large steps in parameter space to achieve reasonable acceptance ratios.

Supplementary Note 1. HIERARCHICAL BAYESIAN MODEL SELECTION
For consistent inference, we need a hierarchy of models that satisfies criteria laid out in Ref. [3].
First, we desire a model hierarchy that will produce a single maximum in L, up to statistical fluctuations, as we add complexity. For this, the hierarchy should be nested (but not necessarily regular or self-similar), meaning that once a part of the model is added, it is never taken away.
Second, the hierarchy should be complete, meaning it is able to fit any data arbitrarily well with a sufficiently complex model. Intuitively, instead of searching a large multidimensional space of models, hierarchical model selection follows a single predefined path through model space (Supplementary Figure 1). While the predefined path may be suboptimal for a particular instance (that is, the true model may not fall on it), even then the completeness guarantees that we will still eventually learn any dynamical system F given enough data, and nestedness assures that this will be done without overfitting along the way. 1 Ordering of hierarchies: An advantage of the S-systems and sigmoidal representations is the existence of a natural scheme for creating a one-dimensional model hierarchy: simply adding dynamical variables x i . The most general network is fully connected, such that every variable x i has an interaction term in every other dx j /dt. Our hierarchy starts with a fully-connected network consisting of the necessary number of input and output variables, and adds "hidden" dynamical variables to add complexity. With each additional x i , we add parameters in a predetermined order.
In the S-systems class, without connections, variable x i 's behavior is specified by 5 parameters: Each connection to and from x j is specified by 4 parameters: g ij , g ji , h ij , and h ji . When adding a new dynamic variable, we first fix its parameters (to zero for the exponential parameters and one for the multiplicative parameters), and then allow them to vary one at a time in the following order: g ii , g ji , h ji , g ij , h ij , β i , h ii , α i (adding connections to every other x j one at a time). An example is shown in Supplementary Table 1.
The sigmoidal class is similar: without connections, variable x i 's behavior is specified by 4 parameters: x init i , W ii , τ i , and θ i . Each connection to and from x j is specified by 2 parameters: W ij and W ji . When adding a new dynamic variable, we first fix its parameters (to zero for W and θ and one for τ ), and then allow them to vary one at a time in the following order: W ij , W ji , W ii , τ i , θ i (adding connections to every other x j one at a time). An example is shown in Supplementary Table 2.
For every adaptive fit model and the full multi-site phosphorylation model, we use the same prior for every parameter α k , which we choose as a normal distribution N (0, 10 2 ) with mean 0 and standard deviation ς = 10. 2 (For the simple model fit to the phosphorylation data, parameters are always well-constrained and priors are unimportant, and we therefore do not use explicit priors.) Representation of sharp nonlinearities: Both the sigmoidal and S-systems classes can represent arbitrary dynamics. However, it is important that they can efficiently represent sharp nonlinearities that are often present in biological systems, such as those typically represented by large Hill coefficients. While this is straightforward for the S-systems class [4], it is less obvious for sigmoidal models.
The sigmoidal model class relies on ξ(y), which has the largest derivative ξ (0) = −1. Thus it may seem that sharp nonlinearities could be hard to produce. In fact, the introduction of hidden variables that perform multiple transformations can produce arbitrarily sharp production rate laws.
As an example, we show here that the nonlinearity captured by the Hill equation, (where S is the substrate concentration, K is the dissociation constant, and f is the fraction of bound receptors) can be represented exactly in the sigmoidal class using two dynamical variables.
Treating I = log S as the input to the system, the sigmoidal system where we set τ 1 = n and θ 1 = log K, has a steady state solution that reproduces (1): First, the derivation in Supplementary Note 5 assumes that the distribution of noise on measured data is Gaussian with known variance. In Supplementary Figure 2A, we compare fitting to the same data but using an incorrect standard deviation for noise on the data when calculating the Bayesian log-likelihood. When the data is thought to be noisier than it actually is (purple and red points), performance remains unchanged until large N , when, as expected, simpler than optimal models are chosen, and comparatively more data is required to select complex models that produce better performance. When the data is thought to be less noisy than it actually is (yellow points), more complex models are selected, which in this case yields performance that can be better or worse, depending on N . In Supplementary Figure 2B, we compare fitting to data with log-normally distributed noise, keeping the mean and variance fixed. The closely overlapping performance suggests that, in the absence of knowledge about the true noise distribution, a good estimate of σ may be enough to attain consistent inference.
Finally, a somewhat arbitrary choice must be made to define an ordering for adding parameters in the model hierarchy; we chose to use the "node order" that is described in Supplementary Table 2.
In Supplementary Figure 2C, we instead add parameters for each dynamical variable in random order. This includes orderings that first add parameters controlling only hidden nodes, which may be decoupled from the visible variables and hence cannot improve the fit. To compensate for this and avoid erroneously stopping fitting due to adding these unproductive parameters, we increase the number of models checked by increasing i overshoot from 3 to 4 (see Supplementary Note 6). One could additionally avoid unproductive orderings by checking that each additional parameter has some causal influence on visible variables. But even including these orderings, mean performance is largely unaffected.

Supplementary Note 2. THE LAW OF GRAVITY MODEL
For a mass m in motion under the influence of the gravitational field of a mass M m, the distance r between the two evolves as [5] where h = (v 0 ·θ)r 0 is the specific angular momentum, v 0 is the initial velocity, r 0 is the initial distance,θ is the unit vector perpendicular to the line connecting the two masses, and G is the gravitational constant. Setting the initial velocity parallel toθ and measuring distance in units of When written as two first-order differential equations, we see that this system can be represented exactly in the S-systems class if the particle does not fall onto the Sun: where we use the variable χ = dr dt + 1, so that the resulting system's variables are never negative, a requirement of the S-systems class.
To illustrate constructing an adaptive model for planetary motion, we consider as input the initial distance from the sun r 0 . We sample r 0 uniformly between 1 and 3 (in units of GM/v 2 0 ), which covers the possible types of dynamics: at r 0 = 1, the orbit is circular; when 1 < r 0 < 2 the orbit is elliptical; when r 0 = 2 the orbit is parabolic; and when r 0 > 2 the orbit is hyperbolic. In Note that certain transformations of the hidden variable and parameters can leave the output behavior unchanged while remaining in the S-systems class. First, the initial condition of hidden parameters can be rescaled to 1 without loss of generality, so we remove this degree of freedom and set X 2 (0) = 1. Second, we have the freedom to let the hidden variable X 2 → X γ 2 for any γ = 0 with appropriate shifts in parameters. To more easily compare the fit model with the perfect model, in the rightmost column of Figure 1we plot X 2 2 on the vertical axes instead of X 2 when comparing it to the dynamics of the true hidden variable χ.
Finally, we may compare performance when we fit the gravitation data using sigmoidal models, a model class that we know is not representative of the underlying mechanics. The results are shown in Supplementary Figure 4; the selected sigmoidal network, which contains three hidden variables, still provides a good fit to the data, as expected, but it does not generalize as well when r 0 is near the edge of the range contained in the data and timepoints are outside of the range of data to which they were fit. This is expected since forces can diverge in the true law of gravity, and they are necessarily limited in the sigmoidal model.

Supplementary Note 3. MULTI-SITE PHOSPHORYLATION MODEL
To explore a complicated biological system with relatively simple output behavior, we imagine a situation in which an immune receptor can be phosphorylated at each of five sites arranged in a linear chain. The rates of phosphorylation and dephosphorylation at each site are affected by the phosphorylation states of its nearest neighboring sites. A site can be unphosphorylated (U ) or phosphorylated (P ), and its state can change via one of two processes. The first process does not depend on states of neighboring sites: with on-rate k on i ([U i ]) and off-rate k off i ([P i ]) that depend on the concentration of the corresponding substrate. The second, cooperative process happens only when a neighboring site j is phosphorylated: with on-and off-rates k on ij ([U i P j ]) and k off ij ([P i P j ]). All rates k are modeled as Michaelis-Menten Km+ [S] . With each reaction specified by two parameters (V and K m ) and 26 possible reactions, the phosphorylation model has a total of 52 parameters. To more easily generate the differential equations that govern the multi-site phosphorylation model, we use the BioNetGen package [6,7].
When fitting this phosphorylation model, we use as input the parameter V on 23 , which is chosen from a uniform distribution in log-space between 10 −3 and 10 3 min −1 . The remaining 51 V and K m parameters we sample randomly from our priors on these parameters. As output, we measure the total phosphorylation of the 5 sites P tot at a single random time uniformly chosen between 0 and 10 minutes. To each measurement we add Gaussian noise with standard deviation equal to 10% of the P tot value at t = 10 min.
Typical training data for the model is shown in Supplementary Figure 3. The out-of-sample mean squared error, as plotted in Figure 2, is measured over 100 new input values selected from the same distribution as the in-sample values, each of which is compared to the true model at 100 timepoints evenly spaced from 0 to 10 minutes.
As a simple guess to the functional form of the total phosphorylation timecourse as a function of our control parameter V = V on 23 (the "simple model" in Figure 2), we use an exponential saturation starting at 0 and ending at a value P ∞ that depends sigmoidally on V : where and a, b, c, d, and t 0 are parameters fit to the data. Figure 2 shows that this simple ad hoc model can fit the data quite well.
For the example shown in Figure 3, the selected sigmoidal model consists of the ODEs In this multi-site phosphorylation example, the sigmoidal model class is a better performer than the S-systems class. A typical example of performance is depicted in Supplementary Figure 6.
Though the S-systems class makes predictions that are still qualitatively correct, and its predictions steadily improve as N increases, the sigmoidal class comes closer to the true underlying model with an equal amount of data.
The confidence intervals on the dynamics in Figure 3 correspond to samples from the posterior over parameters given N = 300 data points. In the notation of Supplementary Note 5, this posterior P (α | data) ∝ exp −χ 2 (α)/2 . To generate samples from this distribution, we use Metropolis Monte Carlo as implemented in SloppyCell [8,9]. As a starting point, we use the bestfit parameters from the model selection procedure, and we sample candidate steps in parameter space from a multidimensional Gaussian corresponding to the Hessian at the best-fit parameters. 4 From 10 4 Monte Carlo steps, the first half are removed to avoid bias from the initial condition, and every 50 of the remaining steps are used as 100 approximately independent samples from the parameter posterior.
Supplementary Figure 7 replots the performance data from Figure 2 as a function of the number of parameters in each model, showing a Pareto front indicating the minimum number of parameters needed to reach a given level of performance using these models.

Supplementary Note 4. YEAST GLYCOLYSIS MODEL
As an example of inference of more complicated dynamics, we use a model of oscillations in yeast glycolysis, originally studied in terms of temperature compensation [2] and since used as a test system for automated inference [1]. The model's behavior is defined by ODEs describing the dynamics of the concentrations of seven molecular species (the biological meaning of the species is not important here): Parameter values, listed in Supplementary Table 3, are set to match with those used in Ref. [1] and Table 1 of Ref. [2], where our S 5 = N 2 , our S 6 = A 3 , and our S 7 = S ex 4 .
For the yeast glycolysis model, we use as input the initial conditions for the visible species S 1 , S 2 , and S 3 . These are each chosen uniformly from ranges listed in the "In-sample IC" column of Supplementary of initial conditions, and in-sample data, and the average is plotted as the "mean out-of-sample correlation" in Figure 4. The topology of the selected sigmoidal model in an example with N = 40 is illustrated in Supplementary Figure 8  Note that our model fitting approach assumes that the model timecourse is fully determined (aside from measurement error) by the concentrations of measured species. To be consistent with this assumption we do not vary the initial conditions of the three hidden variables. In future work it may be possible to relax this assumption, allowing the current state of intrinsic variations in hidden variables to be learned as well.
Simple sinusoidal model: As with the multi-state phosphorylation example, we can use a simple ad hoc model of yeast glycolysis for comparison to our adaptive models. The long-term behavior of the yeast network consists of stable oscillations with a roughly fixed period; a minimally complicated model of the measured concentrations S 1 , S 2 , and S 3 then consists of three sinusoidal oscillators with equal frequency ω and phase relationship fixed by two parameters, φ 2 and φ 3 : The phase φ depends on the initial conditions S 1 (0), S 2 (0), S 3 (0). Specifically, when the initial condition is a valid point on the one-dimensional elliptical curve specified by Eqs. (15), φ can be determined by any two initial values; for instance, where Because the model is not exact, however, we cannot assume that initial conditions will lie on this curve. Instead, we will assume that transient dynamics infinitely quickly bring the state of the system into the plane defined by the curve. This plane has normal vector n = (sin(φ 2 − φ 3 ), sin φ 3 , − sin φ 2 ), so that any initial conditions x can be projected onto . Thus x is a modified initial condition that is inserted into (16) to obtain φ. Unlike the adaptive model, this simple sinusoidal model does not capture the jagged shape of the yeast glycolysis oscillations, but when its 9 parameters are fit to data, its rough approximation is moderately predictive. Its performance is compared to sigmoidal adaptive models in Supplementary Figure 9.
Comparing to EUREQa: In Ref. [1], the EUREQa engine is used to infer the same yeast glycolysis model that we use here. We can roughly compare performance as a function of computational and experimental effort by measuring the number of required model evaluations and measurements ( Figure 4). Here we compare the two approaches in more detail. However, we emphasize that they have different goals: EUREQa aims at finding the exact microscopic model of the process, while Sir Isaac strives for accurate prediction with the simplest phenomenological model.
The former is a harder task, and thus one expects it to require more data and computation.
Reference [1] attempts to match time derivatives of species concentrations as a function of species concentrations, instead of species concentrations as a function of time as we do. This means that each model evaluation 5 is more computationally costly for us, since it requires an integration of the ODEs over time. It also means, however, that we are able to match well the phases of oscillations, which remain unconstrained in Ref. [1]. The fitting of time courses instead of derivatives also makes our method focus on the fitting of dynamics near the attractor, rather than attempting to constrain dynamics through the entire phase space.
To consistently infer exact equations for the full 7-dimensional model, Ref.
[1] used 20, 000 datapoints and roughly 10 11 model evaluations. We contrast this with our method that produces reasonable inferred models using 40 datapoints and less than 5 × 10 8 model evaluations (Figure 4).
Finally, in the main text we test the performance of our yeast glycolysis models for out-of-sample ranges of initial conditions that are twice as large as the in-sample ranges from which data is taken, as in Ref. [1], in order to more directly test their ability to extrapolate to regimes that were not tested in training. In Supplementary Figure 9, we compare this to performance when out-of-sample initial conditions are chosen from the same ranges as in-sample data (note that, nonetheless, none of the test examples has appeared in the training set). Here we see that the mean correlation can reach 0.9 using N = 40 measurements.
Comparing to dynamical inference with alternation regression: In Ref.
[10], we used a somewhat similar dynamical inference algorithm for analysis of the same glycolysis model. It is worthwhile contrasting the two methods.
First, as we have argued here, the crucial difficulty of adaptive dynamical inference is to account for both arbitrary nonlinearities and an arbitrary number of hidden variables underlying data. Sir Isaac does this by traversing hierarchies of models that are complete and can deal with any number of missing variables (e. g., four in the glycolysis example). In contrast, Ref. [10] required knowing every variable in the system, similarly to the EUREQa implementation discussed above [1]. While Ref. [10] controlled the number of (nonlinear) interactions in the system adaptively using Bayesian Isaac is to predict the temporal dynamics of the studied system by inferring the dynamical system from time series and by integrating the inferred dynamics with new initial conditions. In contrast, in Ref. [10], we followed the same route as Ref. [1] and required knowing derivatives in addition to values of variables. Further, the aim of the approach was to predict the derivatives, but not the variables themselves. In particular, there were no guarantees that the integrated trajectories would be close to the true ones, or even that they would stay within a physically reasonable range (such as positive concentrations).
The strong limitations of Ref. Multiple previous approaches have used approximate sampling methods to perform Bayesian model selection on a small number of alternate models in the context of systems biology; e. g., [11][12][13]. For our approach that relies on a search over an infinite set of models, even such approximate sampling is slow. Yet with sufficiently large N , an expansion resembling that used to derive the Bayesian Information Criterion produces good performance without sampling. The derivation here largely follows Refs. [14,15], but can be traced to the 1970s [16].
where the normalization constant Z α = d Np α P (α) and N p is the number of parameters. In terms of the output given the model, Bayes rule states Assuming that the model output has normally distributed measurement errors, where χ 2 is the usual goodness-of-fit measure consisting of the sum of squared residuals, and Z σ is where C ≡ 2P (M )/Z σ P ({y i }) andχ 2 (α) = χ 2 (α) − 2 log P (α). Since we will be comparing models fitting the same data, and we assume all models have the same prior probability P (M ), C will be assumed constant in all further comparisons (but see Ref. [17] for the discussion of this assumption).
If there are enough data to sufficiently constrain the parameters (as is the case for ideal data in the limit N → ∞), then the integral will be dominated by the parameters near the single set of best-fit parameters α best . To lowest order in 1/N , we can approximate the integral using a saddle-point approximation [15]: where H is the Hessian: 7 If we assume normally distributed priors on parameters with variances ς 2 k , the log posterior probability becomes where λ µ are the eigenvalues of H, and the last term comes from Z α . We thus use as our measure of model quality Eq. (25) is a generalization of the Bayesian Information Criterion (BIC) [16] when parameter sensitivities and priors are explicitly included. 8 The first term is the familiar χ 2 "goodness of fit," and the last two terms constitute the fluctuation "penalty" for overfitting or complexity. Note that here the goodness of fit and the complexity penalty are both functions of the entire dynamics, rather than individual samples, which is not a common application of Bayesian model selection techniques.

Supplementary Note 6. FITTING ALGORITHM
We are given N data points x i at known times t i and known exogenous parameters I i , and with known or estimated variances σ 2 i . We are approximating the functions F x and F y in Eq. (1), where y are hidden dynamic model variables, and x = x(t, I) and y = y(t, I) in general depend on time t and inputs I. As described in Supplementary Note 5, we fit to the data x i using a combination of squared residuals from the data and priors P (α) on parameters α, which we assume to be Gaussian and centered at zero: To fit parameters, we use a two step process akin to simulated annealing that uses samples from a "high temperature" Monte Carlo ensemble as the starting points for local optimization performed using a Levenberg-Marquardt routine. The phenomenological models are implemented using SloppyCell [8,9] in order to make use of its parameter estimation and sampling routines.   i. Use as a starting point the best-fit parameters from a smaller N p if a smaller model has been previously fit, or else default parameters.
ii. As a proposal distribution for candidate steps in parameter space, use an isotropic Gaussian with standard deviation √ T N D /λ max , where N D is the total number of data residuals and λ max is the largest singular value of the Hessian [Eq. (23)] at the starting parameters.
iii. If this model has previously been fit to less data, use those parameters as an additional member of the ensemble.  Figure 11). When the model size saturates, the number of model evaluations scales roughly linearly with N . 9 The number of integrations per gradient calculation is proportional to the number of parameters. This means that the computational effort used to fit large models is dominated by gradient calculations.

Supplementary Note 8. COMPARISON TO BAYESIAN NETWORK APPROACHES
A related set of methods for inferring causal structure from time series data comes from the field of Bayesian Networks (BN), and specifically Dynamic Bayesian Networks (dBN). Implementations typically make the following assumptions: 1. Variables are updated at a discrete set of times, rather than continuously.
2. Latent variables are not allowed, or their number is known a priori.
3. The state space of dynamical variables is itself discrete (and often of low cardinality, such as binary or ternary).
Many generalizations of (d)BNs have been presented that lift each of these assumptions. Below we include a brief literature review of current implementations of (d)BNs that address each issue.
However, our method is distinct from (d)BNs in that it is designed to perform inference simultaneously for continuous variables, in continuous time, with potentially a very large number of unknown hidden nodes, and we are not aware of an approach that is able to lift all three assumptions in order to analyze the type of data handled by Sir Isaac.
Continuous time: It is known that exact inference is intractable in continuous time versions of dBNs because calculating a node's distribution at a given time step does not easily factor into conditionally independent subsets, as it does in cases with discrete time. Instead, each node's distribution will in general depend on the entire history of all other variables [18]. Approximate methods have been developed to deal with such Continuous Time Bayesian Networks (CTBNs) [18,19]. Conversely, converting a set of ODEs, such as those explored by Sir Isaac, into the dBN framework generally leads to an exponentially large model [20] that cannot be readily inferred from data.
Continuous states: Most implementations deal with discrete state variables, to avoid the need to infer multidimensional distributions over continuous variables, which can require very large data sets. It is also relatively common to use continuous variables by specifying the state of nodes as finite-parameter continuous distributions, such as Gaussians. However, these differ from