Introduction

A biochemical reaction network is a key concept in understanding how higher-order functions in the cell emerge from relatively simple individual elements, such as proteins and metabolites. The reaction network system is often nonlinear and complex and can potentially display various dynamic behaviors, such as ultrasensitivity, bistability, and oscillation1,2,3,4,5,6, that form the basis of diverse cellular phenotypes. Because of its complexity, in silico analysis based on mathematical modeling and numerical simulation is an essential strategy for quantitatively understanding a system of interest. Mathematical analysis can help to eliminate the nonessential individuality of biological targets and identify core principles that govern the behaviors and function of the system in the cell. Using these approaches, various studies have revealed relationships between the behavior of a system and its underlying mechanisms, including feedback/feedforward loops, cross-talk, compartmentalization, and noise7,8,9,10,11,12.

There are at least two distinct stages of in silico network analysis. The first step involves construction of a mathematical model that describes the system, and the second step involves analysis of the model. Although the second step strongly depends upon the aim of the study, a mathematical model is needed for the analysis, regardless of the details of the second step. Typically, modeling of a target system is performed in a patchwork manner, which means that fragments of studies regarding a specific reaction are integrated to construct a map of the reaction network13,14,15,16. Although this procedure is straightforward, selecting the sources of each reaction that constitutes the network is a non-trivial task that might raise concerns regarding the validity of the modeling. Alternatively, a data-driven approach incorporating as few assumptions as possible for inferring a network model can compensate for the defect of the patchwork modeling.

Data-driven inference of biochemical network models has previously been extensively studied17,18,19,20, and both genome-wide networks and cell-specific gene regulatory and posttranslational modification networks have been systematically reconstructed21,22. Additionally, although the regulatory relationships in the inferred network often represent linear or binary correlations among nodes, efforts are underway to identify nonlinear ordinary differential equation (ODE) systems23,24,25,26. However, the intersection between systematic model inference and network modeling with nonlinear ODEs has received less attention27,28. Therefore, a framework that enables data-driven modeling of the network while considering the nonlinearity of the system is needed. Furthermore, recent advances in experimental methods have made available highly quantitative and time-resolved data at the single-cell level29, thereby making it desirable for the framework to handle single-cell datasets.

To address these problems, we developed a method combining an expectation-maximization (EM) algorithm with a particle smoother and sparse regularization. Using this method, we showed that an oscillatory network model can be systematically inferred based only on single-cell time-course data. Briefly, our strategy is as follows (Fig. 1): (1) quantitatively measure components of the network and obtain a single-cell dataset, (2) prepare a “redundant” model where an excessive number of reaction paths and nodes are defined using nonlinear ODEs, and (3) perform model learning using the dataset while eliminating unnecessary paths in the redundant model to identify the most probable model. We evaluated the performance of the method using artificial time-course data and showed that the algorithm accurately inferred the true network model in a data-driven manner.

Figure 1
figure 1

Schematic representation of the proposed method for data-driven inference of biochemical network models. Details of each step are described in the text.

Results

Maximum likelihood parameter estimation in a biochemical network model

We introduced the following nonlinear state space model:

$$\begin{array}{rcl}{{\boldsymbol{x}}}_{t} & = & {\boldsymbol{f}}({{\boldsymbol{x}}}_{t-1})+{{\boldsymbol{v}}}_{t}\\ {{\boldsymbol{y}}}_{t} & = & {\boldsymbol{h}}({{\boldsymbol{x}}}_{t})+{{\boldsymbol{w}}}_{t}\end{array}$$
(1)

where x and y denote state variables (e.g., amounts of mRNA and protein) and measurements (e.g., fluorescence intensity), respectively. Function f describes the evolution of the system and can be calculated as \({\boldsymbol{f}}({{\boldsymbol{x}}}_{t-1})={{\boldsymbol{x}}}_{t-1}+{\int }_{t-1}^{t}{\boldsymbol{g}}({{\boldsymbol{x}}}_{\tau },{{\boldsymbol{\theta }}}_{sys})d\tau \), where, in general, g represents ODEs that describe a biochemical reaction network of interest, and θ sys denotes model parameters. Function h represents the process of measurement of x. Vectors v t and w t denote system noise and measurement noise, respectively, where we assumed that they followed a Gaussian distribution. Given dataset \(Y=\{{Y}^{(a)}\}(a=1,\ldots ,A)\), where a is an index of each cell and Y(a) represents the single-cell time-course data, estimation of θ, a set of parameters that characterize the state space model, can be accomplished by maximizing the log-likelihood of the model. Here, we employed an EM algorithm to find the maximum likelihood estimates of θ. Note that the algorithm is analytically intractable, because it requires a probability distribution of the time course at all time points. Therefore, we numerically approximated the probability distribution using a particle smoother algorithm30 (Materials and Methods). We referred to the algorithm31,32 as the EM-PS (particle smoother) algorithm.

Next, we tested whether the EM-PS algorithm could provide correct estimates in a given model using artificial time-course data. To generate artificial data, we constructed a gene regulatory network in silico that consisted of three genes (X, Y, and Z) and a negative feedback loop (Fig. 2a). The network produced an oscillatory expression pattern with appropriate parameters. We used the Hill function to express reactions involving either activator or repressor molecules, because the activity of such regulators is often nonlinear (Supplementary Information). For simplicity, we used first-order kinetics for the degradation process. To mimic a realistic biological experiment in which cell-to-cell variability and observation noise exist, we numerically solved the model as nonlinear stochastic Langevin equations and added Gaussian noise as observation error to each value to generate artificial single-cell time-course data (Fig. 2b).

Figure 2
figure 2

Maximum likelihood estimation of model parameters using the EM-PS algorithm. (a) Schematic of a three-component negative feedback oscillator model. (b) Artificial measurement data were generated by numerically solving the model as nonlinear stochastic Langevin equations, followed by addition of Gaussian noise to each value to simulate the measurement process. The dataset consists of 10 independent time-course data, with two examples (#1 and #2) shown. (c) Iterative estimation of the states using the EM-PS algorithm. Each dot represents the (artificial) measurement data, and lines denote the trajectories sampled by the particle smoother. (d) Log-likelihood values were plotted as a function of the iteration number. Note that small fluctuations were observed, even after the convergence of the algorithm because of the stochastic nature of the EM-PS algorithm. (e) A difference in the parameter values between the estimated and correct values is shown as a ratio. We tested two different sets of initial parameter values, where one is 1/100 of the correct values (Initial #1), and another is randomly generated in the range of 1/30 to 30× the correct values (Initial #2).

Using the artificial data and EM-PS algorithm, we conducted maximum likelihood estimation of the model parameters. We observed a monotonic increase in the log-likelihood during iterations of the algorithm, and eventually the estimated states were consistent with the data (Figs 2c and d and S1). Note that small fluctuations were observed, even after convergence due to the stochastic nature of the particle smoother algorithm implemented in the EM-PS algorithm. Differences between the correct and estimated values were <11%, even though the dataset contained significant cell-to-cell variability. Additionally, the algorithm was robust over a wide range of initial values (Fig. 2e; see Initial #1 and #2). Interestingly, the dynamics of parameter convergence varied among parameters and were not always monotonic (Supplementary Fig. S2), indicating that the likelihood function had a complicated landscape in the parameter space. These results revealed that the EM-PS algorithm represented a powerful approach for parameter estimation in nonlinear biochemical network models.

Inferring network topology using a sparse regularized EM-PS algorithm

Next, we extended the algorithm to infer not only parameter values but also network topology. We focused on the fact that biochemical networks are sparse33,34,35, which means that the number of regulatory paths is much smaller than the number of possible links between nodes. To utilize the sparsity of biochemical networks for inference35,36,37, we introduced a regularization term referred to as the least absolute shrinkage and selection operator (Lasso), which is a simple yet powerful technique that provides a sparse solution38. In our strategy, we first prepared a “redundant” model consisting of an excessive number of regulatory paths among genes, followed by elimination of less important paths by Lasso.

To construct the redundant model, we used the Hill function, because it can express both a linear and nonlinear reaction depending on parameters K and n, which denote the (apparent) association constant [reciprocal of the (apparent) dissociation constant] and Hill coefficient, respectively. For example, the activity of transcription activator, A, or repressor, R, was expressed as \({c}_{A}={(K[A])}^{n}/(1+{(K[A])}^{n}),\,{c}_{R}=1/(1+{(K[R])}^{n})\). Assuming a common situation in which regulators function independently39, overall gene expression that is regulated by virtually any gene in the system (redundant model) can be written as follows:

$${\rm{production}}\,{\rm{rate}}=(\sum _{i}\,{a}_{i}\frac{{({K}_{i}[{X}_{i}])}^{{n}_{i}}}{1+{({K}_{i}[{X}_{i}])}^{{n}_{i}}})\cdot \prod _{j}\frac{1}{1+{({K}_{-j}[{X}_{j}])}^{{n}_{j}}}$$
(2)

where i and j represent indices of the activator and repressor, respectively. We focused on the fact that a path does not exist [c A  = 0 (no activator activity) or c R  = 1 (no repressor activity)] when parameter K = 0 (i.e., a zero or nonzero association constant can be used to characterize the presence or absence of the regulatory path in the model). Using this notation, the condition that biochemical networks are sparse is equivalent to the fact that most association constants (K) in the redundant model are equal to zero. Therefore, the association constants were subjected to regularization, thereby virtually removing the less important path from the network. Therefore, we rewrote the equations regarding the EM steps as follows:

$$\begin{array}{rcl}Q^{\prime} ({\boldsymbol{\theta }},{{\boldsymbol{\theta }}}^{({\rm{old}})}) & = & Q({\boldsymbol{\theta }},{{\boldsymbol{\theta }}}^{({\rm{old}})})-\,\lambda \sum _{s}|{K}_{s}|\\ {{\boldsymbol{\theta }}}^{({\rm{new}})} & = & {\rm{\arg }}\,\mathop{{\rm{\max }}}\limits_{{\boldsymbol{\theta }}}Q^{\prime} ({\boldsymbol{\theta }},{{\boldsymbol{\theta }}}^{({\rm{old}})})\end{array}$$
(3)

where s represents the index of the association constant, and λ denotes the strength of the regularization term (details are provided in Materials and Methods). We referred to this algorithm as the EM-PS-Lasso algorithm.

We then tested the performance of the EM-PS-Lasso algorithm using artificial data (Fig. 2b). We assumed a situation where we had time-course data for genes X, Y, and Z, but no prior knowledge of their regulatory relationships. Therefore, we constructed a model where any possible regulatory paths among the three genes (18 paths) were incorporated (Fig. 3 and Supplementary Information). Using the redundant model and artificial single-cell time-course data, we conducted network inference and parameter estimation using the EM-PM-Lasso algorithm. Because this algorithm requires parameter λ, which controls the strength of the penalty term, we evaluated the log-likelihood of the model as a function of λ (Figs 4a and S3). We also examined the log-likelihood on unseen test data and confirmed that the estimation did not suffer from overfitting. As expected, too large a value of λ resulted in a failure to fit the data, because all parameters were estimated to be zero (Supplementary Fig. S4). Values of λ from 0.1 to 10 yielded high log-likelihood values, implying potential good inference; however, too small a value of λ (λ = 0.1, 1) resulted in inference of overly redundant and biologically inconsistent models (e.g., gene Z simultaneously autoactivated and autorepressed gene Z) (Supplementary Fig. S4). Thus, we rejected these models (Supplementary Fig. S4). Consequently, the results at λ = 3,10 were systematically selected as candidates for the inferred model.

Figure 3
figure 3

Schematic of the redundant model. We assumed no prior knowledge regarding the regulatory relationships in the network. Therefore, the model consists of an excessive number of regulatory paths among genes. The numbers shown in the network scheme represent the index of each reaction path.

Figure 4
figure 4

Inferring the network model via the EM-PS-Lasso algorithm. (a) The models were inferred using the EM-PS-Lasso algorithm, with different values for the regularization parameter, λ, and using the artificial data and redundant model. Log-likelihood values at iteration number 100 were plotted as a function of λ. (b) Relationship between the number of effective paths in the inferred models and λ.

At λ = 3, the estimated states based on the inferred model were consistent with the data (Fig. 5a). In the inferred model, three association constants of 18 had nonzero values, indicating that only these three regulatory paths were crucial to reproduce the data (Fig. 5b). The paths consisted of activation of gene Y by gene X, activation of gene Z by gene Y, and repression of gene X by gene Z, which were equivalent to the true network (Fig. 2a). Removal of paths from the redundant model during iterations of the algorithm occurred at several steps rather than at a single step (Fig. 5c and d). We also confirmed that model parameters other than the association constants, such as degradation rate constants and Hill coefficients, were also successfully estimated (Fig. 5e). The same network model was inferred at λ = 10 (Supplementary Fig. S5). By contrast, the dynamics of the removal of paths from the redundant model were highly different from those at λ = 3. Overall, we demonstrated that the EM-PS-Lasso algorithm enabled both estimation of model parameters and inference of network topology. Furthermore, our results indicated that rich information regarding network topology was embedded in single-cell time-course data, even when the data were highly dynamic, nonlinear, and heterogeneous.

Figure 5
figure 5

Data-driven inference of a three-component oscillator model. (a) Estimated states after 100 iterations of the algorithm with λ = 3. Each dot represents artificial data (Fig. 2b), and lines indicate the estimated trajectories. (b) Values of the association constant after 100 iterations of the algorithm with λ = 3. Each parameter index corresponds to the reaction number (Fig. 3). (c) Values of association constants in the model plotted as a function of the iteration number. (d) Schematic representation of the inferred network. The red arrows represent effective paths where the association constant has a nonzero value, whereas light-gray arrows are paths that have no regulatory activities, because the association constant is zero. (e) A difference in the parameter values between the estimated and correct values is shown as a ratio.

In general, the number of effective paths with nonzero association constants decreased as λ increased (Fig. 4b). Note that there was an apparent increase in the number of estimated paths at λ = 30. Indeed, most of the estimated paths had nonzero but extremely small values of association constants and had practically little effect on system behavior. This issue could be overcome by defining a threshold for the parameter value and/or for a degree of response to parameter changes (i.e., sensitivity analysis).

Inferring the number of components in the network

In our analysis, we assumed that the number of genes constituting the network was known, whereas their regulatory relationships were unknown. However, it is more common that neither factor is known. Therefore, we examined whether the algorithm could infer both the number of components and network topology. Again, we generated an artificial dataset using a network consisting of two genes (Fig. 6a and Supplementary Information) showing oscillatory dynamics with appropriate parameters. We also prepared a redundant model (equivalent to that in Fig. 3) consisting of three components rather than two, because we assumed that we had no prior knowledge regarding the number of components in the network. Using the artificial data and redundant model, we performed model inference of the gene regulatory network via the EM-PS-Lasso algorithm and evaluated the log-likelihood of the inferred models, finding that the model with λ = 15 showed the highest log-likelihood value (Fig. 6b) and was consistent with the data (Fig. 6c). Next, we evaluated the values of the association constant for all regulatory paths in the redundant model. The paths in the redundant model were removed in several steps during iterations of the algorithm (Fig. 6d and e), with three association constants of 18 eventually found to have nonzero values (Fig. 6f). The paths remaining in the model described autoactivation of gene X, activation of gene Z by gene X, and repression of gene X by gene Z. All regulatory paths related to gene Y had no activity, indicating the absence of gene Y in the network model. Therefore, the inferred model practically consisted of two genes and three regulatory paths (Fig. 6e, right) and was completely equivalent to the true network (Fig. 6a). Overall, these results revealed that the EM-PS-Lasso algorithm was able to infer not only the regulatory paths but also the number of components in the model.

Figure 6
figure 6

Inferring the number of components in the network. (a) Schematic of a two-component oscillator model. (b) Models were inferred using the EM-PS-Lasso algorithm with different values of λ. Log-likelihood values of the models at iteration number 100 are shown as a function of λ. (c) Consistency of the estimated states with the data. Each dot represents artificial data generated from the two-component oscillatory model. Lines denote trajectories sampled from the inferred model. (d) Values of the association constant after 100 iterations of the algorithm. Each parameter index corresponds to the reaction number (Fig. 3). (e) Values of association constants in the model plotted as a function of the iteration number of the EM-PS-Lasso algorithm. (f) Schematic representation of the inferred network. The red and light-gray arrows represent the effective paths and eliminated paths, respectively. Note that gene Y is not involved in system behavior, because all paths related to gene Y had no regulatory activity after iteration number 27.

Discussion

The concept of data-driven inference and analysis of biochemical networks has gained attention in computational systems biology and biophysics. However, this remains a difficult task due to the highly nonlinear nature of biological systems. Here, we proposed an EM algorithm-based method combining a particle smoother and sparse regularization to enable data-driven and systematic inference of nonlinear biochemical network models. Our method was successfully applied to construct mathematical models showing oscillations, which is one of the stereotypical nonlinear behaviors. Furthermore, because the elemental reaction in our modeling is described by a Hill function commonly used to express various types of biochemical reactions, our method can be directly applied to a wide range of networks, including transcriptional control, signal transduction, and metabolic regulation.

In this study, we focused on the fact that a regulatory path can be negligible when the association constant in the Hill function describing the path is equal to zero. Penalizing the association constants using Lasso resulted in elimination of unnecessary paths in the redundant model and enabled inference of the network topology. The proposed algorithm might also be useful when the model is described by other schemes, such as mass action kinetics, because the biological meaning of the association constant is straightforward. In such a system, the reaction is negligible when the association rate constant (kon) in the mass action kinetics is estimated at zero. Therefore, the algorithm would be applicable to the mass action-based model with only a slight modification, where the association rate constant instead of the association constant is subjected to regularization.

Although Lasso is a simple yet powerful technique that provides a sparse solution, there are also other methods for sparse regularization. For example, automatic relevance determination and Bayesian masking are superior to Lasso in terms of sparsity-shrinkage tradeoff40,41, although we did not use these techniques in the present study because of their slow convergence. Another promising approach to regularization is Group Lasso24,42,43, which can provide a sparse solution at the grouped variable level. Recently, a problem involving insulation of network activity attracted interest, and a condition that insulates the activity of a sub-network from the overall network was also studied44. The prominent feature of Group Lasso, where the sparse solution is given at the group level, might make it compatible with this problem.

Different system configurations often produce qualitatively similar behaviors45. For example, ~10 types of different synthetic circuits reportedly generate “oscillatory” dynamics46. Therefore, our finding that the network can be reconstructed based solely on time-course data might be surprising. Although it seems difficult to strictly define a condition that achieves the most effective inference, our results suggest that time-course data and possibly their associated noise47 contain rich information and would be sufficient to reconstruct the regulatory network. Methods for data-driven analysis will become increasingly important as the number of various experimental technologies, including super-multiplexed color live-cell imaging48, continue to rapidly progress. The present study provides a general framework for analyzing the intersections of nonlinear biochemical systems, model inference, and single-cell time-course data analysis in a data-driven manner.

Materials and Methods

Nonlinear state space model

We introduce a nonlinear state space model, which is given by

$$\begin{array}{rcl}{{\boldsymbol{x}}}_{t} & = & {\boldsymbol{f}}({{\boldsymbol{x}}}_{t-1})+{{\boldsymbol{v}}}_{t}\\ {{\boldsymbol{y}}}_{t} & = & {\boldsymbol{h}}({{\boldsymbol{x}}}_{t})+{{\boldsymbol{w}}}_{t}\end{array}$$
(4)

where x is a k-dimensional vector consisting of state variables, and y denotes an l-dimensional vector representing measurements of x. Functions f and h are nonlinear functions describing the evolution of the system and measurement process, respectively. Vectors v t  = {vt,i} (i = 1, …, k) and w t  = {wt,j} (j = 1, …, l) denote system noise and measurement noise, respectively, where we assumed that they followed a Gaussian distribution: vt,i ~ N (0, (σ i )2), wt,j ~ N (0, (η j )2). Initial values of the state are given by x0,i ~ N (μ i , (γ i )2). We used standard ODEs to model the biochemical reaction network of interest as dx/dt = g (x, θ sys ), where, in general, g is a nonlinear function consisting of arbitrary equations, such as the Hill equation, and θ sys indicates model parameters. Function f can be calculated by numerically integrating the equations as \({\boldsymbol{f}}({{\boldsymbol{x}}}_{t-1})={{\boldsymbol{x}}}_{t-1}+\,{\int }_{t-1}^{t}\,{\boldsymbol{g}}({{\boldsymbol{x}}}_{\tau },{{\boldsymbol{\theta }}}_{sys})d\tau \). Numerical integration of ODEs was performed using routines implemented in the scipy.integrate package (https://docs.scipy.org/doc/scipy/reference/integrate.html) as described previously49. In the present study, we used a linear function for h (h(x) = αx) for simplicity, where α = 1 unless otherwise explicitly indicated. Given dataset \(Y=\{{Y}^{(a)}\}=\{{Y}^{(1)},{Y}^{(2)},\ldots ,{Y}^{(A)}\}=\{{{\boldsymbol{y}}}_{1:T}^{(1)},{{\boldsymbol{y}}}_{1:T}^{(2)},\ldots {{\boldsymbol{y}}}_{1:T}^{(A)}\}\), where a is an index of each cell and Y(a) represents the single-cell time-course data, estimation of \({\boldsymbol{\theta }}=\{{{\boldsymbol{\theta }}}_{sys},{\boldsymbol{\sigma }},{\boldsymbol{\eta }},{\boldsymbol{\mu }},{\boldsymbol{\gamma }}\},({\boldsymbol{\sigma }}=\{{\sigma }_{i}\},{\boldsymbol{\eta }}=\{{\eta }_{j}\},{\boldsymbol{\mu }}=\{{\mu }_{i}^{(a)}\},{\boldsymbol{\gamma }}\,=\) \(\{{\gamma }_{i}\}\,(i=1,\ldots ,k,j=1,\ldots ,l,a=1,\ldots ,A))\) can be accomplished by maximizing the log-likelihood of the model. Note that only μ is dependent on the cell index a to describe cell-to-cell variability of initial states.

EM-PS algorithm for parameter estimation

Maximum likelihood estimation of θ can be accomplished by maximizing log-likelihood \(\mathrm{ln}\,p(Y|{\boldsymbol{\theta }})=\,\mathrm{ln}\,\sum _{X}\,p(Y|X,{\boldsymbol{\theta }})p(X|{\boldsymbol{\theta }})\), which requires intractable integration with respect to state variables \(X=\{{X}^{(a)}\}=\{{X}^{(1)},{X}^{(2)},\ldots ,{X}^{(A)}\}=\{{{\boldsymbol{x}}}_{1:T}^{(1)},{{\boldsymbol{x}}}_{1:T}^{(2)},\ldots ,{{\boldsymbol{x}}}_{1:T}^{(A)}\}\). Therefore, we used an EM algorithm to find maximum likelihood estimates of θ. The EM algorithm was run by iterating steps E (expectation) and M (maximization), which are defined as

$$\begin{array}{rcl}Q({\boldsymbol{\theta }},{{\boldsymbol{\theta }}}^{({\rm{old}})}) & = & {\langle \mathrm{ln}p(X,Y|{\boldsymbol{\theta }})\rangle }_{p(X|Y,{{\boldsymbol{\theta }}}^{({\rm{old}})})}\\ {{\boldsymbol{\theta }}}^{({\rm{new}})} & = & {\rm{\arg }}\,\mathop{{\rm{\max }}}\limits_{{\boldsymbol{\theta }}}Q\,({\boldsymbol{\theta }},{{\boldsymbol{\theta }}}^{({\rm{old}})})\end{array}$$
(5)

respectively. Given that the E step is analytically intractable, because it requires the probability distribution of the time series at all time points, we numerically approximated p(X|Y, θ) using a particle smoother as previously reported31,32. Briefly, the particle smoother algorithm approximates the distribution as an ensemble of particles:

$$p({X}^{(a)}|{Y}^{(a)},{\boldsymbol{\theta }})=\sum _{p=1}^{P}\,{\beta }^{(a,p)}\delta ({X}^{(a)}-{X}^{(a,p)}),\sum _{p=1}^{P}{\beta }^{(a,p)}=1,{\beta }^{(a,p)}\ge 0$$
(6)

where P is the number of particles, X(a,p) indicates a trajectory of the pth particle sampled by the algorithm for data Y(a), β(a,p) represents the weight of the particle, and 𝛿 is Dirac’s delta. This weight is given as \({\beta }^{(a,p)}={l}^{(a,p)}/\sum _{p}{l}^{(a,p)}\), where l(a,p) = p (Y(a,p)|X(a,p)) denotes the likelihood of the particle. The calculation was performed using the pyParticleEst package50. Finally, the log-likelihood estimate was obtained by averaging over the particles: \({\rm{l}}{\rm{n}}\,{L}^{(a)}({\boldsymbol{\theta }})={\rm{l}}{\rm{n}}\,(\frac{1}{P}\sum _{p}\,{l}^{(a,p)})\). Note that \(\mathrm{ln}\,p(X|Y,{\boldsymbol{\theta }})\) can be written as \(\mathrm{ln}\,p(X|Y,{\boldsymbol{\theta }})=\sum _{a}\,\mathrm{ln}\,p({X}^{(a)}|{Y}^{(a)},{\boldsymbol{\theta }})\), because different time-course data are independent. Thus, the E step can be completed using the following approximation:

$$\begin{array}{ccc}Q({\boldsymbol{\theta }},{{\boldsymbol{\theta }}}^{({\rm{o}}{\rm{l}}{\rm{d}})}) & = & \sum _{a=1}^{A}\,{\langle {\rm{l}}{\rm{n}}p({X}^{(a)},{Y}^{(a)}|{\boldsymbol{\theta }})\rangle }_{p({X}^{(a)}|{Y}^{(a)},{{\boldsymbol{\theta }}}^{({\rm{o}}{\rm{l}}{\rm{d}})})}\\ & = & \sum _{a=1}^{A}\sum _{p=1}^{P}{\beta }^{(a,p)}\,{\rm{l}}{\rm{n}}\,p({X}^{(a,p)},{Y}^{(a)}|{\boldsymbol{\theta }})\,\\ & = & \,\sum _{a=1}^{A}\sum _{p=1}^{P}\sum _{i=1}^{k}{\beta }^{(a,p)}(-\frac{1}{2}\,{\rm{l}}{\rm{n}}\,2\pi {({\gamma }_{i})}^{2}-\frac{{({x}_{0,i}^{(a,p)}-{\mu }_{i}^{(a)})}^{2}\,}{2{({\gamma }_{i})}^{2}})\\ & & +\,\sum _{a=1}^{A}\sum _{p=1}^{P}\sum _{t\in T}\sum _{i=1}^{k}{\beta }^{(a,p)}(-\frac{1}{2}\,{\rm{l}}{\rm{n}}\,2\pi {({\sigma }_{i})}^{2}-\frac{{({x}_{t,i}^{(a,p)}-{f}_{i}({{\boldsymbol{x}}}_{t-1}^{(a,p)},{{\boldsymbol{\theta }}}_{sys}))}^{2}}{2{({\sigma }_{i})}^{2}})\\ & & +\,\,\sum _{a=1}^{A}\sum _{p=1}^{P}\sum _{t\in T}\sum _{j=1}^{l}{\beta }^{(a,p)}(-\frac{1}{2}\,{\rm{l}}{\rm{n}}\,2\pi {({\eta }_{j})}^{2}-\frac{{({y}_{t,j}^{(a)}-{h}_{j}({{\boldsymbol{x}}}_{t}^{(a,p)}))}^{2}}{2{({\eta }_{j})}^{2}}).\end{array}$$
(7)

For the M step, we numerically maximized the Q function using the quasi-Newton method with respect to θ sys , because, in general, dQ/dθ sys  = 0 cannot be solved analytically. This optimization was performed using the L-BFGS-B function implemented in the scipy.optimize package (http://docs.scipy.org/doc/scipy/reference/optimize.html), with a non-negative constraint for the parameter values. Additionally, the equations for the derivative of Q with respect to σ, η, μ, and γ are linear equations; therefore, updated values for the parameters were easily found. Note that we defined minimum values for γ, because if the value is too small, sample impoverishment can occur51. We also defined maximum values for σ, η in order to avoid overestimation of the noise that could cause meaningless inference.

Artificial data generation

Artificial data were generated by numerically solving the model as nonlinear stochastic Langevin equations: dx/dt = g(x) + ξ(t), where ξ(t) is Gaussian noise with 〈ξ i (t) = 0〉 and 〈ξ i (t)ξ j  (t′)〉 = 2i,jδ (t − t′) with Kronecker’s δi,j and Dirac’s δ(t) distribution, where parameter D characterizes the amplitude of the noise. Computation was conducted using a stochastic Runge-Kutta algorithm52. The measurement process was simulated by adding Gaussian noise to each variable: y i  = x i  + ηϕ, where ϕ is a random number sampled from a standard normal distribution, and η characterizes the amplitude of the noise. Stochastic simulation was performed over the simulation period T = 400, and data points from T = 351 to T = 400 were collected at a time resolution of 1. The simulation was repeated 10 times to generate 10 independent time-course data that served as training data. Similarly, an additional 10 independent time-course data were generated and used as test data for validation. Details of the model equations and parameter values are described in the Supporting Information.