Abstract
NonMarkovian models of stochastic biochemical kinetics often incorporate explicit time delays to effectively model large numbers of intermediate biochemical processes. Analysis and simulation of these models, as well as the inference of their parameters from data, are fraught with difficulties because the dynamics depends on the system’s history. Here we use an artificial neural network to approximate the timedependent distributions of nonMarkovian models by the solutions of much simpler timeinhomogeneous Markovian models; the approximation does not increase the dimensionality of the model and simultaneously leads to inference of the kinetic parameters. The training of the neural network uses a relatively small set of noisy measurements generated by experimental data or stochastic simulations of the nonMarkovian model. We show using a variety of models, where the delays stem from transcriptional processes and feedback control, that the Markovian models learnt by the neural network accurately reflect the stochastic dynamics across parameter space.
Introduction
Over the past two decades, stochastic modelling has provided insight into how cellular dynamics is influenced by noise in gene expression^{1,2,3}. The complexity of cellular biochemistry prevents a full stochastic description of all reaction events and rather these models are effective, in the sense that each reaction provides an effective description of a group of processes. A major assumption behind the majority of stochastic models of biochemical kinetics is the memoryless hypothesis, i.e., the stochastic dynamics of the reactants is only influenced by the current state of the system, which implies that the waiting times for reaction events obey exponential distributions. In the context of stochastic gene expression, the telegraph model (or the twostate model)^{4,5,6} is a fundamental Markovian model describing promoter switching, transcription and degradation of mature RNA. While this Markovian assumption considerably simplifies model analysis^{7}, it is dubious for modelling certain nonelementary reaction events that encapsulate multiple intermediate reaction steps. For example, consider a model of transcription that predicts the distribution of RNA polymerase (RNAP) numbers attached to the gene but which does not explicitly model the microscopic processes behind elongation^{8}. In this case, assuming that RNA synthesis proceeds with approximately constant elongation speed, the reaction modelling termination should occur a fixed time after the reaction modelling initiation, which implies that the system has memory and is not Markovian. Of course in this instance, the model could be made Markovian by extending it so that it includes the explicit microscopic description of the movement of the RNAP along the gene^{9} but this implies a large increase in the effective number of species, which makes explicit solution of the Markovian model impossible. Hence in many cases, a low dimensional stochastic model can be only achieved by a suitable nonMarkovian description. Given their practical importance, these systems have been the subject of increased research interest, leading to an exact analytical solution for a few simple cases and the development of exact Monte Carlo algorithms for the study of those with complex dynamics^{8,10,11,12,13,14,15,16}.
Nevertheless, presently the understanding of nonMarkovian models lags much behind that of Markovian models where a wide range of approximation methods are available. Hence there is ample scope for the development of methods to tackle the difficulties inherent in stochastic systems possessing memory. Given the success of artificial neural networks (ANNs) in solving scientific problems where traditional methods have made little progress, it is of interest to consider whether such an approach could be useful for solving the aforementioned stochastic problems. Neural networks being universal function approximators have recently been used to solve partial differential equations commonly used in physics, biology and chemistry. In particular these techniques have been used to approximately solve Burgers equation^{17,18,19}, the Schrodinger equation^{18,20} and partial differential equations describing collective cell migration in scratch assays^{21}; the ANNbased methods behind the solution of these problems are a subclass of the universal differential equation framework that has recently been proposed^{22}.
Inspired by the success of ANNs in other fields, in this article we develop a novel ANNbased methodology to study nonMarkovian models of gene expression and transcriptional feedback. We propose to use an ANN to approximate nonMarkovian models by much simpler stochastic models. Specifically the key idea is to approximate the chemical master equation (CME) of nonMarkovian models (which we refer to as the delay CME) that is in terms of the twotime probability distribution by a CME whose terms are only a function of the current time, i.e. by a timeinhomogeneous Markov process (see Fig. 1a for an illustration). Notably, this mapping is achieved without increasing the number of fluctuating species. We refer to the learnt CME describing the timeinhomogeneous Markov process as the neuralnetwork chemical master equation (NNCME). The latter, because of its simplified form, can then either be studied analytically using standard methods or else straightforwardly simulated using the finite state projection (FSP) method. In what follows, we introduce the ANNbased approximation method by means of a simple example and then verify its accuracy in predicting timedependent molecule number distributions of various realistic models of gene expression and its superior computational efficiency when compared to direct stochastic simulation. We finish by showing the orthogonal use of the method to infer the parameters of bursty gene expression from synthetic data.
Results
Illustration of ANNaided stochastic model approximation using a simple model of transcription
We consider a simple nonMarkovian system where molecules are produced at a rate ρ and are removed from the system (degraded) after a fixed time delay τ:
In other words, each molecule has an internal clock that starts ticking when it is ‘born’, and when this clock registers a time τ, the molecule ‘dies’. Note that as a convention in this paper, we use an arrow to denote a reaction in which the products are formed after an exponentially distributed time and an arrow with two horizontal lines to denote a reaction, which occurs after a fixed elapsed time. The above model, which we denote Model I, describes the fluctuations of nascent RNA (N) numbers due to constitutive expression. Specifically, the production reaction models the process of initiation whereby an RNAP molecule binds the promoter; the delayeddegradation reaction models, in a combined manner, the processes of elongation and termination whereby an RNAP molecule travels at a constant velocity along the gene and finally detaches from the gene, respectively. Note that the number of RNAPs bound to the gene is equal to the number of nascent RNA molecules present, irrespective of their lengths^{23} (for an illustration see Fig. 2a Model I). We note that the signal from singlemolecule fluorescence in situ hybridization (smFISH) probes corresponds to measuring the total length of nascent RNA, summed over multiple molecules present at the gene; thus the number of nascent RNA estimated from such experiments leads to a continuous rather than a discrete number^{8,24,25}. Our present formulation ignores the complexities introduced by smFISH and is rather compatible with experiments that can directly quantify the number of RNAPs bound to a gene^{26}.
It can be shown (see SI Note 1) from first principles that the delay CME describing the stochastic dynamics of this process is given by:
where P(n, t) is the probability that at time t there are n nascent RNA molecules bound to the gene. Similarly \(P(n,t n^{\prime} ,t^{\prime} )\) is the conditional probability distribution that at time t there are n molecules given that at a previous time \(t^{\prime} <t\), there were \(n^{\prime} \) molecules. The right hand side of the master equation is a function of the present time t as well as of the previous time t − τ. Master equations of this type are typically much more difficult to solve, analytically or numerically, than conventional master equations where the right hand side is only a function of the present time t (because of its simplicity an exact timedependent solution of Model I is possible and shown in SI Note 1; see also^{12}). Hence the key idea of our method is to map the master Eq. (2) to the new master Eq. (3):
where the function NN_{θ}(n, t) is an effective timedependent propensity describing the removal of nascent RNA molecules, which is to be learnt by the ANN. This is the NNCME for reaction scheme (1). Note that this master equation with NN_{θ}(n, t) = kn is the conventional CME describing the birth–death process \(\varnothing \mathop{\to }\limits^{\rho }N,N\mathop{\to }\limits^{k}\varnothing \), where k is the degradation rate. By considering the cases n = 0, ..., N of Eq. (3) (where N is some positive integer much larger than 1), one obtains a system of N + 1 differential equations. These equations need to be closed before they can be solved. First, we can set P(−1)(t) = 0 since the number of nascent RNA cannot become negative. Next, since we have truncated space to n = N, it follows that any terms in the equations corresponding to jumps from n = N to n = N + 1 or vice versa, need to be neglected. This implies that the terms—ρP(N, t) and NN_{θ}(N + 1, t)P(N + 1, t) are neglected. This is indeed the main idea behind FSP^{27}, which leads to a finite closed system of differential equations for the probabilities. Of course to faithfully approximate the system’s dynamics, N should be chosen large enough such that P(N, t) ≈ 0; in practice N is chosen such that any further increase in its value leads to no significant change in the solution of the master equation. The closed system of equations can be compactly written in the form
with P(t) = [P(0, t), ..., P(N, t)]^{⊤}. The transition matrix is defined as A_{θ} = D + N_{θ}(t), where the two components are given by
and
The output NN_{θ}(0, t) is set to 0 to reflect the fact that nascent RNA cannot be further removed when there is none attached to the gene.
Next we describe how we train the ANN to infer the effective transition matrix A_{θ}(t). We use a multilayer perceptron with a single hidden layer; this is a simple feedforward ANN consisting of three layers—an input layer with N + 1 inputs, a hidden layer with an arbitrary number of neurons and an output layer with N outputs. The simplicity of the ANN here used is motivated by the universal approximation theorem, which states that a single hiddenlayer feedforward ANN is able to approximate a wide class of functions on compact subsets^{28,29}. The activation functions used in the hidden layer and output layer are tanh and relu for all examples. In the output layer, we impose an increasing set of fixed biases, which we specify later. For more details of the ANN, including the choice of hyperparameters, please see SI Table 1. The training procedure consists of three main steps:

(1)
We use stochastic simulations of the birth delayeddegradation reaction (1) to generate approximate probability distributions at the time points \(t\in [{t}_{1},{t}_{2},...,{t}_{{N}_{{\rm{shots}}}}]\), where N_{shots} is the total number of snapshots. Note that by stochastic simulations in this paper, we always mean an exact stochastic simulation algorithm (SSA) modified to incorporate delays (specifically Algorithm 2 in ref. ^{10}; for proof of its exactness see ref. ^{11}). Let these distributions be denoted as H(t).

(2)
The initial condition P(0) is set to be the same as H(0). The N + 1 elements of this probability vector constitute the inputs to the ANN. Given some set of weights and biases θ, the ANN’s N outputs are then taken as the elements of the matrix N_{θ}(0), i.e. the nth output of the ANN is NN_{θ}(n, 0). By a numerical discretization of Eq. (4), given the inputs and the outputs of the ANN, we obtain P(Δt), where Δt is the finite time step. This procedure can be iterated to obtain P(2Δt), P(3Δt), etc. Hence we obtain the probability vector P(t) at the time points \(t\in [{t}_{1},{t}_{2},...,{t}_{{N}_{{\rm{shots}}}}]\).

(3)
We calculate an objective function that is a measure of the distance between the distributions H(t) and P(t) summed over all snapshots. If the objective function is larger than a threshold then update the set of weights and biases by means of back propagation and gradient descent, and repeat step 2. If the distance is smaller than the threshold then the procedure ends and the transition matrix A_{θ}(t) has been learnt.
Note that since the output of the ANN is the propensities NN_{θ}(n, t), these must be positive. We choose the set of biases in the output layer (r_{n} in Fig. 1a for n = 1, ..., N) to be given by \({r}_{n}=\frac{n}{\tau }\). This form is inspired by the fact that for the conventional CME with firstorder degradation, NN_{θ}(n, t) is proportional to n, where the proportionality constant is the effective rate of decay, which has units of inverse time. Hence intuitively, the effective removal propensity of the NNCME is equal to the propensity assuming firstorder degradation plus a correction, which is what the ANN effectively learns. This choice of biases is also similar to that of wellknown residual network (ResNet)^{30,31}, which helps to accelerate the convergence of training and reduce computational cost.
Note also that the objective function is chosen as \(J(\theta )=\mathop{\sum }\nolimits_{j = 1}^{{N}_{\text{shot}}}\parallel {\bf{H}}({t}_{j}){\bf{P}}({t}_{j}){\parallel }_{2}^{2}\). While there are more accurate distance measures (such as the Wasserstein distance), we use the meansquarederror form for two reasons: (i) it is commonly used in neuralnetwork training, and (ii) its simple form allows efficient calculation of derivatives through the chain rule (the back propagation method). The steps of the training procedure are illustrated in Fig. 1b, c. Note that while the gradient descent in Fig. 1c is illustrated using an Euler method, for our training we used the standard adaptive moment estimation algorithm (ADAM).
Once the matrix A_{θ}(t) is learnt, Eq. (4) can be integrated numerically to obtain the timedependent probability vector at all times in the future. In Fig. 2b (first row), we show that the solution of the learnt NNCME is practically indistinguishable from distributions estimated from stochastic simulations of Model I (1)—hence this implies that the ANN training protocol is effective as a means to map a master equation with terms having a nonlocal temporal dependence to a master equation with terms having a purely local temporal dependence.
Testing the accuracy and computational efficiency of ANNaided stochastic model approximation on more complex models of transcription
To verify that the accurate mapping characteristics are not specific to Model I, we next consider the application of the procedure to learn the NNCME of two more complex transcription models incorporating delay (see Fig. 2a). We consider Model II, which is the same as Model I, except that the binding of RNAPs to the promoter occurs in bursts whose size i is distributed according to the geometric distribution b^{i}/(1+ b)^{i+1}; this can be described by the reaction scheme:
where α stands for the burst frequency and b is the mean burst size. This is a minimal delay model to describe the phenomenon of transcriptional bursting^{32}. The delay CME describing the nascent RNA dynamics is given by (see SI Note 2):
This equation can be solved analytically for the timedependent probability distribution (see SI Note 2).
We also consider Model III wherein the promoter switches between an active and inactive state, RNAP binding occurs only in the active state, which is followed by delayed degradation modelling the RNAP movement along the gene and its detachment; this can be described by the reaction scheme:
where G and G^{⋆} stand for the active and inactive gene state, respectively, and σ_{on} and σ_{off} are the activation and inactivation rates, respectively. It can be shown that in the limit of large σ_{off} (compared to σ_{on}), Model III reduces to Model II, whereas in the opposite limit of small σ_{off}, it reduces to Model I. Hence Model III can describe both constitutive and bursty transcription, as well as regimes in between. The delay CME describing nascent RNA dynamics is given by (see SI Note 3):
where P_{i}(n, t) is the probability that the gene is in state i at time t and the number of nascent RNA is n; note that i = 0, 1 where 0 is the inactive state and 1 is the active state. Similarly \({P}_{ij}(n,t n^{\prime} ,t^{\prime} )\) denotes the conditional probability distribution that at time t the gene is in state i and the number of molecules is n, given that at a previous time \(t^{\prime} \), the gene was in state j and the number of molecules was \(n^{\prime} \). We note that an exact closedform solution for the steadystate distribution of this model was reported in ref. ^{8}. The method involves writing the timeevolution equation for the characteristic function and solving it explicitly by means of the Dyson series. Solutions are in fact also possible if the model is modified to predict the signal from smFISH, which necessitates the use of continuous rather than discrete nascent RNA numbers.
Note that as for the delay CME describing Model I, the delay CMEs describing Models II and III also have terms on the right hand side, which are a function of the twotime probability distribution. These terms which stem from delayed degradation, make analytical and numerical solution of the delay master equations nontrivial. However the ANNaided procedure to solve Models II and III is as easy to implement as for Model I. By replacing the twotime probability distribution terms on the lefthand sides of Eqs. (6) and (8) by terms of the type NN_{θ}(n, t) (see SI Note 4 for details), one can map the delay master equations into NNCMEs of the form \(\frac{\,\text{d}}{\text{d}\,t}{\bf{P}}(t)={{\bf{A}}}_{\theta }(t){\bf{P}}(t)\), where the transition matrix A_{θ}(t) is learnt by the same training procedure as before. Note that the NNCME for Model III is none other than the telegraph model of gene expression^{4} but modified to allow degradation propensities to be some general function of nascent copy number and time, and specific to each promoter state.
In Fig. 2b rows 2, 3 and 4, we show the comparison between the timedependent distribution of nascent RNA predicted by the NNCME and stochastic simulations of the reaction schemes corresponding to Models II and III. The agreement is excellent at all times and for all models, independent of the modality and skewness of the distribution. This reinforces the result that the ANNaided procedure enables an accurate mapping of master equations with terms having a nonlocal temporal dependence (via the twotime probability distribution) to master equations with terms having a purely local temporal dependence.
Next, we test the computational efficiency of the ANNaided procedure compared to stochastic simulations. Figure 3a shows the Hellinger distance between the probability distribution of nascent RNA numbers according to the NNCME and the exact analytical solution of Model I (see SI Note 1) as a function of the number of snapshots N_{shots} and of the number of stochastic simulations used to train the ANN. As expected, increasing the number of snapshots and the number of stochastic simulations in the ANN’s training enhances the accuracy of the NNCME’s distribution (manifested as a reduction in the Hellinger distance). More interestingly, we found that the NNCME obtained from training the ANN with just a thousand stochastic simulations outputs a distribution that has the same Hellinger distance from the exact distribution as the one obtained from a histogram generated using 30,000 stochastic simulations (direct simulation). Moreover in this case, the timetoacquire samples plus the time for ANN training takes 1/6 of the computation time if we only use simulations. Another way of distinguishing our method and stochastic simulations is to compare the distributions predicted by both methods, given the same number of stochastic simulations; as shown in Fig. 3b, while at short times, the two are comparable, at long times the NNCME’s prediction is far more accurate than that of the SSA. Note that training can also be done in steady state, i.e. solving the algebraic equations A_{θ}(t)P(t) = 0; the precision and efficiency of this alternative mode of training the ANN is illustrated and discussed in Fig. S2.
Note that the mapping enabled by the ANNaided procedure, from delay master equations to NNCMEs, is also supported by theory for those models which can be solved exactly (see SI Note 5). For Model I, it can be shown that the effective propensity NN_{θ}(n, t) is zero for t < τ and otherwise linear in the nascent copy number n (and independent of time); hence in steadystate conditions, the effective propensity is the same as expected from a firstorder degradation process. For Model II, the effective propensity NN_{θ}(n, t) is zero for t < τ and otherwise nonlinear in the nascent copy number n (and independent of time). In Fig. 4a we show that for Model II, the effective propensity obtained by the ANNaided approximation method is in good agreement with the theoretically predicted effective propensity evaluated in steadystate conditions (t ≫ τ). The nonlinearity of the propensity is an emergent feature of the mapping procedure when transcriptional bursting is present (linear behaviour is observed for Model I). In Fig. 4b, c, we show how the degree of nonlinearity varies with the nondimensional parameter ατ, which is the ratio of the bursting frequency to the frequency at which nascent RNA gets removed (the elongation rate). The deviations from the conventional linear scaling of the propensity with nascent RNA numbers are manifest when the bursts are produced much slower than the elongation rate. In the inset of Fig. 4b, we show that for hundreds of genes in mouse embryonic stem cells, the value of ατ is considerably <1 thus showing that the effective degradation propensities for nascent RNA are generally nonlinear; often the propensities can be wellapproximated by a Hill function (with Hill coefficient <1) over the relevant molecule number range. Since Model II is a good approximation to Model III when gene expression occurs in bursts, it follows that the results shown for Model II also apply to Model III. Note that this also implies that the standard telegraph model of gene expression^{4} (equivalent to the NNCME of Model III with a linear degradation propensity) is not a suitable effective Markovian description for the nascent RNA statistics of most eukaryotic genes.
Rapid exploration of parameter space and the prediction of a novel type of zeroinflation phenomenon
Given the computational efficiency of the ANNaided model approximation, one would expect it to be useful as a means to rapidly explore the phases of a system’s behaviour across large swathes of parameter space. This endeavour is only possible if the NNCME’s predictions are accurate across parameter space, which is yet to be seen since we have only shown its accuracy for few parameter sets in Figs. 2 and 3. In what follows, we explicitly verify that the NNCME can correctly capture all the phases of Model III’s behaviour.
We consider the case when the gene spends most of its time in the OFF state (the bursty regime of gene expression). In this case, Model III is wellapproximated by Model II (see SI Note 3), and by means of the exact analytical solution of the latter, we identify four regions (I–IV; see Fig. 5a) according to the type of steadystate distribution (see Fig. 5b and its caption for their description). Specifically phase IV is the only region of space where bimodal distributions (peak at zero and at a nonzero value) are found. Theory shows that the conditions (see SI Note 2) that need to be satisfied for this bimodality to manifest are
By using the NNCME to randomly sample points in parameter space, we find the same as the theoretical prediction: the distributions are unimodal (dots) except in Region IV where they are bimodal (crosses). Hence this verifies the accuracy of our method across parameter space.
We also note that bimodality in the bursty regime is unexpected because the standard model of gene expression (Model II/III with delayed degradation replaced by firstorder degradation^{2,5,33}) predicts a unimodal steadystate distribution, which is wellapproximated by a negative binomial distribution. Note that Model III is more appropriate to model nascent RNA dynamics than the standard model of gene expression because unlike mature RNA, nascent RNA typically does not get degraded while the RNAP is traversing the gene; rather nascent ‘degradation’ occurs after a finite elapsed time when it detaches from the gene and becomes mature RNA. Hence Region IV can be understood as delayinduced bimodality or a delayinduced zeroinflation phenomenon. Since there is evidence that the delay time is stochastic rather than fixed^{34,35} we also used the NNCME to investigate how the nascent RNA distributions change with variance in the delay time when the mean is kept constant: as shown in Fig. 5c, we find that a large increase in the variability of the delay time tends to destroy the peak at zero. Note that for systems with stochastic delay, the training of the ANNaided approximation remains the same as for those with fixed delay; the advantage of our method is that it can just as easily solve nonMarkovian models with stochastic delay as those with deterministic delay whereas analytically only the latter are amenable to exact solutions when the reaction system is simple enough.
In summary, we find that delayed degradation induces an extra mode peaked at n = 0, a type of zeroinflation phenomenon. This phenomenon is commonly seen in singlecell RNAseq data, and it is usually attributed to the expression dropoff caused by technical noises or sequencing sensitivity^{36}. It has also been shown^{37,38} that it may arise from an extra number of gene states. However our results suggest that delay due to elongation (when the variability in elongation times is small) is another important source contributing to the zeroinflated distributions evident in RNAseq data.
Learning the effective master equation of genetic feedback loops from partial abundance data
Feedback inhibition of gene expression is a common phenomenon in which gene expression is downregulated by its protein product. Given there is sufficient time delay in the regulation process as well as sufficient nonlinearity in the massaction law describing the kinetics of certain reaction steps^{39}, feedback inhibition can lead to oscillatory gene expression such as that observed in circadian clocks^{40}.
Here we consider a simple genetic negative feedback loop (see Fig. 6a) whereby (i) a protein X is transcribed by a promoter, (ii) subsequently after a fixed time delay τ, X turns (via some set of unspecified biochemical processes) into a protein Y and (iii) finally Y binds the promoter and reduces the rate of transcription of X. Unlike Models I–III considered earlier, the delay master equation corresponding to this model has no known analytical solution. Simulation trajectories verify oscillatory behaviour of this circuit; see Fig. 6b. We use the simulated trajectories of mature protein Y to train the ANN (the objective function only measures the distance between the ANNpredicted distribution of Y and the distribution from stochastic simulations), in a similar way as previous examples (see SI Note 7). In Fig. 6c, d, we show that the timedependent distributions of both proteins and their means output using the NNCME are in excellent agreement with the SSA, even clearly capturing the damped oscillatory behaviour; while for Y, this is maybe not so surprising, for X, it is remarkable because simulated trajectories of X were not used in the training of the ANN. Hence this shows that the ANNaided model approximation can learn the effective form of master equations from partial trajectory information, a very useful property if the training data are sparse and available only for some molecular species as commonly the case with experimental data.
ANNaided inference of the parameters of bursty transcription
With a small modification, the ANNaided model approximation technique besides constructing an approximate NNCME model, it can also infer the values of kinetic parameters of the data used for training. This is brought about by optimizing not only the weights and biases of the ANN but also simultaneously for the kinetic parameter values. In Fig. 7, we show the results of this method using training data generated by the SSA of Model II with parameters (burst frequency α and size b) that have been measured for five mammalian genes^{41}. Comparing the latter true parameter values with those obtained from the ANNaided inference, we conclude that the inference procedure leads to accurate results. Note that the 95% confidence intervals of the estimates are obtained using the profile likelihood method (see SI Note 8 and Fig. 7a, b for an illustration).
We also show that the distribution solution of the NNCME (which is obtained at one go, together with the inferred parameters) is practically indistinguishable from distributions constructed using the SSA of Model II (the quantile–quantile plots in Fig. 7c are linear with unit slope and zero intercept). In Fig. S3, we show the application of the ANNaided inference to Model III.
Discussion
In this paper, we have shown how the training of a threelayer perceptron with a single hidden layer is enough to approximate the delay CME of a nonMarkovian model by the NNCME, which is a master equation with timedependent propensities (timeinhomogeneous Markov process). Notably, this mapping has been achieved without increasing the effective number of species in the model. Since the NNCME has no delay terms, it simplifies analysis and simulation; for e.g. the NNCME can be accurately approximated by a wide range of standard methods^{7} and its solution is straightforward using FSP^{27}. The method hence enables the efficient study of much more complex nonMarkovian models of gene regulation than has been possible to exactly solve analytically or using numerical/simulation methods applied directly to the delay master equation. For example, we showed that our ANNbased method easily solves an extension of Model III where we incorporate noise in the delay time associated with elongation and termination (Fig. 5c), as well as a multispecies model of transcriptional feedback involving delayed maturation of proteins followed by binding to the promoter (Fig. 6). In contrast, the dynamics of these systems cannot be obtained by applying FSP directly to the nonMarkovian delay master equation or using analytical methods reported for nonMarkovian stochastic gene expression models^{8}. We note that while neural networks have been recently used to approximate partial differential equations in physics, chemistry and biology, to our knowledge, our work represents their first use in approximating equations describing the timeevolution of stochastic processes in continuous time and with a discrete state space, e.g. systems describing cellular biochemistry where the discreteness is an important feature of the system due to the low copy number of DNA and mRNA molecules involved^{42}.
We find that to obtain an accurate NNCME, training only needs a small sample size (of the order of a thousand SSA trajectories which is computationally cheap), it can be done with partial data (some species data can be missing) and simultaneously one can obtain estimates of the kinetic parameters. The latter is particularly relevant if the training data are collected experimentally, e.g. by measuring nascent RNA numbers using livecell imaging techniques (such as the MS2 system) at several time points for many cells^{43,44}. Our ANNbased inference method rests upon the matching of distributions and hence similarly to nonANNbased methods developed in refs. ^{45,46}, it avoids the pitfalls of momentbased inference^{47,48}. We note that the vast majority of existing inference methods are for stochastic systems with no delayed reactions; a notable exception is ref. ^{49}. We also note that the ability to approximate solutions of delay master equations from simulated data while simultaneously optimizing for the parameters has not been demonstrated before; deep learning frameworks have previously achieved similar feats for deterministic models^{18,21} and more recently for stochastic models described by multidimensional Fokker–Planck equations^{50,51}.
The ANNbased procedure described in this article is most useful to learn effective propensities of those biomolecular processes which we don’t know how to model well using a Markovian approach. The input data for the ANN’s training can be experimental data or that generated by a complex model. The complex model could be nonMarkovian as in this paper or else could be a Markovian model with many more species and reactions than the effective Markovian model that the ANN is trying to learn. In some cases the procedure will show that a mapping is not possible. For example, here we have shown that the standard telegraph model of gene expression (equivalent to the NNCME of Model III with a linear degradation propensity) is not a suitable Markovian description for the nascent RNA dynamics of most eukaryotic genes (it is typically a good description for mature RNA dynamics as has been analytically shown in ref. ^{9}).
Recent work has shown that differential equation models describing the timeevolution of average agent density can be learnt (using sparse regression) from agentbased model simulations of spatial reactiondiffusion processes^{52}. Such models can describe for e.g. intracellular biochemical processes in crowded conditions^{53} or multiscale tissue dynamics including cell movement, growth, adhesion, death, mitosis and chemotaxis^{54,55,56}. Some of these models have been shown to display nonMarkovian behaviour^{57}. While here we showcased the ANNbased method using nonMarkovian delay CMEs, one could also use for training, data generated by spatially resolved particlebased simulations, as the examples above. The application of our method would provide a master equation that effectively captures stochastic dynamics at the population level of description and avoids the pitfalls of commonly used analytical approximation methods, e.g. meanfield approximations, to obtain reduced stochastic descriptions.
Reporting summary
Further information on research design is available in the Nature Research Reporting Summary linked to this article.
Data availability
The experimental data shown as an inset in Fig. 4b can be found at https://doi.org/10.5281/zenodo.4643094.
Code availability
The codes, readme file and data for ANNaided model approximation can be found at https://doi.org/10.5281/zenodo.4643094. The codes are implemented by Julia 1.4.2 and its package Flux v0.10.4, DifferentialEquations v6.15.0 and DiffEqSensitivity v6.26.0.
References
 1.
Shahrezaei, V. & Swain, P. S. Analytical distributions for stochastic gene expression. Proc. Natl Acad. Sci. U.S.A. 105, 17256–17261 (2008).
 2.
Cao, Z. & Grima, R. Analytical distributions for detailed models of stochastic gene expression in eukaryotic cells. Proc. Natl Acad. Sci. U.S.A. 117, 4682–4692 (2020).
 3.
Cao, Z. & Grima, R. Linear mapping approximation of gene regulatory networks with stochastic dynamics. Nat. Commun. 9, 1–15 (2018).
 4.
Peccoud, J. & Ycart, B. Markovian modeling of geneproduct synthesis. Theor. Popul. Biol. 48, 222–234 (1995).
 5.
Raj, A., Peskin, C. S., Tranchina, D., Vargas, D. Y. & Tyagi, S. Stochastic mRNA synthesis in mammalian cells. PLoS Biol. 4, e309 (2006).
 6.
So, L.H. et al. General properties of transcriptional time series in Escherichia Coli. Nat. Genet. 43, 554–560 (2011).
 7.
Schnoerr, D., Sanguinetti, G. & Grima, R. Approximation and inference methods for stochastic biochemical kinetics—a tutorial review. J. Phys. A 50, 093001 (2017).
 8.
Xu, H., Skinner, S. O., Sokac, A. M. & Golding, I. Stochastic kinetics of nascent RNA. Phys. Rev. Lett. 117, 128101 (2016).
 9.
Filatova, T., Popovic, N. & Grima, R. Statistics of nascent and mature rna fluctuations in a stochastic model of transcriptional initiation, elongation, pausing, and termination. Bull. Math. Biol. 83, 1–62 (2021).
 10.
Barrio, M., Burrage, K., Leier, A. & Tian, T. Oscillatory regulation of hes1: discrete stochastic delay modelling and simulation. PLoS Comput Biol. 2, e117 (2006).
 11.
Cai, X. Exact stochastic simulation of coupled chemical reactions with delays. J. Chem. Phys. 126, 124108 (2007).
 12.
Lafuerza, L. F. & Toral, R. Exact solution of a stochastic protein dynamics model with delayed degradation. Phys. Rev. E 84, 051121 (2011).
 13.
Leier, A. & MarquezLago, T. T. Delay chemical master equation: direct and closedform solutions. Proc. R. Soc. A 471, 20150049 (2015).
 14.
Park, S. J. et al. The chemical fluctuation theorem governing gene expression. Nat. Commun. 9, 1–12 (2018).
 15.
Zhang, J. & Zhou, T. Markovian approaches to modeling intracellular reaction processes with molecular memory. Proc. Natl Acad. Sci. U.S.A. 116, 23542–23550 (2019).
 16.
Wang, Z., Zhang, Z. & Zhou, T. Analytical results for nonmarkovian models of bursty gene expression. Phys. Rev. E 101, 052406 (2020).
 17.
Sirignano, J. & Spiliopoulos, K. DGM: a deep learning algorithm for solving partial differential equations. J. Comput. Phys. 375, 1339–1364 (2018).
 18.
Raissi, M., Perdikaris, P. & Karniadakis, G. E. Physicsinformed neural networks: a deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations. J. Comput. Phys. 378, 686–707 (2019).
 19.
BarSinai, Y., Hoyer, S., Hickey, J. & Brenner, M. P. Learning datadriven discretizations for partial differential equations. Proc. Natl Acad. Sci. U.S.A. 116, 15344–15349 (2019).
 20.
Hermann, J., Schätzle, Z. & Noé, F. Deepneuralnetwork solution of the electronic schrödinger equation. Nat. Chem. 12, 891–897 (2020).
 21.
Lagergren, J. H., Nardini, J. T., Baker, R. E., Simpson, M. J. & Flores, K. B. Biologicallyinformed neural networks guide mechanistic modeling from sparse experimental data. PLoS Comput. Biol. 16, e1008462 (2020).
 22.
Rackauckas, C. et al. Universal differential equations for scientific machine learning. Preprint at http://arxiv.org/abs/2001.04385 (2020).
 23.
Zenklusen, D., Larson, D. R. & Singer, R. H. SingleRNA counting reveals alternative modes of gene expression in yeast. Nat. Struct. Mol. Biol. 15, 1263–1271 (2008).
 24.
Chen, H., Shiroguchi, K., Ge, H. & Xie, X. S. Genomewide study of mRNA degradation and transcript elongation in Escherichia Coli. Mol. Syst. Biol. 11, 781 (2015).
 25.
Wang, M., Zhang, J., Xu, H. & Golding, I. Measuring transcription at a single gene copy reveals hidden drivers of bacterial individuality. Nat. Microbiol. 4, 2118–2127 (2019).
 26.
Choubey, S., Kondev, J. & Sanchez, A. Deciphering transcriptional dynamics in vivo by counting nascent RNA molecules. PLoS Comput. Biol. 11, e1004345 (2015).
 27.
Munsky, B. & Khammash, M. The finite state projection algorithm for the solution of the chemical master equation. J. Chem. Phys. 124, 044104 (2006).
 28.
Hornik, K. et al. Multilayer feedforward networks are universal approximators. Neural Netw. 2, 359–366 (1989).
 29.
Pinkus, A. Approximation theory of the MLP model in neural networks. Acta Numer. 8, 143–195 (1999).
 30.
Goodfellow, I., Bengio, Y. & Courville, A. Deep Learning (MIT Press, 2016).
 31.
He, K., Zhang, X., Ren, S. & Sun, J. Deep residual learning for image recognition. In Proc. IEEE Conference on Computer Vision and Pattern Recognition, 770–778 (2016).
 32.
Suter, D. M. et al. Mammalian genes are transcribed with widely different bursting kinetics. Science 332, 472–474 (2011).
 33.
Paulsson, J. & Ehrenberg, M. Random signal fluctuations can reduce random fluctuations in regulated components of chemical regulatory networks. Phys. Rev. Lett. 84, 5447–5450 (2000).
 34.
Hocine, S., Raymond, P., Zenklusen, D., Chao, J. A. & Singer, R. H. Singlemolecule analysis of gene expression using twocolor RNA labeling in live yeast. Nat. Methods 10, 119–121 (2013).
 35.
Liu, J. et al. Quantitative characterization of the eukaryotic transcription cycle using live imaging and statistical inference. Preprint at https://doi.org/10.1101/2020.08.29.273474 (2020).
 36.
Vu, T. N. et al. BetaPoisson model for singlecell RNAseq data analyses. Bioinformatics 32, 2128–2135 (2016).
 37.
Engl, C., Jovanovic, G., Brackston, R. D., KottaLoizou, I. & Buck, M. The route to transcription initiation determines the mode of transcriptional bursting in E. Coli. Nat. Commun. 11, 1–11 (2020).
 38.
Jia, C. Kinetic foundation of the zeroinflated negative binomial model for singlecell RNA sequencing data. SIAM J. Appl. Math. 80, 1336–1355 (2020).
 39.
Novák, B. & Tyson, J. J. Design principles of biochemical oscillators. Nat. Rev. Mol. Cell Biol. 9, 981–991 (2008).
 40.
Wenden, B., Toner, D. L., Hodge, S. K., Grima, R. & Millar, A. J. Spontaneous spatiotemporal waves of gene expression from biological clocks in the leaf. Proc. Natl Acad. Sci. U.S.A. 109, 6757–6762 (2012).
 41.
Larsson, A. J. et al. Genomic encoding of transcriptional burst kinetics. Nature 565, 251–254 (2019).
 42.
Cai, L., Friedman, N. & Xie, X. S. Stochastic protein expression in individual cells at the single molecule level. Nature 440, 358–362 (2006).
 43.
Larson, D. R., Zenklusen, D., Wu, B., Chao, J. A. & Singer, R. H. Realtime observation of transcription initiation and elongation on an endogenous yeast gene. Science 332, 475–478 (2011).
 44.
Lenstra, T. L., Coulon, A., Chow, C. C. & Larson, D. R. Singlemolecule imaging reveals a switch between spurious and functional ncrna transcription. Mol. Cell 60, 597–610 (2015).
 45.
Öcal, K., Grima, R. & Sanguinetti, G. Parameter estimation for biochemical reaction networks using wasserstein distances. J. Phys. A 53, 034002 (2019).
 46.
Munsky, B., Li, G., Fox, Z. R., Shepherd, D. P. & Neuert, G. Distribution shapes govern the discovery of predictive models for gene regulation. Proc. Natl Acad. Sci. U.S.A. 115, 7533–7538 (2018).
 47.
Cao, Z. & Grima, R. Accuracy of parameter estimation for autoregulatory transcriptional feedback loops from noisy data. J. R. Soc. Interface 16, 20180967 (2019).
 48.
Zechner, C. et al. Momentbased inference predicts bimodality in transient gene expression. Proc. Natl Acad. Sci. U.S.A. 109, 8340–8345 (2012).
 49.
Choi, B. et al. Bayesian inference of distributed time delay in transcriptional and translational regulation. Bioinformatics 36, 586–593 (2020).
 50.
Chen, X., Yang, L., Duan, J. & Karniadakis, G. E. Solving inverse stochastic problems from discrete particle observations using the FokkerPlanck equation and physicsinformed neural networks. Preprint at http://arxiv.org/abs/2008.10653 (2020).
 51.
Yang, L., Daskalakis, C. & Karniadakis, G. E. Generative ensembleregression: learning stochastic dynamics from discrete particle ensemble observations. Preprint at http://arxiv.org/abs/2008.01915 (2020).
 52.
Nardini, J. T., Baker, R. E., Simpson, M. J. & Flores, K. B. Learning differential equation models from stochastic agentbased model simulations. J. R. Soc. Interface 18, 20200987 (2021).
 53.
Schöneberg, J. & Noé, F. Readdya software for particlebased reactiondiffusion dynamics in crowded cellular environments. PLoS ONE 8, e74261 (2013).
 54.
Swat, M. H. et al. Multiscale modeling of tissues using CompuCell3D. Methods Cell Biol. 110, 325–366 (2012).
 55.
Matsiaka, O. M., Penington, C. J., Baker, R. E. & Simpson, M. J. Continuum approximations for latticefree multispecies models of collective cell migration. J. Theor. Biol. 422, 1–11 (2017).
 56.
Middleton, A. M., Fleck, C. & Grima, R. A continuum approximation to an offlattice individualcell based model of cell migration and adhesion. J. Theor. Biol. 359, 220–232 (2014).
 57.
Newman, T. & Grima, R. Manybody theory of chemotactic cellcell interactions. Phys. Rev. E 70, 051916 (2004).
Acknowledgements
Z.C., W.D. and F.Q. acknowledge the support from Natural Science Foundation of China (NSFC No. 61988101); Z.C. acknowledges the support from NSFC No. 62073137; W.D. acknowledges the support from NSFC No. 61725301; Q.J. and S.Y. acknowledge the support from NSFC No. 61973119, National Key Research and Development Program of China (2020YFA0908303) and Shanghai RisingStar Program (20QA1402600); R.G. thanks the support from the Leverhulme Trust Grant (RPG2018423). We thank James Holehouse, Kaan Öcal and Guido Sanguinetti for useful discussions and insightful feedback.
Author information
Affiliations
Contributions
Z.C. and R.G designed research, supervised research, acquired funding and wrote the manuscript with input from the others. Q.J., X.F. and S.Y. performed research, analysed the data and wrote the manuscript. W.D. and F.Q. analysed the data and acquired funding. R.L. was involved in data analysis and graphical illustration.
Corresponding authors
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Peer review information Nature Communications thanks Ido Golding and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Peer reviewer reports are available.
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Jiang, Q., Fu, X., Yan, S. et al. Neural network aided approximation and parameter inference of nonMarkovian models of gene expression. Nat Commun 12, 2618 (2021). https://doi.org/10.1038/s41467021229191
Received:
Accepted:
Published:
Comments
By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.