Neural network aided approximation and parameter inference of non-Markovian models of gene expression

Non-Markovian models of stochastic biochemical kinetics often incorporate explicit time delays to effectively model large numbers of intermediate biochemical processes. Analysis and simulation of these models, as well as the inference of their parameters from data, are fraught with difficulties because the dynamics depends on the system’s history. Here we use an artificial neural network to approximate the time-dependent distributions of non-Markovian models by the solutions of much simpler time-inhomogeneous Markovian models; the approximation does not increase the dimensionality of the model and simultaneously leads to inference of the kinetic parameters. The training of the neural network uses a relatively small set of noisy measurements generated by experimental data or stochastic simulations of the non-Markovian model. We show using a variety of models, where the delays stem from transcriptional processes and feedback control, that the Markovian models learnt by the neural network accurately reflect the stochastic dynamics across parameter space.


Supplementary Note 1 Exact solution for Model I: constitutive transcription with delayed degradation
Here we present the exact time-dependent solution of Model I which in the main text is used to illustrate the key ideas behind the artificial neural network (ANN) aided model approximation and also to evaluate its precision. Note that a more general derivation was reported in [1]. The model has two reactions: where the first reaction describes the production of nascent RNA as a Poisson process with its mean equal to ρ, while the second reaction describes the "decay" of the nascent RNA a time τ later after it is produced (actually it models its detachment from the gene locus). The objective is to find the probability P(n, t) that there are n nascent RNAs at time t.

Derivation of delay master equation
We next derive the delay CME describing the stochastic dynamics. We want to obtain the probability P(n, t + ∆t). There are three sets of events that contribute to this: (A) there are n − 1 molecules at time t and a nascent RNA molecule is produced in the time interval (t, t + ∆t); (B) there are n + 1 molecules at time t and a nascent RNA molecule is removed in the time (t, t + ∆t); (C) there are n molecules at time t and no reaction occurs in the time interval (t, t + ∆t). Now in the infinitesimal limit ∆t → 0, the probability that event (A) occurs is simply the conventional propensity associated with a firing of the production reaction: P(n, t + ∆t; n − 1, t) = ρ∆tP(n − 1, t). (S2) The probability that event (B) occurs requires a careful consideration of the history of the process: (i) (n ′ , t − τ) leads to (n ′ + 1, t − τ + ∆t); (ii) (n ′ + 1, t − τ + ∆t) leads to (n + 1, t); (iii) (n + 1, t) leads to (n, t + ∆t), where n ′ is the number of nascent RNAs at time t − τ. Intuitively since degradation is a delayed reaction, a removal reaction in (iii) is due to a production reaction in (i) a time interval τ earlier. The probability of (i) occurring is ρ∆tP(n ′ , t − τ). The probability of (iii) occurring is 1 since every production event is followed by a removal event a time τ later. The probability of (ii) occurring is: P(n + 1, t|n ′ + 1, t − τ + ∆t) = P(n, t|n ′ , t − τ + ∆t) = P(n, t|0, t − τ + ∆t).
The two "+1"s in the left-hand-side of Eq. (S3) represent that a nascent RNA was born in time (t − τ, t − τ + ∆t) and stays in the system till time t, and notably does not participate in any reaction in time (t − τ + ∆t, t); hence it follows that we can write the first line on the right hand side of Eq. (S3). Now since all nascent RNAs born before time t − τ (whose number is n ′ ) are removed from the system, the probability P(n, t|n ′ , t − τ + ∆t) must be independent of n ′ and the n nascent RNAs at time t are produced between time (t − τ + ∆t, t); hence it follows that we can write the second line on the right hand side of Eq. (S3). The probability that event (B) occurs is given by the product of the probability of events (i), (ii) and (iii) and summing over all values of n ′ : P(n, t + ∆t; n + 1, t) = ρ∆t ∑ n ′ P(n ′ , t − τ)P(n, t|0, t − τ + ∆t) = ρ∆tP(n, t|0, t − τ + ∆t), where P(n, t|0, t − τ + ∆t) is the probability of having n nascent RNAs produced in the time interval (t − τ + ∆t, t). See Fig. S1 for an illustration of the aforementioned considerations for event (B). The probability that event (C) occurs is obtained by subtracting all the probabilities of one reaction occurring from the probability P(n, t).

Solution of the delay master equation
To solve Eq. (S7), one has to find an equation governing the conditional probability P(n, t|0, t − τ). Now we know that the n nascent RNAs produced during time (t − τ, t) and any nascent RNA produced prior to t − τ cannot survive up to time t. In other words, the dynamics of n nascent RNAs are specifically determined by the instant reaction ∅ ρ − → N, whose stochastic evolution is characterized by with initial condition beingP(0, t) = 1 andP(n, t) = 0 for any positive integer n (see Fig. S1 for a helpful illustration of the preceding arguments). Importantly, the conditional probability P(n, t|0, t − τ) coincides withP(n, τ). By defining generating functions G(z, t) = ∑ n z n P(n, t) andḠ(z, t) = ∑ n z nP (n, t), Eqs. (S7) and (S8) are transformed into and where t * = min{t, τ} and [τ,∞) (t) = 1 if and only if t ∈ [τ, ∞). Hence finally we obtain using , that the solution to the delay CME of Model I is a Poisson distribution parametrized by ρt * . Interestingly, it is noted from Eq. (S9) that the distribution does not evolve any longer after time τ, while it is continuously changing prior to τ.

Intractability of the analytical solution when delay is stochastic
whereby the delay τ ω is a random variable defined on a probability space and sampled from some distribution with random sample ω.
Repeating the steps of the derivation as above, one finds difficulty in computing the contributions due to nascent RNA removal, i.e. event (B). Given the history leading to removal, as we previously argued, there are three contributions to this event and one finds that those associated with (ii) cannot be analytically resolved as we did earlier. Such a breakdown stems from the fact that the n ′ nascent RNAs born prior to time t − τ ω do not necessarily die in time (t − τ ω + ∆t, t), thereby impacting the distribution at time t and compromising the argument we used to arrive to Eqs. (S4) and (S8). Therefore, it is not possible to exactly solve the delay CME analytically in the presence of stochastic delay. These issues are also generally true for more complex models such as Models II and III in the main text.

Supplementary Note 2 Exact solution for Model II: bursty transcription with delayed degradation
The model consists of the reactions where i = 0, 1, 2, ... Nascent RNA N is produced at a frequency α in bursts which are distributed according to a geometric distribution with mean b; a nascent molecule is removed from the system a time τ after its production, where τ is the time between initiation and termination of transcription.

Derivation of delay master equation
Following the same logical steps as the derivation in Note 1, there are three sets of events that contribute to the probability P(n, t + ∆t): (A) there are n − m molecules at time t and m nascent RNA molecules are produced in the time interval (t, t + ∆t); (B) there are n + m molecules at time t and m nascent RNA molecules are removed in the time (t, t + ∆t); (C) there are n molecules at time t and no reaction occurs in the time interval (t, t + ∆t).
The probability of event (A) due to active burst writes by summing over m The probability that event (B) occurs is a consequence of the history of the process: where n ′ is the number of nascent RNAs at time t − τ. Since degradation is a delayed reaction, a removal reaction in (iii) is due to a production reaction in (i) a time interval τ earlier. The probability of (i) occurring is αb m (1+b) m ∆tP(n ′ , t − τ). The probability of (iii) occurring is 1 since the event of the production of m molecules in (t − τ, t − τ + ∆t) is followed by a removal of these m molecules τ time later. The probability of (ii) occurring is described by P(n + m, t|n ′ + m, t − τ + ∆t).
The probability that event (B) occurs is given by the product of the probability of events (i), (ii) and (iii) and summing over all values of n ′ and m: The probability that event (C) occurs is obtained by subtracting all the probabilities of one reaction oc-curring from the probability P(n, t) where we have: [P(n, t + ∆t; n − m, t) + P(n, t + ∆t; n + m, t)] We further rearrange the equation above and get that Hence, one can obtain a delay CME describing the stochastic dynamics by letting ∆t → 0: (S11)

Solution of the delay master equation
This master equation can be solved using standard methods but it turns out that one can directly obtain P(n, t) without even needing to write this equation. This simpler and more straightforward solution is presented next.
We are interested in the distribution of nascent RNA number Y(t) to the system in Eq. (S10). Clearly, Y(t) is a random variable defined on the integer support. Intuitively, Y(t) is determined by two factorsthe number of "packages" I(t) (where a package stands for an event occurring before time t such that the nascent RNA produced in these events has still not been subject to delayed degradation) and the number of nascent RNAs X i ∼ Geom( b 1+b ) in each package i. Therefore, Y(t) can be written in the form: thereby constituting a compound process. The event number I(t) is determined by the system in Eq. (S1). Now we can compute the distribution of Y(t) from that of I(t) by using the generating-function property of a compound process. The generating function of X i is 1) , and that of package number I(t) is given by Eq. (S9). Thus, the generating function of Y(t) is which is simplifies to (S12)

Conditions for the existence of delay induced zero-inflation
Here we derive conditions for the existence of the zero-inflation phenomenon described in the main text which is when the steady-state distribution is bimodal with a peak at zero and another peak at a non-zero value greater than 1. In steady-state conditions (t → ∞), the generating function Eq. (S12) reduces to from which we can compute the probabilities of having 0, 1 and 2 nascent RNAs at the gene locus: We can define two ratios as follows When r 1 < 1 and r 2 > 1, the probability of having a single nascent RNA becomes a valley between P(0) and P (2), which establishes the sufficient and necessary condition for generating bimodality. This condition can be further simplified to read

Supplementary Note 3 Steady state of Model III converges to that of Model II when transcription is bursty
Model III is comprised of four reactions The gene can switch between active G and inactive G ⋆ states, transcribes nascent RNA N while in the active state which subsequently is removed a time τ later. This is similar to the telegraph model for mature RNAs first studied in [2] except that since we are modelling nascent RNA, degradation is not first-order but described by a delay reaction modelling the time between initiation and termination of transcription.

Derivation of the delay CME of Model III
For Model III, we would like to obtain the probabilities P 0 (n, t) and P 1 (n, t) for gene state being inactive and active, respectively. Similar to what we did in Supplementary Note 1, the two probabilities are contributed by three sets of events: (A) either a nascent RNA molecule is produced or gene state is changed in the time interval (t, t + ∆t); (B) a nascent RNA molecule is removed in the time interval (t, t + ∆t); (C) no reaction occurs in the time interval (t, t + ∆t).
In the infinitesimal limit ∆t → 0, the probabilities of event (A) simply follow the conventional propensity argument, and hence are where P ij (n, t + ∆t; n ′ , t) is the joint distribution of finding a cell having n ′ and n nascent RNA molecules while the gene state is in state j and i at time t and t + ∆t, respectively.
By using similar arguments in Supplementary Note 1, the probability of (i) occurring is ρ∆tδ 1 (S16) As such, the probability of occurring event (B) is given by the product of the probability of events (i), (ii) and (iii) and summing over all values of n ′ : As per the law of total probability, we have Summarizing Eqs. (S15)-(S18), all are simplified to the following set of delay CMEs Summing over all possible n for the two equations in Eq. (S19), we obtain initiated with the inactive state, in which P 0 (t) and P 1 (t) are the probabilities of finding a cell at the inactivated and activated state at time t, and their solutions are Besides, it is noted that The n nascent RNAs of the two conditional probabilities P i1 (n, t|0, t − τ) for i = 0, 1 in Eq. (S19) are produced during time (t − τ, t), and the pertinent dynamics are that of a reaction system only composed of the three non-delayed reactions in Eq. (S14). Specifically, we have initiated atP 0 (n, 0) = 0 for any n,P 1 (n, 0) = 1 for n = 0 and equal to 0 otherwise, as well as P i1 (n, t|0, t − τ) =P i (n, τ) for any i = 0, 1.
We particularly define the generating function in such a form so as to simplify notations. Then, using Eq. (S20), Eqs. (S19) and (S21) lead to the generating function equations: and where the arguments u and t in the generating functions are suppressed for clarity and the superscript τ is used to emphasize the generating functionḠ i at the particular time τ. The initial condition of Eq.
Under the condition of steady state, Eqs. (S22) reduces to

Model reduction under bursty conditions
We now find an approximate solution to Model III when σ off ≫ σ on . To this end, if we define δ = σ off /σ on and divide both sides of Eqs. (S23) by σ on then we obtain withρ = ρ/σ on and s = σ on t. Then, we further divide all equations in Eqs. (S25) by δ and denote ϵ = 1/δ to obtain with b = ρ/σ off (the mean nascent RNA burst size). From Table 1 of Ref. [3] for many genes ϵ is reported to be a small positive real number, and hence as per perturbation theory, we postulate that all the generating functionsḠ 0 andḠ 1 have a series expansion in ϵ of the form Matching the coefficients of different orders in ϵ in Eq. (S26), the following set of equations ensues Thus, we haveḠ where the integration constant is C = 1/(1 − bu). As the process represented by Eq. (S21) is initiated at the activated gene state and quickly converges to the slow manifold (the inactivated state), the number of nascent RNAs produced during such a short ON window is subject to the geometric distribution with mean burst size b, whose generating function is 1/(1 − bu). Directly solving Eq. (S24), we have Furthermore, it is obtained that where the approximation step is performed under the condition σ off ≫ σ on . By comparing the generating functions in Eqs. (S12) and (S27), we find that they are the same if α = σ on , thereby concluding that Model III converges to Model II at steady state in bursty conditions.

Supplementary Note 4 Neural Network Chemical Master equations for Models II and III
The Neural Network Chemical Master equation (NN-CME) for Model I was presented in the main text.
Here we show the same for Models II and III.

Model II
Similar to the reasoning for Model I, since only the removal of nascent RNAs is delayed, the NN-CME of the system Eq. (S10) is readily written as a term for bursty production (which is the same as in the conventional CME) plus effective degradation terms:

Model III
By similar arguments as above, the NN-CME of Model III is Note that these master equations are the same as the telegraph model of gene expression [2] but modified to allow degradation propensities to be nonlinear in n and specific to each promoter state. These equations can be rewritten into the form of Eq. (S29) where the probability vector is and Note that the same set of matrices are used to solve Model III when the delay is stochastic (as in Fig. 5c of the main text), wherein the values of prefix biases are set to the mean of the stochastic delay (⟨τ⟩).
Supplementary Note 5 Theoretical justification of the mapping from delay CME to NN-CME

Model I
Comparing Eq. (2) and Eq. (3) in the main text, for any integer n we have the following relation Besides, it is known from Eq. (S9) that which constitutes a Poisson distribution. Note that P(n, t|0, t − τ) =P(n, τ) for any t ≥ τ andP(n, t) is governed by Eq. (S8), which admits the solution Substituting Eq. (S31) and Eq. (S32) into Eq. (S30), for any t ≥ τ we deduce We also know that for any t < τ, P(n, t|0, t − τ) = 0 which yields to the analytical form of NN θ (n, t) Hence we have proved that the solution of the delay master equation given by Eq. (2) in the main text is precisely the same as the NN-CME given by Eq. (3) in the main text with effective propensity Eq. (S33).

Model II
Next we use the generating function method to obtain the effective propensity in the NN-CME for Model II, i.e. the NN θ term in Eq. (S28). To do so, we first define the exact solution for Model II in generating function form as and also define an auxiliary generating function As such, we can rewrite Eq. (S28) into Since by Eq. (S12) we have G(z, t) = e αb(z−1)t * 1−b(z−1) and t * = min{t, τ}, one can easily show that NN θ (n, t) = 0 when t < τ. For t > τ, we have or equivalently Moreover, it is known from Eq. (S34) that from which it follows that .
This gives the exact solution for the mapped propensity where f (z) =  Fig. 4b Elongation time:

Supplementary Note 6 Data processing in inset of
• The elongation rates (bp/min) for genes in mouse embryonic stem cells are obtained from Source Data 1 of Figure 3 of Ref. [4].
• The elongation time is the ratio of gene length and elongation rates (for those genes appearing in both [4] and [5]).
Transcription kinetic parameters (σ on , σ off , ρ) and the mature RNA degradation rates (d mature ): • The transcription rate ρ, gene activation σ on and inactivation σ off rates normalized by mature RNA degradation rates (d mature ) for CAST allele in mouse embryonic stem cells are reported in Table S3 of Ref. [5], and their absolute values are computed by using the mature RNA degradation rates (d mature ) reported in the file named slam_seq.csv of Ref. [5].
Final data: • The final data are summarized in the data file named processed.mat, which includes a matrix named Data of size 368 × 8. Each row of this file represents σ on (min −1 ), σ off (min −1 ), ρ (min −1 ), d mature (min −1 ), gene length (bp), elongation rate (bp/min), elongation time (min) and (σ on + σ off )τ. The gene symbols of 368 genes are included as cell vector Name in the same file.

Supplementary Note 7 Oscillatory gene regulatory network
Here we describe in more detail the stochastic auto-regulatory model studied in the main text and illustrated in Fig. 6a therein. The reactions comprising the model are: with The two reaction rates J 1 (Y) and J 2 (Y) in Eq. (S37) are in the form of a Hill function. The synthesis rate J 1 (Y) of premature protein X is proportional to the number of upstream regulators S, and is downregulated by the number of protein Y. The dissociation constant for protein Y and the gene promoter is K d , and p is the Hill coefficient. In the degradation rate J 2 (Y), E T stands for the total number of protease, k 2 is the turnover rate, and K m is the Michaelis constant. Let P(x, y, t) be the probability of finding x molecules of premature protein X and y molecules of protein Y in a cell at time t. Then by similar arguments used for previous models, the NN-CME governing the time-evolution of the probability P(x, y, t) is given by with E being the step operator whose function is E i,j f (x, y) = f (x + i, y + j) and initial condition P(x, y, 0) = 1 if and only if x = y = 0. Again we are able to rewrite Eq. (S38) in the compact form Eq. (S29) by selecting a sufficiently large N such that P(x, y, t) = 0 for any x ≥ N and y ≥ N and stacking P(x, y, t) to form a vector of length (N + 1) 2 : P(t) = [P(0, 0, t), · · · , P(N, 0, t), P(0, 1, t), · · · , P(N, 1, t), · · · , P(0, N, t), · · · , P(N, N, t)] ⊤ .
Besides, the following matrices are needed to assemble matrix A θ (t) = D 1 + D 2 + N θ (t), and and for k = 0, · · · , N − 1. The preset biases r n of NN θ (n, k, t) are increasing functions of n and proportional to τ −1 .

Supplementary Note 8 Calculation of confidence intervals using profile likelihood
The confidence intervals are derived by the concept of profile likelihood and the nuisance parameters (see p. 264-265 in Ref. [6]). With reference to Fig. 7 of the main text, suppose that the burst frequency α is the parameter that we are interested in estimating its confidence interval. The augmented parameter vector is ψ = [α, b, θ ⊤ ] ⊤ , and the nuisance parameters accordingly become λ = [b, θ ⊤ ] ⊤ . Let (α,λ) be the solution minimizing the mean squared error cost function to the NN-CME, and potentiallyψ minimize the negative loglikelihood: (α,λ) = arg min (α,λ) L(α, λ), and P ψ is the probability of observing x i,t j molecules at time t j , which is predicted by NN-CME. As the burst frequency α is the one we would like to quantify uncertainty, we define a new type of optimization problem:λ α = arg min λ L(α, λ), in which it means we attempt to minimize the likelihood by only tuning λ and fixing the value α.
Hence, we are able to define a function of α as Asymptotic theory [6] shows that Thus, by varying α, we are able to obtain a curve for ∆L (see the right panel of Fig. 7a main text), whose intersections with the horizontal line 1.92 give the 95% confidence interval. The values ofλ α can be quickly computed by retraining NN-CME initialized withψ. The 95% confidence intervals for burst size b or Model III can be obtained similarly.

Neural network specifications:
The following table summarizes the technical specifications about the ANN used for all the examples presented in the main text. Note that there are two hyperparameters involved in the NN-CME, namely the number of hidden neurons and the learning rate. We found that the results are not very sensitive to the number of hidden neurons, and below we just present a set of hyperparameters that works and achieved good results. Second, the training can be started with a smaller learning rate to achieve an acceptable fitting performance, and then the learning rate can be increased to achieve faster convergence. This is the reason why a step-wise setting is used in the experiments.

Supplementary Figures
Nascent RNA production   Figure S1: Illustration of the chain of events culminating in the removal (detachment) of a nascent RNA in Model I (refer to event (B) in Note 1). Let's assume there are n ′ nascent RNAs (green balls) produced before time t − τ. In the next time step t − τ + ∆t, there is a new born RNA (red ball). During (t − τ + ∆t, t), the first n ′ green RNAs are progressively removed and in the meanwhile n new RNAs (blue balls) are produced. Since each nascent RNA stays for a fixed time τ, the red RNA definitely leaves in the time interval (t, t + ∆t). Note that during the time interval (t − τ + ∆t, t), the red RNA does not participate in any dynamics of the system which explains Eq. (S3) and the dynamics of the blue RNAs follows Eq. (S8) which is a simple birth process from (0, τ).  Figure S2: Testing the precision and computation efficiency of the ANN-aided model approximation for Model II using a steady-state training procedure, i.e. we solve the algebraic equations A θ (t)P(t) = 0 during training. Note that the steady-state training only uses one snapshot of histogram. It shows the Hellinger distance (HD) between the distribution of the NN-CME and histograms computed from stochastic simulations (SSA) as a function of sample size. Each SSA data point is averaged over 10 independent runs, while each NN-CME point is averaged over 3 independent trainings (error bar: standard deviation). It shows that the steady-state ANN-aided approximation is able to achieve comparable precision to that of the SSA while only using about 1/3 of the samples necessary for the latter. The kinetic parameters are: α = 3.84 × 10 −3 , b = 1.38. x 5 i a p F I 6 i 6 L E U L C 1 / W P l Y / r S 6 9 v n L + k Z l c + v C m c x y b H I j j b 1 K w K E U G p s k S O J V a h F U I v E y G R 4 V + u U I r R N G / 6 J J i h 0 F A y 3 6 g g N 5 q l u p t J 0 Y K O i 2 C c e U G z 3 t V q p R P Z p F + B r E c 1 B l 8 z j v b p Y e 2 z 3 D M 4 W a u A T n W n G U U i c H S 4 J L n K 6 2 M 4 c p 8 C E M s O W h B o W u k 8 9 W n 4 Y 7 n u m F f W P 9 0 x T O 2 O c V O S j n J i r x m Q r o 2 i 1 q B f m W 1 s q o f 9 j J h U 4 z Q s 2 f B v U z G Z I J C x / C n r D I S U 4 8 A G 6 F 3 z X k 1 2 C B k 3 f r n W 0 w 6 5 0 i f + F l P s 6 0 4 K a H C 6 y k M V n w p E N S I H T h a 3 6 K c o R + w n / e N y y E 3 W M x E O R q Z / 4 c d O 3 E I g 6 / P 0 v 2 x x A v f v 1 r c L F X j / f r e z 8 P q o 3 a / C z K 7 B v b Z r s s Z j 9 Y g 5 2 y c 9 Z k n I 3 Y H 3 b D b o P f w V 1 w H / x 9 S g 1 K 8 5 q v 7 E U E D / 8 A I 9 H 6 9 w = = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " L 2 6 d G G  Figure S3: Inference of kinetic parameters (ρ, σ on and σ off ) for Model III using ANN-aided model approximation.
(a) Shows the inferred values and the corresponding 95% confidence interval computed using the profile likelihood method (SI Note 8). (b) Shows by means of quintile-quintile plots, that simultaneously with inference, the NN-CME given by the method has a steady-state distribution that is in excellent agreement with that of SSA. See SI Table 1 for parameters.