Efficient and flexible implementation of Langevin simulation for gene burst production

Gene expression involves bursts of production of both mRNA and protein, and the fluctuations in their number are increased due to such bursts. The Langevin equation is an efficient and versatile means to simulate such number fluctuation. However, how to include these mRNA and protein bursts in the Langevin equation is not intuitively clear. In this work, we estimated the variance in burst production from a general gene expression model and introduced such variation in the Langevin equation. Our approach offers different Langevin expressions for either or both transcriptional and translational bursts considered and saves computer time by including many production events at once in a short burst time. The errors can be controlled to be rather precise (<2%) for the mean and <10% for the standard deviation of the steady-state distribution. Our scheme allows for high-quality stochastic simulations with the Langevin equation for gene expression, which is useful in analysis of biological networks.

allows us to define the burst frequency and the burst size in transcription and translation with fundamental rate constants 4,[20][21][22] . Furthermore, the distributions of burst events and sizes derived from this model have the same features as those observed in experiments. The gene expression model shown in Fig. 1a can be written as: p p where g is the fraction of active gene for transcription and (m, p) are the amount of mRNA and protein, respectively; k g and γ g are the gene's activation and deactivation rates; k m and k p are the production rates for mRNA and protein; and γ m and γ p are the corresponding degradation rates. Following previous works 21,22 , when γ γ  k ( , ) g m g , mRNA production can be considered as occurring in bursts. Because the gene activation time (1/γ g ) is rather short, the average amount of mRNAs produced in such short time interval is the mean burst size 22 : The low gene activation rate (k g ) leads to well-separated burst events. The k g is considered the mRNA burst frequency. Similar limiting set γ γ  ( ) m p 7,20 applies to protein production, leading to an average burst size of protein as p p m and burst frequency as the rate of mRNA production (g(t)k m ). In Fig. 1b, we include a stochastic trajectory under the limit of burst-like production. In this work, we aimed to derive a Langevin equation that includes burst production effects and offers good number fluctuation for gene expression.
In the burst regime, when the upstream component is rarely-produced and fast-degraded, the slowly-degraded downstream component would be produced in bursts. Such difference in rates poses a difficulty for simulations with both the Gillespie algorithm and the standard Langevin equation. For the Gillespie algorithm, the slow reactions are sampled rarely, which leads to poor statistics. The Langevin simulation efficiency is also reduced, because the time step size has to be adjusted for the fast changes of the gene switching or mRNA number fluctuation. Therefore, we need a Langevin equation for the protein fluctuation that does not have to track the fast changes of a gene's state or mRNA's number 23,24 .
Starting from the general model, we develop analytical expressions for the mean and variance in the production with the burst effect, and such expression is included in the Langevin equation. Our approach allows for the flexibility to include either or both of the mRNA's and protein's burst effects. We also found that our burst Langevin expression has a large applicable region, which is not limited by the case of burst production. Our algorithm can produce an accurate steady-state mean and similar distribution as that with Gillespie simulation. When a gene switches dynamically, our simulation also can produce accurate dynamics of average protein number. The burst Langevin equation we derived is effective in minimizing the computational time and memory in stochastic simulations. Our simulation scheme with the burst Langevin equation is useful in stochastic simulation for biological networks.

Theory
Langevin equation for burst production. To simplify the derivation of burst Langevin equation, we first consider a two-component model for the burst of either mRNA or protein. In this model, a short-lived x results in a burst event of y: For the mRNA's burst production in equations (1) and (2), we assign x as the state of the gene and y as the mRNA.
In this case, we can combine the terms k g + γ g and set it to γ x . Similarly, for the protein's burst production, x is mRNA and y is protein. We treat g as a constant in equation (2) for a constant mRNA production rate and set k m g as k x . Thus, both the mRNA's and protein's production can be described by equations (6) and (7).
To develop an efficient stochastic simulation, we select a time interval τ that is longer than x's lifetime (1/γ x ). When there are e y burst events and each burst size is denoted as b yl , the change in y is: l e yl y y 1 The burst production of y is the consequence of short-lived x. The number of burst events (e y ) is determined by the number of x produced in τ and each burst size (b yl ) is determined by the survival time of each x. Simulation for the production in equation (8) can be performed with a random number for e y , followed by several random numbers for various burst sizes b yl . For the degradation in τ, a Poisson distribution can be used, with both mean and variance being γ y yτ 18 . A Gaussian random number with zero mean and unit variance  (0, 1) 2 is scaled by the standard deviation (γ y yτ) 1/2 for the noise part of degradation. An alternative approach is to reformulate the production of y in τ as: where Δ y (τ) and σ Δy (τ) are the mean and standard deviation of y's production within time τ. In this way, the simulation steps are simplified, and the computation is more efficient. To estimate Δ y (τ), the average production of y in time τ, we assumed that burst events and burst sizes are independent random processes. Therefore, we can take their average separately: which is the product of average burst event e ( ) y and average burst size b ( ) y . The variance of y's production distribution in time τ, σ τ Δ ( ) 2 y , was derived from the characteristic function of P(y), the probability distribution of y's number, in the supplementary material of ref. 25 : y by e y y 2 2 2 2 y where σ by 2 is the variance of burst size and σ ey 2 is that of burst event number in time τ. We found that it can also be derived directly, With the definition of variance, we also replaced b yl 2 with σ + b by y 2 2 and − e e y y 2 2 with σ ey 2 . Therefore, we obtain the same variance expression for y's burst production as in ref. 25 by direct estimation.
To simulate the downstream y's fluctuation with burst production, we can follow the Langevin equation as in equation (9) including the mean propagation Δ y (τ) as given in equation (10) and variance σ τ Δ ( ) 2 y as in equation (11). The expressions derived in this section can be applied to either or both the mRNA's and protein's burst production.
Langevin equations for either or both mRNA and protein bursts. Generally 28 . We further explored the criteria of γ g and γ m comparing to γ p with and without burst production, as shown in Fig. 2a. For all these different burst cases, we show that the burst production variance in equation (11) has the flexibility to describe all of them.
only mRNA is generated in bursts. We rearranged the expression in equation (1) as: g g g which becomes identical to equation (6). For a single-copy gene, the activity fraction g in equation (13) is considered 0 or 1 for off or on state, respectively. Without loss of generality, we considered a single-copy gene in the present work. If the gene has n-copies, − g (1 ) in equation (1) can be replaced by − n g ( ) ; thus, the first k g in equation (13) needs to be replaced by nk g . mRNA burst frequency is equal to k g (or nk g for n-copy gene case). Assuming that each mRNA burst event is independent, we can approximate burst event distribution by a Poisson distribution, as observed in several experiments 5,22 , where the mean of burst events On the other hand, possible mRNA burst size (b m ) can be described by a geometric distribution 29 , where q is the probability of no mRNA produced from this activation period, thus, q is proportional to k g + γ g . One mRNA is produced with the probability of (1 − q), which is proportional to k m , the transcription rate constant in equation (2). The mean and variance of mRNA burst size are We note that the mRNA burst size definition is modified as in equation (16), instead of γ = b k / m m g in the literature which is obtained with very small k g 21,22 . This new definition for b m yields accurate kinetic expression for the average amount of mRNA, as production (burst frequency (k g ) multiplied by size b ( ) m ) divided by degradation rate constant (γ m ): The mean of mRNA production with bursts (with large γ g ) can be expressed as:  by following equation (11). With equations (19) and (20), the Langevin equation for mRNA is then where the gene state is skipped. From the mRNA's burst Langevin equation in equation (21), we derived the mRNA's steady-state variance as: m ss m , 2 by following the supplementary material of ref. 8 . The steps are transforming m(t) in equation (21) to the Fourier space first and then squaring, averaging, and finally inverse Fourier transforming. The detailed derivation is in the supplementary information of this work. By comparing the production variance in equation (20) and steady-state variance in equation (23), we can see that the τ in equation (20) is replaced by 1/γ m , leading to (23). Also, the b 2 m in equation (20) becomes b m in equation (23). However, the mRNA's exact variance expression in the steady state from linear noise approximation (LNA) [30][31][32]  with detailed derivation given in our supplementary information. By comparing equations (23) and (24), we can see that the burst Langevin approximation can be achieved by assuming γ g /(k g + γ g + γ m ) ≈ 1 in the LNA's result, which is true that a large γ g leads to mRNA bursts. Therefore, with equations (21) and (22), there is no need to track the fast-changing gene state g(t) in the simulation, and a modest error is introduced in the mRNA's variance as in equation (23).
To further calculate the protein's steady-state variance, because of γ γ ∼ m p , we can propagate σ m ss , 2 from equation (23) by the variance propagation equation 33 to obtain σ p ss , 2 : This expression is also slightly different from the exact expression derived from LNA (details in the supporting information), which is given below: In general, σ p ss , 2 obtained by LNA as in equation (26) includes the overall intrinsic noise of a gene following equations (1) to (3). So it is desirable to compare the σ 2 p,ss in equation (25) from the burst Langevin equation to the exact variance in equation (26). The difference between equations (25) and (26) gives us an indication of the burst Langevin equation's accuracy. Such difference is shown in the lower right region in Fig. 2b, where γ g ≥ 10 γm. The largest error of the bursting Langevin equation with mRNA burst alone is .
12 5% that occurs at the lower left boundary of the region, which is still acceptable.
Both mRNA and protein bursts. In the condition γ γ γ   g m p , both the mRNA's and protein's production are produced in bursts. We can combine both bursts and derive one Langevin equation for the protein's fluctuation, thereby greatly simplifying the simulation. Because each mRNA corresponds to a protein burst event, the number of protein burst events in τ is leading to the protein production as Since mRNA is also produced in bursts, the variance of the protein's burst event is which is identical to that in equation (20). The variance of the protein's production following equation (11) is , by following the same assumption of geometric distribution as b m in equation (15). We also note that similar results with both bursts were derived using the generation function of P(p), the probability distribution of p's number in the supplementary material of ref. 34 .
With equations (28) and (30), the Langevin equation for protein fluctuation with both bursts is   τ τ τ γ τ γ τ It allows us to efficiently simulate protein's fluctuation, because we can skip tracking the gene state and the mRNA in the simulation. Following the same process as we obtained the mRNA's steady-state variance σ m ss , 2 as in equation (23), here we obtained the protein's steady-state variance from equation (31) as p ss mp p , 2 In the upper right region of Fig. 2b, we show the normalized error in σ p ss , for equation (32) to that from LNA as equation (26) with the condition of both bursts as following: The largest possible error in the standard deviation is <6.5%, which is quite acceptable.
Protein burst. When the gene's active state is long-lived γ γ ∼ ( ) g p , the mRNA is not produced in bursts. For some genes γ γ γ ∼  ( ) m g p reported in yeast 4 , short-lived mRNA leads to the protein's burst production. Partial simplification of the Langevin equations for the gene expression is still possible if the protein is produced in bursts. For such case, we keep track the gene's activity, skip the short-lived mRNA, and develop the protein's Langevin equation with bursts: The gene-switching probability is k g τ with g = 0 and γ g τ with g = 1. The mean production of the protein number in time τ is τ gk b m p , which is gk m τ, the number of protein burst events (same as the number of mRNA molecules) produced in τ, multiplied by b p , the mean protein burst size. For the noise strength (σ Δp ) in equation (35), g(t)k m can be considered a constant in τ because the state of the gene does not switch frequently in τ. Therefore, the mRNA produced or the protein burst event from equation (35) is a Poisson distribution, with the variance being the same as the mean: Following equation (11), the variance of protein production in equation (35)can be written as In a simulation trial, the noise strength of protein's production σ ∆ ( ) p follows the state of g(t). Also, because g(t) = 1 in some τ steps and g(t) = 0 in others, the average gene state is γ With only protein bursts, based on the condition of γ γ ≥ 10 m p , we can simplify σ p ss   Fig. 2b, comparison of σ p,ss from equation (38) to that from the exact variance in equation (26) shows that the largest possible error is <5%.
In a special condition, where k g and γ g are significantly smaller than other four kinetic parameters in the gene expression model (equations (1) to (3) Overall, our comparison shows that the burst Langevin equation can provide reliable estimations of σ p,ss for all three cases, where bursts are observed in mRNA, protein or both mRNA and protein. For the three cases, we organized the statistical expressions including burst events, burst sizes, variance of production and steady-state variance in Table 1. For three cases of bursts, the variance of the burst event as in equation (11) needs to be modified accordingly.
Neither mRNA nor protein in bursts. For the case that γ g and γ m are close to γ p , simulations with the Gillespie algorithm or the τ-leaping algorithm 35 would work well. The problems of inefficient simulation and poor statistics of rare events due to greatly different reaction rates do not exist in this case. Because mRNA and protein production are not in bursts, all three species in equations (1) to (3) need to be tracked in the simulation to fully account for intrinsic noise of gene expression. Simulation with the Gillespie algorithm has no imposed approximation, and thus the steady-state variance it produces is close to LNA in equation (26). Therefore, the lower left region of Fig. 2b indicates zero error.

Results
Single gene expression. The gene expression model as described in equations (1) to (3) is tested to see how the one-component burst Langevin equation in equation (31) can be used to replace a three-component model. We use the Gillespie algorithm to simulate the model as in equations (1) to (3) to obtain the exact numerical simulation results. We compared the normalized error of a protein's mean p ( ) in the steady state from the burst Langevin simulation to that from the Gillespie simulation.
There are six parameters in the model as in equations (1) to (3). We first chose the unit for time as 1/γ p . In other words, the protein degradation rate was set to 1. We further set γ m = 10 and γ g = 100, for a fast degradation rate in mRNA and an even faster DNA deactivation rate, respectively. This is at the margin of treating both mRNA and protein production with bursts (equation (32)), where the largest error (<6.5%) could be produced, as shown in the upper right region of Fig. 2b. To test the applicable range of the burst Langevin simulation, we scanned the gene activation constant (k g = 1 − 100), which covers the mRNA burst frequency value of 5 to 45 as observed in the experiment 22 . The other parameter we scanned is the protein burst size p , or equivalently protein production rate (k p = 10 − 1000), which also covers the values observed in the experiment 28 . We first fixed the mRNA burst (y = m) both bursts (y = p) protein burst (y = p) condition γ g ≥ 10 γ m γ g ≥ 10 γ m ≥ 100 γ p γ m ≥ 10 γ p

simulated subjects m(t), p(t) p(t) g(t), p(t)
burst event distribution: burst production in Langevin equation: . § With the conditions of γ γ  5000, the range of observed protein copy number in an E. coli cell 28 . We compared the average protein number p ( ) of the steady-state distribution from the burst Langevin simulation to that from the Gillespie simulation and shown in Fig. 3a. When k g ≥ 3 and ≥ b 1 p , corresponding to ≥ p 3, our algorithm's error is <5%. The largest error of p is found at the lower left corner, which is caused by the Gaussian function in the Langevin simulation deviating from the Poisson distribution. Such deviation affects all kinds of Langevin simulations, including our burst Langevin scheme. Figure 3b compares the protein's steady-state distribution with the burst Langevin simulation and Gillespie simulation. The burst Langevin simulation can reproduce the distributions with different combinations of burst frequency (k g ) and burst size (b p ).The normalized error in standard deviation for these cases ranges from −13% to 14% (details included in the supplementary information). Although all the steady-state distributions have some error in σ p,ss , they are sufficiently good for further applications. We further analyzed the sources of such error and discussed them in the supplementary information for interested readers. Figure 4 compares the computational time percentage with the burst Langevin simulation to that with the Gillespie simulation. The burst Langevin simulation always uses less time than the corresponding Gillespie simulation. When particle number is ≤100, the Gillespie simulation is already efficient; thus, the time usage with the burst Langevin simulation is 40% to 80% of that with the Gillespie simulation. However, when the particle number is large, the burst Langevin uses only <10% simulation time as compared with the Gillespie simulation. Therefore, the burst Langevin simulation is efficient.
We also checked the accuracy of the burst Langevin simulation comparing to the Gillespie simulation by varying mRNA burst size. In this test, with a fixed k g = 5, we scanned the other parameter pair: mRNA mean burst size, corresponds to the mRNA burst size observed in mammalian cells 22 . The parameter region tested corresponds to = − p 4 15,000. As shown in Fig. 5a, the errors in p are within ±2%. The steady-state distributions between two methods shows a good agreement (Fig. 5b). Comparison of the standard deviation in the steady state (from −8% to 2.5%) is included in the supplementary information. These results indicate that our burst Langevin algorithm is applicable for a wide range of biological systems.
as seen in the upper-right corner of Fig. 6a. The errors in p 2 in Fig. 6b are from −4% to 8%. The red line in Fig. 6b , the threshold value of the repression. Within the region close to the threshold, the production of p 2 is sensitive to fluctuations in p 1 . However, even in this region nearby, the error at most is only −4%. Therefore, even with a non-linear regulation in this system, the burst Langevin simulation can produce accurate results.
In Fig. 6b, the largest error in p 2 is about 8%, found with = b 30 . In this region, p 1 's copy number is high, and its number fluctuation is also high with such large burst-size pairs. p 2 is fully repressed by high p 1 number and kept at its basal expression level, = p 25 2 .
Here the 8% error comes from a two-to three-particle difference in p 2 , and such error is quite acceptable in stochastic simulations. In the region that we scanned, p 1 's σ p1,ss error is from −8.5% to 2.5%, which mainly follows the value of b m1 (as shown in supplementary information). Such error may propagate through the regulation and cause error in p 2 . However, as seen in Fig. 6b, the error in p 2 has only a mild correlation with increasing b m1 alone. The overall trend of increasing error roughly follows inversely with increasing p 2 from the down-left corner to up-right corner in Fig. 6b, and thus, the errors in the standard deviation of the upstream do not affect the quality of the downstream.

Dynamics of average protein number.
Besides steady-state behaviors, we demonstrate the accuracy of the burst Langevin simulation in dynamics. In Fig. 7, we include the mean protein number dynamics with the burst Langevin simulation and Gillespie simulation. In this model, the gene is activated at time = t 7 by setting k g = 30 and deactivated at t = 14 by setting k g = 3. From Fig. 7, we can see that our simulation algorithm can produce reasonably accurate dynamics in the mean and standard deviation as compared with the exact Gillespie simulation. Only a small deviation can be found in the standard deviation. In this test, we selected

Discussion
In this work, we have developed a Langevin equation that can account for the noise arising from gene expression bursts. We found a large range of parameters with which our burst Langevin simulation can well reproduce the statistics comparing to the Gillespie algorithm, and it covers the protein expression level for more than 4 orders of magnitude. For the case of mRNA (protein) burst production, the deactivation (degradation) rates of the gene (mRNA) should be 10 times faster than that of mRNA (protein). The burst Langevin equation has the flexibility to include only mRNA or protein burst or both bursts. In addition, the gene activation rate constant (k g ) has multiple effects in the mean and variance of the production distribution; thus, it is a critical parameter in the accuracy  (5) and (16) with given b p and b m , respectively, with the other parameters k g = 5, γ g = 100, γ m = 10 and γ p = 1.
of the burst Langevin simulation. When k g ≥ 3, which leads to ≥ p 3, the burst Langevin simulation can produce an accurate steady-state mean and standard deviation as compared with the Gillespie simulation. Furthermore, the burst Langevin simulation can produce accurate dynamics of genetic switching and genes under non-linear regulation. Therefore, the burst Langevin equation is applicable for a wide range of genetic regulation network.
To fully consider all intrinsic noises of a gene, with the Gillespie simulation, all of the three components including the gene state, mRNA and protein are simulated with a total of six reaction channels. However, with the burst Langevin simulation, with both mRNA and protein bursts, the model can be reduced to only protein with two reaction channels. Thus, the burst Langevin simulation uses less computational time and memory than the Gillespie simulation. Therefore, our algorithm is an efficient stochastic simulation method.
Besides efficiency, the Langevin equation allows for easily dissecting the contribution of noise from different sources, because the Gaussian random number for each reaction channel can be easily set as zero. Therefore, the burst Langevin simulation can be used to analyze the dynamics of gene expression noises propagating through the regulation network 8,11 . Moreover, one can introduce a desirable scaling parameter to the noise strength in the Langevin equation for mimicking other possible sources. Therefore, one can reproduce the noises close to that from the Gillespie simulation or that observed in various biological systems.
With the variance expression in equation (11), our burst Langevin equation can be flexible to include various cellular factors. With the assumption that burst events and burst sizes are determined by independent processes, we use two different distributions to estimate the variance of burst production in the burst Langevin equation. For the cases of mRNA or protein burst alone, we use the Poisson distribution for burst events and geometric distribution for burst sizes, whereas for the case of both bursts, the protein burst event's variance is enhanced by the mRNA burst, and the overall variance is obtained by the same expression in equation (11). In general, gene expression in the model (equations (1) to (3)) may be influenced by other factors in the cell 37 , such as chromatin template and promoter structure 38,39 , which leads to different mRNA and protein production distributions other than Poisson or geometric distributions 40 . Also, post-translational modification introduces an additional step after protein production, which can modify the overall protein production rate, k p . Thus, protein burst size b p and variance σ bp 2 may also be modified from the geometric distribution. For a more detailed model 40 , if burst event and size are determined independently, their distributions can be introduced in equation (11) to estimate consequent burst production variance. Therefore, the burst Langevin equation in equation (31) can be modified accordingly to include other factors or more detailed steps in the gene expression model. Different steps in gene expression are implemented by different molecular machineries. A gene is activated by chromatin remodeling 4,21 , whereas mRNA is produced by RNA polymerase and protein is produced by ribosome. There is no machinery competition between different steps. Therefore, we assumed that different steps in gene expression are independent processes and derived the variance of burst production. However, under different physiological conditions in bacteria 41,42 , negative correlations are reported between transcription and translation. And thus, regulation or competition for resources may exist between transcription and translation. Yet, once a gene is expressing, protein requires more biomass than transcripts. Protein synthesis also consumes most of the energy, but other processes in gene expression consume a non-relevant amount (<10%) of energy 43,44 . Therefore, energy competition may only be possible in some extreme conditions and assuming different steps in gene expression as independence processes is valid.
The first fraction is still the same as that from equation (43) for only burst production. When εp ≤ 1, we multiply the original selection of τ = k b b 1/ g m p by b p to include a whole burst event. Further detailed models included effects of chromatin template and promoter structure 38,39 , which change the waiting time distribution for next burst event. And thus, the term τ = k b 1/ g m , which is derived from an exponential distribution, needs to be modified accordingly if the chromatin structural change is considered. Other detailed considerations are included in the supplementary information accompanying this work.
With a selected τ, we can determine the protein's burst production number and degradation number according to equation (31). When p(t) is small, a randomly selected degradation number may be so large that negative p(t + τ) is obtained. If p(t + τ) becomes negative, we take half of the originally selected τ, such that particle changes become small, and repeat this procedure if necessary, until p(t + τ) ≥ 0. Such half-τ scheme is suggested in the work of Cao et al. 35 .
Removing negative production. The protein's production number in each τ is calculated by the second term in the right-hand side of equation (31) and then rounded to the nearest integers. Shown in Fig. 8a is the Gaussian distribution we used to approximate the burst production. With τ = 0.03, the mean production number is 3 and the standard deviation is about 13.5. From such Gaussian distribution, there is a nearly 40% chance to obtain a negative random number, which is then rounded to a negative production number when it is ≤−0.5. Even with a high expression level, with k g = 100 and = b 100 p , the negative part still can be >15% of burst production. A more complete profile for the negative burst percentages is included in the supplementary information. Such negative production also reduces the protein number as the degradation process and leads to many sudden drops in the blue trajectory in Fig. 8b. Negative production is an artifact due to the Gaussian distribution used in the Langevin simulation.
We used a rather simple approach to remove such negative productions and keep the mean of the distribution at the same time. When a negative production number is selected, the negative number is temporarily stored for accumulation with the next production. The production in the current time step is zero, as the silent moment between two bursts. Only when the accumulated protein number becomes positive is there a burst with such a positive number, and the protein number is increased by production. As seen in Fig. 8b, the unrealistic sudden drop is removed in the red trajectory. We note that the red trajectory has a similar shape due to the rapid production and slow degradation as in Fig. 1b. Data Availability. All data generated or analysed during this study are included in this published article (and its supplementary information file).