## Introduction

Gene expression in living cells is a complex stochastic process characterized by various probabilistic chemical reactions, giving rise to spontaneous fluctuations in the abundances of proteins and mRNAs1,2,3,4. Recent advances in experiment techniques, such as flow cytometry, fluorescence microscopy, and scRNA-Seq, have resulted in the generation of large amounts of single-cell gene expression data. This raises a great challenge of whether and how one can infer the topological structure of a gene regulatory network by using such massive but often noisy data. Considering the complexity of gene regulatory networks, this may seem to be a daunting task. However, the situation becomes much simpler if we focus on a particular gene of interest and the feedback loop regulating it5. In general, there are only three types of gross topological structures: no feedback, positive feedback, and negative feedback (see Fig. 1a) and different types of networks can give rise to similarly shaped, usually unimodal, steady-state distributions of gene expression. Therefore, it is highly nontrivial to ask whether the information of feedback topology can be extracted from single-cell measurements of this gene.

## Results

### Model and steady-state protein distribution

Recently, significant progress has been made in the field of single-cell stochastic gene expression6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21. Based on the central dogma of molecular biology, the kinetics of stochastic gene expression in a single cell can be described by a model with three stages consisting of transcription, translation, and switching of the promoter between an active and an inactive epigenetic forms (see Fig. 1b). This model is similar to the three-stage model introduced in12 but with a critical addition of nonlinear feedback regulation. The biochemical state of the gene of interest can be described by three variables: the activity i of its promoter with $$i=1$$ and $$i=0$$ corresponding to the active and inactive forms, respectively, the copy number $$m$$ of the mRNA transcript, and the copy number $$n$$ of the protein product. The evolution of the three-stage model can be mathematically described by the Markov dynamics illustrated in Fig. 1c. Here $$s$$ and $$r$$ are the transcription rates when the promoter is active and inactive, respectively (the basal transcription rate $$r$$ is usually not zero), $$u$$ is the translation rate, and $$v$$ and $$d$$ are the degradation rates of the mRNA and protein, respectively. Since the network has feedback regulation, the protein copy number $$n$$ will directly or indirectly affect the switching rates $${a}_{n}$$ and $${b}_{n}$$ of the promoter between the active and inactive forms. Since many genes have complex epigenetic controls including dissociation of repressors, association of activators, or chromatin remodeling, we do not impose any restrictions on the specific functional forms of $${a}_{n}$$ and $${b}_{n}$$. In15, the authors considered the case of linear feedback regulation with $${a}_{n}=a+un$$ and $${b}_{n}=b$$, where $$a$$ is the spontaneous contribution and $$un$$ is the feedback contribution with $$u$$ measuring the feedback strength. However, recent single-cell experiments on transcription of mammalian cells22 have suggested that $${a}_{n}$$ and $${b}_{n}$$ are often saturated when $$n\gg 1$$ and thus are highly nonlinear. In the present work, we consider a more general case by allowing arbitrary nonlinearity.

In most applications, the switching rates of the promoter are fast10,17 and the effective transcription rate of the gene is given by $${c}_{n}=({a}_{n}s+{b}_{n}r)/({a}_{n}+{b}_{n})$$. It is critical to note that the information of network topology is implicitly characterized by $${c}_{n}$$. If the network has a positive-feedback (negative-feedback) loop, then $${c}_{n}$$ is an increasing (decreasing) function of $$n$$. If the network has no feedback, $${c}_{n}$$ is independent of $$n$$. Let $${p}_{n}$$ denote the steady-state probability of having $$n$$ protein molecules. Experimentally, the lifetime of the mRNA is usually much shorter compared to that of its protein counterpart12. Once an mRNA is synthesized, it can either produce a protein with probability $$p=u/(u+v)$$ or be degraded with probability $$q=v/(u+v)$$. Let $$\lambda =v/d$$ denote the ratio of the protein and mRNA lifetimes. When $$\lambda \gg 1$$, the original Markov model can be simplified to a reduced model with geometrically distributed translation bursts23 and the steady-state distribution of the protein copy number can be calculated analytically (Supplementary Information):

$${p}_{n}=A\frac{{p}^{n}}{n!}\frac{{c}_{0}}{d}(\frac{{c}_{1}}{d}+1)\cdots (\frac{{c}_{n-1}}{d}+n-1),$$
(1)

where $$A$$ is a normalization constant. If the network has no feedback, then $${c}_{n}=c$$ is a constant and the above distribution reduces to the well-known negative-binomial distribution

$${p}_{n}=\frac{{p}^{n}}{n!}\frac{{\rm{\Gamma }}(c/d+n)}{{\rm{\Gamma }}(c/d)}{q}^{c/d},$$

where $${\rm{\Gamma }}(x)$$ is the gamma function. This is consistent with the results obtained in7,21.

In fact, the parameter $$q$$ has important statistical implications. Since $${c}_{n}\le s$$, it follows from Eq. (1) that $${p}_{n+1}/{p}_{n}=p(n+{c}_{n}/d)/(n+\mathrm{1)}\approx p$$ when $$n\gg 1$$. This further suggests that $${p}_{n+k}\approx {p}^{k}{p}_{n}={e}^{k\mathrm{log}\mathrm{(1}-q)}{p}_{n}\approx {e}^{-qk}{p}_{n}$$ when $$q\ll p$$. This shows that the steady-state probability $${p}_{n}$$ decays exponentially with respect to the protein copy number $$n$$ when $$n\gg 1$$ with $$q$$ being the exponentially decaying rate of the steady-state protein distribution. Here $$q\ll p$$ is justified because $$p/q=u/v$$ is the average number of proteins synthesized per mRNA lifetime, which is relatively large in living cells and typically on the order of 100 for an E. coli gene24. To identify $$q$$ as an experimentally accessible quantity is of basic importance, as will be shown later.

### Decomposition of the protein fluctuations

Experimentally, spontaneous stochastic fluctuations, often referred to as noise, in the protein abundance are usually measured by the squared relative standard deviation $$\eta ={\sigma }^{2}/{\langle n\rangle }^{2}$$, where $$\langle n\rangle$$ is the mean and $${\sigma }^{2}$$ is the variance25. With the analytical steady-state protein distribution, it can be shown that the noise $$\eta$$ can be decomposed into three different terms or two different terms as (Supplementary Information)

$$\eta =\frac{1}{\langle n\rangle }+\frac{d}{v\langle m\rangle }+{\eta }_{f}\,(\,\,\begin{array}{l}{\eta }_{f}=0\,\,\,{\rm{no}}\,{\rm{feedback}}\\ {\eta }_{f}\, > \,0\,\,\,{\rm{positive}}\,{\rm{feedback}}\\ {\eta }_{f}\, < \,0\,\,\,{\rm{negative}}\,{\rm{feedback}}\end{array}$$
(2)
$$\begin{array}{r}=\,\frac{1}{q\langle n\rangle }+{\eta }_{f},\phantom{\rule{5em}{0ex}}\phantom{\rule{5em}{0ex}}\phantom{\rule{5em}{0ex}}\end{array}$$
(3)

where $$\mathrm{1/}\langle n\rangle$$ is the Poisson noise from individual births and deaths of the protein, $$d/v\langle m\rangle$$ is the noise due to fluctuations in the mRNA abundance, and $${\eta }_{f}={\rm{Cov}}(n,{c}_{n})/\langle n\rangle \langle {c}_{n}\rangle$$ is the relative covariance between $$n$$ and $${c}_{n}$$, which characterizes the strength of feedback regulation. We stress here that when the promoter switching rates are fast, the above decomposition formula and the expression of $${\eta }_{f}$$ hold exactly without any approximation, even when the nonlinearity of feedback regulation is very high.

If the network has no feedback, then $${c}_{n}$$ is a constant and $${\eta }_{f}=0$$. It is well known that the covariance between a random variable and an increasing (decreasing) function of this random variable must be positive (negative). Therefore, if the network has a positive-feedback loop, then $${c}_{n}$$ is an increasing function of $$n$$ and $${\eta }_{f} > 0$$. Conversely, if the network has a negative-feedback loop, then $${c}_{n}$$ is a decreasing function of $$n$$ and $${\eta }_{f} < 0$$. As a result, the sign of $${\eta }_{f}$$ is completely determined by the network topology and we shall name $${\eta }_{f}$$ as feedback coefficient. The above analysis clearly explains previous experimental observations that positive feedback generally amplifies noise26 and negative feedback generally reduces noise27.

In the previous literature, there are confusing or even contradictory statements about the feedback-noise relationship. Some studies claimed that positive feedback reduces noise28, while negative feedback amplifies noise29. The reason for these seemingly contradictory results has been analyzed in15,20 and here we shall use our noise decomposition formula to provide an clearer explanation. For a positive-feedback (negative-feedback) network, $$\eta$$ is the total noise and $$\pm {\eta }_{f}$$ is the noise amplified (reduced). Therefore, $$\eta -{\eta }_{f}=\mathrm{1/}q\langle n\rangle$$ can be thought of as the feedback-free noise. In general, if all other rate constants remain unchanged, then positive (negative) feedback will lead to an increase (decrease) in the protein mean $$\langle n\rangle$$ 10 and thus lead to a decrease (increase) in the feedback-free noise $$\mathrm{1/}q\langle n\rangle$$. This decrease (increase) in the feedback-free noise may counteract the positive (negative) contribution of the feedback coefficient $${\eta }_{f}$$ and give rise to an anomalous decrease (increase) in the total noise $$\eta$$. This explains why some experiments have observed anomalous noise suppression (amplification) in networks with positive (negative) feedback.

However, from the physical perspective, the feedback-free noise and feedback coefficient have completely different origins: the former characterizes fluctuations from individual births and deaths of the protein and mRNA, while the latter reflects the contribution of feedback regulation. Therefore, it seems logically insufficient to study the effect of feedback regulation on the feedback-free noise by fixing the underlying biochemical rate constants. In fact, what positive (negative) feedback actually amplifies (reduces) is the very part of fluctuations that cannot be explained by the feedback-free noise.

### Bounds for the protein noise

Negative feedback proves to be most interesting because it is responsible for the stability of a cell27. Since negative feedback reduces noise, it is natural to ask to what extent the noise is inevitable and whether the feedback coefficient $${\eta }_{f}$$ could be strong enough such that the noise $$\eta$$ is approaching zero5,30. In fact, for the three-stage model, the upper and lower bounds of the noise $$\eta$$ are given by

$$\frac{1}{q\langle n\rangle }\frac{1}{1+\alpha p/dq}\le \eta < \frac{1}{q\langle n\rangle },$$
(4)

where $$\alpha ={\rm{\sup }}\{|c^{\prime} (x)|:x > \mathrm{0\}}$$ is the steepness of the regulatory function $$c(x)$$ obtained from $${c}_{n}$$ by replacing $$n$$ with a positive real number $$x$$ and the term $$\alpha p/dq$$ is of the order of one for a wide range of biologically relevant parameters (Supplementary Information). These bounds provide the limits on the ability for a negative-feedback loop to suppress protein fluctuations. We stress here that this lower bound is new and is different from the one derived in30. Our lower bound performs better in the regime of strong noise suppression (Supplementary Information). In the literature, the effective transcription rate $$c(x)$$ is often chosen as the generalized Hill function $$c(x)=(as+{x}^{h}r)/(a+{x}^{h})$$ with $$h\ge 1$$ being the Hill coefficient5,10, in which case the steepness

$$\alpha =\frac{{(h-\mathrm{1)}}^{1-\mathrm{1/}h}{(h+\mathrm{1)}}^{1+\mathrm{1/}h}}{4h}\times \frac{(s-r)}{{a}^{\mathrm{1/}h}}\mathrm{.}$$

For a negative-feedback network, $$\eta -{\eta }_{f}$$ is the feedback-free noise, $$-{\eta }_{f}$$ is the noise reduced, and $$\eta$$ is the total noise. Then the efficiency of the negative-feedback network, as a noise filter, can be defined as $$\gamma =-{\eta }_{f}/(\eta -{\eta }_{f})$$. The lower bound in Eq. (4) reveals a general biophysical principle: The efficiency of a negative-feedback network must satisfy $$0 < \gamma \le \mathrm{1/(1}+dq/\alpha p)$$. This fact is similar to Carnot’s theorem in classical thermodynamics, which claims that the theoretical maximum efficiency of any heat engine must be smaller than 1.

If all other cellular factors are constant, the protein will display a small-number Poisson noise24. When $$\alpha > d$$, the lower bound in Eq. (4) is smaller than $$\mathrm{1/}\langle n\rangle$$, which shows that $$\eta$$ may be even smaller than the Poisson noise in the negative-feedback case (see Fig. 2b). Recent experiments have shown that although the variance of expression levels is larger than the mean for most genes, there are still some genes whose variance is less than the mean31. This fact is well explained by our theory. From Eq. (3), if the network has no feedback or a positive-feedback loop, $$\eta$$ is always larger than the Poisson noise (see Fig. 2a). In the positive-feedback case, similar upper and lower bounds for the noise $$\eta$$ can also be obtained (Supplementary Information), which provide the limits on the ability for a positive-feedback loop to enhance protein fluctuations.

### Inference of feedback topology using single-cell data

When a network has nonlinear feedback regulation, the mean and variance are not enough to determine the steady-state protein distribution and the information of higher-order moments will play a crucial role. In fact, Eq. (3) can be rewritten in a more illuminating form as

$${\eta }_{f}=\frac{{\sigma }^{2}}{{\langle n\rangle }^{2}}-\frac{1}{q\langle n\rangle }\mathrm{.}$$
(5)

This equation is of crucial importance because it bridges the feedback topology of a gene circuit and experimentally accessible measurements. In particular, it reveals a quantitative relation between the feedback coefficient $${\eta }_{f}$$, whose sign is fully determined by the network topology, and the digital features of the steady-state protein distribution, characterized by the mean $$\langle n\rangle$$, variance $${\sigma }^{2}$$, and decaying rate $$q$$, which reflects the overall effect of higher-order moments. This provides an effective method to extract the topological information of a gene regulatory network from single-cell gene expression data. From single-cell data, the three digital features, and thus the feedback coefficient $${\eta }_{f}$$, can be estimated robustly (Supplementary Information). If $${\eta }_{f}$$ is significantly larger (smaller) than zero, one has good reasons to believe that there is a positive-feedback (negative-feedback) loop regulating this gene.

In single-cell experiments such as flow cytometry and fluorescence microscopy, one usually obtains data of protein concentrations, instead of protein copy numbers. Let $$x=n/V$$ be a continuous variable representing the protein concentration, where $$V$$ is a constant compatible with the macroscopic scale. It is easy to see that the noise $$\eta ={\sigma }^{2}/{\langle n\rangle }^{2}$$ will not be affected by the scaling constant $$V$$ and thus is dimensionless. In terms of the protein concentration, the mean will become $$\langle n\rangle /V$$ and the decaying rate will become $$qV$$ (Supplementary Information). Therefore, the product of these two terms is also dimensionless. This indicates that the above method not only applies to single-molecule data of protein copy numbers, but also applies to single-cell data of protein concentrations. The above analysis also suggests a crucial difference between the two decomposition formulas (2) and (3): The former only applies to data of protein copy numbers, while the latter also applies to data of protein concentrations.

## Experimental validation

To validate our theory, we apply it to a synthetic gene circuit (orthogonal property of a synthetic network can minimize “extrinsic” noise) stably integrated in human kidney cells, as illustrated in Fig. 3)32. In this circuit, a bidirectional promoter is designed to control the expression of two fluorescent proteins: zsGreen and dsRed. The activity of the promoter can be activated in the presence of Doxycycline (Dox). The green fluorescent protein, zsGreen, is fused upstream from the transcriptional repressor LacI. The LacI protein binds to its own gene and inhibits the transcription of its own mRNA, forming a negative-feedback loop. The negative-feedback strength can be tuned by induction of Isopropyl $$\beta$$-D-1-thiogalactopyranoside (IPTG). As the control architecture, the red fluorescent protein, dsRed, is not regulated by IPTG induction, forming a network with no feedback. The steady-state levels of the zsGreen and dsRed fluorescence are measured under a wide range of IPTG concentrations and two Dox concentrations (low and high) by using flow cytometry.

For each fixed IPTG and Dox concentrations, we can estimate the mean $$\langle n\rangle$$, variance $${\sigma }^{2}$$, and decaying rate $$q$$ for the steady-state distribution of the zsGreen or dsRed fluorescence. Then the feedback coefficient $${\eta }_{f}$$ can be estimated from Eq. (5). In the high Dox case, Fig. 4a,b illustrate the noise $$\eta$$, feedback-free noise $$\eta -{\eta }_{f}$$, and feedback coefficient $${\eta }_{f}$$ of the zsGreen and dsRed proteins under different IPTG concentrations, respectively. For the zsGreen protein, the feedback coefficient $${\eta }_{f}$$ is negative under all IPTG concentrations. With the increase of the IPTG concentration, the negative-feedback strength becomes increasingly weaker and the feedback coefficient $${\eta }_{f}$$ tends to zero. In contrast, for the dsRed protein, the feedback coefficient $${\eta }_{f}$$ fluctuates around zero in a narrow range under different IPTG concentrations. These results are in full agreement with our theory with high accuracy. As a result, our method correctly extracts the topological information of the synthetic gene circuit in both qualitative and quantitative ways. In the low Dox case, the noise $$\eta$$, feedback-free noise $$\eta -{\eta }_{f}$$, and feedback coefficient $${\eta }_{f}$$ of the zsGreen and dsRed proteins are illustrated in Fig. 4c,d, respectively, and similar conclusions can be drawn.

Although it has been observed that negative feedback suppresses molecular fluctuations32, it remains difficult to quantify the corresponding effect5. Our theory provides a quantitative characterization of such effect. In the high Dox case, the negative-feedback effect is the strongest when the IPTG concentration is zero. In this situation, the feedback-free noise is $$\eta -{\eta }_{f}=$$ 0.49 and the feedback coefficient is $${\eta }_{f}=-$$ 0.18, which indicates that negative feedback reduces noise by 36.7%. The efficiency $$\gamma$$ of the negative-feedback network drops significantly with the increase of the IPTG concentration and is close to zero when the concentration reaches 6.2 μM.

One of the potential applications of our theory is to provide a mechanism-driven method to identify the differentially expressed genes (DEGs) of two different cell populations such as tumor and non-tumor tissues. Most of the existing methods searched the DEGs by identifying the difference in the mean levels of the two cell populations under some a priori assumptions on the protein or mRNA distribution such as the negative binomial distribution31. However, the effect of noise amplification or suppression caused by feedback loops is not addressed by these methods, which may result in incorrect predictions (Supplementary Information). Our theory indicates that even if the means and variances of the two cell populations are both very close, one is still able to find the DEGs by detecting the difference in feedback topology. If the signs of the estimated feedback coefficients $${\eta }_{f}$$ of the two cell populations are different, one has good reasons to believe that there is a change in the topological structure of the underlying gene regulatory network when a non-tumor tissue becomes a tumor one.

## Discussion and Conclusions

Here we present a comprehensive analysis of the three-stage model of stochastic gene expression with nonlinear feedback regulation. By taking the limit of a large ratio of protein to mRNA lifetimes, we derive the analytical steady-state distribution of the protein copy number. Furthermore, we decompose the protein noise according to different biophysical origins. The resulting decomposition formula reveals a quantitative relation between stochastic fluctuations and feedback topology at the single-molecule level. In particular, we show that the protein noise $$\eta$$ can be decomposed into the sum of two parts: the feedback-free noise $$\mathrm{1/}q\langle n\rangle$$ and feedback coefficient $${\eta }_{f}$$, whose sign is totally determined by the network topology. Both the two parts can be estimated robustly from single-cell gene expression data via three experimentally accessible quantities: the mean $$\langle n\rangle$$, variance $${\sigma }^{2}$$, and decaying rate $$q$$. Such relation not only enables us to quantify the effects of noise amplification or suppression caused by feedback loops, but also allows us to extract the topological information of the underlying gene regulatory network from single-cell gene expression data. The feasibility of this approach is validated quantitatively by single-cell data analysis of a synthetic gene circuit integrated in human kidney cells.

We stress that our results depend nothing on the specific functional forms of the effective transcription rate $${c}_{n}$$ except for its monotonicity, which makes our theory highly general. One of the most powerful parts of our theory is that it can be applied to gene regulatory networks with highly nonlinear feedback. In the present paper, all the derivations are based on the assumption of rapid promoter switching, under which the fluctuations due to promoter switching are averaged out. Intuitively, in the regime of slow promoter switching, our noise decomposition formula (3) should be amended as

$$\eta =\frac{1}{q\langle n\rangle }+{\eta }_{f}+{\eta }_{s},$$
(6)

where $${\eta }_{s} > 0$$ is the noise due to promoter switching. Because of the contribution of $${\eta }_{s}$$, the difference between the total noise $$\eta$$ and feedback-free noise $$\mathrm{1/}q\langle n\rangle$$ must be positive in positive-feedback networks and may be either positive or negative in negative-feedback networks due to the competition of $${\eta }_{f} < 0$$ and $${\eta }_{s} > 0$$. The above analysis is in full agreement with our numerical simulations in Fig. 5. The ignorance of $${\eta }_{s}$$ in the present paper is the cost for deriving an analytical protein distribution in networks with nonlinear feedback regulation.

In fact, the idea of noise decomposition in terms of different biophysical origins was first proposed by Paulsson in his pioneering work25. However, this work was focused on the decomposition of the local noise around the fixed point of the underlying biochemical reaction system, instead of the global noise of the entire probability distribution, by using the fluctuation-dissipation theorem, also called first-order van Kampen’s expansion24. In networks with no feedback, a decomposition of noise into the feedback-free noise $$\mathrm{1/}q\langle n\rangle$$ and promoter switching noise $${\eta }_{s}$$ can be found in12,33. In the present work, we obtain a noise decomposition in networks with feedback regulation, albeit in the regime of fast promoter switching. There are two major advantages of our decomposition formula (3). First, it can be applied to the situation when the nonlinearity of feedback regulation is very high. Second, all the three contributing terms in the decomposition formula can be estimated robustly from single-cell gene expression data.

In the regime of slow promoter switching, it is difficult to give an intrinsic definition of the promoter switching noise $${\eta }_{s}$$ since the promoter switching rates $${a}_{n}$$ and $${b}_{n}$$ could be both nonlinear functions of the protein copy number $$n$$. In fact, an alternative definition of $${\eta }_{s}$$ has been proposed in20 with the aid of the macroscopic limit of a piecewise-deterministic Markov process. By assuming linear feedback regulation and ignoring the mRNA kinetics, the authors decomposed the protein noise into the superposition of the protein birth-death noise, promoter switching noise, and correlation noise. Although their correlation noise is similar to our feedback coefficient (see the green curves in Figs. 2 and 3 of Ref.20), their protein birth-death noise is a constant independent of feedback regulation and thus is very different from our feedback-free noise.

Finally, we would like to point out that the lower bound of the protein noise in negative-feedback networks was first derived in5 by using concepts in information theory. However, this work is based on the diffusion approximation, with approximated Gaussian fluctuations, of the underlying discrete Markov model. A lower bound of the protein noise without diffusion approximation was derived recently in30. Our lower bound (4) is more explicit than the one obtained in30 and is tighter in the regime of strong noise suppression.

Although we have shown how single-cell measurements may be used to reveal the feedback sign of a gene regulatory network, it is conceivable that in the near future, further advances in live-cell imaging with single-molecule resolution could allow the theory to be tested at the single-molecule level.

## Methods

The numerical simulations in Figs. 2 and 5 are based on the Gillespie algorithm. The single-cell gene expression data of the synthetic gene circuit analyzed during this study are included in the published article32.