Intrinsic limits to gene regulation by global crosstalk

Gene regulation relies on the specificity of transcription factor (TF)–DNA interactions. Limited specificity may lead to crosstalk: a regulatory state in which a gene is either incorrectly activated due to noncognate TF–DNA interactions or remains erroneously inactive. As each TF can have numerous interactions with noncognate cis-regulatory elements, crosstalk is inherently a global problem, yet has previously not been studied as such. We construct a theoretical framework to analyse the effects of global crosstalk on gene regulation. We find that crosstalk presents a significant challenge for organisms with low-specificity TFs, such as metazoans. Crosstalk is not easily mitigated by known regulatory schemes acting at equilibrium, including variants of cooperativity and combinatorial regulation. Our results suggest that crosstalk imposes a previously unexplored global constraint on the functioning and evolution of regulatory networks, which is qualitatively distinct from the known constraints that act at the level of individual gene regulatory elements.

Let regulation be determined by the (mis)match between the binding site sequence and the recognition sequence of any transcription factor. Each binding site is associated with a single TF type with which it forms a perfect match -this is the cognate TF for the given binding site. However, each site could also occasionally be bound by other (noncognate) TFs, at an energetic cost of a certain number of mismatches. Following earlier works [1,2], we assume that the contribution of mismatches at individual positions in a binding site to the binding energy is equal, additive, and independent. We define the energy scale such that binding with cognate TF has zero energy and all other binding configurations have positive energies, proportional to the number of mismatches d, E = ǫd, where ǫ is the per-nucleotide binding energy. The unbound state has energy E a with respect to the cognate bound state. The different states and their energies are illustrated in Fig. 3A in the main text. We employ a thermodynamic model to calculate the equilibrium binding probabilities of cognate and noncognate factors to each binding sequence.
TFs can also be non-specifically bound to the DNA. These configurations only sequester TFs from free solution, but do not directly interfere with gene expression. As explained later, we will lump together the TFs freely diffusing in the solution, as well as nonspecifically bound TFs and any other TF "reservoirs" into one effective concentration of available TFs (equivalently, we work with the chemical potential of the available TFs using the grand-canonical ensemble).
Previous studies calculated the probability of a given transcription factor to be bound or unbound to certain DNA sequences [2]. These probabilities were calculated assuming that the site is vacant or bound by the TF under study, but not bound by TFs of other types. This approach is cumbersome when a large number of TF types are considered simultaneously, because the probability that the site is bound by other factors is non-negligible, and due to steric hinderance, a site cannot be bound by more than one molecule at any given time. Previous studies also proceeded by using the canonical ensemble. These two modeling choices together make the problem of many TFs binding to multiple binding sites coupled and not easily tractable, because one would need to enumerate all possible combinations of TF-BS states. However, an alternative and much simpler approach is to employ the grand-canonical ensemble, and calculate the binding probabilities for the binding sites, rather than for the TFs. The necessary assumption is that binding sites behave independently (e.g., they are sufficiently separated on the DNA so that binding at one site does not overlap the binding at another, or if it does, this is treated explicitly). Underlying the grand-canonical ensemble is the assumption that TFs are present at sufficient copy numbers, so that the binding of a single site under consideration does not appreciably affect the chemical potential of the remaining TFs. Experimental support for such decoupling and the applicability of the grand-canonical approach has been demonstrated recently [3]. In the following we assume equal concentrations of all TF types.
We distinguish two contributions to crosstalk: 1. For a gene i that should be active and whose cognate TF is therefore present, error occurs if its binding site is bound by a noncognate regulator (activation out of context due to crosstalk), or if the binding site is unbound (gene is inactive). This happens with probability where C j is the concentration of the jth TF, d ij is the number of mismatches between the jth TF consensus sequence and the binding site of gene i, ǫ the energy per mismatch and E a the energy difference between unbound and cognate bound states; all energies are measured in units of k B T .

2.
For a gene i that should be inactive and whose cognate TF is therefore absent, crosstalk error only happens if its binding site is bound by a noncognate regulator (erroneous activation) rather than remaining unbound. This happens with probability In general x 1,2 depend on the specific set of pair-wise distances d ij between the consensus sequence of each TF present and the site of gene i. Hence they could vary between genes, and even for each gene different sets of TFs can yield different values of crosstalk. In the following we assume a fully symmetric setup, such that all genes are equivalent in their sensitivity to crosstalk (x 1,2 is independent of i). We assume that for each gene the mismatches d ij of all the noncognate TFs are distributed according to a probability density p(d) (independent of the gene). For a particular gene i, clearly different sets of TFs provide different pairwise distances d ij . However, for Q ≫ 1 the fraction of sets of same size Q that yield distances which are distributed very differently from p(d) is small. In the following we neglect this fraction and assume that all choices of Q TFs yield exactly the same crosstalk contribution x 1,2 (Q, M ); this mean-field assumption is explicitly validated by numerical simulations in Supplementary Note 3. We will also consider that all possible sets of Q TFs (sets of genes that need to be active) are equally likely to occur. See Supplementary Note 9 for the alternative definitions of x 1 and x 2 . Our next step is to calculate total crosstalk as a function of the above parameters (the total number of binding sites M and the number of TF types available at any given time Q). We define total crosstalk as the fraction of genes found in any of the possible erroneous states. We assume that the particular choice of Q TFs that are present is random (hence we average over all possible ways to choose Q out of M TFs). In reality only certain sets of TFs need to be active together in which case the genes that are co-activated could have mutually similar binding sites, especially if they were regulated by the same TF, compared to genes that are activated separately, possibly by different TFs. In Supplementary Note 1 we treat a simple extension of our model where each TF can co-regulate several target genes. We also assume equivalence between the two types of error (we relax this assumption below).
Clearly, if each of the Q genes that should be active has probability x 1 to be in any of the crosstalk states, then the expected number of genes in that state is Qx 1 . Similarly, of the genes that should be inactive the expected number that are in crosstalk state is (M − Q)x 2 . To obtain the fraction of genes in any of the crosstalk states we simply divide by the total number of genes M : Using the definition of S introduced in the main text where we approximated Q − 1 ≈ Q which is valid for Q ≫ 1 (an assumption we make here and throughout the paper). S(ǫ, L) is an average similarity measure between all pairs of binding sites. If binding site sequences are drawn randomly from a uniform distribution, S = ( 1 4 + 3 4 e −ǫ ) L . This is easy to derive: since individual base pairs are assumed to be statistically independent, at each position the probability of a random sequence to be identical to a given TF consensus sequence is 1/4, whereas with probability 3/4 it is different, implying a decrease of e −ǫ in binding energy. Since the complete binding site consists of L independent base pairs, this expression for a single base pair is now raised to the power of L.
The expressions for x 1,2 read: The two extreme cases occur when TF concentrations are either zero or very large (Table 1). If C = 0, x 1 = 1 and x 2 = 0, i.e., x 1 is maximal due to binding sites that should be bound, while zero error for x 2 occurs due to binding sites that should be unbound. The total error then amounts to the fraction of genes that need to be activated X(C = 0) = Q/M . At the other extreme, if C → ∞, x 1 = SQ/(1 + SQ)) and x 2 ≈ 1, i.e., no site is left unbound. The magnitude of x 1 error due to noncognate binding is determined by the binding site similarity S. If QS ≪ 1, The total crosstalk then amounts to X(C → ∞) . Next, we analyze the dependence of crosstalk on various parameters. One unknown in these expressions is the TF concentration C. Because we are searching for a lower bound on crosstalk, we can find the concentration that minimizes X. Taking the derivative of X and solving for its zeros, we find two potential extrema optimal C; activators and global repressor Table 1: Crosstalk errors in the basic model. Per-gene errors of the two types: x 1 is the error of a site whose cognate TF exists and the site should therefore be bound, but is either unbound or bound by a noncognate factor. x 2 is the error of a site whose cognate factor does not exist, and the site should therefore be unbound, but is bound by a noncognate factor. The last column shows the total crosstalk, averaged over all M sites. but only one of them can yield non-negative concentration values (and is consistently a minimum): For small S the leading terms in the optimal concentration are Substituting Eq. (S6) back into Eq. (S3) yields the minimal achievable crosstalk: For a constant number of co-activated genes Q, X * increases to leading order like the square root of S, Substituting C * into the single gene crosstalk expressions Eqs. (S1)-(S2), we obtain the minimal per-gene crosstalk Since crosstalk must be in the range [0,1] and M ≥ Q, this solution is only valid under the condition that S(M − Q) < 1. Thus, minimal crosstalk has 3 regimes: 1. For S > 1/(M − Q), crosstalk is minimized by taking C = 0. This is the "no regulation" regime. In this case, crosstalk amounts to Q/M , which is simply the fraction of genes that were supposed to be activated (but are not due to lack of their TFs).
2. For Q > Q max (S, M ), crosstalk is minimized by taking C → ∞; this is the "constitutive regime." Q max (S, M ) is given by two of the roots of the 4th order equation, S(M + SM Q − 2Q − SQ 2 ) − S(M − Q) = 0, solved for Q. We find the boundaries between the 3 different regulatory regimes by solving for C * (S, M, Q) = 0.
3. Otherwise, there is an optimal concentration 0 < C * < ∞, given by Eq. (S6), that minimizes crosstalk; this is the "regulation regime." The boundary between the first and third region is at S * = 1 M−Q and the boundary between the second and the third is at . Hence, the second region (where C * = ∞) only applies for Q > 4M 5 . Fig. 2(b) illustrates the dependence of the TF concentration C * , which minimizes crosstalk, on the number of co-activated genes Q. It demonstrates how the range in which 0 < C * < ∞ gets narrower when S increases. Fig. 1 demonstrates crosstalk and C * values for M = 20, 000 (compare to Fig. 3 in the main text with M = 5000). shows the optimal TF concentration, C * . These results are analogous to Fig. 3 of the main paper, which is computed for M = 5000. The results for two different M are qualitatively similar and show 3 different regimes of regulation. We make the following observations: (i) for larger M , the C * = 0 regime expands to include lower S values, as expected from the analytical solution for the regime boundaries; (ii) if the fraction of co-activated genes, Q/M , remains constant, the crosstalk increases with M , as it also depends on the absolute number of inactive genes M − Q (see Eq. (S8)). The discrepancies at small Q between the black solid curve separating the "no regulation" and "regulation" regimes, and the numerically computed C * values are due to the approximation Q − 1 ≈ Q.

Dependence on TF concentration
The optimal TF concentration C * in our model arises as a trade-off between the Q genes that need to be active (for which a higher C is favored) and the M − Q genes that need to be inactive (for which Supplementary Figure 2: How is optimal TF concentration C * determined? (a) x 1 crosstalk component (genes that should be active) decreases with TF concentration C, whereas x 2 crosstalk component (genes that should remain inactive) shows the opposite trend. Curves of x 1 and x 2 (crosstalk of a single gene) vs. C are illustrated for various values of S. While x 2 can be fully eliminated if C = 0, x 1 has a residual component which depends on S even for infinite C. Both crosstalk types increase with the similarity between the binding sites S (compare curves with various S values). (b) The optimal concentration C * is a decreasing function of the similarity S for all Q values. At fixed M , the optimal TF concentration, C * , diverges with the number of co-activated genes, Q. This leads to the "constitutive regime," where crosstalk is mathematically minimized by taking C = ∞. Shown is the optimal concentration C * as a function of the number of co-activated genes Q, for various S values; M is fixed at 5000. The value of Q at which C diverges depends on S. For small Q, we require M − 1/S < Q, otherwise the optimal concentration is in the C * = 0 regime. For the lower S values crosstalk can be minimized for 0 < Q < Q max < M , whereas for higher S values there exists also a value for Q min , such that 0 < Q min < Q < Q max < M . In other words, higher S leads to a narrower range of Q where the crosstalk can be effectively minimized. Supplementary Figure 3: Minimal crosstalk X * is an increasing function of the similarity S and has a non-monotonous dependence on the number of active genes Q. The balance between genes that need to be active (x 1 crosstalk type) and genes that need to remain inactive (x 2 crosstalk type) causes a non-monotonous dependence of the total crosstalk on the number of active genes Q, which has a maximum at an intermediate Q value. Curves are shown only in the regulation regime, where crosstalk is minimized by a finite TF concentration. The curves are truncated at the point of transition to regime II where TF concentration formally diverges to infinity.
a lower C is favored). Note, however, the asymmetry between the two crosstalk types: while the x 2 component (genes that should remain inactive) can be completely suppressed by having no TF (C = 0), the opposite does not hold. The x 1 component (genes that should be active) cannot be fully eliminated even for infinitely high C, because of the cross-activation between the distinct genes that should be active; see Fig. 2(a). This trade-off varies with the relative weights of x 1 and x 2 , which depend on both Q and S. We find that a concentration C * that minimizes crosstalk exists only in the third regime ("regulation regime"). In the first regime where S < 1/(M − Q), binding sites are so similar that crosstalk due to the inactive M − Q genes dominates the total crosstalk. Hence the choice of C * = 0 completely eliminates x 2 crosstalk, and minimizes the total crosstalk. In the second regime, where a large number of genes Q need to be active, crosstalk due to the Q active genes dominates (x 1 type), hence C * diverges to infinity. Fig. 2(b) illustrates curves of the optimal concentration C * as a function of the number of active genes Q for constant values of S. As Q increases, the relative weight of the genes that need to be active increases, hence C * is always a monotonously increasing function of Q.

Dependence on the similarity S
Both crosstalk types x 1 and x 2 increase with the similarity S (see Fig. 2(a)). For a fixed Q, C * decreases as a function of S. Again, this is because for larger S the weight of the genes that should remain inactive is more significant, hence the trade-off shifts towards lower TF concentrations (but the minimal crosstalk X * still increases!). This behavior applies only in the regulation regime, hence for M − 1 S < Q < Q max . For larger values of Q (Q > Q max ), a more complex behavior is found because by changing S we pass through all three regimes: C * then first decreases, then diverges (because it enters the second regime), but then decreases back again.

Dependence on the number of active genes Q
The two crosstalk types show opposite dependence on the number of active genes Q: crosstalk per gene that needs to be active (x 1 ) decreases with Q, whereas crosstalk per gene that needs to remain inactive increases with Q. The total crosstalk is a weighted sum of both with varying weights, hence it is not surprising that the total crosstalk has a non-monotonous dependence on the number of active genes Q with a maximum at an intermediate value; see

Basic model with regulation by repressors only
Our basic model assumed that all gene regulation is achieved by using specific activators to drive the expression of genes that would otherwise remain inactive. An alternative formulation of the problem postulates that genes are strongly expressed without TFs bound to their regulatory sites, but need to be repressed by the binding of specific regulators to stop their expression. Indeed, many bacterial genes seem to be regulated in this way. We thus studied this complementary model, in which all regulators are repressors instead of activators. We assume, as before, that Q out of M genes should be active, but now this implies that M − Q types of cognate repressors are present for all the genes that should remain inactive.
The expressions for crosstalk per gene that should be active (x 1 ) or inactive (x 2 ) read: The total crosstalk is still Eqs. (S11) are mathematically identical to Eqs. (S5), where the roles of Q and M − Q are simply swapped. Not surprisingly, the minimal crosstalk in this case is: which is valid for S < 1/Q.
The optimal TF concentration that minimizes crosstalk is now The minimal crosstalk and optimal concentration are illustrated in Fig. 4. It retains the 3 regulatory regimes observed with activators only: 1. For S > 1/Q we obtain the "no regulation" regime where crosstalk is minimized by taking C = 0.
2. For Q < Q min (S, M ) we obtain the "constitutive regime" where crosstalk is minimized by taking C → ∞. Q min is obtained when C * of Eq. (S14) diverges (the denominator equals to zero).
3. Otherwise, there is an optimal concentration 0 < C * < ∞, given by Eq. (S14), that minimizes crosstalk; this is the "regulation regime." The three regions are marked with Roman numerals, in accordance with Fig. 3 of the main text. The boundaries between the three regimes are now: S * = 1/Q (between regimes I and III) and (between regime II to both I and III).
The results are clearly a mirror image of the results shown in Fig. 3 of the main text for the activator-only basic model. They can be obtained simply by mapping Q → M − Q. Since we keep the convention that Q is the number of genes that are active, the difference in regulation strategies amounts to having either Q activator types and keeping M − Q binding sites unbound (activatoronly) or having M − Q repressor types and keeping Q binding sites unbound. Comparing the expressions for minimal crosstalk, Eq. (S13c) to Eq. (S8), we conclude that crosstalk depends on the fraction of TFs that are expressed and on the absolute number of binding sites that need to remain unbound.

Breaking the symmetry between the two crosstalk types
In our basic model we made a simplifying assumption that the two crosstalk types, x 1 and x 2 , have equal weights: not activating a gene that should be active or erroneously activating a gene that should be silenced are assumed to be equally disadvantageous. We now relax this symmetry by allowing different weights, a and b, for the two crosstalk types, to model possible differences in their biological significance. Eq. (S3) for the total crosstalk now takes the form: The expression for the optimal TF concentration then reads: where again only one of the two solutions yields non-negative concentration values. The resulting minimal crosstalk is: Setting a = b = 1 reduces the above formula to the previous solution, Eqs. (S6)-(S8). Note the asymmetry between the two crosstalk types: if b = 0, i.e., when crosstalk in genes that should remain inactive is insignificant, the minimal achievable crosstalk equals zero. This is not true in the other extreme case, when a = 0. In Fig. 5 we show that the three different regulatory regimes still exist under this generalized definition of crosstalk, but their boundaries may shift.

Breaking the symmetry between the co-activated genes
In our basic model we imposed full symmetry between the Q co-activated genes: they contributed equally to crosstalk and all Q types of TFs were assumed to exist in equal concentrations. We now To break the symmetry between the two error types we consider a redefined crosstalk, . For different values of b (the cost of mis-activating genes that should remain inactive), all three regulatory regimes are preserved, although their boundaries shift. The weight of the first crosstalk type (mis-regulating genes that should be active) is equal in all cases. Red shows the "regulation regime," (0 < C * < ∞). As erroneous activation is penalized less (decreasing b), the "no regulation" (C * = 0, white) regime shrinks, whereas the constitutive expression regime (C * = ∞, black) expands, as expected. relax these assumptions. We examine the situation in which a fraction h of these Q genes is more important to the functioning of the cell. Mathematically, we postulate that the per-gene crosstalk error for the important genes contributes with a γ-times higher weight to the total crosstalk relative to the non-important genes. We introduce an additional degree of freedom to the model, by allowing the concentration of the TFs to split unevenly between important and other genes: each important gene has TFs present at concentration C 0 , while a TF of a non-important gene is present at concentration C 0 = ηC 1 .
As hQC 0 + (1 − h)QC 1 = C we obtain: If either h = 0 or η = 1 this reduces back to the basic model with C 0 = C 1 = C/Q. The total crosstalk now takes the form: where x 0 is the per-gene error of the important genes, x 1 is the error of other genes that need to be activated, and x 2 , as before, denotes crosstalk at genes that need to be kept inactive.
We can optimize numerically for both the total TF concentration C and the factor η by which the TF concentration of the important genes is amplified. Alternatively, we can assume that C remains fixed at the optimal value for the case where all genes are equally important, and only optimize for η. We display the latter option in Fig. 6, to explore crosstalk at varying h under equal resource constraints.  Figure 6: Crosstalk can be reduced for a subset of important genes at the cost of increasing the total crosstalk. To break the symmetry between genes, we define a fraction h (out of Q) genes as important, having γ-times higher contribution to the total crosstalk. TF concentration for these genes is optimized separately, subject to the total TF concentration C remaining fixed to its optimal value in the symmetric, γ = 1, case. We show the crosstalk per important gene, x 0 (red), and per a normal gene, x 1 (black), as a function of γ (for h = 0.1). The inset shows the same as a function of h (for γ = 10). Per-gene crosstalk increases approximately linearly with h and important genes achieve ∼ √ γ smaller crosstalk relative to normal genes.
The special case when only a single gene is important is analytically solvable assuming Q ≫ 1, yielding: In particular the per-gene errors read: The error of the single important gene can be reduced at most by a factor of √ γ relative to the other co-activated genes. The x * 1 error for the other Q − 1 genes remains the same, because we assumed that Q ≫ 1. Interestingly, the M − Q genes that need to be kept inactive suffer an increase in crosstalk as a consequence of protecting the important gene.

Every transcription factor regulates Θ genes
In the basic model we considered a regulatory scheme in which every gene has its own unique TF type. This allows for maximal flexibility in regulating each gene individually. Real gene regulatory networks typically have fewer TFs than the number of target genes, so that at least some transcription factors regulate several genes. Here we consider a simple extension of the basic model, in which each TF regulates Θ genes (with identical binding sites) rather than one. We assume no overlap between the sets of genes regulated by various TFs, so that the total number of TFs species is now Θ times smaller than before. If Q genes should be active, then Q/Θ TF species should be present in a given condition. Assuming that Q/Θ ≫ 1, we can approximate Q/Θ − 1 ≈ Q/Θ as before. The only change from the basic crosstalk formulation is in x 1 , because the concentration of cognate factors is now Θ times larger than before: This formulation is analytically solvable, yielding The equations for minimal crosstalk are equivalent to the basic model if we map S → S/Θ. Since crosstalk depends on √ S to first order, this amounts to crosstalk reduction by a factor of √ Θ.
For small S the leading term in the optimal concentration is These gains in crosstalk have, however, been achieved by sacrificing the ability to regulate each gene individually: now, the smallest set of genes that can be co-activated is of size Θ. Typically, TFs might constitute 10% of the genes [4]; with Θ ∼ 10, the crosstalk could be reduced by a factor of ∼ 3 at best.

Non-constant Θ
Until now, we assumed that each TF regulates exactly Θ genes. This assumption can be relaxed using numerical simulations; in particular, we considered the case where the number of genes that each TF regulates is a random variable drawn from a specified distribution. We started by defining which TF controls which sets of genes through explicit enumeration of binding site sequences. We assumed that the number of genes that a given TF regulates is approximately Poisson distributed (with mean Θ) and that all these regulated genes use the same sequence for their binding site, equal to the consensus sequence of the cognate TF. We then sample the environments in which Q out of the total of M genes are active; given the regulatory network structure, not all Q picks out of M can be realized, as is also the case with constant Θ model. The crosstalk is evaluated in each environment exactly, by computing all thermodynamic states of all binding sites, and is subsequently averaged by Monte Carlo sampling through the possible environments. This extension to the model introduces no new parameters, so its crosstalk and regime boundaries can be straightforwardly compared to the model where Θ is constant. We find that Poisson-distributed Θ changes crosstalk at a below-percent level, and produces no notable shifts in regime boundaries, showing that our results are robust with respect to this particular distributional assumption.

Optimal packing
In real organisms, binding site sequences for different genes could depart from a random distribution (even after taking into account the statistical structure of the genomic background). For example, to achieve high specificity of regulation, we could hypothesize that binding site sequences evolved to minimize the overlap between any pair of consensus sequences. To explore the crosstalk limit under such optimal use of sequence space and contrast it with the random choice of binding sites, we synthetically constructed binding site sequences that are as distinct as possible. Specifically, our optimal codes are described by a parameter d min , which is the minimum required number of basepair differences between any pair of binding site sequences. This is the Hamming distance, HD, between sequences. The problem of choosing M sequences of length L such that each pair differs by at least d min is not tractably solvable in general. We construct numerical approximations to these optimal codes using the following algorithm: 1. Generate all possible sequences of length L and store them in a list called words. Create an empty list, called codewords, which will store the binding site sequences.
2. Pick the first entry, s, from the list words, to be a binding site sequence, and append it to the list codewords.
3. Erase s and all of its Hamming neighbours at distance strictly less than d min from the list words.
4. If the list words is not empty, repeat from step 2. If the list words is empty, stop.
When the procedure terminates, the list codewords will contain binding site sequences that are separated by at least d min mismatches. The outcome of this procedure depends on the initial ordering of the list of all possible sequences. The procedure is not guaranteed to generate the maximal set of sequences satisfying the Hamming distance criteria. From the list of generated binding site sequences, we obtain P (d), the distribution of mismatch distances between all pairs of binding sites, and hence obtain the value of S as (S25)

Reverse complemented sequences
We have also considered a different definition of distance between sequences that takes the doublestranded nature of DNA into account. This brings into picture the reverse complement of both sequences in question. If s i and s j are two sequences with reverse complements r i and r j respectively, this new definition of Hamming distance is where HD(s i , s j ) is the usual Hamming distance as considered previously. This restricts the sequence space much more than with the usual definition and as such, as seen in Fig. 8, we can pack fewer binding sites in the sequence space at a specific d min . Given that there are enough sequences under HD rc measure in the sequence space, we can also ask how S changes in relative to the random code. Intuitively, S should increase since each binding site sequence also contributes its reverse complement into the pool of sequences to which TFs can bind non-cognately. Indeed, Fig. 9, which maps S from the reverse complement code to S from a random code, shows that S increases by about a factor of 2 due to the addition of reverse-complemented sites. From the HD to HD rc , the Singleton bound doesn't change from the usual situation but the Gilbert-Varshamov (GV) bound, which takes into account the "volume of restricted ball" around each sequence, goes down. Because of stronger constraints, the number of sequences that can be packed goes down from the usual situation but only by a factor of ≈ 2.

Saturating model of TF-DNA binding energy
It has been experimentally observed that the binding energy between TF and DNA saturates to some nonspecific value after a certain number of mismatches between the TF's cognate sequence and the DNA sequence in question [6]. We consider such a saturating energy model, characterized by a parameter d 0 , the number of mismatches after which binding energy saturates. The binding energy is given by E(d) = ǫ min(d, d 0 ). We obtain S as where P (d) is the distribution of mismatch distances between all pairs of binding sites picked at random from the sequence space. d 0 = L corresponds to a mismatch model with non-saturating energy. Decreasing d 0 limits the specificity of the TF towards binding site sequences far away from the consensus and thereby increasesS(d 0 ).

Empirical values
We obtain organism-specific estimates of S from known databases [7,8,9] of the binding site sequences of different TFs. In the main text, for a particular genome, we defined S for a collection of TFs with the same mismatch penalty ǫ and binding sites of a specific constant length L. In real organisms, different TFs have different ǫ and L, making it difficult to directly calculate S for a genome. Instead we obtain a value of S for each TF by defining it as the value of S of a hypothetical genome in which all TFs have the same binding site properties (ǫ, L) as our TF. Hence, for each organism, we obtain a set of S values.
Many databases document the binding site sequences of TFs in Position Count Matrices (PCMs). The PCM of a TF with a binding site of length L is a 4 × L matrix B with b ij denoting the number of known TF binding site sequences that have nucleotide i in position j. One can obtain estimates of ǫ and L from B, and use them to calculate S. There are two broad ways to estimate ǫ and L (and hence, S) of a TF: (a) Information method, (b) Pseudo-count method. In (a), we calculate the information contained in the whole binding site motif and obtain an ǫ that distributes this information uniformly among all sites in an equivalent "effective" motif that has the same length as the original, but only has 0 or ǫ mismatch energy values. In (b), we obtain ǫ for all entries of the PCM and calculate an average ǫ from these entries. To handle zeros in the PCM which lead to undefined ǫ, (b) uses an arbitrary pseudo-count. Method (a) can, in contrast, avoid the use of pseudo-counts and, additionally, reproduces by construction the information content of each known motif, which is the key statistical property of TF specificity [10,11]. Hence, we used (a) to infer S values. In both the methods, we used PCMs that have that have been constructed from at least 10 distinct binding site sequences.

Information method
In this method, we first obtain the binding site length L and also the total information I, contained in the binding site sequences of the TF.
where I j is the information contained in position j, p ij is the frequency of nucleotide i in position j, obtained in a straightforward way from B, and q ij is the expected background frequency. To get rid of non-specific positions, we neglect all positions that contain information less than a certain threshold (I j > 0.2 bits for position j to be considered part of the binding site). For a random genome, q ij = 0.25 ∀ i, j, resulting in The maximum information in the motif is 2L bits (when ǫ → ∞) with each position contributing a maximum of 2 bits, which for finite ǫ, is reduced by an entropy term. Obtaining information per position I pos = I/L, we infer an ǫ that uniformly distributes the information in the motif among individual positions. At a specific position j * , without loss of generality, assume that i = 4 has the best binding energy (= 0). The probability of observing i = 4 at j * is given by p 4 = 1/Z while the probability of observing any of the three other possible nucleotides is given by p 1,2,3 = e −ǫ /Z, with Z = 1 + 3e −ǫ [12]. Hence, The mismatch energy ǫ can be obtained from the above expression, and from ǫ and L, we obtain S(ǫ, L) = ( 1 4 + 3 4 e −ǫ ) L .

Pseudo-count method
In this method, we infer ǫ for all three non-cognate nucleotides in each position, and obtain ǫ for the TF as an average of these 3L values. For an arbitrary position j, as before, assume that i = 4 has the maximum counts (b 4j > b ij , i = 1, 2, 3). We obtain ǫ ij = log b4j bij and mismatch penalty for position j as ǫ j = 1 3 (ǫ 1j + ǫ 2j + ǫ 3j ). If some entry b kj = 0, ǫ kj is undefined. To take care of this, we first add a pseudocount δ to all entries of B and obtain a modified PCM B δ to infer ǫ. The value of δ chosen is arbitrary and it is common practice to use δ = 0.5 or δ = 1. As before, to get rid of non-specific positions, we consider positions that have ǫ j ≥ 1. From the remaining, we take a mean to obtain ǫ = 1 coli TFs were obtained from RegulonDB [7] and yeast (S. cerevisiae) from two different databases -scerTF [9] and JASPAR [8]. All the other organism specific TFs were obtained from JASPAR. Notice that in the pseudo-count method, δ has the biggest influence on the estimates in E. coli. Importantly, for all other organisms, the estimates are invariant to δ and agree well with the information estimate.

Supplementary Note 3 Validity of the mean-field assumption
In computing crosstalk at given M and Q, we have made a mean-field assumption on the similarity measure S. For a given set of binding site sequences in the sequence space (total M in number), this amounts to assuming that the distribution of neighbours for each binding site comes from the same underlying distribution. For a particular selection of Q genes, for each binding site i from the M binding sites, similarity S i can be defined using d ij where j = i indexes over the binding sites of the Q selected genes.
From this, we have for crosstalk for a particular selection of Q genes, where x 1 (S i ) and x 2 (S i ) depend on S i as shown. We are interested in the mean crosstalk X = X({S i }) over all selections of Q out of M genes, which requires us to know the full distribution of S i . The crosstalk is then In the mean-field assumption, we have From this, one can obtain the optimal crosstalk X * . To check the validity of such a mean-field assumption, we performed numerical simulations by drawing lists of M binding sites from the sequence space, computing optimal crosstalk X * sim by explicit enumeration of all thermodynamic states, and comparing this with the mean-field crosstalk X * . In detail, we first picked M binding sites (to regulate M genes) randomly from the sequence space and held this choice fixed. Now, for each Q, we performed n sel different selections of Q out of M genes. For each such selection, after computing the binding site mismatches and occupancies, we compute the crosstalk. To get the mean crosstalk for Q, we perform a Monte Carlo estimate of the mean crosstalk over these n sel different selections of Q out of M genes. Figures 12 and 13 show that the mean-field crosstalk systematically over-estimates the actual crosstalk, but nevertheless remains a very good approximation to the true crosstalk.

Supplementary Note 4 Mixed models
In the baseline model we consider M genes, all of which are regulated either solely by activators or solely by repressors. Here, we consider mixed models, i.e., models that utilize repression to control one subset of genes and activation to control the other genes. Let's assume that M A genes are regulated by activators and M R genes are regulated by repressors, where M = M A + M R . In a particular environment, let's assume that Q genes need to be ON. Out of these, let's assume that Q A genes are activator-regulated and Q R genes are repressor-regulated, where Q = Q A + Q R . For activating Q genes, the number of TFs present now amounts to T = Q A + M R − Q R : Q A activators and M R − Q R repressors. As before, S is the similarity of the binding sites and C the total concentration of TFs (activators+repressors). The concentration of a particular TF type, when present, will now be C/T . We assume that any non-cognate interaction ("activation outof-context" or "repression out-of context") counts as a crosstalk error. We distinguish 4 types of per-gene crosstalk errors: An activator-regulated gene that needs to be ON, should be bound by the cognate activator. The unbound state and any non-cognate binding (non-cognate activator or repressor) are crosstalk states: An activator-regulated gene that needs to be OFF, should be unbound. Any non-cognate binding is a crosstalk state: Supplementary Figure 12: Comparison of mean-field results and numerical simulations. On the left, we plot the difference in optimal crosstalk between simulations and the mean-field approach, X * sim − X * , for different Q and S. On the right, we plot X * sim − X * against Q for three different S. Here, M = 5000, L = 10, and S has been varied by tuning ǫ. X * sim is a Monte Carlo estimate of the mean crosstalk, obtained over n sel different selections of Q out of M genes. n sel = 1 in the top row, and n sel = 30 in the bottom row. The mean-field approach is in general a very good approximation of the simulations. The maximal crosstalk difference is less than 0.02, and decreases with increasing S.
A repressor-regulated gene that needs to be ON, should be unbound. Any non-cognate binding is a crosstalk state: Lastly, a repressor-regulated gene that needs to be OFF, should be bound by the cognate repressor. The unbound state and any non-cognate binding (non-cognate repressor or activator) are crosstalk states: Supplementary Figure 13: Comparison of mean-field results and numerical simulations. On the left, we plot the difference in optimal crosstalk between simulations and the mean-field approach, X * sim − X * , for different Q and S. On the right, we plot X * sim − X * against Q for three different S. Here, M = 500, L = 8, and S has been varied by tuning ǫ. X * sim is a Monte Carlo estimate of the mean crosstalk, obtained over n sel = 100 different selections of Q out of M genes. Again, as with M = 5000, the mean-field approach is a very good approximation of the simulations. The maximal crosstalk difference is only slightly larger than 0.02.
where P QA is the fraction of Q gene selections that have Q A activated genes. We have where M * A is the M A value which minimizes crosstalk for a given Q. In Fig. 14, we see that for Q < M/2, the best strategy is to use all activators (M A = M ), and for Q ≥ M/2, the best strategy is to use all repressors; optimization of crosstalk in mixed models therefore always picks out one of the two "pure" regulatory strategies and does not yield an optimal mixed model. To see if the pure strategies get chosen because the activation of all genes is symmetric in all environments, we studied a simple system in which different subsets of genes are required to be activated with different probabilities. So far, when Q genes are required to be ON, each gene had the same probability, Q/M , to be among the Q out of M required genes, i.e. Q/M is the probability of each gene to be activated.
Here, we introduce two classes (1 and 2) of genes, with M 1 genes in the first class and M 2 = M − M 1 genes in the second class. Genes in each of the two classes have different probabilities of requiring activation across environments: P 1 for the first class and P 2 for the second class. If P i > 0.5, then genes in class i are called "hot" genes, and if P i < 0.5, genes in class i are called "cold" genes. Given certain M 1 , M 2 , P 1 , and P 2 , different environments correspond to different choices of the Q genes that should be active, where Q is no longer constant as before, but a random variable with mean In a similar fashion as before, we compute the crosstalk (at optimal C * ) for different choices of mixed models (how many class i genes are under activators or repressors). Then, we obtain the optimal (M A ,M R ) strategy among these mixed models that minimizes crosstalk. In Fig. 15, we show how this optimal strategy varies, along with Q , as a function of P 1 and P 2 for a fixed choice of M 1 = M 2 = 2500. First, we note that Q increases in any direction that increases P 1 or P 2 . In the symmetric mixed model setup, we essentially studied the system along the diagonal from (0, 0) to (1, 1) on the (P 1 , P 2 ) plane (dashed white line), increasing Q from 0 to M . The previously studied results yielded two "pure" strategies-all activators or all repressors, depending on whether Q is bigger or smaller than M/2-which is consistent with the following observations in the asymmetric mixed models. When P 1 < 0.5 and P 2 < 0.5 (all genes are cold), the optimal strategy is a pure one, namely, to put all genes under activators; when P 1 > 0.5 and P 2 > 0.5 (all genes are hot), the optimal strategy is to put all genes under repressors, which is also a pure strategy. But when P 1 > 0.5, P 2 < 0.5 or P 1 < 0.5, P 2 > 0.5 (one class is hot, while the other is cold), the optimal strategy is "mixed": put hot genes under repressors and cold genes under activators. Note that not all Q are possible with these optimal mixed strategies. From here onwards, we study mixed models in the bottom right square of Fig. 15, where P 1 > 0.5 and P 2 < 0.5, i.e., class 1 is hot and class 2 is cold. Here we show how the optimal strategy and Q vary as a function of P 1 and P 2 for a fixed choice of M 1 = M 2 = 2500. Q increases in any direction that increases P 1 or P 2 . When P 1 < 0.5 and P 2 < 0.5 (all genes are cold), the optimal strategy is a pure one (all genes under activator control), while when P 1 > 0.5 and P 2 > 0.5 (all genes are hot), the optimal strategy is to put all genes under repressors, which is also a pure strategy. But when P 1 > 0.5, P 2 < 0.5 or P 1 < 0.5, P 2 > 0.5 (one class is hot, while the other is cold), the optimal strategy is "mixed": hot genes are under repressor control and cold genes under activator control.
At fixed P 1 and P 2 , crosstalk gains from using the optimal mixed strategy (instead of using all activators) increase with both S and the number of hot genes M 1 , as shown in Fig. 16.  Figure 16: Crosstalk gains from using the optimal mixed strategy instead of all activators. Plotted is the difference in optimal crosstalk (crosstalk gain), X * all−act − X * mb , between the pure strategy of using all activators and the optimal mixed strategy of putting hot genes under repressors and cold genes under activators, as a function of S, with fixed P 1 = 0.75 and P 2 = 0.25. As S increases, we cross from the regulatory regime III to regime I in which C * = 0 . The optimal mixed strategy becomes increasingly better (than the all activators pure strategy at reducing crosstalk) as S and M 1 increase.
In Fig. 17, we show in detail the crosstalk gains from using the optimal mixed strategy instead of the optimal pure strategy (either all activators or all repressors), for different Q and S, for four different M 1 = 500, 2000, 3000 and 4500.

Supplementary Note 5 Cooperative regulation
So far, we assumed a single binding site for every gene. Yet, some genes employ combinatorial regulation, with several binding sites regulated by a number of transcription factors. As a next step in extending our model we consider cooperative regulation, where every gene has two binding sites that are bound by two copies of the same type of transcription factor.
We assume 2 binding sites per gene, with energy gap E a between cognate-bound and unbound states. An additional energy contribution ∆ is obtained if both sites are bound by cognate factors, which then interact with each other. We consider also the configuration that two noncognate factors of the same type bind to the double binding sites and interact with each other as well. In the limit that ∆ ≫ E a once one of the sites is bound, the binding of the other becomes energetically favorable. This cooperative binding energy only applies for two molecules of the same type. Thus, if one site is bound by the cognate and the other by a noncognate molecule, cooperative interaction doesn't apply. We assume that only binding of one of the two sites induces transcription. The reasoning for this assumption is that for many bacterial and yeast genes activators are thought to work by recruiting the transcriptional machinery to the DNA [13]. Following this rationale, only one of the two sites is in the correct physical location (in bacteria, the proximal one) to do so successfully. Technically, if we assume that only one of the two sites determines transcription, for ∆ = 0, the cooperativity case reduces back to the basic model (Supplementary Note 1). We list the possible binding configurations of the two sites, their energies and statistical weight in Table 2.
The general case of this model, incorporating all possible binding configurations yields a 6th order equation in the TF concentration C, which we only handle numerically. The following limiting cases are however analytically solvable: 1. Limit of strong cooperativity: Assume that the cooperative interaction is strong compared to the individual protein-DNA binding energies ∆ ≫ E a . We can then neglect binding configurations in which only one of the sites is bound and the other is vacant, and the ones in which both are bound, but by molecules that do not interact cooperatively. That leaves us with only 3 possible binding configurations: both sites unbound, both bound by cognate TF or both bound by noncognate TF molecules of the same type with cooperative interaction (configurations 1,4 and 10 in Table 2). By proper change of variables this case can be reduced back to the basic single-binding-site model. The minimal crosstalk then reads: whereS = S(2ǫ, L). This error is achievable with TF concentration .

(S47)
Since the cooperative binding model allows for a binding site which is twice as long and higher total binding energy the parameters need to be correctly transformed to compare to the 1-site model. If we transform:S → S we obtain exactly the same minimal error as in the single-site model. By proper transformation of the energy of the unbound stateẼ a = ∆ + 2E a the TF concentration that minimizes the error is a square root of the one we had in the single-site model Eq. (S6). In similarity with the basic single-site model, here too we obtain different parameter regimes, whereas ForS = S(2ǫ, L) > 1 M−Q the minimal error is obtained by taking C = 0, namely regulation is not advantageous. While seemingly the cooperative binding is equivalent to a 1-site model which has twice as long binding site, this  Table 2: All possible binding configurations and the corresponding energies for a two-binding site model with cooperative interaction. 'C' denotes binding by cognate factor, 'N'binding by noncognate and 'U' -means that the site is unbound. We distinguish between binding of noncognate molecules of the same type (N x N x ) and different types (N x N y ), where in the former there is also cooperative interaction between the molecules. We define the reference energetic level E = 0 as the state 'CC' when both sites are bound by cognate factors with cooperative interaction, such that all other energies are positive. We assume that the left binding site is the auxiliary and only the right one determines the state of activity. Note that the statistical weight of the last binding configuration N x N x uses S(2ǫ, L) instead of the otherwise S(ǫ, L). The column 'activity' denotes whether in the given configuration the gene is either ON, OFF or * -could be either active or inactive (possibly active in response to noncognate signal). Blank space denotes a non-existing configuration (or one which is not accounted for): these are the configurations including a cognate factor bound in the situation that it is absent because the gene should be silent. The next two columns denote whether this configuration was counted as crosstalk (+) or not (-) if the cognate transcription factor is present and the gene should be activated or if it is absent (and the gene should be silenced). The 'Strong Cooperativity' column denotes the configuration included under strong cooperativity approximation.
is not accurate. The reason is that cooperative interaction occurs only between two specific molecules, which limits the possible sequence space.
2. Limit of weak cooperativity: If ∆ = 0, the problem reduces to the basic single-site model.

Cooperativity with interactions between noncognate pairs
In Fig. 4 of the main text we neglected the possibility of cooperative interaction between pairs of noncognate molecules at the binding site of interest. This situation is plausible if the interaction between the molecules is facilitated by the specific binding sites. However, the molecules can also cooperatively interact in solution before binding and then bind a noncognate site as a complex. This possibility was not taken into account in Fig. 4 (main text). In the following we repeat the calculation including this interaction too (state no. 10 in Table 2). The results are illustrated in Fig. 18. Evidently, the improvement in crosstalk owing to cooperativity is now significantly smaller.

Supplementary Note 6 Weak global repressor
So far we only considered gene regulation by activators. Cells however also have repression mechanisms as an additional means of regulation. As a first step to account for that we incorporate in the model one type of an abundant weak global repressor that interacts with all binding sites with sequence-independent low affinity. Non-specific repression mechanisms such as the nuclear envelope, histones and DNA methylation are thought to mitigate spurious transcription [14]. It was hypothesized that their emergence enabled the genome expansion in the transitions between prokaryotes to eukaryotes and from invertebrates to vertebrates [14]. We include an additional molecule in the model, which is found in concentration C r and can bind all binding sites equally well with energy 0 < E r < E a , namely it is more favorable than the unbound state, but not as favorable as the specific cognate activator of each site. Hence, our intuition was that such a global repressor cannot compete equally with specific binding, but it can reduce non-specific binding. The crosstalk expressions now read: As before, we minimize the crosstalk with respect to the TF concentration. The optimal concentration is now: This is the same optimal concentration C * as in Eq. (S6) only scaled by a factor C r e −Er + e −Ea , instead of e −Ea there. We conclude that the mere effect of a global repressor is to scale down the concentration of the specific activator. This is simply compensated for by a larger concentration of the activator. Hence, regardless of the global repressor affinity E r and concentration C r this additional regulatory mechanism cannot lower the crosstalk beyond what is possible with specific activators only. As before, the minimal crosstalk is:

Supplementary Note 7 Regulation by a combination of specific activators and specific repressors
As the global repressor examined in Supplementary Note 6 did not show any additional improvement in crosstalk, we elaborate the model further to account for specific repressors, in similarity to the specific activators. We extended the basic model (Supplementary Note 1) in which a gene had a single regulatory site and was regulated by an activator alone, to a more general model in which each gene has two regulatory sites: one compatible with a specific activator binding and the other with a specific repressor. We assume that each gene has a unique activator and unique repressor. In the basic model (Supplementary Note 1), for a gene to be silent its binding site should be vacant. The only way to achieve this was to lower the activator concentration. On the other hand, to improve activation reliability, the activator concentration, should be increased! Thus, in the simple model there seemed to be a trade-off between reliable activation and elimination of undesirable activation. The existence of a specific molecule that blocks the site from binding of other (potentially activating) molecules is thought to be a more reliable way to prevent undesired gene activation, not at the expense of the activation of other genes [15].
To be consistent with the basic model, we assume that the total concentration of all TFs (activators and repressors together) is constant C. As before, Q genes need to be activated for which Q specific activators are present. The other M − Q genes need to be silent for which we now add their M − Q specific repressors. All activators are found in equal concentrations C A /Q = α * C/Q each. All repressors are in equal concentrations C R /(M − Q) = (1 − α) * C/(M − Q) each. We allow for different binding energies for the two binding sites E a and E r . We assume that activation can only occur by binding of an activator molecule to the 'A' site. Repression is asymmetric in the sense that binding of any molecule to the repressor site prevents binding regardless of what is bound to the activator site. Thus a gene can only be active if the repressor site is empty and the activator site is bound by an activator. See the list of all possible states of the two binding sites in Tables 3 and 4 below.

Overlapping activator and repressor binding sites
For some genes, the regulatory sites of the activator and repressor partially overlap. Another possibility is "negative cooperativity" -when one molecule repels the other. The outcome of either option is that either an activator or a repressor could be bound at any given time, but not both of them simultaneously. In Tables 3-4 all the states above the double horizontal line are such that only one site can be bound at any given time ('overlapping sites'). The additional states below the line are only possible if both sites can be bound simultaneously ('non-overlapping sites'). Fig. 19 illustrates the dependence of crosstalk on the energy E r (energy gap between unbound and repressor-bound states) for different values of co-activated genes Q. Crosstalk is minimized for E r = E a exactly when Q = M − Q, meaning equal number of activated and repressed genes. However, for other values of Q = M − Q, E r is also not significantly different from E a .

Supplementary Note 8 Combinatorial regulation (AND gate)
So far, we have been dealing with models in which each gene is regulated by a single type of TF, be it by a single activator, a single repressor, or multiple TFs of the same type using cooperative interactions. Here, we will consider a simple model of combinatorial regulation by a combination of two activators of different types, and compute optimal crosstalk for this setup as a function of parameters of interest.
As before, we have M genes in total, with each gene having two binding sites, corresponding to two different (cognate) TF types. For a particular gene to be ON, we need the presence of both cognate TF types, which need to occupy both binding sites. This regulatory architecture corresponds to an AND gate. We don't specify how this AND gate is implemented on the molecular level. Unlike in cooperative regulation, no additional energy gain is assumed here due to the interaction between the two TFs when bound to the DNA.
Each TF can pair with various other TFs in regulating a particular gene. In the basic activation setup, the total number of TFs, M , was equal to the total number of genes. In the combinatorial regulation setup, which is an extension of the basic activation setup, the total number of genes M will be equal to the total number of different TF-TF combinations that can exist. This will depend on the extent of combinatorial regulation, which we quantify using f , the fraction of TF-TF combinations each TF type realizes out of the theoretically maximal number of pairwise combinations it could have.  If there are T TFs in total, each TF can potentially pair with N int = f (T −1) other TF types, where f is the fraction of pairs each TF type realizes. This gives us M = T N int /2, and thus T ≈ 2M/f and N int ≈ √ 2M f . But each TF should pair with at least one other TF, so we require N int ≥1. Taking both of these limits into account, we have, for N int , the number of TFs each TF pairs with, and the number of total TFs T , Table 4: All possible binding configurations, corresponding energies and statistical weights for a two-binding site (A,R)-model: a gene that needs to be silent (hence its cognate repressor is present and its cognate activator is absent). All notation is the same as in Table 3. The column 'crosstalk if OFF' lists binding configurations that were accounted for as crosstalk in x 2 calculation -in this case only no. 5.
If each TF pairs with all other TFs, we have f = 1 and N int = T − 1, which gives us T ≈ √ 2M . We call this "perfect combinatorial regulation" because it minimizes the number of TFs needed to regulate a certain number of genes.
If each TF realizes only a fraction 1/2M < f < 1 of its combinations, we have N int > 1 pairs for each TF, which gives us T ≈ 2M/f. We call this "imperfect combinatorial regulation".
If f ≤ 1/2M , we have N int = 1, which gives us T = 2M . We call this "worst combinatorial regulation".
As before, we will compute the optimal crosstalk when Q genes are required to be ON. Here, we compute the "typical" number of TFs present at any one time, t, by following a similar recipe as before. We have Q = tn int /2, where n int is the number of pairs per TF present at any one time. This will be smaller as there are fewer TFs present at any given time relative to the total number of TF types, i.e., t ≤ T . As before, we have When f > 1/2Q, we have t = 2Q/f and when f ≤ 1/2Q, we have t = 2Q.
crosstalk if gene needs to be configuration activity ON OFF, C can be Energy Weight Table 5: All possible binding configurations and the corresponding energies for a combinatorial regulation setup implementing an AND gate. Each gene has two binding sites which bind two different cognate TF types. The "configuration" column lists all the configurations of the two binding sites of a gene. 'C' denotes binding by cognate factor, 'N' -binding by noncognate and 'U' -means that the site is unbound. We distinguish between binding of noncognate molecules of the same type (N x N x ) and different types (N x N y ). The "activity" column denotes whether in the given configuration the gene is either ON or OFF. To implement the AND gate, we assume that transcription occurs (ON) only when both the binding sites are bound. The next four columns denote whether this configuration is counted as crosstalk (+) or not (-). In the leftmost column "ON", both the cognate transcription factors are present (and the gene should be ON). In the next three "OFF" columns, at least one of the cognate TFs is absent (and the gene should be OFF). In "C can be X" column, the cognate TF of only the left binding site (X) is present, in "C can be Y", the cognate TF of only the right binding site is present, and in "C can be none" column, both the cognate TFs are absent. Blank space denotes a non-existing configuration: these are the configurations including a cognate factor bound in the situation that it is absent. The column "Energy" specifies the energy of these configurations. We define the reference energetic level E = 0 as the state 'CC' when both sites are bound by their cognate factors, such that all other energies are positive. The column "Weight" denotes the statistical weight of the configurations, taking into account the concentrations of the relevant TFs and the energy of the configurations. Note that the statistical weight of the last binding configuration N x N x uses S(2ǫ, L) instead of the usual S(ǫ, L). Supplementary Figure 19: Activator-repressor overlapping binding sites, different Q values. E * r -the energy gap between unbound and repressor-bound states -that minimizes crosstalk depends on the number of co-activated genes Q. Here we show numerical results for the minimal crosstalk X * as a function of the repressor binding affinity E r (with constant activator affinity E a = 15) for different numbers of co-activated genes Q, in the model where activator and repressor binding sites overlap. We find that when the number of co-activated genes decreases (so that more genes need to be repressed) the optimal repressor affinity E * r increases, so that repressors more effectively bind their cognate binding sites and eliminate spurious transcription. When the number of genes that need to be activated equals the numbers of genes that need to be repressed Q = M − Q, we obtain that full symmetry between activator and repressor E * r = E a provides minimal crosstalk -this case is shown in the main text, Fig. 5. Parameters: M = 5000, S = 10 −4.5 .
Unlike in the basic activation setup, Q genes that are required to be ON have two cognate TFs present, but genes that are required to be OFF have either none of the cognate types present, or one (but not both) of TF types present. As calculated above, we have t TFs and each TF has n int combinations, while the total number of combinations it can have are N int ; each TF that is present therefore has N int − n int missing combinations. The number of genes (that should be OFF) which have only one TF present can be obtained as The number of genes with no cognate TFs present is Q 0 = M − Q − Q 1 . In Table 5, we have listed all possible configurations for the two binding sites of a gene, along with details of crosstalk states and statistical weights. From this, we get the per-gene crosstalk for different types of genes. For genes that have both cognate TFs present (Q out of M ), the per-gene crosstalk error is (C/t) 2 (C/t) 2 + 2e −Ea (C/t) + 2(C/t)CS + 2e −Ea CS + (CS) 2 + (C/t)CS(2ǫ, L) + e −2Ea . (S57) For genes that have only one of the two cognate TFs present (Q 1 out of M genes), the per-gene crosstalk error is x one = (C/t)CS + (CS) 2 + (C/t)CS(2ǫ, L) e −Ea (C/t) + (C/t)CS + 2e −Ea CS + (CS) 2 + (C/t)CS(2ǫ, L) + e −2Ea . (S58) For genes that don't have any of their two cognate TFs present (M − Q − Q 1 out of M genes), the per-gene crosstalk error is The total crosstalk is: For a given M and f and for each (Q, S) pair, we compute the optimal concentration C * numerically, and obtain the minimal crosstalk X * comb . As plotted in Fig. 20, the boundaries between different regimes shift in the combinatorial setup. In particular, while at small f the "regulation regime" shrinks in the (Q, S) plane, as f increases, it expands. As f increases towards 1, the boundary between the "regulation regime" and "C = 0" regime moves towards larger S. In Fig. 21, we have plotted the difference in optimal crosstalk between combinatorial regulation and the basic activation setup. For f = 0.001, combinatorial regulation doesn't improve from the basic activation setup in terms of optimal crosstalk. But for f = 0.01, 0.1, and 1, combinatorial regulation gives a lower optimal crosstalk than the basic activation setup. So, there exists a threshold in f such that for combinatorial regulation below that threshold, the "regulation regime" shrinks in comparison to the basic activation setup and performs worse. Above the threshold, the "regulation regime" expands towards larger S and gives a lower optimal crosstalk than the basic activation setup. At the baseline parameters of Q = 2500, M = 5000 and log (S) = −10.5, optimal crosstalk for the combinatorial setups reads as X * comb = 0.28, 0.18, 0.11 and 0.07 for f = 0.001, 0.01, 0.1 and 1 respectively, compared to X * = 0.23 for the basic activation setup.
This decrease in crosstalk is consistent with the reduction in the number of regulatory components (T and t, the number of TFs, see Fig. 22), as discussed in Supplementary Note 1. In the case of perfect combinatorial regulation (f = 1), we have roughly √ 2M instead of M TF species in the basic activation setup, which is a significant reduction in the number of regulatory components. Hence, each TF now effectively controls Θ = M/ √ 2M = M/2 genes, and so the decrease in crosstalk is expected to be roughly √ Θ compared to the basic activation setup. For M = 5000 genes, this would suggest that perfect combinatorial regulation could decrease the crosstalk by ∼ 7fold over the basic model. The actual reduction in crosstalk (from 0.23 to 0.07) isn't as large because of certain differences between the combinatorial setup and Θ-genes setup of Supplementary Note 1. One major difference is that in the Θ-genes setup, the cell can only activate sets of genes of size Θ, while in the combinatorial setup, the cell has the power to activate single genes at will, albeit at the cost of partially activating genes that aren't needed (since a considerable fraction of genes that should be OFF must have one of the two activators present) and allowing new non-cognate configurations. Fundamentally, therefore, crosstalk reduction comes from the decrease in the number of regulatory components (TF species) needed in the system, which again points to the explosion in the number of possible noncognate interactions as the crucial origin of the crosstalk. In other words, what qualitatively seems to matter is Θ, the number of regulated genes per TF, while the detailed manner in which these TFs regulate is less important for the actual numerical value of crosstalk (but is important for the functioning of the cell; e.g., in combinatorial regulation genes can be addressed individually, while in the model of Supplementary Note 1 they cannot be).
We also note that while near-ideal combinatorial regulation appears to be a useful strategy to reduce the crosstalk, studies of scaling laws in gene regulatory networks do not appear to be consistent with the use of such a pure combinatorial strategy. In particular, the number of TFs scales at least linearly (quadratically, in prokaryotes) with the total number of genes [4] across different organisms, while an efficient combinatorial strategy would suggest sub-linear (e.g., square-root) scaling. This clearly does not preclude the use of combinatorial regulation in some regulatory elements, but does show that even with the possible utilization of the combinatorial strategy the observed growth in the number of distinct TF species (which seems to be an important crosstalk parameter) is extensive. In the other panels, we show the regimes for the combinatorial setup for f = 0.001, 0.1, and 1, respectively, from left to right. For f = 0.001, the "regulation regime" is slightly smaller than in the basic activation setup. As f increases, the "regulation regime" increases in size (and is bigger than in the basic activation setup) and the boundary with C = 0 is pushed higher towards larger S.

Supplementary Note 9 Alternative crosstalk definition
In the basic setup presented in the main text, we considered "activation out-of-context"-i.e., activation by the binding of a noncognate TF when the cognate TF is present (but not bound)-to be a crosstalk state. Our reasoning was motivated by viewing transcriptional regulation as a signal transmission apparatus. In this interpretation, gene activation by a noncognate TF amounts to generating a response (transcriptional activity) to a wrong input signal. Consequently, this should count as crosstalk, despite the fact that (by chance) the correct signal was simultaneously present in the cell. This is perhaps easiest to appreciate if one considers more realistic setups in which genes are not simply "ON" and "OFF", but can be quantitatively regulated by the level of their cognate TF. In such a model, there might be two TFs present and varying in concentration as a function of time: one cognate for the gene of interest and one not. In this case it is clear that the correct response of the gene is to track the changes in the cognate TF, and not to simply be expressed in a constant "ON" state; consequently, tracking the noncognate TF due to crosstalk is obviously an error, even if the cognate TF is present at the same time.
One could, however, argue that "activation-out-of-context" shouldn't be considered as an error state. If the presence or absence of TF signals is a binary variable and if the binary response is defined solely by the state of transcriptional activity (activation/inactivation of gene), then when the presence of the signal matches the response state, the regulation outcome is correct, irrespective of the molecular details on the promoter. For example, for a gene whose cognate TF is present, activation by any means (either by cognate or noncognate binding) is the correct response. In this scenario, the "out-of-context activation" is actually what one might call beneficial crosstalk: here, noncognate TF can be seen as helping to activate the gene when the cognate TF is also present. For a gene whose cognate TF is absent, activation is still an incorrect response, like before.
Hence, x 2 (i) retains the same expression, but x 1 (i) changes to As shown in Fig. 23, optimizing C results in three distinct regulatory regimes, like in the default basic setup. For small S in the regulation regime, the optimal C is given to the leading order by: The minimal crosstalk error at the optimal concentration C * is given by Optimal TF concentration C * , that minimizes the crosstalk, relative to C 0 , the optimal concentration at the baseline parameters (see main text).