Evolutionary constraints on the complexity of genetic regulatory networks allow predictions of the total number of genetic interactions

Genetic regulatory networks (GRNs) have been widely studied, yet there is a lack of understanding with regards to the final size and properties of these networks, mainly due to no network currently being complete. In this study, we analyzed the distribution of GRN structural properties across a large set of distinct prokaryotic organisms and found a set of constrained characteristics such as network density and number of regulators. Our results allowed us to estimate the number of interactions that complete networks would have, a valuable insight that could aid in the daunting task of network curation, prediction, and validation. Using state-of-the-art statistical approaches, we also provided new evidence to settle a previously stated controversy that raised the possibility of complete biological networks being random and therefore attributing the observed scale-free properties to an artifact emerging from the sampling process during network discovery. Furthermore, we identified a set of properties that enabled us to assess the consistency of the connectivity distribution for various GRNs against different alternative statistical distributions. Our results favor the hypothesis that highly connected nodes (hubs) are not a consequence of network incompleteness. Finally, an interaction coverage computed for the GRNs as a proxy for completeness revealed that high-throughput based reconstructions of GRNs could yield biased networks with a low average clustering coefficient, showing that classical targeted discovery of interactions is still needed.


Introduction
Regulation is a critical biological process common to every living organism.Environmental cues such as nutrient availability and stimuli like temperature need to be sensed and integrated across a multi-layered decision-making system for an organism to mount an ad hoc response.A reductionist approach to biology has yielded extensive amounts of information about individual molecules and their interactions.However, it is now clear that most biological phenomena are complex and arise from the interaction of different components 1 .Transcriptional regulation is the process by which a set of regulator genes promote or inhibit the expression of other genes 2 support to previous observations such as the presence of a long-tailed node degree distribution.The observed constraint in network density allowed us to estimate the total number of interactions that complete GRNs would have.Our framework represents the first approach based on biological information used to make this prediction and could be particularly valuable for integrative methods used to predict GRNs.The Dialogue for Reverse Engineering Assessments and Methods (DREAM) organized a challenge aimed at predicting GRN interactions based on expression data [17][18][19] .The participants were required to upload a list of possible regulatory interactions between Escherichia coli genes sorted by their confidence scores.This approach would benefit from the inclusion of prior information contained in the network structure.The ability to predict the number of interactions a GRN has, along with other topologically constrained properties, could be incorporated into these methods to allow for more accurate predictions.Using our predicted number of interactions, we computed an interaction coverage for each network, which we found serves not only as a completeness measure but also as a network quality indicator.The methodology developed in this study can be used to systematically assess whether other GRN topological properties have been subjected to evolutionary constraints, further enhancing our understanding of prokaryotic genetic circuitry.

Constrains in the properties of the Abasy Atlas GRNs
GNRs possess a series of topological properties that allow for the correct integration of and response to environmental signals.The density of a network is a property quantifying the fraction of existing interactions relative to the total number of possible interactions given the number of genes (see Methods).If GRNs share some common structural patterns, we could expect to see a convergence or trend of their densities into a defined range.Interestingly, we found that Abasy GRNs (N = 71) densities followed a trend towards a relatively small value as the number of genes in the GRN increased.Networks with less than 500 genes showed a significant amount of variation but followed a nonlinear trend towards a value of low density (Fig. 1a).We found that this nonlinear trend follows a power law (d ∼ n -γ ) with γ = 0.8 ≈ 1 (R 2 = 0.96), strongly suggesting that an hyperbolic (inverse) behavior governs the relationship between density and number of genes.The same behavior was observed using a set of nonredundant Abasy GRNs (γ = 0.74, R 2 = 0.95) (Supplementary Fig. 1).
Differences in the curation methods or reconstruction strategies could account for this result, but this trend was conserved when analyzing a set of historical reconstructions (see methods) of E. coli 20,21 and Corynebacterium glutamicum 15 GRNs.Furthermore, four different reconstructions of the Bacillus subtilis GRN each constructed independently with different curation strategies 22,23 and even a computational prediction 24 continued to show a trend towards a low density (Supplementary Fig. 1).This trend was observed when GRN genomic coverage (a traditional measure for completeness; the fraction of the organism genome that is included in the GRN reconstruction), or a set of only non-redundant networks (see methods) were used (Supplementary Fig. 1).Interestingly, changes in GRN completeness seemed to negatively correlate with density (Fig. 1b note the density axis scale), suggesting that density is likely constrained in the complete GRNs.
The variability observed in highly incomplete GRNs (genomic coverage < 0.25) can be explained by the fact that a network with such a low number of genes could not possibly have a density as low as observed for more complete networks without having disconnected genes, which are not included in GRN reconstructions (e.g., a network with 200 nodes and a density of 0.001 would have ∼40 interactions which, in the best case scenario, would leave 120 disconnected nodes).To illustrate this, we generated random subnetworks (using a snowball sampling approach, see methods) of the most recent E. coli GRN (511145_v2017_sRDB16_dsRNA) and assessed the sampled networks densities excluding disconnected nodes.As expected, the relationship between subnetwork mean density and number of nodes followed the same trend as observed for the networks in Abasy.Accordingly, the variation increased as subnetwork size decreased (Supplementary Fig. 1).In fact, normalizing network density by its expected value derived from E. coli random subnetworks unveils the putative invariance in density (Supplementary Fig. 1).
GRNs are usually modeled as directed graphs because only a set of regulator genes may activate or inhibit other genes.Therefore the patterns observed in density could be understood if the number of regulators, and hence the number of interactions, in the networks are being constrained.Notably, the number of regulators in the studied GRNs showed a high correlation with the number of genes or genomic-coverage (Fig. 1c and Supplementary Fig. 2).This trend was also observed in historical (E.coli, but not C. glutamicum) and independent GRN reconstructions (B.subtilis) (Supplementary Fig. 2).These observations suggest that the evolutionary constraints that have shaped the structure of prokaryotic GRNs constrains the percentage of genes that can act as regulators (∼7% in average, Fig. 1c).Although the discovered patterns in network density and number of regulators are very likely to be related, it is still elusive whether one and which, is a causal agent for the other or whether more elusive topological constraints confound both.

The complexity of GRNs could be bound by their stability
We found that the relationship between density (d) and number of genes (n) of Abasy Atlas GRNs follows a power law (d ∼ n -γ ) with γ ≈ 1 (Fig. 1a).This is a really interesting result, as it evidences that GRN complexity could be bound by the number of genes as predicted by the May-Wigner stability theorem 25 .To further elaborate on this, we need to go back several decades.
In the early 1970s, Gardner and Ashby empirically found that the stability of randomly connected large systems depends on their connectance 26 .They explored the system stability by modeling a set of nonlinear first-order differential equations whose coefficients, representing the interaction strengths among variables, were randomly obtained from a Gaussian distribution having zero mean and variance α 2 .The change of the system state x(t) (where x(t) = (x 1 (t), x 2 (t), …, x n (t))) in time can be represented by the equation dx/dt = Ax, where A is the matrix of interaction strengths.The percentage of non-zero entries in this matrix was defined as the percentage of connectedness (connectance).Connectance is then analogous to density in graph theory as both quantify the fraction of existing interactions relative to the total possible (hereafter we use connectance and density as synonyms, but we prefer connectance when refer to early works or ecological communities where this term is a standard).
Connectance (and consequently also density) has an important role in complexity theory as it quantifies the complexity of a system 25,26 .Robert M. May extended Gardner and Ashby's work to conclude that the stability of randomly connected systems depends on the number of variables (n), connectance (C), and interaction strength dispersion (α 2 ) 25 .The May-Wigner stability theorem says that randomly connected large systems are stable if nC < 1/α 2 .Unstructured and structured systems have been shown to be bound by this theorem.An early theoretical work in structured model networks has suggested that such structures promote stability 27 , and this was also observed for hierarchical networks under certain conditions 28 .However, a later theoretical study concluded that hierarchical and modular networks are less stable than random networks 29 and other showed that increasing modularity or the number of hierarchical layers tends to increase the probability of instability 30 .It has been theoretically suggested that optimizing multiple structural and dynamical constrains such as minimizing complexity and path length while increase robustness to dynamical perturbations will evolve modular scale-free networks 31 .
Bacterial GRNs exhibit a hierarchical modular organization as predicted by the natural decomposition approach 5,6,8,15 and other studies 7,8,32 .As system stability is a requirement for organism survival, there must be constrains shaping the modularity and number of hierarchical layers of GRNs.Besides the inverse relationship between density and number of genes (nd = k = 0.40), we also observed that the interaction strength dispersion of Abasy Atlas GRNs is a constant (α 2 = 1/k = 2.50) (Fig. 1a).The same result was also observed when we used a set of non-redundant Abasy GRNs (nd = k = 0.34, α 2 = 1/k = 2.90) (Supplementary Fig. 1).This shows that GRNs stability is ensured if the interaction strength dispersion is less than 2 (nd < 0.50).Conversely, empirical data from food webs also have shown a hyperbolic behavior governing connectance and number of species but the interaction strength dispersion falls into a range having lower values (1/6 < α 2 < 1/2) [33][34][35] .
Further research is needed to understand how evolution is acting upon the GRNs organization constraining modularity and hierarchy to ensure stability, and to evaluate how the probability of stability constrains the landscape of possible GRNs structures (the GRNs organizational landscape 15 ).An analysis of GRN properties shows that the organization of complete prokaryotic GRNs is not random but scale-free The trend towards a relatively small density value implies that prokaryotic GRNs are sparse.This observation is likely a result of a set of common organization principles.The existence of highly connected global regulators in a GRN with such a low density should cause that nonglobal regulatory genes to have a low number of interactions (relative to global regulators).Thus, the average node connectivity should be low.In fact, a previous study found that in sparse complex networks (a social network in this case) the node degree distribution changes as link density increases; these sparse complex networks initially showed a power-law node degree distribution, but as link density increased a divergence from power law was noted 36 .
The power-law behavior governing the node degree distribution has been proposed as a common organizational principle of GRNs 6 .For the sake of simplicity, we focus here on a general P(k) distribution combining both P(k in ) and P(k out ) distributions.In the literature of network biology, there has been a debate when referring to 'universal' topological properties of biological networks 1,37 , especially on whether a specific probability distribution governs GRN P(k).It has been particularly stated 38 that sampled Erdos-Renyi (ER) networks (with an initial Poisson governed node degree distribution) could present power-law P(k) distributions.Thus creating the misleading idea of sampled random networks being hierarchical structures with global regulators (hubs).Furthermore, these results raised the possibility of biological network P(k) long-tailed distributions to be an artifact of sampling (i.e., incompleteness).
The main problem with the aforementioned (and other) studies is the assessment of the goodness of fit to a long-tailed distribution by only using the coefficient of determination for a linear regresion 37 .To robustly find whether the analyzed GRNs P(k) follow a long-tailed distribution, or if previous observations 1,9,15 could be in fact an artifact of sampling 37,38 , the goodness of fit of GRN connectivity to different probability distributions was assessed through computing the Kolmogorov-Smirnov (KS) D statistic (distance) using a maximum-likelihoodestimate (MLE) fitted theoretical distribution, and by calculating the log-likelihood ratio test of the MLE fitted distributions against a power-law distribution 39 .
The log-likelihood ratio tests showed a preference for a power-law distribution over exponential and Poisson distributions for the GRNs studied, and rendered no significant distinction between lognormal, stretched exponential and power law.A slight preference for truncated power law was also evident (Fig. 2a, b).Furthermore, the KS D statistic favored the power law and other long-tailed distributions as they had the smallest difference between the data and the model (Fig. 2a).No GRN P(k) showed a good fit to a Poisson distribution, while a set of sampled 38 (see methods) ER graphs parametrized to have an average probability and size equal to the GRNs, had the best P(k) fit to a Poisson distribution as demonstrated by the calculated KS D statistic (Fig. 2a) 39,40 .
Previous reports 38 claimed that the smaller the subsample of the ER networks analyzed, the better the fit to long-tailed distributions.It is important to highlight that we initially observed a preference for a power law (over a Poisson) for all ER-graphs P(k) distributions (Supplementary Fig. 3).While unexpected, this observation can be explained by a) the broken assumption when comparing Poisson and power-law distributions due to the fact that the likelihood ratio test assumes nested distribution (Poisson is not nested within power law unlike the other distributions tested), or b) the effect on the data of a fitted parameter (xmin) needed for the MLE of a power-law distribution 39,40 .Because we observed similar results for the KS D statistic (Supplementary Fig. 3), we assumed the latter to be the cause of the observation, as the KS_D statistic does not assume nested distributions.The xmin parameter finds the minimal accepted value of connectivity for which the power law distribution is valid and causes ER graphs derived P(k) to appear power-law like (see Supplementary note).When this analysis was performed fixing the value of xmin to a value of one (therefore forcing the fit to consider all the data) the ER P(k) preference for power-law distributions (specially for the most complete ER networks) disappeared (Fig. 2a, b).
While we were able to replicate previous results showing a better fit to a power-law distribution with more incomplete ER graphs 38 (Fig. 2c), when comparing the fit of these ER networks to Poisson and power law, an evident and significantly better fit for a Poisson P(k) distribution was observed even for the most incomplete networks (Fig. 2d-f and Supplementary Fig. 3).Furthermore, the smallest available GRNs showed a strong deviation from a Poisson distribution, and, as previously reported 36 , we detected a negative correlation between power-law fit and network density (Supplementary Fig. 3).These observations suggest that a Poisson distribution is not a bona fide model for deriving GRNs P(k) (Fig. 2d and Supplementary Fig. 3 and 4).Long-tail distributions explain better the prokaryotic GRNs P(k) here studied, thus supporting the existence of hubs and a non-random organization as properties, rather than as artifacts of sampling [5][6][7][8][9] .
The previously discussed 38 effect of sampling on ER networks (Fig. 2d) motivated us to assess the possibility of further rejecting ER networks as a model for GRNs.If GRNs are not derived from ER graphs, then some of the properties observed in biological networks should be incompatible with the ER model.Mainly, biological networks have been proposed, and seem to have, long-tailed P(k) distributions 1,8 .While some instances of a Poisson derived network may show a high goodness of fit to a long-tailed P(k) distribution (Fig. 2), we hypothesize a measure of P(k) tail length to show higher values in biological networks as opposed to comparable ER graphs.Therefore, we defined tail length as the common logarithm of the difference between the maximal and minimal degree of a network.
A GRN comparable (see methods) set of ER graphs was constructed (Fig. 3 a, b), and we studied the tail length and average clustering coefficient distributions of this null subset compared to Abasy GRNs.We also applied this methodology to Barabasi-Albert (BA) growing random networks model.Both average clustering coefficient and tail length distributions from GRNs were different from their parametrized ER and BA counterparts (Fig. 3 c, d).The construction algorithm for the BA graphs caused a significant difference between the null models density and the GRNs density (Fig. 3b).Thus, although the average clustering coefficient differences are evident, no conclusion can be drawn regarding them (see discussion).Nonetheless, the significant difference between average clustering coefficient and P(k) tail-length between GRNs and their equivalent ER graphs (Fig. 3 c,d) suggest that Abasy GRNs cannot be derived from an ER model, given their densities, average clustering coefficients and tail lengths (Fig. 3).
Furthermore, when analyzing the properties of ER networks sampled to appear power law, their tail-lengths and clustering coefficients were further incompatible with those of the biological networks (Fig. 3; compare e,f with c,d).These combined results provide further evidence suggesting that a Poisson distribution is not likely to model the distribution of complete prokaryotic GRN P(k) appropriately, and that the observed properties (e.g., the existence of hubs causing long tailed P(k) distributions) are not likely an artifice of GRN incompleteness.Importantly, this methodology can be easily extended and modified to assess the plausibility of other network null models for the existent GRNs possibly aiding in the design or parametrization of algorithms to infer GRNs.

Estimating the number of interactions of complete GRNs
A discovered completeness-density trend (Fig. 1a) should imply a relationship between the number of genes and the number of interactions a network has.In fact, this was observed amongst the networks in Abasy (Fig. 4a) and when analyzing GRNs normalized by genomic coverage, historical reconstructions and a subset of only strongly experimentally-supported 41 (when possible) non-redundant networks (Supplementary Fig. 5).Both the completenessdensity and completeness-number of interactions correlations (Fig. 1a and 4a) enabled us to generate predictive models for inferring the number of interactions a GRN would have (see methods).Briefly, by incorporating either the relationship between number of genes and edges (edge regress model (EdR), Fig. 4a), by assuming an invariant density (density invariance model (DI), Fig. 4b and Supplementary Fig. 1) or incorporating the trend in density (density proportionality model (DP), Fig. 4c and Supplementary Fig. 5), we could estimate a set of proportionality factors that explained the number of edges (interactions) in terms of the number of nodes (genes).
Assuming the total number of nodes of a GRN to be the number of annotated genes in the genome of that organism, we could generate three models (Fig. 4a-c and Supplementary Fig. 5) to estimate the total number of interactions.A comparison between some organisms GRN number of genes, genomic coverage, actual number of interactions and total estimated number of interactions predicted by each model is given in Of the three models, both the EdR and DP approaches had the highest goodness of fit, as they incorporated variance from the smallest GRNs.Both have only a free parameter to estimate, and an error parameter, but EdR assumes a simpler underlying model.Interestingly both model predictions are in agreement with each other, while the DI model predicts more interactions, and could be biased by the reduced number of non-redundant more complete (genomic coverage above 0.2) GRNs.
To further compare our models, we assessed their capability to predict the number of genes of the historical reconstructions of E. coli (see methods) available in Abasy.All the models had a good fit to the data (0.87 ≤ R 2 ≤ 0.91), with small differences between them.Interestingly, the DI model had the best fit, despite having the poorest fit to the most incomplete networks in Abasy (Fig. 4b), the DP and EdR models were, as expected, very similar (Supplementary Fig. 5).Until today, little efforts have been made to estimate GRN size considering missing interactions 42 , and our models are the first ones to integrate meaningful biological data consistent across distinct organisms.We leveraged our capability of finding putative trends or constraints on the topology of GRNs to make further predictions about their final topology.The trends in the number of regulators, network density and the total number of interactions could be valuable for the still ongoing development of methodologies aimed at predicting complete GRNs de novo by integrating high-throughput data.

Assessing GRN completeness based on number of interactions
We next assessed GRN completeness using an interaction coverage, rather than genomic coverage score.Notably, the interaction coverage computed from our models showed a high correlation with genomic coverage (R 2 = ∼0.88)and with average clustering coefficient (R 2 = ∼0.60,Supplementary Fig. 6), thus suggesting that more complete networks tend to have a higher average clustering coefficient.
To assess differences between genomic and interaction coverages, we computed a comparison score defined as the natural logarithm of the ratio between interaction coverage (calculated independently with each prediction model) and genomic coverage.Negative values of this score indicate that the interaction coverage is penalizing (i.e., predicting the GRN to be less complete) when compared to genomic coverage.We expected networks with a higher average clustering coefficient to be less penalized by the interaction coverage, as they contained more interactions (R 2 = 0.40, Supplementary Fig. 6).A dependency between the computed comparison score and average clustering coefficient was evident in two predictive models (EdR and DP Fig. 4d, f) as the less penalized networks were the ones with a higher average clustering coefficient.Notably, the model based on a density invariance (DI) fails to recapitulate this result, something explained by its rather simplistic nature which could also explain the lower fit (Fig. 4b,e).In all cases, networks were mostly penalized as evidenced by negative penalization values (Fig. 4 d-f).Thus far, the analyses were carried out using all networks available in Abasy including different organism historical, independent, and meta-curated GRNs.The density invariant assuming model developed herein uses only a set of non-redundant networks with a number of genes higher than 1000 to identify the putative invariant density (a smaller training set).
We also evaluated whether network redundancy could be biasing our estimates for the other two estimators, we re-estimated model parameters using only a set of non-redundant, most complete, strongly supported (when possible) GRNs.The parameters and results were similar regardless of the analyzed set (Supplementary Fig. 7 and Supplementary Table S2).The overall negative penalization scores suggest that most networks still lack a significant amount of curation for discovering intra-modular interactions.Furthermore, our computed interaction coverage can be used to assess network completeness and quality, as deviations from this pattern may reveal reconstruction biases (see below).

Curation of high throughput experiments could bias GRN discovery as revealed by subsetting analysis of a GRN gold standard
We found that the interaction coverage and penalization score estimates for three networks of Mycobacterium tuberculosis (83332_v2015_s15, 83332_v2016_s11-12-15 and 83332_v2018_s11-12-15-16) were strong outliers in our previous analyses, particularly in the dependency between average clustering coefficient and completeness or comparison scores (Fig. 4d,f and Supplementary Fig. 6,7).These networks are based mainly on the 2015 reported network reconstructed by high-throughput experiments, which presents a very low average clustering coefficient (∼0.12) indicating a low network modularity (Supplementary Fig. 7) 43,44 .Furthermore, one of these networks showed an interaction coverage above one, but a low average clustering coefficient (Supplementary Fig. 6).To address whether a biased reconstruction based mainly on high-throughput methods could be the cause of our observations, we created three subsets of the E. coli gold-standard GRNs 21 consisting of interactions supported only by high-throughput curated experiments (HT), interactions with evidence from non-high-throughput experiments (non-HT), and a last one containing both.Although all of them contained the same set of nodes (and hence the same genomic coverage), the non-HT subnetwork contained a higher number of edges than the HT.This difference in the number of regulatory interactions could be due to the great amount of curation this regulatory network has.However, no differences among the intrinsic properties of these subnetworks should be observed.We hypothesized that the HT reconstruction contained fewer modular interactions if experiments were performed on a set of regulators yielding a poorly interconnected tree-like structure.Effectively, the average clustering coefficient of the HT subgraph was ∼5-fold less than the non-HT subgraph and ∼7.7-fold less than the combined (i.e., containing both HT and non-HT interactions) subgraph (Fig. 5).
To analyze if this reduction in the average clustering coefficient is due to the structure of the graph rather than a diminished number of edges, we performed a random removal of edges of the combined subgraph until it contained the same quantity as the HT; this was repeated 1000 times to have a random distribution of clustering coefficient averages.Interestingly, the distribution of the average clustering coefficient of the randomized networks was significantly higher from that of the HT subgraph (Z = -3.9,p < 0.001)(Fig.5), and significantly lower than the experimental curation and combined subgraphs (Z = 9.3, p < 0.001, and Z = 18.4,p < 0.001, respectively).Thus, implying that the high-throughput GRN structure has arisen from a specific type of sampling causing a particular organization, specifically yielding a low average clustering coefficient while maintaining the genomic and interaction coverage in the HT reconstruction.This phenomenon holds true for curated regulatory networks, as revealed by analyzing the average clustering coefficients of the GRNs available in Abasy Atlas 14 .
23]45 These networks showed a higher average clustering to coverage ratio than the ones based mostly in high-throughput 46 , computational predictions 24 or metacurations 14 (Supplementary Fig. 7).When analyzing a non-redundant subset of Abasy GRNs, those with a higher genomic coverage were usually the ones with higher average clustering coefficient except for 83332_v2015_s15 (the previously discussed M. tuberculosis HT GRN, Supplementary Fig. 7).Overall, these observations suggest that the genomic coverage is not the best proxy for network completeness provided an interaction coverage to be available, and that average clustering coefficient could serve as a network quality indicator, as very low values of average clustering coefficient could indicate a biased GRN reconstruction.Furthermore, with the upcoming of high-throughput and computational discovery of GRNs, an increase in the genomic and interaction coverages of networks is expected, potentially creating a misleading belief of network completeness.Reaching high levels of genomic coverage does not necessarily represent a highly complete network, and an integrative estimate of the number of missing interactions and clustering coefficients is needed not only for assessing completeness but also for guiding the experimental or inferential strategies to complete GRNs.

Discussion
The present study was powered by the availability of many meta-curated GRNs present in a single database 14 .By simultaneously comparing properties between these GRNs, an implicit evolutionary study was conducted revealing a set of constrained properties.The inherent assumption underlying our conclusions is that a trait (or in this case a network property) present in phylogenetically distinct organisms would have most likely appeared in a common ancestor and prevailed through natural selection 6 or other evolutionary forces.Recent studies support this assumption, showing that the functional architectures of disparate bacteria are conserved by convergent evolution thus suggesting that bacterial GRNs evolved in a constrained organizational landscape 6,14,15 .
Because of the lack of available complete GRNs, our results compared organisms with a heterogeneous amount of information.This represents a potential source of noise or bias which would complicate the elucidation of constrained properties.To account for the variation in existing regulatory information, the properties or traits were always studied in relation to network completeness using the number of genes or network genomic coverage as a quantitative proxy for it.Using the approach described here, the systematic analysis of constrained properties in prokaryotic GRNs is feasible.
Abasy Atlas v2.0, the database used in this study to obtain regulatory information and properties, contains 71 distinct regulatory networks covering 42 strains of nine different species.The implicit redundancy of the GRNs present in this database could, through pseudo replication, bias our results.This was accounted for by repeating all analyses based on comparing between GRNs using a non-redundant subset of the networks present in Abasy.Furthermore, the usage of historical reconstructions 20,21 and independent reconstructions [22][23][24] of regulatory networks to validate our results is a strategy that allowed us to further assess the existence of observed properties, and to exclude curation strategies and network completeness as sources of artificial observations.It is not yet clear to what extent and why, the observed properties are being constrained by evolutionary forces.We hypothesize that the observed constraints, including network complexity and number of regulators, could be explained by evolution selecting for system stability and the existence of regulatory motifs which enable cells to perform computations to integrate differential signals 4 , preventing GRNs from randomly growing or losing a defined structure 6,8,15 .Future studies could use similar approaches as the ones described herein to test this hypothesis.
Network biology has revolutionized life-science oriented research.An increased understanding of any organism could be gained through the analysis of the different systems that compose it.Nonetheless, topological analyses of biological networks have faced strong controversies arising from hyperbolic claims stating universal properties of all biological networks 1,37 .In the present study, we have shed light on the polemic subject of GRN degree distribution 38 .Through rigorous statistical approaches we have shown that long-tailed distributions (which favor the existence of hubs and a non-random organization), and not Poisson, better explain the degree distributions in bacterial GRNs.Furthermore, we demonstrated that ER networks that appear power-law due to an artifice of sampling, as they still have a better fit for a P(k) Poisson distribution, and their average clustering coefficients and tail-lengths are highly incompatible with currently known GRNs.Notably, remarks referring to the extremely low average clustering coefficient of these sampled ER networks have been published 47 .Here we extend these observations by actually comparing power-law like ER network properties with meaningful GRNs.
Although a limitation of one of our approaches was the invalidity of comparing power-law and Poisson distribution fits using likelihood ratio tests, our observations are further supported by observations of the KS D statistic, which is free of any nested model assumption.Moreover, an approach based on constructing ER and BA networks parametrized to follow our GRNs densities allowed us to reject the possibility of our biological networks as being derived from these models based on their paired densities, clustering coefficients and tail length distributions.We consider the incapacity to construct BA networks parametrized to fit our biological network properties evidence of how unlikely it would be for our GRNs to be derived from them.This same approach can be used and extended to assess the consistency of GRN inference methods by systematically attempting to find incompatible properties between the predicted GRNs and the distribution of their curated counterparts.
The discovered constrain on GRN complexity was exploited to find a relationship between number of genes in a GRN, and number of edges, and as a consequence predict total number of genetic interactions.These predictions use the trends discovered herein to produce an expected number of interactions, along with 95% confidence intervals.The current challenges existing in curating or predicting GRNs from high-throughput data make the estimate of the total number of expected interactions for the complete networks a key factor allowing algorithms to set boundaries when assessing the possibility of connections between all possible gene pairs 17,19 .A prediction based on observed properties of the amount of interactions a complete GRN has not been, to the best of our knowledge, described before.It represents an important contribution to the field as it allows databases and curators to gain an idea of how truly incomplete current GRNs are 42 .For example, estimates on the amount of missing information suggest that half of the E. coli genetic regulation is still unknown 48 .We extend this predictions and present quantitative approaches for estimating the amount of total (and thus also of unknown) interactions.We computed an interaction coverage score based on the predictions made by our different models and discovered that most GRNs are more incomplete than previously thought based on genomic coverage.Interestingly, networks with a higher average clustering coefficient were penalized the least (when compared to genomic coverage) by two of our three models.The outliers to this observation were networks based on a high-throughput reconstruction of M. tuberculosis GRN 46 .To provide a proof of principle of how high-throughput curation could bias GRNs reconstruction making them appear complete in terms of both genomic and interaction coverage, but presenting a low average clustering coefficient, we devised a subsampling of a gold-standard network 21,49 based on the experimental evidence for each interaction.Our results suggest that high-throughput curation of GRNs could yield networks with modularity lower than expected.This result aimed at raising the possibility of biases existing when curating GRNs mostly from high-throughput experiments, we acknowledge that a more detailed revision of GRNs and high-throughput technologies, and their biases, should be performed.Overall, our observations suggest and enable the finding of global structural properties constrained in GRNs, which can be used to understand how evolution has shaped their topology, aid in predicting other properties (i.e., number of interactions a complete GRN would have) and will be particularly valuable for guiding GRN inference and prediction algorithms.

Conclusion
In this study, the availability of a large collection of meta-curated GRNs in a single database enabled the analysis of different properties unveiling topological constraints, which we hypothesize to have underlying evolutionary causes.We have found GRN density, number of regulators and number of interactions to have a constrained space of possible values (organizational landscape); in the latter two cases with a strong relationship with the number of genes in the network.Our results suggest that bacterial GRNs node degree distributions are governed by long-tailed distributions, supporting the existence of global regulators and a nonrandom organization.We could discard ER and potentially BA as faithful representations of GRNs given their densities, average clustering coefficients, and tail-length values, settling the current debate of whether degree-distribution claimed properties exist or are an artifact of network sampling.
Three different estimations of GRN total number of interactions were computed, which represent a valuable tool that can be used to aid in network inference.Most GRNs were penalized when comparing their interaction coverage with their genomic coverage.Interestingly, the least penalized networks were those with a high average clustering coefficient, except for a set of networks all based on a high-throughput reconstruction of the M. tuberculosis GRN.Finally, our results suggested that high-throughput based curation could bias GRN discovery yielding tree-like networks with a low average clustering coefficient.Nonetheless, a more thorough analysis of high-throughput methodologies and their resulting reconstructed GRNs is needed to detect whether this pattern holds true for all high-throughput based GRNs.The methodology presented here can be used to systematically find constrained topological properties throughout the GRNs available in public databases, and aid in the developing area of GRN de novo inference or prediction.

Data retrieval processing and availability
Prokaryote GRNs were obtained from Abasy Atlas v2.0, a database that contains 71 reconstructed and meta-curated GRNS covering 42 bacteria.All the analyses were performed on all 71 GRNs unless stated otherwise.All analyzed data is available as downloadable files from Abasy Atlas at http://abasy.ccg.unam.mx.Transcriptional regulation was modelled as a graph, where nodes represented genes, and the edges between them represented the existence of a regulatory interaction.For the analyses of number of regulators, directed graphs were used and regulators were defined as the set of nodes with an out-connectivity > 0 (i.e., they are a known regulator of at least one gene).

Density analysis
For a given network, density is defined as the number of actual edges of the network over the number of potential total edges.In GRNs which are represented as directed graphs it is calculated using the formula: Where ݀ ீ stands for the density of graph G, ݁ ீ and ݊ ீ are the sets of interactions and nodes in graph G, respectively, and |‫|ݔ‬is the cardinality of set ‫.ݔ‬This formula was used to estimate the densities for all GRNs present in Abasy Atlas.

Computing average clustering coefficient
The clustering coefficient is a node-level measurement of modularity.An easy analogy to understand clustering coefficient is a social network, where nodes represent people and edges represent the existence of a friendship between two nodes (people).The clustering coefficient quantifies for a given person how many of its friends know each other.A clustering coefficient of one would indicate a high centrality, within a module (group of friends), of that node (person) meaning that all of this person's friends are also friends with each other.Formally, the clustering coefficient is calculated as (assuming an undirected graph): Where the numerator (ܰ ௩ ) represents the number of actual edges (interactions) between the current node neighbors, and the denominator is the maximum number of interactions the neighbors could have.In this study, the clustering coefficients of all nodes in a given graph were averaged (arithmetic mean).

Historical, independent and non-redundant GRN reconstructions
In this study we leveraged the availability (within Abasy) of different networks curated or constructed for the same organisms.In this study, we defined the set different public versions of a network (e.g., the different versions of RegulonDB for E. coli) as the historical reconstructions.In contrast, networks from the same organism reported from different databases or sources (e.g., Subtiwiki, DBTBS and the in silico reconstruction by Arrieta et al. (2015) 24 ) were considered independent GRN reconstructions.Finally, given the existence of several non-independent GRNs for a given organism we defined a set of non-redundant GRNs to ensure the results reported herein are not an artifice of pseudoreplication.The nonredundant GRN set was constructed by selecting the most-complete strong evidence GRN available in Abasy.

MLE and KS D estimation
To estimate the parameters for the different P(k) distributions compared, a vector of degrees was obtained for each GRN.This vector was used as the input data to fit the different probability distributions using maximum likelihood estimates found with in-house scripts and the library powerlaw for python 39 based in the methods previously described 40,50 .A loglikelihood ratio test was used to compare the goodness of fit of the different distributions versus a power-law distribution.The scores were plotted on a heat map using matplotlib and seaborn.As another measure of goodness of fit, the Kolmogorov-Smirnov D statistic was also calculated and depicted as a heat map.The MLE methods used require an 'xmin' parameter for the power-law distribution.This parameter selects the minimal degree value from which a power-law would have its best fit.Because this parameter induces trimming of the data causing non-long-tailed data to have a good fit to a power law distribution, these analyses were repeated with a fixed xmin value of one, thus including all of the data (see supplementary note).

Incompatibility between null models and biological GRNs
Randomized ER or BA graphs were generated to follow the observed GRNs density and size distributions.This set of 1000 random networks were considered analogous to biological GRNs, and their average clustering coefficients and 51 tail lengths (defined in main text) were calculated and compared with the biological networks.If the models were a faithful representation of the GRNs, then no significant difference between any properties should be observed, if the contrary occurred then the model was considered incompatible given the networks densities and clustering coefficients, or given the networks densities and tail lengths.Notably, this methodology can be modified to accommodate other null models with construction parameters different than density.

GRN comparable Erdos-Renyi graphs
An ER graph is a network that can be described by a characteristic node whose connectivity is the expected value of the whole graphs P(k) following a Poisson distribution.The parameters needed to construct them are the number of genes and the average connectivity.GRN density was used as a proxy of network average connectivity, a random sampling from the GRNs densities and sizes was used to create 1000 ER graphs.Average clustering coefficient and taillength distribution means were compared with a Mann-Whitney U test, and statistical significance was considered at a significance level α < 0.05.
Graph sampling methods: Information retrieval sampling While originally designed to model the discovery of protein-protein interaction networks, we decided to use this sampling framework as it was previously reported to show Erdos-Renyi graphs P(k) to present a good fit to a power-law distribution.The approach is extensively described in Han et al. (2005).Briefly, two parameters, bait and edge coverage are used for sub-setting a network.In our scenario, bait coverage represents the fraction of genes whose regulatory interactions will be included and edge coverage represents the fraction of existing interactions per bait (simulating technical and experimental limitations) whose paired genes will be also included in the GRN.This approach was used to sample the graphs on the results unless otherwise stated.
Graph sampling methods: Snowball sampling This sampling algorithm is very likely to accurately model curation of GRNs.It is based on the idea of reconstructing a network through layers of connected components by following a breadth-first search (i.e., by discovering all of the neighbors and the neighbors of the neighbors … of a seed node).Briefly, a random gene is selected to serve as a seed.The gene neighbors (interactors) are added to the network until the final size is achieved.If all the seed neighbors have been included, then the next layer of neighbors, following a breadth-first search, is systematically added until the desired percentage of completeness is achieved.If the network has several disconnected components and the desired sample size has not been achieved, another gene is chosen randomly as seed to continue with the sampling procedure.This approach could mimic the classical curation of GRNs as information is retrieved from experiments that tend to be performed based on other known interactions.

Number of interactions and Interaction coverage estimation
Two different, yet related network properties (density and number of interactions) seemed to show a trend or dependency with the number of genes in the network.Although there is a direct mathematical formula linking density, number of nodes and number of edges, including both relationships (density-number of nodes, and number of nodes-number of edges) under the same model is not trivial, primarily because of the quadratic dependency between the number of nodes, number of edges and graph density.We devised therefore to implement three independent approaches, each incorporating our observations to predict GRN total number of edges.

Density proportionality approach
We assume a conservation of the trend observed in Fig. 1a between network completeness and density and modeled it using an exponential decay fit (see Supplementary Fig. 4e).Thus, the tendency GRNs densities follow should be governing the number of interconnections these networks have.Below follows a derivation of how assuming a specific value for network density can allow for a total number of edges prediction.First, let us define graph density as the number of interactions a network has divided by the total number of interactions the network could have: Where D G represents the density for graph G, I G the set of interactions in graph G and N G the set of nodes (genes) present in graph G. ‫ܫ‬ ெ represents the maximum number of edges a genetic regulatory network would have given its number of nodes.If we assume D G to be known, or constrained to a set of values, it is possible to rearrange the equation above to obtain a relationship between the number of genes in a network and the number of edges present: This quadratic equation represents a relationship between the number of interactions and the number of nodes given a specific graph density.Using ordinary least squares regression (OLS), it is possible to estimate the proportionality density factor ‫ܦ‬ ீ given the networks in Abasy (Fig. 4b) by modelling the change in density using a linearized exponential decay function (supplementary Figure 4e formula).The proportionality factor is therefore defined as ‫ܦ‬ ீ ൌ ݁ ି୪୭ሺே ಸ ሻାఌ (6) Where ܰ ீ corresponds to the number of genes in the genome and α and ߝ represent the coefficient estimates for the linearized exponential decay model.This proportionality factor can be combined with a predicted number of genes in a GRN to predict the number of interactions it would have.We decided to use genome annotation information based on ORFs as a proxy for the number of genes each GRN in Abasy would have if complete.Knowing the total number of genes, a complete GRN would have and the determined density proportionality factor, we effectively computed a total number of interactions prediction for each GRN ‫ܫ|‬ ீ | ൌ ‫ܫ‬ ெ ‫כ‬ ‫ܦ‬ ீ ( 7) Density invariance approach For the density invariance (DI) approach eq 1.4 is used, but ‫ܦ‬ ீ is the mean density of the set of non-redundant networks in Abasy with at least 1000 genes.

Edge regression approach
The linear and robust correlation between the number of genes and the number of interactions in our analyzed prokaryotic GRNs motivated this approach.The correlation from a nonredundant set of Abasy GRNs was modeled as a linear dependency between the two variables, and the coefficients were estimated by OLS (Fig. 4a).Genome annotation information based on open reading frames (ORFs) was obtained from Abasy for each of the analyzed networks and used as a proxy for the number of genes in the genome.Finally, the extrapolation of the observed dependency was used to generate the number of interactions prediction for each GRN.
‫ܫ|‬ ீ | ൌ ߙܰ ீ ߝ In all three approaches interaction coverage was defined as the number of edges present in the GRN over the predicted total number of interactions in the GRN

Comparison between interaction and genomic coverage
A score that enabled direct comparison between our interaction-based completeness estimates and the classical genomic coverage estimates was implemented as: ‫ܥ‬ ൌ ln ‫ܫ‬ ௩ ሺ݃ሻ ‫ܩ‬ ௩ ሺ݃ሻ (10)   where ‫ܥ‬ is the comparison score for GRN g, ‫ܫ‬ ௩ ሺ݃ሻ and ‫ܩ‬ ௩ ሺ݃ሻ represent the interaction and genomic coverages (respectively) of g.If the same completeness is estimated from the interaction and genomic coverages, then this score will have values close to zero.Negative values indicate that the interaction coverage is predicting a GRN to be less complete in comparison to the estimate of genomic coverage and vice-versa.This score was calculated for all GRNs present in Abasy for the three different interaction coverage scores (one per predictive model, see above).
High throughput E. coli analysis E. coli GRN metadata was processed to obtain the different experimental evidences for each of the interactions present.Three subsets of this GRN were created: 1) Containing both high throughput and classical experimental supported interactions (All), 2) Only classical experimental supported interactions (Non-HT) and 3) Only high throughput supported interactions (HT).Both Non-HT and HT were further subsampled to contain the same nodes, thus only differ in the number of edges.Non-HT edges were randomly removed until having the same number of edges as HT and average clustering coefficient was stored; this was repeated 1000 times to create a null distribution.A Mann-Whitney U test was used to obtain a p-value of the high throughput average clustering coefficient subsample with respect to the Non-HT sampled same sized distribution, and statistical significance was called at α < 0.05.

Statistical analyses
All other not previously mentioned analyses such as hypothesis tests, correlation coefficients estimation and so forth were performed using Python 2.7, using in-house written scripts and the numpy, SciPy and statmodels modules.
List of abbreviations and Brittany Mitchell for their help with manuscript proofreading.Finally, we also thank two anonymous reviewers for helpful suggestions.3. Property incompatibilities between GRN and theoretical network null models.a, b) BA and ER graphs were generated to span the range of densities observed in Abasy Atlas GRNs (No significant difference between the parametrized ER networks and the GRN distribution).c) Tail length distribution for the networks depicted in (a) (p < 0.001 Mann-Whitney U test).d) Distribution of average clustering coefficient for the networks depicted in (b) note the substantial differences between biological GRNs and both null models (p < 0.001 Mann-Whitney U test).e-f) Sampled ER network (previously reported as power law) properties (same networks as above) were calculated.Note that the better the fit to a power law (Figure 2 c-e) the higher the deviation of actual properties such as tail-length (e) and average clustering coefficient (f).The comparison score enables a direct comparison of the GRN completeness as predicted by our interaction coverage (derived from the models) or the classical genomic coverage approach; it ranges from minus to positive infinity, with negative values indicating that the interaction coverage predicts the GRN to be less complete than the genomic coverage.If we were to fit a power-law distribution to the ER P(k) allowing for xmin to be a free parameter, most of the data of the ER network gets trimmed out.Importantly the trimmed data no longer seems to have a good fit to a poisson distribution (see below right panel).
Allowing for a free xmin impedes us from understanding the true fit of the data to long-tailed distributions.Furthermore, a measure of the amount of information being ignored is not directly available from the estimates.Although in this example the xmin for ER and Biological networks are fairly close (5 and 3 respectively), their effects on trimming the data are far from being equal: This result motivated us to repeat the analyses using a fixed xmin parameter of one, ensuring the use of all data available both for biological and theoretical networks.

Figures and TablesFigure 1 .
Figures and Tables

Figure 2 .
Figure 2. Goodness of fit of Abasy Atlas GRN P(k) to alternative probability distributions.a) Kolmogorov-Smirnov D (KS D) statistic of the GRN P(k) data against the MLE probability distributions.Higher values indicate a higher deviation (worse fit) from the fitted distribution.b) Log-likelihood ratio test score of power-law vs. other distributions.Higher values (red) indicate a preference for power law while smaller values (blue) indicate a preference for an alternative distribution (y axis labels).Blank spaces denote non-significant comparisons.All ER graphs initial parameters were generated by randomly sampling from the distribution of biologically equivalent measures (see methodology).The scores depicted are the mean of 1000 random sampling experiments using a previously published information retrieval sampling scheme 38 .c,d) KS_D statistic assessing the goodness of fit of Erdos-Renyi graphs (sampled with the information retrieval scheme) to a Poisson (c) and Powerlaw P(k) distribution.As before higher values of KS D indicate a worse fit.Results represent the mean of 100 iterations of the sampling scheme for each combination of bait and coverage values.e) Heatmap depicting the goodness of fit differences for the same ER sampled networks, negative values would indicate a preference for power law whereas positive values indicate the expected preference for Poisson, all of this differences are statistically significant (see Supplementary figure3).Detailed annotated subgraphs a and b are available in supplementary Fig. 4.

Figure
Figure 3. Property incompatibilities between GRN and theoretical network null models.a, b) BA and ER graphs were generated to span the range of densities observed in Abasy Atlas GRNs (No significant difference between the parametrized ER networks and the GRN distribution).c) Tail length distribution for the networks depicted in (a) (p < 0.001 Mann-Whitney U test).d) Distribution of average clustering coefficient for the networks depicted in (b) note the substantial differences between biological GRNs and both null models (p < 0.001 Mann-Whitney U test).e-f) Sampled ER network (previously reported as power law) properties (same networks as above) were calculated.Note that the better the fit to a power law (Figure2 c-e) the higher the deviation of actual properties such as tail-length (e) and average clustering coefficient (f).

Figure 4 .
Figure 4. GRN total number of interactions prediction.a, b and c) Models to estimate the total number of interactions in a GRN.a) Edge regression model (EdR).b) Density invariance model (DI) where Dg was obtained from average density of most complete graphs.c) Density proportionality model (DP), where density is modeled as an exponential decay.d, e, and f) Dependency between completeness comparison score and average clustering coefficient for the different models: Edge linear dependency (d) Density invariant (e) and the density proportionality factor (f). E. coli and M. tuberculosis GRNs are represented with different colors and markers.The comparison score enables a direct comparison of the GRN completeness as predicted by our interaction coverage (derived from the models) or the classical genomic coverage approach; it ranges from minus to positive infinity, with negative values indicating that the interaction coverage predicts the GRN to be less complete than the genomic coverage.

Figure 5 . 21 .
Figure 5. E. coli purely high throughput subset of GRN contains an unexpectedly nonrandom low clustering coefficient.Average clustering coefficient distribution of the subsampled networks of E. coli 21 .All networks have the same number nodes.The number of edges in random-edge-removal networks and pure high throughput is the same.M. tuberculosis v2015 (83332_v2015_s15) complete GRN clustering coefficient is depicted in red for comparison.

Table 1 (
see Supplementary TableS1for a full listing of all organisms in Abasy).