Abstract
Biochemical reactions underlie the functioning of all life. Like many examples of biology or technology, the complex set of interactions among molecules within cells and ecosystems poses a challenge for quantification within simple mathematical objects. A large body of research has indicated many realworld biological and technological systems, including biochemistry, can be described by powerlaw relationships between the numbers of nodes and edges, often described as “scalefree”. Recently, new statistical analyses have revealed true scalefree networks are rare. We provide a first application of these methods to data sampled from across two distinct levels of biological organization: individuals and ecosystems. We analyze a large ensemble of biochemical networks including networks generated from data of 785 metagenomes and 1082 genomes (sampled from the three domains of life). The results confirm no more than a few biochemical networks are any more than superweakly scalefree. Additionally, we test the distinguishability of individual and ecosystemlevel biochemical networks and show there is no sharp transition in the structure of biochemical networks across these levels of organization moving from individuals to ecosystems. This result holds across different network projections. Our results indicate that while biochemical networks are not scalefree, they nonetheless exhibit common structure across different levels of organization, independent of the projection chosen, suggestive of shared organizing principles across all biochemical networks.
Introduction
Statistical mechanics was developed in the nineteenth century for studying and predicting the behavior of systems with many components. It has been hugely successful in its application to those physical systems wellapproximated by idealized models of noninteracting particles. However, realworld systems are often much more complex, leading to a realization over the last several decades that new statistical approaches are necessary to describe biological and technological systems. Among the most natural mathematical frameworks for developing the necessary formalism is network theory, which projects the complex set of interactions composing real systems onto an abstract graph representation^{1,2,3,4,5,6,7}. Such representations are powerful in their capacity to quantitatively describe the relationship between components of complex systems and because they permit inferring function and dynamics from structure^{8,9,10,11,12}.
Network theory has been especially useful for studying metabolism. Metabolism consists of catalyzed reactions that transform matter along specific pathways, creating a complex web of interactions among the set of molecular species that collectively compose living things^{13,14,15,16,17}. It is the collective behavior of this system of reactions that must be understood in order to fully characterize living chemical processes—counting only individual components (molecules) is inadequate. The structure of how those components interact with one another (via reactions) really matters: in fact it is precisely what separates organized biological systems from messy chemical ones^{18,19,20}.
Within the formalism of network theory, one of the simplest ways to capture insights into the global structure of a network is to analyze the shape of its degree distribution. A huge volume of research into various complex biological, technological and social networks has therefore focused on identifying the scaling behavior of the corresponding degree distributions for network projections describing those systems. One of the most significant results emerging from these analyses is that many networks describing realworld systems exhibit ostensibly “scalefree” topology^{21,22,23,24,25}, characterized by a powerlaw degree distribution. The allure of scalefree networks is in part driven by the simplicity of their underlying generative mechanisms, for example a powerlaw degree distribution can be produced by relatively simple preferential attachment algorithms^{21}, or to a lesser extent through optimization principles^{26}. For truly scalefree networks the probability to find a node with degree x should scale as:
For numerous biological and technological systems, including metabolic networks, the scaling exponent, \(\alpha\), is reported with values in the range \(2< \alpha <3\). The apparent ubiquity of scalefree networks across biological, technological and social networks has fueled some to conjecture scalefree topology as a unifying framework for understanding all such systems, with the enticing possibility these seemingly diverse examples could in reality arise from relatively simple, universal generating mechanisms^{21,25,26,27,28}.
However, this story is far from complete. Recently, Broido and Clauset developed statistical tests to rigorously examine whether observed distributions share characteristics with a powerlaw, or are instead more similar to other heavy tailed distributions, and have revealed that true scalefree networks may not be as ubiquitous as previously supposed^{29,30}. These tests reveal that while it is superficially possible for a network to appear scalefree, more rigorous analysis can reveal a structure more similar to other heavytailed distributions such as the lognormal distribution, or even non heavytailed distributions like the exponential distribution^{28,29,30,31}.
The problem of characterizing the global structure of realworld systems is further compounded by the fact there are often many ways to coarsegrain a real system to generate a network representation, each corresponding to a different way for set of interactions to be projected onto a graph. For example, metabolic networks may be represented as unipartite or bipartite graphs, depending on whether one chooses to focus solely on the statistics over molecules (or reactions) and their interactions (requiring a unipartite representation) or instead to include both molecules and reactions as explicit nodes in the graph (where molecules and reactions represent two classes of nodes in a bipartite representation)^{32,33,34}. These graphs can have different largescale topological properties, even when projected from the same underlying system. This raises the question of determining which projection to analyze, and whether or not a realworld system should be considered “scalefree” if only some of its network projections exhibit powerlaw degree distributions. In Broido and Clauset’s classification, “scale freeness’ is assessed ranging from “Not scalefree” to “strongest”^{30}. Their approach provides methods for statistically analyzing the different network projections of realworld systems to determine how well scalefree structure can describe the properties of the highdimensional underlying system, when it is projected into lower dimensional, coarsegrained network representations.
The goal of assessing different network projections in order to classify “scale freeness” is intended to be as thorough as possible in identifying relevant features of a systems’ complex network structure. However, there is debate about whether this criterion is too strict. In particular, some researchers have argued that depending on the system being analyzed, it may not make sense to represent and equally weigh many different network projections (see e.g. debate by Barabasi and others^{35}). Herein, we aim to be as agnostic as possible about which projections are best suited for capturing how the many components of biochemical systems interact, as this is an open question in its own right. Given our aim to broadly assess the scaling of biochemical systems, we therefore follow Broido and Clauset and consider all possible projections available from the underlying data. A second potential criticism of this approach concerns whether or not it matters if longtail networks are not precisely scalefree. Our goal here is to report the statistical properties of biochemical networks across scales and agree in some contexts it might not matter if they are precisely scalefree, but in others it might. For example, many different heavy tailed distributions share the property of having a high degreesquared mean, \(\langle k^2 \rangle\), and in many applications this indicates a highrobustness to failure^{24,36,37,38}. Although these networks may share some properties, they can also have very different underlying generative mechanisms. There has even been recent work proposing that the phrase “scalefree” should be used only for networks generated using the preferentialattachment mechanism^{39}. Since we do not yet know the generative mechanisms that best explain the structure of biochemical networks, our goal herein is to provide the most rigorous classification of their structure that might enable future research in this area.
The novelty in our approach is recognizing that in order to understand the structure of realworld biological (and technological) systems, the relevant organizational level for performing analyses must also be considered. In particular, biological and technological systems are often hierarchical in their organization, with interactions across multiple levels. In example, while it is possible to study the biochemistry of individual species, ultimately the function of species in natural systems depends on the complex interplay of interactions among the many species within an ecosystem. Indeed, universal properties of life are now recognized to be characterized at the scale of ecosystems as much as they are at the scale of individual organisms^{40,41}.
In what follows, we analyze a large set of biochemical systems including data from 785 metagenomes (ecosystemlevel) and 1082 genomes (individuallevel, sampled from each of the three domains of life). Our results include the first analysis of scalefree network structure for the different projections of ecosystemlevel biochemistry, significantly expanding on earlier work focusing on the largescale structure of individual metabolic networks only^{13,29,30,32,33,34}. Like Broido and Clauset, we consider all possible projections of biochemical systems to graphs simultaneously, whereas most prior work on the organization of biochemistry has only considered one or at most a few projections^{17,42,43,44,45}. We find a majority of biochemical networks are not scalefree, independent of projection or level of organization. We also demonstrate how the network properties analyzed herein can be used to distinguish individual and ecosystem level networks, and find that independent of projection, individuals and ecosystems share very similar structure. These results have potentially deep implications for identifying underlying rules of biochemical organization at both the individual and ecosystemlevel by providing constraints on whether the same or different generative mechanisms could operate to organize biochemistry across multiple scales.
Results
We use the statistical methods developed by Broido and Clauset^{30} in what follows. All identified biochemical reactions encoded in each genome and metagenome were used to construct eight distinct network representations for each. This resulted in 8656 total network projections across the 1082 genomelevel biochemical datasets, and 6280 total network projections across the 785 metagenomelevel biochemical datasets. Each representation can be viewed as a different coarsegraining of the underlying system of reactions (i.e. the underlying dataset) (Fig. 1). We determine whether or not these datasets are scalefree, and analyze the aspects of them, and their diverse projections, that tend to lend themselves to be more or less scalefree. The alternative distributions that we compare to the powerlaw are: The exponential distribution, the lognormal distribution, the stretched exponential distribution, and the powerlaw distribution with a cutoff (see^{29,30} for more details on these distributions).
We first classified each dataset in terms of how scalefree it is. Data are classified as: SuperWeak when for \(\ge 50\%\) of network projections, no alternative distributions are favored over powerlaw; Weakest if for \(\ge 50\%\) of network projections, a powerlaw cannot be rejected (\(p \,\ge \, 0.1\)); Weak if it meets the requirements for Weakest, and there are \(\ge\) 50 nodes in the distribution’s tail (\(n_\text {tail} \ge \, 50\)); Strong if it meets the requirements of both SuperWeak and Weak, and the median scaling exponent satisfies (\(2< {\hat{\alpha }} < 3\)); and Strongest if it meets the requirements for Strong for \(\ge 90\%\) of graphs, rather than \(\ge 50\%\), and if for at least 95% of graphs none of the alternative distributions are favored over the powerlaw.
Our results are consistent with nearly all biochemical networks, at either the individual or ecosystemlevel, being “superweakly” scalefree (Fig. 2). While the powerlaw is better than other models, it is not itself a good model. When doing a goodnessoffit test, we find the majority of network representations across individual and ecosystemlevel networks have \(p < 0.1\). This indicates there is a < 10% chance that the data is truly powerlaw distributed. Additionally, when compared to other distributions through loglikelihood ratios, 99% of all data sets do not favor alternative heavy tail distributions to the power law for the majority of their networkprojections (Fig. 3, top row).
Where biochemical systems succeed and fail scalefree classifications
Goodnessoffit pvalue
The “weakest” requirement for a scalefree network introduced by Broido and Clauset stipulates at least 50% of a dataset’s networkprojections must have a powerlaw goodnessoffit \(p \ge 0.1\). For both individuals and ecosystems, only 6% of networkprojections meet this requirement (Fig. 4, left column). This goodnessoffit pvalue requirement is the most restrictive of all scalefree requirements.
Tail size
Setting aside the fact each subsequent scalefree requirement builds on the requirement(s) of the preceding one, we find 98% of individual networks and 99% of ecosystem networks do meet the requirement of \(n_{tail} \ge 50\) for a scalefree degree distribution (Fig. 4, center column).
The powerlaw exponent, \({\alpha }\)
Only 50% of individuallevel networks and 51% of ecosystemlevel networks meet the requirement that \(2< \alpha < 3\) for their degree distribution. The goodnessoffit p value requirement, followed by the requirement constraining values of \(\alpha\), are the most restrictive when determining whether a biochemical network’s degree distribution should be considered scale free (Fig. 4, right column).
Meeting the threshold for scalefree classification is dependent on the network representation
We find the results of each requirement listed above for classifying topology as scalefree differ across the eight network projection types for each dataset. Unsurprisingly, for most requirements, there exists a minute difference between the values observed for the largest connected component and entire graph of a given network projection type (Fig. 3, right column). Depending on the measure, there is a noticeably larger difference between the major network projection types, e.g., between bipartite, unipartitereactions, unipartitecompounds (where all substrates participating in the same reaction are connected), and unipartitecompounds (where substrates on the same side of a reaction are not connected) (Fig. 3, right column).
Comparing to alternative distributions
Over 99% of individual and ecosystemlevel datasets have 6 projections which do not favor any other distribution over the powerlaw (Fig. 3, top row, left column). No datasets have more than 6. The other two projections nearly always favor at least one other distribution over the powerlaw distribution—either the lognormal, exponential, stretched exponential, or powerlaw with exponential cutoff (Fig. 3, top row, right column). There are only 3 of the 6280 ecosystemlevel network projections (across the 785 ecosystemlevel datasets) that do not favor at least one of the alternative distributions. Oftentimes all four are favored over the powerlaw distribution (Fig. S1, rows 3–4). These results are identical, within 95% confidence, for both individuals and ecosystems.
Goodnessoffit pvalue
Out of all datasets, 80% of individuals and 84% of ecosystems have only a single projection type with \(p \ge 0.1\) for a powerlaw fit to their degree distribution. This indicates the majority of datasets would still not meet the “weakest” requirement for scalefree even with a threshold that lowered the percent of a dataset’s projections needed to 25% (2 networks) instead of 50% (4 networks) (Fig. 3, 2nd row, left column). The unipartite projection where substrates on the same side of a reaction are not connected (unipartitesubs_not_connected) was the most likely to satisfy \(p \ge 0.1\). For the two unipartitecompound projections, the difference between individuals and ecosystems is within the error. The unipartitereaction projections were the least likely to satisfy \(p \ge 0.1\), which is consistent with the observation that these networks always favor an alternative distribution as a better fit to the data than the powerlaw (Fig. 3, 2nd row, right column). As we initially reported, the majority of datasets do not meet the pvalue threshold for being considered scalefree, although ecosystemslevel datasets are more likely to meet the threshold.
Tail size
Out of all datasets, 98% of individuals and 96% of ecosystems meet \(n_{tail} \ge 50\) for all projection types (Fig. 3, 3rd row, left column). For 7 of the projection types, there is no difference between individuals and ecosystems, within 95% confidence (Fig. 3, 3rd row, right column).
The powerlaw exponent \({\alpha }\)
Out of all datasets, 95% of individuals and 97% of ecosystems meet \(2< \alpha < 3\) for 4 of 8 projection types (Fig. 3, bottom row, left column). The two types of unipartitecompound networks contribute to the datasets which meet the alpharange requirement the majority of the time. That is, chances are if a dataset has at least 4 projection types meeting \(2< \alpha < 3\), two of them are going to be unipartitecompound network projections (Fig. 3, bottom row, right column). The results are similar for both individuals and ecosystems.
Correlation of results between projections
Because 8 different network projections are derived from a single biochemical dataset, there is reason to expect the proportions of each projection type meeting any given scalefree criteria are correlated. We therefore constructed a Pearson correlation matrix to test whether there are correlations between projections (Fig. S2). Unsurprisingly, we find that values from projections of a network’s LCC and entire graph are highly correlated. All types of unipartite compound networks tend to be correlated. Values across many other projection types are barely correlated for the pvalue and \(n_{tail}\) criteria. Ecosystems tend to show more correlation, across all projection types, than individuals.
Distinguishing individuals and ecosystems based on their degree distributions
Multinomial regression
We used multinomial regression on network and degree distribution data from the above analyses to attempt to distinguish individuals from ecosystems. Most measures cannot reliably distinguish between these two levels of organization, with only network size and network tail size data distinguishing the two levels better than chance. Using only network size, ecosystems could be correctly identified in test data 72.23% of the time, whereas individuals could be correctly identified 85.33% of the time (Fig. 5, left columns). When normalizing other measures to network size, the only one that improved in distinguishing individuals and ecosystems to be better than chance was dexp (Fig. S3). This is a measure of which type of distribution is favored (or neither) when doing a loglikelihood ratio test between the powerlaw and exponential distribution.
Random forest
Random forest classifiers are a supervised machine learning technique that use decision trees to make classifications. When using random forests to try and distinguish individuals and ecosystems based on network and distribution data, we find ecosystems can be correctly predicted 87.01% of the time, and individuals can be correctly predicted 95.82% of the time (Out of bag, OOB, error rate is 7.91%). However, the size of the network and size of the degree distribution tail once again are the best relative predictors. Without network size and tail size, the prediction accuracy drops to 79.27% for ecosystems and 94.81% for individuals (OOB error rate of 11.80%). When doing random forest classification by projection type, the prediction accuracies are still above 75% for ecosystems and 91% for individuals across all projections, which is better than multinomial regression models even when information about network size is included (Fig. 5, left columns; Table S1). Mean degree was the best predictor across all network projection types.
Discussion
Our results indicate biochemical systems across individuals and ecosystems are, at best, only weakly scalefree. This is revealed by studying all possible projections of biochemical systems in tandem: only six of the eight network projection types analyzed favor powerlaw distributions over alternatives and in all cases the powerlaw is not itself a good fit to the data. Nonetheless, we can conclude individuals and ecosystems both share qualitatively similar degree distribution characteristics, and while this is a very coarsegrained measure of network structure, it suggests the possibility of shared principles operating across levels of organization to architect biochemical systems. The random forest distinguishability analyses demonstrate how using a combination of all the results of scalefree analyses completed in this paper can predict, better than chance, whether the data comes from individuals or ecosystems. Individuals are perhaps more tightly constrained in their network structure, based on being able to more accurately predict them based on simple network characteristics. Whether or not this structure is truly a universal property of life’s chemical systems is more difficult to conclude. Based on the sample sizes, we are confident our results hold over the population of genomes and metagenomes in the JGI and PATRIC databases. However, the observed scaling is only reflective of biology universally if the databases are unbiased in sampling from all of biology on Earth, and this is impossible to know with certainty (see e.g. proposals of ‘shadow life’ and reports of missing biota^{49,50}). Nonetheless, the fact that multiple levels and multiple projections of biochemistry reveal common structure suggests universal principles may be within reach if cast within an ensemble theory of biochemical network organization (see e.g. also^{41}).
Achieving an ensemble theory for biochemistry will requires different approaches to those that have been used to apply to cases of simpler physical systems where statistics over individual components are sufficient to describe and predict their behavior. Complex systems are complex precisely because they require additional information about the structure of interactions among their many components. This challenge is wellknown. However, the most effective methods for projecting these highdimensional structures onto simple mathematical objects to enable their analysis and comparison is among the most central problems of complexity science. By contrast, in physics coarsegraining procedures are well known, but we are not so advanced in understanding complex systems that we have similarly useful tools at hand. A first challenge is to identify coarsegrained network representations, which is subject to debate. Current literature cautions against the use of unipartite graphs, as they can lead to “wrong” interpretations of some system properties, including degree^{34,51}. We find instead that this conclusion is not so easy to arrive at. Whether the interpretation of a given representation is correct depends strongly on the characteristics of the degree distribution under consideration. As an example, all network projection types in our analysis, aside from unipartite reaction networks, favor powerlaw degree distributions over other heavytailed alternatives (Fig. 3, top row). For powerlaw \(\alpha\), there are a similar proportion of networks with bipartite projections and ecosystem unipartite reaction projections with \(2< \alpha < 3\) (within 2SD). However, the proportion of networks within this alpha range differ when compared using ecosystems, or any unipartite compound projections (Fig. 3, fourth row). Nearly all projections show different results for the scalefree pvalue cutoff (Fig. 3, second row). While previous work^{32,33} has advocated for unipartite networks (where all compounds that participate in a reaction are connected—called unicompounds here), we find these overestimate the powerlaw goodnessoffit pestimates and the values for \(\alpha\) when compared to reaction networks or bipartite networks (Fig. 3). The nuances of both similarity and difference in the structure of the same system across different projections can provide insights into the underlying system of interest, providing details that are inaccessible looking at just one projection. That is, regardless of whether or not a given projection is scalefree, all projections provide insight into the underlying system. In physics, we are accustomed to a unique coarsegrained descriptor describing all relevant features. To understand complex interacting systems, such as the systems of reactions underlying all life on Earth, it may be the case that we should forgo the allure of simple, singular models with only one coarsegrained description. Instead, to characterize living processes, it may be time to adopt and develop theory for statistical analyses over many projections in tandem.
Materials and methods
Obtaining biological data
Bacteria and Archaea data were obtained through PATRIC^{52}. Starting with the 21,637 bacterial genomes available from the 2014 version of PATRIC, we created a parsed dataset by selecting one representative genome containing the largest number of annotated ECs from each genus. Unique genera (genera only represented by a single genome) were also included in our parsed data. Uncultured/candidate organisms without genera level nomenclature are left in the parsed dataset. This left us with 1152 parsed bacteria, from which we chose 361 randomly to use in this analysis. Starting with 845 archaeal genomes available from the 2014 version of PATRIC, we randomly chose 358 to use in this analysis. Enzyme Commission (EC) numbers associated with each genome were extracted from the ec_number column of each genome’s .pathway.tab file.
Eukarya and Metagenome data were obtained through JGI IMG/m^{53}. All 363 eukaryotic genomes available from JGI IMG/m as of Dec. 01, 2017 were used. Starting with the 5586 metagenomes available from JGI IMG/m as of June 20, 2017, 785 metagenomes were randomly chosen for this paper’s analyses. Enzyme Commission (EC) numbers associated with each genome/metagenome were extracted from the list of Protein coding genes with enzymes, and metagenome EC numbers were obtained from the total category. All JGI IMG/m data used in this study were sequenced at JGI.
Because each EC number corresponds to a unique set of reactions that an enzyme catalyzes, the list of EC numbers associated with each genome and metagenome can be used to identify the reactions that are catalyzed by enzymes coded for in each genome/metagenome. We use the Kyoto Encyclopedia of Genes and Genomes (KEGG) ENZYME database to match EC numbers to reactions, and the KEGG REACTION database to identify the substrates and products of each reaction^{46,47,48}. This provides us with a list of all chemical reactions that a genome/metagenome’s enzymes can catalyze.
Generating networks
Each genomic/metagenomic dataset is used to construct eight representations of biochemical reaction networks. We refer to each type of representation as a “network projection type” throughout the text:

1.
Bipartite graph with reaction and compound nodes. A compound node \(C_i\) is connected to a reaction node \(R_i\) if it is involved in the reaction as a reactant or a product. Abbreviated in figures as bifull.

2.
Unipartite graph with compound nodes only. Two compound nodes \(C_i\) and \(C_j\) are connected if they are both present in the same reaction. A reaction’s reactant compounds are connected to each other; a reaction’s product compounds are connected to each other; and a reaction’s reactant and product compounds are connected. Abbreviated in figures as unicompounds.

3.
Unipartite graph with reaction nodes only. Two reaction nodes \(R_i\) and \(R_j\) are connected if they involve a common compound. Abbreviated in figures as unireactions.

4.
Unipartite graph with compound nodes only (alternate). Two compound nodes \(C_i\) and \(C_j\) are connected only if they are both present on opposite sides of the same reaction. A reaction’s reactant compounds are not connected to each other; a reaction’s product compounds are not connected to each other; but a reaction’s reactant and product compounds are connected. Abbreviated in figures as unisubs_not_connected.
There exists a version of each of these four network construction methods for the largest connected component (LCC), and for the entire graph, yielding a total of eight network projections for each dataset (Fig. 1). These network projection types are signified in the figured by appending largest and entire to the network projection abbreviations. Some datasets may yield identical networks for their LCC and entire graph, if there is exists only a single connected component.
Assessing the powerlaw fit on degree distributions
As defined in Clauset^{29}, a quantity x obeys a power law if it is drawn from a probability distribution
where \(\alpha\), the exponent/scaling parameter of the distribution, is a constant. In order to estimate \(\alpha\), we follow the methods described in Clauset^{29}, and use an approximation of the discrete maximum likelihood estimator (MLE)
where \(x_{min}\) is the lower bound of powerlaw behavior in our data, and \(x_i\), i=1,2,...,n, are the observed values x such that \(x_i \ge x_{min}\). The standard error of our calculated \(\alpha\) is given by
where the higherorder correction is positive^{29}. Because many quantities only obey a powerlaw for values greater than some \(x_{min}\), the optimal \(x_{min}\) value must be calculated. The importance of choosing the correct value for \(x_{min}\) is discussed in detail in Clauset et al.^{29}. If it is chosen too low, data points which deviate from a powerlaw distribution are incorporated. If it is chosen too high, the sample size decreases. Both can change the accuracy of the MLE, but it is better to err too high than too low.
In order to determine \(x_{min}\), we use the method first proposed by Clauset et al.^{54}, and elaborated on in Clauset et al.^{29}: we choose the value of \(x_{min}\) that makes the probability distributions of the measured data and the bestfit powerlaw model as similar as possible above \(x_{min}\). The similarity between the distributions is quantified using the Kolmogorov–Smirnov or KS statistic, given by
where S(x) is the cumulative density function (CDF) of the data for the observations with value at least \(x_{min}\), and P(x) is the CDF for the powerlaw model that best fits the data in the region \(x \ge x_{min}\). Our estimate of \(x_{min}\) is the one that minimizes D.
We used the github respository made available in Broido and Clauset^{30} to determine the optimal \(x_{min}\) of all our degree distributions, and to subsequently calculate the MLE in order to determine the scaling exponent \(\alpha\) and the standard error on \(\alpha\), \(\sigma\)^{55}.
A powerlaw can always be fit to data, regardless of the true distribution from which it is drawn from, so we need to determine whether the powerlaw fit is a good match to the data. We do this by sampling many synthetic data sets from a true powerlaw distribution, recording their fluctuation from powerlaw form, and comparing this to similar measurements on the empirical data in question. If the empirical data has similar form to the synthetic data drawn from a truepower law distribution, then the powerlaw fit is plausible. We use the KS statistic to measure the distance between distributions.
We use a goodnessoffit test to generate a pvalue which indicates the plausibility of a hypothesis. The pvalue is defined as the fraction of the synthetic distances that are larger than the empirical distance. If p is large (close to 1), then the difference between the empirical data and the model can be attributed to statistical fluctuations alone; if it is small, the model is not a plausible fit to the data^{29}. We follow the methods in Clauset et al.^{29}—and implement them with the github package used in Broido and Clauset^{30}—to generate synthetic datasets and measure the distance between distributions. Following these methods, we chose to generate 1000 synthetic datasets in order to optimize the tradeoff between having an accurate estimation of the pvalue and computational efficiency. If p is small enough (\(p < 0.1\)) the power law is ruled out. Put another way, it is ruled out if there is a probability of 1 in 10 or less that we would by chance get data that agree as poorly with the model as the data we have^{29}. However, measuring a \(p \ge 0.1\) does not guarantee that the powerlaw is the most likely distribution for the data. Other distributions may match equally well or better. Additionally, it is harder to rule out distributions when working with small sample sizes.
A better way to determine whether or not data is drawn from a powerlaw distribution is to compare its likelihood of being drawn from a powerlaw distribution directly to a competing distribution^{29,56}. We use the exponential, stretchedexponential, lognormal, and powerlawwithcutoff distributions as four competing distributions to the powerlaw. While we cannot compare how the data fits between every possible distribution, comparing the powerlaw distribution to these four similarly shaped competing distributions helps us ensure that our results are valid.
We use the loglikelihood ratio test \({\mathcal {R}}\)^{29,56} to compare the powerlaw distribution to other candidate distributions,
where \({\mathcal {L}}_\text {PL}\) and \({\mathcal {L}}_\text {Alt}\) are the loglikelihoods of the best fits for the powerlaw and alternative distributions, respectively. This can be rewritten as a summation over individual observations,
with the loglikelihood of single observed degree values under the powerlaw distribution, \(\ell _i^{\text {(PL)}}\), and alternative distribution, \(\ell _i^{\text {(Alt)}}\), are summed over the number of model observations, \(n_\text {tail}\).
If \({\mathcal {R}}>{0}\), the powerlaw distribution is more likely; if \({\mathcal {R}}<{0}\), the competing candidate distribution is more likely; if \({\mathcal {R}}=0\), they are equally likely. Just like with the goodness of fit test, we need to make sure our result is statistically significant (\(p < 0.01\)). The methodology described here summarizes the methodology introduced by Clauset et al. (2009), and described again in Broido and Clauset^{29,30} and more details such as the exact formulas for alternative distributions, and derivation of the pvalue for \({\mathcal {R}}\) can be obtained therein.
Classifying network scaling
We classify each genomic/metagenomic dataset, as represented by the set of eight network projection types, as having some categorical degree of “scalefreeness” from “superweak” to “strongest”. This classification scheme was introduced by Broido and Clauset^{30} in order to compare many networks with different degrees of complexity, and the definitions below were extracted from therein:

SuperWeak For at least 50% of graphs, none of the alternative distributions are favored over the power law.
The four remaining definitions are nested, and represent increasing levels of direct evidence that the degree structure of the network data set is scale free:

Weakest For at least 50% of graphs, the powerlaw hypothesis cannot be rejected (\(p\ge 0.1\)).

Weak The requirements of the Weakest set, and there are at least 50 nodes in the distribution’s tail (\(n_\text {tail} \ge 50\)).

Strong The requirements of the Weak and SuperWeak sets, and that \(2< {\hat{\alpha }} < 3\).

Strongest The requirements of the Strong set for at least 90% of graphs, rather than 50%, and for at least 95% of graphs none of the alternative distributions are favored over the powerlaw.
Categorizing a network as “SuperWeak” is in effect saying that that network’s degree distribution data is better modeled by a powerlaw fit than alternative distributions. This is independent of whether or not the powerlaw model is a good fit to the data, which is what is what the “Weakest” and “Weak” definitions emphasize. A network may be classified as “SuperWeak” without meeting any of the nested definition’s criteria. Similarly, a network may be classified as “Weak” without meeting the criteria in the “SuperWeak” definition. We believe this framework is a proper way to classify the degreedistributions of biochemical networks, given that there are many different accepted ways to represent biochemical reactions as networks, and each has their pros and cons^{32,33,34}.
Standard error and correlation
The black error bars on each plot represent 2 standard deviation (2SD) around the sample proportion \({\hat{p}}\) (the height of the bar, which we also refer to as the mean). This is equivalent to 2 standard error around the mean (2SEM), or a 95% confidence interval for the true population proportion p (true population mean). Standard deviation was calculated by treating each category as a binomial distribution, meaning the standard deviation is given by:
Although the errors for each plot’s categories are calculated independently, there is covariance between many of them. This is especially true for the right column of Fig. 3, where all bars of a color total to a fixed number of datasets, with each dataset falling into one of the 8 network projection type bins. Because of this, we also calculated the correlations between each network projection type, across both individuals and ecosystems (Fig. S2). The correlation matrices were calculated by using the pandas function DataFrame.correlation(method=’pearson’) on a matrix of binomially distributed True/False values representing whether each dataset passed or failed specific scalefree criteria for pvalue, tail size, or powerlaw exponent value (\(\alpha\)), for each networkprojection.
Classifying levels of biology using degree distribution data
We used two different statistical methods, multinomial regression and random forest classifiers, in conjunction with the scalefree classification scheme above in order to test if individuals and ecosystems were distinguishable based on their degree distribution characteristics.
Multinomial regression
For our multinomial regression, the response class is the biological level (individual or ecosytem), and a single network or statistical measure is the dependent variable. In order to control for over fitting the training data was composed of an equal number of samples from each level. The number of networks used for training data was chosen to be equal in size to 80% of all ecosystem projections, because there were less ecosystem datasets used than individual datasets. This corresponded to 80% of 6280 networks (of all projection types), or 5024 networks. The model was tested on the 20% of the data that it was not trained on. This process was repeated 100 times and the average model error is reported in the results and Fig. 5, left columns. The multinom and predict functions from the Rpackage nnet were used to do the multinomial regression.
Random forest classifiers
We used a random forest to attempt to classify networks as falling into the category of individuals or ecosystems. In the first scenario, we used 11 predictors: powerlaw alpha value (\(\alpha\)); loglikelihood result from powerlaw vs. exponential (dexp); loglikelihood result from powerlaw vs. lognormal (dln); loglikelihood result from powerlaw vs. powerlaw with exponential cutoff (dplwc); loglikelihood result from powerlaw vs. stretched exponential (dstrexp); the network mean degree (\(<k>\)); network node size (n); degree distribution tail size (\(n_{tail}\)); network edge size (\(n_{edges}\)); the pvalue of the goodnessoffit test for the powerlaw model (p); and cutoff degree value for network tail (\(x_{min}\)). In the second scenario, we repeated the random forest without the three predictors which can be directly used to quantify the size of a network (n, \(n_{tail}\), and \(n_{edges}\)). In the third scenario, we repeated the random forest without the three predictors on each network projection type independently. For each scenario, we randomly split our data in two halves: one for training, and one for testing (for the third scenario, each training and testing set is 1/8 as large as for the first two scenarios, since we run the classifier on each network projection type independently). In all scenarios, we use the randomForest function from the Rpackage randomForest for classification. Three features were used to construct each tree (mtry=3), which is \(\approx \sqrt{n_{features}}\), with 100 trees generated each time (enough time for the outofbag, or OOB, estimate of the error rate to level off).
References
Strogatz, S. H. Exploring complex networks. Nature 410(6825), 268 (2001).
Albert, R. & Barabási, A. L. Statistical mechanics of complex networks. Rev. Mod. Phys. 74(1), 47 (2002).
Dorogovtsev, S. N. & Mendes, J. F. Evolution of networks. Adv. Phys. 51(4), 1079–1187 (2002).
Barabasi, A. L. & Oltvai, Z. N. Network biology: Understanding the cell’s functional organization. Nat. Rev. Genet. 5(2), 101 (2004).
Newman, M., Barabasi, A. L. & Watts, D. J. The Structure and Dynamics of Networks Vol. 19 (Princeton University Press, 2011).
Barabási, A. L. et al. Network Science (Cambridge University Press, 2016).
Newman, M. Networks (Oxford University Press, 2018).
Girvan, M. & Newman, M. E. Community structure in social and biological networks. Proc. Natl. Acad. Sci. 99(12), 7821–7826 (2002).
Milo, R. et al. Network motifs: Simple building blocks of complex networks. Science 298(5594), 824–827 (2002).
Newman, M. E. The structure and function of complex networks. SIAM Rev. 45(2), 167–256 (2003).
Albert, R. Scalefree networks in cell biology. J. Cell Sci. 118(21), 4947–4957 (2005).
Boccaletti, S., Latora, V., Moreno, Y., Chavez, M. & Hwang, D. U. Complex networks: Structure and dynamics. Phys. Rep. 424(4–5), 175–308 (2006).
Jeong, H., Tombor, B., Albert, R., Oltvai, Z. N. & Barabási, A. L. The largescale organization of metabolic networks. Nature 407(6804), 651–654 (2000).
Koonin, E. V., Wolf, Y. I. & Karev, G. P. Power Laws, ScaleFree Networks and Genome Biology (Springer, 2006).
Guimera, R. & Amaral, L. A. N. Functional cartography of complex metabolic networks. Nature 433(7028), 895 (2005).
Tanaka, R. Scalerich metabolic networks. Phys. Rev. Lett. 94(16), 168101 (2005).
Gosak, M. et al. Network science of biological systems at different scales: A review. Phys. Life Rev. 24, 118 (2018).
Whitesides, G. M. Is the focus on “molecules’’ obsolete?. Annu. Rev. Anal. Chem. 6, 1–29 (2013).
Cronin, L. & Walker, S. I. Beyond prebiotic chemistry. Science 352(6290), 1174–1175 (2016).
Walker, S. I. & Mathis, C. Network theory in prebiotic evolution. In Prebiotic Chemistry and Chemical Evolution of Nucleic Acid (ed. MenorSalvan, C.) 263–291 (Springer, 2018).
Barabási, A. L. & Albert, R. Emergence of scaling in random networks. Science 286(5439), 509–512 (1999).
Albert, R., Jeong, H. & Barabási, A. L. Internet: Diameter of the worldwide web. Nature 401(6749), 130 (1999).
Carlson, J. M. & Doyle, J. Highly optimized tolerance: A mechanism for power laws in designed systems. Phys. Rev. E 60(2), 1412 (1999).
Albert, R., Jeong, H. & Barabási, A. L. Error and attack tolerance of complex networks. Nature 406(6794), 378–382 (2000).
Barabási, A. L. Scalefree networks: A decade and beyond. Science 325(5939), 412–413 (2009).
Mitzenmacher, M. A brief history of generative models for power law and lognormal distributions. Internet Math. 1(2), 226–251 (2004).
Barabási, A. L. & Bonabeau, E. Scalefree networks. Sci. Am. 288(5), 60–69 (2003).
Li, L., Alderson, D., Doyle, J. C. & Willinger, W. Towards a theory of scalefree graphs: Definition, properties, and implications. Internet Math. 2(4), 431–523 (2005).
Clauset, A., Shalizi, C. R. & Newman, M. E. Powerlaw distributions in empirical data. SIAM Rev. 51(4), 661–703 (2009).
Broido, A. D. & Clauset, A. Scalefree networks are rare. Nat. Commun. 10(1), 1–10 (2019).
Khanin, R. & Wit, E. How scalefree are biological networks. J. Comput. Biol. 13(3), 810–818 (2006).
Holme, P. Model validation of simplegraph representations of metabolism. J. R. Soc. Interface 6(40), 1027–1034 (2009).
Holme, P. & Huss, M. Substance graphs are optimal simplegraph representations of metabolism. Chin. Sci. Bull. 55(27–28), 3161–3168 (2010).
Montanez, R., Medina, M. A., Sole, R. V. & RodríguezCaso, C. When metabolism meets topology: Reconciling metabolite and reaction networks. Bioessays 32(3), 246–256 (2010).
Klarreich, E. Scant evidence of power laws found in realworld networks. In Quanta Magazine, 15 (2018).
Cohen, R., Erez, K., BenAvraham, D. & Havlin, S. Resilience of the internet to random breakdowns. Phys. Rev. Lett. 85(21), 4626 (2000).
Bollobás, B. & Riordan, O. Robustness and vulnerability of scalefree random graphs. Internet Math. 1(1), 1–35 (2004).
PastorSatorras, R. & Vespignani, A. Epidemic spreading in scalefree networks. Phys. Rev. Lett. 86(14), 3200 (2001).
Zhou, B., Meng, X. & Stanley, H. E. Powerlaw distribution of degreedegree distance: A better representation of the scalefree property of complex networks. Proc. Natl. Acad. Sci. 117(26), 14812–14818 (2020).
Smith, E. & Morowitz, H. J. The Origin and Nature of Life on Earth: The Emergence of the Fourth Geosphere (Cambridge University Press, 2016).
Kim, H., Smith, H. B., Mathis, C., Raymond, J. & Walker, S. I. Universal scaling across biochemical networks on Earth. Sci. Adv. 5(1), eaau0149 (2019).
Featherstone, D. E. & Broadie, K. Wrestling with pleiotropy: Genomic and topological analysis of the yeast gene expression network. Bioessays 24(3), 267–274 (2002).
Guelzim, N., Bottani, S., Bourgine, P. & Képès, F. Topological and causal structure of the yeast transcriptional regulatory network. Nat. Genet. 31(1), 60 (2002).
Li, S. et al. A map of the interactome network of the metazoan C. elegans. Science 303, 540 (2004).
Kaiser, M. A tutorial in connectome analysis: Topological and spatial features of brain networks. Neuroimage 57(3), 892–907 (2011).
Kanehisa, M. & Goto, S. KEGG: Kyoto encyclopedia of genes and genomes. Nucleic Acids Res. 28(1), 27–30 (2000).
Kanehisa, M., Sato, Y., Kawashima, M., Furumichi, M. & Tanabe, M. KEGG as a reference resource for gene and protein annotation. Nucleic Acids Res. 44(D1), D457–D462 (2015).
Kanehisa, M., Furumichi, M., Tanabe, M., Sato, Y. & Morishima, K. KEGG: New perspectives on genomes, pathways, diseases and drugs. Nucleic Acids Res. 45(D1), D353–D361 (2016).
Davies, P. C. et al. Signatures of a shadow biosphere. Astrobiology 9(2), 241–249 (2009).
Locey, K. J. & Lennon, J. T. Scaling laws predict global microbial diversity. Proc. Natl. Acad. Sci. 113(21), 5970–5975 (2016).
Klamt, S., Haus, U. U. & Theis, F. Hypergraphs and cellular networks. PLoS Comput. Biol. 5(5), e1000385 (2009).
Wattam, A. R. et al. Improvements to PATRIC, the allbacterial bioinformatics database and analysis resource center. Nucleic Acids Res. 45(D1), D535–D542 (2016).
Markowitz, V. M. et al. IMG: The integrated microbial genomes database and comparative analysis system. Nucleic Acids Res. 40(D1), D115–D122 (2011).
Clauset, A., Young, M. & Gleditsch, K. S. On the frequency of severe terrorist events. J. Conflict Resolut. 51(1), 58–87 (2007).
Broido, A.D. SFAnalysis (2017). https://github.com/adbroido/SFAnalysis.
Alstott, J., Bullmore, E. & Plenz, D. Powerlaw: A Python package for analysis of heavytailed distributions. PLoS ONE 9(1), e85777 (2014).
Acknowledgements
We thank the Emergence@ASU team (especially Doug Moore, Cole Mathis, and Jake Hanson) for feedback through various stages of this work.
Funding
The funding was provided by National Aeronautics and Space Administration (NNX15AL24G S02).
Author information
Authors and Affiliations
Contributions
H.B.S., H.K. and S.W. conceived of the idea. H.B.S. performed the analysis. H.B.S., H.K. and S.W. wrote the manuscript.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher's note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Smith, H.B., Kim, H. & Walker, S.I. Scarcity of scalefree topology is universal across biochemical networks. Sci Rep 11, 6542 (2021). https://doi.org/10.1038/s41598021859031
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41598021859031
Comments
By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.