Abstract
In a cell or microorganism, the processes that generate mass, energy, information transfer and cell-fate specification are seamlessly integrated through a complex network of cellular constituents and reactions1. However, despite the key role of these networks in sustaining cellular functions, their large-scale structure is essentially unknown. Here we present a systematic comparative mathematical analysis of the metabolic networks of 43 organisms representing all three domains of life. We show that, despite significant variation in their individual constituents and pathways, these metabolic networks have the same topological scaling properties and show striking similarities to the inherent organization of complex non-biological systems2. This may indicate that metabolic organization is not only identical for all living organisms, but also complies with the design principles of robust and error-tolerant scale-free networks2, 3, 4, 5, and may represent a common blueprint for the large-scale organization of interactions among all cellular constituents.
An important goal in biology is to uncover the fundamental design principles that provide the common underlying structure and function in all cells and microorganisms6, 7, 8, 9, 10, 11, 12, 13. For example, it is increasingly appreciated that the robustness of various cellular processes is rooted in the dynamic interactions among its many constituents14, 15, 16, such as proteins, DNA, RNA and small molecules. Scientific developments have improved our ability to identify the design principles that integrate these interactions into a complex system. Large-scale sequencing projects have not only provided complete sequence information for a number of genomes, but also allowed the development of integrated pathway–genome databases17, 18, 19 that provide organism-specific connectivity maps of metabolic and, to a lesser extent, other cellular networks. However, owing to the large number and diversity of the constituents and reactions that form such networks, these maps are extremely complex, offering only limited insight into the organizational principles of these systems. Our ability to address in quantitative terms the structure of these cellular networks has benefited from advances in understanding the generic properties of complex networks2.
Until recently, complex networks have been modelled using the classical
random network theory introduced by Erdös and Rényi20, 21.
The Erdös–Rényi model assumes that each pair of nodes (that
is, constituents) in the network is connected randomly with probability
p, leading to a statistically homogeneous network in which, despite the
fundamental randomness of the model, most nodes have the same number of links,
k
(Fig. 1a). In particular, the connectivity
follows a Poisson distribution that peaks strongly at
k
(Fig. 1b), implying that the probability of finding
a highly connected node decays exponentially (P(k)
e
-k for k
k
). On the other hand,
empirical studies on the structure of the World-Wide Web22,
Internet23 and social networks2 have reported
serious deviations from this random structure, showing that these systems
are described by scale-free networks2 (Fig. 1c
), for which P(k) follows a power-law, P(k
)
k-
(Fig. 1d).
Unlike exponential networks, scale-free networks are extremely heterogeneous,
their topology being dominated by a few highly connected nodes (hubs) which
link the rest of the less connected nodes to the system (Fig.
1c). As the distinction between scale-free and exponential networks
emerges as a result of simple dynamical principles24, 25, understanding
the large-scale structure of cellular networks can not only provide valuable
and perhaps universal structural information, but could also lead to a better
understanding of the dynamical processes that generated these networks. In
this respect the emergence of power-law distribution is intimately linked
to the growth of the network in which new nodes are preferentially attached
to already established nodes2, a property that is also thought
to characterize the evolution of biological systems1.
Figure 1: Attributes of generic network structures.

a, Representative structure of the network generated by the Erdös–Rényi
network model20,21. b, The network connectivity can
be characterized by the probability, P(k), that a node has
k links. For a random network P(k) peaks strongly at
k =
k
and decays exponentially for large k (that
is, P(k)
e-k for
k
k
and k
k
).
c, In the scale-free network most nodes have only a few links, but a few
nodes, called hubs (red), have a very large number of links. d,
P(k) for a scale-free network has no well-defined peak, and for
large k it decays as a power-law, P(k)
k-
, appearing as a straight line with slope -
on a log–log plot. e, A portion of the WIT database for E.
coli. Each substrate can be represented as a node of the graph, linked
through temporary educt–educt complexes (black boxes) from which the
products emerge as new nodes (substrates). The enzymes, which provide the
catalytic scaffolds for the reactions, are shown by their EC numbers.
To begin to address the large-scale structural organization of cellular networks, we have examined the topological properties of the core metabolic network of 43 different organisms based on data deposited in the WIT database19. This integrated pathway–genome database predicts the existence of a given metabolic pathway on the basis of the annotated genome of an organism combined with firmly established data from the biochemical literature. As 18 of the 43 genomes deposited in the database are not yet fully sequenced, and a substantial portion of the identified open reading frames are functionally unassigned, the list of enzymes, and consequently the list of substrates and reactions (see Table 1 in Supplementary Information), will certainly be expanded in the future. Nevertheless, this publicly available database represents our best approximation for the metabolic pathways in 43 organisms and provides sufficient data for their unambiguous statistical analysis (see Methods and Supplementary Information).
As we show in Fig. 1e, we first established a graph theoretic representation of the biochemical reactions taking place in a given metabolic network. In this representation, a metabolic network is built up of nodes, the substrates, that are connected to one another through links, which are the actual metabolic reactions. The physical entity of the link is the temporary educt–educt complex itself, in which enzymes provide the catalytic scaffolds for the reactions yielding products, which in turn can become educts for subsequent reactions. This representation allows us systematically to investigate and quantify the topologic properties of various metabolic networks using the tools of graph theory and statistical mechanics21.
Our first goal was to identify the structure of the metabolic networks:
that is, to establish whether their topology is best described by the inherently
random and uniform exponential model21 (Fig.
1a, b), or the highly heterogeneous scale-free
model2 (Fig. 1c, d).
As illustrated in Fig. 2, our results convincingly indicate
that the probability that a given substrate participates in k reactions
follows a power-law distribution; in other words, metabolic networks belong
to the class of scale-free networks. As under physiological conditions a large
number of biochemical reactions (links) in a metabolic network are preferentially
catalysed in one direction (the links are directed), for each node we distinguish
between incoming and outgoing links (Fig. 1e). For instance,
in Escherichia coli the probability that a substrate participates as
an educt in k metabolic reactions follows P(k)
k-
in, with
in = 2.2, and the probability
that a given substrate is produced by k different metabolic reactions
follows a similar distribution, with
out = 2.2 (
Fig. 2b). We find that scale-free networks describe the metabolic networks
in all organisms in all three domains of life (Fig. 2a–c
; see Supplementary Information, also available
at www.nd.edu/
networks/cell), indicating the generic nature of this structural
organization (Fig. 2d).
Figure 2: Connectivity distributions P(k) for substrates.

a, Archaeoglobus fulgidus (archae); b, E. coli
(bacterium); c, Caenorhabditis elegans (eukaryote), shown
on a log–log plot, counting separately the incoming (In) and outgoing
links (Out) for each substrate. kin (kout)
corresponds to the number of reactions in which a substrate participates as
a product (educt). The characteristics of the three organisms shown in
a–c and the exponents
in and
out for all organisms are given in Table 1 of the Supplementary Information.
d, The connectivity distribution averaged over all 43 organisms.
A general feature of many complex networks is their small-world character26, meaning that any two nodes in the system can be connected by relatively short paths along existing links. In metabolic networks these paths correspond to the biochemical pathway connecting two substrates (Fig. 3a). The degree of interconnectivity of a metabolic network can be characterized by the network diameter, defined as the shortest biochemical pathway averaged over all pairs of substrates. For all non-biological networks examined, the average connectivity of a node is fixed, which implies that the diameter of a network increases logarithmically with the addition of new nodes2, 26, 27. For metabolic networks this implies that a more complex bacterium with more enzymes and substrates, such as E. coli, would have a larger diameter than a simple bacterium, such as Mycoplasma genitalium. We find, however, that the diameter of the metabolic network is the same for all 43 organisms, irrespective of the number of substrates found in the given species (Fig. 3b). This is unexpected, and is possible only if with increasing organism complexity individual substrates are increasingly connected to maintain a relatively constant metabolic network diameter. We find that the average number of reactions in which a certain substrate participates increases with the number of substrates found within a given organism (Fig. 3c, d).
Figure 3: Properties of metabolic networks.

a, The histogram of the biochemical pathway lengths, l, in
E. coli. b, The average path length (diameter) for each of the
43 organisms. Error bars represent standard deviation
l2
-
l
2
as determined from
(l) (shown in a for E. coli).
c, d, Average number of incoming links (c) or outgoing links
(d) per node for each organism. e, The effect of substrate removal
on the metabolic network diameter of E. coli. In the top curve (red)
the most connected substrates are removed first. In the bottom curve (green)
nodes are removed randomly. M = 60 corresponds to
8% of
the total number of substrates in found in E. coli. f, Standard
deviation of the substrate ranking (
r) as a function
of the average ranking
r
o for substrates present
in all 43 organisms investigated. The horizontal axis in b–
d denotes the number of nodes in each organism. b–d,
Archaea (magenta), bacteria (green) and eukaryotes (blue) are shown.
An important consequence of the power-law connectivity distribution is that a few hubs dominate the overall connectivity of the network ( Fig. 1c), and upon the sequential removal of the most connected nodes the diameter of the network rises sharply, the network eventually disintegrating into isolated clusters that are no longer functional. But scale-free networks also demonstrate unexpected robustness against random errors5. To investigate whether metabolic networks display a similar error tolerance we performed computer simulations on the metabolic network of E. coli. Upon removal of the most connected substrates the diameter increases rapidly, illustrating the special role of these metabolites in maintaining a constant metabolic network diameter (Fig. 3e). However, when a randomly chosen M substrates are removed—mimicking the consequence of random mutations of catalysing enzymes—the average distance between the remaining nodes is not affected, indicating a striking insensitivity to random errors. Indeed, in silico and in vivo mutagenesis studies indicate remarkable fault tolerance upon removal of a substantial number of metabolic enzymes from the E. coli metabolic network28. Data similar to those shown in Fig. 3e have been obtained for all organisms investigated, without detectable correlations with their evolutionary position.
As the large-scale architecture of the metabolic network rests on the most
highly connected substrates, we need to investigate whether the same substrates
act as hubs in all organisms, or whether there are organism-specific differences
in the identity of the most connected substrates. When we rank all the substrates
in a given organism on the basis of the number of links they have (Table 1;
see Supplementary Information), we find that the ranking
of the most connected substrates is practically identical for all 43 organisms.
Also, only around 4% of all substrates that are found in all 43 organisms
are present in all species. These substrates represent the most highly connected
substrates found in any individual organism, indicating the generic utilization
of the same substrates by each species. In contrast, species-specific differences
among organisms emerge for less connected substrates. To quantify this observation,
we examined the standard deviation (
r) of the rank
for substrates that are present in all 43 organisms. As shown in
Fig. 3f,
r increases with the average rank
order
r
, implying that the most connected substrates have
a relatively fixed position in the rank order, but the ranking of less connected
substrates is increasingly species-specific. Thus, the large-scale structure
of the metabolic network is identical for all 43 species, being dominated
by the same highly connected substrates, while less connected substrates preferentially
serve as the educts or products of species-specific enzymatic activities.
The contemporary topology of a metabolic network reflects a long evolutionary process moulded in general for a robust response towards internal defects and environmental fluctuations and in particular to the ecological niche occupied by a specific organism. As a result, we would expect that these networks are far from random, and our data show that the large-scale structural organization of metabolic networks is indeed very similar to that of robust and error-tolerant networks2, 5. The uniform network topology observed in all 43 organisms indicates that, irrespective of their individual building blocks or species-specific reaction pathways, the large-scale structure of metabolic networks may be identical in all living organisms, in which the same highly connected substrates may provide the connections between modules responsible for distinct metabolic functions1.
A unique feature of metabolic networks, as opposed to non-biological scale-free networks, is the apparent conservation of the network diameter in all living organisms. Within the special characteristics of living systems this attribute may represent an additional survival and growth advantage, as a larger diameter would attenuate the organism's ability to respond efficiently to external changes or internal errors. For example, if the concentration of a substrate were to suddenly diminish owing to a mutation in its main catalysing enzyme, offsetting the changes would involve the activation of longer alternative biochemical pathways, and consequently the synthesis of more new enzymes, than within a metabolic network with a smaller diameter.
How generic are these principles for other cellular networks (for example, apoptosis or cell cycle)? Although the current mathematical tools do not allow unambiguous statistical analysis of the topology of other networks owing to their relatively small size, our preliminary analysis indicates that connectivity distribution of non-metabolic pathways may also follow a power-law distribution, indicating that cellular networks as a whole are scale-free networks. Therefore, the evolutionary selection of a robust and error-tolerant architecture may characterize all cellular networks, for which scale-free topology with a conserved network diameter appears to provide an optimal structural organization.
Methods
Database preparation
For our analyses of core cellular metabolisms we used the 'Intermediate metabolism and bioenergetics' portions of the WIT database19 (http://igweb.integratedgenomics.com/IGwit/), which predicts the existence of a metabolic pathway in an organism on the basis of its annotated genome (on the presence of the presumed open reading frame of an enzyme that catalyses a given metabolic reaction), in combination with firmly established data from the biochemical literature. As of December 1999, this database provides descriptions for 6 archaea, 32 bacteria and 5 eukaryotes. The downloaded data were manually rechecked, removing synonyms and substrates without defined chemical identity.
Construction of metabolic network matrices
Biochemical
reactions described within a WIT database are composed of substrates and enzymes
connected by directed links. For each reaction, educts and products were considered
as nodes connected to the temporary educt–educt complexes and associated
enzymes. Bidirectional reactions were considered separately. For a given organism
with N substrates, E enzymes and R intermediate complexes
the full stoichiometric interactions were compiled into an (N + E + R
)
(N + E + R) matrix, generated
separately for each of the 43 organisms.
Connectivity distribution P(k)
Substrates generated by a biochemical reaction are products, and are characterized by incoming links pointing to them. For each substrate we have determined kin, and prepared a histogram for each organism, showing how many substrates have exactly kin = 0,1,.... Dividing each point of the histogram with the total number of substrates in the organism provided P(kin), or the probability that a substrate has kin incoming links. Substrates that participate as educts in a reaction have outgoing links. We have performed the analysis described above for kin, determining the number of outgoing links (kout) for each substrate. To reduce noise logarithmic binning was applied.
Biochemical pathway lengths [
(l)]
For all
pairs of substrates, the shortest biochemical pathway,
(l) (that
is, the smallest number of reactions by which one can reach substrate B from
substrate A) was determined using a burning algorithm. From
(l)
we determined the diameter, D =
l
l
(l)/
l
(l), which represents the average
path length between any two substrates.
Substrate ranking
r
o,
(
r)
Substrates present in all 43 organisms (a total of 51
substrates) were ranked on the basis of the number of links each had in each
organisms, having considered incoming and outgoing links separately (r
= 1 was assigned for the substrate with the largest number of connections,
r = 2 for the second most connected one, and so on). This gave a well
defined r value in each organism for each substrate. The average rank
r
o for each substrate was determined by averaging
r over the 43 organisms. We also determined the standard deviation,
(
r) =
r2
o -
r
2o for all 51 substrates present in all
organisms.
Analysis of the effect of database errors
Of the 43 organisms whose metabolic network we have analysed, the genomes of 25 have been completely sequenced (5 archaea, 18 bacteria and 2 eukaryotes), whereas the remaining 18 are only partially sequenced. Therefore two main sources of possible errors in the database could affect our analysis: the erroneous annotation of enzymes and, consequently, biochemical reactions (the likely source of error for the organisms with completely sequenced genomes); and reactions and pathways missing from the database (for organisms with incompletely sequenced genomes, both sources of error are possible). We investigated the effect of database errors on the validity of our findings. The data, presented in Supplementary Information, indicate that our results are robust to these errors.


