Letters to Nature

Nature 407, 651-654 (5 October 2000) | doi:10.1038/35036627; Received 3 April 2000; Accepted 18 July 2000

The large-scale organization of metabolic networks

H. Jeong1, B. Tombor2, R. Albert1, Z. N. Oltvai2 & A.-L. Barabási1

  1. Department of Physics, University of Notre Dame, Notre Dame, Indiana 46556, USA
  2. Department of Pathology, Northwestern University Medical School, Chicago, Illinois 60611, USA

Correspondence to: Z. N. Oltvai2A.-L. Barabási1 Correspondence and requests for materials should be addressed to A.-L.B. (e-mail: Email: alb@nd.edu) or Z.N.O. (e-mail: Email:  zno008@northwestern.edu).

Top

In a cell or microorganism, the processes that generate mass, energy, information transfer and cell-fate specification are seamlessly integrated through a complex network of cellular constituents and reactions1. However, despite the key role of these networks in sustaining cellular functions, their large-scale structure is essentially unknown. Here we present a systematic comparative mathematical analysis of the metabolic networks of 43 organisms representing all three domains of life. We show that, despite significant variation in their individual constituents and pathways, these metabolic networks have the same topological scaling properties and show striking similarities to the inherent organization of complex non-biological systems2. This may indicate that metabolic organization is not only identical for all living organisms, but also complies with the design principles of robust and error-tolerant scale-free networks2, 3, 4, 5, and may represent a common blueprint for the large-scale organization of interactions among all cellular constituents.

An important goal in biology is to uncover the fundamental design principles that provide the common underlying structure and function in all cells and microorganisms6, 7, 8, 9, 10, 11, 12, 13. For example, it is increasingly appreciated that the robustness of various cellular processes is rooted in the dynamic interactions among its many constituents14, 15, 16, such as proteins, DNA, RNA and small molecules. Scientific developments have improved our ability to identify the design principles that integrate these interactions into a complex system. Large-scale sequencing projects have not only provided complete sequence information for a number of genomes, but also allowed the development of integrated pathway–genome databases17, 18, 19 that provide organism-specific connectivity maps of metabolic and, to a lesser extent, other cellular networks. However, owing to the large number and diversity of the constituents and reactions that form such networks, these maps are extremely complex, offering only limited insight into the organizational principles of these systems. Our ability to address in quantitative terms the structure of these cellular networks has benefited from advances in understanding the generic properties of complex networks2.

Until recently, complex networks have been modelled using the classical random network theory introduced by Erdös and Rényi20, 21. The Erdös–Rényi model assumes that each pair of nodes (that is, constituents) in the network is connected randomly with probability p, leading to a statistically homogeneous network in which, despite the fundamental randomness of the model, most nodes have the same number of links, left fence kright fence (Fig. 1a). In particular, the connectivity follows a Poisson distribution that peaks strongly at left fencekright fence (Fig. 1b), implying that the probability of finding a highly connected node decays exponentially (P(k) approximately e -k for k double greater than left fencekright fence). On the other hand, empirical studies on the structure of the World-Wide Web22, Internet23 and social networks2 have reported serious deviations from this random structure, showing that these systems are described by scale-free networks2 (Fig. 1c ), for which P(k) follows a power-law, P(k ) approximately k-gamma (Fig. 1d). Unlike exponential networks, scale-free networks are extremely heterogeneous, their topology being dominated by a few highly connected nodes (hubs) which link the rest of the less connected nodes to the system (Fig. 1c). As the distinction between scale-free and exponential networks emerges as a result of simple dynamical principles24, 25, understanding the large-scale structure of cellular networks can not only provide valuable and perhaps universal structural information, but could also lead to a better understanding of the dynamical processes that generated these networks. In this respect the emergence of power-law distribution is intimately linked to the growth of the network in which new nodes are preferentially attached to already established nodes2, a property that is also thought to characterize the evolution of biological systems1.

Figure 1: Attributes of generic network structures.
Figure 1 : Attributes of generic network structures. Unfortunately we are unable to provide accessible alternative text for this. If you require assistance to access this image, or to obtain a text description, please contact npg@nature.com

a, Representative structure of the network generated by the Erdös–Rényi network model20,21. b, The network connectivity can be characterized by the probability, P(k), that a node has k links. For a random network P(k) peaks strongly at k = left fencekright fence and decays exponentially for large k (that is, P(kapproximately e-k for k double greater than left fencekright fence and k less double left fencekright fence). c, In the scale-free network most nodes have only a few links, but a few nodes, called hubs (red), have a very large number of links. d, P(k) for a scale-free network has no well-defined peak, and for large k it decays as a power-law, P(kapproximately  k-gamma, appearing as a straight line with slope -gamma on a log–log plot. e, A portion of the WIT database for E. coli. Each substrate can be represented as a node of the graph, linked through temporary educt–educt complexes (black boxes) from which the products emerge as new nodes (substrates). The enzymes, which provide the catalytic scaffolds for the reactions, are shown by their EC numbers.

High resolution image and legend (71K)

To begin to address the large-scale structural organization of cellular networks, we have examined the topological properties of the core metabolic network of 43 different organisms based on data deposited in the WIT database19. This integrated pathway–genome database predicts the existence of a given metabolic pathway on the basis of the annotated genome of an organism combined with firmly established data from the biochemical literature. As 18 of the 43 genomes deposited in the database are not yet fully sequenced, and a substantial portion of the identified open reading frames are functionally unassigned, the list of enzymes, and consequently the list of substrates and reactions (see Table 1 in Supplementary Information), will certainly be expanded in the future. Nevertheless, this publicly available database represents our best approximation for the metabolic pathways in 43 organisms and provides sufficient data for their unambiguous statistical analysis (see Methods and Supplementary Information).

As we show in Fig. 1e, we first established a graph theoretic representation of the biochemical reactions taking place in a given metabolic network. In this representation, a metabolic network is built up of nodes, the substrates, that are connected to one another through links, which are the actual metabolic reactions. The physical entity of the link is the temporary educt–educt complex itself, in which enzymes provide the catalytic scaffolds for the reactions yielding products, which in turn can become educts for subsequent reactions. This representation allows us systematically to investigate and quantify the topologic properties of various metabolic networks using the tools of graph theory and statistical mechanics21.

Our first goal was to identify the structure of the metabolic networks: that is, to establish whether their topology is best described by the inherently random and uniform exponential model21 (Fig. 1a, b), or the highly heterogeneous scale-free model2 (Fig. 1c, d). As illustrated in Fig. 2, our results convincingly indicate that the probability that a given substrate participates in k reactions follows a power-law distribution; in other words, metabolic networks belong to the class of scale-free networks. As under physiological conditions a large number of biochemical reactions (links) in a metabolic network are preferentially catalysed in one direction (the links are directed), for each node we distinguish between incoming and outgoing links (Fig. 1e). For instance, in Escherichia coli the probability that a substrate participates as an educt in k metabolic reactions follows P(kapproximately  k-gammain, with gammain = 2.2, and the probability that a given substrate is produced by k different metabolic reactions follows a similar distribution, with gammaout = 2.2 ( Fig. 2b). We find that scale-free networks describe the metabolic networks in all organisms in all three domains of life (Fig. 2a–c ; see Supplementary Information, also available at www.nd.edu/approxnetworks/cell), indicating the generic nature of this structural organization (Fig. 2d).

Figure 2: Connectivity distributions P(k) for substrates.
Figure 2 : Connectivity distributions P(k) for substrates. Unfortunately we are unable to provide accessible alternative text for this. If you require assistance to access this image, or to obtain a text description, please contact npg@nature.com

a, Archaeoglobus fulgidus (archae); b, E. coli (bacterium); c, Caenorhabditis elegans (eukaryote), shown on a log–log plot, counting separately the incoming (In) and outgoing links (Out) for each substrate. kin (kout) corresponds to the number of reactions in which a substrate participates as a product (educt). The characteristics of the three organisms shown in ac and the exponents gammain and gamma out for all organisms are given in Table 1 of the Supplementary Information. d, The connectivity distribution averaged over all 43 organisms.

High resolution image and legend (66K)

A general feature of many complex networks is their small-world character26, meaning that any two nodes in the system can be connected by relatively short paths along existing links. In metabolic networks these paths correspond to the biochemical pathway connecting two substrates (Fig. 3a). The degree of interconnectivity of a metabolic network can be characterized by the network diameter, defined as the shortest biochemical pathway averaged over all pairs of substrates. For all non-biological networks examined, the average connectivity of a node is fixed, which implies that the diameter of a network increases logarithmically with the addition of new nodes2, 26, 27. For metabolic networks this implies that a more complex bacterium with more enzymes and substrates, such as E. coli, would have a larger diameter than a simple bacterium, such as Mycoplasma genitalium. We find, however, that the diameter of the metabolic network is the same for all 43 organisms, irrespective of the number of substrates found in the given species (Fig. 3b). This is unexpected, and is possible only if with increasing organism complexity individual substrates are increasingly connected to maintain a relatively constant metabolic network diameter. We find that the average number of reactions in which a certain substrate participates increases with the number of substrates found within a given organism (Fig. 3c, d).

Figure 3: Properties of metabolic networks.
Figure 3 : Properties of metabolic networks. Unfortunately we are unable to provide accessible alternative text for this. If you require assistance to access this image, or to obtain a text description, please contact npg@nature.com

a, The histogram of the biochemical pathway lengths, l, in E. coli. b, The average path length (diameter) for each of the 43 organisms. Error bars represent standard deviation sigma approximately left fence l2right fence - left fencelright fence2 as determined from Pi(l) (shown in a for E. coli). c, d, Average number of incoming links (c) or outgoing links (d) per node for each organism. e, The effect of substrate removal on the metabolic network diameter of E. coli. In the top curve (red) the most connected substrates are removed first. In the bottom curve (green) nodes are removed randomly. = 60 corresponds to approx8% of the total number of substrates in found in E. coli. f, Standard deviation of the substrate ranking (sigmar) as a function of the average ranking left fencerright fenceo for substrates present in all 43 organisms investigated. The horizontal axis in b d denotes the number of nodes in each organism. bd, Archaea (magenta), bacteria (green) and eukaryotes (blue) are shown.

High resolution image and legend (68K)

An important consequence of the power-law connectivity distribution is that a few hubs dominate the overall connectivity of the network ( Fig. 1c), and upon the sequential removal of the most connected nodes the diameter of the network rises sharply, the network eventually disintegrating into isolated clusters that are no longer functional. But scale-free networks also demonstrate unexpected robustness against random errors5. To investigate whether metabolic networks display a similar error tolerance we performed computer simulations on the metabolic network of E. coli. Upon removal of the most connected substrates the diameter increases rapidly, illustrating the special role of these metabolites in maintaining a constant metabolic network diameter (Fig. 3e). However, when a randomly chosen M substrates are removed—mimicking the consequence of random mutations of catalysing enzymes—the average distance between the remaining nodes is not affected, indicating a striking insensitivity to random errors. Indeed, in silico and in vivo mutagenesis studies indicate remarkable fault tolerance upon removal of a substantial number of metabolic enzymes from the E. coli metabolic network28. Data similar to those shown in Fig. 3e have been obtained for all organisms investigated, without detectable correlations with their evolutionary position.

As the large-scale architecture of the metabolic network rests on the most highly connected substrates, we need to investigate whether the same substrates act as hubs in all organisms, or whether there are organism-specific differences in the identity of the most connected substrates. When we rank all the substrates in a given organism on the basis of the number of links they have (Table 1; see Supplementary Information), we find that the ranking of the most connected substrates is practically identical for all 43 organisms. Also, only around 4% of all substrates that are found in all 43 organisms are present in all species. These substrates represent the most highly connected substrates found in any individual organism, indicating the generic utilization of the same substrates by each species. In contrast, species-specific differences among organisms emerge for less connected substrates. To quantify this observation, we examined the standard deviation (sigmar) of the rank for substrates that are present in all 43 organisms. As shown in Fig. 3f, sigmar increases with the average rank order left fencerright fence, implying that the most connected substrates have a relatively fixed position in the rank order, but the ranking of less connected substrates is increasingly species-specific. Thus, the large-scale structure of the metabolic network is identical for all 43 species, being dominated by the same highly connected substrates, while less connected substrates preferentially serve as the educts or products of species-specific enzymatic activities.

The contemporary topology of a metabolic network reflects a long evolutionary process moulded in general for a robust response towards internal defects and environmental fluctuations and in particular to the ecological niche occupied by a specific organism. As a result, we would expect that these networks are far from random, and our data show that the large-scale structural organization of metabolic networks is indeed very similar to that of robust and error-tolerant networks2, 5. The uniform network topology observed in all 43 organisms indicates that, irrespective of their individual building blocks or species-specific reaction pathways, the large-scale structure of metabolic networks may be identical in all living organisms, in which the same highly connected substrates may provide the connections between modules responsible for distinct metabolic functions1.

A unique feature of metabolic networks, as opposed to non-biological scale-free networks, is the apparent conservation of the network diameter in all living organisms. Within the special characteristics of living systems this attribute may represent an additional survival and growth advantage, as a larger diameter would attenuate the organism's ability to respond efficiently to external changes or internal errors. For example, if the concentration of a substrate were to suddenly diminish owing to a mutation in its main catalysing enzyme, offsetting the changes would involve the activation of longer alternative biochemical pathways, and consequently the synthesis of more new enzymes, than within a metabolic network with a smaller diameter.

How generic are these principles for other cellular networks (for example, apoptosis or cell cycle)? Although the current mathematical tools do not allow unambiguous statistical analysis of the topology of other networks owing to their relatively small size, our preliminary analysis indicates that connectivity distribution of non-metabolic pathways may also follow a power-law distribution, indicating that cellular networks as a whole are scale-free networks. Therefore, the evolutionary selection of a robust and error-tolerant architecture may characterize all cellular networks, for which scale-free topology with a conserved network diameter appears to provide an optimal structural organization.

Top

Methods

Database preparation

For our analyses of core cellular metabolisms we used the 'Intermediate metabolism and bioenergetics' portions of the WIT database19 (http://igweb.integratedgenomics.com/IGwit/), which predicts the existence of a metabolic pathway in an organism on the basis of its annotated genome (on the presence of the presumed open reading frame of an enzyme that catalyses a given metabolic reaction), in combination with firmly established data from the biochemical literature. As of December 1999, this database provides descriptions for 6 archaea, 32 bacteria and 5 eukaryotes. The downloaded data were manually rechecked, removing synonyms and substrates without defined chemical identity.

Construction of metabolic network matrices

Biochemical reactions described within a WIT database are composed of substrates and enzymes connected by directed links. For each reaction, educts and products were considered as nodes connected to the temporary educt–educt complexes and associated enzymes. Bidirectional reactions were considered separately. For a given organism with N substrates, E enzymes and R intermediate complexes the full stoichiometric interactions were compiled into an (N + E + R ) times (N + E + R) matrix, generated separately for each of the 43 organisms.

Connectivity distribution P(k)

Substrates generated by a biochemical reaction are products, and are characterized by incoming links pointing to them. For each substrate we have determined kin, and prepared a histogram for each organism, showing how many substrates have exactly kin = 0,1,.... Dividing each point of the histogram with the total number of substrates in the organism provided P(kin), or the probability that a substrate has kin incoming links. Substrates that participate as educts in a reaction have outgoing links. We have performed the analysis described above for kin, determining the number of outgoing links (kout) for each substrate. To reduce noise logarithmic binning was applied.

Biochemical pathway lengths [Pi(l)]

For all pairs of substrates, the shortest biochemical pathway, Pi(l) (that is, the smallest number of reactions by which one can reach substrate B from substrate A) was determined using a burning algorithm. From Pi(l) we determined the diameter, D = Sigmal ldotPi(l)/Sigma lPi(l), which represents the average path length between any two substrates.

Substrate ranking left fencerright fenceo, sigma( r)

Substrates present in all 43 organisms (a total of 51 substrates) were ranked on the basis of the number of links each had in each organisms, having considered incoming and outgoing links separately (r = 1 was assigned for the substrate with the largest number of connections, r = 2 for the second most connected one, and so on). This gave a well defined r value in each organism for each substrate. The average rank left fence rright fenceo for each substrate was determined by averaging r over the 43 organisms. We also determined the standard deviation, sigma( r) = left fencer2right fenceo - left fence rright fence2o for all 51 substrates present in all organisms.

Analysis of the effect of database errors

Of the 43 organisms whose metabolic network we have analysed, the genomes of 25 have been completely sequenced (5 archaea, 18 bacteria and 2 eukaryotes), whereas the remaining 18 are only partially sequenced. Therefore two main sources of possible errors in the database could affect our analysis: the erroneous annotation of enzymes and, consequently, biochemical reactions (the likely source of error for the organisms with completely sequenced genomes); and reactions and pathways missing from the database (for organisms with incompletely sequenced genomes, both sources of error are possible). We investigated the effect of database errors on the validity of our findings. The data, presented in Supplementary Information, indicate that our results are robust to these errors.

Top

References

  1. Hartwell, L. H. , Hopfield, J. J. , Leibler, S. & Murray, A. W. From molecular to modular cell biology. Nature 402, C47–52 (1999). | Article | PubMed | ISI | ChemPort |
  2. Barabási, A.-L. & Albert, R. Emergence of scaling in random networks. Science 286, 509– 512 (1999). | Article | PubMed | ISI |
  3. West, G. B. , Brown, J. H. & Enquist, B. J. The fourth dimension of life: fractal geometry and allometric scaling of organisms. Science 284, 1677–1679 (1999). | Article | PubMed | ISI | ChemPort |
  4. Banavar, J. R. , Maritan, A. & Rinaldo, A. Size and form in efficient transportation networks. Nature 399, 130–132 (1999). | Article | PubMed | ISI | ChemPort |
  5. Albert, R. , Jeong, H. & Barabási, A.-L. Error and attack tolerance of complex networks. Nature 406, 378–382 ( 2000). | Article | PubMed | ISI | ChemPort |
  6. Ingber, D. E. Cellular tensegrity: defining new rules of biological design that govern the cytoskeleton. J. Cell Sci. 104, 613 –627 (1993). | PubMed | ISI |
  7. Bray, D. Protein molecules as computational elements in living cells. Nature 376, 307–312 (1995).  | Article | PubMed | ISI | ChemPort |
  8. McAdams, H. H. & Arkin, A. It's a noisy business! Genetic regulation at the nanomolar scale. Trends Genet. 15, 65–69 (1999). | Article | PubMed | ISI | ChemPort |
  9. Gardner, T. S. , Cantor, C. R. & Collins, J. J. Construction of a genetic toggle switch in Escherichia coli. Nature 403, 339– 342 (2000). | Article | PubMed | ISI | ChemPort |
  10. Elowitz, M. B. & Leibler, S. A synthetic oscillatory network of transcriptional regulators. Nature 403, 335–338 (2000). | Article | PubMed | ISI | ChemPort |
  11. Hasty, J. , Pradines, J. , Dolnik, M. & Collins, J. J. Noise-based switches and amplifiers for gene expression. Proc. Natl Acad. Sci. USA 97, 2075–2080 ( 2000).
  12. Becskei, A. & Serrano, L. Engineering stability in gene networks by autoregulation. Nature 405, 590– 593 (2000). | Article | PubMed | ISI | ChemPort |
  13. Kirschner, M. , Gerhart, J. & Mitchison, T. Molecular 'vitalism'. Cell 100, 79–88 (2000).  | Article | PubMed | ISI | ChemPort |
  14. Barkai, N. & Leibler, S. Robustness in simple biochemical networks. Nature 387, 913– 917 (1997). | Article | PubMed | ISI | ChemPort |
  15. Yi, T. M. , Huang, Y. , Simon, M. I. & Doyle, J. Robust perfect adaptation in bacterial chemotaxis through integral feedback control. Proc. Natl Acad. Sci. USA 97, 4649–4653 (2000).
  16. Bhalla, U. S. & Iyengar, R. Emergent properties of networks of biological signaling pathways. Science 283, 381–387 (1999). | Article | PubMed | ISI | ChemPort |
  17. Karp, P. D. , Krummenacker, M. , Paley, S. & Wagg, J. Integrated pathway–genome databases and their role in drug discovery. Trends Biotechnol. 17, 275– 281 (1999). | Article | PubMed | ISI | ChemPort |
  18. Kanehisa, M. & Goto, S. KEGG: Kyoto encyclopedia of genes and genomes. Nucleic Acids Res. 28, 27– 30 (2000). | Article | PubMed | ISI | ChemPort |
  19. Overbeek, R. et al. WIT: integrated system for high-throughput genome sequence analysis and metabolic reconstruction. Nucleic Acids Res. 28, 123–125 (2000).  | Article | PubMed | ISI | ChemPort |
  20. Erdös, P. & Rényi, A. On the evolution of random graphs. Publ. Math. Inst. Hung. Acad. Sci. 5, 17–61 (1960). | ISI |
  21. Bollobás, B. Random Graphs (Academic, London, 1985).
  22. Albert, R. , Jeong, H. & Barabási, A.-L. Diameter of the World-Wide Web. Nature 400, 130–131 ( 1999). | Article |
  23. Faloutsos, M. , Faloutsos, P. & Faloutsos, C. On power-law relationships of the internet topology. Comp. Comm. Rev. 29, 251 ( 1999).
  24. Amaral, L. A. N. , Scala, A. , Barthelemy, M. & Stanley, H. E. Classes of behavior of small-world networks. (cited 31 January 2000) left fencehttp://xxx.lanl.gov/abs/cond-mat/0001458right fence (2000).
  25. Dorogovtsev, S. N. & Mendes, J. F. F. Evolution of reference networks with aging (cited 28 January 2000) left fencehttp://xxx.lanl.gov/abs/cond-mat/0001419right fence (2000).
  26. Watts, D. J. & Strogatz, S. H. Collective dynamics of 'small-world' networks. Nature 393, 440– 442 (1998). | Article | PubMed | ISI | ChemPort |
  27. Barthelemy, M. & Amaral, L. A. N. Small-world networks: Evidence for a crossover picture. Phys. Rev. Lett. 82, 3180–3183 (1999).  | Article | ISI | ChemPort |
  28. Edwards, J. S. & Palsson, B. O. The Escherichia coli MG1655 in silico metabolic genotype: its definition, characteristics, and capabilities. Proc. Natl Acad. Sci. USA 97, 5528–5533 (2000). | Article | PubMed | ChemPort |
Top

Supplementary Information

Supplementary information accompanies this paper.

Top

Acknowledgements

We thank all members of the WIT project for making this invaluable database publicly available. We also thank C. Waltenbaugh and H. S. Seifert for comments on the manuscript. Research at the University of Notre Dame was supported by the National Science Foundation, and at Northwestern University by grants from the National Cancer Institute.