The difficulty in annotating the vast amounts of biological information poses one of the greatest current challenges in biological research. The number of genomic, proteomic, and metabolomic datasets has increased dramatically over the last two decades, far outstripping the pace of curation efforts. Here, we tackle the challenge of curating metabolic network reconstructions. We predict organismal metabolic networks using sequence homology and a global metabolic network constructed from all available organismal networks. While sequence homology has been a standard to annotate metabolic networks it has been faulted for its lack of predictive power. We show, however, that when homology is used with a global metabolic network one is able to predict organismal metabolic networks that have enhanced network connectivity. Additionally, we compare the annotation behavior of current database curation efforts with our predictions and find that curation efforts are biased towards adding (rather than removing) reactions to organismal networks.
Advances in high-throughput experimental biology have greatly furthered our knowledge and made it possible to interrogate cellular processes in a systematic manner. However, this data deluge is only as useful as our ability to interpret it1. The current uncertainty of data reliability hampers our efforts to understand which network topologies fulfill physical, chemical, and biological constraints2,3,4,5,6. Understanding the possible network topologies is a critical condition in on-going attempts to understand the function and evolution of cellular networks.
Improving annotations of organismal metabolic networks has been an area of intense interest, especially owing to their usefulness in assessing organismal fitness in silico. There have been numerous methods proposed that attempt to solve the incompleteness of metabolic network reconstructions, ranging from gap-filling the organismal network based on what other known organismal networks possess7,8,9 to methods that rely on multiple sources of annotation information to provide an assessment of enzyme presence10.
However, even with these methods we are still far from achieving consensus on the correct metabolic network for a given organism, even one as well-studied as Escherichia coli (Fig. 1a). Indeed, there are dramatic differences in both the size and degree of overlap of the metabolic network for E. coli recorded in (i) different databases or (ii) the same database at different time snapshots (Fig. 1b). This problem is magnified in organisms that have their genomes sequenced for the first time (Fig. 1c). This last example is a perfect demonstration of both our lack of knowledge and the problem of developing computational analyses that perfectly recapitulate the known network at only the present instance.
Because data reliability is such a pressing problem for experimental and computational researchers alike, there has been a push in research to consider the analysis of metabolic networks from novel perspectives. A promising new framework is to consider metabolism in the context of a global network. This framework has been successfully applied in assessing the emergence of biological carbon fixation in phylometabolism11 and, more generally, to understand the regulation of metabolism12. A global network has also been recently used in conjunction with probabilistic methods to predict metabolic networks on a small scale with experimental verification13. While the motivation for the global network approach has been mostly pragmatic, it is reminiscent of the “Res Potentia” framework proposed by Whitehead14. Wherein he proposes that which does exist—termed the Res Extenta or in the case of metabolism the set of organismal metabolic network—are specific realizations of a “universal” framework—the Res Potentia or the global network in our analysis—that defines what is possible.
We contend that using a global network approach to the study of metabolism is comparable to what epidemiologists do when studying worldwide propagation of infection. In building the worldwide air transportation network15,16,17 all carrier flights are aggregated into a single network and analyzed. This is an important feature of its construction because it aids in the identification of and distinction between international and regional hubs, establishing their relative importance in the network. As an example, if we were to consider US Airways (a North American carrier) alone we would not have fully grasped the importance of London, since this carrier will have more flights to Los Angeles or even San Diego than to London. Even if we were to pick a group of carriers based on similarity (such as operating primarily in North America) and assess the ensemble of their individual networks, it would be difficult to assess the relative importance of the individual hub airports. Reframing the analysis of metabolic networks to a global network is an appropriate method to both assess our current network annotations and to gain an understanding of what evolutionary dynamics shape organismal metabolism into the structures that we currently know.
In the following we predict entire organismal metabolic networks using only sequence homology and the global network as a reference. We show that using an appropriate reference set, the global network, allows for more insight to be obtained from sequence homology. We compare our predicted networks to known metabolic database reconstructions and also evaluate the connectedness of the resultant graphs for predicted networks to assess the performance of our methodology. We also use our predictions to help understand how curation behavior in a metabolic database affects known organismal metabolic networks.
Organismal network prediction
We constructed a global network by performing the graph union of all organismal networks (Fig. 2b). Our analyses focused on the giant component of the global network because it contains the most reliable data, as its metabolites are more conserved and have more pathway annotations. To predict individual organismal metabolic networks we assumed that a given reaction can be catalyzed within an organism if, and only if, the organism synthesizes a protein that is sufficiently similar to the known enzymes for the reaction. We evaluated each reaction in the global network for its possibility of existence in any individual organism.
We aligned the enzyme sequences associated with each reaction to each organism's protein database (Fig. 3a) and determined the expectation value (E-value) of the alignment using blastp18,19. The E-value is a measure of the number of times the match between the sequences would be expected to occur by chance; E-value = 0.0 indicates a perfect match between the queried enzyme sequence and a protein in the database, while E-value > 1.0 is interpreted as a sequence match that is not indicative of biological homology.
For clarity, we define several additional terms. A reaction predicted to be catalyzed in a certain organism by a certain database curation team is “annotated” in that database. Otherwise, the reaction is “unannotated” — it exists in the global network but not in the organismal one. To make our predictions of annotation status we separated the alignments associated with a reaction r into two categories, hits and poor matches, based on the magnitude of the E-values obtained. If an alignment has E-value ≤ 0.01 then we classify the alignment as a hit; otherwise, if 0.01 < E-value ≤ 10, we classify it as a poor match (Fig. 3b).
We use the fraction of alignments that are classified as hits as a predictor of whether reaction r can be catalyzed within organism i. When the distribution of is examined we find a peak at greater values for KEGG annotated reactions when all of the reactions are considered. Furthermore, we see that this behavior holds no matter which domain of organisms is considered (Fig. 3c), indicating that this is a robust behavior preserved across all organismal networks. The imperfect separation between annotated and unannotated distributions is also expected given the amount of known annotation errors in the KEGG networks (Fig. 1).
To determine whether a given is large enough to be considered a reaction that can be catalyzed we must set a threshold value f×; if , then we predict reaction r to exist in organism i. To identify an appropriate value for f×, we calculated the receiver operator characteristic (ROC) curve, accuracy, and false discovery rate statistics20. The ROC curve analysis demonstrates that can discriminate between annotated and unannotated reactions (Fig. 4a). We thus use the accuracy and false discovery rates to determine a good threshold value for and set f× = 0.14 (Figs. 4b). In summary, we predict organismal metabolic networks by checking whether a reaction r for organism i has a value (Fig. 4c). This approach allows us to predict entire organismal networks using only the global metabolic network and the associated organismal BLAST alignments.
Comparison with consensus networks
In order to calculate the accuracy of our predictions it is necessary to compare it to a “ground truth”. However, given the significant variations in size and content that exist across different databases and in time, such a ‘true’ answer does not exist (Fig. 1). In an effort to estimate the true accuracy of our predictions we consider the metabolic network reconstructions of E. coli, a well-studied organism, from three different sources, with two of the sources having network data at two separate time points.
We constructed ten separate consensus networks as detailed in Methods leaving out three reconstructions at a time, and evaluated the accuracy of the networks that were left out of the consensus, with the results shown in Table 1. While no network is 100% accurate with respect to the consensus network, we find that our predictions range in accuracy between 70 and 71% while the database networks range in accuracy between 66 and 93%.
In an effort to understand what types of reactions our method incorrectly predicts we characterize the reactions that are identified as false positives in comparison to the consensus network. First, we examine the pathway annotations associated with the metabolites in this set of false positive edges and reactions (Fig. 5a). We find that the majority of the metabolites are either unclassified or classified in pathways that are not central to metabolism (outside of the carbohydrate, amino acid, nucleotide, and lipid metabolism pathways). However, even being associated with a central pathway does not mean that all of the metabolites are specifically involved in central or essential processes and these characterizations could be due only to their presence as a byproduct in a reaction.
Second, we examined the conservation of the false positive reactions (Fig. 5b). Conservation is calculated as the fraction of times that the edge appears in an organismal network in comparison to the total number of organismal networks. We calculate the conservation for all edges in all organismal networks in KEGG and define three bins in the distribution (lower, middle, and upper thirds of the distribution). When we bin the false positive edges into these bins we find that the overwhelming majority are in the lowest third of conservation values. If we consider the lower and middle thirds together these groups accounts for more than 90% of all edges in the false positive set.
The abundance of low conservation reactions in the “false positive” set of our method could plausibly be interpreted as suggesting that these reactions may not actually be false positives. It is likely that a majority of the edges in this set do actually exist, they just have not been incorporated into a majority of the databases due to poor characterization and understanding of the reactions themselves.
Given the challenge presented by traditional validation due to the lack of a ground truth and our aim to predict an organism's true metabolic network instead of simply recapitulating the annotations in KEGG, we also use a validation scheme focused on the expected properties of metabolic networks. Specifically, we surmise that organismal metabolic networks must have a bias toward connectedness. Indeed FBA metabolic reconstructions assume that metabolic networks act as “transportation” networks that carry mass from external nutrients to biomass7,8. The possession of fewer network components implies a greater ability of the organism to exploit a broad range of incoming nutrients for disparate cellular roles, and thus offers a fitness advantage over topologies where each network component must be individually fed.
We find that both our predictions and the changes made in KEGG in the period 2009-2011 close more gaps between network components than would be expected if new reactions were added at random (Fig. 6a). When we consider gaps of size one, our predictions fill almost twice as many gaps as the KEGG changes. Remarkably, we also find that our predictions introduce fewer new network components than random removals. The changes in KEGG actually cause the creation of more additional network components than would be expected if reactions were randomly removed from organisms (Fig. 6b). The fact that so many gaps are closed by both our method and by KEGG curation in the period 2009-2011 lends credence to our original hypothesis that metabolic networks should be evolutionarily biased towards minimizing the number of network components and supports the validity of our methodology.
It is important to note that our method takes in no information from the global network concerning reactions other than the possibility of their existence. Therefore, our method is no more biased towards closing gaps or preserving network structure than the actual changes to the database could or should be and yet it still accomplishes this goal of increased connectivity.
Biases in database curation
When we examine how our predictions compare to the corrections made to the KEGG database over time we find that there is a distinct bias towards adding instead of removing reactions to organismal networks (Fig. 6c). It would be simple to assume that our method under-predicts in comparison to the reference dataset; however, we do not observe this trend when we examine the set of well-studied organisms used in Fig. 1b and c. This suggests that the curation teams are more aggressive in adding reactions than removing them, despite the fact that both errors of omission and addition are equally detrimental. Large-scale comparison and tracking of database changes could influence curation teams' actions and help attenuate this problem.
There are several distinct advantages to reframing the study of metabolic networks and, more broadly, metabolism to the organismal usages of the global network. As demonstrated in this study we are able to extract substantially more predictive power from sequence homology when it is used in conjunction with the global network. While most studies have moved beyond homology due to a lack of predictive power to more complicated and time consuming methods (such as Bayesian or multiple information methods), we are able to predict metabolic networks that compare favorably to the known database data and exceed them in producing connected networks. We could also easily increase the efficacy of our method by including additional network information such as whether a reaction completes a gap or not, which would be trivial to calculate and consider.
The global network also enables community detection and other graphical analyses that are unchanging in the face of organismal usage, facilitating an understanding of the true importance of a metabolite. Comparing the differences in organismal usage of metabolites and reactions can then be used to more robustly characterize the evolutionary forces that have optimized an organismal network. Specifically, when studying an organismal network we cannot fully comprehend the importance of a given metabolite because we do not have access to all the manners in which that metabolite could potentially connect to other metabolites in the network. Thus, we cannot accurately determine, for example, the centrality of the metabolite within metabolism or ascertain its true importance from an evolutionary standpoint. In contrast, the global network makes apparent these possibilities because it includes all available organismal knowledge. An increased understanding of why an organism develops certain “solutions” for its metabolic needs will aid in predicting unique features of the organism's metabolite and reaction usage that can be specifically targeted by drugs or other therapeutics and metabolic engineering.
Additionally, the global metabolism also allows us to view the metabolite and reaction usage of organisms in a general framework providing a means to identify metabolic “devices”, small groups of metabolites and reactions that have a functional purpose, and other features that become apparent only when considering intermediate scales within the network21,22,23. This enables us to give greater insight into both metabolic evolution as well as ways to design synthetic metabolic “circuits” from these devices24,25.
We downloaded multiple instances of the Kyoto Encyclopedia of Genes and Genomes (KEGG) LIGAND database26,27,28; the first instance on June 24, 2009 and the last on February 22, 2011. We also downloaded enzyme protein sequences from KEGG on five occasions, all between July 2010 and February 2011. All possible, unique sequences for each enzyme were used, based on the associations to reactions from KEGG. We downloaded bioreaction databases for Ma 2003 and Zeng 201129,30 from http://www.tuharburg.de, and the iAF1260 Escherichia coli reconstruction31 from the BiGG database32.
We considered 998 organisms listed in the KEGG database to construct the global network. We constructed protein databases and predict metabolic networks for 874 of these 998 organisms. We did not predict the networks for 125 organisms due to a lack of sequence availability or because the time necessary to run a complete analysis for larger organisms was prohibitively long.
The bacterial domain dominates in representation due to the breakdown of organisms in KEGG itself. However, the network is not influenced by this over-representation because each reaction is only counted once in the construction of the global network. We include the domain and clade breakdown of the organisms that we tested and predicted metabolic networks for in Table 2.
Organismal and global network construction
We constructed individual metabolic networks for 998 organisms using a 2009 snapshot of the KEGG database. In these networks, each node represents a metabolite, and two metabolites i and j are connected by an edge if there is a chemical reaction in which i is a substrate and j is its product, or vice versa. We established these relationships using only the main reaction pair designations on KEGG and, as in prior studies33,34, excluded transfer ions, co-factors, and energy carrier molecules to maintain focus on the biomass transfer through the networks (Fig. 2a).
We constructed a global network by performing the graph union of all organismal networks (Fig. 2b). The 3,467 distinct reactions listed for the 998 organisms in the KEGG database yielded a global metabolic network comprising 6,656 metabolites and 3,328 unique edges. These metabolites are organized into a giant component comprising 2,023 metabolites and 2,729 edges, and 333 smaller components typically comprising only a few metabolites each. We focused our analyses on the giant component of the global network because it contains the most reliable data, as its metabolites are more conserved and have more pathway annotations.
Metabolic networks for E. coli based on other databases were constructed in the same manner as the organismal metabolic networks constructed using the KEGG database. For the Ma 2003 and Zeng 2011 datasets the main pairs designation was included in the original dataset and it is used instead of the KEGG main pairs designation, while we used the main pairs designation from KEGG for the iAF1260 reconstruction.
Organismal network prediction
We collected 5.94 × 106 known enzyme amino acid sequences from the KEGG database that are associated with the 3,467 reactions in the global network and prepared databases of all known proteins for 874 organisms from the nr database (downloaded February 23, 2011) in accordance with the BLAST user manual35 in order to test sequence homology. We used blastp18,19, version 2.2.24, to align the known enzyme amino acid sequences to the organismal protein databases. We obtained a total of 2.6 × 1010 BLAST alignments that were subsequently used in our analysis.
Consensus network construction
We create consensus networks using a majority rule, similar to other work36. A set of the networks is selected and every edge in all of the networks is evaluated. If the edge appears in the majority of the networks in the set then it is added to the consensus network, otherwise it is not added. We then calculated the accuracy statistic for each network not used in the consensus network against the consensus metabolic network.
To assess network connectivity we examine two quantities, the probability of a reaction addition closing a gap between two network components and a reaction removal creating an additional network component. We then compare the observed number of gaps versus the random chance expectation of completing a gap of a given size with the available number of additional reactions. For the random filling of gaps, we use the intersection of additional reactions between our predicted network and the KEGG 2011 network for an organism as the number of reactions that should be added. For the creation of additional network components we removed every edge individually in all organismal networks and determine if an additional network component is created. We then average the number of additional components added across all edges tested.
We thank I. Sirer, P.D. McMullen, E.N. Sawardecker, S.M.D. Seaver, and P.B. Winter for comments and suggestions. ARP acknowledges support from the Northwestern Predoctoral Biotechnology Training Grant and the Chicago Biomedical Consortium with support from The Searle Funds at the Chicago Community Trust. RG acknowledges the support of the James S. McDonnell Foundation, of the Spanish Ministerio de Ciencia e Innovación grant FIS2010-18639, and of the European Union Grant PIRG-GA-2010-277166. LANA acknowledges the support of NSF award SBE 0624318 Foundation and the W.M. Keck Foundation.