Enumerating all possible biosynthetic pathways in metabolic networks

Exhaustive identification of all possible alternate pathways that exist in metabolic networks can provide valuable insights into cellular metabolism. With the growing number of metabolic reconstructions, there is a need for an efficient method to enumerate pathways, which can also scale well to large metabolic networks, such as those corresponding to microbial communities. We developed MetQuest, an efficient graph-theoretic algorithm to enumerate all possible pathways of a particular size between a given set of source and target molecules. Our algorithm employs a guided breadth-first search to identify all feasible reactions based on the availability of the precursor molecules, followed by a novel dynamic-programming based enumeration, which assembles these reactions into pathways of a specified size producing the target from the source. We demonstrate several interesting applications of our algorithm, ranging from identifying amino acid biosynthesis pathways to identifying the most diverse pathways involved in degradation of complex molecules. We also illustrate the scalability of our algorithm, by studying large graphs such as those corresponding to microbial communities, and identify several metabolic interactions happening therein. MetQuest is available as a Python package, and the source codes can be found at https://github.com/RamanLab/metquest.


Method
Current status † Type Ref.

PathFinder
Link appears broken Visualisation of biochemical transformation [1] PathMiner Link appears broken Represent the compounds based on chemical descriptors and carry out an uninformed search [2] MetaPath Link exists but no page to submit query Calculates the scope of metabolic networks given a set of starting seed [3] NeAT Link to find paths appears broken Toolbox for analysis of biological networks [4] Rahnuma Link appears broken Hypergraph-based method that performs DFS on hypergraph to find routes [5] ReTrace (Python code) -Identifies branching metabolic pathways in atom-level representation of metabolic networks [6] FMM Active Constructs metabolic pathways between metabolites using substrate graph representation [7] PathPred Active Generates the pathways based on the structure transformation patterns and its comparison with reference pathway [8] BPAT-S/BPAT-M Active Identifies branched pathways by first determining linear pathways and combining them [9] MRSD Link appears broken Searches and designs routes based on the weighted compound transform digraph [10] Metabolic Tinker Link appears broken, C# code available Searches for thermodynamically feasible paths in metabolic universe using a tailored heuristic search strategy [11] RouteSearch Active Branch-and-Bound search and involves constructing a network of atom mappings to facilitate efficient searching [12] FindPath (MATLAB code) -Using chemical reaction database generates possible metabolic pathways and uses Genome-Scale Metabolic Model to find most efficient synthetic pathway [13] ATLAS Active Finds possible transformations between two metabolites using reactions from KEGG and other reactions specific to ATLAS [14] MRE Active Provides organism specific data from KEGG online tool for heterologous biosynthesis pathway design [15] †Accessed on October 26, 2017     Table S5: This set represents the common pool of seed metabolites used in all the simulations. As stated earlier, the metabolites were renamed based on the models where the pathways were identified. Note that following metabolites were found in the graph, and hence were removed Ecoli core model pi p, Ecoli core model ppi c, Ecoli core model h2o p, Lactococcus lactis MG1363 pi p, Lactococcus lactis MG1363 h2o p, P1 pi p and P1 h2o p 3 SUPPLEMENTARY ALGORITHM Algorithm S1 Guided BFS -An algorithm to to identify the scope of metabolites 1: Input: G(M, R, E), where M is the set of metabolites in the metabolic network, R is the set of reactions and E is the set of edges, Seed metabolites S (containing the source metabolite(s)) and a set of target nodes T. 2: Output: Scope M s of the seed metabolite set S. 3: R s ← succ(m s ), ∀m s ∈ S R s is the set of source reaction nodes 4: R t ← pred(t), ∀m t ∈ T R t is the set of target reaction nodes 5: Q = Ø Queue containing reactions whose predecessors are present in S 6: R start = Ø 7: for each r ∈ R s do 8: if pred(r) ⊆ S then 9: visited(r) = true R start = R start ∪ r 13: for each r ∈ R start do 14: for each m s ∈ succ(r) do 15: for every r ∈ succ(m s ) do 16: if metabolites to trigger reaction r are present in S & visited (r ) = true then 17: Enqueue r to Q Enqueue r to Q

Note:
Here succ refers to the successors, i.e., the metabolites produced by a reaction and pred refers to the predecessors, i.e., the precursor metabolites required by every reaction.

Optimisations carried out
In this section, we discuss the different optimisations carried out to improve the running times of the algorithm. Broadly, these optimizations pertain to the reduction in the number of partitions generated and pruning the number of reactions assessed at every stage of iteration.
where l is the sum that we intend to generate. Depending on the value of t, we assign the value of k − 1 to t inputs (in multiple combinations) and calculate the values that n − t variables can take up to generate the given sum. Based on these values, we carry out Step 5 of Algorithm 1 and fill up the corresponding entry in the table.
In this case, = 10, (k − 1) = 8, thus t = 1. Now, we assign the value of 8 to 1 input and calculate the value of other input such that both sum to 10. We do this ( n t ) times to assess all combinations. 3. In the second round, we generate partition for all the values which have not been generated so far. For instance, when k = 8 and k = 9, we would generate a partition of numbers that sum ( ) between 7 to 14 and 8 to 16 respectively. Thus, once we intend to evaluate k = 9, we generate a full set of partitions only for the sum ( ) 15 and 16. In the second round of calculations, we repeat Step 4 and Step 5, only for the sum that has not been generated in the previous step. For the case shown above, we would be performing this step when k = 9, = 15, = 16, since partitions for these sums have not been evaluated so far.
4. We prune the number of reactions by taking into consideration only those reaction nodes can be visited (based on the availability of precursors) while carrying out Guided BFS over original and reversed graph. To obtain a reversed graph, we merely reverse the directions of the reactions in the original graph. Scope M s , obtained at the end of traversal on the original graph, is fed as an input to the reversed graph. The traversal starts from the target and stops when the source is reached. This step eliminates reactions that are not a part of the pathway from source to target.

5.
To generate the sum , we use values between m and k − 1, and not 0 to k − 1, since m already has the information about when the metabolite can be first visited. Thus, computation of partitions using numbers between 0 and m , would not be useful.

Computation of partitions till
In this section, we explain why we should generate partitions till n × (k − 1) by illustrating it on a small toy-network. Consider a directed bipartite graph as shown in Figure S3. If we were to traverse from source M 1 to target M 5 , with A 1 as the seed metabolite set S, and cut-off j 3 and k 100, following would be the steps followed by MetQuest:  2. We generate the table of size (6,4), and fill the values against seed metabolites and the source as ∅. In every iteration, for every reaction r ∈ R v , we check if the input reactants can be generated in k − 1 reactions 3. In the first iteration k = 1, for every reaction r ∈ R v , we check if the input reactants can be generated in k − 1 to n × (k − 1) reactions, i.e., in this case, it is 0 reactions. Based on the number of inputs r to the reaction, we partition k − 1 using r integers, such that together they add up to k − 1. While performing this over all the reactions, we find only reaction R 1 , whose inputs satisfy the partitions. Thus, we fill the values 4. Similarly, in the second iteration, we consider all the reactions r ∈ R v , and check if the input reactants can be generated in k − 1 reactions to n × (k − 1), i.e., 1 reaction. We find that only for the reactions R 2 and R 3 , the inputs M 2 satisfy the partition. Thus, we fill in Table( 5. Again in the third step, we consider all the reactions r ∈ R v , and check if the input reactants can be generated in k − 1 to n × (k − 1) reactions, i.e., with 2 (3 − 1) to 6 (2 × (3 − 1)) reactions. We generate partitions of numbers for every reaction r ∈ R v and find only the inputs to reaction R 4 satisfy the partition generated i.e., (2,2) ( Figure S3). Using these partitions, we fetch the corresponding entry from the table, i.e., 6. If we had restricted the generation of partitions till (k − 1), we would have missed this sub-network, which is derived using the partition of numbers generating a sum of 4 (n × (k − 1)). The final output table can be found in Figure S3.

Demonstration of MetQuest on a toy-network
M 10 M 11 Figure S4: Toy reaction network with 9 reactions (Dark gray rectangles) and 11 metabolites (Light gray circles). This network is used to demonstrate how MetQuest works on a toy-network. The final output of pathways producing the metabolites can be found in Table S6. Note that M 1 and A 1 are the seed metabolites. We intend to find pathways of size β = 7 to target M 11 .
The algorithm starts with the seed metabolite set S, which contains M 1 and A 1 . From the first phase, we obtain the scope M s of S. We then initialise the Table of size |M s | × (β + 1). In the first stage, we fill in column 0 with ∅ against the entries corresponding to seed metabolites. We also check the reaction where these seed metabolites, i.e., M 1 and A 1 , participate and determine the metabolite produced. We then fill the Table(M 2 , 1) as R 1 , since this reaction involves the seed metabolites and produces M 2 . From the next step, we start generating partition of numbers for every reaction, and fetch the corresponding values from the Table if they are present in the system. For instance, consider reaction R 2 , which requires one metabolite M 2 . Since we currently intend to fill column 2, we generate partitions that sum to 1 using 1 input. We then check if there are any entries against Table(M 2 , 1). We then perform a union of this reaction R 1 with the current reaction R 2 and fill it against Table(M 3 , 2). If there are multiple entries, we consider single entries at a time and perform a cross-product on the union of reaction sets. Note that the final Table size may be greater than β, since we compute partitions of numbers till n × (k − 1), where k = 1, . . . , β.  Figure S4.

Dealing with cycles
Besides the linear and branched pathways, MetQuest, due to the nature of formulation, also effectively deals with cycles. As an example, consider a directed cyclic bipartite graph as shown in Figure S5. We apply MetQuest for a cut-off of five, with M 1 ,A 1 and M 4 as the source, seed and target metabolite respectively. In the first step of Guided BFS, all the reactions in the network are marked visited, and the scope of M 1 and A 1 are calculated. In the second step, while generating sub-networks, the table entries are sequentially filled. M 2 is generated by R 4 as a part of the cycle. While filling the entry corresponding to Table(M 2 , 4) using reaction R 4 , we generate partitions using one number whose sum is three (since there is one input to reaction R 4 ). Using the output generated from previous iteration, Table( Since M 4 can be generated using 3 reactions, Table(M 2 , 4) will now be R 1 , R 2 , R 3 , R 4 . In the next iteration k = 5 and R = R 2 , we generate partitions using one number to generate a sum of 4, and fetch the entry corresponding to  Figure S5: Cyclic pathway consisting 4 reactions and 5 metabolites. Light gray circles represent metabolites, dark gray rectangles represent reactions.

Implementation and detailed work-flow
We implemented MetQuest on Python 3.6, and is freely available at https://github.com/RamanLab/metquest. MetQuest is also available as a Python package from PyPI. Detailed documentation can be found at http://metquestdoc.readthedocs. io/. We read the SBML files using the functions in COBRAPy [16] and construct the bipartite graph using NetworkX package [17]. Note that in our implementation, while finding the branched networks, we restrict the combined number of the total sub-networks for each metabolite to ns total (Consider a reaction R1 : A + B → C + D, if there are 50 ways to synthesise A and 100 ways to synthesise B, there are 5000 ways to synthesize C and D. If ns total = 1000, we consider the first 1000 networks only, although this can be changed. Thus, at the end of iterations, we get at least ns total sub-networks.) Further, the implementation of MetQuest also allows the user to seek answers to several interesting questions, such as 1. Can the target metabolite be produced using a pathway with n (≤ β) steps? 2. Can the metabolite be produced using (non-)cyclic pathways?

Modules in MetQuest
The implementation of MetQuest has the following modules: 1. construct graph: This module is used to construct the bipartite graph object using the metabolic models provided by the user. (c) find pathways involving exchange mets: This function identifies the pathways producing the target metabolites, which involve exchange metabolites. This function prints output only when a community of organisms is considered, i.e., when more than one metabolic network is used.
(d) find pathways starting from source: This function finds all the pathways from the source metabolite, excluding the seed metabolites.
(e) print summary: This function prints the results summary obtained from the pathways, i.e., i. Number of metabolites in scope ii. Target metabolite iii. Pathway size cutoff β iv. Number of all branched pathways found from seed v. Number of all branched pathways from seed whose size ≤ Pathway size cutoff β vi. Minimum number of steps to produce target metabolite vii. Number of branched pathways from source whose size ≤ Pathway size cutoff viii. If the target metabolite can be produced using cyclic pathway ix. Number of cyclic pathways whose size ≤ Pathway size cutoff β x. One of the combination of most different pathways producing target metabolite xi. Important reactions based on the frequency of occurrences (f) write output to file: This function writes the pathways of sizes less than or equal to the cutoff from source to the target and seed metabolites to target. This function also writes cyclic pathways of sizes less than or equal to cut off from the source to target.

fetch reactions:
This module gets the data about the reactions and the metabolites from the models of multiple organisms. The functions in this module require as input the pathname where the .xml files are located. From this path, this function reads all the files using the functions in the COBRA toolbox and generates the stoichiometric model for these SBML models.

generate partitions:
This module generates combinations of numbers, which together add up to the desired sum.

get reactions types:
This module determines the type of reactions, namely, reversible, irreversible and exchange reactions from the genome-scale metabolic model. 6. guided bfs: This module carries out the guided breadth-first search (Phase 1), as explained in the main manuscript.
7. pathway assembler: This module contains functions that carry out all the steps of Phase 2 of the algorithm, as explained in the main manuscript.

Formal proof of the algorithm
In this section, we present the correctness of our algorithm. The Phase 1 of the algorithm computes the scope M s of the seed set S and for every metabolite m the value m and for each reaction r the value r . This phase is simply a modification of the standard breadth-first search algorithm, and we assume that M s and m and r are correctly generated. We now argue that the Algorithm MetQuest computes Table correctly. In particular, we prove the following: • Claim 1: There exists a pathway RS of size k in the metabolic network that produces a metabolite m starting with the seed set S if and only if = ⊥ if and only if there does not exist any pathway of size k in the metabolic network that produces m starting at the seed set S.

Proof:
We prove the above statements by induction on the column index k. We begin by noting that for k = 0, the entry of the form [0] is set to ⊥ indicating that m cannot be generated using zero reactions. By induction hypothesis assume that for all values of k = 1, . . . , col − 1, the above two claims are true. We now prove that for k = col both the statements hold. We split the proof into two parts.
• Assume that a pathway RS is generated by our algorithm when the for loop in Line 9 has k = col and the reaction in the for loop in Line 10 is r. We argue that such a pathway is a valid pathway in the metabolic network. For k = col, Algorithm 1 considers all sum-values ranging from col − 1 all the way up to n × (col − 1), where n is the number of inputs to reaction r. Assume that m 1 , m 2 , . . . , m n are the inputs to r. A particular sum-value is partitioned as (p 1 , p 2 , . . . , p n ) (using a call to generatePartitions). Note that this partition allows us to generate a pathway containing r which has a sub-pathway of size p i for each m i . Our algorithm then uses for i = 1, . . . , n, We note a subtle but important point here -our algorithm considers sum-values which are larger than col − 1. However, to generate each partition, the maximum value that any p i can take is at most col − 1 (see Line 19 and Line 21 in Algorithm 1). This ensures that for a particular partition (p 1 , p 2 , . . . , p n ) for a sum value , when we query the entry  Table in the columns up to col − 1 are already computed correctly. Thus, using sub-pathways of size p 1 , p 2 , . . . , p n (if they exist) and combining such sub-pathways with r we generate a pathway RS (Line 7 Algorithm 2) which must belong to the metabolic network.
We also remark that since our algorithm considers sum-values all the way up to n × col − 1, the size of the pathways generated by our algorithm can be larger than col. Thus when the value of k = col our algorithm may add pathways to the entries in Table[m][k ] where k > col. We finally remark on the Step 3-4 of Algorithm 2 (populateTable). Note that for a particular partition (p 1 , p 2 , . . . , p n ) and the reaction r, if the entry Table[m i ][p i ] is ⊥ for some i, then it is clear that there is no way to generate a pathway which has a sub-network (pathway) of size p i for the metabolite m i . Thus our algorithm discards such a partition and proceeds with the next partition.
• Now consider any pathway RS of size k containing a particular reaction r in the metabolic network. We argue that at the end of the k-th iteration of the for loop in Line 9 of Algorithm 1, the pathway is generated by our algorithm. Let r be an n-input reaction with m 1 , m 2 , . . . , m n as its inputs. Any pathway RS of size k containing r must have sub-pathways RS i for each i, where the sub-pathway RS i produces m i and size of RS i is p i . Here 0 ≤ p i ≤ k − 1. Thus there exists a partition (p 1 , p 2 , . . . , p n ) denoting the size of each sub-pathway in RS. We now note the important point. The sum-value = ∑ n i=1 p i can range from k − 1 to n × (k − 1) . The sum can be k − 1 when each of the sub-pathway has distinct p i many reactions. On the other hand, the sum can be n × (k − 1) when there is a single sub-pathway of size k − 1 producing all the n inputs to r. This is precisely the reason our algorithm considers all sum values ranging from k − 1 to n × (k − 1) (Line 13 in Algorithm 1). This step in the algorithm ensures that any partition that can possibly generate the pathway RS is considered by our algorithm.

Pathways from D-glucose to L-phenylalanine of length 28 in S. cerevisiae iMM904
In this section, we enlist the two pathways of length 28 found by MetQuest in S. cerevisiae iMM904. The reactions (in blue colour) represent the target reactions producing L-phenylalanine. pep c + h c + adp c → pyr c + atp c GAPD: g3p c + pi c + nad c → nadh c + h c + 13dpg c PGK: 13dpg c + adp c → 3pg c + atp c ACONT: cit c → icit c SHKK: skm c + atp c → h c + skm5p c + adp c SHK3D: 3dhsk c + h c + nadph c → nadp c + skm c HCO3E: co2 c + h2o c → hco3 c + h c PC: hco3 c + pyr c + atp c → h c + adp c + pi c + oaa c CHORM: chor c → pphn c GLUDyi: akg c + h c + nadph c + nh4 c → nadp c + h2o c + glu DASH L c PFK: f6p c + atp c → fdp c + h c + adp c

Pathways for catechol degradation
Below, we enlist the two pathways (shown in Figure 2 of main manuscript) found by MetQuest that degrade catechols in Pseudomonas putida iJN746. These two pathways represent the most diverse ways of catechol degradation. The reactions (in blue colour) represent the target reactions producing formate.

Comparison with other methods
The complete description of the reactions for the source-target pair listed in Table 1 (main manuscript) are given below: