Reliable and efficient solution of genome-scale models of Metabolism and macromolecular Expression

Constraint-Based Reconstruction and Analysis (COBRA) is currently the only methodology that permits integrated modeling of Metabolism and macromolecular Expression (ME) at genome-scale. Linear optimization computes steady-state flux solutions to ME models, but flux values are spread over many orders of magnitude. Data values also have greatly varying magnitudes. Standard double-precision solvers may return inaccurate solutions or report that no solution exists. Exact simplex solvers based on rational arithmetic require a near-optimal warm start to be practical on large problems (current ME models have 70,000 constraints and variables and will grow larger). We have developed a quadruple-precision version of our linear and nonlinear optimizer MINOS, and a solution procedure (DQQ) involving Double and Quad MINOS that achieves reliability and efficiency for ME models and other challenging problems tested here. DQQ will enable extensive use of large linear and nonlinear models in systems biology and other applications involving multiscale data.


Introduction
summarizes our DQQ procedure for achieving reliability and efficiency for multiscale optimization problems. The main paper reports application of DQQ to three large ME models (TMA ME, GlcAerWT, GlcAlift) and to some other challenging linear optimization problems (the pilot economic models and the Mészáros problematic set). Below we provide the following supplementary information: • Solution of 78 Metabolic models by Double and Quad MINOS, verifying that the Double solver gives reliable results.
• Solution of two slightly different forms of the TMA ME model, showing robustness of solution values with respect to O(10 −6 ) relative perturbations of the data.
• Some details of the Double and Quad MINOS implementations.
• Results with Gurobi on the ME models.
• Results with SoPlex80bit on the ME and problematic models.

Metabolic models with Quad solvers admit biomass synthesis
COBRA models of metabolic networks assume the existence of at least one steady-state flux vector that satisfies the imposed constraints and admits a non-zero optimal objective. Where the objective is to maximize a biomass synthesis reaction, the corresponding FBA problem should admit a nonzero biomass synthesis rate. It is established practice to solve monoscale metabolic FBA problems with Double solvers, so one may ask: do biomass synthesis predictions from metabolic models hold when higher precision solvers are applied to the same FBA problem? We tested 78 M models derived from the BiGG database 1 using Double and Quad MINOS. We downloaded these models in the JSON format and parsed them using the JSON reader in cobrapy 2 . The models were not modified after loading, so all constraints, bounds, and objective coefficients were used as in the original files. All models were feasible using both Double and Quad, and all but five models had an optimal objective value greater than zero. Of these five models, four simply had all-zero objective coefficients, while the remaining (RECON1) model maximized a single reaction (S6T14g) but its optimal value was zero. The maximum difference in objective value between Double and Quad was 2.6 × 10 −12 . The additional precision provided by Quad MINOS enabled us to conclude efficiently and effectively that the 78 metabolic models could be solved reliably using a Double solver. This conclusion is consistent with previous findings by Ebrahim et al. 3 .

Robustness of solution values for TMA ME
TMA ME 4 was the first ME model that we used for Quad experiments. The data S, c, , u came as a Matlab structure with c j = 0, j = 0, u j = 1000 for most j, except four variables had smaller upper bounds, the last variable had moderate positive bounds, and 64 variables were fixed at zero. The objective was to maximize flux v 17533 . We output the data to a plain text file. Most entries of S were integers (represented exactly), but about 5000 S i j values were of the form 8.037943687315e−01 or 3.488862338191e−06 with 13 significant digits. The text data was read into Double and Quad versions of a prototype Fortran 90 implementation of SQOPT 5 .
For the present work, we used the same Matlab data to generate an MPS file for input into MINOS. Since this is limited to 6 significant digits, the values in the preceding paragraph were rounded to 8.03794e−01 and 3.48886e−06 and in total about 5000 S i j values had O(10 −6 ) relative perturbations of this kind. This was a fortuitous limitation for the ME models. We have been concerned that such data perturbations could alter the FBA solution greatly because the final basis matrices could have condition number as large as 10 6 or even 10 12 (as estimated by LUSOL 6 each time SQOPT or MINOS factorizes the current basis B). However, in comparing Quad SQOPT and Quad MINOS with SoPlex 7, 8 and the exact simplex solver QSopt ex 9 , we observe in Table 1 that the final objective values for TMA ME in Matlab data reported by QSopt ex and Quad SQOPT match in every digit. Moreover, the objective value achieved by Quad MINOS on the perturbed data in MPS format agrees to 5 digits of the results from the exact solver QSopt ex on the "accurate" data. These results show the robustness of the TMA ME model and our 34-digit Quad solvers.
More importantly, for the most part even small solution values are perturbed in only the 5th or 6th significant digit. Let v and w be the solutions obtained on slightly different data. Some example values are given in Table 2. Among all j for which max(v j , w j ) > δ 1 = 10 −15 (the feasibility tolerance), the largest relative difference |v j − w j |/ max(v j , w j ) was less than 10 −5 for all but 31 variables. For 22 of these pairs, either v j or w j was primal or dual degenerate (meaning one of them was zero and there are alternative solutions with the same objective value). The remaining 9 variables had v j , w j values shown in Table 3.
We see that the values are small (the same magnitude as the data perturbation) but for each of the nine pairs there is about 1 digit of agreement. We could expect thousands of small solution pairs to differ more, yet for almost all 17535 pairs at least 5 digits agree. Although these observations do not prove robustness of FBA models in general (because we analyzed only one perturbation to one model), they are welcome empirical evidence that the solutions are not extremely unstable. Quad solvers can help evaluate the robustness of future (increasingly large) models of metabolic networks by enabling similar comparison of high-accuracy solutions for slightly different problems.

MINOS implementation
MINOS 10, 11 is a linear and nonlinear optimization solver implemented in Fortran 77 to solve problems of the form where ϕ(v) is a smooth nonlinear function and f (v) is a vector of smooth nonlinear functions. The matrix S and the Jacobian of f (v) are assumed to be sparse. Let Single/Double/Quad denote the floating-point formats defined in the 2008 IEEE 754 standard 12 with about 7/16/34 digits of precision, respectively. Single is not useful in the present context, and Double may not ensure adequate accuracy for multiscale problems. This is the reason for our work. Since release 4.6 of the GCC C and Fortran compilers 13 , Quad has been available via the long double and real(16) data types. Thus, we have made a Quad version of Double MINOS using the GNU gfortran compiler (GNU Fortran 5.2.0).
On today's machines, Double is implemented in hardware, while Quad (if available) is typically implemented in a software library, in this case GCC libquadmath 14 .
For Double MINOS, floating-point variables are declared real(8) (≈ 16 digits). For Quad MINOS, they are real(16) (≈ 34 digits) with the data S, c, , u stored in Quad even though they are not known to that precision. This allows operations such as Sv and S T y to be carried out directly on the elements of S and the Quad vectors v, y. If S were stored in Double, such products would require each entry S i j to be converted from Double to Quad at runtime (many times).
The primal simplex solver in MINOS includes geometric mean scaling 15 , the EXPAND anti-degeneracy procedure 16 , and partial pricing (but no steepest-edge pricing, which would generally reduce total iterations and time). Basis LU factorizations and updates are handled by LUSOL 6 . Cold starts use a Crash procedure to find a triangular initial basis matrix. Basis files are used to preserve solutions between runs and to enable warm starts.
Scaling is commonly applied to linear programs to make the scaled data and solution values closer to 1. Feasibility and optimality tolerances can be chosen more easily for the scaled problem, and LU factors of the basis matrix are more likely to be sparse. For geometric mean scaling, several passes are made through the columns and rows of S to compute a scale factor for each column and row. A difficulty is that the scaled problem may solve to within specified feasibility and optimality tolerances, but when the solution is unscaled it may lie significantly outside the original (unscaled) bounds.
EXPAND tries to accommodate consecutive "degenerate" simplex iterations that make no improvement to the objective function. The problem bounds are effectively expanded a tiny amount each iteration to permit nonzero improvement. Convergence is usually achieved but is not theoretically guaranteed 17 . Progress sometimes stalls for long sequences of iterations.
LUSOL bounds the subdiagonals of L when the current basis matrix B is factorized as P 1 BP 2 = LU with some permutations P 1 , P 2 . It also bounds off-diagonal elements of elementary triangular factors L j that update L in product form each simplex

Conventional iterative refinement
For the biology models, our aim is to satisfy Feasibility and Optimality tolerances of 10 −15 (close to Double precision). It is reasonable to suppose that this could be achieved within a Double simplex solver by implementing iterative refinement (Wilkinson 18 ) for every linear system involving the basis matrix B or B T . This is a more sparing use of Quad precision than our DQQ procedure. For example, each time the current B is factorized directly (typically a new sparse LU factorization every 100 iterations), the constraints Sv = 0 can be satisfied more accurately by computing the primal residual r = 0 − Sv from the current solution v, solving B∆v B = r, and updating v B ← v B + ∆v B . In general, the new v will not be significantly more accurate unless r is computed in Quad. (If B is nearly singular, more than one refinement may be needed.) Similarly for solving B T y = c B after refactorization, and for two systems of the form Bp = a and B T y = c B each iteration of the simplex method. By analogy with DQQ, we implemented the following procedure within a test version of Double MINOS. Note that "iterative refinement" in steps R1, R2 means a single refinement for each B or B T system, with residuals −Sv, a − Bp, c B − B T y computed in Quad as just described.

DRR procedure
Step D Apply Double MINOS with scaling and moderately strict runtime options.
Step R1 Warm-start Double MINOS with scaling, stricter options, and iterative refinement.
Step R2 Warm-start Double MINOS without scaling but with stricter options and iterative refinement.
Step D is the same as for DQQ (with no refinement). The runtime options for each step are the same as for DQQ, except in steps R1, R2 the tolerances 1e−15 were relaxed to 1e−9.
In Table 4 we see that this simplified (cheap) form of iterative refinement is only partially successful, with step R2 achieving only 4, 3, and 2 correct digits in the final objective. For GlcAerWT, steps R1 and R2 encountered frequent near-singularities in the LU factors of B (requiring excessive refactorizations and alteration of B), and in step R2, the single refinement could not always achieve full Double precision accuracy for each system. Additional refinements would improve the final Pinf and Dinf, but would not reduce the excessive factorizations. We conclude that on the bigger ME problems, a Double solver is on the brink of failure even with the aid of conventional (Wilkinson-type) iterative refinement of each system involving B and B T . We conclude that our DQQ procedure is a more expensive but vitally more robust approach.

Results with NEOS/Gurobi
For large linear models, commercial solvers have reached a high peak of efficiency. It would be ideal to make use of them to the extent possible. For example, their Presolve capability allows most of the optimization to be performed on a greatly reduced form of any typical model.   Table 5 summarizes the performance of Gurobi 19 on three large ME models via the NEOS server 20 . The first three results used Gurobi's default runtime options, including Presolve, Dual simplex, and Scaling (with default FeasibilityTol = OptimalityTol = 1e−6). TMA ME seemed to solve successfully, but from the Quad MINOS solution we know that Gurobi's final objective value has no correct digits. GlcAerWT failed with "Numeric error" after many expensive iterations using 80-bit floating-point. GlcAlift also switched to 80-bit floating-point. The scaled problem seemed to solve successfully, but unscaling damaged the primal residual and this casts significant doubt on the final solution. (This is the reason for our research.) For GlcAlift2 we specified NumericFocus 3 with no Presolve and no scaling. These options are appropriate for lifted models 21 . Gurobi did not switch to 80-bit arithmetic, yet achieved 5 correct digits in the objective. This helps confirm the value of the lifting strategy of 21 , and would provide a good starting point for steps Q1 and Q2 of DQQ. However, DQQ permits us to solve the original model GlcAerWT directly (without the lifting transformation). Table 6 summarizes the performance of SoPlex80bit 8 on the three large ME models via NEOS with default options, except the simplifier and lifting options were turned off to ensure that SoPlex80bit was iterating on the same problems as MINOS.

Results with NEOS/SoPlex80bit
SoPlex80bit performed extremely well on all ME models ( Table 6). The first floating-point solves achieved maximum primal and dual feasibilities of order 1e−7 or less, with no sign of scaling or potentially troublesome unscaling, and three rounds of iterative refinement reduced the infeasibilities to order 1e−44(!). The optimal objective values agreed to the 11 digits printed by Quad MINOS. Analogous excellent performance by SoPlex80bit on large ME models is described by Gleixner [22,Ch. 4]. On the problematic set (Table 7), SoPlex80bit solved most problems solved accurately, but with some anomalies. On de063157, the first floating-point solve achieved 5 significant digits in the objective function but with primal and dual infeasibilities of 4e+2 and 1e+4. The first refinement reduced the latter to 2e+1 and 3e−12, and the second refinement achieved 4e−15 and 3e−12. This should have been acceptable, but a further 100 refinements were conducted (at negligible cost) before the run was terminated with no final solution available.

5/7
On gen2 and gen4, the first floating-point solves were very efficient and accurate (41 and 82 seconds respectively). Three refinements achieved primal and dual infeasibilities of order 1e−11 or less. A final rational factorization proved expensive and accounted for 99% of the total times (6016 and 7132 seconds respectively), but confirmed optimality.
On l30, the first floating-point solve performed many iterations but achieved primal and dual infeasibilities of 4e−9 and 1e−10 with objective value −2.5e−11, which should have been acceptable. The first refinement reported numerical troubles after 2702 iterations. It continued to about 154000 iterations and computed an unbounded ray. Nearly 4000 refinements followed (each doing no iterations) before numerical trouble was reported. One final solve performed 3000 iterations before increasing the Markowitz threshold and terminating with no solution.
Details of this nature will change, but some of them hint at the need for higher precision in the floating-point solver to facilitate SoPlex's iterative refinement.

Looking ahead
The large-scale optimizer SNOPT 5 is maintained as a Fortran 77 solver snopt7 23 suitable for step D of the DQQ procedure. An accompanying Fortran 2003 version snopt9 has also been developed, for which Double and Quad libraries can be built with only one line of source code changed. They are ideal for applying DQQ to future multiscale linear and nonlinear optimization models, as long as step D can be terminated early enough when numerical difficulties arise. Quad enhancements to the SoPlex floating-point solver also promise reliability and extreme accuracy for future challenging models.