Deep learning-based kcat prediction enables improved enzyme-constrained model reconstruction

Enzyme turnover numbers (kcat) are key to understanding cellular metabolism, proteome allocation and physiological diversity, but experimentally measured kcat data are sparse and noisy. Here we provide a deep learning approach (DLKcat) for high-throughput kcat prediction for metabolic enzymes from any organism merely from substrate structures and protein sequences. DLKcat can capture kcat changes for mutated enzymes and identify amino acid residues with a strong impact on kcat values. We applied this approach to predict genome-scale kcat values for more than 300 yeast species. Additionally, we designed a Bayesian pipeline to parameterize enzyme-constrained genome-scale metabolic models from predicted kcat values. The resulting models outperformed the corresponding original enzyme-constrained genome-scale metabolic models from previous pipelines in predicting phenotypes and proteomes, and enabled us to explain phenotypic differences. DLKcat and the enzyme-constrained genome-scale metabolic model construction pipeline are valuable tools to uncover global trends of enzyme kinetics and physiological diversity, and to further elucidate cellular metabolism on a large scale. Comprehensive information on enzyme catalytic rates is essential to understand the metabolism of cells, but only a small fraction has been determined experimentally. Now, a deep learning model is developed to predict kcat values of metabolic enzymes on a large scale using substrate SMILES and protein sequence information.

T he enzyme turnover number (k cat ), which defines the maximum chemical conversion rate of a reaction, is a critical parameter for understanding the metabolism, proteome allocation, growth and physiology of a certain organism [1][2][3] . There are large collections of k cat values available in the enzyme databases BRENDA 4 and SABIO-RK 5 , which are, however, still sparse compared to the variety of existing organisms and metabolic enzymes, largely due to the lack of high-throughput methods for k cat measurement. Additionally, experimentally measured k cat values have considerable variability due to varying assay conditions such as pH, cofactor availability and experimental methods 6 . Altogether, the sparse collection and considerable noise limit the use of k cat data for global analysis and may mask enzyme evolution trends.
In particular, enzyme-constrained genome-scale metabolic models (ecGEMs), where the whole-cell metabolic network is constrained by enzyme catalytic capacities and thereby able to accurately simulate the maximum growth abilities, metabolic shifts and proteome allocations, rely heavily on genome-scale k cat values 2,7 . Over the past decade, ecGEMs (or models following the concept of enzyme constraints) have been separately developed for several well-studied organisms 7 including Escherichia coli 8,9 , Saccharomyces cerevisiae 2,10 , Chinese hamster ovary cells 11 and Homo sapiens 12 . Due to the limitations of k cat measurements 13 and the reliance on enzyme commission (EC) number annotations to search for k cat values in those developed pipelines 2,8,10 , the reconstruction of ecGEMs for lesser-studied organisms or large-scale reconstruction for multiple organisms has remained a challenge 7,14 . Moreover, even for those well-studied organisms, the k cat coverage is far from complete 13,15,16 . In a S. cerevisiae ecGEM, only 5% of all enzymatic reactions have fully matched k cat values in BRENDA 2 . When data are missing, previous ecGEM reconstruction pipelines typically assume k cat values from similar substrates, reactions or other organisms, which can result in model predictions deviating from experimental observations 7 . There is a clear requirement for obtaining large-scale k cat values to improve model accuracy and yield more reliable phenotype simulations 17 .
Deep learning has been applied and shown great performance in modelling chemical spaces 18 , gene expression 19 , enzyme-related parameters such as enzyme affinity 20 and EC numbers 21 . Previously, Heckmann and colleagues employed machine learning approaches to predict E. coli k cat values based on features such as average metabolic fluxes and catalytic sites obtained from protein structures 16 . However, such features are typically hard to obtain, which allows the application of this approach only to the most well-studied organisms such as E. coli.
To this end, we developed a deep learning approach (DLKcat) that uses substrate structures and protein sequences as inputs, and demonstrated its capability for the large-scale prediction of k cat values for various organisms, as well as for identifying key amino acid residues that affect these predictions. We showcased the predictive power of the deep learning model by predicting genome-scale k cat profiles for 343 yeast/fungi species, accounting for more than 300,000 enzymes and 3,000 substrates. The predicted k cat profiles enabled reconstruction of 343 ecGEMs for the yeast/fungi species through an automatic Bayesian-based pipeline, which can accurately simulate growth phenotypes among yeast species and identify the phenotype-related key enzymes.

Construction of a deep learning approach for k cat prediction.
The deep learning approach DLKcat was developed by combining a graph neural network (GNN) for substrates and a convolutional neural network (CNN) for proteins (Fig. 1). Substrates were represented as molecular graphs converted from the simplified Deep learning-based k cat prediction enables improved enzyme-constrained model reconstruction molecular-input line-entry system (SMILES), and protein sequences were split into overlapping n-gram amino acids (the string of contiguous sequences consisting of n items). We generated a comprehensive dataset from the BRENDA 4 and SABIO-RK 5 databases to train the neural network. Incomplete database entries with missing information and redundant entries were filtered out to ensure a dataset of unique entries with substrate name, substrate SMILES information, EC number, protein sequence, organism name and k cat value. The final dataset contained 16,838 unique entries catalysed by 7,822 unique protein sequences from 851 organisms and converting 2,672 unique substrates (Supplementary Figs. 1 and 2). This dataset was randomly split into training, validation and test datasets by 80%, 10% and 10%, respectively, while five times of random splitting indicated the robustness of the deep learning model ( Supplementary Fig. 3).
Deep learning model performance for k cat prediction. The effects of hyperparameters on deep learning performance were evaluated by learning curves (Supplementary Fig. 4). With the selected optimal parameters (r-radius substrate subgraphs, in which r is the number of hops from a vertex of substrate structure, 2; n-gram amino acids, 3; vector dimensionality, 20; time steps in GNN, 3; number of layers in CNN, 3), the deep learning model was trained. The root mean square error (r.m.s.e.) of k cat predictions gradually decreased with increasing epoch (Fig. 2a), where one epoch is one iteration of the dataset passing through the neural network. A final deep learning model trained and stored for further use had a r.m.s.e. of 1.06 for the test dataset, signifying that predicted and measured k cat values were overall within one order of magnitude (Fig. 2a). A high predictive accuracy could be observed on both the whole dataset (training, validation and test datasets) ( Fig. 2b; Pearson's r = 0.88) and the test dataset ( Supplementary Fig. 5a; Pearson's r = 0.71; Supplementary  Fig. 5b for test dataset where at least either the substrate or enzyme was not present in the training dataset; Pearson's r = 0.70). The predicted k cat values were categorized according to the metabolic context of the enzymes (Supplementary Table 1), and enzymes involved in primary central and energy metabolism yielded significantly higher k cat values than enzymes involved in intermediary and secondary metabolism ( Supplementary Fig. 5c), in agreement with previous observations 6 . The deep learning model was able to show enzyme promiscuity. Understanding enzyme promiscuity and the related underground metabolism is a key topic in evolutionary biology 22,23 . DLKcat-predicted k cat values (Fig. 2c) were higher for preferred substrates (median k cat = 11.07 s -1 ) compared to alternative substrates (median k cat = 6.01 s -1 ; P = 1.3 × 10 -12 ) and random substrates (median k cat = 3.51 s -1 ; P = 9.3 × 10 -6 ) for promiscuous enzymes in the whole dataset, while the same trend was identified in the test dataset (Supplementary Fig. 5d; P < 0.05). The concept of native and underground metabolism 24 could be exemplified with the rich experimental k cat data that are available for human aldo-keto reductase and 61 substrates, where DLKcat could differentiate ( Fig. 2d; P = 0.0039) between native (top 10% experimental k cat values, median = 2.22 s -1 ) and underground (last 10%, median = 0.04 s -1 ) substrates.
Prediction and interpretation of k cat of mutated enzymes. Beyond good overall performance (Fig. 2b), DLKcat was able to capture the effects of amino acid substitutions on the k cat values of individual enzymes. The annotated dataset was divided into wild-type enzymes and mutated enzymes with amino acid substitutions. As the median k cat of mutant enzymes was lower than that of wild-type enzymes ( Supplementary Fig. 6a), the deep learning model was a good k cat predictor for both wild-type enzymes (Fig. 3a Table 2). The predicted and experimentally measured k cat values correlated very well (Pearson's r = 0.94; Fig. 3c). The experimentally measured k cat values were further grouped as within a 0.5-fold to 2.0-fold change of wild-type k cat ('wild-type-like k cat ') or less than a 0.5-fold change of wild-type k cat ('decreased k cat '). The scarcity of mutated enzymes with k cat values over twofold of the wild-type k cat values precluded defining the 'increased k cat ' group 25,26 . DLKcat was able to capture the effects of small changes in protein sequences on the activities of individual enzymes, as the decreased k cat group contained significantly lower predicted k cat values compared to the wild-type-like k cat group, for all enzyme-substrate pairs (Fig. 3d).
To investigate which amino acid residues dominate enzyme activity, we applied a neural attention mechanism to back-trace important signals from the neural network output towards its input 27 . This approach assigns attention weights to each amino acid residue, quantitatively describing its importance for the predicted enzyme activity. Attention weights were calculated for the wild-type H. sapiens purine nucleoside phosphorylase (PNP) with inosine as substrate, as rich mutation data are available for this enzyme-substrate   Table 3). Situating the mutations from the wild-type-like k cat and decreased k cat groups (Fig. 3e) to the wild-type PNP sequence exhibited that residues that were mutated in the decreased k cat group had significantly higher attention weights ( Fig. 3f; P = 0.0014; Supplementary Table 4). The calculation of attention weights from the deep learning model can thereby identify amino acid residues whose mutation would likely have a more substantial effect on enzyme activity.
The k cat prediction for 343 yeast/fungi species. We previously reconstructed GEMs for 332 yeast species plus 11 out-group fungi, but only expanded 14 of them to ecGEMs using the original pipeline 10 due to the limited available k cat data 14 . As DLKcat allows prediction of almost all k cat values for metabolic enzymes against any substrates for any species, this enabled the generation of ecGEMs for all 343 yeast/fungi species, predicting k cat values for around three million enzyme-substrate pairs ( Supplementary Fig. 7). Yeast and fungal specialist enzymes (with narrow substrate specificity) had higher k cat values compared with generalist (that is, promiscuous) enzymes that catalyse more than one reaction in the model ( Supplementary Fig. 8a). This is aligned with the hypothesis that ancestral enzymes with broad substrate specificity and low catalytic efficiency improve their k cat value when they evolve into specialists through mutation, gene duplication or horizontal gene transfer 29 . Sequence conservation also trended with predicted k cat values, where the ratio of non-synonymous over synonymous substitutions (dN/dS) is commonly used to detect proteins undergoing adaptation 30 . Conserved enzymes with lower dN/dS have significantly higher k cat values compared with relatively lesser conserved enzymes (with high dN/dS), implying that conserved yeast/fungi enzymes under evolutionary pressure are adapted to have higher k cat values ( Supplementary Fig. 8b).
Bayesian approach for 343 ecGEM reconstructions. Using the predicted k cat values for 343 yeast/fungi species, we generated 343 'DL-ecGEMs' (ecGEMs parameterized with k cat values from DLKcat). The training data for the deep learning model were primarily measured in vitro, which implies that DLKcat also predicts in vitro k cat values, which is undesired as in vitro k cat values can be considerably different from in vivo 31 . To resolve these uncertainties, we adopted a Bayesian genome-scale modelling approach 32 . Here, we used predicted k cat values as mean values for prior distributions and experimentally measured phenotypes to update these to obtain posterior k cat distributions. For this, experimental growth data on yeast/fungi species were collected, collating 371 entries for 53 species with 16 carbon sources (Supplementary Table 5 and Supplementary Fig. 9). A sequential Monte-Carlo-based approximate Bayesian computation (SMC-ABC) approach 32 was implemented to sample the k cat values, after validating its generality with the ecGEM of S. cerevisiae, which had the most abundant experimental data ( Supplementary Fig. 10). The ecGEMs parameterized with the mean values of sampled posterior k cat values are hereafter represented as posterior-mean-DL-ecGEMs.
The Bayesian learning processes for S. cerevisiae and nonconventional yeast Yarrowia lipolytica are shown as examples ( Fig. 4 and Supplementary Fig. 11). We calculated r.m.s.e. values between measurements and predictions for batch and chemostat growth of S. cerevisiae and Y. lipolytica under different carbon sources. After several generations, the ecGEMs parameterized with sampled posterior k cat values achieved a r.m.s.e. lower than one ( Fig. 4a and Supplementary Fig. 11a), which showed they could accurately describe the experimental observations. For instance, the S. cerevisiae ecGEM captured the metabolic shift at increasing growth rate (Fig. 4b)-known as the Crabtree effect 33 -while Y. lipolytica respired at its maximum growth rate ( Supplementary Fig. 11b). Principal component analysis for all generated k cat sets (9,800 sets for S. cerevisiae and 4,900 sets for Y. lipolytica) showed a gradual move from the prior distribution to the distinct posterior distribution ( Fig. 4c and Supplementary Fig. 11c). The Bayesian learning process affected more variance than mean predicted k cat values (Fig. 4d,e). For S. cerevisiae, 1,057 enzyme-substrate pairs reduced their k cat variance (Šidák-adjusted one-tailed F-test, P < 0.01), while only 532 pairs changed their mean predicted k cat (Šidák-adjusted Welch's t-test, P < 0.01), which were randomly distributed across metabolic subsystems (Supplementary Table 6 values and those present in the whole dataset (training, validation and test datasets) was evaluated. The brightness of colour represents the density of data points. Student's t-test was used to calculate the P value for Pearson's correlation. c, Enzyme promiscuity analysis on the whole dataset. For enzymes with multiple substrates, we divided the substrates into preferred and alternative by their experimentally measured k cat value, and then used the predicted k cat values for this box plot. Random substrates were randomly chosen from the compound dataset in our training data, except for the documented substrates and products for the tested enzyme. We evaluated 945 promiscuous enzymes in the whole dataset (n = 945 for preferred substrates, n = 4,238 for alternative substrates, n = 945 for random substrates). d, Comparison of the predicted k cat values for the native substrates and the underground substrates with the human aldo-keto reductase enzyme as a case study. Here, we defined those substrates with the top 10% catalytic ability (experimental k cat value) as the native substrates (n = 6), while those with the last 10% catalytic ability (experimental k cat value) were considered as the underground substrates as defined in the reference (n = 6) 24 . In each box plot (c and d), the central band represents the median value, the box represents the upper and lower quartiles and the whiskers extend up to 1.5 times the interquartile range beyond the box range. A two-sided Wilcoxon rank sum test was used to calculate the P values in c and d.

Deep learning and Bayesian approaches improve ecGEM quality.
We subsequently generated posterior-mean-ecGEMs from corresponding DL-ecGEMs for all the 343 yeast/fungi species. For comparison, we also built 'original-ecGEMs' for the same species with a k cat parameterization strategy that assigns measured k cat values from BRENDA 4 and SABIO-RK 5 to enzyme/reaction pairs as was done in previous pipelines 2, 8 . We were able to reconstruct original-ecGEMs for all 343 yeast/fungi species only after assuming that orthologs across yeast species had the same EC number annotation as in S. cerevisiae. In case of missing data, certain flexibility was introduced by matching the k cat value to other substrates or organisms, or even introducing wild cards in the EC number. The original-ecGEMs yielded k cat values for ~40% of enzymes and generated enzymatic constraints for ~60% of enzyme-annotated reactions, while DL-ecGEMs and their derived posterior-mean-ecGEMs covered k cat values for ~80% of enzymes and defined enzymatic constraints for ~90% of enzymatic reactions (Fig. 5a,b for 343 yeast/fungi species; Supplementary Fig. 12a,b for S. cerevisiae). While original-ecGEMs had fewer assigned k cat values, their reconstruction pipeline also relied heavily on correct enzyme EC number annotations and available measured k cat values in the databases, contrasting with the DL-ecGEM reconstruction, which relied only on protein sequences and substrate SMILES information while resulting in a higher coverage. In DL-ecGEMs and posterior-mean-ecGEMs the only missing k cat values were for generic substrates without defined SMILES information (such as generic compounds phosphatidate and thioredoxin).
Besides the improved k cat coverage, the posterior-mean-ecGEMs and DL-ecGEMs also outperformed original-ecGEMs in the prediction of exchange rates ( Moreover, we used these three types of ecGEMs to predict required protein abundances and compared this with published quantitative proteomics data from four species with different carbon sources, culture modes and medium set-ups (Supplementary Table 7). Proteome predictions from DL-ecGEMs and posterior-mean-ecGEMs had the lowest r.m.s.e. values, while DL-ecGEMs had already reduced the r.m.s.e. by 30% when compared to original-ecGEMs ( Fig. 5e for four species with absolute proteome data). Combined, the current pipeline not only increases k cat coverage but also contributes to ecGEMs better representing the 343 fungi/yeast species.
The k cat comparison identifies phenotype-related enzymes. The predicted k cat values were furthermore able to distinguish between Crabtree positive and negative yeast species. There is much interest in understanding the presence of the Crabtree phenotype among yeast species 34,35 , and a model of S. cerevisiae energy metabolism has previously been used to interpret this phenotype by comparing protein efficiency (that is, ATP produced per protein mass per time) in its two energy-producing pathways 1 . It was postulated that the Crabtree effect is related to the high-yield (HY) pathway (containing the Embden-Meyerhof-Parnas pathway, the tricarboxylic acid (TCA) cycle and the electron transport chain), having a lower protein efficiency than the low-yield (LY) pathway (containing Embden-Meyerhof-Parnas plus ethanol formation; Fig. 6a) 1 . We here used the posterior-mean-ecGEMs of 102 yeast species with experimental reported Crabtree phenotype (25 positive; 77 negative) to similarly calculate the protein efficiencies of the HY and LY values for all wild-type (a) and mutated (b) enzymes. Colour brightness represents data density. c, Comparison between predicted and measured k cat values for several well-studied enzyme-substrate pairs with rich experimental mutagenesis data. Enzyme abbreviations: DHFR, dihydrofolate reductase; PGDH, d-3-phosphoglycerate dehydrogenase; AKIII, aspartokinase III; DAOCS, deacetoxycephalosporin C synthase; PNP, purine nucleoside phosphorylase; GGPPs, geranylgeranyl pyrophosphate synthase. Substrate abbreviations: G3P, glycerate 3-phosphate; L-Asp, l-aspartate; IPP, isopentenyl diphosphate. In a-c, the student's t-test was used to calculate the P value for the Pearson's correlation. d, Comparison of predicted k cat values on several mutated enzyme-substrate pairs between enzymes with wild-type-like k cat and decreased k cat . P < 0.05 (*), P < 0.01 (**) and P < 0.001 (***), two-sided Wilcoxon rank sum test. Detailed information and sample numbers can be found in Supplementary Table 2. e, Attention weight of residue position in the wild-type PNP enzyme, using inosine as substrate. The mutated residues in each of the mutated enzymes (with both wild-type-like k cat and decreased k cat ) were marked on the curve according to their mutated residue. Dot size indicates the number of mutated enzymes with mutations of that residue. f, Overall attention weights for the PNP-inosine pair, comparing enzymes with wild-type-like k cat and decreased k cat by two-sided Wilcoxon rank sum test. n = 15 for wild-type-like k cat ; n = 72 for decreased k cat . In each box plot (d and f), the central band represents the median value, the box represents the upper and lower quartiles and the whiskers extend up to 1.5 times the interquartile range beyond the box range.
pathways. Of the 102 species, 89% followed the trend that Crabtree positive species have a higher LY efficiency, suggesting that Crabtree positive yeasts' LY pathways are more protein efficient than their HY pathways for producing the same amount of ATP (Supplementary  Table 8). For five commonly studied species, the results are shown in Fig. 6b, and even though ATP yields in their HY pathways may vary across species, primarily due to the presence of respiratory complex I, they still followed the same trend (Supplementary Table  8). Inconsistencies in strains where the HY/LY protein efficiency ratio did not trend with the Crabtree effect might be due to additional regulation not considered in ecGEMs 36 .
With the predicted k cat profiles for yeast species, we could investigate whether key enzymes show different k cat values among 25 Crabtree positive and 77 negative species. Of the enzymes in the energy-producing pathways, only pyruvate kinase, citrate synthase, fumarase and phosphoglucose isomerase had significantly different k cat values (Fig. 6c). Since fumarase and phosphoglucose isomerase can operate in reversible directions, it is unclear how the k cat difference relates to the Crabtree effect. The k cat values of pyruvate kinase were higher in Crabtree positive species (P = 0.006; Fig. 6c). This aligns with the fact that increasing pyruvate kinase activity in the Crabtree positive Schizosaccharomyces pombe increases its fermentation ratio, decreases the growth dependence on respiration and provides resistance to growth-inhibiting effects of antimycin A, which inhibits respiratory complex III (ref. 37 ). Citrate synthase catalyses the first and rate-limiting step of the TCA cycle 38 , condensing acetyl-coenzyme A and oxaloacetate to citrate. The k cat values of citrate synthase of Crabtree negative species are higher (P = 0.008), which would benefit metabolic flux from entering the TCA cycle ( Fig. 6a,c). This is consistent with 13 C-metabolic flux analysis that showed that Crabtree negative species have higher TCA flux 39,40 .

Discussion
The diversity of biochemical reactions and organisms makes it difficult to generate genome-scale k cat profiles. Here we presented the deep learning approach DLKcat to predict k cat values of all metabolic enzymes against their substrates, requiring only the substrate SMILES information and protein sequences of the enzymes as input, yielding a versatile k cat prediction tool for any species.
DLKcat can capture k cat changes towards precise single amino acid substitutions, enabling attention weight calculations that identify the amino acid residues majorly impacting enzyme activity. Amino acid substitution is a powerful technique in the enzyme evolution field and routinely used to probe enzyme catalytic mechanisms 41,42 . Particularly, most substitution experiments perform mutagenesis in the substrate binding site region, since it is hypothesized that the binding region would have a high impact towards catalytic activity. However, it has been reported that remote regions can have a profound impact on catalytic activity 43,44 . Here, we identified not only high attention weights for amino acid residues in the inosine binding region of human PNP enzyme, but also various non-binding residue sites with high attention weights, suggesting that those residues may also majorly impact catalytic activity and deserve further validation. DLKcat can thereby serve as a valuable part of the protein engineering toolbox 45,46 .
Predicted genome-scale k cat profiles can facilitate the reconstruction of enzyme-constrained models of metabolism, from both curated and automatically generated basic (non-ec) GEMs. The deep learning-predicted k cat process proved to be a more comprehensive but still practical alternative to matching in vitro k cat values from the BRENDA 4 and SABIO-RK 5 databases, as is common in original-ecGEM reconstruction pipelines such as the GECKO and MOMENT 2,8,47 . By not depending on EC number annotation, DLKcat is furthermore able to predict isozyme-specific k cat values, while the use of SMILES (matching via the PubChem 48 or MetaNetX 49 databases) avoids the issues of ununified substrate naming between the GEM and BRENDA that original-ecGEM reconstruction pipelines can experience. The DL-ecGEMs can subsequently be adjusted to existing experimental growth data through a Bayesian approach that yields posterior-mean-ecGEMs with physiologically relevant solution spaces. Combined, the current DLKcat-based pipeline is therefore applicable to ecGEM reconstruction for virtually any organism for which a protein sequence FASTA file and a basic GEM is available. Our pipeline hereby improves applicability, and it even improves the number of reactions with enzymatic constraints in comparison with original-ecGEMs that have previously been constr ucted 2,[8][9][10][11][12]50 . Even though the DLKcat-based pipeline yields ecGEMs with superior performance over original-ecGEMs, various challenges remain. For example, while our deep learning model can distinguish alternative from randomly chosen substrates for promiscuous enzymes (Fig. 2c), it still predicts a level of kinetic activity towards random substrates that is likely too high. This behaviour can be explained by the limited availability of negative data: cases where an enzyme-substrate pair did not result in catalysis. Increased reporting of negative datasets, where non-detected activity for enzymesubstrate pairs are reported and collected by enzyme databases, could enhance future deep learning models in terms of defining true negatives 46 . In addition, DLKcat did not consider the effect of environmental factors such as pH and temperature, but combining DLKcat with other emerging machine learning tools, such as for enzyme optimal temperature prediction, would enable future investigation on the impact of environmental parameters on enzyme activities 32 .
Another challenge relates to reactions involving multiple substrates and those catalysed by heteromeric enzyme complexes. The multiple substrate SMILES and protein sequences that can be defined for such reactions can all function with DLKcat, thereby yielding multiple predicted k cat values for one reaction. We currently select the maximum k cat values in those cases, but it would be favourable to devise an approach that can predict one k cat value for each multi-substrate and/or heteromeric enzyme.
In addition, DLKcat-derived DL-ecGEMs and posterior-mean-ecGEMs inherit limitations from basic (non-ec) GEMs, where the steady-state assumption that is central to constraint-based modelling allows one to determine metabolic fluxes but does not readily consider regulatory behaviours. While ecGEMs drastically reduce the solution space of constraint-based models to cellular feasible capacities, k cat is not the only kinetic parameter that determines reaction rate, as for example, affinity constants play influential roles. However, as constraint-based models cannot predict internal metabolite concentrations, it is currently not feasible to readily consider the influence of those parameters. Nonetheless, k cat values are also important parameters in other resource allocation models such as proteome-constrained GEMs 51-53 and metabolism/macromolecular-expression models 7,54,55 . Despite improved predictions and more applications, how to define k cat values has also remained a challenge in the reconstruction of those models. Such resource allocation models and ecGEMs share the assertion that cells need to allocate their limited proteome to different pathways to achieve faster growth or better fitness, while the proteome cost for each reaction is similarly defined by the flux and the kinetic rate of the enzyme. Deep learning-predicted k cat values for the metabolic parts of those models can therefore improve their quality and performance, although other challenging kinetic parameters, for example, ribosomal catalytic rates, to be determined in those model formulations cannot be obtained from DLKcat. In addition, model formulations that particularly focus on describing enzyme kinetics 56 could benefit from deep learning-predicted k cat values, so that our DLKcat approach can find a broad application in the modelling field.
In conclusion, we showed that DLKcat yields realistic k cat values that can be used to direct future genetic engineering, understand Original-ecGEMs were constructed following the pipeline to extract k cat profiles from BRENDA and SABIO-RK; DL-ecGEMs were constructed from DLKcat-predicted k cat profiles; and posterior-mean-ecGEMs were parameterized with mean k cat values from 100 posterior datasets after the Bayesian training process. Culture conditions for the labels on the x axis of those proteome datasets can be found in Supplementary enzyme evolution and reconstruct ecGEMs to predict metabolic fluxes and phenotypes. Besides that, we envision many other possible uses of this deep learning-based k cat prediction tool, such as a tool in genome mining and Genome-Wide Association Studies analysis. The developed automatic Bayesian ecGEM reconstruction pipeline will be instrumental for further use in ecGEM reconstruction, for omics data incorporation and analysis.

Methods
Dataset preparation for deep learning model development. The dataset used for deep learning model construction was extracted from the BRENDA 4 and SABIO-RK databases 5 on 10 July 2020 by customized scripts via application programming interface. We generated a comprehensive dataset including the substrate name, organism information, EC number, protein identifier (UniProt ID), enzyme type and k cat values. As the overall majority of k cat values reported in BRENDA and SABIO-RK do not specify their assay conditions, such as pH and temperature, we decided not to include the features in order to maintain the training dataset size and variety. In addition, substrate SMILES, a string notation to represent the substrate structure, was extracted using substrate name to query the PubChem compound database 48 , which is the largest database of chemical compound information and is easy to access 57 . As different substrates usually have various synonyms in different databases and GEMs, we used a customized Python-based script to ensure that the same canonical SMILES information could be output for the same substrates with various synonyms, which is essential to help filter redundant entries obtained from different databases. Several rounds of data cleaning were performed to ensure quality ( Supplementary Fig. 2). Protein sequences were queried with two methods: for entries with UniProt ID information, the amino acid sequences could be obtained via the application programming interface of the UniProt 58 with the help of Biopython v.1.78 (https:// biopython.org/); and for entries without UniProt ID, the amino acid sequences were acquired from the UniProt 58 and the BRENDA 4 databases based on their EC number and organism information. After that, the sequences of those entries with wild-type enzymes were mapped directly, and the sequences of those entries with mutated enzymes were changed according to the mutated sites. Finally, the remaining entries formed the high-quality dataset for deep learning model construction. Detailed numbers for the data cleaning can be found in Supplementary Fig. 2.

Construction of the deep learning pipeline.
In this work, we developed an end-to-end learning approach for in vitro k cat value prediction by combining a GNN for substrates and a CNN for proteins. The integration of GNN and CNN can be naturally used to handle pairs of data with different structures, that is, molecular graphs and protein sequences. In this approach, substrates are represented as molecular graphs where the vertices are atoms and the edges are chemical bonds, while proteins are represented as sequences in which the characters are amino acids.
For substrates, there are just a few types of chemical atoms (for example, carbon and hydrogen) and chemical bonds (for example, single bond and double bond). To obtain more learning parameters, we employed r-radius subgraphs to get the vector representations, which are induced by the neighbouring vertices and edges within radius r from a vertex 59 . First, substrate SMILES information was converted to a molecular graph using RDKit v.2020.09.1 (https://www.rdkit.org). Given a substrate graph, the GNN can update each atom vector and its neighbouring atom vectors transformed by the neural network via a nonlinear function, for example, ReLU (ref. 60 ). In addition, two transitions were developed in the GNN, including vertex transitions and edge transitions. The aim of transitions is to ensure that the local information of vertices and edges is propagated in the graph by iterating the process and summing neighbouring embeddings. The final output of the GNN is a set of real-valued molecular vector representations for substrates.
Similarly, by using the CNN to scan protein sequences, we can obtain low-dimensional vector representations for protein sequences transformed by the neural network via a nonlinear function, for example, ReLU. To apply the CNN to proteins, we defined 'words' in protein sequence and split a protein sequence into an overlapping n-gram (n = 1, 2, 3) of amino acids 61 . In this work, to avoid low-frequency words in the learning representations, a relatively smaller n-gram number of 1, 2 or 3 was set. Then, we translated protein sequences into various word embeddings. Following this, the CNN used a filter function, shown in equation (1), to compute the hidden vectors from the input word embeddings and weight matrix. After that, we obtained a set of hidden vectors for these split subsequences based on n-gram amino acid splitting.
where f is a nonlinear activation function (for example, ReLU); W conv is the weight matrix and b conv is the bias vector; i and t are the serial numbers of a set of hidden vectors; and c i (t) and c i (t-1) are the hidden vectors for the protein sequence. Also, other important parameters of the neural networks (CNN and GNN) were set as follows: number of convolutional layers in CNN, 2, 3 or 4; number of time steps in GNN, 2, 3 or 4; window size, 11 (fixed); r-radius, 0, 1 or 2; and vector dimensionality, 5, 10 or 20. These different settings were explored based on the coefficient of determination (R 2 ) in equation (2) during the hyperparameter tuning to find which hyperparameter is better for improving the deep learning performance. The R 2 was calculated by scikit-learn v.0.23.2 (https://scikit-learn.org/stable/). And finally, we used the optimal hyperparameters to train our deep learning model.
where y ip is the predicted k cat value, y ie is the experimental k cat value, ȳ is the average of the experimental k cat values and n is the total number of items in the dataset (validation dataset or test dataset). After the acquisition of the substrate molecular vector representations and the protein sequence vector representations, we concatenated them together along  with an output vector (k cat value) to train the deep learning model using the neural attention mechanism 59 . During the training process, all the datasets were shuffled at the first step, and then were randomly split into a training dataset, validation dataset and test dataset at the ratio of 80%:10%:10%. Given a set of substrateprotein pairs and the k cat values in the training dataset, the aim of the training process is to minimize its loss function. The best model was chosen according to the minimal r.m.s.e., shown in equation (3) where y ip is the predicted k cat value, y ie is the experimental k cat value and n is the total number of items in the dataset (validation dataset or test dataset).
Enzyme promiscuity analysis based on deep learning model. For enzyme promiscuity, we explored whether the deep learning model can identify substrate preference for promiscuous enzymes. For each promiscuous enzyme, we defined that the substrate with the highest k cat value was considered as the preferred substrate, while those with k cat values less than the maximum value were classified as alternative substrates. Random substrates were randomly chosen from the compound dataset in our training data, except for the documented substrates and products for the tested enzyme. By using the deep learning model, we further predicted and compared the k cat values for the preferred, alternative and random substrates on various promiscuous enzymes. In order to identify high-quality promiscuous enzymes, entries with an experimentally measured k cat value less than -2 (s -1 ) in a log 10 scale were excluded in this analysis.

Validation of deep learning-based k cat values.
According to the classification of metabolic pathways, metabolic contexts were mainly divided into four different subsystems: (1) primary metabolism (carbohydrate and energy), involving the main carbon and energy metabolism, for example, glycolysis/gluconeogenesis, TCA cycle, pentose phosphate pathway, and so on; (2) primary metabolism (amino acids, fatty acids and nucleotides); (3) intermediate metabolism, related to the biosynthesis and degradation of cellular components, such as coenzymes and cofactors; and (4) secondary metabolism 6 . To explore the metabolic subsystems for all of the wild-type enzymes in the experimental dataset, the module in the KEGG database 62 was used to assign metabolic pathways for enzyme-substrate pairs by linking the detailed metabolic pathway in the KEGG application programming interface with the EC number annotated in each enzyme-substrate pair. Detailed classification can be found in Supplementary Table 1. Using the trained deep learning model, the predicted k cat values were generated for all the enzymesubstrate pairs.
Interpretation of the reasoning of deep learning. To interpretate which subsequences or residue sites are more important for the substrate, the neural attention mechanism was employed by assigning attention weights to the subsequences 27 . A higher attention weight of one residue means that that residue is more important for the enzyme activity towards the specific substrate. Such attention weights were modelled based on the output of the neural network. The mathematical equations for the neural attention mechanism are shown as follows: h substrate = f(Wintery substrate + b) where C is a set of hidden vectors for the protein sequence, c 1 (t) to c n (t) are the sub-hidden vectors for the split subsequences, y substrate is the substrate molecular vector, W inter and b are the weight matrix and the bias vector in the neural network, respectively, f is a nonlinear activation function (for example, ReLU), α i is the final attention weight value, σ is the element-wise sigmoid function, and T is the transpose function.
A defined protein could be split into overlapping n-gram amino acids and calculated as a set of hidden vectors in equation (4). Given a substrate molecular vector y substrate and a set of protein hidden vectors, the substrate embeddings (h substrate ) and subsequence embeddings (h i ) could be output based on the neural network, as shown in equations (5) and (6). By considering the embeddings of y substrate , the attention weight value for each subsequence was accessible in equation (7), which represents the importance signals of the protein subsequence towards the enzyme activity for a certain substrate.
Prediction of k cat values for 343 yeast/fungi species. The GEMs of 343 yeast/fungi species were automatically reconstructed in our previous paper 14 from a yeast/ fungi 'pan-GEM' , which was derived from the well-curated Yeast8 of S. cerevisiae combined with the pan-genome annotation. For each model, all reversible enzymatic reactions were split into forward and backward reactions. Reactions catalysed by isoenzymes were also split into multiple reactions with one enzyme complex for each reaction. Substrates were extracted from the model and mapped to the MetaNetX database to get SMILES information using annotated MetaNet identifiers (IDs) for metabolites 49 . Protein IDs for the enzymes were from the model grRules. Protein sequences were queried by the protein ID in the protein FASTA file for each species. Reaction IDs, substrate names, substrate SMILES information and protein IDs were combined as the input file for the deep learning k cat prediction model.

Analysis of k cat values and dN/dS for yeast/fungi species.
In a previous study, the genomes of 343 yeast/fungi species combined with comprehensive genome annotations were publicly available 63 . The gene-level dN/dS of gene sequences for pairs of orthologous genes from the 343 species were calculated with yn00 from PAML v.4.7 (ref. 64 ). For this computational framework, the input is the single-copy ortholog groups, and the output is the gene-level dN/dS values extracted from the PAML output files. By mapping the predicted k cat values with the gene-level dN/dS values via the bridge of protein ID, a global analysis was performed between the k cat values and the dN/dS values for 343 yeast/fungi species across the out-group (11 fungal species) together with 12 major clades divided by the genus-level phylogeny for 332 yeast species. ecGEM reconstruction. Besides the constraints in basic (non-ec) GEM, shown in equations (8) and (9), ecGEMs are reconstructed by adding enzymatic constraints, shown in equations (10) and (11).
in which S is the stoichiometry matrix and v is the flux vector. This equation is the representative of the steady-state assumption of the metabolic model to constrain the mass balance.
lbj ≤ vj ≤ ubj (9) in which lb and ub are the lower bound and upper bound of the rate for the reaction j.
where v j stands for the metabolic flux (mmol gDW -1 h -1 ; gDW, gram dry weight) of the reaction j; [E i ] stands for the enzyme concentration for the enzyme i that catalyses reaction j; and k i,j cat is the catalytic turnover number for the enzyme catalysing reaction j. This constraint is applied to all enzymatic reactions with available k cat values. Additionally, we added reactions to draw protein mass from the total protein pool to each enzyme, therefore, a mass balance constraint was proposed as: where θ is the fraction of metabolic protein in the total protein content of the cell. This equation means that the sum enzyme usage should be lower or equal to the total metabolic protein abundance.
To compare the different k cat value assignment approaches, we built ecGEMs parameterized with three types of k cat values: original-ecGEMs, DL-ecGEMs and posterior-mean-ecGEMs.
Original-ecGEM reconstruction queried k cat values from the BRENDA database by matching the EC number, a method that relies heavily on the database EC number annotation for the specific species 2,8 . Since more than 200 out of 343 yeast/fungi species are not annotated in UniProt 58 and KEGG 62 , EC numbers for orthologs annotated in S. cerevisiae were borrowed to facilitate the original-ecGEM reconstruction process for all these 343 species. The k cat extraction process used the criteria from process 13 in the reconstruction methods of the reference 47 .
DL-ecGEM reconstruction extracted all k cat values from the deep learning predicted file. To assign a k cat value for each metabolic reaction, we followed these criteria: If the in vitro k cat measurement with matched substrate and enzyme was available, then the measured in vitro k cat values were used rather than the k cat prediction. This pipeline also accepted the user's input for the k cat values. For enzymes with no k cat measurement, predicted k cat values were used after the following steps: k cat values predicted for currency metabolites such as H 2 O and H + were excluded; if there were multiple substrates in the reaction, maximum values among the substrates were kept; and if multiple subunits existed in the enzyme complex, we used the maximum value among all subunits to represent the k cat for the complex. Subunit protein stoichiometry information was multiplied before comparison. We assumed the same enzyme complex stoichiometry information for yeast species as that of S. cerevisiae, which is collected from the Protein Data Bank in Europe database (https://www.ebi.ac.uk/pdbe/) as well as the Complex Portal (www.ebi.ac.uk/complexportal).
Posterior-mean-ecGEM reconstruction was parameterized by mean k cat values from accepted posterior distribution. The k cat values in the DL-ecGEMs combined with the r.m.s.e. (which is 1 in the log 10 scale) of the k cat prediction were used as mean values and variance to make the prior distribution. Each k cat value was described with a log normal distribution N(k cat i , 1). This prior iteratively morphs into a posterior through multiple generations 32 . For each generation, we sampled 126 k cat datasets within the distribution; 100 among those 126 datasets with a smaller distance (see next section for the SMC-ABC distance calculation) between the phenotype measurements and predictions, which can better represent the phenotype, were kept to make the distribution for the next generation. Until the distance was lower than the cut-off (r.m.s.e. for phenotype prediction of 1), we accepted the final distribution as the posterior distribution 32 .

SMC-ABC distance function.
Experimental growth data and related exchange rates in batch and chemostat conditions were collected for the yeast/fungi species, which are available in Supplementary Table 5. The distance function was designed as the r.m.s.e. between the simulated and experimental phenotypes. To have a metric for the variance of phenotype prediction of both flux and maximum growth potential, r.m.s.e. was designed in two parts (each part may contain multiple measurement entries such as growth with a different medium). The first part addressed flux prediction. This part checks whether the model predicts similar fluxes when the carbon uptake rate is constrained, as experimentally measured. In this part, all data points for the species are used, and all measured exo-metabolite exchange fluxes are used for comparison. The second part addresses the prediction of the maximum growth rate potential. This part checks the maximum growth rate of the model prediction against the experimental measurement for one species on a certain experimentally tested medium. In this part, only the batch condition with maximum growth rate measurement was tested. No carbon uptake rate or other exchange rate was constrained in the model. Growth maximization was set as the objective function. After simulation, only the maximum growth rate and the carbon uptake rates were used for comparison with measurement.
After running the above two parts of the simulations, the r.m.s.e. for each part can be calculated. All measured and simulated rates were normalized by multiplying the carbon numbers of the corresponding metabolites before calculation of r.m.s.e. The carbon number for biomass is 41 (the mean value for the molecular weight of 1 carbon moles (Cmol) biomass of yeast is ~24.42 g (ref. 65 ); the biomass equals 1,000 mg). Note that if the substrate or by-product does not contain any carbon, such as O 2 , then the normalizing number is 1. Then the average r.m.s.e. of both simulations was used to represent the distance. The SMC-ABC search stopped once the r.m.s.e. reached the accepted value or reached the maximum generation. The accepted value for the distance was set to be lower than 1, and the maximum generation was set to be 100.

Simulations with ecGEMs.
We performed different kinds of simulations using the ecGEMs, including simulations of growth and protein abundance. Different media and growth conditions were set to match the experiment measurement conditions, for example, using xylose as the carbon source or anaerobic conditions. Since there are no measured total protein abundances in the biomass for all yeast/fungi species, we used the protein content mass to serve as the default total protein abundance for each species and used a factor of 0.5 to serve as the ratio of the metabolic protein to the total protein.
As for the protein abundance simulation, the medium was set to match the experimental condition as mentioned above. For the chemostat condition, the growth rate was fixed as the dilution rate, and the carbon source uptake rate was minimized, which is a normal set-up for the simulation of the chemostat condition. For the batch condition, the growth rate maximization was used as the objective. Then, the simulated protein abundances, which can be extracted from the fluxes, were compared with those in collected proteome datasets. The MATLAB (2019b), COBRA (v.3.2) 66 , RAVEN (v.2.4) 67 and libSBML (v.5.17.0) toolboxes were used in the process with solver IBM ILOG CPLEX optimizer. Violinplot-Matlab (https://github.com/bastibe/Violinplot-Matlab) was used for the visualization of violin plots.
Statistical tests for Bayesian approach. Sampled prior and posterior k cat datasets were compared for the difference in the mean values and the variance. Welch's t-test was used to test the significance for the mean values, while a one-tailed F-test was used for the reduced variances. The cut-off for the significance was set to 0.01 for the adjusted P value corrected by the Šidák method. PVAL_ADJUST (https:// github.com/nunofachada/pval_adjust) was used in the analysis.
Proteome data processing. We normalized the collected relative proteome datasets using the identical condition of the absolute proteome data from the literature following the same method as in ref. 68 . The reference absolute datasets for those relative proteome datasets were documented in the collected file in the GitHub repository.

Calculation of protein cost and efficiency.
To calculate the protein cost of the HY pathway, the glucose uptake rate was fixed at 1 mmol gDW -1 h -1 , and the non-growth associated maintenance energy (NGAM) reaction was maximized. The total protein pool reaction was then minimized by fixing the NGAM reaction at the maximized value. The minimized flux through the total protein pool reaction is the protein cost of the HY pathway for converting one glucose to ATP. As for the protein cost calculation of the LY pathway, the glucose uptake rate was fixed at 1 mmol gDW -1 h -1 , and ethanol production was maximized. Then the ethanol exchange rate was fixed at the maximized value, and NGAM was maximized. After that, NGAM was also fixed at the maximized value, and the total protein pool was minimized to calculate the protein cost for the LY pathway. We also examined the flux distribution to ensure that other energy-producing pathways were all inactive during this simulation. Protein efficiency is defined as the protein cost for producing one flux ATP in each pathway.
Reporting summary. Further information on research design is available in the Nature Research Reporting Summary linked to this article.

Code availability
To facilitate further usage, we provide all codes and detailed instruction in the GitHub repository: https://github.com/SysBioChalmers/DLKcat. A user-friendly example for k cat prediction is also included in the repository.