Machine learning discovery of missing links that mediate alternative branches to plant alkaloids

Engineering the microbial production of secondary metabolites is limited by the known reactions of correctly annotated enzymes. Therefore, the machine learning discovery of specialized enzymes offers great potential to expand the range of biosynthesis pathways. Benzylisoquinoline alkaloid production is a model example of metabolic engineering with potential to revolutionize the paradigm of sustainable biomanufacturing. Existing bacterial studies utilize a norlaudanosoline pathway, whereas plants contain a more stable norcoclaurine pathway, which is exploited in yeast. However, committed aromatic precursors are still produced using microbial enzymes that remain elusive in plants, and additional downstream missing links remain hidden within highly duplicated plant gene families. In the current study, machine learning is applied to predict and select plant missing link enzymes from homologous candidate sequences. Metabolomics-based characterization of the selected sequences reveals potential aromatic acetaldehyde synthases and phenylpyruvate decarboxylases in reconstructed plant gene-only benzylisoquinoline alkaloid pathways from tyrosine. Synergistic application of the aryl acetaldehyde producing enzymes results in enhanced benzylisoquinoline alkaloid production through hybrid norcoclaurine and norlaudanosoline pathways.

S train engineering is now a reliable approach to scale up the production of target metabolites by integrating known genes, and applying simple yet effective metabolic engineering strategies 1 . But engineering the microbial production of secondary metabolites reaches the limitation of the characterized enzymes present in sequence databases, where many annotations are incorrect. In reality, there are millions of enzyme variants to choose from for each desired reaction, and a great abundance of variations are still hidden in nature with unknown sequence and function. In this way, the evolution of nature over millions of years can be viewed as a highly diverse screening resource for synthetic biologists. Accordingly, the rational discovery of homologous enzyme sequences with useful functions is a powerful and inevitable approach to improve microbial bioproduction pathways [2][3][4][5][6] .
Functional prediction of uncharacterized enzyme sequences is a promising approach to increase the number of specialized enzymes that can be applied to biosynthesis 5,7,8 . In our previous study, aromatic acetaldehyde synthase (AAS) was predicted with the enzyme selection software M-path to improve the production of valuable alkaloids 7 . However, only enzyme commission (EC) number could be predicted with M-path and the actual selection of candidate sequences had to be performed by human intuition. To address this issue, a support vector machine (SVM) algorithm was developed to automatically select specific enzyme sequences: 8 an upgrade that enables computer automated Design, Build, Test and Learn (DBTL) cycles.
Conversion of tyrosine to benzylisoquinoline alkaloid (BIA) is a model example of metabolic engineering ( Fig. 1) 1,4,5,7 . BIAs are precursors to opioid analgesic medications that are currently massproduced by industrially grown Papaver somniferum plants, which are a historical target for the human-directed evolution of natural product production. While opioid misuse is a global problem, natural and semi-synthetic opioids derived from the BIA reticuline actually result in fewer deaths than less expensive and overly potent synthetic opioids (CDC Opioid Data Analysis and Resources). With diverse potential, natural BIAs have been shown to inhibit coronavirus 9 , and the BIA norcoclaurine is a β2-adrenergic receptor agonist that is present in edible plants, medicinal herbs and sports supplements 10,11 .
BIA production in Escherichia coli has utilized bacterial monoamine oxidase and insect 3,4-dihydroxyphenylacetaldehyde synthase (DHPAAS) to generate toxic 3,4-dihydroxyphenylacetaldehyde (DHPAA) 12 . However, the DHPAA containing pathways result in rapid loss of unstable catechol containing intermediates 7,[12][13][14][15] . Other reports show that plants use a 4-hydroxyphenylacetaldehyde (4HPAA) pathway to norcoclaurine ( Supplementary Fig. 1), which may be more stable due to lack of a catechol group in early intermediates. Therefore, plant 4HPAA pathways offer the potential to prevent the loss of BIA intermediates in E. coli. Furthermore, the combination of 4HPAA and DHPAA pathways may also improve the utilization of tyrosine and aryl acetaldehydes. Despite success with the 4HPAA pathway in yeast 1,4,16,17 and many discussions on the expected phenylpyruvate decarboxylase (PPDC, EC 4.1.1.43) and AAS (EC 4.1.107-9) 18-20 activities in plants, no enzymes to produce aryl acetaldehydes 4HPAA or DHPAA have been characterized from high alkaloid producing poppy plants 21 . Moreover, no plant sequence annotated as phenylpyruvate decarboxylase can be found from public databases, and numerous P. somniferum cytochrome P450 (CYP450) monooxygenases (EC 1.14.14) require complex clarification. Therefore, this serious limitation in known enzymes is addressed by applying machine learning to predict the specialized missing links in plant alkaloid pathways shown as dotted arrows in Fig. 1 and Supplementary Fig. 1.
In this study, eight refined SVM models are built and applied to automate the selection of target sequences from over 100 candidates present throughout highly duplicated carboxy-lyase and oxidase gene families. Then, to verify the machine learning predictions, approximately 50 strains expressing various combinations of candidate sequences are screened using liquid chromatography-mass spectrometry (LC-MS)-, capillary electrophoresis-MS (CE-MS)-and gas chromatography-MS (GC-MS)-based metabolomics. As a result, AAS and PPDC are identified as potential missing links that mediate uncharacterized branches of the Papaver somniferum alkaloid pathway. The synergistic combination of predicted enzymes together with homologous enzyme templates affords 356 µM norcoclaurine, 240 µM N-methylcoclaurine and 74.9 µM reticuline, without using any genome engineering. The alternative branches of flux from tyrosine to downstream alkaloids are confirmed using dynamic metabolic profiling 5 with mechanism-directed deuterium labeling patterns.

Results
Prediction and discovery of P. somniferum aromatic acetaldehyde synthase. DHPAA and norlaudanosoline (also referred to as tetrahydropapaveroline or THP) are more easily oxidized and more toxic than their corresponding 4-hydroxyphenyl analogues 12 . Therefore, missing link enzymes to 4HPAA and norcoclaurine are explored to test our machine learning enzyme selection models (Fig. 2). Our previous M-path analysis identified 4-hydroxyphenylacetaldehyde synthase (4HPAAS, EC 4.1.1.108) to mediate 4HPAA production from tyrosine; however specific 4HPAAS sequences are incompletely annotated throughout databases. In this study the term AAS is used to cover plant-type AAS enzymes 4HPAAS and phenylacetaldehyde synthase (PAAS, EC 4.1.1.109), as well as insect 3,4-dihydroxyphenylacetaldehyde synthase (DHPAAS, EC 4.1.1.107), because substrate specificities are often mixed throughout these groups.
Unclear variations within the plant-type AAS group, which may act upon a wide range of substrates including phenylalanine, tyrosine, 3,4-dihydroxy-L-phenylalanine (L-DOPA), tryptophan, and histidine, further complicates the selection of a correct sequence based on phylogenetic and structural analyses alone. Accordingly, no AAS enzyme from P. somniferum has been clearly established 21 . To overcome this challenge in enzyme prediction, our SVM-based algorithm 8 is first applied to select AAS from P. somniferum homologs annotated as tyrosine/DOPA decarboxylase (TyDC) (Fig. 2c, d, and Supplementary Table 1).
Separate SVM models for pyridoxal 5-phosphate (PLP)-dependent aromatic amino acid decarboxylase (AAAD) and closely related PLP-dependent AAS were trained using sequences classified as described in the methods section (Supplementary Data 1 and Supplementary Data 2). According to database annotations and previous reports, P. somniferum TyDC (PsTyDC) proteins should be expected to catalyze the decarboxylation of tyrosine to form tyramine, and possibly L-DOPA conversion to dopamine 22,23 . In contrast, SVM decision scores show that while most of the fulllength PsTyDC sequences have high potential for AAAD activity, PsTyDC1-8 also appear in AAS prediction space (Fig. 2d and  Supplementary Table 1). Here, higher positive SVM decision scores indicate sequences that are further from the SVM prediction boundary, deeper within the positive prediction space.
PsTyDC1 contains the unique active site residue L205 (Fig. 2b), further suggesting atypical activity of this test sequence, and PsTyDC1 is therefore first selected to explore demonstration level (Fig. 1c) prediction of AAS. In accordance with the SVM prediction, expression of wild-type PsTyDC1 in E. coli promotes in vivo production of norcoclaurine from tyrosine and dopamine (Fig. 2e). As a positive AAS control, PsTyDC1-Y98F-F99Y-L205N with engineered active site residues transplanted from insect DHPAAS, also produces similar results to those of wild-type PsTyDC1. After substitution of PsTyDC1-L205 to a histidine residue, found in typical AAAD, the decarboxylation product tyramine increases dramatically ( Fig. 2e and Supplementary Fig. 2). Production of norcoclaurine is further confirmed in strains expressing PsTyDC1 with additional variations in the alkaloid pathway (Supplementary Fig. 3 and Supplementary Table 2).
While norcoclaurine is detected in vivo with the expression of PsTyDC1, in vitro production of unstable 4HPAA by PsTyDC1 could not be detected, possibly due to low activity of PsTyDC1. Therefore, the SVM models are investigated further to select a better AAS candidate that might not be suggested by structural analysis. Despite containing AAAD-like active site residues Y98, F99, H205, Y350, and S372 (Fig. 2b), PsTyDC6 scores highest in the AAS prediction model (Fig. 2d and Supplementary Table 1). Therefore, PsTyDC6 is further selected for in vitro characterization. Interestingly, PsTyDC6 and PsTyDC1 share over 98% sequence identity, which is the highest sequence identity among the entire PsTyDC family, and PsTyDC6 is accordingly annotated as 'tyrosine/DOPA decarboxylase 1-like'.
In agreement with the high AAS decision score, PsTyDC6 exhibits AAS activity in the presence of tyrosine and L-DOPA (Fig. 3), thereby demonstrating discovery level (Fig. 1c) prediction of a plant AAS enzyme. Here, the in vitro AAS activity of PsTyDC6 is indicated by detection of 4HPAA by GC-MS, DHPAA by LC-MS as well as production of H 2 O 2 in a peroxidase-based fluorescent assay (Fig. 3b-d). PsTyDC6 also exhibits bifunctional AAAD activity which is indicated by the LC-MS detection of tyramine and dopamine as products of tyrosine and L-DOPA, respectively, and also by the production of downstream norlaudanosoline from L-DOPA and norcoclaurine from L-DOPA and tyrosine (Fig. 3d).
P. somniferum pyruvate decarboxylase promotes an alternative 4HPAA bypass pathway. Thiamine pyrophosphate (TPP)-dependent PPDC is an alternative to PLP-dependent AAS for the production of aryl acetaldehyde intermediates 4HPAA and DHPAA (Fig. 4). Previous reports hypothesize that P. somniferum should contain PPDC with specificity towards 4-hydroxyphenylpyruvate (4HPP) 21 Table 3). Two additional test candidates, P. somniferum PDC2 (PsPDC2) and a 2-hydroxyacyl-CoA ligase-like protein, score lower for PPDC prediction and result in lower in vivo production of downstream 4HPP decarboxylase products, in comparison to that of PsPDC1 (Fig. 4d, e, and Supplementary Fig. 4). The PPDC prediction model also suggests that truncated PsPDC1 isoform X1 (TrcPsPDC1-IX1) is a strong PPDC candidate sequence, and therefore this candidate is prepared for expression ( Fig. 4d and Supplementary Table 7). In vivo screenings with PsPDC1 reveal the alternative alkaloid route through 4HPP, and this PPDC bypass is distinct from the direct aromatic amino acid branch mediated by PsTyDC1 (Fig. 4e). Application of PsPDC1 for conversion of tyrosine through the 4HPP and 4HPAA containing pathway results in improvement in norcoclaurine titers to the >10 μM range (   Fig. 2 Prediction of AAS branch pathway enzymes to produce 4HPAA for norcoclaurine production. a The aromatic acetaldehyde synthase (AAS) branch pathway (green) can produce 4HPAA directly from tyrosine, but this missing link is unreported in P. somniferum. b Structure-based curation of typical aromatic amino acid decarboxylase (AAAD), insect-type AAS and plant-type AAS, as represented by the active site configurations of Pseudomonas putida DDC (PpDDC, blue), Bombyx mori DHPAAS (DHPAAS, grey), and Petroselinum crispum 4HPAAS (Pc4HPAAS, deepteal). AAS candidate PsTyDC1 (green) has a unique active site, while AAS candidate PsTyDC6 (green) has an AAAD-like active site and could not be predicted by a homology or structure-based approach alone. c Cross-validation for correct assignment of AAAD and AAS training sequences is performed using SVM models (blue), Random forests models (green), and by comparing sequence homology of each training sequence to a consensus sequence of AAS training sequences and a consensus sequence of AAAD training sequences (grey), as described in the methods section. d For visual representation, a two-dimensional plot of AAS SVM-based prediction is shown, with positive and negative prediction spaces colored green and white, respectively (left side). Principal component analysis (PCA) is used to compress multidimensional data into two dimensions (PC1 and PC2) for a visual representation. Corresponding high-dimensional SVM decision scores from Supplementary  Table 1 are shown on the right. Decision scores represent the distance from the SVM prediction boundary. PsTyDC1 and PsTyDC6 score highest for AAS prediction and are colored red. e LC-MS detection of products from Thalictrum flavum norcoclaurine synthase (TfNCS) containing strains T1-01-DE3 (wild-type PsTyDC1 + TfNCS), T1-02-DE3 (PsTyDC1-L205H + TfNCS) and T1-03-DE3 (PsTyDC1-Y98F-F99Y-L205N + TfNCS) (Supplementary Table 2 homologs of N-methylcoclaurine 3-hydroxylase (NMCH) are next considered to extend this pathway from N-methylcoclaurine to reticuline (Fig. 5). Currently, Saccharomyces cerevisiae BIA productions utilize characterized Eschscholzia californica NMCH (EcNMCH) for conversion of N-methylcoclaurine to 3-hydroxy-N-methylcoclaurine (3HNMC) 1,16,17,28 . There are several promising P. somniferum CYP450 sequences annotated as NMCH based on skillful characterizations in plants [29][30][31] . However, the presence of many additional CYP450 homologs in the P. somniferum genome complicates the selection of the best candidate sequence by non-experts. To automate the selection of optimal NMCH and CPR sequences, a SVM model was trained using positive training vectors derived from plant CYP80B sequences (Supplementary Data 5). 100 P. somniferum CYP450 sequences were then tested against this model to assist the selection of an optimal candidate (Fig. 5b, Supplementary Table 4). As a result of this demonstration level prediction (Fig. 1c), PsNMCH Isoform 1 (PsNMCH-I1) scored high against the model and was selected. A CYP450 reductase (CPR) redox partner for PsNMCH was selected based on the decision scores of an SVM model trained using the sequences in Supplementary Data 6. While a CPR sequence has been characterized from P. somniferum 32 , the referenced sequence AAC05021.1 is annotated as 'NADPH:ferrihemoprotein oxidoreductase', which may confuse the selection of this sequence as CPR by non-experts. Moreover, there are at least 8 other unique P. somniferum sequences with high CPR homology that have not been characterized. After testing the 8 additional P. somniferum candidates against the CPR SVM model, XP_026404029.1 is selected as a high scoring sequence (Fig. 5c and Supplementary Table 5), and observed to exhibit CPR activity (Fig. 5d). This CPR sequence is annotated as 'NADPH-cytochrome P450 reductase-like', and accordingly it is referred to as PsCPR-L in this manuscript. NMCH activity is evaluated by converting norcoclaurine to stable reticuline using NMCH and CPR variants expressed together with norcoclaurine 6-O-methyltransferase (6OMT), coclaurine N-methyltransferase (CNMT) and 3-hydroxy-Nmethylcoclaurine 4-O-methyltransferase (4OMT) ( Fig. 5d and Supplementary Table 2). N-methylcoclaurine accumulates much more than other intermediates in this system, and therefore stable reticuline titers should reflect the activity of the NMCH bottleneck. In this system, PsNMCH-I1 affords higher amounts of reticuline than that of EcNMCH, when paired with either PsCPR-L or AtATR2 (Fig. 5d). PsNMCH-I1 pairs best with PsCPR-L from the same species, resulting in the highest amount of reticuline. On the other hand, reticuline production with EcNMCH is best with AtATR2 pairing, with no improvement from PsCPR-L pairing.
Just one residue difference is observed when comparing the binding pockets of PsNMCH and EcNMCH: PsNMCH-H203 versus EcNMCH-Y202 (Fig. 5a). SVM prediction of PsNMCH-H203Y and EcNMCH-Y202H sequences results in lower and higher decision scores in comparison to those of wild-type sequences, respectively ( Fig. 5b   In te ns ity (re la tiv e un its )  Early in vivo tests of PsNMCH-I1 without a CPR redox partner in E. coli did not result in detectable NMCH activity, but L-DOPA production from tyrosine was detected ( Supplementary Fig. 3). This led us to hypothesize that PsNMCH-I1 might also have potential tyrosine 3-monooxygenase activity; however, the observed L-DOPA production is probably more likely to be mediated by native E. coli HpaBC. To further clarify this important missing link in P. somniferum, the candidate CYP450 monooxygenase sequences are also explored as potential tyrosine 3-monooxygenase templates (Supplementary Table 6  //www.rcsb.org/structure/2WVA], grey) and candidate PPDC sequence PsPDC1 (green). The modeled PsPDC1 active site contains Y332, which is also present in typical PDC enzymes which decarboxylate pyruvate. In this respect, the PsPDC1 active site is distinct from microbial PPDCs, which all contain smaller residues (red) in place of Y332 (Lactococcus lactis KdcA contains S286 corresponding to Y332). Yet, the presence of Y332 in PsPDC1 does not interfere with the docking of tyrosine into the PsPDC1 active site. c Cross-validation for correct assignment of PPDC model training sequences is performed using SVM models (blue), Random forests models (green), and by comparing sequence homology of each training sequence to a consensus sequence of PPDC training sequences and a consensus sequence of PDC training sequences (grey), as described in the methods section. d SVM-based prediction of putative PPDC sequences visualized in three dimensions by compressing high-dimensional data (Supplementary Table 3,  upper table) into two dimensions (PC1 and PC2) and plotting them together with two-dimensional decision scores. Prediction spaces with two-dimensional decision scores above and below 2 are colored green and grey, respectively. Prediction score trends for truncated PsPDC1 isoform X1 (TrcPsPDC1-IX1, red), PsPDC1 (red), PsPDC2 and Ps2HCLL (2-hydroxyacyl-CoA ligase-like) are similar in high dimensional models (Supplementary Table 3). e PsPDC1 mediates in vivo production of 4-hydroxyphenylethanol (tyrosol) through a 4HPAA intermediate (green), in M9 medium supplemented with 1.2 mM 4HPP at 25°C with 180 rpm shaking. Strain P1-01-AI, which contains PsPDC1 (red), mediates higher tyrosol production than that of strains P2-01-AI and P3-01-AI, which contain PsPDC2 (blue) and Ps2HCLL (magenta), respectively. PsPDC1 (red) mediates downstream production of norcoclaurine (NC) from LB supplemented with 5 mM tyrosine and 3.7 mM dopamine in strain P1-02-AI, at 20-25°C with 180 rpm shaking. Here, tyrosol is detected after 71 h, and norcoclaurine is detected after 41 h, from filtered and dried culture medium as trimethylsilyl (TMS)-derivatives using GC-MS, as shown in the extracted ion chromatograms (EICs). Detection of PsPDC1 products is replicated in Supplementary Fig. 4. relatively high scores in the positive prediction space of both highdimensional models of Supplementary Table 6.
Emergence of dual norcoclaurine and norlaudanosoline pathways via expression of P. somniferum decarboxylases. PsTyDC6 shares over 98% sequence identity with PsTyDC1 and is able to convert tyrosine and L-DOPA to norcoclaurine and norlaudanosoline (Fig. 3). Furthermore, co-expression of PsTyDC1 with TfNCS, PsNMCH-I1, Cj6OMT, CjCNMT, and Cj4OMT, results in a plant-gene only dual pathway through 4HPAA and DHPAA to norcoclaurine and reticuline (Supplementary Fig. 3). Therefore, the potential DHPAAS activity of PsTyDC1 is further explored to construct combined norcoclaurine and norlaudanosoline pathways (Fig. 6a). At the same time, PsPDC1 is also explored as a mediator of DHPAA production via decarboxylation of transaminated L-DOPA.
After incorporating L-DOPA decarboxylase (DDC) from Pseudomonas putida (PpDDC) for in vivo dopamine production and optimization in Terrific Broth (TB), PsPDC1 and PsTyDC1 containing strains produce reticuline from L-DOPA via the DHPAA pathway, with titers reaching the μM range (Fig. 6b). Previously, a single strain containing DHPAAS, 6OMT, CNMT, and 4OMT only produced reticuline titers of 0.2 μM from L-DOPA 7 . These results suggests that PsPDC1 can produce DHPAA from 3,4-dihydroxyphenylpyruvic acid (DHPP) that is supplied by L-DOPA transamination, and that PsPDC1 works synergistically with PsTyDC1 at later production times to promote high reticuline production in E. coli. Accordingly, combinations of PPDC and AAS are next explored to improve BIA titers.
Expanding the prediction models towards template enzyme engineering. The characterizations of PsTyDC1, PsTyDC6, and PsPDC1 indicate that these enzymes promote dual pathways in E. coli. However, the activity of PsTyDC1 is low under the conditions tested, while preliminary experiments show that PsPDC1    Table 4, and high-dimensional SVM decision scores are listed. c Two-dimensional SVM-based prediction of CPR sequences, with positive and negative prediction spaces colored green and white, respectively. With exception to positive training sequence AtATR2 (blue), all points represent tested P. somniferum sequences. PCA is used to compress multi-dimensional data into two dimensions (PC1 and PC2) for the visual representation. Corresponding high-dimensional SVM results are detailed in Supplementary Table 5, and high-dimensional SVM decision scores are listed. d Conversion of 1.2 mM norcoclaurine to reticuline, mediated by various combinations of NMCH and CPR, together with Ps6OMT, PsCNMT and Ps4OMT, in strains N1-01-DE3, N1-02-DE3, N1-03-DE3, N1-04-DE3, N2-01-DE3, N2-02-DE3, N2-03-DE3, N2-04-DE3 (Supplementary Table 2). Here, individual samples were analyzed 3 times (n = 3) to generate bar graphs in Prism 7 version 7.0d with error bars representing mean values +/− standard deviations. Source data are provided as a Source Data file. NATURE COMMUNICATIONS | https://doi.org/10.1038/s41467-022-28883-8 ARTICLE and TrcPsPDC1-IX1 expression in E. coli is unstable and toxic. Therefore, in order to quickly achieve in vivo titers high enough for dynamic metabolomic profiling, dual norcoclaurine and norlaudanosoline pathways are re-explored using homologous enzyme templates with stable expression in E. coli (Fig. 7). The concept of template enzyme engineering refers to the approach where useful features are identified from a specialized enzyme and those features are transplanted into a related template to confer some advantages. This is illustrated with the above EcNMCH-Y202H substitution where the corresponding H203 residue from PsNMCH is substituted to improve EcNMCH as the template enzyme. To further develop this methodology, the SVM enzyme selection algorithm is applied to evaluate multiple enzyme engineering substitutions for highly active template sequences, using PpDDC as a specific example (Fig. 7a).
AAS activity analogous to that of PsTyDC1 could be engineered into the bacterial PpDDC template by transplanting DHPAAS specific catalytic residues F79, Y80 and N181 (PpDDC numbering). Rationally engineered PpDDC-Y79F-F80Y-H181N mediates improved norlaudanosoline production in E. coli (Fig. 7a). Switching from PsPDC1 to a S. cerevisiae ARO10 template is observed to improve in vivo turnover of both DHPP (Fig. 7b) and 4HPP (Fig. 8a), in comparison to corresponding strains containing PsPDC1. However, the high activity of ARO10 may come at a specificity tradeoff, as the production of additional aromatic keto acid-derived alkaloids result from ARO10 expression (Fig. 7c).
Combinations of natural and analogous enzyme templates result in improved E. coli BIA production ( Fig. 8a and Supplementary Table 2). Expression of PpDDC-Y79F-F80Y-H181N together with PsPDC1 in strain P1-07-AI selectively promotes the DHPAA pathway in the presence of tyrosine and L-DOPA to produce 61.8 µM reticuline, while the application of ARO10 in strain A1-01-DE3 selectively favors the 4HPAA pathway in the presence of tyrosine and dopamine to produce 356 µM norcoclaurine and 240 µM N-methylcoclaurine. Dual pathway production of 112 µM N-methylcoclaurine and 74.9 µM reticuline is promoted through the combination of PpDDC-Y79F-F80Y-H181N, ARO10 and PsTyDC1 in strain A1-06-AI.
Dynamic metabolomic profiling of AAS and PPDC branch pathways. By tracing the turnover of isotope-labeled precursors and measuring the resulting fractions of isotope-labeled intermediates, metabolic flux can be observed, and this approach is referred to as dynamic metabolic profiling 5,36,37 . While multiple reaction monitoring (MRM) with LC-MS is sensitive, this method does not readily detect isotope-labeled intermediates. After improving BIA titers to μM levels suitable for quantification with high-resolution CE-MS, isotope tracing experiments could be performed. Combinations of PsPDC1, ARO10, PsTyDC1 and PpDDC produce various labeling patterns: tyrosine-13 C to BIA-13 C 2, L-DOPA-d 3 with tyrosine-d 4 to d 6 -labeled BIA, L-DOPA-d 3 to d 5 -labeled BIA, L-DOPA-d 3 with dopamine-d 2 to d 5 -labeled BIA, tyrosine-d 4 with dopamine-d 2 to d 6 -labeled BIA, and tyrosine-d 4 with dopamine to d 4 -labeled BIA (Fig. 8b, c, and Supplementary Fig. 5). The loss of a ring deuterium atom during NCS-mediated condensation of aryl acetaldehydes with ringlabeled dopamine is consistent with the reported NCS mechanism ( Fig. 8b and Supplementary Fig. 5d) 38,39 ; this kind of mechanismdirected deuterium labeling pattern has not been reported for the tracing of BIA [40][41][42] . Isotope tracing from L-DOPA-d 3 to d 5labeled BIA supports the bifunctional decarboxylase and oxidative deamination activities of PpDDC-Y79F-F80Y-H181N (Supplementary Fig. 5d). Improvement of N-methylcoclaurine-d 6 and reticuline-d 5 production via PsTyDC1 in addition to PsPDC1 again demonstrates the synergistic combination of these distinct enzymes (Fig. 8b). Moreover, amounts of N-methylcoclaurine-d 6 and reticuline-d 5 relative to their respective precursors norcoclaurine-d 6 and norlaudanosoline-d 5 (Supplementary Fig. 5b, c) show the bottleneck of the S-adenosylmethionine (SAM)dependent methylation of deuterium-labeled BIA. Furthermore, isotope tracing from tyrosine-13 C supports that PsPDC1 and ARO10 are converting isotope-labeled 4-hydroxyphenylpyruvic acid (4HPP) to downstream BIA (Supplementary Fig. 5a).
Dynamic metabolomic profiling of mixed fractions of unlabeled and labeled BIA, could be performed with high-titer norcoclaurine-d 4 and N-methylcoclaurine-d 4 production (Fig. 8c). In this case, a higher fraction of d 4 -labeled norcoclaurine relative to d 4 -labeled N-methylcoclaurine is consistent with the SAMdependent methyltransferase bottleneck observed previously 1,7 .

Discussion
This report demonstrates that machine learning can uncover missing link enzymes with direct applications to biomanufacturing. While previous studies have also reported machine learning for enzyme prediction, these examples were never applied to the discovery of uncharacterized enzymes [43][44][45][46][47]  kinases B were used as examples to verify their algorithm, however, this test data was obtained from the previous publications 45 . On the other hand, a study by Tietz et al. applied SVM to predict ribosomally synthesized and post-translationally modified peptides precursor peptides, of which some were validated experimentally 48 . Similarly, the current study demonstrates the paired prediction and experimental investigation of four kinds of plant enzymes. Furthermore, the possibility to engineer artificial enzymes is demonstrated by prediction of PsNMCH-H203Y, EcNMCH-Y202H (Fig. 5) and PpDDC-Y79F-F80Y-H181N (Fig. 7a), where scores are in agreement with in vivo test results. Therefore, the SVM prediction models of this study (Supplementary Tables 1 and 3 -6) can readily enable the discovery and engineering of specialized carboxy-lyases (EC 4.1.1.X), CYP450s (EC 1.14.X.X) and CPRs (EC 1.6.2.4). While the current machine learning method is shown to be superior to homologybased selection of PLP-dependent AAS and TPP-dependent PPDC sequences (Fig. 2c and Fig. 4c), additional studies should be pursued to demonstrate improved selection for other classes of enzymes.
PsPDC1 shows potential for in vivo PPDC activity and contains active site residue Y332, which is also present in ZmPDC that is known to only convert small non-aromatic substrates. This active site tyrosine is substituted with smaller residues in characterized yeast and bacterial PPDC enzymes (Fig. 4b), and therefore the structural basis of plant PPDC substrate recognition appears to be determined by other factors. Species-by-species variation in functional residues is also seen with the evolution of AAS variants throughout insects and plants. Insects have evolved a histidine to asparagine active site switch, corresponding to residue 192 of DHPAAS, to promote AAS activity essential for their survival (Fig. 2b) 7,12 . In the plant homologs, tyrosine is commonly substituted with a more hydrophobic phenylalanine (residue 346 of Petroselinum crispum 4HPAAS, Fig. 2b) to switch from AAAD to AAS activity. Yet the active site of PsTyDC6 resembles that of typical AAAD while still promoting AAS activity. These results with PsPDC1 and PsTyDC6 indicate that specialized PPDC and AAS activities may exist in other plant sequences that resemble typical carboxy-lyases. This insight also suggests that combinations of subtle structural features or emergent properties may be underlying the specialized activities of select plant carboxy-lyases. Accordingly, machine learning offers advantages over structural analysis to identify elusive emergent features in enzymes with specialized functions that cannot be predicted from structure or homology alone.
Transplantation of discovered functional residues into highactivity microbial templates is an effective strategy for improving bioproduction, as demonstrated by PpDDC-Y79F-F80Y-H181N with transplanted DHPAAS active site residues. In this example, the design of three amino acid substitutions, including the most critical H181 substitution that corresponds to PsTyDC1-L205, could be guided with the SVM prediction algorithm. Improved protein stability, removal of regulation/inhibition, and improved expression in host cells are additional factors that might contribute to improved templates. While PpDDC-Y79F-F80Y-H181N favors the DHPAA pathway, PsTyDC6 is capable of mediating both DHPAA and 4HPAA containing pathways. Similarly, expression of PsPDC1 and ARO10 are observed to promote the conversion of 4HPP to 4HPAA and DHPP to DHPAA, but strains expressing PsPDC1 favor the DHPAA containing pathway under the conditions tested (Fig. 8a, b). This illustrates that the control of flux through the norcoclaurine route versus the norlaudanosoline route may be controlled by the selection of specific PPDC and AAS templates.
Many sophisticated studies have elucidated key steps of plant BIA metabolism, and as a result many of the discovered enzymes have enabled the production of key intermediates and downstream medicinal compounds in yeast and E. coli 21,[28][29][30][31][32][49][50][51][52] . For high-level production of the key intermediate reticuline, current studies require NCS and plant OMTs, but their selection has generally been limited to a few sequences from P. somniferum, C. japonica and T. flavum 1,7,13-17 . Yeast studies have focused on the natural norcoclaurine route, and require the CYP450 enzyme NMCH to complete the pathway to reticuline. Yet in the recent S. cerevisiae reports, the first well-characterized NMCH from E. californica continues to be selected 1,16,17,28 . In light of this, machine learning-based enzyme prediction offers great potential to expand the choice of additional homologous sequences that might further boost product titers. For example, the predicted pathways with PsNMCH-I1 paired with PsCPR-L and EcNMCH-Y202H paired with AtATR2 produce higher reticuline titers than that of the conventional EcNMCH and AtATR2 combination. For production in E. coli, Matsumura et al. were able to achieve 160 mg/L (S)-reticuline using a MAO-dependent and NMCHindependent pathway through norlaudanosoline in E. coli 13 . In contrast, establishing the natural norcoclaurine pathway in E. coli produces high levels of extracted norcoclaurine (96.7 mg/L) and N-methylcoclaurine (71.8 mg/L) (Fig. 8a). Although yeast is better suited than E. coli for the expression of CYP450s including NMCH, production of expensive N-methylcoclaurine does not require NMCH. In addition, the dual pathway strain of the current study produces 24.6 mg/L reticuline and 33.6 mg/L Nmethylcoclaurine (Fig. 8a).
Considerable experimental evidence has indicated that plant BIA pathways proceed through norcoclaurine and that norlaudanosoline is not a natural metabolite of P. somniferum 50,52 . In contrast, the current characterizations of PsTyDC1, PsTyDC6 and PsPDC1 show that these enzymes promote the production of both norlaudanosoline and norcoclaurine in E. coli. This suggests that the norlaudanosoline pathway might also occur in plants; however, the precise effects of PsTyDC1 53 , PsTyDC6 53 and PsPDC1 within the natural pathway should be carefully elucidated in plants before drawing any final conclusions.
A dual pathway to norcoclaurine and norlaudanosoline in E. coli offers advantages for utilization of tyrosine, and for improving amounts of unstable aryl acetaldehydes relative to dopamine. Accordingly, increased aryl acetaldehyde production by synergistic expression of PsPDC1 together with PsTyDC1 results in increased reticuline through an enhanced norlaudanosoline pathway (Fig. 6 and Fig. 8b). These pathways were further and P1-07-AI contain the same genes, but P1-06-DE3 was induced with only IPTG, without including arabinose for PsPDC1 expression. Cultures containing PpDDC-T and L-DOPA were supplemented with additional sodium ascorbate. The BL21(AI) derived strain P1-07-AI was induced with IPTG and arabinose. For improved N-methylcoclaurine production, A1-01-DE3 was supplemented with the aldehyde reductase/dehydrogenase inhibitor gossypol. Additional culture conditions are described in the methods section. Extracted N-methylcoclaurine and reticuline were TMS-derivatized and analyzed with GC-MS (t = 40 h, n = 3). After extraction, cultures were stored at 4°C and stable norcoclaurine titers from culture medium were analyzed with LC-MS (n = 2). b Isotope profiling of strains P1-02-AI (expressing PsPDC1) and P1-04-AI (expressing PsPDC1 and PsTyDC1), which produce N-methylcoclaurine-d 6 (P1-02-AI -62 nM, P1-04-AI 160 nM) and reticuline-d 5 from tyrosine-d 4 and L-DOPA-d 3 (t = 61 h, n = 3). There is a synergistic improvement in BIA production when combining PsPDC1 and PsTyDC1. Here, NCS catalyzes the loss of a deuterium from in vivo generated dopamine-d 3 . c For tracing aromatic isotope flux from tyrosine to norcoclaurine and N-methylcoclaurine, alkaloids were extracted with ethyl acetate from the A1-01-DE3 culture 40 h after addition of tyrosine-d 3 and dopamine, according to the methods section. Extracted alkaloids were TMS-derivatized and analyzed with GC-MS (n = 3). After extraction, cultures were stored at 4°C and stable BIA titers from culture medium were analyzed with CE-MS (n = 3); the fraction of labeled BIA-d 4 and unlabeled BIA from natural tyrosine in the rich TB broth can be quantified. With exception to unlabeled norcoclaurine (n = 2), all other individual samples were analyzed 3 times (n = 3) to generate bar graphs in Prism 7, with error bars representing mean values +/− standard deviations. Source data are provided as a Source Data file. characterized by dynamic metabolomic profiling, an approach that can readily identify bottleneck targets for increasing metabolic flux to target compounds. In conclusion, machine learning discovery of missing links and homologous enzyme templates is now a realistic approach for assembling alternative routes and relieving bottlenecks in improved metabolic pathways.

Methods
Machine learning prediction. Support vector machine (SVM) Enzyme-models were built from enzyme amino acid sequence information using scikit-learn version 0.21.3 8 , and the resulting computer code was made publically available [https://github.com/nwatanbe/SVM_E_model]. Aromatic amino acid decarboxylase (AAAD), aromatic acetaldehyde synthase (AAS, previously referred to as aromatic aldehyde synthase) and phenylpyruvate decarboxylase (PPDC) prediction models were trained with vectors generated by PROFEAT 54 . AAAD positive training sequences include L-DOPA decarboxylase (DDC) and other typical PLPdependent carboxy-lyases that decarboxylate aromatic amino acids (Supplementary Data 1). The AAAD positive examples all contain a catalytic histidine, corresponding to H181 of PpDDC (Fig. 2b) 55 . A second PPDC specific SVM model was therefore built by training with 19 homologous plant sequences in the same phylogenetic clade as rose PPDC as positive training sequences and 3 negative training sequences which were curated as plant PDC, as suggested by the results of a previous report 56 .
Positive training sequences from AAS and PPDC models were included as negative training sequences for the AAAD model; positive training sequences from AAAD and PPDC models were included as negative training sequences for the AAS model; and positive training sequences from AAS and AAAD models were included as negative training sequences for PPDC models. For all models, general negative training sequences included E. coli, S. cerevisiae and A. thaliana enzymes, excluding sequences classified in the positive training group.
Cytochrome P450 (CYP450) prediction models were trained with vectors generated by ProtVec 57 . To clarify potential N-methylcoclaurine 3-hydroxylase (NMCH) activities of CYP450 monooxygenases, SVM models were trained with CYP80B sequences as positive examples (Supplementary Data 5). CYP450 reductase (CPR) prediction models were trained using sequences listed in Supplementary Data 6. To clarify potential tyrosine 3-monooxygenase activities, SVM models were trained with sequences related to CYP76AD, CYP98A3 and CYP199A2 as positive examples (Supplementary Data 7 and Supplementary Data 8). CYP76AD, CYP98A3 and CYP199A2 enzymes are reported to mediate aromatic hydroxylation of tyrosine and the similarly sized substrate coumaric acid [33][34][35] .
Prediction models were first built with high-dimensional vectors. Cross validation of all high-dimensional SVM models resulted in F-scores above 0.96. Candidate sequences were selected based on high-dimensional scores. Twodimensional and three-dimensional plots were used for visual representation of data in Figures. For two-dimensional plots, high-dimensional vectors were compressed to 2 dimensions using principal component analysis (PCA). 2-dimensional SVM models were then built derived from the PCA compressed vectors. SVM and PCA from the scikit-learn library were used 58 . The threedimensional SVM plot in Fig. 4d was adopted from an SVM illustration by Dr. Saptashwa Bhattacharyya [https://towardsdatascience.com/visualizing-supportvector-machine-decision-boundary-69e7591dacea]. Compressed two-dimensional decision scores from the combined model (Supplementary Table 3, upper table) are used as the third dimension of Fig. 4d.
Random forests E-models were built from enzyme amino acid sequence information using scikit-learn version 0.21.3 8 , with the same datasets and feature extractions as that of the corresponding SVM models. As an additional benchmark, machine learning differentiation of AAS versus AAAD sequences, and PPDC versus PDC sequences, was compared to differentiation based on homology to consensus sequences. To do this, consensus sequences were generated for each group of training sequences (AAS, AAAD, PPDC and PDC), by selecting the amino acid of maximum frequency at each position. If a training sequence has higher sequence identity to the consensus sequence of its correct group, compared to that of its related group, then it was counted as a correct prediction by homology.
Training sequences, cross-validation F-scores and additional parameters for high-dimensional models are available in the Supplementary Data files.
Protein structural modeling and docking analysis. Homology models were built with Modeller 59 run in UCSF Chimera (candidate version 1.15), using template structures of highly similar proteins from the Protein Data Bank [https:// www.rcsb.org] 7 . Multimeric structures and ligands were first prepared in PyMOL version 1.8.7.0. Structures were refined and prepared for docking analysis using Molecular Operating Environment (MOE) version 2020.0901 7 .
Materials and reagents. KOD -Plus-and Ex-Taq HS DNA polymerases were purchased from Toyobo (Tokyo, Japan) and Takara (Tokyo, Japan), respectively. A-attachment Mix was purchased from Toyobo. Primers were ordered from Eurofins Genomics (Tokyo, Japan). A DNA ligation kit and JM109 chemical competent cells were purchased from Takara. The QIAprep Spin Miniprep Kit was obtained from Qiagen (Hilden, DE). BL21(DE3) and BL21-AI competent cells were purchased from Thermo Fisher Scientific (Waltham, MA, USA). Rosetta gami 2 cells were purchased from Sigma-Aldrich (St. Louis, MO, USA). All restriction endonucleases were purchased from New England Biolabs (NEB, Ipswich, MA, USA). Antibiotics were purchased from Nacalai Tesque (Kyoto, Japan), Sigma-cleavage of the chitin-binding domain and intein fusion via the addition of 50 mM DTT to the column. PsTyDC6 was then eluted into Amicon Ultra centrifugal filters and the buffer was changed to PBS (pH 7.0).
For detection of in vitro produced 4HPAA, purified PsTyDC6 and digested PsTyDC6 cell extract were mixed with 5 mM and 4 mM tyrosine, respectively. PsTyDC6 reactions containing 100 μM PLP were started together with control reactions containing 100 μM PLP and 4 mM tyrosine, followed by incubation at 30°C for 3.5 h. Samples were lyophilized and then derivatized by treatment with a pyridine and methoxyamine solution followed by treatment with MSTFA. Derivatized compounds were analyzed by GC-MS. TMS-and methoxyamine-derivatized 4HPAA was identified based on product ions m/z 190.1 and m/z 205.1, and confirmed by running an authentic 4HPAA standard after derivatization using the same method. To detect in vitro production of H 2 O 2 by PsTyDC6, a horseradish peroxidase-based fluorescent assay 7 was performed with the fluorescent substrate Amplex Red (10acetyl-3,7-dihydroxyphenoxazine) together with other components of a Fluorimetric Hydrogen Peroxide Assay Kit (Sigma-Aldrich). For the peroxidase-based assay, PsTyDC6 was prepared in PBS (pH 7.0) with 1 μM PLP. Baseline fluorescence from the control with matching PsTyDC6 and PLP, but with no tyrosine, was subtracted from each tested condition containing tyrosine. Initial rates of fluorescence production were plotted against final tyrosine concentration using the Michaelis-Menten function of Prism 7 version 7.0d.
LC-MS operated in MRM mode was applied to detect in vitro produced DHPAA, dopamine, tyramine, norcoclaurine and norlaudanosoline. For DHPAA, dopamine and norlaudanosoline production, purified PsTyDC6 was mixed with 5 mM L-DOPA. For tyramine and norcoclaurine production, purified PsTyDC6 was mixed with 1.25 mM tyrosine and 2.5 mM L-DOPA. In vitro samples were incubated at 30°C for 80 min. to analyze DHPAA, and for 8 h to analyze norcoclaurine and norlaudanosoline.
Extraction of aromatic compounds for GC-MS quantification. A solution of ammonium carbonate was added to culture samples, followed by addition of ethyl acetate. After vortexing, the organic layer was removed and evaporated under vacuum. The dried extracts were then derivatized in a mixture of BSTFA, TMS-Cl, and ethyl acetate. Quantitative standard curves were produced by extracting alkaloid standards prepared in TB medium, followed by TMS-derivatization in equivalent volumes. The TMS-derivatized samples were analyzed with GC-MS as described above.
Reporting summary. Further information on research design is available in the Nature Research Reporting Summary linked to this article.

Data availability
Data supporting the findings of this work are available within the paper and its Supplementary Information files. Training sequences and information for machine learning models are included as Supplementary Data files. A reporting summary for this Article is available as a Supplementary Information file. All protein accessions used in this study are available from the National Center for Biotechnology Information (NCBI) database [https://www.ncbi.nlm.nih.gov]. Protein Data Bank (PDB) structures including 2Q5O and 2WVA are available from the PDB database. Source data are provided with this paper.
Code availability