Biocatalysed synthesis planning using data-driven learning

Enzyme catalysts are an integral part of green chemistry strategies towards a more sustainable and resource-efficient chemical synthesis. However, the use of biocatalysed reactions in retrosynthetic planning clashes with the difficulties in predicting the enzymatic activity on unreported substrates and enzyme-specific stereo- and regioselectivity. As of now, only rule-based systems support retrosynthetic planning using biocatalysis, while initial data-driven approaches are limited to forward predictions. Here, we extend the data-driven forward reaction as well as retrosynthetic pathway prediction models based on the Molecular Transformer architecture to biocatalysis. The enzymatic knowledge is learned from an extensive data set of publicly available biochemical reactions with the aid of a new class token scheme based on the enzyme commission classification number, which captures catalysis patterns among different enzymes belonging to the same hierarchy. The forward reaction prediction model (top-1 accuracy of 49.6%), the retrosynthetic pathway (top-1 single-step round-trip accuracy of 39.6%) and the curated data set are made publicly available to facilitate the adoption of enzymatic catalysis in the design of greener chemistry processes.

Supplementary Figure 15: Correlation between forward prediction accuracy and sample count in EC2 (a, b), EC3 (c, d), and EC4 (e, f ). We observe a significant correlation between sample size in token schemes EC2 and EC3. The trend towards lower correlations in higher EC-level token schemes is caused by a further reduction in test cases due to the selection of unique test products not found in the training sets and the resulting hit-or-miss accuracies appearing as bands at 0 and 100% accuracy, respectively. Increasing k results not only in increasing the accuracy but also in lowering the correlation.
Supplementary Figure 16: Correlation between backward prediction accuracy and sample count in EC2 (a, b), EC3 (c, d), and EC4 (e, f ). The trend towards lower correlations in higher EC-level token schemes is caused by a further reduction in test cases due to the selection of unique test products not found in the training sets and the resulting hit-or-miss accuracies appearing as bands at 0 and 100% accuracy, respectively. Increasing k results not only in increasing the accuracy but also in lowering the correlation.

Attention Analysis
The analysis of the patterns in the attention weights of the Molecular Transformer provides insights on the interpretability of these complex models and on potential biases [1]. In the case of reaction SMILES, attention weights have shown to uncover complex reaction information with no supervision, such as atom mappings [2].
In the forward fine-tuned molecular transformer, the connection between the reactants and enzyme components and the products is modelled via self-attention and multi-head attention in the encoder/decoder layers. Since the probability distribution over all prediction candidates is computed based on the current translation state, summarised by the last multi-head attention and the output layer, we focused our analysis on this last part of the decoder by considering only its attention weights.
We used relevant examples from the test set to analyse the patterns emerging from the mean attention over the heads. Using these examples, we investigated attention weights focusing on EC-levels 1-3 of the different heads. We started by analysing all reactions in our test set, focusing at a later stage on the three most frequent enzymatic reaction classes (oxidoreductases, transferases, and hydrolases). Finally, we analysed the correlation between the heads' attention weights to inspect redundancy.
For EC-level analysis, we filtered weights greater than a noise threshold. The threshold was set to 1 N , where N indicates the number of tokens in the input. The value was determined by considering a baseline where each output token uniformly attends all the input tokens, i.e., no specific focus. By masking certain values, we have an appropriate metric to evaluate attention focus. If a token received weights lower than or equal to the threshold, its value was automatically excluded from contributing to the mean calculation. For the correlation analysis, we randomly selected 20 reactions for each class from which we extracted the corresponding head weights. For each reaction, we computed pairwise Pearson correlations [3] between the heads' flattened attention matrices. The correlation matrices for each reaction were aggregated by averaging the Fisher-transformed [4] correlation values. The resulting averaged correlation matrix was then derived by anti-transforming the values using a hyperbolic tangent.
We analyzed the attention patterns across all reactions (see Supplementary Figure 17) and for the most three representative enzymatic reaction classes: oxidoreductases, transferases and hydrolases (see Supplementary Figure 20).
Specific heads focus their attention on the different levels of the EC token, while others attend the complete enzymatic information, attributing comparable weights to all levels of the token. On average, the heads pay more attention to the first two EC number levels and less to the third, causing levels 1 and 2 of the token to be primarily responsible for forward reaction prediction. The comparison of the mean attention for oxidoreductases, transferases and hydrolases reactions (see Supplementary Figure 20) reveals that the model captures variations in enzymatic reactions, focusing on different EC number levels based on the reaction type.
Overall, oxidoreductases exhibit higher values on the enzymatic tokens compared to the others. In contrast, transferases present low values, except for head 3, where the EC number class generally receives higher weights in respect to the average. This explains why transferase data sets can be predicted with only a slight loss of accuracy even when paired with wrong EC numbers. Hydrolases show more variation in attention values, with the highest weight given to the EC-level 2 by head 3. Besides these differences, head 3 always receives the highest attention values, while head 2 receives the lowest of all the reaction classes considered.
In an attempt to capture similarities in attention patterns, we extended our analysis to consider average correlations between the attention heads (see Supplementary Figure 19, details on the correlation analysis can be found in the Methods Section). Attention weights for heads 3, 6 and 7 tends to focus on single tokens (i.e., atoms and EC-levels) and exhibit highly significant correlation values (ρ 3,6 = 0.78, ρ 3,7 = 0.65, ρ 6,7 = 0.66), providing the inherent mapping between tokens/atoms in the reactants and the ones in product. Heads 2 and 4, which tend to focus on the structurally larger group of tokens, e.g., representing branches, show a weakly positive correlation (ρ 2,4 = 0.33). This suggests that the two heads are capturing distinct aspects of the enzymatic reactions while attending similar token lengths. The remaining heads are uncorrelated, highlighting the existence of more complex attention patterns captured by the model.
Supplementary Figure 18   focuses on key features of the enzymatic reaction: the centre subject to nucleophilic substitution and the token related to the configurational information. Example (b) reveals the connection between the EC token and the centre of the nucleophilic addition as well as the introduced nucleophile. Finally, example (c) reveals the connection between the EC token and the stereochemical centre undergoing inversion of configuration. The analysis of the attention weights confirms the capacity of the forward Molecular Transformer to use the EC token for discerning the enzymatic reaction centre while capturing enzymatic reaction rules.  Figure 18: Analysis of the attention weights in the forward prediction models on reactions (5), (6) and (7) from Supplementary Figure 8 ((a), (b) and (c) respectively). For each reaction, the attention mapping between tokens representing EC numbers is highlighted in purple (reactant atom tokens are connected using grey curves). The curve thickness is proportional to the attention weight computed by the forward Molecular Transformer.  Supplementary Figure 21: Summarized depiction of the most relevant statistics for the curated biocatalysed pathways from Finnigan [5]. In the scatter plot, each enzymatic reaction subclass at EC-level 3 is represented as a point. On the x-axis, we report the percentage of reactions in ECREACT belonging to the class. On the y-axis, we report a biased measure (between 0 and 100) for the EC-level 3 subclass, calculated using the Jensen-Shannon divergence [6] in base 2 between the distribution of EC-level 4 reaction subclasses and a baseline, defined as a uniform distribution of reactions in the EC-level 3 subclass. The bias measure the diversity in the EC-level 3 subclass considered. The point size encodes the number of EC-level 3 reaction subclasses reported in the set of enzymatic reactions from Finnigan [5]. Points are coloured based on the capability of the Molecular Transformer to find a successful route for at least one of the product considered. The depiction shows the high diversity of the reaction subclasses considered in the data sets (bias higher than 70 for all subclasses) and the low sample size for most of the reactions.   Figure 22: Step-wise description of the tokenisation process. Starting from an enzymatic reaction (top), a reaction SMILES representation is extracted (middle). The enzymatic reaction SMILES is finally tokenised both at the atom level and at the EC level (bottom).  Figure 23: Detailed workflow of the retrosynthesis algorithm adapted from [7]. The hypergraph exploration algorithm combining two Molecular Transformer models for forward and backward predictions is extended to handle EC level information at each disconnection predicted by the model encoding it as a reaction class.