Discovery of novel chemical reactions by deep generative recurrent neural network

The “creativity” of Artificial Intelligence (AI) in terms of generating de novo molecular structures opened a novel paradigm in compound design, weaknesses (stability & feasibility issues of such structures) notwithstanding. Here we show that “creative” AI may be as successfully taught to enumerate novel chemical reactions that are stoichiometrically coherent. Furthermore, when coupled to reaction space cartography, de novo reaction design may be focused on the desired reaction class. A sequence-to-sequence autoencoder with bidirectional Long Short-Term Memory layers was trained on on-purpose developed “SMILES/CGR” strings, encoding reactions of the USPTO database. The autoencoder latent space was visualized on a generative topographic map. Novel latent space points were sampled around a map area populated by Suzuki reactions and decoded to corresponding reactions. These can be critically analyzed by the expert, cleaned of irrelevant functional groups and eventually experimentally attempted, herewith enlarging the synthetic purpose of popular synthetic pathways.

The discovery of new organic reactions has always been in the focus of synthetic organic chemistry. Each new reaction enriches the arsenal of synthetic tools and opens new horizons in the development and optimization of new drugs and materials. Such reactions are often given the names of their discoverers, which is the highest recognition of their contribution to organic chemistry. Most of the new reactions have been discovered by plain luck, and it has been up to the chemists to notice the discovery and apply their "chemical intuition" to study it in detail 1 . The beginning of a systematic approach to the search for new reactions was laid in 1967 by Balaban, who applied the graph theory for systematical enumeration of pericyclic reactions proceeding through a 6-membered transition state 2 . In the 1970s, these studies were significantly expanded by Hendrickson 3 , Arens 4-6 , Zefirov, and Tratch 7,8 who considered various formal schemes describing bonds redistribution for different types of pericyclic reactions. Another approach implemented in the IGOR 1,9 and IGOR2 10 programs concerned the algebraic model of constitutional chemistry developed by Dugundji and Ugi 11 . This approach supports the hierarchical representation of organic reactions and deals explicitly with heteroatoms and charges, keeps track of rings in molecules 10 . Its application led to the discovery of previously unknown reactions: the thermal decomposition of α-formyl-oxy ketones 1,9 , and the formation of a cage molecule from N-methoxycarbonyl homopyrrole and tropone 10 . Then, an alternative method based on the generation of the complete sets of non-isomorphic spanning subgraphs of a given graph was suggested. With the help of this approach, new carbene reaction 12 and two new elimination reactions leading to the formation of synthetically important dienes 13 were discovered. The formal-logical approach to organic reactions 7 implemented in the SYMBEQ 14 and ARGENT 15,16 software was used to discover substituted furans 14 .
Despite great expectations, no significant progress in computer-aided reaction design was achieved; approaches, algorithms, and software tools reported so far have not found any widespread popularity among organic chemists. The work with those tools required both extensive knowledge in synthetic organic chemistry and a well-developed intuition to turn abstract schemes of bonds redistribution into specific chemical reactions with particular reagents, catalysts, and experimental conditions. This explains why all reactions computationally  www.nature.com/scientificreports/ A sequence-to-sequence neural network with Bidirectional Long Short-Term Memory 52 layers trained on SMILES/CGR achieved the ability to convert SMILES/CGR to their latent vectors ("encode") and back ("decode"). Generative Topographic Mapping (GTM) was used to visualize the latent space in 2D and to detect a cluster mostly populated with Suzuki reactions (Fig. 2). Then, virtual chemical reactions were generated by sampling the targeted zone followed by the decoding of associated latent vectors to SMILES/CGR. Notice that visualization is not strictly required for clusters identification, but may significantly help to choose a cluster from which the sampling is performed.

Results
Reaction sampling from generative topographic map. A set of 2 424 306 reactions, extracted and curated from the USPTO database 53 , was rendered as CGRs and then as SMILES/CGR strings used to feed the autoencoder. The latter was trained on some 2 million reactions and validated on 450 thousand reactions. The reconstruction rate (a ratio of correctly reconstructed SMILES/CGR) was 98.4% and 97.8% at the training and validation stage, respectively. This is slightly less than reconstruction rates of plain molecular SMILES by stateof-the-art encoders/decoders, but it can be explained by larger complexity and length of SMILES/CGR and an additional source of error: the errors of atom-to-atom mapping in some entries. SMILES/CGR is intrinsically more difficult to learn, with dynamical bonds, dynamical atoms and formal coordination numbers exceeding atomic valency representing novel degrees of freedom in the syntax. Unbalanced or erroneous entries may pass the standardization protocols and thus negatively impact generated SMILES/CGR quality.Nevertheless, reconstruction rates are robust and although LSTM has relatively short memory compared to some other neural networks architectures like transformers and, therefore, may fail to learn relatively complex structural motifs, the bidirectional LSTM used in our work seems to perform acceptably well.
The latent vectors for 100 000 randomly selected reactions were used to construct a Generative Topographic Map (GTM) using in-house software 54 . Then the entire USPTO database was projected onto the map, on which several zones predominantly populated by Suzuki reactions were identified, as shown in Fig. 2.
Random latent vectors were sampled from one of these zones with the highest relative population of Suzuki reactions. As expected, the sampling procedure led to virtual transformations of a similar type. Finally, 10,000 text strings have been generated, followed by their analysis using a complex post-processing protocol (Fig. 2). At the structures verification and standardization stage, the CGRtools.v3 tool was used to discard invalid SMILES/ CGR and to perform valence and aromaticity check. This reduced the dataset to 1099 reactions (some 11% of generated text strings) in which structures of reactants and products were correct. This value is similar to that www.nature.com/scientificreports/ (15-20%) observed for the SMILES strings in our previous studies devoted to generation of individual molecules. Clearly, not every latent space vector corresponds to a valid structure. However, since invalid SMILES/CGR can be discarded algorithmically, they are not a liability but a manageable consequence of exploratory sampling. Also, the USPTO reactions are unevenly distributed in terms of types. Deep learning typically focuses on dense clusters allowing it to extensively capture their associated syntactic rules. The realibility of different combinations of leaving group L and a coupling partner Q in the reaction center BC.QL > > B.CQ.L suggested by AI depends not only on the training set size but also on its diversity, i.e., on the presence in the training set related examples. The USPTO dataset contains very few reactions with Q = O, S and Si which may explain the relatively high rate of invalid SMILES/CGR strings.
Reaction novelty analysis. The main interest of in silico reaction generation is the proposal of novel reactions that a human mind would not spontaneously think of. However, unlike individual compounds, where novelties can be identified as unique scaffolds or particular structural motifs 30 , the definition of reaction novelty was not discussed in the literature. The most descriptive part is the reaction center (RC) 55 , i.e. atoms and bonds directly involved in the transformation. Thus, we consider two levels of reaction novelty: (i) the reaction center is unknown (not present in the training set); (ii) reaction center is known, but its closest neighborhood (1 st atoms and bonds near the RC, RC + 1) is new. The latter can be extended to a more distant neighborhood (n atoms and bonds away, RC + n), but in this work, we only focus on the reaction center and the closest neighbors. To decide whether a reaction is novel, these substructural reaction motifs are encoded by a hashing function as reaction signatures and are compared to all signatures extracted from the initial dataset.
Among 1099 reactions selected using the post-processing workflow (Fig. 2), 436 contain new reaction center RC and 30 reactions are novel at first neighborhood level RC + 1. Some generated reactions have two or more distinct reaction centers, i.e. represent multistep transformations. Note that "novelty" defined as the absence of reaction center from the training set data is per se meaningful, as an illustration of the "creativity" of this Artificial Intelligence, i.e. its ability to generate original reaction centers which can be submitted for empirical feasibility assessment to human experts. Unfortunately, "novelty" as the absence of reaction center from both the training set and public reaction databases is not easy to interpret, for it may both mean that (a) such reactions were tried, but failed and thus were not published or (b) reactions were never explored, thus represent a real asset of innovation. The choice not to publish failed reactions is a major drawback in training reactivity models 44 .

Reactions curation and generalisation.
A close look at the generated reactions reveals several serious drawbacks: (i) unbalanced reaction equations, (ii) presence of likely unstable groups (e.g., R 3 S(=O)H and R-PH(=O)-OR'), and, (iii) transformations which require harsh reaction conditions (e.g., breaking of a C-C bond), or kinetically unfavorable reactions (e.g., cleavage of a leaving group with carbon at attachment point). Some reactions can be corrected or discarded using some heuristic rules ("Chemical Filters").
Output of unbalanced reactions is a direct consequence of the training set composition: almost all USPTO reactions are also unbalanced, e.g., leaving groups are almost never reflected in the reaction equation present in the database. The application of the CGR technology may implicitly solve that problem. Indeed, within the CGR formalism,heavy atoms in reactants and products are implicitly conserved-as the same graph is simply interpreted differently in terms of dynamical bond status in order to convert it to reagents or to products, respectively. Even if the initial reaction was not stoichiometrically balanced (see example in SI), its CGR representation will be-in so far the conversion of an unbalanced transformation to CGR succeeds to produce the correct CGR of the balanced process. However, as the exact state of the leaving group cannot be deduced from the training set, in silico generated CGRs may occasionally decode into reactions by simply substituting a broken bond by a hydrogen atom, leading to a disbalance in terms of implicit hydrogens.This is seen in the example from Fig. 3A in which the products contain 2 hydrogens more than reactants. Furthermore, the postulated product BH(OH) 2 is highly reactive, thus unrealistic. Formally, this is a rather "creative" in silico interpretation of the Suzuki reaction pattern, in which the organic halide R-X is replaced by an amide group: the acyl fragment is assimilated to "R" while the benzylamine is the leaving group X. Formally, a balanced Suzuki process could be formulated as either or, more realistically, with inclusion of the required alkoxy base, typically BuO -: Unfortunately, a sketchily written Suzuki transformation, carelessly ignoring the inorganic leaving groups: converts to a CGR corresponding to  (4), which are nothing but a biased interpretation of Suzuki processes, corrupted by intrisic representation errors in USPTO database entries. The addition of a water molecule as a "formal" basic species leads to a fully balanced reaction (Fig. 3B). Although water is not a perfect base for the Suzuki reaction, it helps to correctly represent the boron-containing leaving group in reaction equations. In silico generated reactions that cannot be "corrected" in this way have been discarded.
We also decided to discard the unfeasible under normal conditions transformations consisting in the cleavage of a C-C bond and assuming a carbon-centric leaving group. Application of these heuristics reduced a considered set of novel reactions to 44 including 31 reactions with new RC and 13 reactions with new RC + 1.
The question arises whether we need to consider explicit chemical structures of generated reactants and products. In our opinion, this is not firmly required if one focuses on the detection of new reaction transformations identified by RC or RC + 1 structural motifs. In this case, a "simplified" reaction in which substrates contain only atoms of reaction center and their closest environment (including second neighbors) could be sufficient, see Fig. 3C. Notice that such simplified reactions correspond to general reactivity patterns. Particular reactants can be selected by chemists as a function of availability, intended conditions, reactivity concerns, etc. For example, the reaction in Fig. 3C looks unfeasible, but it becomes more realistic if the amine leaving group were strengthened by binding to strong electron acceptors (for example, trifluoromethanesulfonyl) or by quaternization.
Notice that the majority of generated reactions have known RC and RC + 1 motifs. All belong to the Suzuki coupling type, as exemplified below.  Table 1 and Table S3 in Supporting Information). Substructural searching with RC as a query in the much larger CAS REACT database (SciFinder) resulted in retrieval of several reactions similar to those "discovered" by Artificial Intelligence. In particular, this concerns reactions with C-Br 56 bond formation and C-Si 57 coupling, and C-C coupling with N-containing 58 and F 59 leaving groups, as well as C-O coupling with organosilicon leaving groups. In total, 5 out of 13 new reaction centers discovered computationally, were found in SciFinder reactions (Table S5 in Supporting Information). Since none of them were used for the autoencoder training, these generated reactions were pure "imagination" of AI. Thus, several "novel" reactions (4 in Table 1, 5, and 7 in Table S5) correspond to a quite interesting C-N bond cleavage with amine as leaving group. A similar reaction has recently been discovered experimentally by Weires et al. 60 who shown that the formation of amides facilitated nickel-catalyzed cleavage of C-N bonds accompanied by C-C coupling (reaction 12 in Table S3). Experimental analogues of C-Si coupling reactions generated by the model (reactions 8 and 9 in Table 1, reactions 19-26 in Table S3) were found in SciFiner (reaction 9 in Table 1 and 19 in Table S5). In the experiment, bromotriarylsilane was used as template 57 (reaction 19 in Table S5) whereas our tool proposed less stable di-substituted silane bromide possibly with heteroatoms surrounding silicon (reactions 8 and 9 in Table 1). The organosilicon leaving group proposed for the C-O coupling (reaction 11 in Table 1) is very similar to that reported by Kori et al. 61 Fluoro-Suzuki reaction proposed by the autoencoder (reaction 5 in Table 1) was observed experimentally in the study by Chi et al. 62 Reaction 7 in Table 1 is not a coupling but boron substitution by bromine; it has been experimentally discovered using N-bromosuccinimide as a donor of bromine in the study by Thiebes et al 56 (reaction 17 in Table S5). However, from the structural point of view, its reaction center looks similar to "classical" Suzuki type reactions (boron substitution by carbon or heteroatom). Some of the reactions still look unfeasible, e.g., the O-I compound seems quite unstable (reaction 2 in Table 1). Nonetheless, such compounds are listed as commercially available (e.g. CAS Nos 3240-34-4, 1338247-47-4). Sulfur-containing compounds are generally unsuitable for Suzuki catalysts. Their generation can be explained by an excessive model's "creativity", which can be hardly controlled in the employed neural network architecture.

Reactions with a new environment of known reaction centers (RC + 1).
Following the novelty detection procedure, 13 reactions that correspond to 3 known reaction centers but an original first environment (RC + 1) were detected, see Table 2 and Table S4 in SI. Two similar reactions have been found in SciFinder. Although the simplified reaction 1 in Table 2 looks unfeasible, a more suitable leaving group might render it possible. For instance, in hydrogenation conditions, a catalyst can facilitate reductive cleavage of C-O bond (in esters, carbamates, benzyl ethers, etc.) followed by a coupling (as in reaction 10 in Table S6).
The use of alkyl and acyl bromides in C-C coupling in reaction 3 (Table 2), was observed experimentally (see reaction 13 in Table S6 in SI). Reaction 2 in Table 2 looks quite feasible because synthesis of acyl iodides was reported in the literature (e.g., CAS 191340-22-4 and CAS 1332596-80-1) whereas carboniodidates can be provided by some vendors (e.g., Enamine BBV-109267542 or BBV-109267541). Notice that similar reactions with chloroformates have been also found in Reaxys 63 .  Reactions feasibility assessment. Strictly speaking, reaction feasibility is defined by both kinetic and thermodynamic factors. However, according to the Bell-Evans-Polanyi principle 66,67 , in a series of similar reactions, the trend of activation energies follows the trend of reaction enthalpies. Thus, favorable thermodynamics, namely reaction enthalpy (∆H), can be considered as weak proof of reaction feasibility. A series of gas-phase DFT calculations were performed to assess ∆H for all simplified reactions with new RC and RC + 1. According to our estimations, almost all reactions are exothermic except for four reactions with Si-containing substrates in which ∆H is positive but close to zero (see Tables S3 and S4 in Supporting Information). This shows that all new computer-generated reactions are feasible, at least, as far as DFT-based thermodynamics estimates can tell. Since DFT is a rather time-consuming method and can hardly be applied for thousands of generated reactions, we also performed a rough estimation of ∆H using the tabulated bond energies in reactants and products [68][69][70] . Although calculated in such a way reaction enthalpies poorly correlate with the DFT values, they generally provide similar conclusions concerning reaction feasibility (see Tables S3 and S4 in Supporting Information).

Conclusion
Here we present the first attempt to generate new chemical reactions using a combination of Condensed Graph of Reaction, Generative Topographic Mapping, and sequence-to-sequence autoencoder. To feed the autoencoder, special reaction SMILES strings (SMILES/CGR) were conceived and implemented. In order to discard the seemingly unfeasible reactions, a special 4-steps post-processing procedure has been implemented. It includes: (i) stoichiometric balancing of reaction equations, (ii) reduction of substrates structure to their simplified form, (iii) discarding chemically infeasible transformations using suggested heuristics ("Chemical filters"), and (iv) assessment of synthetic feasibility using quantum mechanics calculations. The effectivness of the suggested approach was demonstrated on the example of Suzuki-like coupling reactions. Among generated reactions we discovered transformations with 13 new reaction centers which did not occur in the training set. Five out of 13 This study reveals that creativity of Artificial Intelligence is rather limited. Deep learning neural networks, at least, in their current state, are not able to invent completely new type of chemical transformations but rather propose unseen and sometimes not trivial variations of existing ones. Thus, in this study novel (in the context of the training set) C-N, C-O, C-S and C-Si bond formation reactions, as well as nitrogen-and sulfur-containing leaving groups have been suggested by the model. We believe that this opens a way to propose putatively new synthetic pathways in a way that is not affected by the bias of human expertise-with all the benefits and the pitfalls this may bring. It should also be noted that compared to theory-driven quantum chemical models, datadriven DNN is much less time consuming and, practically, is not limited by the reactants size. The more data are used in the neural network training, the more realistic the predicted reactions are. Since the sizes of reaction databases are rapidly growing up, deep learning approach has an obvious perspective as a tool for discovery of novel reactions.

Methods
Datasets and data curation. The dataset used in this project comes from United States Patents and Trademark Office database (1976 to 2016) extracted by Lowe 53 . It contains about 3.5 million reactions. The initial dataset was preprocessed with in-house scripts based on the CGRtools library 51 . The curation includes the standardization (aromatization and functional group standardization), removal of empty reactions (those where the products and reactants are the same, or no reactants or products are recorded) and reactions with valence errors. For curated reactions, atom-to-atom mapping (AAM) was performed using the ChemAxon Automapper tool which is a part of the JChem toolkit 71 . The mapped reactions were converted into CGRs and their reaction centers were extracted with the CGRtools. In total, 165 879 different reaction centers were obtained. Since AAM errors lead to incorrect reaction centers, which are usually rare, only highly populated reaction centers were selected. Thus, the resulting dataset consisted of some 2.5 million reactions (approximately 70% of the initial dataset) which corresponds to 300 most frequent reaction centers.
According to our estimations 72 , the ChemAxon Automapper tool leads to the erroneous AAM for some 25% of USPTO reactions. Most of those concern cycloadditions with complex reaction centers. As far as Suzuki coupling is concerned, this error is around 3%.
Notice that practically all USPTO reactions are stoichiometrically unbalanced. This doesn't prevent to build Condensed Graph of Reaction, but, in some cases, may lead to erroneous atom-to-atom mapping.
Reaction data treatment. CGRtools library (version 3) 51 was used for the reactions cleaning, their transformation to CGRs, conversion of CGRs into SMILES/CGR, and processing of generated SMILES/CGR back into reactions. SMILES/CGR notation. Generally, SMILES/CGR follows the OpenSMILES rules 73 . They differ from regular Daylight SMILES in terms of aromatic atoms and ring closure specification and introduce special "dynamic" bonds and atoms characterizing chemical transformations. Dynamic bonds in CGR characterizing chemical transformations have special labels representing changes in bond orders. Dynamic atom corresponds to change of formal charge or radical state of this atom in reaction. Detailed information about SMILES/CGR syntax is given in Supporting Information. SMILES/CGR generation and parsing, including preparation of canonic SMILES/CGR, are implemented into CGRtools Python library 51 .
Reaction generation algorithm. The network architecture previously applied for molecular SMILES generation 30 has been used in this study. It is based on the autoencoder architecture introduced by Xu et al. 74 . SMILES/CGR transformed into sequences of one-hot encoded characters with padding to constant length (256) were used to feed the encoder. Symbols within square brackets (conventional or dynamic atoms or dynamic bonds) were considered as a single symbol within tokenization. The encoder consists of two bidirectional Long Short-Term Memory (LSTM) 52 layers (128 nodes each), while the decoder is composed of two forward LSTM layers (256 nodes each). The bottleneck dense layer between the encoder and the decoder transforms the states of the encoder LSTMs into latent variables to subsequently feed them to the decoder; it consists of 128 nodes. Finally, the decoder outputs are transformed back to one-hot encoded characters via a single dense layer.
The autoencoder was trained in batch mode, where batches of "one-hot"-encoded sequences were generated on-the-fly from training set SMILES/ CGR strings. The Adam optimizer was used for training, initial learning rate was set to 0.005, and batch size was set to 256 samples per batch. The learning rate was reduced during training if there were no improvement in the validation loss for two epochs. The training was terminated after 34 epochs when no improvements in test set reconstruction accuracy was observed. To generate latent variable vectors for eventual decoding, we use the Generative Topographic Mapping method. It is a non-linear dimensionality reduction method that has been successfully used for chemical space analysis 54,74-82 , comparison of chemical libraries 83 , building classification 43,[74][75][76][77]80,84 , and regression 85,86 models via activity landscapes, as well as for solving the "inverse" QSAR problem 87 . The GTM algorithm operates by embedding a nonlinear two-dimensional manifold into a D-dimensional descriptor space and calculating the distribution of objects of initial space on these two dimensions. In this work, we utilize the autoencoder's latent vectors as an initial descriptor space. Once a map for the entire USPTO database was constructed, the zones corresponding to the desired reaction type (Suzuki reaction) were located, from which the latent vectors for virtual reactions were sampled. These new vectors fed the trained decoder resulting in new SMILES/CGR strings. www.nature.com/scientificreports/ Novelty detection. Novelty detection is based on the comparison of hashed reaction signatures corresponding to reaction centers (RC) and their environment between the database of known reaction (here, USPTO database) and the reactions generated by the autoencoder (Fig. 4). Encoding chemical reactions by CGR significantly simplifies RC detection. Thus, substructural motifs involving the reaction center (RC, RC + 1, RC + 2, …) can easily be extracted from CGR (see Fig. 5). Since any operations with molecular graphs are timeconsuming, each substructural motif was encoded by a unique hash code 51 -a reaction signature uniquely identifying given transformation. In this case, the novelty detection is reduced to the comparison of signature (hash code) of a generated reaction with those of known reactions (Fig. 4). The suggested procedure assures fast and precise novelty detection.
Reaction enthalpy calculations. The difference in energies between reactants and products is calculated in several steps 88 . First, a conformer with the lowest energy is generated for each compound in the reaction using the ChemAxon cxcalc module. Then, the geometry of each compound was optimized using the Priroda16 program with PBE exchange and correlation functional 89 , and the built-in triple-zeta split valence basis set 3z, which is equivalent to Schäfer's TZVP basis set 90 . Relativistic and solvent effects were neglected. The Priroda16 program was chosen as it is one of the fastest DFT software due to the efficient evaluation of density functional exchange-correlation terms based on the electron density expansion 91 . Final energy values were extracted for optimized structures and used for calculation of reaction enthalpy. The additive scheme for estimating reaction enthalpies was implemented using the tabulated chemical bonds increments [68][69][70] .

Data availability
The dataset used in this project comes from the publicly available United States Patents and Trademark Office database (Lowe, https ://doi.org/10.17863 /CAM.16293 ). Curated USPTO dataset is available on GitHub: https :// githu b.com/Labor atoir e-de-Chemo infor matiq ue. All data preprocessing procedures are described in the Methods section and are based on freely available CGRtools library.

Code availability
CGRtools library is used for data preprocessing and creation and treatment of chemical reactions as SMILES/ CGR, and is freely available (https ://githu b.com/cimm-kzn/CGRto ols). Autoencoder model code and the ISIDA/ GTM tool are available upon request.