Automatic mapping of atoms across both simple and complex chemical reactions

Mapping atoms across chemical reactions is important for substructure searches, automatic extraction of reaction rules, identification of metabolic pathways, and more. Unfortunately, the existing mapping algorithms can deal adequately only with relatively simple reactions but not those in which expert chemists would benefit from computer’s help. Here we report how a combination of algorithmics and expert chemical knowledge significantly improves the performance of atom mapping, allowing the machine to deal with even the most mechanistically complex chemical and biochemical transformations. The key feature of our approach is the use of few but judiciously chosen reaction templates that are used to generate plausible “intermediate” atom assignments which then guide a graph-theoretical algorithm towards the chemically correct isomorphic mappings. The algorithm performs significantly better than the available state-of-the-art reaction mappers, suggesting its uses in database curation, mechanism assignments, and – above all – machine extraction of reaction rules underlying modern synthesis-planning programs.


Examples of 10 pairs of reactions with correct mapping in the MIT training
set. The first reaction in each pair has correct mapping and was used with such a mapping to train the MIT's deep neural network for reaction prediction. The second reaction is a closely related transformation (with chemically correct product shown) with which we queried/tested the MIT's network. The question asked is whether the MIT network can or cannot predict the correct product based on reaction substrates. On this set of 10 examples, 90% of outputs are correct………….….page 836 -853 6.3. Examples of 10 pairs of reactions with incorrect mapping in the MIT training set. The first reaction in each pair has incorrect mapping and was used with such a mapping to train the MIT's deep neural network for reaction prediction. The second reaction is a closely related transformation (with chemically correct product shown) with which we queried/tested the MIT's network. The question asked is whether the MIT network can or cannot predict the correct product based on reaction substrates.

Further comments on isomorphic mapping and pseudocodes. The atom-
atom mapping problem is formulated by means of reaction multi graphs as follows. We represent reactants and products molecules as two undirected and potentially disconnected multi graphs R and P. Vertices (also called nodes) of graphs R and P are labeled with disjoint sets of natural numbers N R and N P .
Assume that function n P : N P ⟶ ℙ and n R : N R ⟶ ℙ (1) maps node labels into the set of chemical elements ℙ = {H, He, Li, …}.
We are looking for a mapping ɸ from N R to N P , such that each node in N R is assigned to the corresponding node in N P , i.e., the atom-to-atom assignment reflects the underlying chemical changes in the course of reaction.
To reduce the problem's complexity, atom environments are defined recursively. Let us start with trivial observation that atoms do not change their types in the course of chemical reactions.
Hence for every the following equality holds: Therefore, the problem of finding correct ɸ may be split into several distinct problems for different atom types. The main idea can be viewed as the extension of Morgan's approach to identify other properties of atoms that are invariant in the reaction. These properties are analogues of EC-values and we regard them as hash values from the set ℍ.
Let us introduce two mappings ℎ ∶ ⟶ ℍ and ℎ ∶ ⟶ ℍ. The hash values are defined recursively as follows: where Γ (i) = {j : e(i, j) > 0} is a set of neighbors of the node i and ℎ (Γ( )) is a multiset image of the set (Γ( )) of the function ℎ .
Hash values for reaction products are defined analogously: The exemplary hash-function labels for reaction from for the first level of recursion (i.e. ℎ 0 and ℎ 0 ) and for the second level (i.e., ℎ 1 and ℎ 1 ) are illustrated in the example in Supplementary Note 1.2 ( Supplementary Figures 1a and 1b, respectively).
For each atom, hash function provides the set of features that enable to distinguish them and, at the same time, remain invariant in the course of reaction. Since hash values of the n-th level describe the neighborhood of a given atom within the radius n, they have much more discriminatory power than simple atom types.
However, the usability of hash values is limited to the stable parts of reaction graphs. When the reaction bonds are broken, the incident atoms are rearranged in novel ways. Fortunately, in a typical reaction, the number of broken bonds is relatively small and most of atoms could be matched using hash values.
Having defined the hash labeling, we generate the candidates for isomorphic subgraphs of reagents and products as follows: We execute bucket sort algorithm on sets N R and N P with respect to their hash values. Then, we group together the sets corresponding to pairs of nodes having identical hash values. If in a such pair the number of reagent nodes is different than the number of product nodes, we exclude nodes in all possible ways to make these numbers identical, thus obtaining a set of candidates. The following pseudocode formalizes CreateCandidates procedure together with the several auxiliary functions: 7 Next, for each candidate subgraph, we strive to establish one-to-one mapping (isomorphism) between reagent atoms and product atoms. The isomorphism is constructed separately for each connected component of candidate graphs. The following tests are used to exclude most candidates before checking the actual graph isomorphism (see Supplementary Figure 1  Test 2: Connected components may be matched according to the number of nodes. Test 3: Connected components may be matched according to node types (they have pairwise identical multisets of nodes' hash values).
Test 4: Connected components are pairwise isomorphic.
The following pseudocode formalizes MatchIsomorphicConnectedComponents procedure that implements the abovementioned tests: To test whether graphs are isomorphic, we implemented the constrained search over a space of all partial mappings. Appropriately defined hashing values increase the distinctness of vertices which reduces the size of the search space. The nodes with the most unique labels are chosen as starting points for the search procedure. Then, a decision tree describing all possible isomorphisms is created. The size of this tree cannot exceed the predefined threshold (in the current implementation, 1,000,000 vertices). Failure is admitted by the algorithm when the decision tree is too large.
The described algorithm is similar in spirit to the popular VF2 algorithm. The main difference lies in the fact that VF2 adds single nodes to match sequentially while our algorithm extends the matching simultaneously to all neighbors of a given node.
In order to further reduce computational complexity, we treat hydrogen and fluorine atoms in a special way. We consider them as features of atoms to which they are linked. In this way, we substantially reduce the number of graph nodes for which we look for isomorphism and solve the problem of ambiguity caused by the fact that hydrogen (or fluorine) atoms linked to a given carbon atom are indistinguishable.
When mapping for other atoms is found, hydrogen and fluorine atoms are expanded and a simple mapping procedure is executed for them. In this procedure, we take into account that hydrogen and fluorine atoms may have only one bond and we look for any mapping that preserves as many bonds as possible.
With the key concepts of heuristics discussed at length in the main text, we briefly mention two other issuesnamely, the treatment of aromaticity and disambiguation.
There are two possible representations for aromatic rings in SMILES format: one can mark atoms belonging to a ring as aromatic or point out a sequence of single and double bonds.
However, the internal representation/notation of aromaticity in our algorithm is different and each aromatic bond is counted as one-and-a-half single bonds. We have chosen such notion because the notion of aromatic atoms (rather than bonds) is hard to combine with the concept of minimal chemical distance, whereas the alternating single-double representation requires to apply the algorithm to all isomers of a molecule, which causes a combinatorial explosion for molecules with many aromatic carbon rings.
Disambiguation procedures are another important component of the algorithm. While they do not affect the algorithm's outcome, they make the task tractable. During the whole mapping process, the algorithm operates on collections of candidates for mappings. Disambiguation is a process of elimination of repetitive candidates and is performed after each step of the algorithm: After each execution of the atom mapping procedure, on a set of candidates for which we have found complete matching, and on a set of modified versions of reactions obtained after application of heuristics. Down to technical detail, candidates are redundant if there exists isomorphism between substrates of the first candidate and substrates of the second one and between products of the first candidate and products of the second one. This isomorphism takes into account any partial mapping or labeling that is assigned to candidates. Disambiguation is implemented using the mapping algorithm with the omission of node exclusion step. If the mapping algorithm finds complete isomorphism, we consider candidates identical. Otherwise they are different.

Illustrative example of isomorphic mapping without heuristics
As discussed above and in the main text, we simplify the mapping problem to the isomorphism of subgraphs (rather than full molecular graphs), and first detect atoms whose environments are not changing during the reaction. The environments are defined not by scalar values used in previous approaches but by molecular subgraphs (i.e., smaller graphs within the graph defining the entire molecule) assigned to each atom, extending to within a given "radius" (measured in terms of bond distances), and defined recursively as illustrated in Supplementary Figures 1a and 1b for the "zero-order" environments (atoms themselves) and the first-order ones (nearest neighbors). In general, our algorithm uses environments up to the fourth order. The algorithm then uses the so-called bucket sorting to group together atoms with identical environment subgraphs. This procedure is illustrated in Supplementary Figure 1c which builds on the example from Supplementary Figures 1a,b (again, for the sake of illustration, restricting the analysis to the first-order environments). As seen, the numbers of atoms within each group are not necessarily equal. For instance, the reactant has one more atom of first-order environment type than the product, the product has one more atom of type than the reactant, etc.
In case of such discrepancies, the algorithm excludes extraneous atoms in all possible ways (see Supplementary Note 1.1 above for pseudocode) to make numbers in each type category equal.
These tests exclude the majority of possible candidatesin our example, only candidates G and H survive with Supplementary Figure 1d showing their matching connected components colored in yellow and in cyan.
The most computationally-intensive part of the above operations is to find the correct isomorphism. This is done by creating a decision tree of all possible isomorphisms, with the size of this tree limited to a predefined number of vertices (currently, 1,000,000)evaluation of a candidate reaction typically requires trees with tens of thousands of vertices but for either very large molecules and/or those with many symmetries, it can approach or even exceed the one-million limit. The procedure is similar to the so-called VF2 algorithm with an important difference that VF2 adds single nodes (here, atoms) to the matching while our algorithm extends the matching simultaneously to all immediate neighbors of a given atom. The isomorphisms are first calculated at the level of the fourth-order environments, and then recursively at the third, second, and first levels. The purpose of this recursion is to match the atoms sequentially from the "periphery" of the molecules (where atoms agree to higher-order environments) towards the reaction center (where the conserved neighborhoods become smaller).
In some, rather rare cases, all atoms in the molecule are matched after this proceduremore generally, there remain unmatched nodes such as the uncolored atoms in Supplementary   Figure 1d. For these atoms, mapping is straightforward when their neighbors are already matched and this matching fits environments across the reactionin our example, candidate G yields correspondence between atoms 3 ~ 14 and 11 ~ 18 while for candidate H, 3 ~ 14 and 11 ~ 22. When, in the next step, these correspondences are extended to complete isomorphisms spanning entire molecules, only candidate H passes the test, yielding a mapping that is, above all, chemically correct. In the most general case, however, not all atoms will already have matched neighbors. If this is so, we again resort to combinatorics, sequentially cutting/removing all possible subsets of bonds originating from the unlabeled atomsfirst all possible single bonds then, if needed, pairs of bonds, and so on to up to sets of six bonds. By this bond cutting we strive to identify minimal sets of bonds defining the "reaction center" and whose disconnection gives a full isomorphism between reactants and products. Overall, the above algorithmusing an improved representation of neighborhoods and various original combinatorial tests/procedures is crafted to and solves efficiently sub-graph isomorphism problem for graphs representing chemical molecules, which represent a special class of graphs that are almost always planar and have node degrees of, at most, four. components are colored yellow and blue. In G, atoms 1,2 correspond to atoms 12,13 and atoms 4,5,6,7,8,9,10 correspond to atoms 15,16,17,19,20,21,22,respectively. In H,atoms 1,2 correspond to atoms 12, 13 and atoms 4,5,6,7,8,9,10 correspond to atoms 15,16,17,18,19,20,21, respectively.

Examples of mappings missing some atom assignments but otherwise correct.
Supplementary Figure 5. Examples of mappings missing some atom assignments but otherwise correct. Examples are mostly from Indigo, but similar problems are common in ChemDraw. Unmapped atoms/groups are denoted by yellow circles. a, Opening of an epoxide with methylmagnesium bromide is mapped without assigning numbers to bromine and magnesium atoms. b, Mapping of this silyl-enol ether formation is chemically correct with exception that the chlorine, lithium and carbon  to the carbonyl group are left unmapped. c, Pd(0)-catalyzed cyclization has correctly assigned atom numbering in the product but has no numbers assigned to atoms within the tosyl and acetate groups present in the substrate.

Hard-coded reaction templates for some popular reaction classes.
a b Supplementary Figure 6. Hard-coded reaction templates for some popular reaction classes. If each molecule present in the as-written reaction matches one of the molecules in a template and if the key molecules from the template (drawn in thicker lines) match some molecules in the as-written reaction, then the reaction matches the template. The mapping of such reaction's core is then provided by the template, without application of heuristics.

Quantification of mapping times Supplementary Figure 7. Distribution of mapping times plotted on a double -logarithmic
scale. Transformations were taken from the training and the main test sets (949 reactions in total). Our algorithm maps 91.5% of reactions in less than 1 sec and 96.8% of reactions in less than 10 sec. For our mapper, five reactions (0.5%) were mapped for times above 100 sec (not shown on the plot) and two were unmapped; on the other hand, for Indigo, the program returned no mapping for six reactions. All calculations were performed on Intel(R) Core(TM) i7-6700K CPU @ 4.00GHz. Blue dots correspond to the mapping times obtained by Indigo (the only other mapper allowing for batch-mode mappings) -Indigo is certainly faster but is also significantly less accurate (cf. Figure 5 in the main text ). lacking mechanistically important by-products. This double reduction of an ester is written without specifying the two molecules of the ethanol by-product. As a result, the algorithm incorrectly retains in the product and maps the singly-bonded oxygens #3 and #31 (marked in red) rather than the "dissimilar-looking," doubly-bonded carbonyl oxygens #5 and #30. b, A specialized variant of a [3,3]-sigmatropic rearrangement is undetected by the mapper as it is not covered by the appropriate heuristics (Ia in main-text Figure 2). c, A mechanistically multistep reaction involving a rearrangement that is not recognized (i.e., not present in Figure 2) resulting in misplacing oxygens #1 and #3. d, The algorithm does not properly account for the migration of the silyl group. e, This Au(I)-catalyzed synthesis of bicyclo [3.2.0]heptanes comprises 10 (sic!) mechanistic steps involving 1,3-dipolar cycloaddition and 1,2-alkyl rearrangement (cf.

Examples of chemically nonsensical reactions and mappings in the USPTO database.
Supplementary Figure 9. Examples of chemically nonsensical reactions and mappings in the USPTO database. a, A chemical "typo"in one of the substrates, the silyl ether protecting group does not contain all carbon substituents. We suppose that in the real reaction, it was the -OTBS group, but here two methyl groups are missing and replaced by hydrogens. This difference results in a dramatic change in the reactivity of the substrate: the R 2 SiH 2 groupcompared to the -OTBSis much more reactive and incompatible in the reaction shown. b, "Transmutation"by changing a carbon atom into an oxygen, cyclopentane ring is somehow transformed into the tetrahydrofuran ring. This reaction is certainly not possible. c, Wrong structure of the productin this example of the Wittig reaction, the double bond is created as expected, but is placed at the wrong part of the product (in exomethylene position which is chemically impossible considering the proposed substrates). d, List of chemicals rather than a reactionthere is no known reaction that would convert simultaneously the three substrates on the left side into the three "products" on the right side of the reaction arrow. e, It is impossible to obtain this reaction product from anisole in one step. Atom mappingespecially fate of carbon #7 is also mysterious. Based on these and many other USPTO examples we scrutinized, this collection contains the following (approximate) percentages of nonsensical reactions: for reaction in which one bond is altered, ~15%; for two bonds, ~2%; for three bonds, ~ 6%; for four bonds, ~15%; for five bonds, ~34%; for six bonds, ~58%. Correctness of the mapping MAPPET YES ReactionMap YES Marvin NO Correct mapped SMILES/SMARTS of the reaction: [Br:1] [Br:2]. [O:3] :1].  Reaction no 86
Correct mapped SMILES/SMARTS of the reaction: Correct mapped SMILES/SMARTS of the reaction: Correct mapped SMILES/SMARTS of the reaction: Correct mapped SMILES/SMARTS of the reaction: [O-:42] Correct mapped SMILES/SMARTS of the reaction: Correct mapped SMILES/SMARTS of the reaction:  :20]. [C:19]. [Cl :31]. [Cl:32]. [O:18]. [O:17] Correct mapped SMILES/SMARTS of the reaction: Correct mapped SMILES/SMARTS of the reaction: Correct mapped SMILES/SMARTS of the reaction: Correct mapped SMILES/SMARTS of the reaction: Correct mapped SMILES/SMARTS of the reaction: Correct mapped SMILES/SMARTS of the reaction: Correct mapped SMILES/SMARTS of the reaction: Correct mapped SMILES/SMARTS of the reaction: Correct mapped SMILES/SMARTS of the reaction: Correct mapped SMILES/SMARTS of the reaction: Correct mapped SMILES/SMARTS of the reaction: Correct mapped SMILES/SMARTS of the reaction: Correct mapped SMILES/SMARTS of the reaction: Correct mapped SMILES/SMARTS of the reaction: Correct mapped SMILES/SMARTS of the reaction: SMILES of the input: Correct mapped SMILES/SMARTS of the reaction: Correct mapped SMILES/SMARTS of the reaction: Correct mapped SMILES/SMARTS of the reaction: Cl [H] Correct mapped SMILES/SMARTS of the reaction: Correct mapped SMILES/SMARTS of the reaction: Cl [H] Correct mapped SMILES/SMARTS of the reaction: 1C=O.CSC.Cl [H].Cl [H].O=C=O. [ SMILES of the input: [H].Cl [H].O=C=O . [C-] Correct mapped SMILES/SMARTS of the reaction: Correct mapped SMILES/SMARTS of the reaction: Correct mapped SMILES/SMARTS of the reaction: Correct mapped SMILES/SMARTS of the reaction: Correct mapped SMILES/SMARTS of the reaction:       [I-]. [I-] Correct mapped SMILES/SMARTS of the reaction: Correct mapped SMILES/SMARTS of the reaction: . [Br-]. [I-] Correct mapped SMILES/SMARTS of the reaction: . [Br-] Correct mapped SMILES/SMARTS of the reaction: Correct mapped SMILES/SMARTS of the reaction: Cl [H] Correct mapped SMILES/SMARTS of the reaction: Cl [H] Correct mapped SMILES/SMARTS of the reaction: Correct mapped SMILES/SMARTS of the reaction: Cl [H] Correct mapped SMILES/SMARTS of the reaction: Correct mapped SMILES/SMARTS of the reaction: Correct mapped SMILES/SMARTS of the reaction: Cl [H] Correct mapped SMILES/SMARTS of the reaction: Cl [H] Correct mapped SMILES/SMARTS of the reaction:     17]12. [C:18]. [C:19]. [C:20]. [C:21 ]. [O:22]

S5.2. Examples of 10 pairs of reactions with correct mapping in the MIT training set.
The first reaction in each pair has correct mapping and was used with such a mapping to train the MIT's deep neural network for reaction prediction. The second reaction is a closely related transformation (with chemically correct product shown) with which we queried/tested the MIT's network. The question asked is whether the MIT network can or cannot predict the correct product based on reaction substrates. On this set of 10 examples, 90% of outputs are correct.

S5.3. Examples of 10 pairs of reactions with incorrect mapping in the MIT training set.
The first reaction in each pair has incorrect mapping and was used with such a mapping to train the MIT's deep neural network for reaction prediction. The second reaction is a closely related transformation (with chemically correct product shown) with which we queried/tested the MIT's network. The question asked is whether the MIT network can or cannot predict the correct product based on reaction substrates. On this set of 10 examples, only 20% of outputs are correct.