Background & Summary

The ability to predict reaction outcomes for a given set of substrates, known as forward synthesis prediction, plays a crucial role in successful synthetic planning1,2. This encompasses not only the prediction of major products resulting from organic reactions but also the retrosynthetic analysis, which aims to identify a viable synthetic pathway to synthesize a desired target compound3. To ensure the reliability of retrosynthetic analysis, it is essential to couple retrosynthesis prediction models with reliable reaction outcome prediction models. While recent advances in machine learning models have shown promise in predicting reaction outcomes based on reactant sets4,5,6,7,8, these models often overlook the finer details of electron movements, reactive intermediates, and other mechanistic information that are crucial for a comprehensive understanding of the reaction9. Consequently, there is a need for more sophisticated and accurate chemical models that can explicitly capture the underlying reaction mechanisms, which involve a step-by-step sequence of electron movements and reactive intermediates, to gain valuable insights into the stereochemistry, reaction kinetics, formation of byproducts, and other important reaction details.

Arrow pushing diagrams, commonly utilized by organic chemists, provide a visual representation of electron rearrangements as bonds form and break10. While an ideal chemical model is expected to predict the same arrow pushing diagrams as human chemists do, a reliable chemical model that can predict arrow pushing diagrams has not been developed yet due to the lack of mechanistic reaction dataset. In an early attempt to automate this process, Chen and Baldi developed Reaction Explorer11 to predict major products and mechanistic steps based on reactants and reagents utilizing a set of prioritized transformation rules. While it provided detailed and reasonable mechanistic descriptions, due to the nature of rule prioritizing, introducing a new set of rules to cover more reactions would require revision of significant proportion of existing ones. Hence, it is challenging to be scaled up to larger reaction datasets containing diverse reaction types. In a more recent effort, Bradshaw et al. proposed a machine learning-based electron path prediction model called ELECTRO12 which is fast and scalable. However, since the model still uses only the reactant and product information to extract the electron movements without the chemical guidance of actual mechanisms, most of the electron pathways predicted by the models are approximated mechanisms. This gives rise to discrepancies between predicted and generally accepted mechanisms, as illustrated in Fig. 1 as an example, highlighting the pressing need for a more reliable and dependable mechanistic prediction model. Due to these challenges, a deficiency in a comprehensive database for large-scale reaction mechanisms persists.

Fig. 1
figure 1

Demonstration of the difference between the more reasonable mechanism intuitively accepted by human chemists and the label obtained in Bradshaw et al. for a nucleophilic acyl substitution reaction between an alkoxide and acyl chloride. Breaking and re-forming of carbonyl bonds (green box) account for the observed chemical reactivity and are reflected in the mechanism. On the other hand, the label obtained in Bradshaw et al.12 is focused on the changed atoms mainly, highlighted in green: electrophile carbon, leaving group chlorine and attacking oxygen of alkoxide. The latter mechanism omits carbonyl oxygen which is an integral part of the reaction and responsible for the reactivity of the reacting molecule.

To overcome these limitations regarding the intrinsically approximate answer (ELECTRO) and the scalability (Reaction Explorer) of existing methods, we here propose a two-step method called MechFinder to label reaction mechanisms in a chemically reasonable and scalable manner using automatically extracted reaction templates and expert-coded mechanistic templates. In our method, we extract the most important subgraph of reaction transformation in a reaction template, and manually label the mechanism of each reaction template based on chemist knowledge in the form of mechanistic template. Employing both reaction templates and mechanistic templates, our method allows us then to label the mechanisms of a large number of chemical reactions automatically by going through the two separate models to determine the types of templates. Our method enables the generation of mechanistic pathways for various reaction types, including pericyclic reactions and those involving multiple steps, such as reductive amination. These labels serve as computational analogs to the conventional arrows employed in the arrow-pushing model. To evaluate the effectiveness of our method, we curated a subset of the USPTO-50K dataset as a benchmark and assessed the coverage and applicability of our approach. We report the first large-scale reaction dataset with chemically reasonable reaction mechanisms, denoted as mech-USPTO-31K, which would benefit the chemistry community to develop prediction models for the reactivities that are based on molecular mechanisms.

Methods

Reaction data

In this paper, we used the reaction data extracted from USPTO grant patents collected by Lowe13, an organic reaction dataset extensively used in benchmarking various reaction prediction approaches. In particular, we demonstrate the results using USPTO-50K dataset curated by Schneider et al.14 and atom-mapped by LocalMapper15. Each reaction in this dataset is presented in SMILES16 format. Since our approach only addresses two electron-based arrow-pushing mechanisms, we removed organometallic and radical reactions based one their reaction templates. Thus, 33,099 reactions are finally obtained after the above pre-processing procedure. We refer to this reaction subset as USPTO-33K dataset in this paper.

Since previous study reported that the necessary reagents are frequently missing in the chemical reactions recorded in the USPTO dataset15, we intentionally removed the reagent information from all reactions. For instance, approximately 50% of the Suzuki coupling reactions lack Pd catalyst and 40% of the Mitsunobu reactions do not include diethyl azodicarboxylate (DEAD) or diisopropyl azodicarboxylate (DIAD). This data impurity makes it challenging to treat different reactions with varying data completeness. Hence, we designed the MechFinder to automatically generate the reagent needed for mechanistic labeling (see Mechanistic Template (MT) section).

Reaction Template (RT)

In our approach, we leverage the reaction rules, which are localized around specific atoms and bonds. This allows us to narrow down the scope of deriving mechanistic labels by focusing only on the atoms involved in the reaction. To obtain the reactivity information of a reaction dataset, we extract a set of reaction templates (RT)17 from each reaction in the dataset. We start by identifying the reaction centers by comparing the chemical environments between the same atoms before and after the reaction. Nonetheless, we recognize that in many cases the electron movement can go beyond the changed atoms, such as in the nucleophilic acyl substitution reaction shown in Fig. 1. Therefore, we also include moieties that are π-conjugated to the changing atoms, such as double, triple, and aromatic bonds, and several mechanistically important special groups, such as carbonyl group and acetal group. This reaction template is simpler than the template extracted by RDChiral18 but more informative than the local reaction template described in LocalRetro19. However, it is important to note that our automated workflow does not differentiate certain mechanisms, such as neighboring group participation and SN2’ mechanisms, from more common reaction mechanisms because the leading assumption we made when developing this method was that the mechanism occurs around the defined reaction center.

The overall template extraction is performed by the following four steps:

  1. 1.

    Compare the chemical environment of each atom before and after the reaction according to the atom-mapping. The atoms that are found to have changes in chemical environment are identified as “changed atoms”.

  2. 2.

    For each identified changed atom, we identify the neighboring atoms connected to the changed atom in the reactants with double, triple, or aromatic bond as “extended atoms”.

  3. 3.

    To further extend the scope of RT for mechanistic labeling, we manually define a set of mechanistically important special groups. If any of the changed atoms are identified in one of the special groups, all the atoms in the special groups are also added to the “extended atoms” list. After identifying the extended atoms in the reactants, we record the atoms sharing the same atom-map numbers in the product. The details of RT extension process can be found in Fig. S1 and S2.

  4. 4.

    Using RDKit python package20, we extract the chemical fragment in the reactants and products in SMARTS format based on the identified changed atoms and extended atoms, and connect the fragments by a reaction symbol “ » ”.

Mechanistic Template (MT)

Since RTs only capture the changes before and after the reaction, simply applying heuristic rules on RTs to generate mechanistic pathway without any in-domain chemistry knowledge poses clear limitations, as the example shown in Fig. 1. Therefore, we additionally introduce the concepts of mechanistic classes (MC) and mechanistic templates (MT) to describe the actual reaction mechanism. The MC is defined as a group of reactions following the same reaction mechanism, including one or multiple RTs. For a given MC, we then hand-code the MT which describes the direction of electron movements in the form of a sequence of arrow-pushing diagrams, representing the attacking and electron-receiving moieties to incorporate chemistry knowledge. In particular, the designed MTs are able to distinguish different mechanisms sharing the same RT (such as SN1 and SN2 reactions) based on chemically designed criteria. In addition, the necessary reagents for deriving reaction mechanisms are recovered and essential functional groups (such as electron withdrawing group, EWG) are additionally included in the mechanism labeling process by MTs. The proposed MTs are represented by categorizing the arrows that illustrate the movement of electron pairs in organic reactions into four groups: lone pair to atom, lone pair to bond, bond to atom, and bond to bond. Technically, the lone pairs of atoms are simply annotated by their atom-map numbers and the electron pairs from bonds are annotated by pairs of atom-map number.

Fig. 2
figure 2

The four features of a proposed mechanistic template (MT). (a) Multiple RTs can match a single MT if they share the same mechanism. (b) One RT can lead to different MTs according to the criteria of the belonging MT. (c) Necessary reagents are added to complete the mechanistic pathway labeling. (d) Reactions whose mechanisms do not follow an arrow-pushing diagram cannot be labeled.

The proposed MT has four notable features:

  1. 1.

    Because the atom types are specified in RTs but not in MT, multiple RTs often share the same MT. For example, different nucleophiles in substitution reactions can lead to different RTs but the same MTs (Fig. 2a).

  2. 2.

    In some cases, a single RT can match different MTs depending on the specific chemical environment. In these cases, we design particular criteria to assign the correct MT to the obtained RT. For example, the decision of assigning SN1 and SN2 depends on the alkane group connected to the leaving group (Fig. 2b).

  3. 3.

    For many reactions, the reaction can only occur when additional reagents are added, and the reaction mechanism can only be labeled if these reagents exist. For these reactions, we put the necessary additional reagents into the reactant set to complete the mechanism (Fig. 2c). We define missing reagents as compounds whose atoms, lone pairs or bonds participate in the mechanism, but do not appear in the major products. In our labeled dataset, we add the necessary reagent(s) for approximately 19,000 reactions (60%).

  4. 4.

    Since the mechanistic pathway labeled by this method is based on the movement of electron pairs, reaction mechanisms beyond this scope such as organometallic or radical reactions cannot be labeled by the current method (Fig. 2d). More examples of MTs can be found in Supplementary Information.

It is noteworthy that, the mechanism derivation for certain groups of reactions inevitably requires the involvement of additional moieties beyond those present in the extracted RT. To address the limitation associated with the locality, we have incorporated technical maneuverability into our method to capture the important mechanistic elements. The framework and examples can be found in Supplementary Information.

Labeling reaction mechanism using MechFinder

In this paper, we introduce a dataset generated by a mechanism labeling framework called MechFinder utilizing reaction templates (RTs, details in previous subsection) and mechanistic templates (MTs, details in previous subsection) introduced above. The process of using MechFinder to label the reaction mechanisms in a reaction dataset is divided into two phases: the expert annotation phase and the automatic labeling phase, as shown in Fig. 3a,b.

Fig. 3
figure 3

Overall pipeline of (a) expert annotation and (b) automatic mechanism labeling with MechFinder. (c) Example of the process of labeling an SN2 reaction. First, we extract the RT of the reaction focused on reacting atoms (nitrogen, alkyl group, and chlorine with atom-map numbers of 6, 9, and 8, respectively) along with the lookup table containing one-to-one correspondence between template map numbers and atom-map numbers in the original reaction. Next, we manually identify the MC for extracted RT as “SN2 reaction” which has its unique MT characterized by the inherent electron flow of reactive moieties. Upon replacement of template map numbers in the MT (1, 2, 3) by the corresponding atom-map numbers in the original reaction (6, 9, 8) using the lookup table, we finally obtain a mechanistic sequence for the given reaction.

During the expert annotation phase (Fig. 3a), we first extracted N unique RTs from all the X reactions in the reaction dataset. For each RT, we sampled k representative reactions to manually label the mechanism by three steps shown in Fig. 3c:

  1. 1.

    RT extraction. We extracted reaction template focused on the reaction center, describing the local changes in atomic configuration upon a chemical transformation. The extraction process also yields an atom-map lookup table, recording the one-to-one atom-map correspondence between the input reaction and the extracted RT.

  2. 2.

    MT identification. Having RT for the given reaction, the MC and MT is identified by manual labeling in the expert annotation phase (but, once mapped, automated in the actual large-scale mechanism generation).

  3. 3.

    Mechanistic sequence acquisition. The mechanistic pathway for the input reaction is labeled by aligning the atom-map numbers from the MT to the input reaction according to the atom-map lookup table.

The number of sampled reactions k in the expert annotation phase is defined by the complexity of the encountered RT. For simple reactions like nucleophilic acyl substitution, we only sample one reaction to label the MT. For more complex reactions like SNAr reaction, we sample more reactions to include more cases where the electron withdrawing groups (EWG) are located at different positions (ortho or para) with respect to the leaving group to label the MT with different criteria.

MechFinder’s approach is centered around these sampled reactions, which then enables the automated labeling of the entire dataset. During the expert annotation phase, represented in Fig. 3a, kN reactions (k reactions from each of N templates) are directly observed and labeled, which typically amount to a few hundred reactions that form the basis for the development of our mechanism template library. The automatic labelling phase, shown in Fig. 3b, applies developed mechanistic steps to the entire dataset (X reactions).

In the current dataset used, we label 33,099 (X = 33,099) reactions with 100 unique RTs (N = 100), where we generally sample fewer than 5 reactions for each RT (k < 5). In other words, the MechFinder in this work is developed by manually labeling less than 500 reactions, which is affordable for a group of chemists in a reasonable amount of time.

Data Records

The curated reaction dataset containing 31,364 organic reactions is provided as a single CSV file and it is publicly available at Figshare21. Each entry in the dataset corresponds to a single chemical reaction represented in string (SMILES16) format along with its electron pathways, namely mechanistic labels. The details of the data format are given in Table 1.

Table 1 Format of each data record: description, data label, data type.

The construction of a mechanistic label starts with the representation of source, i.e., attacking electrons and sink i.e., electron receiving moiety, which follows the guidelines described below:

  • Atom - represented by its atom map number.

  • Lone pair - represented by an atom map number of the atom containing it.

  • Bond - represented by a list of map numbers of atoms making it.

  • Source - can be either a lone pair or a bond.

  • Sink - can be either an atom or a bond.

Using representation rules for source and sink, we describe a mechanistic step as a pair of source and sink, which is a computational analogue of a curved arrow deployed in organic reaction mechanism drawing. Tail and head of the curved arrow align with the source and sink, respectively. Once we have represented source, sink and mechanistic step computationally, we define mechanistic template (MT) and mechanistic label:

  • Mechanistic Template (MT) - is characterized by an electron path specific to an assigned mechanistic class (MC). It is a list of mechanistic steps which mirrors an ordered collection of the curved arrows for a single mechanistic pathway drawn on a paper. For example, the MT shown in Fig. 3c is written as [(1, 2), ([2, 3], 3)]. The full list of 63 MTs hand-coded from top-100 RTs extracted from the USPTO dataset can be found in Table S2.

  • Mechanistic Label - is an extracted electron path corresponding to a reaction.

Since the essence of the electron pathway generation method lies in the extraction of reaction template (RT), it is noteworthy to mention how it is defined and represented.

  • Reaction Template (RT) - is a subgraph of a reaction focused on changed atoms, which are defined as the atoms whose chemical environment is altered upon transformation. For example, the RT shown in Fig. 3c is written as “[N:1].[C:2]-[X:3] » [N:1]-[C:2]”. The full list of top-100 RTs can be found in Table S1. More details about the RT designed in this work can be found in Supplementary Information.

Technical Validation

From a total of 33,099 reactions in the USPTO-33K dataset, we identified 400 RTs based on their precise atom-mappings. Among these, we categorized the 100 most frequently occurring RTs (shared by the most chemical reactions) into 63 distinct MCs (X = 33,099, N = 100, M = 63) aligned with the conceptual framework of arrow-pushing-diagram-representable reaction mechanisms. The chosen top 100 RTs cover 94.8 % of the USPTO-33K dataset reactions. To this end, our dataset contains mechanistic annotation of 31,364 reactions that can be described by arrow-pushing-diagrams. We performed technical validation by comparing the structure obtained through applying the mechanistic pathway to reactants with the structure of the actual product(s) in the original reaction. We found 100% match upon comparison which implies the fact that all mechanistic pathways gave rise to the expected product.

There still remains the possibility of labeled mechanisms being deemed unreasonable even though the products generated by MechFinder are the same as the original products, as illustrated in Fig. 1. Therefore, we conducted a qualitative analysis by randomly sampling 100 reactions from the labeled mech-USPTO-31K dataset and asked 10 people with organic chemistry background (5 organic chemistry professors, 2 postdoctoral associates, and 3 graduate students) to evaluate the mechanistic labels generated by MechFinder one by one. Each chemist was instructed to assign either 0 (unreasonable), 0.5 (reasonable but with missing components), or 1 (fully reasonable) to each mechanism. To reduce the effect of the evaluation outliers, we compute the mean and standard deviation after removing the lowest and the highest scores for each evaluated reaction.

The average score and standard deviation for each evaluated reaction sorted by the number of labeled mechanistic steps is presented in Fig. 4a. The results indicate a consensus among the human chemists regarding the reaction mechanisms generated by MechFinder, where 74 of the reactions receive a perfect average score 1.0, and 97 of them receive scores higher than 0.9. We present the mechanistic class distribution of the sampled reactions in Fig. 4b, along with the scores assigned by the chemists. We also found that there is no significant relationship between the score of the labeled mechanism and the number of labeled mechanistic steps, demonstrating the robustness of MechFinder. In fact, 2 of the total 3 reactions with scores lower than 0.9 originated from the disagreement between SN1 and SN2 reactions. Depending on the substrates and reaction conditions, secondary substrates can engage in either SN1 or SN2 reactions or both. When an enantiomerically enriched substrate is utilized, an SN2 reaction results in a product with inverted stereochemistry, whereas an SN1 reaction leads to a product with ablated stereoinformation.

Fig. 4
figure 4

The results of the technical evaluation of 100 randomly sampled reactions from the labeled dataset. (a) The mean and standard deviation of chemist score for each reaction mechanism labeled by MechFinder. The error bars are clipped at score 1.0 for valid presentation. The reactions in the evaluation questionnaire are randomly sampled from the dataset but sorted by the number of mechanistic steps in this figure for the ease of visualization. (b) Distribution of randomly sampled reactions across assigned mechanistic classes represented as a bar graph. We show the scores for the mechanistic classes having more than two reactions in the sampled dataset. (c) Example of a nucleophilic substitution reaction with a moderately low score and high standard deviation. This case highlights suggestion for alternative pathway from multiple chemists, namely SN2 process, differing from the SN1 mechanism generated by MechFinder. (d) Example of a reaction with a marginally lower average score. The lower score from a few chemists was attributed to the potential for an alternative route, which yields the major product through a more stable carbocation intermediate formed by the protonation of oxygen adjacent to the aryl ring.

In the instance shown in Fig. 4c, MechFinder assigned an SN1 mechanism, applying criterion specific to our methodology. Several chemists (four out of eleven) considered the generated mechanism chemically implausible, proposing an SN2 mechanism based on the inverted stereochemistry of the product. The current version of MechFinder does not take into consideration the stereochemical information in the generation of the mechanism. Another example of an alternative pathway suggestion is shown in Fig. 4d. Three out of eleven chemists proposed an alternative mechanism that produces the major product earlier in the pathway and through a carbocation with a positive charge delocalized onto the adjacent aryl ring. In reality, both mechanisms may be operative during the MOM deprotection. Other issues such as omissions of certain resonance structures (e.g., acylium ion in Friedel-Crafts acylation) and the need for additional steps (e.g. prior deprotonation step in DCC condensation) have been identified and demonstrated in Supplementary Information.

To compare the different results generated by MechFinder and the ELECTRO’s labeling approach12, we show five representative reactions within the top-15 most popular RTs to label their mechanisms in Fig. 5. It clearly demonstrates our labeled mechanisms capture all reacting atoms, accounting for their participation in the chemical reactivity, whereas ELECTRO’s approach often omits certain crucial atoms. The full 100 labeled reaction mechanisms by MechFinder and ELECTRO label can be found in the Supplementary Materials.

Fig. 5
figure 5

Example reactions were chosen from the top-15 popular RTs to demonstrate the accuracy of the labels obtained from MechFinder. (a) Condensation of amine and carboxylic acid to give amide. (b) Nucleophilic aromatic substitution between primary amine and aromatic bromide to give aromatic amine. (c) Demethylation of methyl ether to give alcohol as a major product. (d) Reductive amination. This reaction, widely used to prepare amines, relies on the formation of imines as intermediates which are then reduced to amines using mainly NaBH3CN as a reducing agent in a one-pot synthesis. (e) Nucleophilic substitution at sulfonyl center. The label we provided is similar to nucleophilic acyl substitution which is governed by the breaking and re-forming of the S=O bond through a tetrahedral intermediate.

It is noteworthy that for SNAr reactions (5.2% of the labeled dataset, Fig. 5b), our proposed method begins with extraction of RTs, but the derived mechanisms go beyond the initially identified reaction center atoms. The flexibility of the method allows the derived mechanisms to accommodate additional atoms and bonds outside the template, capturing important chemical concepts like delocalization and tautomerization. It shows the method’s ability to bridge RT-extracted cores with larger mechanistic landscape, consistent with organic chemistry principles. More details about the technical aspects of mechanism derivation beyond RT and examples can be found in Supplementary Information.