Abstract
Understanding organic reaction mechanisms is crucial for interpreting the formation of products at the atomic and electronic level, but still remains as a domain of knowledgeable experts. The lack of a large-scale dataset with chemically reasonable mechanistic sequences also hinders the development of reliable machine learning models to predict organic reactions based on mechanisms as human chemists do. Here, we present a high-quality and the first large-scale reaction dataset, denoted as mech-USPTO-31K, with chemically reasonable arrow-pushing diagrams validated by synthetic chemists, encompassing a wide spectrum of polar organic reaction mechanisms. We envision this dataset curated by applying a simple and flexible method that automatically generates reaction mechanisms using autonomously extracted reaction templates and expert-coded mechanistic templates to become an invaluable tool to develop future reaction outcome prediction models and discover new reactions.
Similar content being viewed by others
Background & Summary
The ability to predict reaction outcomes for a given set of substrates, known as forward synthesis prediction, plays a crucial role in successful synthetic planning1,2. This encompasses not only the prediction of major products resulting from organic reactions but also the retrosynthetic analysis, which aims to identify a viable synthetic pathway to synthesize a desired target compound3. To ensure the reliability of retrosynthetic analysis, it is essential to couple retrosynthesis prediction models with reliable reaction outcome prediction models. While recent advances in machine learning models have shown promise in predicting reaction outcomes based on reactant sets4,5,6,7,8, these models often overlook the finer details of electron movements, reactive intermediates, and other mechanistic information that are crucial for a comprehensive understanding of the reaction9. Consequently, there is a need for more sophisticated and accurate chemical models that can explicitly capture the underlying reaction mechanisms, which involve a step-by-step sequence of electron movements and reactive intermediates, to gain valuable insights into the stereochemistry, reaction kinetics, formation of byproducts, and other important reaction details.
Arrow pushing diagrams, commonly utilized by organic chemists, provide a visual representation of electron rearrangements as bonds form and break10. While an ideal chemical model is expected to predict the same arrow pushing diagrams as human chemists do, a reliable chemical model that can predict arrow pushing diagrams has not been developed yet due to the lack of mechanistic reaction dataset. In an early attempt to automate this process, Chen and Baldi developed Reaction Explorer11 to predict major products and mechanistic steps based on reactants and reagents utilizing a set of prioritized transformation rules. While it provided detailed and reasonable mechanistic descriptions, due to the nature of rule prioritizing, introducing a new set of rules to cover more reactions would require revision of significant proportion of existing ones. Hence, it is challenging to be scaled up to larger reaction datasets containing diverse reaction types. In a more recent effort, Bradshaw et al. proposed a machine learning-based electron path prediction model called ELECTRO12 which is fast and scalable. However, since the model still uses only the reactant and product information to extract the electron movements without the chemical guidance of actual mechanisms, most of the electron pathways predicted by the models are approximated mechanisms. This gives rise to discrepancies between predicted and generally accepted mechanisms, as illustrated in Fig. 1 as an example, highlighting the pressing need for a more reliable and dependable mechanistic prediction model. Due to these challenges, a deficiency in a comprehensive database for large-scale reaction mechanisms persists.
To overcome these limitations regarding the intrinsically approximate answer (ELECTRO) and the scalability (Reaction Explorer) of existing methods, we here propose a two-step method called MechFinder to label reaction mechanisms in a chemically reasonable and scalable manner using automatically extracted reaction templates and expert-coded mechanistic templates. In our method, we extract the most important subgraph of reaction transformation in a reaction template, and manually label the mechanism of each reaction template based on chemist knowledge in the form of mechanistic template. Employing both reaction templates and mechanistic templates, our method allows us then to label the mechanisms of a large number of chemical reactions automatically by going through the two separate models to determine the types of templates. Our method enables the generation of mechanistic pathways for various reaction types, including pericyclic reactions and those involving multiple steps, such as reductive amination. These labels serve as computational analogs to the conventional arrows employed in the arrow-pushing model. To evaluate the effectiveness of our method, we curated a subset of the USPTO-50K dataset as a benchmark and assessed the coverage and applicability of our approach. We report the first large-scale reaction dataset with chemically reasonable reaction mechanisms, denoted as mech-USPTO-31K, which would benefit the chemistry community to develop prediction models for the reactivities that are based on molecular mechanisms.
Methods
Reaction data
In this paper, we used the reaction data extracted from USPTO grant patents collected by Lowe13, an organic reaction dataset extensively used in benchmarking various reaction prediction approaches. In particular, we demonstrate the results using USPTO-50K dataset curated by Schneider et al.14 and atom-mapped by LocalMapper15. Each reaction in this dataset is presented in SMILES16 format. Since our approach only addresses two electron-based arrow-pushing mechanisms, we removed organometallic and radical reactions based one their reaction templates. Thus, 33,099 reactions are finally obtained after the above pre-processing procedure. We refer to this reaction subset as USPTO-33K dataset in this paper.
Since previous study reported that the necessary reagents are frequently missing in the chemical reactions recorded in the USPTO dataset15, we intentionally removed the reagent information from all reactions. For instance, approximately 50% of the Suzuki coupling reactions lack Pd catalyst and 40% of the Mitsunobu reactions do not include diethyl azodicarboxylate (DEAD) or diisopropyl azodicarboxylate (DIAD). This data impurity makes it challenging to treat different reactions with varying data completeness. Hence, we designed the MechFinder to automatically generate the reagent needed for mechanistic labeling (see Mechanistic Template (MT) section).
Reaction Template (RT)
In our approach, we leverage the reaction rules, which are localized around specific atoms and bonds. This allows us to narrow down the scope of deriving mechanistic labels by focusing only on the atoms involved in the reaction. To obtain the reactivity information of a reaction dataset, we extract a set of reaction templates (RT)17 from each reaction in the dataset. We start by identifying the reaction centers by comparing the chemical environments between the same atoms before and after the reaction. Nonetheless, we recognize that in many cases the electron movement can go beyond the changed atoms, such as in the nucleophilic acyl substitution reaction shown in Fig. 1. Therefore, we also include moieties that are π-conjugated to the changing atoms, such as double, triple, and aromatic bonds, and several mechanistically important special groups, such as carbonyl group and acetal group. This reaction template is simpler than the template extracted by RDChiral18 but more informative than the local reaction template described in LocalRetro19. However, it is important to note that our automated workflow does not differentiate certain mechanisms, such as neighboring group participation and SN2’ mechanisms, from more common reaction mechanisms because the leading assumption we made when developing this method was that the mechanism occurs around the defined reaction center.
The overall template extraction is performed by the following four steps:
-
1.
Compare the chemical environment of each atom before and after the reaction according to the atom-mapping. The atoms that are found to have changes in chemical environment are identified as “changed atoms”.
-
2.
For each identified changed atom, we identify the neighboring atoms connected to the changed atom in the reactants with double, triple, or aromatic bond as “extended atoms”.
-
3.
To further extend the scope of RT for mechanistic labeling, we manually define a set of mechanistically important special groups. If any of the changed atoms are identified in one of the special groups, all the atoms in the special groups are also added to the “extended atoms” list. After identifying the extended atoms in the reactants, we record the atoms sharing the same atom-map numbers in the product. The details of RT extension process can be found in Fig. S1 and S2.
-
4.
Using RDKit python package20, we extract the chemical fragment in the reactants and products in SMARTS format based on the identified changed atoms and extended atoms, and connect the fragments by a reaction symbol “ » ”.
Mechanistic Template (MT)
Since RTs only capture the changes before and after the reaction, simply applying heuristic rules on RTs to generate mechanistic pathway without any in-domain chemistry knowledge poses clear limitations, as the example shown in Fig. 1. Therefore, we additionally introduce the concepts of mechanistic classes (MC) and mechanistic templates (MT) to describe the actual reaction mechanism. The MC is defined as a group of reactions following the same reaction mechanism, including one or multiple RTs. For a given MC, we then hand-code the MT which describes the direction of electron movements in the form of a sequence of arrow-pushing diagrams, representing the attacking and electron-receiving moieties to incorporate chemistry knowledge. In particular, the designed MTs are able to distinguish different mechanisms sharing the same RT (such as SN1 and SN2 reactions) based on chemically designed criteria. In addition, the necessary reagents for deriving reaction mechanisms are recovered and essential functional groups (such as electron withdrawing group, EWG) are additionally included in the mechanism labeling process by MTs. The proposed MTs are represented by categorizing the arrows that illustrate the movement of electron pairs in organic reactions into four groups: lone pair to atom, lone pair to bond, bond to atom, and bond to bond. Technically, the lone pairs of atoms are simply annotated by their atom-map numbers and the electron pairs from bonds are annotated by pairs of atom-map number.
The proposed MT has four notable features:
-
1.
Because the atom types are specified in RTs but not in MT, multiple RTs often share the same MT. For example, different nucleophiles in substitution reactions can lead to different RTs but the same MTs (Fig. 2a).
-
2.
In some cases, a single RT can match different MTs depending on the specific chemical environment. In these cases, we design particular criteria to assign the correct MT to the obtained RT. For example, the decision of assigning SN1 and SN2 depends on the alkane group connected to the leaving group (Fig. 2b).
-
3.
For many reactions, the reaction can only occur when additional reagents are added, and the reaction mechanism can only be labeled if these reagents exist. For these reactions, we put the necessary additional reagents into the reactant set to complete the mechanism (Fig. 2c). We define missing reagents as compounds whose atoms, lone pairs or bonds participate in the mechanism, but do not appear in the major products. In our labeled dataset, we add the necessary reagent(s) for approximately 19,000 reactions (60%).
-
4.
Since the mechanistic pathway labeled by this method is based on the movement of electron pairs, reaction mechanisms beyond this scope such as organometallic or radical reactions cannot be labeled by the current method (Fig. 2d). More examples of MTs can be found in Supplementary Information.
It is noteworthy that, the mechanism derivation for certain groups of reactions inevitably requires the involvement of additional moieties beyond those present in the extracted RT. To address the limitation associated with the locality, we have incorporated technical maneuverability into our method to capture the important mechanistic elements. The framework and examples can be found in Supplementary Information.
Labeling reaction mechanism using MechFinder
In this paper, we introduce a dataset generated by a mechanism labeling framework called MechFinder utilizing reaction templates (RTs, details in previous subsection) and mechanistic templates (MTs, details in previous subsection) introduced above. The process of using MechFinder to label the reaction mechanisms in a reaction dataset is divided into two phases: the expert annotation phase and the automatic labeling phase, as shown in Fig. 3a,b.
During the expert annotation phase (Fig. 3a), we first extracted N unique RTs from all the X reactions in the reaction dataset. For each RT, we sampled k representative reactions to manually label the mechanism by three steps shown in Fig. 3c:
-
1.
RT extraction. We extracted reaction template focused on the reaction center, describing the local changes in atomic configuration upon a chemical transformation. The extraction process also yields an atom-map lookup table, recording the one-to-one atom-map correspondence between the input reaction and the extracted RT.
-
2.
MT identification. Having RT for the given reaction, the MC and MT is identified by manual labeling in the expert annotation phase (but, once mapped, automated in the actual large-scale mechanism generation).
-
3.
Mechanistic sequence acquisition. The mechanistic pathway for the input reaction is labeled by aligning the atom-map numbers from the MT to the input reaction according to the atom-map lookup table.
The number of sampled reactions k in the expert annotation phase is defined by the complexity of the encountered RT. For simple reactions like nucleophilic acyl substitution, we only sample one reaction to label the MT. For more complex reactions like SNAr reaction, we sample more reactions to include more cases where the electron withdrawing groups (EWG) are located at different positions (ortho or para) with respect to the leaving group to label the MT with different criteria.
MechFinder’s approach is centered around these sampled reactions, which then enables the automated labeling of the entire dataset. During the expert annotation phase, represented in Fig. 3a, kN reactions (k reactions from each of N templates) are directly observed and labeled, which typically amount to a few hundred reactions that form the basis for the development of our mechanism template library. The automatic labelling phase, shown in Fig. 3b, applies developed mechanistic steps to the entire dataset (X reactions).
In the current dataset used, we label 33,099 (X = 33,099) reactions with 100 unique RTs (N = 100), where we generally sample fewer than 5 reactions for each RT (k < 5). In other words, the MechFinder in this work is developed by manually labeling less than 500 reactions, which is affordable for a group of chemists in a reasonable amount of time.
Data Records
The curated reaction dataset containing 31,364 organic reactions is provided as a single CSV file and it is publicly available at Figshare21. Each entry in the dataset corresponds to a single chemical reaction represented in string (SMILES16) format along with its electron pathways, namely mechanistic labels. The details of the data format are given in Table 1.
The construction of a mechanistic label starts with the representation of source, i.e., attacking electrons and sink i.e., electron receiving moiety, which follows the guidelines described below:
-
Atom - represented by its atom map number.
-
Lone pair - represented by an atom map number of the atom containing it.
-
Bond - represented by a list of map numbers of atoms making it.
-
Source - can be either a lone pair or a bond.
-
Sink - can be either an atom or a bond.
Using representation rules for source and sink, we describe a mechanistic step as a pair of source and sink, which is a computational analogue of a curved arrow deployed in organic reaction mechanism drawing. Tail and head of the curved arrow align with the source and sink, respectively. Once we have represented source, sink and mechanistic step computationally, we define mechanistic template (MT) and mechanistic label:
-
Mechanistic Template (MT) - is characterized by an electron path specific to an assigned mechanistic class (MC). It is a list of mechanistic steps which mirrors an ordered collection of the curved arrows for a single mechanistic pathway drawn on a paper. For example, the MT shown in Fig. 3c is written as [(1, 2), ([2, 3], 3)]. The full list of 63 MTs hand-coded from top-100 RTs extracted from the USPTO dataset can be found in Table S2.
-
Mechanistic Label - is an extracted electron path corresponding to a reaction.
Since the essence of the electron pathway generation method lies in the extraction of reaction template (RT), it is noteworthy to mention how it is defined and represented.
-
Reaction Template (RT) - is a subgraph of a reaction focused on changed atoms, which are defined as the atoms whose chemical environment is altered upon transformation. For example, the RT shown in Fig. 3c is written as “[N:1].[C:2]-[X:3] » [N:1]-[C:2]”. The full list of top-100 RTs can be found in Table S1. More details about the RT designed in this work can be found in Supplementary Information.
Technical Validation
From a total of 33,099 reactions in the USPTO-33K dataset, we identified 400 RTs based on their precise atom-mappings. Among these, we categorized the 100 most frequently occurring RTs (shared by the most chemical reactions) into 63 distinct MCs (X = 33,099, N = 100, M = 63) aligned with the conceptual framework of arrow-pushing-diagram-representable reaction mechanisms. The chosen top 100 RTs cover 94.8 % of the USPTO-33K dataset reactions. To this end, our dataset contains mechanistic annotation of 31,364 reactions that can be described by arrow-pushing-diagrams. We performed technical validation by comparing the structure obtained through applying the mechanistic pathway to reactants with the structure of the actual product(s) in the original reaction. We found 100% match upon comparison which implies the fact that all mechanistic pathways gave rise to the expected product.
There still remains the possibility of labeled mechanisms being deemed unreasonable even though the products generated by MechFinder are the same as the original products, as illustrated in Fig. 1. Therefore, we conducted a qualitative analysis by randomly sampling 100 reactions from the labeled mech-USPTO-31K dataset and asked 10 people with organic chemistry background (5 organic chemistry professors, 2 postdoctoral associates, and 3 graduate students) to evaluate the mechanistic labels generated by MechFinder one by one. Each chemist was instructed to assign either 0 (unreasonable), 0.5 (reasonable but with missing components), or 1 (fully reasonable) to each mechanism. To reduce the effect of the evaluation outliers, we compute the mean and standard deviation after removing the lowest and the highest scores for each evaluated reaction.
The average score and standard deviation for each evaluated reaction sorted by the number of labeled mechanistic steps is presented in Fig. 4a. The results indicate a consensus among the human chemists regarding the reaction mechanisms generated by MechFinder, where 74 of the reactions receive a perfect average score 1.0, and 97 of them receive scores higher than 0.9. We present the mechanistic class distribution of the sampled reactions in Fig. 4b, along with the scores assigned by the chemists. We also found that there is no significant relationship between the score of the labeled mechanism and the number of labeled mechanistic steps, demonstrating the robustness of MechFinder. In fact, 2 of the total 3 reactions with scores lower than 0.9 originated from the disagreement between SN1 and SN2 reactions. Depending on the substrates and reaction conditions, secondary substrates can engage in either SN1 or SN2 reactions or both. When an enantiomerically enriched substrate is utilized, an SN2 reaction results in a product with inverted stereochemistry, whereas an SN1 reaction leads to a product with ablated stereoinformation.
In the instance shown in Fig. 4c, MechFinder assigned an SN1 mechanism, applying criterion specific to our methodology. Several chemists (four out of eleven) considered the generated mechanism chemically implausible, proposing an SN2 mechanism based on the inverted stereochemistry of the product. The current version of MechFinder does not take into consideration the stereochemical information in the generation of the mechanism. Another example of an alternative pathway suggestion is shown in Fig. 4d. Three out of eleven chemists proposed an alternative mechanism that produces the major product earlier in the pathway and through a carbocation with a positive charge delocalized onto the adjacent aryl ring. In reality, both mechanisms may be operative during the MOM deprotection. Other issues such as omissions of certain resonance structures (e.g., acylium ion in Friedel-Crafts acylation) and the need for additional steps (e.g. prior deprotonation step in DCC condensation) have been identified and demonstrated in Supplementary Information.
To compare the different results generated by MechFinder and the ELECTRO’s labeling approach12, we show five representative reactions within the top-15 most popular RTs to label their mechanisms in Fig. 5. It clearly demonstrates our labeled mechanisms capture all reacting atoms, accounting for their participation in the chemical reactivity, whereas ELECTRO’s approach often omits certain crucial atoms. The full 100 labeled reaction mechanisms by MechFinder and ELECTRO label can be found in the Supplementary Materials.
It is noteworthy that for SNAr reactions (5.2% of the labeled dataset, Fig. 5b), our proposed method begins with extraction of RTs, but the derived mechanisms go beyond the initially identified reaction center atoms. The flexibility of the method allows the derived mechanisms to accommodate additional atoms and bonds outside the template, capturing important chemical concepts like delocalization and tautomerization. It shows the method’s ability to bridge RT-extracted cores with larger mechanistic landscape, consistent with organic chemistry principles. More details about the technical aspects of mechanism derivation beyond RT and examples can be found in Supplementary Information.
Code availability
The codes of the MechFinder introduced in this paper are publicly available at https://github.com/snu-micc/MechFinder.
References
de Almeida, A. F., Moreira, R. & Rodrigues, T. Synthetic organic chemistry driven by artificial intelligence. Nature Reviews Chemistry 3, 589–604 (2019).
Warr, W. A. A short review of chemical reaction database systems, computer-aided synthesis design, reaction prediction and synthetic feasibility. Molecular Informatics 33, 469–476 (2014).
Warren, S. & Wyatt, P.Organic synthesis: the disconnection approach (John Wiley & Sons, 2008).
Schwaller, P. et al. Molecular transformer: a model for uncertainty-calibrated chemical reaction prediction. ACS central science 5, 1572–1583 (2019).
Coley, C. W. et al. A graph-convolutional neural network model for the prediction of chemical reactivity. Chemical science 10, 370–377 (2019).
Sacha, M. et al. Molecule edit graph attention network: modeling chemical reactions as sequences of graph edits. Journal of Chemical Information and Modeling 61, 3273–3284 (2021).
Chen, S. & Jung, Y. A generalized-template-based graph neural network for accurate organic reactivity prediction. Nature Machine Intelligence 4, 772–780 (2022).
Irwin, R., Dimitriadis, S., He, J. & Bjerrum, E. J. Chemformer: a pre-trained transformer for computational chemistry. Machine Learning: Science and Technology 3, 015022 (2022).
Venkatasubramanian, V. & Mann, V. Artificial intelligence in reaction prediction and chemical synthesis. Current Opinion in Chemical Engineering 36, 100749 (2022).
Levy, D. E.Arrow-pushing in organic chemistry: an easy approach to understanding reaction mechanisms (John Wiley & Sons, 2017).
Chen, J. H. & Baldi, P. No electron left behind: a rule-based expert system to predict chemical reactions and reaction mechanisms. Journal of chemical information and modeling 49, 2034–2043 (2009).
Bradshaw, J., Kusner, M., Paige, B., Segler, M. & Hernández-Lobato, J. A generative model for electron paths. In 7th International Conference on Learning Representations, ICLR 2019, vol. 7 (International Conference on Learning Representations (ICLR), 2019).
Lowe, D. M. Extraction of chemical structures and reactions from the literature. Figshare https://doi.org/10.6084/m9.figshare.5104873.v1 (2012).
Schneider, N., Stiefl, N. & Landrum, G. A. What’s what: The (nearly) definitive guide to reaction role assignment. Journal of chemical information and modeling 56, 2336–2346 (2016).
Chen, S., An, S., Babazade, R. & Jung, Y. Precise atom-to-atom mapping for organic reactions via human-in-the-loop machine learning. Nature Communications 15, 2250 (2024).
Weininger, D. Smiles, a chemical language and information system. 1. introduction to methodology and encoding rules. Journal of chemical information and computer sciences 28, 31–36 (1988).
Chen, S. et al. Reaction templates: Bridging synthesis knowledge and artificial intelligence. Accounts of Chemical Research 16952–16968 (2024).
Coley, C. W., Green, W. H. & Jensen, K. F. Rdchiral: An rdkit wrapper for handling stereochemistry in retrosynthetic template extraction and application. Journal of chemical information and modeling 59, 2529–2537 (2019).
Chen, S. & Jung, Y. Deep retrosynthetic reaction prediction using local reactivity and global attention. JACS Au 1, 1612–1620 (2021).
The RDKit Development Team. The RDKit Documentation Accessed: [2023]. (2020).
Chen, S., Babazade, R., Kim, T., Han, S. & Jung, Y. A Large-Scale Reaction Dataset of Mechanistic Pathways of Organic Reactions. Figshare https://doi.org/10.6084/m9.figshare.24797220.v2 (2023).
Acknowledgements
This research was supported by the Digital Research Innovation Institution Program through NRF Korea (RS-2023-00283902) and AI Graduate School Program (RS-2021-II211343). We thank 10 chemists who participated in the qualitative evaluation of MechFinder’s performance for 100 randomly chosen reactions.
Author information
Authors and Affiliations
Contributions
S.C. and Y.J. conceived the project. S.C. designed the methods and experiments. R.B. refined the methods and performed the experiments. R.B., T.K. and S.H. analyzed the chemical results. R.B., S.C. and Y.J. wrote the initial draft, and all authors participated in editing the manuscript. Y.J. and S.H. supervised the project.
Corresponding authors
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Chen, S., Babazade, R., Kim, T. et al. A large-scale reaction dataset of mechanistic pathways of organic reactions. Sci Data 11, 863 (2024). https://doi.org/10.1038/s41597-024-03709-y
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41597-024-03709-y