Organic reactions are usually assigned to classes containing reactions with similar reagents and mechanisms. Reaction classes facilitate the communication of complex concepts and efficient navigation through chemical reaction space. However, the classification process is a tedious task. It requires identification of the corresponding reaction class template via annotation of the number of molecules in the reactions, the reaction centre and the distinction between reactants and reagents. Here, we show that transformer-based models can infer reaction classes from non-annotated, simple text-based representations of chemical reactions. Our best model reaches a classification accuracy of 98.2%. We also show that the learned representations can be used as reaction fingerprints that capture fine-grained differences between reaction classes better than traditional reaction fingerprints. The insights into chemical reaction space enabled by our learned fingerprints are illustrated by an interactive reaction atlas providing visual clustering and similarity searching.
Subscribe to Journal
Get full journal access for 1 year
only $8.25 per issue
All prices are NET prices.
VAT will be added later in the checkout.
Tax calculation will be finalised during checkout.
Rent or Buy article
Get time limited or full article access on ReadCube.
All prices are NET prices.
The Schneider 50k dataset is publicly available25. We provide a new reaction dataset (USPTO 1k TPL), derived from the work of Lowe50, containing the 1,000 most common reaction templates as classes. It can be accessed through https://rxn4chemistry.github.io/rxnfp. The commercial Pistachio (version 191118) dataset can be obtained from NextMove Software38. Pistachio relies on Leadmine58 to text-mine patent data. The dataset comes with reaction classes assigned using NameRxn (https://www.nextmovesoftware.com/namerxn.html).
Grzybowski, B. A., Bishop, K. J. M., Kowalczyk, B. & Wilmer, C. E. The ‘wired’ universe of organic chemistry. Nat. Chem. 1, 31–36 (2009).
Coley, C. W., Rogers, L., Green, W. H. & Jensen, K. F. Computer-assisted retrosynthesis based on molecular similarity. ACS Cent. Sci. 3, 1237–1245 (2017).
IBM RXN for Chemistry (IBM); https://rxn.res.ibm.com
Coley, C. W., Barzilay, R., Jaakkola, T. S., Green, W. H. & Jensen, K. F. Prediction of organic reaction outcomes using machine learning. ACS Cent. Sci. 3, 434–443 (2017).
Schwaller, P., Gaudin, T., Lanyi, D., Bekas, C. & Laino, T. ‘Found in translation’: predicting outcomes of complex organic chemistry reactions using neural sequence-to-sequence models. Chem. Sci. 9, 6091–6098 (2018).
Schwaller, P. et al. Molecular transformer: a model for uncertainty-calibrated chemical reaction prediction. ACS Cent. Sci. 5, 1572–1583 (2019).
Segler, M. H. S., Preuss, M. & Waller, M. P. Planning chemical syntheses with deep neural networks and symbolic AI. Nature 555, 604–610 (2018).
Thakkar, A., Kogej, T., Reymond, J.-L., Engkvist, O. & Bjerrum, E. J. Datasets and their influence on the development of computer assisted synthesis planning tools in the pharmaceutical domain. Chem. Sci. 11, 154–168 (2020).
Schwaller, P. et al. Predicting retrosynthetic pathways using transformer-based models and a hyper-graph exploration strategy. Chem. Sci. 11, 3316–3325 (2020).
Vaucher, A. C. et al. Automated extraction of chemical synthesis actions from experimental procedures. Nat. Commun. 11, 3601 (2020).
Vaswani, A. et al. Attention is all you need. In Advances in Neural Information Processing Systems (eds Guyon, I. et al.) 5998–6008 (NIPS, 2017).
Devlin, J., Chang, M.-W., Lee, K. & Toutanova, K. BERT: pre-training of deep bidirectional transformers for language understanding. In Proc. 2019 Conference on North American Chapter of the Association for Computational Linguistics 4171–4186 (Association for Computational Linguistics, 2019).
Weininger, D. SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules. J. Chem. Inf. Model. 28, 31–36 (1988).
Weininger, D., Weininger, A. & Weininger, J. L. SMILES. 2. Algorithm for generation of unique SMILES notation. J. Chem. Inf. Comput. Sci. 29, 97–101 (1989).
Schwaller, P., Hoover, B., Reymond, J.-L., Strobelt, H. & Laino, T. Unsupervised attention-guided atom-mapping. Preprint at https://doi.org/10.26434/chemrxiv.12298559.v1 (2020).
Toniato, A., Schwaller, P., Cardinale, A., Geluykens, J. & Laino, T. Unassisted noise-reduction of chemical reactions data sets. Preprint at https://doi.org/10.26434/chemrxiv.12395120.v1 (2020).
Miyaura, N. & Suzuki, A. Palladium-catalyzed cross-coupling reactions of organoboron compounds. Chem. Rev. 95, 2457–2483 (1995).
NameRXN (Nextmove Software); http://www.nextmovesoftware.com/namerxn.html
Kraut, H. et al. Algorithm for reaction classification. J. Chem. Inf. Model. 53, 2884–2895 (2013).
Daylight Theory Manual Ch. 5 (Daylight Chemical Information Systems); https://www.daylight.com/dayhtml/doc/theory/index.pdf
Weininger, D. SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules. J. Chem. Inf. Comput. Sci. 28, 31–36 (1988).
Chen, L. & Gasteiger, J. Organic reactions classified by neural networks: Michael additions, Friedel–Crafts alkylations by alkenes, and related reactions. Angew. Chem. Int. Ed. 35, 763–765 (1996).
Chen, L. & Gasteiger, J. Knowledge discovery in reaction databases: landscaping organic reactions by a self-organizing neural network. J. Am. Chem. Soc. 119, 4033–4042 (1997).
Satoh, H. et al. Classification of organic reactions: similarity of reactions based on changes in the electronic features of oxygen atoms at the reaction sites. J. Chem. Inf. Comput. Sci. 38, 210–219 (1998).
Schneider, N., Lowe, D. M., Sayle, R. A. & Landrum, G. A. Development of a novel fingerprint for chemical reactions and its application to large-scale reaction classification and similarity. J. Chem. Inf. Model. 55, 39–53 (2015).
Gao, H. et al. Using machine learning to predict suitable conditions for organic reactions. ACS Cent. Sci. 4, 1465–1476 (2018).
Ghiandoni, G. M. et al. Development and application of a data-driven reaction classification model: comparison of an electronic lab notebook and medicinal chemistry literature. J. Chem. Inf. Model. 59, 4167–4187 (2019).
Schneider, N., Stiefl, N. & Landrum, G. A. What’s what: the (nearly) definitive guide to reaction role assignment. J. Chem. Inf. Model. 56, 2336–2346 (2016).
ChemAxon (ChemAxon); https://docs.chemaxon.com/display/ltsargon/Reaction+fingerprint+RF
Duvenaud, D. K. et al. Convolutional networks on graphs for learning molecular fingerprints. In Proc. 28th International Conference on Neural Information Processing Systems Vol. 2, 2224–2232 (NIPS, 2015).
Wei, J. N., Duvenaud, D. & Aspuru-Guzik, A. Neural networks for the prediction of organic chemistry reactions. ACS Cent. Sci. 2, 725–732 (2016).
Sandfort, F., Strieth-Kalthoff, F., Khnemund, M., Beecks, C. & Glorius, F. A structure-based platform for predicting chemical reactivity. Chem 6, 1379–1390 (2020).
Probst, D. & Reymond, J.-L. Visualization of very large high-dimensional data sets as minimum spanning trees. J. Cheminf. 12, 1–13 (2020).
Jorner, K., Brinck, T., Norrby, P.-O. & Buttar, D. Machine learning meets mechanistic modelling for accurate prediction of experimental activation energies. Chem. Sci. https://doi.org/10.1039/D0SC04896H (2020).
Schwaller, P., Vaucher, A. C., Laino, T. & Reymond, J.-L. Prediction of chemical reaction yields using deep learning. Preprint at https://doi.org/10.26434/chemrxiv.12758474.v2 (2020).
Socher, R. et al. Recursive deep models for semantic compositionality over a sentiment treebank. In Proc. 2013 Empirical Methods in Natural Language Processing 1631–1642 (Association for Computational Linguistics, 2013).
Warstadt, A., Singh, A. & Bowman, S. R. Neural network acceptability judgments. Trans. Assoc. Comput. Linguist. 7, 625–641 (2019).
Pistachio (Nextmove Software); http://www.nextmovesoftware.com/pistachio.html
Johnson, J., Douze, M. & Jégou, H. Billion-scale similarity search with GPUs. IEEE Trans. Big Data (2019); https://doi.org/10.1109/TBDATA.2019.2921572
Landrum, G. et al. rdkit/rdkit: 2019_03_4 (q1 2019) release (Zenodo, 2019); https://doi.org/10.5281/zenodo.3366468
Wei, J.-M., Yuan, X.-J., Hu, Q.-H. & Wang, S.-Q. A novel measure for evaluating classifiers. Exp. Syst. Appl. 37, 3799–3809 (2010).
Matthews, B. W. Comparison of the predicted and observed secondary structure of T4 phage lysozyme. Biochim. Biophys. Acta Protein Struct. 405, 442–451 (1975).
Gorodkin, J. Comparing two K-category assignments by a K-category correlation coefficient. Comput. Biol. Chem. 28, 367–374 (2004).
Willighagen, E. L. et al. The Chemistry Development Kit (CDK) v2.0: atom typing, depiction, molecular formulas and substructure searching. J. Cheminf. 9, 33 (2017).
Capecchi, A., Probst, D. & Reymond, J.-L. One molecular fingerprint to rule them all: drugs, biomolecules and the metabolome. J. Cheminf. 12, 1–15 (2020).
Probst, D. & Reymond, J.-L. FUn: a framework for interactive visualizations of large, high-dimensional datasets on the web. Bioinformatics 34, 1433–1435 (2017).
Carey, J. S., Laffan, D., Thomson, C. & Williams, M. T. Analysis of the reactions used for the preparation of drug candidate molecules. Org. Biomol. Chem. 4, 2337–2347 (2006).
RXNO Ontology (RSC); http://www.rsc.org/ontologies/RXNO/index.asp
Schneider, N., Lowe, D. M., Sayle, R. A., Tarselli, M. A. & Landrum, G. A. Big data from pharmaceutical patents: a computational analysis of medicinal chemists’ bread and butter. J. Med. Chem. 59, 4385–4402 (2016).
Lowe, D. Chemical reactions from US patents (1976–Sep2016) (Figshare, 2017); https://figshare.com/articles/Chemical_reactions_from_US_patents_1976-Sep2016_/5104873
Coley, C. W., Green, W. H. & Jensen, K. F. RDChiral: an RDKit wrapper for handling stereochemistry in retrosynthetic template extraction and application. J. Chem. Inf. Model. 59, 2529–2537 (2019).
Klein, G., Kim, Y., Deng, Y., Senellart, J. & Rush, A. M. OpenNMT: Open-Source Toolkit for Neural Machine Translation (Association for Computational Linguistics, 2017).
Paszke, A. et al. PyTorch: an imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems 8024–8035 (Curran Associates, 2019).
Wolf, T. et al. Transformers: state-of-the-art natural language processing. Preprint at https://arxiv.org/pdf/1910.03771.pdf (2019).
Haghighi, S., Jasemi, M., Hessabi, S. & Zolanvari, A. PyCM: multiclass confusion matrix library in Python. J. Open Source Softw. 3, 729 (2018).
Lowe, D. M. & Sayle, R. A. LeadMine: a grammar and dictionary driven approach to entity recognition. J. Cheminf. 7, 1–9 (2015).
RXNFP Repository (v0.0.7) (Zenodo, accessed 17 November 2020); https://doi.org/10.5281/zenodo.4277570
D.P. and J.-L.R. acknowledge financial support by the Swiss National Science Foundation (NCCR TransCure). We thank L. Rudin for the careful proofreading of our manuscript.
The authors declare no competing interests.
Peer review information Nature Machine Intelligence thanks the anonymous reviewers for their contribution to the peer review of this work.
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
About this article
Cite this article
Schwaller, P., Probst, D., Vaucher, A.C. et al. Mapping the space of chemical reactions using attention-based neural networks. Nat Mach Intell 3, 144–152 (2021). https://doi.org/10.1038/s42256-020-00284-w
Future Science OA (2021)
Nature Reviews Chemistry (2021)
Science Advances (2021)
Machine Learning: Science and Technology (2021)
Nature Machine Intelligence (2021)