Abstract
Organic reactions are usually assigned to classes containing reactions with similar reagents and mechanisms. Reaction classes facilitate the communication of complex concepts and efficient navigation through chemical reaction space. However, the classification process is a tedious task. It requires identification of the corresponding reaction class template via annotation of the number of molecules in the reactions, the reaction centre and the distinction between reactants and reagents. Here, we show that transformer-based models can infer reaction classes from non-annotated, simple text-based representations of chemical reactions. Our best model reaches a classification accuracy of 98.2%. We also show that the learned representations can be used as reaction fingerprints that capture fine-grained differences between reaction classes better than traditional reaction fingerprints. The insights into chemical reaction space enabled by our learned fingerprints are illustrated by an interactive reaction atlas providing visual clustering and similarity searching.
This is a preview of subscription content, access via your institution
Access options
Access Nature and 54 other Nature Portfolio journals
Get Nature+, our best-value online-access subscription
$29.99 / 30 days
cancel any time
Subscribe to this journal
Receive 12 digital issues and online access to articles
$119.00 per year
only $9.92 per issue
Buy this article
- Purchase on SpringerLink
- Instant access to full article PDF
Prices may be subject to local taxes which are calculated during checkout
Similar content being viewed by others
Data availability
The Schneider 50k dataset is publicly available25. We provide a new reaction dataset (USPTO 1k TPL), derived from the work of Lowe50, containing the 1,000 most common reaction templates as classes. It can be accessed through https://rxn4chemistry.github.io/rxnfp. The commercial Pistachio (version 191118) dataset can be obtained from NextMove Software38. Pistachio relies on Leadmine58 to text-mine patent data. The dataset comes with reaction classes assigned using NameRxn (https://www.nextmovesoftware.com/namerxn.html).
Code availability
The rxnfp code and the experiments on the public datasets, as well as an interactive TMAP, are provided at https://rxn4chemistry.github.io/rxnfp59.
References
Grzybowski, B. A., Bishop, K. J. M., Kowalczyk, B. & Wilmer, C. E. The ‘wired’ universe of organic chemistry. Nat. Chem. 1, 31–36 (2009).
Coley, C. W., Rogers, L., Green, W. H. & Jensen, K. F. Computer-assisted retrosynthesis based on molecular similarity. ACS Cent. Sci. 3, 1237–1245 (2017).
IBM RXN for Chemistry (IBM); https://rxn.res.ibm.com
Coley, C. W., Barzilay, R., Jaakkola, T. S., Green, W. H. & Jensen, K. F. Prediction of organic reaction outcomes using machine learning. ACS Cent. Sci. 3, 434–443 (2017).
Schwaller, P., Gaudin, T., Lanyi, D., Bekas, C. & Laino, T. ‘Found in translation’: predicting outcomes of complex organic chemistry reactions using neural sequence-to-sequence models. Chem. Sci. 9, 6091–6098 (2018).
Schwaller, P. et al. Molecular transformer: a model for uncertainty-calibrated chemical reaction prediction. ACS Cent. Sci. 5, 1572–1583 (2019).
Segler, M. H. S., Preuss, M. & Waller, M. P. Planning chemical syntheses with deep neural networks and symbolic AI. Nature 555, 604–610 (2018).
Thakkar, A., Kogej, T., Reymond, J.-L., Engkvist, O. & Bjerrum, E. J. Datasets and their influence on the development of computer assisted synthesis planning tools in the pharmaceutical domain. Chem. Sci. 11, 154–168 (2020).
Schwaller, P. et al. Predicting retrosynthetic pathways using transformer-based models and a hyper-graph exploration strategy. Chem. Sci. 11, 3316–3325 (2020).
Vaucher, A. C. et al. Automated extraction of chemical synthesis actions from experimental procedures. Nat. Commun. 11, 3601 (2020).
Vaswani, A. et al. Attention is all you need. In Advances in Neural Information Processing Systems (eds Guyon, I. et al.) 5998–6008 (NIPS, 2017).
Devlin, J., Chang, M.-W., Lee, K. & Toutanova, K. BERT: pre-training of deep bidirectional transformers for language understanding. In Proc. 2019 Conference on North American Chapter of the Association for Computational Linguistics 4171–4186 (Association for Computational Linguistics, 2019).
Weininger, D. SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules. J. Chem. Inf. Model. 28, 31–36 (1988).
Weininger, D., Weininger, A. & Weininger, J. L. SMILES. 2. Algorithm for generation of unique SMILES notation. J. Chem. Inf. Comput. Sci. 29, 97–101 (1989).
Schwaller, P., Hoover, B., Reymond, J.-L., Strobelt, H. & Laino, T. Unsupervised attention-guided atom-mapping. Preprint at https://doi.org/10.26434/chemrxiv.12298559.v1 (2020).
Toniato, A., Schwaller, P., Cardinale, A., Geluykens, J. & Laino, T. Unassisted noise-reduction of chemical reactions data sets. Preprint at https://doi.org/10.26434/chemrxiv.12395120.v1 (2020).
Miyaura, N. & Suzuki, A. Palladium-catalyzed cross-coupling reactions of organoboron compounds. Chem. Rev. 95, 2457–2483 (1995).
NameRXN (Nextmove Software); http://www.nextmovesoftware.com/namerxn.html
Kraut, H. et al. Algorithm for reaction classification. J. Chem. Inf. Model. 53, 2884–2895 (2013).
Daylight Theory Manual Ch. 5 (Daylight Chemical Information Systems); https://www.daylight.com/dayhtml/doc/theory/index.pdf
Weininger, D. SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules. J. Chem. Inf. Comput. Sci. 28, 31–36 (1988).
Chen, L. & Gasteiger, J. Organic reactions classified by neural networks: Michael additions, Friedel–Crafts alkylations by alkenes, and related reactions. Angew. Chem. Int. Ed. 35, 763–765 (1996).
Chen, L. & Gasteiger, J. Knowledge discovery in reaction databases: landscaping organic reactions by a self-organizing neural network. J. Am. Chem. Soc. 119, 4033–4042 (1997).
Satoh, H. et al. Classification of organic reactions: similarity of reactions based on changes in the electronic features of oxygen atoms at the reaction sites. J. Chem. Inf. Comput. Sci. 38, 210–219 (1998).
Schneider, N., Lowe, D. M., Sayle, R. A. & Landrum, G. A. Development of a novel fingerprint for chemical reactions and its application to large-scale reaction classification and similarity. J. Chem. Inf. Model. 55, 39–53 (2015).
Gao, H. et al. Using machine learning to predict suitable conditions for organic reactions. ACS Cent. Sci. 4, 1465–1476 (2018).
Ghiandoni, G. M. et al. Development and application of a data-driven reaction classification model: comparison of an electronic lab notebook and medicinal chemistry literature. J. Chem. Inf. Model. 59, 4167–4187 (2019).
Schneider, N., Stiefl, N. & Landrum, G. A. What’s what: the (nearly) definitive guide to reaction role assignment. J. Chem. Inf. Model. 56, 2336–2346 (2016).
ChemAxon (ChemAxon); https://docs.chemaxon.com/display/ltsargon/Reaction+fingerprint+RF
Duvenaud, D. K. et al. Convolutional networks on graphs for learning molecular fingerprints. In Proc. 28th International Conference on Neural Information Processing Systems Vol. 2, 2224–2232 (NIPS, 2015).
Wei, J. N., Duvenaud, D. & Aspuru-Guzik, A. Neural networks for the prediction of organic chemistry reactions. ACS Cent. Sci. 2, 725–732 (2016).
Sandfort, F., Strieth-Kalthoff, F., Khnemund, M., Beecks, C. & Glorius, F. A structure-based platform for predicting chemical reactivity. Chem 6, 1379–1390 (2020).
Probst, D. & Reymond, J.-L. Visualization of very large high-dimensional data sets as minimum spanning trees. J. Cheminf. 12, 1–13 (2020).
Jorner, K., Brinck, T., Norrby, P.-O. & Buttar, D. Machine learning meets mechanistic modelling for accurate prediction of experimental activation energies. Chem. Sci. https://doi.org/10.1039/D0SC04896H (2020).
Schwaller, P., Vaucher, A. C., Laino, T. & Reymond, J.-L. Prediction of chemical reaction yields using deep learning. Preprint at https://doi.org/10.26434/chemrxiv.12758474.v2 (2020).
Socher, R. et al. Recursive deep models for semantic compositionality over a sentiment treebank. In Proc. 2013 Empirical Methods in Natural Language Processing 1631–1642 (Association for Computational Linguistics, 2013).
Warstadt, A., Singh, A. & Bowman, S. R. Neural network acceptability judgments. Trans. Assoc. Comput. Linguist. 7, 625–641 (2019).
Pistachio (Nextmove Software); http://www.nextmovesoftware.com/pistachio.html
Johnson, J., Douze, M. & Jégou, H. Billion-scale similarity search with GPUs. IEEE Trans. Big Data (2019); https://doi.org/10.1109/TBDATA.2019.2921572
Landrum, G. et al. rdkit/rdkit: 2019_03_4 (q1 2019) release (Zenodo, 2019); https://doi.org/10.5281/zenodo.3366468
Wei, J.-M., Yuan, X.-J., Hu, Q.-H. & Wang, S.-Q. A novel measure for evaluating classifiers. Exp. Syst. Appl. 37, 3799–3809 (2010).
Matthews, B. W. Comparison of the predicted and observed secondary structure of T4 phage lysozyme. Biochim. Biophys. Acta Protein Struct. 405, 442–451 (1975).
Gorodkin, J. Comparing two K-category assignments by a K-category correlation coefficient. Comput. Biol. Chem. 28, 367–374 (2004).
Willighagen, E. L. et al. The Chemistry Development Kit (CDK) v2.0: atom typing, depiction, molecular formulas and substructure searching. J. Cheminf. 9, 33 (2017).
Capecchi, A., Probst, D. & Reymond, J.-L. One molecular fingerprint to rule them all: drugs, biomolecules and the metabolome. J. Cheminf. 12, 1–15 (2020).
Probst, D. & Reymond, J.-L. FUn: a framework for interactive visualizations of large, high-dimensional datasets on the web. Bioinformatics 34, 1433–1435 (2017).
Carey, J. S., Laffan, D., Thomson, C. & Williams, M. T. Analysis of the reactions used for the preparation of drug candidate molecules. Org. Biomol. Chem. 4, 2337–2347 (2006).
RXNO Ontology (RSC); http://www.rsc.org/ontologies/RXNO/index.asp
Schneider, N., Lowe, D. M., Sayle, R. A., Tarselli, M. A. & Landrum, G. A. Big data from pharmaceutical patents: a computational analysis of medicinal chemists’ bread and butter. J. Med. Chem. 59, 4385–4402 (2016).
Lowe, D. Chemical reactions from US patents (1976–Sep2016) (Figshare, 2017); https://figshare.com/articles/Chemical_reactions_from_US_patents_1976-Sep2016_/5104873
Coley, C. W., Green, W. H. & Jensen, K. F. RDChiral: an RDKit wrapper for handling stereochemistry in retrosynthetic template extraction and application. J. Chem. Inf. Model. 59, 2529–2537 (2019).
Klein, G., Kim, Y., Deng, Y., Senellart, J. & Rush, A. M. OpenNMT: Open-Source Toolkit for Neural Machine Translation (Association for Computational Linguistics, 2017).
BERT code (GitHub); https://github.com/google-research/bert#sentence-and-sentence-pair-classification-tasks
Paszke, A. et al. PyTorch: an imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems 8024–8035 (Curran Associates, 2019).
Wolf, T. et al. Transformers: state-of-the-art natural language processing. Preprint at https://arxiv.org/pdf/1910.03771.pdf (2019).
Probst, D. & Reymond, J.-L. SmilesDrawer: parsing and drawing SMILES-encoded molecular structures using client-side javascript. J. Chem. Inf. Model. 58, 1–7 (2018).
Haghighi, S., Jasemi, M., Hessabi, S. & Zolanvari, A. PyCM: multiclass confusion matrix library in Python. J. Open Source Softw. 3, 729 (2018).
Lowe, D. M. & Sayle, R. A. LeadMine: a grammar and dictionary driven approach to entity recognition. J. Cheminf. 7, 1–9 (2015).
RXNFP Repository (v0.0.7) (Zenodo, accessed 17 November 2020); https://doi.org/10.5281/zenodo.4277570
Acknowledgements
D.P. and J.-L.R. acknowledge financial support by the Swiss National Science Foundation (NCCR TransCure). We thank L. Rudin for the careful proofreading of our manuscript.
Author information
Authors and Affiliations
Contributions
P.S. and A.C.V. conceived the initial idea for the project. P.S., D.P., A.C.V. and V.H.N. trained models, performed the classification experiments and analysed the results. P.S. investigated the reaction fingerprints and wrote the code base. P.S., D.P. and D.K. worked on the reaction atlases. The project was supervised by T.L. and J.-L.R. All authors took part in discussions and contributed to the writing of the manuscript.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Peer review information Nature Machine Intelligence thanks the anonymous reviewers for their contribution to the peer review of this work.
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Supplementary Information
Supplementary Notes 1–3, Figs. 1–3 and Tables 1–5.
Supplementary Data
Interactive reaction atlas.
Rights and permissions
About this article
Cite this article
Schwaller, P., Probst, D., Vaucher, A.C. et al. Mapping the space of chemical reactions using attention-based neural networks. Nat Mach Intell 3, 144–152 (2021). https://doi.org/10.1038/s42256-020-00284-w
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1038/s42256-020-00284-w
This article is cited by
-
IDSL_MINT: a deep learning framework to predict molecular fingerprints from mass spectra
Journal of Cheminformatics (2024)
-
Improving chemical reaction yield prediction using pre-trained graph neural networks
Journal of Cheminformatics (2024)
-
Prediction of chemical reaction yields with large-scale multi-view pre-training
Journal of Cheminformatics (2024)
-
Rxn-INSIGHT: fast chemical reaction analysis using bond-electron matrices
Journal of Cheminformatics (2024)
-
Transfer learning across different chemical domains: virtual screening of organic materials with deep learning models pretrained on small molecule and chemical reaction data
Journal of Cheminformatics (2024)