Unassisted noise reduction of chemical reaction datasets

Toniato, Alessandra; Schwaller, Philippe; Cardinale, Antonio; Geluykens, Joppe; Laino, Teodoro

doi:10.1038/s42256-021-00319-w

Article
Published: 29 March 2021

Unassisted noise reduction of chemical reaction datasets

Nature Machine Intelligence volume 3, pages 485–494 (2021)Cite this article

2183 Accesses
23 Citations
34 Altmetric
Metrics details

Subjects

A preprint version of the article is available at arXiv.

Abstract

Existing deep learning models applied to reaction prediction in organic chemistry can reach high levels of accuracy (>90% for natural language processing-based ones). With no chemical knowledge embedded other than the information learnt from reaction data, the quality of the datasets plays a crucial role in the performance of the prediction models. Human curation is prohibitively expensive, so unaided approaches to remove chemically incorrect entries from existing datasets are essential to improve the performance of artificial intelligence models in synthetic chemistry tasks. Here, we propose a machine learning-based, unassisted approach to remove chemically wrong entries from chemical reaction collections. We apply this method to the Pistachio collection of chemical reactions and to an open dataset, both extracted from United States Patent and Trademark Office patents. Our results show an improved prediction quality for models trained on the cleaned and balanced datasets. For retrosynthetic models, the roundtrip accuracy metric grows by 13 percentage points and the value of the cumulative Jensen–Shannon divergence decreases by 30% compared to its original record. The coverage remains high at 97%, and the value of the class diversity is not affected by the cleaning. The proposed strategy is the first unassisted rule-free technique to address automatic noise reduction in chemical datasets.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on Springer Link
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Fig. 2: Results of the forgetting forward experiment.**

**Fig. 4: Comparison with null models.**

**Fig. 5: Retrosynthesis models results.**

**Fig. 6: Multistep retrosynthesis examples.**

Chemical language models enable navigation in sparsely populated chemical space

Article 19 July 2021

Michael A. Skinnider, R. Greg Stacey, … Leonard J. Foster

Large language model enhanced corpus of CO2 reduction electrocatalysts and synthesis procedures

Article Open access 06 April 2024

Xueqing Chen, Yang Gao, … Bin Wang

A corpus of CO2 electrocatalytic reduction process extracted from the scientific literature

Article Open access 29 March 2023

Ludi Wang, Yang Gao, … Bin Wang

Data availability

The data that support the findings of this study are the reaction dataset Pistachio 3 (version release of 18 November 2019) from NextMove Software³. It is derived by text-mining chemical reactions in US patents. We also used two smaller open-source datasets: the dataset by Schneider et al.³⁴, which consists of 50,000 randomly picked reactions from US patents, and the USPTO dataset by Lowe², an open dataset with chemical reactions from US patents (1976 to September 2016). A demonstration of the code on the dataset by Schneider et al. is also available in the GitHub repository (https://github.com/rxn4chemistry/OpenNMT-py/tree/noise_reduction). Source data for the plots in the main manuscript are available at https://figshare.com/articles/journal_contribution/Source_Data/13674496.

Code availability

All code for data cleaning and analysis associated with the current submission is available at https://github.com/rxn4chemistry/OpenNMT-py/tree/noise_reduction³⁵.

References

Lowe, D. M. Extraction of Chemical Structures and Reactions from the Literature. PhD thesis, Univ. Cambridge (2012).
Lowe, D. Chemical reactions from US patents (1976–Sep2016). figshare https://figshare.com/articles/Chemical_reactions_from_US_patents_1976-Sep2016_/5104873 (2017).
Nextmove Software Pistachio (NextMove Software, accessed 2 April 2020); https://www.nextmovesoftware.com/pistachio.html
Reaxys (Reaxys, accessed 2 April 2020); https://www.reaxys.com
Segler, M., Preuss, M. & Waller, M. Planning chemical syntheses with deep neural networks and symbolic AI. Nature 555, 604–610 (2018).
Article Google Scholar
Coley, C. W.et al. A robotic platform for flow synthesis of organic compounds informed by AI planning. Science 365, eaax1566 (2019).
Schwaller, P. & Laino, T. Data-Driven Learning Systems for Chemical Reaction Prediction: An Analysis of Recent Approaches. In Machine Learning in Chemistry: Data-Driven Algorithms, Learning Systems and Predictions (eds. Pyzer-Knapp, E. O. & Laino, T.) 61–79 (ACS Publications, 2019).
Schwaller, P. et al. Molecular Transformer: a model for uncertainty-calibrated chemical reaction prediction. ACS Cent. Sci. 5, 1572–1583 (2019).
Article Google Scholar
Schwaller, P. et al. Predicting retrosynthetic pathways using transformer-based models and a hyper-graph exploration strategy. Chem. Sci. 11, 3316–3325 (2020).
Article Google Scholar
Öztürk H., Özgür A., Schwaller P., Laino T. & Ozkirimli E. Exploring chemical space using natural language processing methodologies for drug discovery. Drug Discov. Today 25, 689–705 (2020).
Satoh, H. & Funatsu, K. Sophia, a knowledge base-guided reaction prediction system-utilization of a knowledge base derived from a reaction database. J. Chem. Inf. Comput. Sci. 35, 34–44 (1995).
Article Google Scholar
Thakkar, A., Kogej, T., Reymond, J. L., Engkvist, O. & Esben, J. Datasets and their influence on the development of computer assisted synthesis planning tools in the pharmaceutical domain. Chem. Sci 11, 154–168 (2020).
Article Google Scholar
Zhu, X. & Wu, X. Class noise vs. attribute noise: a quantitative study of their impacts. Artif. Intell. Rev. 22, 177–210 (2004).
Article Google Scholar
Toneva, M. et al. An empirical study of example forgetting during deep neural network learning. In Proc. International Conference on Learning Representations (ICLR, 2019).
Coley, C. W., Rogers, L., Green, W. H. & Jensen, K. F. Computer-assisted retrosynthesis based on molecular similarity. ACS Cent. Sci. 3, 1237–1245 (2017).
Article Google Scholar
Segler, M. H. S. & Waller, M. P. Neural-symbolic machine learning for retrosynthesis and reaction prediction. Chem. Eur. J. 23, 5966–5971 (2017).
Article Google Scholar
Somnath, V. R., Bunne, C., Coley, C. W., Krause, A. & Barzilay, R. Learning graph models for template-free retrosynthesis. Preprint at https://arxiv.org/pdf/2006.07038.pdf (2020).
Dai, H., Li, C., Coley, C., Dai, B. & Song, L. Retrosynthesis prediction with conditional graph logic network. In Proc. Advances in Neural Information Processing Systems 32 (eds Wallach, H. et al.) 8872–8882 (Curran Associates, 2019).
Zheng, S., Rao, J., Zhang, Z., Xu, J. & Yang, Y. Predicting retrosynthetic reactions using self-corrected transformer neural networks. J. Chem. Inf. Model. 60, 47–55 (2020).
Article Google Scholar
Sacha, M., Błaż, M., Byrski, P., Włodarczyk-Pruszyński, P. & Jastrzebski, S. Molecule edit graph attention network: modeling chemical reactions as sequences of graph edits. Preprint at https://arxiv.org/pdf/2006.15426.pdf (2020).
Tetko, I. V., Karpov, P., Van Deursen, R. & Godin, G. State-of-the-art augmented NLP transformer models for direct and single-step retrosynthesis. Nat. Commun. 11, 5575 (2020).
Article Google Scholar
McCloskey, M. & Cohen, N. J. Catastrophic interference in connectionist networks: the sequential learning problem. Psychol. Learn. Motiv 24, 109–165 (1989).
Article Google Scholar
Vaswani, A. et al. Attention is all you need. In Proc. Advances in Neural Information Processing Systems Vol. 30, 5998–6008 (Curran Associates, 2017).
Wallis, S. Binomial confidence intervals and contingency tests: mathematical fundamentals and the evaluation of alternative methods. J. Quant. Linguist. 20, 178–208 (2013).
Article Google Scholar
IBM RXN for chemistry (IBM, 2020); https://rxn.res.ibm.com
Weininger, D. SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules. J. Chem. Inf. Comput. Sci. 28, 31–36 (1988).
Article Google Scholar
Weininger, D., Weininger, A. & Weininger, J. L. SMILES. 2. Algorithm for generation of unique SMILES notation. J. Chem. Inf. Comput. Sci. 29, 97–101 (1989).
Article Google Scholar
Klein, G., Kim, Y., Deng, Y., Senellart, J. & Rush, A. M. OpenNMT: open-source toolkit for neural machine translation. In Proc. ACL 2017, System Demonstrations 67–72 (ACL, 2017).
Paszke, A. et al. PyTorch: an imperative style, high-performance deep learning library. In Proc. Advances in Neural Information Processing Systems Vol. 32, 8024–8035 (Curran Associates, 2019).
Landrum, G. et al. rdkit/rdkit: 2019_03_4 (Q1 2019) Version Release_2019_03_4 Zenodo https://doi.org/10.5281/zenodo.3366468 (2019).
Shannon, C. E. A mathematical theory of communication. Bell Syst. Tech. J. 29, 623–656 (1948).
Article MathSciNet Google Scholar
Murali, R., Chen, Y., Vemuri, B. C. & Fei, W. Cumulative residual entropy: a new measure of information. IEEE Trans. Inf. Theory 50, 1220–1228 (2004).
Article MathSciNet Google Scholar
Nguyen, H. V. & Vreeken, J. Non-parametric Jensen–Shannon divergence. In Machine Learning and Knowledge Discovery in Databases. ECML PKDD 2015. Lecture Notes in Computer Science Vol. 9285 (eds. Appice, A. et al.) 173–189 (Springer, 2015).
Schneider, N., Stiefl, N. & Landrum, G. A. What’s what: the (nearly) definitive guide to reaction role assignment. J. Chem. Inf. Model. 56, 2336–2346 (2016).
Article Google Scholar
Noise reduction repository (v0.1). Zenodo https://zenodo.org/badge/latestdoi/281679964 (2020).

Download references

Acknowledgements

We thank M. Manica for help with the model deployment. We also acknowledge all the IBM RXN team for insightful suggestions.

Author information

Authors and Affiliations

IBM Research Europe – Zurich, Rüschlikon, Switzerland
Alessandra Toniato, Philippe Schwaller, Antonio Cardinale, Joppe Geluykens & Teodoro Laino
Department of Chemistry, University of Bern, Bern, Switzerland
Philippe Schwaller
Department of Chemistry, University of Pisa, Pisa, Italy
Antonio Cardinale

Authors

Alessandra Toniato
View author publications
You can also search for this author in PubMed Google Scholar
Philippe Schwaller
View author publications
You can also search for this author in PubMed Google Scholar
Antonio Cardinale
View author publications
You can also search for this author in PubMed Google Scholar
Joppe Geluykens
View author publications
You can also search for this author in PubMed Google Scholar
Teodoro Laino
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

A.T. and P.S. conceived the idea and performed the experiments. A.T. verified the statistical results of the method. A.C. carried out the chemical evaluation of the results. J.G. helped with the software implementation. T.L. supervised the project. All authors participated in discussions and contributed to the manuscript.

Corresponding author

Correspondence to Alessandra Toniato.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Peer review information Nature Machine Intelligence thanks the anonymous reviewers for their contribution to the peer review of this work.

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 Results of the retrosynthetic forgetting.

a, Statistical experiment: detecting whether the selection of most forgotten examples by the retro model is random. b, Overlap between the data sets removed by forward and retro models. The red line models how the overlap would be if the retro selection were entirely random. c, Percentage of one-precursor reactions in the data set cleaned by the forward forgetting and by the retro forgetting models. The former is able to identify them, while the latter is not.

Supplementary information

Supplementary Information

Supplementary Figs. 1–4, Tables 1–4, pdfs of retrosynthesis code output, discussion.

Supplementary Data 1

Supplementary data: a sample of reactions removed by the algorithms that were manually checked by chemists.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Toniato, A., Schwaller, P., Cardinale, A. et al. Unassisted noise reduction of chemical reaction datasets. Nat Mach Intell 3, 485–494 (2021). https://doi.org/10.1038/s42256-021-00319-w

Download citation

Received: 07 July 2020
Accepted: 11 February 2021
Published: 29 March 2021
Issue Date: June 2021
DOI: https://doi.org/10.1038/s42256-021-00319-w

This article is cited by

Precise atom-to-atom mapping for organic reactions via human-in-the-loop machine learning
- Shuan Chen
- Sunggi An
- Yousung Jung
Nature Communications (2024)
Retrosynthesis prediction using an end-to-end graph generative architecture for molecular graph editing
- Weihe Zhong
- Ziduo Yang
- Calvin Yu-Chian Chen
Nature Communications (2023)
A generalized-template-based graph neural network for accurate organic reactivity prediction
- Shuan Chen
- Yousung Jung
Nature Machine Intelligence (2022)
Autonomous design of new chemical reactions using a variational autoencoder
- Robert Tempke
- Terence Musho
Communications Chemistry (2022)