Abstract
Existing deep learning models applied to reaction prediction in organic chemistry can reach high levels of accuracy (>90% for natural language processing-based ones). With no chemical knowledge embedded other than the information learnt from reaction data, the quality of the datasets plays a crucial role in the performance of the prediction models. Human curation is prohibitively expensive, so unaided approaches to remove chemically incorrect entries from existing datasets are essential to improve the performance of artificial intelligence models in synthetic chemistry tasks. Here, we propose a machine learning-based, unassisted approach to remove chemically wrong entries from chemical reaction collections. We apply this method to the Pistachio collection of chemical reactions and to an open dataset, both extracted from United States Patent and Trademark Office patents. Our results show an improved prediction quality for models trained on the cleaned and balanced datasets. For retrosynthetic models, the roundtrip accuracy metric grows by 13 percentage points and the value of the cumulative Jensen–Shannon divergence decreases by 30% compared to its original record. The coverage remains high at 97%, and the value of the class diversity is not affected by the cleaning. The proposed strategy is the first unassisted rule-free technique to address automatic noise reduction in chemical datasets.
This is a preview of subscription content, access via your institution
Access options
Access Nature and 54 other Nature Portfolio journals
Get Nature+, our best-value online-access subscription
$29.99 / 30 days
cancel any time
Subscribe to this journal
Receive 12 digital issues and online access to articles
$119.00 per year
only $9.92 per issue
Buy this article
- Purchase on Springer Link
- Instant access to full article PDF
Prices may be subject to local taxes which are calculated during checkout
Similar content being viewed by others
Data availability
The data that support the findings of this study are the reaction dataset Pistachio 3 (version release of 18 November 2019) from NextMove Software3. It is derived by text-mining chemical reactions in US patents. We also used two smaller open-source datasets: the dataset by Schneider et al.34, which consists of 50,000 randomly picked reactions from US patents, and the USPTO dataset by Lowe2, an open dataset with chemical reactions from US patents (1976 to September 2016). A demonstration of the code on the dataset by Schneider et al. is also available in the GitHub repository (https://github.com/rxn4chemistry/OpenNMT-py/tree/noise_reduction). Source data for the plots in the main manuscript are available at https://figshare.com/articles/journal_contribution/Source_Data/13674496.
Code availability
All code for data cleaning and analysis associated with the current submission is available at https://github.com/rxn4chemistry/OpenNMT-py/tree/noise_reduction35.
References
Lowe, D. M. Extraction of Chemical Structures and Reactions from the Literature. PhD thesis, Univ. Cambridge (2012).
Lowe, D. Chemical reactions from US patents (1976–Sep2016). figshare https://figshare.com/articles/Chemical_reactions_from_US_patents_1976-Sep2016_/5104873 (2017).
Nextmove Software Pistachio (NextMove Software, accessed 2 April 2020); https://www.nextmovesoftware.com/pistachio.html
Reaxys (Reaxys, accessed 2 April 2020); https://www.reaxys.com
Segler, M., Preuss, M. & Waller, M. Planning chemical syntheses with deep neural networks and symbolic AI. Nature 555, 604–610 (2018).
Coley, C. W.et al. A robotic platform for flow synthesis of organic compounds informed by AI planning. Science 365, eaax1566 (2019).
Schwaller, P. & Laino, T. Data-Driven Learning Systems for Chemical Reaction Prediction: An Analysis of Recent Approaches. In Machine Learning in Chemistry: Data-Driven Algorithms, Learning Systems and Predictions (eds. Pyzer-Knapp, E. O. & Laino, T.) 61–79 (ACS Publications, 2019).
Schwaller, P. et al. Molecular Transformer: a model for uncertainty-calibrated chemical reaction prediction. ACS Cent. Sci. 5, 1572–1583 (2019).
Schwaller, P. et al. Predicting retrosynthetic pathways using transformer-based models and a hyper-graph exploration strategy. Chem. Sci. 11, 3316–3325 (2020).
Öztürk H., Özgür A., Schwaller P., Laino T. & Ozkirimli E. Exploring chemical space using natural language processing methodologies for drug discovery. Drug Discov. Today 25, 689–705 (2020).
Satoh, H. & Funatsu, K. Sophia, a knowledge base-guided reaction prediction system-utilization of a knowledge base derived from a reaction database. J. Chem. Inf. Comput. Sci. 35, 34–44 (1995).
Thakkar, A., Kogej, T., Reymond, J. L., Engkvist, O. & Esben, J. Datasets and their influence on the development of computer assisted synthesis planning tools in the pharmaceutical domain. Chem. Sci 11, 154–168 (2020).
Zhu, X. & Wu, X. Class noise vs. attribute noise: a quantitative study of their impacts. Artif. Intell. Rev. 22, 177–210 (2004).
Toneva, M. et al. An empirical study of example forgetting during deep neural network learning. In Proc. International Conference on Learning Representations (ICLR, 2019).
Coley, C. W., Rogers, L., Green, W. H. & Jensen, K. F. Computer-assisted retrosynthesis based on molecular similarity. ACS Cent. Sci. 3, 1237–1245 (2017).
Segler, M. H. S. & Waller, M. P. Neural-symbolic machine learning for retrosynthesis and reaction prediction. Chem. Eur. J. 23, 5966–5971 (2017).
Somnath, V. R., Bunne, C., Coley, C. W., Krause, A. & Barzilay, R. Learning graph models for template-free retrosynthesis. Preprint at https://arxiv.org/pdf/2006.07038.pdf (2020).
Dai, H., Li, C., Coley, C., Dai, B. & Song, L. Retrosynthesis prediction with conditional graph logic network. In Proc. Advances in Neural Information Processing Systems 32 (eds Wallach, H. et al.) 8872–8882 (Curran Associates, 2019).
Zheng, S., Rao, J., Zhang, Z., Xu, J. & Yang, Y. Predicting retrosynthetic reactions using self-corrected transformer neural networks. J. Chem. Inf. Model. 60, 47–55 (2020).
Sacha, M., Błaż, M., Byrski, P., Włodarczyk-Pruszyński, P. & Jastrzebski, S. Molecule edit graph attention network: modeling chemical reactions as sequences of graph edits. Preprint at https://arxiv.org/pdf/2006.15426.pdf (2020).
Tetko, I. V., Karpov, P., Van Deursen, R. & Godin, G. State-of-the-art augmented NLP transformer models for direct and single-step retrosynthesis. Nat. Commun. 11, 5575 (2020).
McCloskey, M. & Cohen, N. J. Catastrophic interference in connectionist networks: the sequential learning problem. Psychol. Learn. Motiv 24, 109–165 (1989).
Vaswani, A. et al. Attention is all you need. In Proc. Advances in Neural Information Processing Systems Vol. 30, 5998–6008 (Curran Associates, 2017).
Wallis, S. Binomial confidence intervals and contingency tests: mathematical fundamentals and the evaluation of alternative methods. J. Quant. Linguist. 20, 178–208 (2013).
IBM RXN for chemistry (IBM, 2020); https://rxn.res.ibm.com
Weininger, D. SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules. J. Chem. Inf. Comput. Sci. 28, 31–36 (1988).
Weininger, D., Weininger, A. & Weininger, J. L. SMILES. 2. Algorithm for generation of unique SMILES notation. J. Chem. Inf. Comput. Sci. 29, 97–101 (1989).
Klein, G., Kim, Y., Deng, Y., Senellart, J. & Rush, A. M. OpenNMT: open-source toolkit for neural machine translation. In Proc. ACL 2017, System Demonstrations 67–72 (ACL, 2017).
Paszke, A. et al. PyTorch: an imperative style, high-performance deep learning library. In Proc. Advances in Neural Information Processing Systems Vol. 32, 8024–8035 (Curran Associates, 2019).
Landrum, G. et al. rdkit/rdkit: 2019_03_4 (Q1 2019) Version Release_2019_03_4 Zenodo https://doi.org/10.5281/zenodo.3366468 (2019).
Shannon, C. E. A mathematical theory of communication. Bell Syst. Tech. J. 29, 623–656 (1948).
Murali, R., Chen, Y., Vemuri, B. C. & Fei, W. Cumulative residual entropy: a new measure of information. IEEE Trans. Inf. Theory 50, 1220–1228 (2004).
Nguyen, H. V. & Vreeken, J. Non-parametric Jensen–Shannon divergence. In Machine Learning and Knowledge Discovery in Databases. ECML PKDD 2015. Lecture Notes in Computer Science Vol. 9285 (eds. Appice, A. et al.) 173–189 (Springer, 2015).
Schneider, N., Stiefl, N. & Landrum, G. A. What’s what: the (nearly) definitive guide to reaction role assignment. J. Chem. Inf. Model. 56, 2336–2346 (2016).
Noise reduction repository (v0.1). Zenodo https://zenodo.org/badge/latestdoi/281679964 (2020).
Acknowledgements
We thank M. Manica for help with the model deployment. We also acknowledge all the IBM RXN team for insightful suggestions.
Author information
Authors and Affiliations
Contributions
A.T. and P.S. conceived the idea and performed the experiments. A.T. verified the statistical results of the method. A.C. carried out the chemical evaluation of the results. J.G. helped with the software implementation. T.L. supervised the project. All authors participated in discussions and contributed to the manuscript.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Peer review information Nature Machine Intelligence thanks the anonymous reviewers for their contribution to the peer review of this work.
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Extended data
Extended Data Fig. 1 Results of the retrosynthetic forgetting.
a, Statistical experiment: detecting whether the selection of most forgotten examples by the retro model is random. b, Overlap between the data sets removed by forward and retro models. The red line models how the overlap would be if the retro selection were entirely random. c, Percentage of one-precursor reactions in the data set cleaned by the forward forgetting and by the retro forgetting models. The former is able to identify them, while the latter is not.
Supplementary information
Supplementary Information
Supplementary Figs. 1–4, Tables 1–4, pdfs of retrosynthesis code output, discussion.
Supplementary Data 1
Supplementary data: a sample of reactions removed by the algorithms that were manually checked by chemists.
Rights and permissions
About this article
Cite this article
Toniato, A., Schwaller, P., Cardinale, A. et al. Unassisted noise reduction of chemical reaction datasets. Nat Mach Intell 3, 485–494 (2021). https://doi.org/10.1038/s42256-021-00319-w
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1038/s42256-021-00319-w
This article is cited by
-
Precise atom-to-atom mapping for organic reactions via human-in-the-loop machine learning
Nature Communications (2024)
-
Retrosynthesis prediction using an end-to-end graph generative architecture for molecular graph editing
Nature Communications (2023)
-
A generalized-template-based graph neural network for accurate organic reactivity prediction
Nature Machine Intelligence (2022)
-
Autonomous design of new chemical reactions using a variational autoencoder
Communications Chemistry (2022)