Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Article
  • Published:

Unassisted noise reduction of chemical reaction datasets

A preprint version of the article is available at arXiv.

Abstract

Existing deep learning models applied to reaction prediction in organic chemistry can reach high levels of accuracy (>90% for natural language processing-based ones). With no chemical knowledge embedded other than the information learnt from reaction data, the quality of the datasets plays a crucial role in the performance of the prediction models. Human curation is prohibitively expensive, so unaided approaches to remove chemically incorrect entries from existing datasets are essential to improve the performance of artificial intelligence models in synthetic chemistry tasks. Here, we propose a machine learning-based, unassisted approach to remove chemically wrong entries from chemical reaction collections. We apply this method to the Pistachio collection of chemical reactions and to an open dataset, both extracted from United States Patent and Trademark Office patents. Our results show an improved prediction quality for models trained on the cleaned and balanced datasets. For retrosynthetic models, the roundtrip accuracy metric grows by 13 percentage points and the value of the cumulative Jensen–Shannon divergence decreases by 30% compared to its original record. The coverage remains high at 97%, and the value of the class diversity is not affected by the cleaning. The proposed strategy is the first unassisted rule-free technique to address automatic noise reduction in chemical datasets.

This is a preview of subscription content, access via your institution

Access options

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Fig. 1: Overview.
Fig. 2: Results of the forgetting forward experiment.
Fig. 3: Artificial noise addition.
Fig. 4: Comparison with null models.
Fig. 5: Retrosynthesis models results.
Fig. 6: Multistep retrosynthesis examples.

Similar content being viewed by others

Data availability

The data that support the findings of this study are the reaction dataset Pistachio 3 (version release of 18 November 2019) from NextMove Software3. It is derived by text-mining chemical reactions in US patents. We also used two smaller open-source datasets: the dataset by Schneider et al.34, which consists of 50,000 randomly picked reactions from US patents, and the USPTO dataset by Lowe2, an open dataset with chemical reactions from US patents (1976 to September 2016). A demonstration of the code on the dataset by Schneider et al. is also available in the GitHub repository (https://github.com/rxn4chemistry/OpenNMT-py/tree/noise_reduction). Source data for the plots in the main manuscript are available at https://figshare.com/articles/journal_contribution/Source_Data/13674496.

Code availability

All code for data cleaning and analysis associated with the current submission is available at https://github.com/rxn4chemistry/OpenNMT-py/tree/noise_reduction35.

References

  1. Lowe, D. M. Extraction of Chemical Structures and Reactions from the Literature. PhD thesis, Univ. Cambridge (2012).

  2. Lowe, D. Chemical reactions from US patents (1976–Sep2016). figshare https://figshare.com/articles/Chemical_reactions_from_US_patents_1976-Sep2016_/5104873 (2017).

  3. Nextmove Software Pistachio (NextMove Software, accessed 2 April 2020); https://www.nextmovesoftware.com/pistachio.html

  4. Reaxys (Reaxys, accessed 2 April 2020); https://www.reaxys.com

  5. Segler, M., Preuss, M. & Waller, M. Planning chemical syntheses with deep neural networks and symbolic AI. Nature 555, 604–610 (2018).

    Article  Google Scholar 

  6. Coley, C. W.et al. A robotic platform for flow synthesis of organic compounds informed by AI planning. Science 365, eaax1566 (2019).

  7. Schwaller, P. & Laino, T. Data-Driven Learning Systems for Chemical Reaction Prediction: An Analysis of Recent Approaches. In Machine Learning in Chemistry: Data-Driven Algorithms, Learning Systems and Predictions (eds. Pyzer-Knapp, E. O. & Laino, T.) 61–79 (ACS Publications, 2019).

  8. Schwaller, P. et al. Molecular Transformer: a model for uncertainty-calibrated chemical reaction prediction. ACS Cent. Sci. 5, 1572–1583 (2019).

    Article  Google Scholar 

  9. Schwaller, P. et al. Predicting retrosynthetic pathways using transformer-based models and a hyper-graph exploration strategy. Chem. Sci. 11, 3316–3325 (2020).

    Article  Google Scholar 

  10. Öztürk H., Özgür A., Schwaller P., Laino T. & Ozkirimli E. Exploring chemical space using natural language processing methodologies for drug discovery. Drug Discov. Today 25, 689–705 (2020).

  11. Satoh, H. & Funatsu, K. Sophia, a knowledge base-guided reaction prediction system-utilization of a knowledge base derived from a reaction database. J. Chem. Inf. Comput. Sci. 35, 34–44 (1995).

    Article  Google Scholar 

  12. Thakkar, A., Kogej, T., Reymond, J. L., Engkvist, O. & Esben, J. Datasets and their influence on the development of computer assisted synthesis planning tools in the pharmaceutical domain. Chem. Sci 11, 154–168 (2020).

    Article  Google Scholar 

  13. Zhu, X. & Wu, X. Class noise vs. attribute noise: a quantitative study of their impacts. Artif. Intell. Rev. 22, 177–210 (2004).

    Article  Google Scholar 

  14. Toneva, M. et al. An empirical study of example forgetting during deep neural network learning. In Proc. International Conference on Learning Representations (ICLR, 2019).

  15. Coley, C. W., Rogers, L., Green, W. H. & Jensen, K. F. Computer-assisted retrosynthesis based on molecular similarity. ACS Cent. Sci. 3, 1237–1245 (2017).

    Article  Google Scholar 

  16. Segler, M. H. S. & Waller, M. P. Neural-symbolic machine learning for retrosynthesis and reaction prediction. Chem. Eur. J. 23, 5966–5971 (2017).

    Article  Google Scholar 

  17. Somnath, V. R., Bunne, C., Coley, C. W., Krause, A. & Barzilay, R. Learning graph models for template-free retrosynthesis. Preprint at https://arxiv.org/pdf/2006.07038.pdf (2020).

  18. Dai, H., Li, C., Coley, C., Dai, B. & Song, L. Retrosynthesis prediction with conditional graph logic network. In Proc. Advances in Neural Information Processing Systems 32 (eds Wallach, H. et al.) 8872–8882 (Curran Associates, 2019).

  19. Zheng, S., Rao, J., Zhang, Z., Xu, J. & Yang, Y. Predicting retrosynthetic reactions using self-corrected transformer neural networks. J. Chem. Inf. Model. 60, 47–55 (2020).

    Article  Google Scholar 

  20. Sacha, M., Błaż, M., Byrski, P., Włodarczyk-Pruszyński, P. & Jastrzebski, S. Molecule edit graph attention network: modeling chemical reactions as sequences of graph edits. Preprint at https://arxiv.org/pdf/2006.15426.pdf (2020).

  21. Tetko, I. V., Karpov, P., Van Deursen, R. & Godin, G. State-of-the-art augmented NLP transformer models for direct and single-step retrosynthesis. Nat. Commun. 11, 5575 (2020).

    Article  Google Scholar 

  22. McCloskey, M. & Cohen, N. J. Catastrophic interference in connectionist networks: the sequential learning problem. Psychol. Learn. Motiv 24, 109–165 (1989).

    Article  Google Scholar 

  23. Vaswani, A. et al. Attention is all you need. In Proc. Advances in Neural Information Processing Systems Vol. 30, 5998–6008 (Curran Associates, 2017).

  24. Wallis, S. Binomial confidence intervals and contingency tests: mathematical fundamentals and the evaluation of alternative methods. J. Quant. Linguist. 20, 178–208 (2013).

    Article  Google Scholar 

  25. IBM RXN for chemistry (IBM, 2020); https://rxn.res.ibm.com

  26. Weininger, D. SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules. J. Chem. Inf. Comput. Sci. 28, 31–36 (1988).

    Article  Google Scholar 

  27. Weininger, D., Weininger, A. & Weininger, J. L. SMILES. 2. Algorithm for generation of unique SMILES notation. J. Chem. Inf. Comput. Sci. 29, 97–101 (1989).

    Article  Google Scholar 

  28. Klein, G., Kim, Y., Deng, Y., Senellart, J. & Rush, A. M. OpenNMT: open-source toolkit for neural machine translation. In Proc. ACL 2017, System Demonstrations 67–72 (ACL, 2017).

  29. Paszke, A. et al. PyTorch: an imperative style, high-performance deep learning library. In Proc. Advances in Neural Information Processing Systems Vol. 32, 8024–8035 (Curran Associates, 2019).

  30. Landrum, G. et al. rdkit/rdkit: 2019_03_4 (Q1 2019) Version Release_2019_03_4 Zenodo https://doi.org/10.5281/zenodo.3366468 (2019).

  31. Shannon, C. E. A mathematical theory of communication. Bell Syst. Tech. J. 29, 623–656 (1948).

    Article  MathSciNet  Google Scholar 

  32. Murali, R., Chen, Y., Vemuri, B. C. & Fei, W. Cumulative residual entropy: a new measure of information. IEEE Trans. Inf. Theory 50, 1220–1228 (2004).

    Article  MathSciNet  Google Scholar 

  33. Nguyen, H. V. & Vreeken, J. Non-parametric Jensen–Shannon divergence. In Machine Learning and Knowledge Discovery in Databases. ECML PKDD 2015. Lecture Notes in Computer Science Vol. 9285 (eds. Appice, A. et al.) 173–189 (Springer, 2015).

  34. Schneider, N., Stiefl, N. & Landrum, G. A. What’s what: the (nearly) definitive guide to reaction role assignment. J. Chem. Inf. Model. 56, 2336–2346 (2016).

    Article  Google Scholar 

  35. Noise reduction repository (v0.1). Zenodo https://zenodo.org/badge/latestdoi/281679964 (2020).

Download references

Acknowledgements

We thank M. Manica for help with the model deployment. We also acknowledge all the IBM RXN team for insightful suggestions.

Author information

Authors and Affiliations

Authors

Contributions

A.T. and P.S. conceived the idea and performed the experiments. A.T. verified the statistical results of the method. A.C. carried out the chemical evaluation of the results. J.G. helped with the software implementation. T.L. supervised the project. All authors participated in discussions and contributed to the manuscript.

Corresponding author

Correspondence to Alessandra Toniato.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Peer review information Nature Machine Intelligence thanks the anonymous reviewers for their contribution to the peer review of this work.

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 Results of the retrosynthetic forgetting.

a, Statistical experiment: detecting whether the selection of most forgotten examples by the retro model is random. b, Overlap between the data sets removed by forward and retro models. The red line models how the overlap would be if the retro selection were entirely random. c, Percentage of one-precursor reactions in the data set cleaned by the forward forgetting and by the retro forgetting models. The former is able to identify them, while the latter is not.

Supplementary information

Supplementary Information

Supplementary Figs. 1–4, Tables 1–4, pdfs of retrosynthesis code output, discussion.

Supplementary Data 1

Supplementary data: a sample of reactions removed by the algorithms that were manually checked by chemists.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Toniato, A., Schwaller, P., Cardinale, A. et al. Unassisted noise reduction of chemical reaction datasets. Nat Mach Intell 3, 485–494 (2021). https://doi.org/10.1038/s42256-021-00319-w

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/s42256-021-00319-w

This article is cited by

Search

Quick links

Nature Briefing AI and Robotics

Sign up for the Nature Briefing: AI and Robotics newsletter — what matters in AI and robotics research, free to your inbox weekly.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing: AI and Robotics