Anthropogenic biases in chemical reaction data hinder exploratory inorganic synthesis

Abstract

Most chemical experiments are planned by human scientists and therefore are subject to a variety of human cognitive biases1, heuristics2 and social influences3. These anthropogenic chemical reaction data are widely used to train machine-learning models4 that are used to predict organic5 and inorganic6,7 syntheses. However, it is known that societal biases are encoded in datasets and are perpetuated in machine-learning models8. Here we identify as-yet-unacknowledged anthropogenic biases in both the reagent choices and reaction conditions of chemical reaction datasets using a combination of data mining and experiments. We find that the amine choices in the reported crystal structures of hydrothermal synthesis of amine-templated metal oxides9 follow a power-law distribution in which 17% of amine reactants occur in 79% of reported compounds, consistent with distributions in social influence models10,11,12. An analysis of unpublished historical laboratory notebook records shows similarly biased distributions of reaction condition choices. By performing 548 randomly generated experiments, we demonstrate that the popularity of reactants or the choices of reaction conditions are uncorrelated to the success of the reaction. We show that randomly generated experiments better illustrate the range of parameter choices that are compatible with crystal formation. Machine-learning models that we train on a smaller randomized reaction dataset outperform models trained on larger human-selected reaction datasets, demonstrating the importance of identifying and addressing anthropogenic biases in scientific data.

Access options

Rent or Buy article

Get time limited or full article access on ReadCube.

from$8.99

All prices are NET prices.

Fig. 1: Occurrence of amines in structures of reported metal oxide crystals.
Fig. 2: Distribution of the choices of reaction parameters and reaction outcomes.
Fig. 3: Reaction outcomes from randomly generated experiments for popular amines and not-popular (unpopular and absent) amines.

Data availability

The authors declare that all data supporting the findings of this study are available within the article and its supplementary information.

Code availability

The code used for this project is available in the supplementary information files.

References

  1. 1.

    Tversky, A. & Kahneman, D. Judgment under uncertainty: heuristics and biases. Science 185, 1124–1131 (1974).

    ADS  CAS  Article  Google Scholar 

  2. 2.

    Gigerenzer, G. & Gaissmaier, W. Heuristic decision making. Annu. Rev. Psychol. 62, 451–482 (2011).

    Article  Google Scholar 

  3. 3.

    Salganik, M. J., Dodds, P. S. & Watts, D. J. Experimental study of inequality and unpredictability in an artificial cultural market. Science 311, 854–856 (2006).

    ADS  CAS  Article  Google Scholar 

  4. 4.

    Henson, A. B., Gromski, P. S. & Cronin, L. Designing algorithms to aid discovery by chemical robots. ACS Cent. Sci. 4, 793–804 (2018).

    CAS  Article  Google Scholar 

  5. 5.

    Coley, C. W., Green, W. H. & Jensen, K. F. Machine learning in computer-aided synthesis planning. Acc. Chem. Res. 51, 1281–1289 (2018).

    CAS  Article  Google Scholar 

  6. 6.

    Raccuglia, P. et al. Machine-learning-assisted materials discovery using failed experiments. Nature 533, 73–76 (2016).

    ADS  CAS  Article  Google Scholar 

  7. 7.

    Kim, E. et al. Materials synthesis insights from scientific literature via text extraction and machine learning. Chem. Mater. 29, 9436–9444 (2017).

    CAS  Article  Google Scholar 

  8. 8.

    Caliskan, A., Bryson, J. J. & Narayanan, A. Semantics derived automatically from language corpora contain human-like biases. Science 356, 183–186 (2017).

    ADS  CAS  Article  Google Scholar 

  9. 9.

    Cheetham, A. K., Férey, G. & Loiseau, T. Open-framework inorganic materials. Angew. Chem. 38, 3268–3292 (1999).

    CAS  Article  Google Scholar 

  10. 10.

    Price, D. D. S. A general theory of bibliometric and other cumulative advantage processes. J. Am. Soc. Inf. Sci. 27, 292–306 (1976).

    Article  Google Scholar 

  11. 11.

    Candia, C., Jara-Figueroa, C., Rodriguez-Sickert, C., Barabási, A.-L. & Hidalgo, C. A. The universal decay of collective memory and attention. Nat. Hum. Behav. 3, 82–91 (2018).

    Article  Google Scholar 

  12. 12.

    Carroll, H. A., Toumpakari, Z., Johnson, L. & Betts, J. A. The perceived feasibility of methods to reduce publication bias. PLoS One 12, e0186472 (2017).

    Article  Google Scholar 

  13. 13.

    Fortunato, S. et al. Science of science. Science 359, (2018).

  14. 14.

    Greenslade, P., Florentine, S. K., Hansen, B. D. & Gell, P. A. Biases encountered in long-term monitoring studies of invertebrates and microflora: Australian examples of protocols, personnel, tools and site location. Environ. Monit. Assess. 188, 491 (2016).

    Article  Google Scholar 

  15. 15.

    Boobier, S., Osbourn, A. & Mitchell, J. B. O. Can human experts predict solubility better than computers? J. Cheminform. 9, 63 (2017).

    Article  Google Scholar 

  16. 16.

    Keserű, G. M., Soós, T. & Kappe, C. O. Anthropogenic reaction parameters – the missing link between chemical intuition and the available chemical space. Chem. Soc. Rev. 43, 5387–5399 (2014).

    Article  Google Scholar 

  17. 17.

    Varela, J. N., Lammoglia Cobo, M. F., Pawar, S. V. & Yadav, V. G. Cheminformatic analysis of antimalarial chemical space illuminates therapeutic mechanisms and offers strategies for therapy development. J. Chem. Inf. Model. 57, 2119–2131 (2017).

    CAS  Article  Google Scholar 

  18. 18.

    Zdrazil, B. & Guha, R. The rise and fall of a scaffold: a trend analysis of scaffolds in the medicinal chemistry literature. J. Med. Chem. 61, 4688–4703 (2018).

    CAS  Article  Google Scholar 

  19. 19.

    Cleves, A. E. & Jain, A. N. Effects of inductive bias on computational evaluations of ligand-based modeling and on drug discovery. J. Comput. Aided Mol. Des. 22, 147–159 (2008).

    ADS  CAS  Article  Google Scholar 

  20. 20.

    Jain, A. N. & Cleves, A. E. Does your model weigh the same as a duck? J. Comput. Aided Mol. Des. 26, 57–67 (2012).

    ADS  Article  Google Scholar 

  21. 21.

    Brown, D. G. & Boström, J. Analysis of past and present synthetic methodologies on medicinal chemistry: where have all the new reactions gone? J. Med. Chem. 59, 4443–4458 (2016).

    CAS  Article  Google Scholar 

  22. 22.

    Brown, D. G., Gagnon, M. M. & Boström, J. Understanding our love affair with p-chlorophenyl: present day implications from historical biases of reagent selection. J. Med. Chem. 58, 2390–2405 (2015).

    CAS  Article  Google Scholar 

  23. 23.

    Kirkwood, J., Hargreaves, D., O’Keefe, S. & Wilson, J. Analysis of crystallization data in the Protein Data Bank. Acta Crystallogr. F 71, 1228–1234 (2015).

    CAS  Article  Google Scholar 

  24. 24.

    Rijssenbeek, J. T., Rose, D. J., Haushalter, R. C. & Zubieta, J. Novel clusters of transition metals and main group oxides in the alkylamine/oxovanadium/borate system. Angew. Chem. 36, 1008–1010 (1997).

    CAS  Article  Google Scholar 

  25. 25.

    Duros, V. et al. Human versus robots in the discovery and crystallization of gigantic polyoxometalates. Angew. Chem. 56, 10815–10820 (2017).

    CAS  Article  Google Scholar 

  26. 26.

    Cao, B. et al. How to optimize materials and devices via design of experiments and machine learning: demonstration using organic photovoltaics. ACS Nano 12, 7434–7444 (2018).

    CAS  Article  Google Scholar 

  27. 27.

    Kahneman, D. & Klein, G. Conditions for intuitive expertise: a failure to disagree. Am. Psychol. 64, 515–526 (2009).

    Article  Google Scholar 

  28. 28.

    Evans, D. W. et al. Human preferences for symmetry: subjective experience, cognitive conflict and cortical brain activity. PLoS One 7, e38966 (2012).

    ADS  CAS  Article  Google Scholar 

  29. 29.

    Liu, Z. & Kersten, D. Three-dimensional symmetric shapes are discriminated more efficiently than asymmetric ones. J. Opt. Soc. Am. A 20, 1331–1340 (2003).

    ADS  Article  Google Scholar 

  30. 30.

    Falcon, A. Aristotle on causality. The Stanford Encyclopedia of Philosophy Spring 2019 edn (ed. Zalta, E. N.) https://plato.stanford.edu/archives/spr2019/entries/aristotle-causality (Stanford Univ., 2019).

  31. 31.

    Menard, W. H. & Sharman, G. Scientific uses of random drilling models. Science 190, 337–343 (1975).

    ADS  Article  Google Scholar 

  32. 32.

    Menard, W. H. & Sharman, G. Random drilling. Science 192, 206–208 (1976).

    CAS  Article  Google Scholar 

  33. 33.

    McNally, A., Prier, C. K. & MacMillan, D. W. C. Discovery of an α-amino C–H arylation reaction using the strategy of accelerated serendipity. Science 334, 1114–1117 (2011).

    ADS  CAS  Article  Google Scholar 

  34. 34.

    Biondo, A. E., Pluchino, A. & Rapisarda, A. The beneficial role of random strategies in social and financial systems. J. Stat. Phys. 151, 607–622 (2013).

    ADS  MathSciNet  Article  Google Scholar 

  35. 35.

    Adler, P. et al. Auditing black-box models for indirect influence. Knowl. Inf. Syst. 54, 95–122 (2018).

    Article  Google Scholar 

  36. 36.

    Groom, C. R., Bruno, I. J., Lightfoot, M. P. & Ward, S. C. The Cambridge Structural Database. Acta Crystallogr. B 72, 171–179 (2016).

    CAS  Article  Google Scholar 

  37. 37.

    Landrum, G. RDKit: open-source cheminformatics http://www.rdkit.org (2018).

  38. 38.

    ChemAxon. JChem cxcalc 5.2.0. http://www.chemaxon.com (2018).

  39. 39.

    Pedregosa, F. et al. Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011).

    MathSciNet  MATH  Google Scholar 

  40. 40.

    Lundberg, S. M. & Lee, S.-I. A unified approach to interpreting model predictions. In 31st Conference on Neural Information Processing Systems (eds Guyon, I. et al.) 4765–4774 (Curran Associates, 2017).

  41. 41.

    Lundberg, S. M. SHAP. (SHapley Additive exPlanations) https://github.com/slundberg/shap (2019).

  42. 42.

    Scheidegger, C., Falk, C., Friedler, S., Venkatasubramanian, S. & Nix, T. BlackBoxAuditing https://github.com/algofairness/BlackBoxAuditing (2019).

Download references

Acknowledgements

We thank G. Cattabriga for software engineering support and X. Weng for proofreading the experimental data entries. This project was funded by the National Science Foundation (award no. DMR-1709351). I.L. was partially supported by a Bryn Mawr College LILAC Summer Internship Funding Program. J.S. acknowledges the Henry Dreyfus Teacher-Scholar Award (TH-14-010).

Author information

Affiliations

Authors

Contributions

J.S. and A.J.N. conceived the project. A.R., H.W., X.J., and A.M. devised and performed the human-selected reactions, supervised by A.J.N. X.J. and A.M. collected historical notebook data, supervised by A.J.N; A.J.N and X.J. extracted the appropriate structures. A.L., I.L. and J.S. determined the amine counts from these structures. J.S. generated the random reactions. O.H., X.J., M.D. and A.M. performed the randomly generated and test set reactions, supervised by A.J.N. Statistical analysis was performed by A.L. and J.S. S.A.F. performed model construction and analysis. S.A.F., J.S. and A.J.N. performed and interpreted the feature influence analysis calculations. J.S., A.J.N and S.A.F. wrote the paper. All authors discussed the results and commented on the manuscript.

Corresponding authors

Correspondence to Sorelle A. Friedler or Alexander J. Norquist or Joshua Schrier.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Peer review information Nature thanks Leroy Cronin, Edward Kim and Hans Conrad Zur Loye for their contribution to the peer review of this work.

Extended data figures and tables

Extended Data Fig. 1 Cambridge Structural Database (CSD) search results for templated metals borates.

a, A plot of the number of unique structures for each amine, ordered from the amine with the fewest structures to the most. b, A plot of cumulative probability versus amine proportion. The grey rectangle represents the Pareto split.

Extended Data Fig. 2 Amine price and availability.

a, Amine price versus quantity for the randomized reaction amines. The data are separated by amine popularity (popular, unpopular or absent). Amines used in the test set experiments are also included. b, Amine pricing information for those used in the randomized reactions. The price per gram was calculated assuming amine densities of 1 g ml−1. The data presented in the figures above suggest that there is no systematic difference in amine prices between the popular, unpopular and absent amines. Additionally, the distribution of amine pricing for the test set amines is similar to the other distributions, suggesting a representative sample of amines.

Extended Data Fig. 3 Outcome probabilities for not-popular, unpopular and absent organic amines.

The not-popular set includes the unpopular and absent amines.

Extended Data Fig. 4 Average nearest-neighbour distances in the datasets, and nearest-neighbour choices on model performance.

a, Average distances to the kth nearest neighbour within each training set. b, Average distances from each training set to the kth nearest neighbour within the test set. c, AUC for the kth nearest neighbour classifier for k = 1 to 100.

Extended Data Fig. 5 Comparison of the influence of direct and indirect features.

a, Direct influence values of descriptors in the human reaction test set versus the random reaction test set. b, Indirect influence values of descriptors in the human reaction test set versus the random reaction test set.

Extended Data Table 1 Structure inclusion and exclusion criteria
Extended Data Table 2 Matthews correlation coefficient (MCC), accuracy and AUC results for each machine-learning algorithm, trained on either the human-selected or randomly generated reaction data using all features
Extended Data Table 3 Feature selection comparison
Extended Data Table 4 Comparison of discrepancies between model predictions and reaction outcomes

Supplementary information

Supplementary Information

The Supplementary Information document contains two figures and 21 tables, and a manifest describing the content of the electronic supplementary information file below.

Supplementary Data

The Supplementary Information zip file contains all experimental and computational data, and computational codes used for this study. A manifest is contained in the Supplementary Information document file.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Jia, X., Lynch, A., Huang, Y. et al. Anthropogenic biases in chemical reaction data hinder exploratory inorganic synthesis. Nature 573, 251–255 (2019). https://doi.org/10.1038/s41586-019-1540-5

Download citation

Further reading

Comments

By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.

Search

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing