Most chemical experiments are planned by human scientists and therefore are subject to a variety of human cognitive biases1, heuristics2 and social influences3. These anthropogenic chemical reaction data are widely used to train machine-learning models4 that are used to predict organic5 and inorganic6,7 syntheses. However, it is known that societal biases are encoded in datasets and are perpetuated in machine-learning models8. Here we identify as-yet-unacknowledged anthropogenic biases in both the reagent choices and reaction conditions of chemical reaction datasets using a combination of data mining and experiments. We find that the amine choices in the reported crystal structures of hydrothermal synthesis of amine-templated metal oxides9 follow a power-law distribution in which 17% of amine reactants occur in 79% of reported compounds, consistent with distributions in social influence models10,11,12. An analysis of unpublished historical laboratory notebook records shows similarly biased distributions of reaction condition choices. By performing 548 randomly generated experiments, we demonstrate that the popularity of reactants or the choices of reaction conditions are uncorrelated to the success of the reaction. We show that randomly generated experiments better illustrate the range of parameter choices that are compatible with crystal formation. Machine-learning models that we train on a smaller randomized reaction dataset outperform models trained on larger human-selected reaction datasets, demonstrating the importance of identifying and addressing anthropogenic biases in scientific data.
Subscribe to Journal
Get full journal access for 1 year
only $3.90 per issue
All prices are NET prices.
VAT will be added later in the checkout.
Tax calculation will be finalised during checkout.
Rent or Buy article
Get time limited or full article access on ReadCube.
All prices are NET prices.
The authors declare that all data supporting the findings of this study are available within the article and its supplementary information.
The code used for this project is available in the supplementary information files.
Tversky, A. & Kahneman, D. Judgment under uncertainty: heuristics and biases. Science 185, 1124–1131 (1974).
Gigerenzer, G. & Gaissmaier, W. Heuristic decision making. Annu. Rev. Psychol. 62, 451–482 (2011).
Salganik, M. J., Dodds, P. S. & Watts, D. J. Experimental study of inequality and unpredictability in an artificial cultural market. Science 311, 854–856 (2006).
Henson, A. B., Gromski, P. S. & Cronin, L. Designing algorithms to aid discovery by chemical robots. ACS Cent. Sci. 4, 793–804 (2018).
Coley, C. W., Green, W. H. & Jensen, K. F. Machine learning in computer-aided synthesis planning. Acc. Chem. Res. 51, 1281–1289 (2018).
Raccuglia, P. et al. Machine-learning-assisted materials discovery using failed experiments. Nature 533, 73–76 (2016).
Kim, E. et al. Materials synthesis insights from scientific literature via text extraction and machine learning. Chem. Mater. 29, 9436–9444 (2017).
Caliskan, A., Bryson, J. J. & Narayanan, A. Semantics derived automatically from language corpora contain human-like biases. Science 356, 183–186 (2017).
Cheetham, A. K., Férey, G. & Loiseau, T. Open-framework inorganic materials. Angew. Chem. 38, 3268–3292 (1999).
Price, D. D. S. A general theory of bibliometric and other cumulative advantage processes. J. Am. Soc. Inf. Sci. 27, 292–306 (1976).
Candia, C., Jara-Figueroa, C., Rodriguez-Sickert, C., Barabási, A.-L. & Hidalgo, C. A. The universal decay of collective memory and attention. Nat. Hum. Behav. 3, 82–91 (2018).
Carroll, H. A., Toumpakari, Z., Johnson, L. & Betts, J. A. The perceived feasibility of methods to reduce publication bias. PLoS One 12, e0186472 (2017).
Fortunato, S. et al. Science of science. Science 359, (2018).
Greenslade, P., Florentine, S. K., Hansen, B. D. & Gell, P. A. Biases encountered in long-term monitoring studies of invertebrates and microflora: Australian examples of protocols, personnel, tools and site location. Environ. Monit. Assess. 188, 491 (2016).
Boobier, S., Osbourn, A. & Mitchell, J. B. O. Can human experts predict solubility better than computers? J. Cheminform. 9, 63 (2017).
Keserű, G. M., Soós, T. & Kappe, C. O. Anthropogenic reaction parameters – the missing link between chemical intuition and the available chemical space. Chem. Soc. Rev. 43, 5387–5399 (2014).
Varela, J. N., Lammoglia Cobo, M. F., Pawar, S. V. & Yadav, V. G. Cheminformatic analysis of antimalarial chemical space illuminates therapeutic mechanisms and offers strategies for therapy development. J. Chem. Inf. Model. 57, 2119–2131 (2017).
Zdrazil, B. & Guha, R. The rise and fall of a scaffold: a trend analysis of scaffolds in the medicinal chemistry literature. J. Med. Chem. 61, 4688–4703 (2018).
Cleves, A. E. & Jain, A. N. Effects of inductive bias on computational evaluations of ligand-based modeling and on drug discovery. J. Comput. Aided Mol. Des. 22, 147–159 (2008).
Jain, A. N. & Cleves, A. E. Does your model weigh the same as a duck? J. Comput. Aided Mol. Des. 26, 57–67 (2012).
Brown, D. G. & Boström, J. Analysis of past and present synthetic methodologies on medicinal chemistry: where have all the new reactions gone? J. Med. Chem. 59, 4443–4458 (2016).
Brown, D. G., Gagnon, M. M. & Boström, J. Understanding our love affair with p-chlorophenyl: present day implications from historical biases of reagent selection. J. Med. Chem. 58, 2390–2405 (2015).
Kirkwood, J., Hargreaves, D., O’Keefe, S. & Wilson, J. Analysis of crystallization data in the Protein Data Bank. Acta Crystallogr. F 71, 1228–1234 (2015).
Rijssenbeek, J. T., Rose, D. J., Haushalter, R. C. & Zubieta, J. Novel clusters of transition metals and main group oxides in the alkylamine/oxovanadium/borate system. Angew. Chem. 36, 1008–1010 (1997).
Duros, V. et al. Human versus robots in the discovery and crystallization of gigantic polyoxometalates. Angew. Chem. 56, 10815–10820 (2017).
Cao, B. et al. How to optimize materials and devices via design of experiments and machine learning: demonstration using organic photovoltaics. ACS Nano 12, 7434–7444 (2018).
Kahneman, D. & Klein, G. Conditions for intuitive expertise: a failure to disagree. Am. Psychol. 64, 515–526 (2009).
Evans, D. W. et al. Human preferences for symmetry: subjective experience, cognitive conflict and cortical brain activity. PLoS One 7, e38966 (2012).
Liu, Z. & Kersten, D. Three-dimensional symmetric shapes are discriminated more efficiently than asymmetric ones. J. Opt. Soc. Am. A 20, 1331–1340 (2003).
Falcon, A. Aristotle on causality. The Stanford Encyclopedia of Philosophy Spring 2019 edn (ed. Zalta, E. N.) https://plato.stanford.edu/archives/spr2019/entries/aristotle-causality (Stanford Univ., 2019).
Menard, W. H. & Sharman, G. Scientific uses of random drilling models. Science 190, 337–343 (1975).
Menard, W. H. & Sharman, G. Random drilling. Science 192, 206–208 (1976).
McNally, A., Prier, C. K. & MacMillan, D. W. C. Discovery of an α-amino C–H arylation reaction using the strategy of accelerated serendipity. Science 334, 1114–1117 (2011).
Biondo, A. E., Pluchino, A. & Rapisarda, A. The beneficial role of random strategies in social and financial systems. J. Stat. Phys. 151, 607–622 (2013).
Adler, P. et al. Auditing black-box models for indirect influence. Knowl. Inf. Syst. 54, 95–122 (2018).
Groom, C. R., Bruno, I. J., Lightfoot, M. P. & Ward, S. C. The Cambridge Structural Database. Acta Crystallogr. B 72, 171–179 (2016).
Landrum, G. RDKit: open-source cheminformatics http://www.rdkit.org (2018).
ChemAxon. JChem cxcalc 5.2.0. http://www.chemaxon.com (2018).
Pedregosa, F. et al. Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011).
Lundberg, S. M. & Lee, S.-I. A unified approach to interpreting model predictions. In 31st Conference on Neural Information Processing Systems (eds Guyon, I. et al.) 4765–4774 (Curran Associates, 2017).
Lundberg, S. M. SHAP. (SHapley Additive exPlanations) https://github.com/slundberg/shap (2019).
Scheidegger, C., Falk, C., Friedler, S., Venkatasubramanian, S. & Nix, T. BlackBoxAuditing https://github.com/algofairness/BlackBoxAuditing (2019).
We thank G. Cattabriga for software engineering support and X. Weng for proofreading the experimental data entries. This project was funded by the National Science Foundation (award no. DMR-1709351). I.L. was partially supported by a Bryn Mawr College LILAC Summer Internship Funding Program. J.S. acknowledges the Henry Dreyfus Teacher-Scholar Award (TH-14-010).
The authors declare no competing interests.
Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Peer review information Nature thanks Leroy Cronin, Edward Kim and Hans Conrad Zur Loye for their contribution to the peer review of this work.
Extended data figures and tables
Extended Data Fig. 1 Cambridge Structural Database (CSD) search results for templated metals borates.
a, A plot of the number of unique structures for each amine, ordered from the amine with the fewest structures to the most. b, A plot of cumulative probability versus amine proportion. The grey rectangle represents the Pareto split.
a, Amine price versus quantity for the randomized reaction amines. The data are separated by amine popularity (popular, unpopular or absent). Amines used in the test set experiments are also included. b, Amine pricing information for those used in the randomized reactions. The price per gram was calculated assuming amine densities of 1 g ml−1. The data presented in the figures above suggest that there is no systematic difference in amine prices between the popular, unpopular and absent amines. Additionally, the distribution of amine pricing for the test set amines is similar to the other distributions, suggesting a representative sample of amines.
The not-popular set includes the unpopular and absent amines.
Extended Data Fig. 4 Average nearest-neighbour distances in the datasets, and nearest-neighbour choices on model performance.
a, Average distances to the kth nearest neighbour within each training set. b, Average distances from each training set to the kth nearest neighbour within the test set. c, AUC for the kth nearest neighbour classifier for k = 1 to 100.
a, Direct influence values of descriptors in the human reaction test set versus the random reaction test set. b, Indirect influence values of descriptors in the human reaction test set versus the random reaction test set.
The Supplementary Information document contains two figures and 21 tables, and a manifest describing the content of the electronic supplementary information file below.
The Supplementary Information zip file contains all experimental and computational data, and computational codes used for this study. A manifest is contained in the Supplementary Information document file.
About this article
Cite this article
Jia, X., Lynch, A., Huang, Y. et al. Anthropogenic biases in chemical reaction data hinder exploratory inorganic synthesis. Nature 573, 251–255 (2019). https://doi.org/10.1038/s41586-019-1540-5
Nature Reviews Methods Primers (2021)
Quantitative interpretation explains machine learning models for chemical reaction prediction and uncovers bias
Nature Communications (2021)
Foundations of Chemistry (2021)
SAVI, in silico generation of billions of easily synthesizable compounds through expert-system type rules
Scientific Data (2020)
npj Computational Materials (2020)