Abstract
Most chemical experiments are planned by human scientists and therefore are subject to a variety of human cognitive biases1, heuristics2 and social influences3. These anthropogenic chemical reaction data are widely used to train machine-learning models4 that are used to predict organic5 and inorganic6,7 syntheses. However, it is known that societal biases are encoded in datasets and are perpetuated in machine-learning models8. Here we identify as-yet-unacknowledged anthropogenic biases in both the reagent choices and reaction conditions of chemical reaction datasets using a combination of data mining and experiments. We find that the amine choices in the reported crystal structures of hydrothermal synthesis of amine-templated metal oxides9 follow a power-law distribution in which 17% of amine reactants occur in 79% of reported compounds, consistent with distributions in social influence models10,11,12. An analysis of unpublished historical laboratory notebook records shows similarly biased distributions of reaction condition choices. By performing 548 randomly generated experiments, we demonstrate that the popularity of reactants or the choices of reaction conditions are uncorrelated to the success of the reaction. We show that randomly generated experiments better illustrate the range of parameter choices that are compatible with crystal formation. Machine-learning models that we train on a smaller randomized reaction dataset outperform models trained on larger human-selected reaction datasets, demonstrating the importance of identifying and addressing anthropogenic biases in scientific data.
This is a preview of subscription content, access via your institution
Access options
Access Nature and 54 other Nature Portfolio journals
Get Nature+, our best-value online-access subscription
$29.99 / 30 days
cancel any time
Subscribe to this journal
Receive 51 print issues and online access
$199.00 per year
only $3.90 per issue
Rent or buy this article
Prices vary by article type
from$1.95
to$39.95
Prices may be subject to local taxes which are calculated during checkout



Similar content being viewed by others
Data availability
The authors declare that all data supporting the findings of this study are available within the article and its supplementary information.
Code availability
The code used for this project is available in the supplementary information files.
References
Tversky, A. & Kahneman, D. Judgment under uncertainty: heuristics and biases. Science 185, 1124–1131 (1974).
Gigerenzer, G. & Gaissmaier, W. Heuristic decision making. Annu. Rev. Psychol. 62, 451–482 (2011).
Salganik, M. J., Dodds, P. S. & Watts, D. J. Experimental study of inequality and unpredictability in an artificial cultural market. Science 311, 854–856 (2006).
Henson, A. B., Gromski, P. S. & Cronin, L. Designing algorithms to aid discovery by chemical robots. ACS Cent. Sci. 4, 793–804 (2018).
Coley, C. W., Green, W. H. & Jensen, K. F. Machine learning in computer-aided synthesis planning. Acc. Chem. Res. 51, 1281–1289 (2018).
Raccuglia, P. et al. Machine-learning-assisted materials discovery using failed experiments. Nature 533, 73–76 (2016).
Kim, E. et al. Materials synthesis insights from scientific literature via text extraction and machine learning. Chem. Mater. 29, 9436–9444 (2017).
Caliskan, A., Bryson, J. J. & Narayanan, A. Semantics derived automatically from language corpora contain human-like biases. Science 356, 183–186 (2017).
Cheetham, A. K., Férey, G. & Loiseau, T. Open-framework inorganic materials. Angew. Chem. 38, 3268–3292 (1999).
Price, D. D. S. A general theory of bibliometric and other cumulative advantage processes. J. Am. Soc. Inf. Sci. 27, 292–306 (1976).
Candia, C., Jara-Figueroa, C., Rodriguez-Sickert, C., Barabási, A.-L. & Hidalgo, C. A. The universal decay of collective memory and attention. Nat. Hum. Behav. 3, 82–91 (2018).
Carroll, H. A., Toumpakari, Z., Johnson, L. & Betts, J. A. The perceived feasibility of methods to reduce publication bias. PLoS One 12, e0186472 (2017).
Fortunato, S. et al. Science of science. Science 359, (2018).
Greenslade, P., Florentine, S. K., Hansen, B. D. & Gell, P. A. Biases encountered in long-term monitoring studies of invertebrates and microflora: Australian examples of protocols, personnel, tools and site location. Environ. Monit. Assess. 188, 491 (2016).
Boobier, S., Osbourn, A. & Mitchell, J. B. O. Can human experts predict solubility better than computers? J. Cheminform. 9, 63 (2017).
Keserű, G. M., Soós, T. & Kappe, C. O. Anthropogenic reaction parameters – the missing link between chemical intuition and the available chemical space. Chem. Soc. Rev. 43, 5387–5399 (2014).
Varela, J. N., Lammoglia Cobo, M. F., Pawar, S. V. & Yadav, V. G. Cheminformatic analysis of antimalarial chemical space illuminates therapeutic mechanisms and offers strategies for therapy development. J. Chem. Inf. Model. 57, 2119–2131 (2017).
Zdrazil, B. & Guha, R. The rise and fall of a scaffold: a trend analysis of scaffolds in the medicinal chemistry literature. J. Med. Chem. 61, 4688–4703 (2018).
Cleves, A. E. & Jain, A. N. Effects of inductive bias on computational evaluations of ligand-based modeling and on drug discovery. J. Comput. Aided Mol. Des. 22, 147–159 (2008).
Jain, A. N. & Cleves, A. E. Does your model weigh the same as a duck? J. Comput. Aided Mol. Des. 26, 57–67 (2012).
Brown, D. G. & Boström, J. Analysis of past and present synthetic methodologies on medicinal chemistry: where have all the new reactions gone? J. Med. Chem. 59, 4443–4458 (2016).
Brown, D. G., Gagnon, M. M. & Boström, J. Understanding our love affair with p-chlorophenyl: present day implications from historical biases of reagent selection. J. Med. Chem. 58, 2390–2405 (2015).
Kirkwood, J., Hargreaves, D., O’Keefe, S. & Wilson, J. Analysis of crystallization data in the Protein Data Bank. Acta Crystallogr. F 71, 1228–1234 (2015).
Rijssenbeek, J. T., Rose, D. J., Haushalter, R. C. & Zubieta, J. Novel clusters of transition metals and main group oxides in the alkylamine/oxovanadium/borate system. Angew. Chem. 36, 1008–1010 (1997).
Duros, V. et al. Human versus robots in the discovery and crystallization of gigantic polyoxometalates. Angew. Chem. 56, 10815–10820 (2017).
Cao, B. et al. How to optimize materials and devices via design of experiments and machine learning: demonstration using organic photovoltaics. ACS Nano 12, 7434–7444 (2018).
Kahneman, D. & Klein, G. Conditions for intuitive expertise: a failure to disagree. Am. Psychol. 64, 515–526 (2009).
Evans, D. W. et al. Human preferences for symmetry: subjective experience, cognitive conflict and cortical brain activity. PLoS One 7, e38966 (2012).
Liu, Z. & Kersten, D. Three-dimensional symmetric shapes are discriminated more efficiently than asymmetric ones. J. Opt. Soc. Am. A 20, 1331–1340 (2003).
Falcon, A. Aristotle on causality. The Stanford Encyclopedia of Philosophy Spring 2019 edn (ed. Zalta, E. N.) https://plato.stanford.edu/archives/spr2019/entries/aristotle-causality (Stanford Univ., 2019).
Menard, W. H. & Sharman, G. Scientific uses of random drilling models. Science 190, 337–343 (1975).
Menard, W. H. & Sharman, G. Random drilling. Science 192, 206–208 (1976).
McNally, A., Prier, C. K. & MacMillan, D. W. C. Discovery of an α-amino C–H arylation reaction using the strategy of accelerated serendipity. Science 334, 1114–1117 (2011).
Biondo, A. E., Pluchino, A. & Rapisarda, A. The beneficial role of random strategies in social and financial systems. J. Stat. Phys. 151, 607–622 (2013).
Adler, P. et al. Auditing black-box models for indirect influence. Knowl. Inf. Syst. 54, 95–122 (2018).
Groom, C. R., Bruno, I. J., Lightfoot, M. P. & Ward, S. C. The Cambridge Structural Database. Acta Crystallogr. B 72, 171–179 (2016).
Landrum, G. RDKit: open-source cheminformatics http://www.rdkit.org (2018).
ChemAxon. JChem cxcalc 5.2.0. http://www.chemaxon.com (2018).
Pedregosa, F. et al. Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011).
Lundberg, S. M. & Lee, S.-I. A unified approach to interpreting model predictions. In 31st Conference on Neural Information Processing Systems (eds Guyon, I. et al.) 4765–4774 (Curran Associates, 2017).
Lundberg, S. M. SHAP. (SHapley Additive exPlanations) https://github.com/slundberg/shap (2019).
Scheidegger, C., Falk, C., Friedler, S., Venkatasubramanian, S. & Nix, T. BlackBoxAuditing https://github.com/algofairness/BlackBoxAuditing (2019).
Acknowledgements
We thank G. Cattabriga for software engineering support and X. Weng for proofreading the experimental data entries. This project was funded by the National Science Foundation (award no. DMR-1709351). I.L. was partially supported by a Bryn Mawr College LILAC Summer Internship Funding Program. J.S. acknowledges the Henry Dreyfus Teacher-Scholar Award (TH-14-010).
Author information
Authors and Affiliations
Contributions
J.S. and A.J.N. conceived the project. A.R., H.W., X.J., and A.M. devised and performed the human-selected reactions, supervised by A.J.N. X.J. and A.M. collected historical notebook data, supervised by A.J.N; A.J.N and X.J. extracted the appropriate structures. A.L., I.L. and J.S. determined the amine counts from these structures. J.S. generated the random reactions. O.H., X.J., M.D. and A.M. performed the randomly generated and test set reactions, supervised by A.J.N. Statistical analysis was performed by A.L. and J.S. S.A.F. performed model construction and analysis. S.A.F., J.S. and A.J.N. performed and interpreted the feature influence analysis calculations. J.S., A.J.N and S.A.F. wrote the paper. All authors discussed the results and commented on the manuscript.
Corresponding authors
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Peer review information Nature thanks Leroy Cronin, Edward Kim and Hans Conrad Zur Loye for their contribution to the peer review of this work.
Extended data figures and tables
Extended Data Fig. 1 Cambridge Structural Database (CSD) search results for templated metals borates.
a, A plot of the number of unique structures for each amine, ordered from the amine with the fewest structures to the most. b, A plot of cumulative probability versus amine proportion. The grey rectangle represents the Pareto split.
Extended Data Fig. 2 Amine price and availability.
a, Amine price versus quantity for the randomized reaction amines. The data are separated by amine popularity (popular, unpopular or absent). Amines used in the test set experiments are also included. b, Amine pricing information for those used in the randomized reactions. The price per gram was calculated assuming amine densities of 1 g ml−1. The data presented in the figures above suggest that there is no systematic difference in amine prices between the popular, unpopular and absent amines. Additionally, the distribution of amine pricing for the test set amines is similar to the other distributions, suggesting a representative sample of amines.
Extended Data Fig. 3 Outcome probabilities for not-popular, unpopular and absent organic amines.
The not-popular set includes the unpopular and absent amines.
Extended Data Fig. 4 Average nearest-neighbour distances in the datasets, and nearest-neighbour choices on model performance.
a, Average distances to the kth nearest neighbour within each training set. b, Average distances from each training set to the kth nearest neighbour within the test set. c, AUC for the kth nearest neighbour classifier for k = 1 to 100.
Extended Data Fig. 5 Comparison of the influence of direct and indirect features.
a, Direct influence values of descriptors in the human reaction test set versus the random reaction test set. b, Indirect influence values of descriptors in the human reaction test set versus the random reaction test set.
Supplementary information
Supplementary Information
The Supplementary Information document contains two figures and 21 tables, and a manifest describing the content of the electronic supplementary information file below.
Supplementary Data
The Supplementary Information zip file contains all experimental and computational data, and computational codes used for this study. A manifest is contained in the Supplementary Information document file.
Rights and permissions
About this article
Cite this article
Jia, X., Lynch, A., Huang, Y. et al. Anthropogenic biases in chemical reaction data hinder exploratory inorganic synthesis. Nature 573, 251–255 (2019). https://doi.org/10.1038/s41586-019-1540-5
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1038/s41586-019-1540-5
This article is cited by
-
Combatting over-specialization bias in growing chemical databases
Journal of Cheminformatics (2023)
-
Exploiting redundancy in large materials datasets for efficient machine learning with less data
Nature Communications (2023)
-
Knowledge-integrated machine learning for materials: lessons from gameplaying and robotics
Nature Reviews Materials (2023)
-
The value of negative results in data-driven catalysis research
Nature Catalysis (2023)
-
A critical examination of robustness and generalizability of machine learning prediction of materials properties
npj Computational Materials (2023)
Comments
By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.