Anthropogenic biases in chemical reaction data hinder exploratory inorganic synthesis

Jia, Xiwen; Lynch, Allyson; Huang, Yuheng; Danielson, Matthew; Lang’at, Immaculate; Milder, Alexander; Ruby, Aaron E.; Wang, Hao; Friedler, Sorelle A.; Norquist, Alexander J.; Schrier, Joshua

doi:10.1038/s41586-019-1540-5

Letter
Published: 11 September 2019

Anthropogenic biases in chemical reaction data hinder exploratory inorganic synthesis

Xiwen Jia¹,
Allyson Lynch¹,
Yuheng Huang¹,
Matthew Danielson¹,
Immaculate Lang’at¹,
Alexander Milder¹,
Aaron E. Ruby¹,
Hao Wang¹,
Sorelle A. Friedler²,
Alexander J. Norquist¹ &
…
Joshua Schrier^1,3

Nature volume 573, pages 251–255 (2019)Cite this article

9101 Accesses
131 Citations
67 Altmetric
Metrics details

Subjects

Abstract

Most chemical experiments are planned by human scientists and therefore are subject to a variety of human cognitive biases¹, heuristics² and social influences³. These anthropogenic chemical reaction data are widely used to train machine-learning models⁴ that are used to predict organic⁵ and inorganic^6,7 syntheses. However, it is known that societal biases are encoded in datasets and are perpetuated in machine-learning models⁸. Here we identify as-yet-unacknowledged anthropogenic biases in both the reagent choices and reaction conditions of chemical reaction datasets using a combination of data mining and experiments. We find that the amine choices in the reported crystal structures of hydrothermal synthesis of amine-templated metal oxides⁹ follow a power-law distribution in which 17% of amine reactants occur in 79% of reported compounds, consistent with distributions in social influence models^10,11,12. An analysis of unpublished historical laboratory notebook records shows similarly biased distributions of reaction condition choices. By performing 548 randomly generated experiments, we demonstrate that the popularity of reactants or the choices of reaction conditions are uncorrelated to the success of the reaction. We show that randomly generated experiments better illustrate the range of parameter choices that are compatible with crystal formation. Machine-learning models that we train on a smaller randomized reaction dataset outperform models trained on larger human-selected reaction datasets, demonstrating the importance of identifying and addressing anthropogenic biases in scientific data.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on Springer Link
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Fig. 1: Occurrence of amines in structures of reported metal oxide crystals.**

**Fig. 2: Distribution of the choices of reaction parameters and reaction outcomes.**

**Fig. 3: Reaction outcomes from randomly generated experiments for popular amines and not-popular (unpopular and absent) amines.**

Bayesian reaction optimization as a tool for chemical synthesis

Article 03 February 2021

The digitization of organic synthesis

Article 12 June 2019

The case for data science in experimental chemistry: examples and recommendations

Article 21 April 2022

Data availability

The authors declare that all data supporting the findings of this study are available within the article and its supplementary information.

Code availability

The code used for this project is available in the supplementary information files.

References

Tversky, A. & Kahneman, D. Judgment under uncertainty: heuristics and biases. Science 185, 1124–1131 (1974).
Article ADS CAS Google Scholar
Gigerenzer, G. & Gaissmaier, W. Heuristic decision making. Annu. Rev. Psychol. 62, 451–482 (2011).
Article Google Scholar
Salganik, M. J., Dodds, P. S. & Watts, D. J. Experimental study of inequality and unpredictability in an artificial cultural market. Science 311, 854–856 (2006).
Article ADS CAS Google Scholar
Henson, A. B., Gromski, P. S. & Cronin, L. Designing algorithms to aid discovery by chemical robots. ACS Cent. Sci. 4, 793–804 (2018).
Article CAS Google Scholar
Coley, C. W., Green, W. H. & Jensen, K. F. Machine learning in computer-aided synthesis planning. Acc. Chem. Res. 51, 1281–1289 (2018).
Article CAS Google Scholar
Raccuglia, P. et al. Machine-learning-assisted materials discovery using failed experiments. Nature 533, 73–76 (2016).
Article ADS CAS Google Scholar
Kim, E. et al. Materials synthesis insights from scientific literature via text extraction and machine learning. Chem. Mater. 29, 9436–9444 (2017).
Article CAS Google Scholar
Caliskan, A., Bryson, J. J. & Narayanan, A. Semantics derived automatically from language corpora contain human-like biases. Science 356, 183–186 (2017).
Article ADS CAS Google Scholar
Cheetham, A. K., Férey, G. & Loiseau, T. Open-framework inorganic materials. Angew. Chem. 38, 3268–3292 (1999).
Article CAS Google Scholar
Price, D. D. S. A general theory of bibliometric and other cumulative advantage processes. J. Am. Soc. Inf. Sci. 27, 292–306 (1976).
Article Google Scholar
Candia, C., Jara-Figueroa, C., Rodriguez-Sickert, C., Barabási, A.-L. & Hidalgo, C. A. The universal decay of collective memory and attention. Nat. Hum. Behav. 3, 82–91 (2018).
Article Google Scholar
Carroll, H. A., Toumpakari, Z., Johnson, L. & Betts, J. A. The perceived feasibility of methods to reduce publication bias. PLoS One 12, e0186472 (2017).
Article Google Scholar
Fortunato, S. et al. Science of science. Science 359, (2018).
Greenslade, P., Florentine, S. K., Hansen, B. D. & Gell, P. A. Biases encountered in long-term monitoring studies of invertebrates and microflora: Australian examples of protocols, personnel, tools and site location. Environ. Monit. Assess. 188, 491 (2016).
Article Google Scholar
Boobier, S., Osbourn, A. & Mitchell, J. B. O. Can human experts predict solubility better than computers? J. Cheminform. 9, 63 (2017).
Article Google Scholar
Keserű, G. M., Soós, T. & Kappe, C. O. Anthropogenic reaction parameters – the missing link between chemical intuition and the available chemical space. Chem. Soc. Rev. 43, 5387–5399 (2014).
Article Google Scholar
Varela, J. N., Lammoglia Cobo, M. F., Pawar, S. V. & Yadav, V. G. Cheminformatic analysis of antimalarial chemical space illuminates therapeutic mechanisms and offers strategies for therapy development. J. Chem. Inf. Model. 57, 2119–2131 (2017).
Article CAS Google Scholar
Zdrazil, B. & Guha, R. The rise and fall of a scaffold: a trend analysis of scaffolds in the medicinal chemistry literature. J. Med. Chem. 61, 4688–4703 (2018).
Article CAS Google Scholar
Cleves, A. E. & Jain, A. N. Effects of inductive bias on computational evaluations of ligand-based modeling and on drug discovery. J. Comput. Aided Mol. Des. 22, 147–159 (2008).
Article ADS CAS Google Scholar
Jain, A. N. & Cleves, A. E. Does your model weigh the same as a duck? J. Comput. Aided Mol. Des. 26, 57–67 (2012).
Article ADS Google Scholar
Brown, D. G. & Boström, J. Analysis of past and present synthetic methodologies on medicinal chemistry: where have all the new reactions gone? J. Med. Chem. 59, 4443–4458 (2016).
Article CAS Google Scholar
Brown, D. G., Gagnon, M. M. & Boström, J. Understanding our love affair with p-chlorophenyl: present day implications from historical biases of reagent selection. J. Med. Chem. 58, 2390–2405 (2015).
Article CAS Google Scholar
Kirkwood, J., Hargreaves, D., O’Keefe, S. & Wilson, J. Analysis of crystallization data in the Protein Data Bank. Acta Crystallogr. F 71, 1228–1234 (2015).
Article CAS Google Scholar
Rijssenbeek, J. T., Rose, D. J., Haushalter, R. C. & Zubieta, J. Novel clusters of transition metals and main group oxides in the alkylamine/oxovanadium/borate system. Angew. Chem. 36, 1008–1010 (1997).
Article CAS Google Scholar
Duros, V. et al. Human versus robots in the discovery and crystallization of gigantic polyoxometalates. Angew. Chem. 56, 10815–10820 (2017).
Article CAS Google Scholar
Cao, B. et al. How to optimize materials and devices via design of experiments and machine learning: demonstration using organic photovoltaics. ACS Nano 12, 7434–7444 (2018).
Article CAS Google Scholar
Kahneman, D. & Klein, G. Conditions for intuitive expertise: a failure to disagree. Am. Psychol. 64, 515–526 (2009).
Article Google Scholar
Evans, D. W. et al. Human preferences for symmetry: subjective experience, cognitive conflict and cortical brain activity. PLoS One 7, e38966 (2012).
Article ADS CAS Google Scholar
Liu, Z. & Kersten, D. Three-dimensional symmetric shapes are discriminated more efficiently than asymmetric ones. J. Opt. Soc. Am. A 20, 1331–1340 (2003).
Article ADS Google Scholar
Falcon, A. Aristotle on causality. The Stanford Encyclopedia of Philosophy Spring 2019 edn (ed. Zalta, E. N.) https://plato.stanford.edu/archives/spr2019/entries/aristotle-causality (Stanford Univ., 2019).
Menard, W. H. & Sharman, G. Scientific uses of random drilling models. Science 190, 337–343 (1975).
Article ADS Google Scholar
Menard, W. H. & Sharman, G. Random drilling. Science 192, 206–208 (1976).
Article CAS Google Scholar
McNally, A., Prier, C. K. & MacMillan, D. W. C. Discovery of an α-amino C–H arylation reaction using the strategy of accelerated serendipity. Science 334, 1114–1117 (2011).
Article ADS CAS Google Scholar
Biondo, A. E., Pluchino, A. & Rapisarda, A. The beneficial role of random strategies in social and financial systems. J. Stat. Phys. 151, 607–622 (2013).
Article ADS MathSciNet Google Scholar
Adler, P. et al. Auditing black-box models for indirect influence. Knowl. Inf. Syst. 54, 95–122 (2018).
Article Google Scholar
Groom, C. R., Bruno, I. J., Lightfoot, M. P. & Ward, S. C. The Cambridge Structural Database. Acta Crystallogr. B 72, 171–179 (2016).
Article CAS Google Scholar
Landrum, G. RDKit: open-source cheminformatics http://www.rdkit.org (2018).
ChemAxon. JChem cxcalc 5.2.0. http://www.chemaxon.com (2018).
Pedregosa, F. et al. Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011).
MathSciNet MATH Google Scholar
Lundberg, S. M. & Lee, S.-I. A unified approach to interpreting model predictions. In 31st Conference on Neural Information Processing Systems (eds Guyon, I. et al.) 4765–4774 (Curran Associates, 2017).
Lundberg, S. M. SHAP. (SHapley Additive exPlanations) https://github.com/slundberg/shap (2019).
Scheidegger, C., Falk, C., Friedler, S., Venkatasubramanian, S. & Nix, T. BlackBoxAuditing https://github.com/algofairness/BlackBoxAuditing (2019).

Download references

Acknowledgements

We thank G. Cattabriga for software engineering support and X. Weng for proofreading the experimental data entries. This project was funded by the National Science Foundation (award no. DMR-1709351). I.L. was partially supported by a Bryn Mawr College LILAC Summer Internship Funding Program. J.S. acknowledges the Henry Dreyfus Teacher-Scholar Award (TH-14-010).

Author information

Authors and Affiliations

Department of Chemistry, Haverford College, Haverford, PA, USA
Xiwen Jia, Allyson Lynch, Yuheng Huang, Matthew Danielson, Immaculate Lang’at, Alexander Milder, Aaron E. Ruby, Hao Wang, Alexander J. Norquist & Joshua Schrier
Department of Computer Science, Haverford College, Haverford, PA, USA
Sorelle A. Friedler
Department of Chemistry, Fordham University, The Bronx, New York, NY, USA
Joshua Schrier

Authors

Xiwen Jia
View author publications
You can also search for this author in PubMed Google Scholar
Allyson Lynch
View author publications
You can also search for this author in PubMed Google Scholar
Yuheng Huang
View author publications
You can also search for this author in PubMed Google Scholar
Matthew Danielson
View author publications
You can also search for this author in PubMed Google Scholar
Immaculate Lang’at
View author publications
You can also search for this author in PubMed Google Scholar
Alexander Milder
View author publications
You can also search for this author in PubMed Google Scholar
Aaron E. Ruby
View author publications
You can also search for this author in PubMed Google Scholar
Hao Wang
View author publications
You can also search for this author in PubMed Google Scholar
Sorelle A. Friedler
View author publications
You can also search for this author in PubMed Google Scholar
Alexander J. Norquist
View author publications
You can also search for this author in PubMed Google Scholar
Joshua Schrier
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

J.S. and A.J.N. conceived the project. A.R., H.W., X.J., and A.M. devised and performed the human-selected reactions, supervised by A.J.N. X.J. and A.M. collected historical notebook data, supervised by A.J.N; A.J.N and X.J. extracted the appropriate structures. A.L., I.L. and J.S. determined the amine counts from these structures. J.S. generated the random reactions. O.H., X.J., M.D. and A.M. performed the randomly generated and test set reactions, supervised by A.J.N. Statistical analysis was performed by A.L. and J.S. S.A.F. performed model construction and analysis. S.A.F., J.S. and A.J.N. performed and interpreted the feature influence analysis calculations. J.S., A.J.N and S.A.F. wrote the paper. All authors discussed the results and commented on the manuscript.

Corresponding authors

Correspondence to Sorelle A. Friedler, Alexander J. Norquist or Joshua Schrier.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Peer review information Nature thanks Leroy Cronin, Edward Kim and Hans Conrad Zur Loye for their contribution to the peer review of this work.

Extended data figures and tables

Extended Data Fig. 1 Cambridge Structural Database (CSD) search results for templated metals borates.

a, A plot of the number of unique structures for each amine, ordered from the amine with the fewest structures to the most. b, A plot of cumulative probability versus amine proportion. The grey rectangle represents the Pareto split.

Extended Data Fig. 2 Amine price and availability.

a, Amine price versus quantity for the randomized reaction amines. The data are separated by amine popularity (popular, unpopular or absent). Amines used in the test set experiments are also included. b, Amine pricing information for those used in the randomized reactions. The price per gram was calculated assuming amine densities of 1 g ml⁻¹. The data presented in the figures above suggest that there is no systematic difference in amine prices between the popular, unpopular and absent amines. Additionally, the distribution of amine pricing for the test set amines is similar to the other distributions, suggesting a representative sample of amines.

Extended Data Fig. 3 Outcome probabilities for not-popular, unpopular and absent organic amines.

The not-popular set includes the unpopular and absent amines.

Extended Data Fig. 4 Average nearest-neighbour distances in the datasets, and nearest-neighbour choices on model performance.

a, Average distances to the kth nearest neighbour within each training set. b, Average distances from each training set to the kth nearest neighbour within the test set. c, AUC for the kth nearest neighbour classifier for k = 1 to 100.

Extended Data Fig. 5 Comparison of the influence of direct and indirect features.

a, Direct influence values of descriptors in the human reaction test set versus the random reaction test set. b, Indirect influence values of descriptors in the human reaction test set versus the random reaction test set.

Extended Data Table 1 Structure inclusion and exclusion criteria

Full size table

Extended Data Table 2 Matthews correlation coefficient (MCC), accuracy and AUC results for each machine-learning algorithm, trained on either the human-selected or randomly generated reaction data using all features

Full size table

Extended Data Table 3 Feature selection comparison

Full size table

Extended Data Table 4 Comparison of discrepancies between model predictions and reaction outcomes

Full size table

Supplementary information

Supplementary Information

The Supplementary Information document contains two figures and 21 tables, and a manifest describing the content of the electronic supplementary information file below.

Supplementary Data

The Supplementary Information zip file contains all experimental and computational data, and computational codes used for this study. A manifest is contained in the Supplementary Information document file.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Jia, X., Lynch, A., Huang, Y. et al. Anthropogenic biases in chemical reaction data hinder exploratory inorganic synthesis. Nature 573, 251–255 (2019). https://doi.org/10.1038/s41586-019-1540-5

Download citation

Received: 30 December 2018
Accepted: 10 July 2019
Published: 11 September 2019
Issue Date: 12 September 2019
DOI: https://doi.org/10.1038/s41586-019-1540-5

This article is cited by

Optimal thermodynamic conditions to minimize kinetic by-products in aqueous materials synthesis
- Zheren Wang
- Yingzhi Sun
- Gerbrand Ceder
Nature Synthesis (2024)
Navigating phase diagram complexity to guide robotic inorganic materials synthesis
- Jiadong Chen
- Samuel R. Cross
- Wenhao Sun
Nature Synthesis (2024)
Combatting over-specialization bias in growing chemical databases
- Katharina Dost
- Zac Pullar-Strecker
- Jörg S. Wicker
Journal of Cheminformatics (2023)
Artificial intelligence-powered electronic skin
- Changhao Xu
- Samuel A. Solomon
- Wei Gao
Nature Machine Intelligence (2023)
Exploiting redundancy in large materials datasets for efficient machine learning with less data
- Kangming Li
- Daniel Persaud
- Jason Hattrick-Simpers
Nature Communications (2023)

Comments

By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.