Like most research fields, materials science has embraced ‘big data’, including machine-learning models and techniques. These are being used to predict new materials and properties, and devise routes to existing drugs and chemicals.
But machine learning requires training data, such as those on reagents, conditions and starting materials. These are usually gleaned from the literature, and are human-generated. The choice of reagents that researchers use could come, for example, from experience or from previously published work. It might be based on a recommendation passed from supervisor to graduate student, or simply on how easy reagents are to find or buy. But that subjectivity becomes a potential problem for the accuracy of machine-learning models, as research published this week in Nature shows.
Joshua Schrier at Fordham University in New York City, Alexander Norquist and Sorelle Friedler at Haverford College in Pennsylvania and their colleagues looked at materials called amine-templated vanadium borates. These were chosen because success and failure are easily defined in their synthesis — simply by whether or not crystals form. The researchers compiled a data set of several hundred synthetic conditions that are used to make vanadium borates. They then trained a machine-learning model on this data set to predict the success or failure of reactions. The team found that a model trained on a human-generated data set was less successful in predicting the success or failure of a reaction than one trained on a data set with randomly generated reaction conditions (X. Jia et al. Nature 573, 251–255; 2019).
In some sense, this should be no surprise. It is now well known that when machine-learning techniques are used to pick out patterns in aggregated data, biases in those data can be amplified. For example, facial-recognition algorithms trained mostly on white faces are less able to distinguish between the faces of people of other ethnicities, thereby introducing bias that could lead to entrenched inequality.
Does the existence of bias matter to chemistry and materials science? When the goal of a research project is to find new materials, it could be argued that it’s irrelevant which reagents are used as long as they work.
But there are potential drawbacks to relying on ‘tried and trusted’ methods. A prevalence of favourite protocols — even an unintentional one — in a training data set could hinder the success of machine-learning models that are used to predict materials. Or, as this study reveals, more efficient ways to make existing ones.
No one would argue that the consequences of biased chemical data are as serious as those of biases in facial-recognition software, but they share a similar origin. Researchers should be alert to the potential for bias in their chemical data sets, before it gets baked into a machine.