EDITORIAL
11 September 2019

Look out for potential bias in chemical data sets

Materials science has embraced machine learning. As with other disciplines, researchers must be alert to the risks of biased data.

You have full access to this article via your institution.

Download PDF

Closeup of a laboratory technician mixing reagents. — There might be disadvantages to using tried and trusted methods.Credit: Science Photo Library

Like most research fields, materials science has embraced ‘big data’, including machine-learning models and techniques. These are being used to predict new materials and properties, and devise routes to existing drugs and chemicals.

But machine learning requires training data, such as those on reagents, conditions and starting materials. These are usually gleaned from the literature, and are human-generated. The choice of reagents that researchers use could come, for example, from experience or from previously published work. It might be based on a recommendation passed from supervisor to graduate student, or simply on how easy reagents are to find or buy. But that subjectivity becomes a potential problem for the accuracy of machine-learning models, as research published this week in Nature shows.

Joshua Schrier at Fordham University in New York City, Alexander Norquist and Sorelle Friedler at Haverford College in Pennsylvania and their colleagues looked at materials called amine-templated vanadium borates. These were chosen because success and failure are easily defined in their synthesis — simply by whether or not crystals form. The researchers compiled a data set of several hundred synthetic conditions that are used to make vanadium borates. They then trained a machine-learning model on this data set to predict the success or failure of reactions. The team found that a model trained on a human-generated data set was less successful in predicting the success or failure of a reaction than one trained on a data set with randomly generated reaction conditions (X. Jia et al. Nature 573, 251–255; 2019).

In some sense, this should be no surprise. It is now well known that when machine-learning techniques are used to pick out patterns in aggregated data, biases in those data can be amplified. For example, facial-recognition algorithms trained mostly on white faces are less able to distinguish between the faces of people of other ethnicities, thereby introducing bias that could lead to entrenched inequality.

Does the existence of bias matter to chemistry and materials science? When the goal of a research project is to find new materials, it could be argued that it’s irrelevant which reagents are used as long as they work.

But there are potential drawbacks to relying on ‘tried and trusted’ methods. A prevalence of favourite protocols — even an unintentional one — in a training data set could hinder the success of machine-learning models that are used to predict materials. Or, as this study reveals, more efficient ways to make existing ones.

No one would argue that the consequences of biased chemical data are as serious as those of biases in facial-recognition software, but they share a similar origin. Researchers should be alert to the potential for bias in their chemical data sets, before it gets baked into a machine.

Nature 573, 164 (2019)

doi: https://doi.org/10.1038/d41586-019-02670-w

Reprints and permissions

Subjects

Latest on:

Retractions are part of science, but misconduct isn’t — lessons from a superconductivity lab

Editorial 24 APR 24

Regioselective hydroformylation of propene catalysed by rhodium-zeolite

Article 24 APR 24

High-performance fibre battery with polymer gel electrolyte

Article 24 APR 24

Lethal AI weapons are here: how can we control them?

News Feature 23 APR 24

Do insects have an inner life? Animal consciousness needs a rethink

News 19 APR 24

AI-fuelled election campaigns are here — where are the rules?

World View 09 APR 24

Retractions are part of science, but misconduct isn’t — lessons from a superconductivity lab

Editorial 24 APR 24

Growth of diamond in liquid metal at 1 atm pressure

Article 24 APR 24

Valleytronics in bulk MoS2 with a topologic optical field

Article 24 APR 24

Jobs

Postdoctoral Associate- Computational Spatial Biology

Houston, Texas (US)

Baylor College of Medicine (BCM)
Staff Scientist - Genetics and Genomics

Houston, Texas (US)

Baylor College of Medicine (BCM)
Technician - Senior Technician in Cell and Molecular Biology

APPLICATION CLOSING DATE: 24.05.2024 Human Technopole (HT) is a distinguished life science research institute founded and supported by the Italian ...

Milan (IT)

Human Technopole
Postdoctoral Fellow

The Dubal Laboratory of Neuroscience and Aging at the University of California, San Francisco (UCSF) seeks postdoctoral fellows to investigate the ...

San Francisco, California

University of California, San Francsico
Postdoctoral Associate

Houston, Texas (US)

Baylor College of Medicine (BCM)

Look out for potential bias in chemical data sets

Subjects

Latest on:

Jobs

Postdoctoral Associate- Computational Spatial Biology

Staff Scientist - Genetics and Genomics

Technician - Senior Technician in Cell and Molecular Biology

Postdoctoral Fellow

Postdoctoral Associate

Search

Quick links

Related Articles

Subjects

Latest on:

Jobs

Postdoctoral Associate- Computational Spatial Biology

Staff Scientist - Genetics and Genomics

Technician - Senior Technician in Cell and Molecular Biology

Postdoctoral Fellow

Postdoctoral Associate

Search

Quick links