The unseen Black faces of AI algorithms

Birhane, Abeba

doi:10.1038/d41586-022-03050-7

NEWS AND VIEWS
19 October 2022
Correction 27 October 2022

The unseen Black faces of AI algorithms

An audit of commercial facial-analysis tools found that dark-skinned faces are misclassified at a much higher rate than are faces from any other group. Four years on, the study is shaping research, regulation and commercial practices.

Abeba Birhane⁰

Abeba Birhane
1. Abeba Birhane is at the Mozilla Foundation, San Francisco, California 94105, USA, and in the School of Computer Science, University College Dublin, Dublin, Ireland.
View author publications

You can also search for this author in PubMed Google Scholar

You have full access to this article via your institution.

Download PDF

Data sets are essential for training and validating machine-learning algorithms. But these data are typically sourced from the Internet, so they encode all the stereotypes, inequalities and power asymmetries that exist in society. These biases are exacerbated by the algorithmic systems that use them, which means that the output of the systems is discriminatory by nature, and will remain problematic and potentially harmful until the data sets are audited and somehow corrected. Although this has long been the case, the first major steps towards overcoming the issue were taken only four years ago, when Joy Buolamwini and Timnit Gebru¹ published a report that kick-started sweeping changes in the ethics of artificial intelligence (AI).

Read the paper: Gender Shades: Intersectional Accuracy Disparities in Commercial Gender Classification

As a graduate student in computer science, Buolamwini was frustrated that commercial facial-recognition systems failed to identify her face in photographs and video footage. She hypothesized that this was due, in part, to the fact that dark-skinned faces were not represented in the data sets that were used to train the computer programs she was studying. This insight led Buolamwini and her collaborator Gebru to undertake a systematic audit of commercial facial-analysis systems, and to demonstrate that such systems perform differently depending on the skin colour and gender of the person in the image. The work became known as the Gender Shades audit.

The authors began by using a skin-type classification system, approved by dermatologists, to assess the composition of two image banks, known as IJB-A and Adience, that were widely used at the time to train facial-recognition software. They found that individuals with light-coloured skin were the subject of 79.6% of the images in IJB-A and of 86.2% of those in Adience. This prompted Buolamwini and Gebru to compile their own set of images — one that offered a broader range of skin tones than did either of the existing options, as well as including similar numbers of men and women (commercial algorithms are typically not capable of dealing with non-binary classifications). To do so, they turned to photographs of politicians from countries with gender parity in their national parliaments. The resulting data set, known as the Pilot Parliaments Benchmark (Fig. 1), contains images of 1,270 individuals from Rwanda, Senegal, South Africa, Iceland, Finland and Sweden.

Four average faces, male and female, generated from a database of portraits of parliamentarians from different countries — **Figure 1 | A gender-balanced facial image bank with a range of skin tones.** On realizing that dark-skinned faces were under-represented in the data sets of images that are used to train facial-recognition software, Buolamwini and Gebru¹ compiled their own data set using photographs of politicians from countries with gender parity in their national parliaments. This is a subset of ‘average’ faces made by blending many images from the full data set, which contains photographs of 1,270 individuals from Rwanda, Senegal, South Africa, Iceland, Finland and Sweden. Buolamwini and Gebru used their data set to show that three commercial gender-classification systems misclassified women with darker skin with an error rate that was much higher than that for men with lighter skin.Credit: Dr Joy Buolamwini (CC BY 4.0)

Buolamwini and Gebru then used their benchmark set to evaluate three commercial gender-classification systems developed by the technology companies Microsoft, Face++ and IBM. Rather than assessing the accuracy of these systems on the basis of gender or of skin type, the authors compared the performance of the classifiers on four intersectional groups that they termed darker female, darker male, lighter female and lighter male. They found that women with darker skin were the most likely to be misclassified, with a maximum classification error rate of 34.7%; by contrast, the maximum error rate for men with lighter skin was 0.8%. All three systems consistently showed poor accuracy for women with dark skin and performed substantially better on white men.

Impactful research isn’t always understood and acknowledged at first glance, especially when it challenges conventional thinking. At the time of publication, Buolamwini and Gebru’s paper was considered an outlier — not only in the field of computer vision (the study of how computers can be made to automate tasks performed by the human visual system), but also in AI ethics. Since then, a lot has changed, and algorithmic auditing has rapidly become a crucial practice, prompting academic journals and conferences to highlight audit studies.

The downstream effect of the Gender Shades audit in research can also be found in curation practices for large-scale data sets. For instance, an initiative reported earlier this year suggests that faces in large image banks, such as the popular ImageNet (go.nature.com/3qukjkn), should be obscured to protect individuals’ privacy². The study showed that blurring or mosaicking faces in an image had little effect on the accuracy of software designed to recognize other elements of the image. But the authors also noted that this work must be done through crowdsourcing, rather than using commercial software, to avoid the racial bias revealed by the Gender Shades study.

Skin colour affects the accuracy of medical oxygen sensors

Although there was resistance to Buolamwini and Gebru’s paper at first, the vendors of the facial-recognition software that they audited eventually responded positively. IBM and Microsoft, for example, pledged to test their facial-recognition algorithms and diversify their training data sets (see, for example, go.nature.com/3rmbo17). Around a year after the paper was published, a follow-up audit found that Microsoft, IBM and Face++ had all succeeded in reducing the performance error of their facial-analysis products³. The most noteworthy improvement was a 30.4% reduction in the error with which the Face++ software recognized darker female faces in the Pilot Parliaments Benchmark set, with the Microsoft and IBM algorithms improving by 19.28% and 17.73%, respectively, on this task.

But none of these systems has yet overcome racial bias entirely, and many companies have discontinued or temporarily halted facial-recognition technologies. Evidence continues to emerge that AI models mistakenly associate images of Black people with animal classes such as ‘gorilla’ or ‘chimpanzee’ more often than they do for images of people who aren’t Black⁴.

The study also influenced how facial-analysis technology is regulated. In the United States, the 2019 Algorithmic Accountability Act authorized the Federal Trade Commission (the agency tasked with promoting and enforcing consumer protection) to regulate automated decision systems (go.nature.com/3xguff7). US cities such as San Francisco in California, Boston, Massachusetts, and Portland, Oregon, have banned the use of facial recognition by police, citing biased misidentification that disproportionately affects communities of colour. In Europe, civil-society organizations, activists and technologists have come together to call for a ban on facial-recognition analysis (go.nature.com/3qwzmnq) and on biometric technology in general (go.nature.com/3f7jrka). And the first draft of the European Union’s Artificial Intelligence Act (go.nature.com/3dtgh4x), released in April 2021, indicates that real-time use of facial recognition in public places might be restricted.

Regulations and the risk of liability impel large corporations to change their practices, but even minimal regulations are being undermined (go.nature.com/3yb96kq). Although such deterrents can result in measurably improved outcomes, given the prevalence of facial-analysis technology, the changes that have the most long-term impact are likely to come from shifting public attitudes — something that I think Buolamwini and Gebru’s study has influenced both directly and indirectly. The work even became the subject of the 2020 documentary film Coded Bias (go.nature.com/3fashnf). Unfortunately, the authors (like many other Black female scholars) have also been overlooked by mainstream media: a 2021 television segment on racial bias in facial-analysis technologies, for example, failed to recognize their work and that of their collaborators (go.nature.com/3satrp8).

Over the past few years, the conversation initiated by this work has shifted from a focus on the accuracy and performance of facial-recognition algorithms to larger and more-fundamental questions around surveillance technology. The question of accuracy becomes meaningless when this technology is used to supposedly measure internal behaviours from outward appearances. In fact, ‘accurate’ representation boils down to reducing these behaviours to outdated social stereotypes⁵.

Algorithms that claim to detect emotions, predict gender or gauge someone’s trustworthiness have been dubbed ‘AI snake oil’ by some (go.nature.com/3rh7cfp), because such sociocultural attributes cannot reliably be inferred from faces, expressions or gestures⁶. Others have called for a blanket ban on facial-recognition algorithms, saying that the technology resurrects the pseudosciences of physiognomy and phrenology⁷.

The ImageNet data set, a large-scale collection of images that is considered the gold standard in computer vision, has had a pivotal role in positioning computer-vision research at the core of the ‘deep-learning revolution’ of the past decade. Facial-recognition technology has subsequently become mainstream and is prevalent in almost all social and public spaces, including concert venues, schools, airports, neighbourhoods and public squares — seriously undermining privacy and enabling worrying surveillance practices. Even if new algorithms are designed on the basis of diverse image sets such as the Pilot Parliaments Benchmark, they are still vulnerable to being used for inherently harmful and oppressive purposes, such as the surveillance of minority communities.

Facial-recognition technology has expanded into other fields of research, such as studies designed to predict facial characteristics from the analysis of DNA⁸, and others that aim to automate medical diagnoses from images of faces alone⁹. Given the racial biases inherent in facial-recognition algorithms, these are concerning developments.

Amid what can feel like overwhelming public enthusiasm for new AI technologies, Buolamwini and Gebru instigated a body of critical work that has exposed the bias, discrimination and oppressive nature of facial-analysis algorithms. Their audit was ground-breaking four years ago, and remains an influential reference point to counter the rapid progress of this technology and the threat it poses.

Nature 610, 451-452 (2022)

doi: https://doi.org/10.1038/d41586-022-03050-7

Updates & Corrections

Correction 27 October 2022: The original version of this article contained a misspelling in the author’s e-mail address, which has now been corrected.

References

Buolamwini, J. & Gebru, T. Proc. Mach. Learn. Res. 81, 77–91 (2018).
Google Scholar
Yang, K., Yau, J. H., Fei-Fei, L., Deng, J. & Russakovsky, O. Proc. Mach. Learn. Res. 162, 25313–25330 (2022).
Google Scholar
Raji, I. D. & Buolamwini, J. In Proc. 2019 AAAI/ACM Conference on AI, Ethics, and Society 429–435 (Association for Computing Machinery, 2019).
Google Scholar
Radford, A. et al. Preprint at https://arxiv.org/abs/2103.00020 (2021).
Birhane, A. Artif. Life 27, 44–61 (2021).
Article PubMed Google Scholar
Birhane, A. & Guest, O. Women Gender Res. 29, 60–73 (2021).
Article Google Scholar
Stark, L. & Hutson, J. Fordham Intellect. Prop. Media Entertain. Law J. https://doi.org/10.2139/ssrn.3927300 (2021).
Article Google Scholar
Sero, D. et al. Nature Commun. 10, 2557 (2019).
Article PubMed Google Scholar
Thevenot, J., Bordallo López, M. & Hadid, A. IEEE J. Biomed. Health Inform. 22, 1497–1511 (2018).
Article PubMed Google Scholar

Download references

Reprints and permissions

Competing Interests

The author declares no competing interests.

Subjects

Latest on:

AI now beats humans at basic tasks — new benchmarks are needed, says major report

News 15 APR 24

High-threshold and low-overhead fault-tolerant quantum memory

Article 27 MAR 24

Three reasons why AI doesn’t model human language

Correspondence 19 MAR 24

Lethal AI weapons are here: how can we control them?

News Feature 23 APR 24

Do insects have an inner life? Animal consciousness needs a rethink

News 19 APR 24

AI-fuelled election campaigns are here — where are the rules?

World View 09 APR 24

How scientists are making the most of Reddit

Career Feature 01 APR 24

A global timekeeping problem postponed by global warming

Article 27 MAR 24

AI image generators often give racist and sexist results: can they be fixed?

News Feature 19 MAR 24

Jobs

Postdoctoral Fellow

The Dubal Laboratory of Neuroscience and Aging at the University of California, San Francisco (UCSF) seeks postdoctoral fellows to investigate the ...

San Francisco, California

University of California, San Francsico
Postdoctoral Associate

Houston, Texas (US)

Baylor College of Medicine (BCM)
Postdoctoral Research Fellow

Description Applications are invited for a postdoctoral fellow position at the Lunenfeld-Tanenbaum Research Institute, Sinai Health, to participate...

Toronto (City), Ontario (CA)

Sinai Health
Postdoctoral Research Associate - Surgery

Memphis, Tennessee

St. Jude Children's Research Hospital (St. Jude)
Open Rank Faculty Position in Biochemistry and Molecular Genetics

The Department of Biochemistry & Molecular Genetics (www.virginia.edu/bmg) and the University of Virginia Cancer Center

Charlottesville, Virginia

Biochemistry & Molecular Genetics

[1] Buolamwini, J. & Gebru, T. Proc. Mach. Learn. Res. 81, 77–91 (2018).
Google Scholar

[2] Yang, K., Yau, J. H., Fei-Fei, L., Deng, J. & Russakovsky, O. Proc. Mach. Learn. Res. 162, 25313–25330 (2022).
Google Scholar

[3] Raji, I. D. & Buolamwini, J. In Proc. 2019 AAAI/ACM Conference on AI, Ethics, and Society 429–435 (Association for Computing Machinery, 2019).
Google Scholar

[4] Radford, A. et al. Preprint at https://arxiv.org/abs/2103.00020 (2021).

[5] Birhane, A. Artif. Life 27, 44–61 (2021).
Article PubMed Google Scholar

[6] Birhane, A. & Guest, O. Women Gender Res. 29, 60–73 (2021).
Article Google Scholar

[7] Stark, L. & Hutson, J. Fordham Intellect. Prop. Media Entertain. Law J. https://doi.org/10.2139/ssrn.3927300 (2021).
Article Google Scholar

[8] Sero, D. et al. Nature Commun. 10, 2557 (2019).
Article PubMed Google Scholar

[9] Thevenot, J., Bordallo López, M. & Hadid, A. IEEE J. Biomed. Health Inform. 22, 1497–1511 (2018).
Article PubMed Google Scholar

The unseen Black faces of AI algorithms

Updates & Corrections

References

Competing Interests

Subjects

Latest on:

Jobs

Postdoctoral Fellow

Postdoctoral Associate

Postdoctoral Research Fellow

Postdoctoral Research Associate - Surgery

Open Rank Faculty Position in Biochemistry and Molecular Genetics

Search

Quick links

Updates & Corrections

References

Competing Interests

Related Articles

Subjects

Latest on:

Jobs

Postdoctoral Fellow

Postdoctoral Associate

Postdoctoral Research Fellow

Postdoctoral Research Associate - Surgery

Open Rank Faculty Position in Biochemistry and Molecular Genetics

Search

Quick links