Humans inherit artificial intelligence biases

Vicente, Lucía; Matute, Helena

doi:10.1038/s41598-023-42384-8

Download PDF

Article
Open access
Published: 03 October 2023

Humans inherit artificial intelligence biases

Scientific Reports volume 13, Article number: 15737 (2023) Cite this article

15k Accesses
8 Citations
487 Altmetric
Metrics details

Subjects

Abstract

Artificial intelligence recommendations are sometimes erroneous and biased. In our research, we hypothesized that people who perform a (simulated) medical diagnostic task assisted by a biased AI system will reproduce the model's bias in their own decisions, even when they move to a context without AI support. In three experiments, participants completed a medical-themed classification task with or without the help of a biased AI system. The biased recommendations by the AI influenced participants' decisions. Moreover, when those participants, assisted by the AI, moved on to perform the task without assistance, they made the same errors as the AI had made during the previous phase. Thus, participants' responses mimicked AI bias even when the AI was no longer making suggestions. These results provide evidence of human inheritance of AI bias.

Do as AI say: susceptibility in deployment of clinical decision-aids

Article Open access 19 February 2021

Foundation models for generalist medical artificial intelligence

Article 12 April 2023

Guiding principles for the responsible development of artificial intelligence tools for healthcare

Article Open access 01 April 2023

Introduction

Over the last decades, the number of tools based on artificial intelligence (AI) designed to assist human decisions has increased in many professional fields such as justice^1,2, personnel selection^3,4 and healthcare^5,6,7,8. In the medical realm, the advent of AI-based decision support systems has been welcomed as a promise to minimize errors in human decision-making^{9,10,11,12,13}. Such optimism is founded on the impressive precision that artificial intelligence has reached in some clinical tasks such as image-based diagnostics, predicting the outcomes of interventions, or recommending treatments^13,14,15,16. Thus, AI tools can help health professionals in, for example, diagnosis and triage tasks¹⁷ by offering them specific and quite accurate data-driven recommendations^18,19. This cooperation between natural and artificial intelligence is hoped to augment clinicians' knowledge and capabilities, as well as compensate for some of the weaknesses of their human minds, such as cognitive biases^20,21,22,23 and vulnerability to fatigue^24,25, and therefore reduce diagnostic errors, improving clinical decisions²⁶. But these AI tools are not free of errors and biases themselves.

Despite their promising benefits, the introduction of AI-based decision support systems in clinical contexts has also raised concerns about the peril of using biased AI to assist medical decisions^27,28,29. People tend to perceive artificial intelligence algorithms as objective, secure³⁰ and impartial³¹, but AI algorithms are a product of human design, so they often inherit our mistakes and biases^32,33. Although AI-based tools have proven to be quite accurate, their performance is far from perfect and most of them still need to be validated in real-world settings⁷. Thus, while AI can help to overcome some of the limitations of human reasoning, other new problems may arise in the human-AI interaction³⁴. Biased AI systems could sometimes diminish rather than augment the correctness of clinical decisions in that collaborative AI-human decision-making^35,36.

Bias in an algorithm is defined as a systematic error in its outputs or processes^37,38. One important potential source of bias in AI systems is biases or imbalances in the data used to train the algorithms^28,38,39,40. AI algorithms identify patterns and generate predictions based on historical data (i.e. what happened in the past). High-quality data sets used to feed the algorithms are difficult to obtain⁵, particularly when they pertain to clinical data⁴¹. Consequently, if a given data set contains imbalances or biases⁴², the AI systems trained with these data will learn and potentially reproduce those biases⁴³.

In the healthcare area, the data used to train the algorithms are usually the product of past human decisions, for example, medical decisions about what kind of patient requires a certain test or what kind of patient would benefit more from a given treatment. When the historical record of these choices shows an extended systematic error such as, for instance, a misdiagnosis of a certain pathology when certain features are present, the AI system trained with such historical data will simply inherit this bias⁴⁴. It is relevant to note the technical nature of AI bias, as it is, in essence, a mathematical or statistical artefact. In an elementary form, this systematic error could be exemplified by a consistent misidentification of a certain pattern of coloured pixels in an image. However, this bias would reach serious implications if such an error in the model is not detected and the AI system is applied to decision-making in high-stakes scenarios, such as AI-based facial recognition systems⁴⁵ to identify criminal suspects or AI-based image diagnosis⁴⁶.

As a consequence, AI bias can also result in discrimination or prejudice towards a person or group of people⁴⁷ since the patterns embedded in the historical data used to train algorithms often reflect systematic social and economic inequities. When an artificial intelligence system is trained with data that does not represent the diversity of the reality and population groups to which an AI tool is to be applied⁴², the model will produce undesirable effects due to the difficulty in generalising its predictions to social groups or environments with characteristics underrepresented in those databases^45,46,48.

In high-stakes areas such as healthcare, clinicians assume the responsibility and accountability for the decisions of the AI-human team⁴⁹. Thus, they are expected to supervise the outcomes of their artificial counterparts. Since there is a risk of bias (i.e., systematic errors) in the recommendations of AI-based decision support systems, health professionals should interpret AI advice as just an additional piece of evidence to help them in the decision-making process. Hence, they need to critically supervise and decide whether the AI advice is correct or useful for each decision^50,51.

Some evidence has pointed out that this effective oversight and control of AI by humans could be possible^52,53,54, while another corpus of research has documented an excessive trust towards AI recommendations, which calls into question the ability of humans to exercise a good supervision of algorithmic outcomes^55,56,57. The tendency to over-rely on artificial intelligence can lead to humans uncritically adhering to AI recommendations, even incorrect ones^50,58,59.

Human over-reliance on AI advice seems to be modulated by the context and the characteristics of the task (e.g., subjective vs objective) in which AI and humans collaborate^{60,61,62,63,64}. In the clinical context, algorithmic advice is usually seen as trustworthy because artificial intelligence is perceived as accurate in objective and analytical tasks^55,62,65, and these are the ones that are common in the medical domain, such as diagnosis and image classification. Thus, there are reasons to believe that decision-making in a health context could be particularly vulnerable to human over-reliance on AI advice, and therefore humans could tend to accept the recommendations of AI algorithms even when they are noticeably biased or erroneous.

There is some evidence that incorrect AI recommendations can have a detrimental influence on clinicians ' decision-making^66,67,68,69. As an example, a recent study showed that when prescribing antidepressants in different scenarios, incorrect AI recommendations led to a reduction in the accuracy of the clinician's decisions in comparison to a baseline and a correct advice condition⁷⁰. These results suggest that occasional mistakes in the AI recommendations can make people err, but our main concern refers to the presence of systematic biases in AI systems and to the potential of humans to perpetuate those biases. Thus, it is necessary to explore how humans react to systematic errors in AI advice, because AI bias may have a more profound impact on human behaviour than occasional and random errors. To our knowledge, very few investigations have addressed the effect of biased artificial intelligence systems on human behaviour⁵⁴, and even fewer studies have focused, specifically, on the impact of algorithmic bias on human decisions in a clinical context.

The present research aims to answer two main questions. The first one is whether biased recommendations of an AI system can influence human behaviour, specifically decision-making, in a medical context. The recent study of Adam et al.⁷¹ represented an interesting starting point to this problem since they directly explored the influence of biased recommendations of an algorithm on the human response to mental health emergencies. These authors reported that the biased recommendations strongly influenced the decisions of experts and non-experts on how to respond to a crisis, while the decisions of participants without the advice of the algorithm were unbiased. In addition, we believe that to fully explore the potential impact of AI biases on human behaviour, it is also necessary to answer a second question. This question is whether, after the interaction with a biased AI system, people would reproduce those biases in their own decisions when they move on to a context without their artificial partner. The reader should consider the following scenario: given the case of a group of people accustomed to performing a specific task with the suggestions of a biased AI-based decision support system, is there a risk that the system's biased recommendations will exert a training effect such that people will reproduce the AI bias when making decisions on their own? Could the system influence the behaviour of these people so that they inherit the AI bias, even in a context without AI recommendations?

The potential inheritance of AI bias and its propagation through human decisions is a phenomenon that remains unexplored. The current research aims to examine how biased AI recommendations can influence people's decision-making in a health-related task and to test whether the impact of the biased advice on human behaviour extends beyond the phase in which the AI recommendations are explicitly present. That is, we will explore whether human decision-makers can inherit the bias of an artificial intelligence system.

Overview of the experiments

In a series of three experiments, we empirically tested whether (a) people follow the biased recommendations offered by an AI system, even if this advice is noticeably erroneous (Experiment 1); (b) people who have performed a task assisted by the biased recommendations will reproduce the same type of errors than the system when they have to perform the same task without assistance, showing an inherited bias (Experiment 2); and (c) performing a task first without assistance will prevent people from following the biased recommendations of an AI and, thus, from committing the same errors, when they later perform the same task assisted by a biased AI system.

Experiment 1

The first experiment tested the influence of explicitly biased recommendations made by a fictitious AI system on participants’ behaviour using a classification task with a medical-themed story: a simulation of an image-based diagnosis.

Method

Ethics statement

The Ethical Review Board of the University of Deusto reviewed and approved the methodology reported in this article and the experiments were conducted according the approved guidelines. Informed consent was obtained from all participants. Due to ethical considerations, as well as to prevent the influence of prior knowledge and beliefs on experiments’ results, the clinical context, the classification task, the artificial intelligence system, its recommendations and its bias, the tissue sample images, the patients and syndromes used in these experiments, were all fictitious.

Participants

A group of 171 Psychology students took part in the experiment, but data from two of them were excluded following the data selection criteria described below. The final sample included 169 students (73.6% female, 20.7% male, 2.4% non-binary, mean age = 18.4, SD = 0.79). Their participation was anonymous and we did not ask for any personal data other than age and gender. Participants were randomly assigned to either one of two groups, assisted by AI (n = 85) and unassisted (n = 84). The post hoc sensitivity analysis showed that, with this sample size, we obtained a power of 0.80 to detect a size effect of d = 0.38 or higher in a test of difference between two independent means.

Materials and procedure

In the three experiments we used the same experimental task: a classification task framed in a fictitious health context. With this procedure we tried to simulate a clinical decision-making process that was assisted by a biased AI system for some participants, while others performed the task without assistance.

The experimental task was constructed though Qualtrics, a platform that also facilitated the data collection. We wanted to simulate a clinical decision-making process with a medical diagnosis task that was simple to learn, but also challenging enough to require sustained attention. With this aim, we used the stimuli created by Blanco et al.⁷² in their study related to causal illusion. Each stimulus consisted of a matrix of 50 × 50 pixels which represented a human tissue sample obtained from a given patient. Each tissue sample contained 2500 cells of two colours (dark pink and light yellow) randomly distributed in the matrix space so that there were no two identical samples. The proportion of dark and light cells in each tissue sample was variable, so we created a large set of different stimuli with varying levels of discrimination difficulty. The different proportions of dark and light cells for different stimuli were 80/20, 70/30, 60/40, 40/60, 30/70, and 20/80.

In the classification task, participants were instructed to observe a series of tissue samples, to decide, for each sample, whether it was affected or not by a fictitious disease called Lindsay Syndrome. Each tissue sample had cells of two colours, but one of them was presented in a greater proportion and volunteers were instructed to follow this criterion to identify the presence of the syndrome. A greater proportion of dark than light cells was described in the instructions as “Positive”, that is, affected by the Lindsay Syndrome. If the tissue sample had a greater proportion of light than dark cells, then it should be classified as a “Negative” because it was not affected by the syndrome. It is important to note that the assignment of dark and light colours to the positive/negative categories was randomly decided for each participant. For simplicity, in this section we describe only the Instructional Assignment A, but the reader should bear in mind that half of the participants received the opposite assignment.

To ensure that all participants correctly understood the instructions the experiment begun with a practice phase in which participants categorised six tissue samples with different dark/light colours proportions. Each sample was presented twice, so that the practice phase consisted of two blocks of six stimuli, presented one per trial, that is 12 trials in total. In the first block, the six samples were presented in order of difficulty. In the second block, the same six samples were presented in random order. If participants did not get five correct classifications out of six trials in the second block, they had to repeat all the practice phase.

Once the volunteers finished the practice phase, they were randomly distributed into two groups, AI-assisted and unassisted. The design of the three experiments is summarized in Table 1.

Table 1 Design summary of the three experiments. n is the number of final participants in each group. The two columns labelled Trials indicate the number of tissue sample images used in Phase 1 and Phase 2 of the classification task. As participants viewed one image per trial, i.e., each tissue sample was shown on a separate screen, the total number of images equals the total number of trials for each phase.

Full size table

The aim of the participants was to classify a series of tissue samples following the criteria they had just learned in the practice phase. Phase 1 of this experiment comprised 60 tissue samples (i.e., trials). In the AI-assisted group, the simulated tissue sample and the AI recommendation were presented simultaneously in each trial. The recommendation had the form of an orange label with the text "POSITIVE + ”, or a blue label with the text "NEGATIVE—", placed above the tissue sample. In the unassisted group, only the tissue sample was presented in each trial. Both groups viewed the same stimuli, the only difference between them was the presence or absence of the advice of the fictitious AI. The sequence of trials included ten stimuli (i.e., tissue samples) of each of the dark/light cell proportions, that is, 80/20, 70/30, 60/40, 40/60, 30/70 and 20/80, resulting in a total of 60 stimuli. Figure 1 shows some examples of trials in both the unassisted and the AI-assisted groups. The simulated AI of our experiment made correct recommendations on 50 tissue samples out of 60. Thus, our hypothetical model correctly classified approximately 80% of the tissue samples during Phase 1. However, this simulated AI model showed a bias or systematic error: the recommendations for the ten stimuli with the dark/light cells ratio of 40/60 were always wrong. In these samples, there was a contradiction between the evidence (i.e., number of dark/light cells in the tissue) and the recommendation given by the AI (i.e., the blue or orange labels with the negative or positive suggestions). For example, following the instructions, the 40/60 tissue samples had a greater number of light cells so they should be classified as negative, however, the recommendation given by the AI for the 40/60 samples was positive.

The order of the stimuli was randomly assigned through the sequence of 60 trials, with one exception: the 40/60 stimuli, where the AI recommendation was always erroneous, did not appear during the first 10 trials of the sequence of 60 trials. This manipulation tried to keep certain level of trust in the AI model before the problematic stimuli showed up. We wanted the AI bias not to be so evident from the beginning of the task. The ten 40/60 stimuli did appear in the sequence of the next 50 trials, randomly intermixed with the stimulus of other proportions.

After the classification task was completed, we asked participants, as a manipulation check, to indicate whether the AI had offered any recommendations to them during the task. Next, volunteers answered a series of questions about their own performance and about how they had perceived the AI’s performance during the classification task. Participants indicated on a Likert scale from 1 (not at all) to 9 (completely) their level of trust on the artificial intelligence used in this experiment, and their trust in artificial intelligence applied to healthcare, in general.

The main dependent variable in our experiment was the number of misclassifications of the ten 40/60 tissue samples. Although the stimulus 40/60 was more difficult to classify than other stimuli, its discrimination was clear, and it was possible to detect the AI mistakes easily. Thus, we expected the unassisted group to classify them correctly. However, since the AI showed a rather high reliability, we expected volunteers in the group assisted by the biased AI to get used to performing the task following the AI recommendations and without careful examination of the tissue sample. As a result, the AI-assisted group would misclassify more often the 40/60 stimuli, where the recommendations were erroneous, than the unassisted group.

Data selection criteria

If participants failed to reach the threshold of five correct classifications out of six trials in the second attempt of the practice phase, their data were excluded from the analyses. In addition, data from participants who misclassified tissue samples on more than half of the trials of the first phase of the classification task (i.e., more than 30 trials out of 60), were excluded from the analyses, as they were, presumably, paying little attention.

Results and discussion

As expected, participants who performed the task assisted by a biased AI made more errors than unassisted participants (see Fig. 2). The mean number of misclassifications of the 40/60 trials was 2.21 (SD = 3.17) in group AI-assisted, and 0.69 (SD = 1.83) in the unassisted group. This difference was significant, t(167) = 3.81, p < 0.001, d = 0.586. These results show that the explicit recommendations of a biased AI influenced participants’ behaviour and increased the number of errors in a health framed task.

On average, participants though they had performed the task fairly well and showed a moderate trust in AI in healthcare. In the AI-assisted group we found a positive correlation between participants’ incorrect classifications of the 40/60 samples and how helpful they considered the AI of the experiment to be, r = 0.544, p < 0.001, the accuracy they attributed to the AI recommendations during the experiment, r = 0.563, p < 0.001, and their confidence in AI in the health domain in general, r = 0.470, p < 0.001.

These results evidenced human compliance with the AI recommendations during a decision-making process. A large part of participants from the AI-assisted group reproduced the AI bias in their own responses to the medical task.

Experiment 2

The purpose of Experiment 2 was (a) to replicate the observation of Experiment 1 that a biased AI can influence human decision-making, (b) to test whether such influence persists when the AI is no longer present, and (c) to test whether this influence generalizes to novel stimuli as well. Additionally, we now also measured and analysed changes in the participants' behaviour over the sequence of trials during the classification task.