Mitigating belief projection in explainable artificial intelligence via Bayesian teaching

State-of-the-art deep-learning systems use decision rules that are challenging for humans to model. Explainable AI (XAI) attempts to improve human understanding but rarely accounts for how people typically reason about unfamiliar agents. We propose explicitly modelling the human explainee via Bayesian teaching, which evaluates explanations by how much they shift explainees’ inferences toward a desired goal. We assess Bayesian teaching in a binary image classification task across a variety of contexts. Absent intervention, participants predict that the AI’s classifications will match their own, but explanations generated by Bayesian teaching improve their ability to predict the AI’s judgements by moving them away from this prior belief. Bayesian teaching further allows each case to be broken down into sub-examples (here saliency maps). These sub-examples complement whole examples by improving error detection for familiar categories, whereas whole examples help predict correct AI judgements of unfamiliar cases.

www.nature.com/scientificreports/ with differing degrees of helpfulness as judged by the fidelity between the explainee model and the target model. For the explainee model, we used a ResNet-50-PLDA model, which is a ResNet-50 model where the last softmax layer is replaced by a probabilistic linear discriminate analysis (PLDA) model. This alteration introduces the probabilistic training required by Bayesian teaching while keeping the architecture of ResNet-50, which is known to accurately fit human labels 39 . In the context of image classification, Bayesian teaching can be expressed as where d * is a target image; y * is the label predicted by the model to be explained, hence the target decision; {τ } is a set of explanatory examples; f L is the explainee model; the probability produced by f L (·) is the simulated explainee fidelity; and P T (·) determines the probability of selecting a set explanatory examples. See Fig. 1 for an overview and the "Methods" section for further details. Bayesian teaching also allows for selection of examples at different levels of granularity. For the current task, we consider the selection of entire images as well as pixels in an image as explanations. The latter pixel-selection process derived from Bayesian teaching turns out to be mathematically equivalent to a type of feature attribution method called Randomized Input Sampling for Explanations 42 . Thus the two levels of example granularity evaluated in this paper coincide with two popular methods of explanation-explanation-by-examples and saliency maps. To give a concrete example, consider a trial where a participant tries to predict whether the target AI classifier will classify a certain image as a barn or a flagpole (see Fig. 2). Bayesian teaching operates by selecting the four example images-two of a barn and two of a flagpole-from the training set that are most likely to make the explainee model reach the same judgement as the target model, i.e., high simulated explainee fidelity. The target image and the examples are overlaid with saliency maps where each pixel is weighted by the probability that showing it will guide the explainee model to the same conclusion as the target model.
Bayesian teaching contributes to the literature on XAI by formalizing the role of the explainee. Explicitly considering the explainee highlights how XAI methods can be validated, and how explanations informed by the explainee model can mitigate human prior beliefs about the AI system. We showcase three criteria to validate explainable AI from the Bayesian teaching perspective: (1) Explanations selected by Bayesian teaching improve the fidelity between human prediction of AI classification and actual AI classification; (2) the Bayesian Teacher can correctly infer which explanations humans will prefer; and (3) the Bayesian Teacher can accurately predict both which explanation will improve fidelity and which explanations will decrease it. Additionally, we show how the prior beliefs of human participants can be mitigated by appropriate explanations. Consistent with existing work from psychology 44,45 , we find that human participants project their own beliefs onto the AI system. This belief-projection manifests as (4) fidelity being higher when the AI is correct relative to when it is wrong, (5) this impact of AI correctness on fidelity being particularly pronounced for familiar categories, and (6) these effects being mitigated by appropriate explanations. We provide justifications and intuitions for these six points in the (1) P T {τ }|y * , d * ∝ f L y * |d * , {τ } , www.nature.com/scientificreports/ following paragraphs. To the best of our knowledge, this is the first paper to empirically explore the implications of human belief-projection for explainable AI. The core prediction of Bayesian teaching is that explanations which lead the explainee model to correct predictions will help humans to better understand the AI. We test this by evaluating whether participants exposed to helpful examples and saliency maps are better able to predict the AI system's classifications than participants who do not view any explanations. Returning to our example in Fig. 2, this means that a participant who is shown the example images and saliency maps are more likely to correctly predict that the AI classified the image as a flagpole rather than a barn, relative to a participant who is only shown the target image without any explanation. This is a generous test of Bayesian teaching, but a necessary one, because failing this test would make all subsequent results moot. Provided that the explainee model match human users reasonably well, we expect that examples selected to be helpful by the Bayesian Teacher will be preferred over examples that are selected to be unhelpful or at random. A stricter test of the appropriateness of the Bayesian teaching is whether it can predict both explanations that improve the fidelity of human predictions and those that lead to reduced fidelity. Such calibration implies that it is not every explanation improves fidelity, but that explanations need to be curated to reach a desired result. In our experimental setup this would manifest in examples that are judged to be helpful or detrimental by the Bayesian Teacher increasing or reducing the fidelity of the participants' predictions to the AI judgments, respectively.
If human participants project their beliefs onto the AI system, they will expect the AI classifier to be highly accurate because they themselves perform well at image classification 38 . In our experiment this translates to humans who predict AI classifications achieving higher sensitivity (correctly predicting AI's correct classifications) than specificity (correctly predicting AI's mistakes), absent explanation. In the context of our example: since the target image is showing a barn, a participant not given any explanation should typically (incorrectly) predict that the AI will classify the image as a barn rather than a flagpole. However, this effect should not be uniform across trials because some categories are easier to distinguish than others. Since more familiar categories should be easier to distinguish, and since participants expect the model to get the right answer for trials they themselves find easy, belief projection implies that familiarity should increase fidelity for model hits. Conversely, familiarity should decrease fidelity for model errors. Introducing a different example, a participant who is familiar with dogs will find the discrimination between yorkshire terrier and silky terrier easy, whereas someone less familiar with dogs might struggle with the first-order categorization, and consequently be more willing to consider the AI classifier making a mistake.
If explanations generated by Bayesian teaching operates by mitigating belief-projection, we would expect them to reduce the gap between sensitivity and specificity by increasing the latter (improving error detection). Additionally, the belief-projection implies that examples improve fidelity the most for unfamiliar categories, whereas saliency maps improve fidelity most for familiar categories. The reason why examples are most beneficial for unfamiliar categories is that they could strengthen category distinctions for unfamiliar categories with fuzzier mental representations. In the context of the two breeds of terrier: someone who is unfamiliar with dogs can leverage the examples to better understand what features distinguish the two breed, and compare that to the features of the target image. Saliency maps, on the other hand, might be most diagnostic for familiar cases because they highlight features that were consequential to the AI system, and determining the appropriateness of these features requires familiarity with the categories. In the context of the barn versus flagpole example: most people can reliably distinguish between them, so can notice that the saliency map of the target indicates that the AI classifier pays less attention to the house relative to the whethervane, suggesting a potential misclassification.

Results
Methodological overview. User understanding in the context of classification can be captured by how well the user can predict the model's judgement. Throughout this paper we will refer to this predictive capacity as fidelity, referring to the agreement between an agent's prediction (either a participant or theexplainee model) and the judgement of the classifier. A natural measure of explanation effectiveness is how much the explanations increase such fidelity, relative to a control condition. We designed a two-alternative forced choice (2AFC) task in which participants were asked to predict the model's classification of a target image between two given categories. No trial-by-trial feedback was provided to participants. It is important to note that in this task high fidelity does not imply that participants' judgements match the ground truth of the image, which we refer to as first-order accuracy or simply accuracy. It is possible for a participant to have high accuracy (in that their judgements often match the ground-truth category of the image) but poor fidelity (in that their judgements rarely match the AI's).
We designed a total of 15 conditions that vary along three dimensions: (1)  www.nature.com/scientificreports/ Table 1 shows the sample size of each condition. Figure 2 shows a trial where the categories are represented with informative label, helpful examples, and blur saliency maps. Each trial has three more distinct features beyond the condition it belongs to: the category accuracy, the simulated explainee fidelity, and a familiarity score. Category accuracy refers to the classification accuracy on the category which the target ResNet-50 model predicts that the target image belongs to (see Supplementary  Table T1). Note that in contrast to the category accuracy which is an accuracy on the category-level, we use the . When interpreting these results in relation to belief projection it is instructive to consider three idealized scenarios. An agent who picked categories at random would have 50% fidelity, sensitivity (correctly predicting AI classifications when the AI classifier is correct), and specificity (correctly predicting the AI's mistakes). An agent who modelled the AI classifier perfectly would have 100% fidelity, sensitivity, and specificity. Finally, an agent with perfect firstorder accuracy who projected their own beliefs onto the AI classifier would have 100% sensitivity, 0% specificity, and 33% overall fidelity because the experiment contains twice as many AI errors as AI correct classifications (see "Methods" section). Absent intervention, participants behave most like the third, belief-projecting, agent (Fig. 3). The explanation interventions increase overall fidelity by increasing specificity (participants are better able to spot the AI's mistakes), at the cost of some sensitivity. Participants in the control condition have a mean fidelity of 49 ; β = − 0.43(0.12) , z = − 3.68, p = 0.0002), but not enough to offset the specificity gains. Collectively, these results imply that participants attempt to predict the AI by projecting their own beliefs, and that the explanations improve fidelity by mitigating this belief projection. showing the fidelity of: a random agent, a perfect agent, and an agent with perfect access to the ground truth who assumes that the AI classifier always mirror their own predictions (belief projection). (B) Human fidelity most closely match the belief projection profile, but the interventions increase specificity (and slightly reduce sensitivity) by making participants better at spotting the AI's errors. The violinplots show the distribution of fidelity within conditions. Black dots show the group mean with error bars signifying 95% bootstrapped confidence intervals. (C) Individual participants' sensitivity and specificity. The vertices of the triangle show the fidelity of a belief-projecting agent with perfect access to the ground truth (upper left), an agent with a perfect model of the AI classifier (upper right), and an agent choosing at random (lower middle). The control group is clustered at high sensitivity and low specificity towards the upper left, whereas the experimental group is shifted to the right. However, the experimental group also shows greater variance, signifying inter-individual differences in the intervention effectiveness. This figure was created using the ggplot2 package (v. 3 Bayesian teaching can predict which explanations improve and reduce fidelity. Bayesian teaching makes explicit the existence of an explainee and suggests that a sound explainee model should have the capacity to track the inference of actual explainees. In our experiment the calibration between the explainee model and the participants is captured by the relationship between category accuracy and participant accuracy. We estimate participant accuracy (their first-order belief about the ground truth) by using their fidelity in the control trials (their second-order belief about the AI classifier with no exposure to explanation). The assumption that their attempt to predict the AI classifier may serve as a proxy of their first-order accuracy is justified given the tendency to belief-project observed in previous sections. We found that participant fidelity (interpreted as accuracy for the control trials) was positively correlated with category accuracy for trials where the model was correct ( β = 1.74(0.20) , z = 8.67, p < 0.0001), indicating good calibration between the model and participants in this situation (see Supplementary Fig. F1). We also found a negative interaction between category accuracy and model correctness ( β = − 2.57(0.23) , z = − 11.03, p < 0.0001). This suggests the poor calibration in the special case in which the model's overall accuracy on the predicted category is high but it misclassifies the particular trial. In sum, these results imply that category accuracy is a good proxy of human ground truth judgements at the aggregate level, which in turn suggests that our explainee model is appropriate for our participants. Bayesian teaching should be able to modify participant fidelity by selecting explanations of varying helpfulness. To test this in practice, we ran three nested hierarchical logistic regression models of increasing complexity. Each regression model predicted participant fidelity (whether the participant correctly predicted the AI classifier on a given trial) from the [examples] trials only, as these are the only trials impacted by the simulated explainee fidelity, which measures the degree to which the examples would lead the explainee model to the targeted inference. The first regression model served as a null-model, not using simulated explainee fidelity as a predictor, only including category accuracy and a dummy variable encoding AI correctness (whether the AI prediction for that trial matched the ground truth or not). The second regression model added simulated explainee fidelity as a predictor, capturing the hypothesis that the helpfulness of the examples as determined by Bayesian teaching covaries with participant fidelity. The third regression model added two two-way interactions between model correctness (model hit and error) and category accuracy, and model correctness and simulated explainee fidelity, capturing the hypothesis that helpful examples had differential impact on error detection relative to hit confirmation. We found that the second regression model fitted the fidelity data better than the first regression model ( χ2(1, 4) = 71.68 , p < 0.0001). This means that the Bayesian Teacher's perception of the helpfulness of the presented examples predict participant fidelity above and beyond category accuracy. The third regression model outperformed the second regression model ( χ2(3, 7) = 7371.28 , p < 0.0001). This indicates that how well the category accuracy and/or the modelled helpfulness of the examples shown predicted fidelity differed for trials with correct or incorrect AI judgements.
To explore how model correctness interacted with category accuracy and simulated explainee fidelity, we explored the parameters of the third regression model. Participants are typically better at predicting the AI classifier when it is correct relative to when it is wrong ( β = 0.53(0.06) , z = 9.15, p < 0.0001). This aligns with our previous results, which suggest that participants have a sense of the ground truth for most trials, and assume that the AI classifier would make the same judgement that they would make. Category accuracy is positively associated with participant fidelity when the AI is wrong ( β = 0.59(0.05) , z = 12.30, p < 0.0001), and even more strongly associated with fidelity when the AI classifier is correct ( β = 0.93(0.09) , z = 10.68, p < 0.0001; see Fig. 4). Because there was a significant positive relationship between ResNet accuracy and participant fidelity for both the control trials and the example trials, it seems plausible that the calibration between model and participant observed in the control condition survives the introduction of explanatory examples, at least partially. Finally, while statistically controlling for category accuracy, simulated explainee fidelity did not predict fidelity on trials when the AI classifier was wrong ( β = −0.01(0.03) , z = − 0.16, p = 0.89) but did so for trials when the AI classifier was correct ( β = 0.77(0.05) , z = 14.19, p < 0.0001). Because the simulated explainee fidelity determined which examples were shown, the fact that this variable could accurately predict human fidelity above and beyond ResNet accuracy implies that the Bayesian Teacher can successfully predict which explanations improve or impair the fidelity of participant judgements.
Bayesian teaching improves fidelity through belief-mitigation. The  , while controlling for category accuracy and familiarity score. We ran separate analyses for when the AI classifier was correct and when the AI classifier was wrong, corresponding to the distinction between sensitivity and specificity in previous sections. We will treat the ground truth as a proxy of participant first-order beliefs, a defensible assumption given the reported human accuracy on ImageNet in previous works 38 . Based on this assumption, interventions increasing fidelity while also increasing mismatches to the ground-truth, would shift participant predictions of the AI classifier away from their first-order judgements. The [specific labels] are associated with higher fidelity than the [generic labels] regardless of whether the AI classifier is correct ( β = 0.24(0.08) , z = 3.06, p = 0.002) or not ( β = 0.07(0.03) , z = 2.13, p = 0.03). Because these effects are small and orthogonal to belief projection, they will not be discussed further.
The presence of the saliency maps in the [map] condition improves fidelity when the AI classifier is wrong ( β = 0.43(0.03) , z = 14.24, p < 0.0001), but reduces fidelity (to a lesser extent) when the AI classifier is correct ( β = − 0.56(0.07) , z = − 7.98, p < 0.0001; see Fig. 5). In both cases, saliency maps reduced the first order-accuracy of the participants (model hit: β = − 0.56(0.07) , z = − 7.98, p < 0.0001; model error: β = − 0.43(0.03) , z = − 14.24, p < 0.0001), meaning that they were less likely to report that the AI classifier's judgements matches the ground truth of the image. This implies that the saliency maps encourage participants to consider that the AI classifier might be mistaken. One potential explanation for this observation is that the saliency maps show when the AI classifier attends to non-sensible features (i.e. parts that are not representative of either of the categories) as well as ambiguous features (e.g. thin metal strips that are present in both the "Electric Fan" and "Buckle" category).
Comparing  explanatory examples expected by the Bayesian teacher-correlates significantly with participant fidelity for correct trials but not for incorrect trials. This suggests that the Bayesian teaching framework can predict explanations that are informative or misleading for trials that are correctly classified by the model, but not for trials that are incorrectly classified. (B) Category accuracy is positively associated with participant fidelity, both for trials when the AI classifier is correct and when it is wrong. A similar trend is observed in the control condition (see Supplementary Fig. F1). This suggests that humans and ResNet-50 find the same categories difficult to discriminate, implying that the ResNet architecture can serve as an appropriate model of human participants in this task. The difference in fidelity between when the AI classifier is correct and when the AI classifier is wrong suggests that it is harder to teach incorrect judgements, at least in this context. (C). Two-dimensional kernel density with 25 density bins showing the distribution of trials in terms of category accuracy and simulated explainee fidelity. In this study the two are independent. Note that the higher density near perfect simulated explainee fidelity was due to all the helpful examples being selected based on this variable, so they constitute a majority of our example trials. This figure was created using the ggplot2 package (v. www.nature.com/scientificreports/ The familiarity scores capture the ease of the discrimination task in that they are higher for trials involving categories that humans are familiar with. These scores provide clues as to whether participants project their own beliefs onto the AI: If humans use their first-order classifications to model the AI, participants should assume that the AI classifier gets the correct answer for trials that they themselves find easy. This is indeed what we find: familiarity is positively associated with fidelity when the and the AI classifier is correct ( β = 1.10(0.04) , z = 29.28, p < 0.0001), but negatively associated with fidelity for AI errors ( β = − 0.92(0.02) , z = − 42.82, p < 0.0001; Fig. 6).
Previously, we showed that saliency maps improved fidelity on trials when the AI classifier was wrong. This could be explained by saliency maps helping participants distinguish between their first-order judgements of the ground truth and their second-order beliefs about the model classification. This explanation can be evaluated by testing whether the impact of the familiarity scores on fidelity are attenuated by the saliency maps. In other words, if participants are more likely to predict that the AI classifier is correct on trials that they themselves find easy, and the saliency maps work by helping people realize that the AI classifier use decision-processes that differ from their own, the saliency maps should make participants more willing to consider that the AI classifier might be wrong for trials they themselves find easy. This is what we find (see Fig. 6): the presence of saliency maps reduces the positive impact of familiarity on fidelity when the AI classifier is correct ( β = − 0.51(0.08) , z = − 6.31, p < 0.0001). Conversely, saliency maps reduce the negative impact of familiarity on fidelity when the AI is wrong ( β = 0.70(0.05) , z = 15.22, p < 0.0001; Fig. 6). Collectively these results suggest that the presence of saliency maps helps participants model the AI as an agent with distinct beliefs that may conflict with their own. www.nature.com/scientificreports/ Though the presence of examples did not generally impact fidelity, it is possible that they impacted judgements specifically for unfamiliar categories. Like the saliency maps, examples typically reduced the impact of familiarity on fidelity, both when the AI classifier is correct ( β = − 1.01(0.08) , z = − 12.71, p < 0.0001) and when the AI classifier is wrong ( β = 0.33(0.05) , z = 7.35, p < 0.0001). However, in contrast to the saliency maps, examples seem to be most helpful for unfamiliar trials when the AI classifier is correct, see Fig. 6C. This effect may imply that the examples help participants develop a working representation of the unfamiliar categories, which they are otherwise lacking.

Discussion
Bayesian teaching provides a novel way to think about XAI by explicitly modeling the explainee and their prior beliefs. It suggests that explanations can be evaluated in terms of how well they shift explainees' beliefs away from their prior towards a target. We have presented evidence that a Bayesian Teacher can successfully predict which explanations will improve the fidelity between human predictions and target classifications as well as be preferred by human users. Crucially, our results show that the Bayesian Teacher is well-calibrated to human users: it both knows which explanations will improve predictions about the AI and which explanations are problematic. This The fidelity between the participant predictions and the AI classifier's judgements increase with category familiarity when the AI is correct, but decrease with familiarity when it is wrong. This provides further evidence that participants project their own beliefs unto the AI classifier, as they are more likely to predict that the AI makes the correct choice on trials they themselves find easy. (B) Saliency maps decrease the impact of familiarity on participant judgements. For model hits this leads to decreased fidelity, whereas for model errors it leads to improved fidelity. This pattern provides further evidence that the saliency maps work by shifting participants away from using their first-order judgments to model the AI's classifications. (C) Examples also decrease the impact of familiarity on participant judgements. For model hits this improves fidelity for unfamiliar items but decreases fidelity for familiar items, with the opposite pattern for model errors. These results suggest that examples are most beneficial for unfamiliar items when the AI classifier is correct. Errorbars represent 95% bootstrapped confidence intervals. All point estimates have confidence intervals, though some are too narrow to see clearly. Shaded areas represent analytic 95% confidence intervals. This figure was created using the ggplot2 package (v. 3 www.nature.com/scientificreports/ calibration provides strong evidence that the selection process of the Bayesian Teacher has a causal effect on explainee understanding. Not all examples are created equal, so they need to be appropriately curated. Multiple strands of evidence from our results suggest that in the absence of explanations people project their own first-order beliefs onto the AI classifier. Specifically, we find that participants in the control condition show higher sensitivity than specificity, and that this discrepancy becomes more extreme the more familiar participants are with the trial categories. The finding that participants predict the AI system by projecting their own beliefs onto the AI links research on explainable AI to the rich psychological literature on social prediction. In many social prediction tasks (in contrast to mechanistic prediction tasks) people use their own preferences, judgements, and beliefs as priors for other agents 44,45,[48][49][50] . Our results imply that such belief-projection can be mitigated by Bayesian teaching. The most compelling evidence that explanations mitigate belief projection is that the impact of familiarity on fidelity is reduced by explanations: explanations make participants more likely to catch AI mistakes on trials they themselves found easy.
Bayesian teaching also gives a coherent framework for comparing and contrasting explanatory methods that hereto have been considered independent: explanation-by-examples and feature attribution. We apply Bayesian teaching to study explanation-by-examples, a popular method for XAI that previously has lacked a sound theoretical footing. Explanation-by-examples has many strengths: it is model-agnostic, domain-general, and easy to use with other XAI methods. Viewed through a Bayesian teaching lens, this method can be generalized to include feature attribution, another popular post-hoc method, by splitting each example into its component features (i.e. pixels in this study) and considering each pixel individually. When applied to images, such feature attribution on the pixel level generates saliency maps, which is arguably the most popular method for XAI in the image domain. The connection between feature attribution and pixel selection by Bayesian teaching opens up the possibility to reinterpret all feature attribution methods (e.g. 51,52 ) as a form of teaching. By treating images and saliency maps as explanatory examples at different levels of granularity, we discover that the two explanations show complementary effects. Namely, example images are effective explanations for confirming the model's correct classification of unfamiliar categories, and saliency maps are effective explanations for exposing the model's incorrect classification of familiar categories.
The lack of a coherent theory is currently stifling XAI as methods are developed around technical innovations without any apriori hypothesis as to whether they are appropriate for the specific use case 53 . Bayesian teaching both exposes this blind spot and offers a solution: effective explanation is a communication act which depends on a knowledgeable teacher, a good model of the explainee, and an awareness of the context in which inference takes place. Consequently, the framework encourages systematic evaluation of XAI interventions on these dimensions, and provides a way to systematically diagnose how interventions could be improved. In our study we show how such an evaluation applies to explanation-by-examples. We modeled the explainee by a ResNet-50 architecture, focused on two contextual variables (familiarity and model correctness), and surfaced how explanations generated by Bayesian teaching can mitigate mistaken prior beliefs. These results highlight the promise of the Bayesian teaching approach, since the function of explanation is to shape the explainee's inductive reasoning 54 . Furthermore, Bayesian teaching exemplifies how XAI can be improved by considering links to other fields such as education and cognitive science. A balanced synergy between the social sciences and the more technical literature of AI is much needed, as XAI is simultaneously a machine-learning problem and a human-centered endeavor.

Methods
The objective of this study was to explore the effects of explanations, in the form of examples and saliency maps, on users' understanding of high-performing machine learning models (referred to as AI throughout the paper) in the domain of image classification. We probe users' understanding by a two-alternative-forced-choice (2AFC) task in which users are asked to predict the model's classification of a target image into one of two categories. Experimental conditions vary in terms of the information presented on the screen during each classification. The information presented differs along three dimensions: types of labels, types of examples, and types of saliency maps. All the examples and saliency maps are generated by the Bayesian teaching framework. The fidelity of the participant is captured by sensitivity, specificity, and accuracy.
The model to be explained. The machine learning model to be explained is a ResNet-50 model 39 . For this study, we used the pre-trained version of ResNet-50 in Keras with ImageNet weights. For the selection of saliency maps, the Bayesian teaching framework expects the model to be able to make probabilistic inference on the image classification task presented in the ImageNet Large Scale Visual Recognition Challenge 2012 (ILSVRC2012). The ResNet-50 model has this capability, and we can use the ResNet-50 model without any modification. However, for the selection of examples, the Bayesian teaching framework expects the model to be able to make probabilistic inference on the 2AFC task, and the ResNet-50 model is deterministic. We replace the fully-connected classification layer of the ResNet-50 model with a probabilistic linear discriminant analysis (PLDA) model 55 . This new PLDA layer is trained using a transfer-learning-like procedure. Training images were first passed through the ResNet-50 model and transformed into feature vectors. Then, the PLDA layer was fit to these feature vectors and the corresponding class labels following the algorithm presented in 55 . Using the training dataset ImageNet 1K from the ILSVRC2012 38 , this ResNet-50-PLDA model has a top-1 accuracy of 52.86% and a top-5 accuracy of 76.29%. For the actual experiment, we focused on a subset of 100 categories that include the most difficult, easiest, and most confusable categories (see the next subsection for details). Unless otherwise stated, all the model predictions used to design the experiment is based on the ResNet-50-PLDA model trained on the training data in only these 100 categories. www.nature.com/scientificreports/ Stimuli selection. Each experiment consisted of 150 trials. For 50 of the trials, the predictions of the model (or the robot) matched the ground-truth labels of the target images. For the remaining 100, the model predictions did not match the ground-truth labels. We selected the target images and the classification categories based on the model's confusion matrix, with the aim to cover a wide range of model behavior. First we calculated the ResNet-50-PLDA model's confusion matrix on ImageNet 1K, which contains 1000 categories. Then, we randomly selected 25 categories from each of the following four subsets: the 100 categories on which the model was most accurate, the 100 categories that were most confusable with these most accurate categories, the 100 categories on which the model was least accurate, and the 100 categories that were most confusable with these least accurate categories. This resulted in 100 categories. We recorded the model's predicted labels of all the training images in these 100 categories and marked all images for which the model predictions were also among these 100 categories. From this subset where both the image and the top model prediction belonged to our 100 categories, we randomly sampled 50 images where the model prediction matched the ground truth labels and 100 images for which the model predictions did not match the ground-truth labels. For the 50 trials with correctly classified target images, the two classification options participants could choose from were the correct model-predicted category and one of the two most confusable categories (out of our 100 selected categories). Which one of the two most confusable categories was presented were selected randomly for each trial. For the 100 incorrectly classified trials the two classification options were simply the ground-truth category and the incorrect model prediction. This procedure resulted in a total of 83 unique categories used in the experiment (Supplementary Table T1). This number is smaller than 100 because not all confusable categories are unique and not all categories were kept during the random sampling. Figure 7 depicts the trial generating process. The pairs of categories used in the experiments are listed in Supplementary Table T2.
Experimental design. At the beginning of the experiment, participants were told that a robot has been trained to classify images but sometimes makes mistakes. They were asked to help by guessing how the robot will classify images. On each trial, a target image was displayed along with information about two categories, and the participants were asked to perform the 2AFC task by choosing which of the two categories they think the robot would classify the target image as.
The experimental conditions determined what information was presented during each trial and varied three dimensions: labels, examples, and saliency maps.  Table 1 show the naming conventions for the different conditions in terms of these features. Below, we provide more details on the conditions. Specific or generic labels. Conditions with informative or generic labels are referred to as [specific labels] and [generic labels], respectively. In the [specific labels] conditions, the English labels of the two categories (e.g., "Flagpole" and "Barn" in Fig. 2) are given. In the [generic labels] conditions, the two categories are named "Category A" and "Category B. " With or without saliency maps. Conditions with and without saliency maps are referred to as [map] and [no map], respectively. A saliency map is an image mask that shows the contribution of each pixel to the model's classification decision. Details on the generation of the saliency maps are provided in the subsection below called "Selection of saliency maps with Bayesian teaching" section. In the [map] conditions, a saliency map is shown for every image displayed. In the [no map] conditions, no saliency map is shown.
Blur or jet saliency maps. Conditions with the blur saliency maps and jet saliency maps are referred to as [blur] and [jet], respectively. The two types of map differ only in the rendering of the mask but not in the generation of the mask. The jet saliency map renders the importance of each pixel by colors following the jet color map. In order of decreasing importance, the jet color map goes from red to green to blue. The jet color map, overlaid on an image with some level of transparency, is one of the most commonly used renderings of saliency maps. Furthermore, blurring is a more naturalistic visual effect than any color map masking because our visual system constantly experiences a large difference in visual acuity between our fovea and peripheral vision. The implementation details of both renderings are provided in the subsection below on saliency map selection.
Naming convention. As shown in Table 1, not all combinations of the five binary features are allowed. Conditions with generic labels and no examples are not tested because that would make the 2AFC task a game of pure guessing. Furthermore, conditions without examples cannot be paired with helpful or random examples, and conditions without saliency maps cannot be paired with blur or jet maps. This leaves a total of 15 experimental conditions. The naming convention for the conditions is based on filter queries using the database structure presented in Table 1 Participants. The study protocol was approved by Rutgers University IRB. All research was performed in accordance with the approved study protocol. An IRB-approved consent page was displayed before the experi- The inputs to Bayesian teaching are: the model to be explained, data sets from two categories, and a target image that belongs to one of the two categories. The green box depicts the inner working of Bayesian teaching. Random image pairs are selected from each of the input categories. Along with the target image, two sets of image pairs, one set from each category, are selected at random to form a trial. The explainee model, which is set to have the same architecture as the input model, takes in a large number of random trials to produce the simulated explainee fidelity (unnormalized teaching probabilities according to Eq. (3)). Here, a trial with high fidelity (probability) is selected, exemplifying the trial generation process in the [helpful] condition. Saliency maps are generated for the target image and the four selected examples using Eq. (6). The final output is a set of ten images: a target image, two examples selected from each of the two input categories, and the saliency maps of the above five images. (B) Trial generation steps peripheral to Bayesian teaching. Our model to be explained is a ResNet-50 trained on ImageNet 1K. A confusion matrix on the 1000 ImageNet categories was computed using the model. Using the confusion matrix, we sampled 25 categories where the model has high accuracy (the "Easy" categories), 25 categories where the model has low accuracy (the "Hard" categories), and the categories that are most confusable with the above 50 categories. To generate a trial, we select at random two categories from the 100 candidates mentioned above as well as a target image that belongs to one of the two selected categories. The model, the target image, and the data associated with the two categories are fed into Bayesian teaching to produce a trial. See Methods for the full details. This figure was created using Adobe Illustrator CS6 (v. 16 Table 1 shows the number participants in each of the 15 conditions. The number of participants in every condition is shown in Table 1. All participants in the [helpful] conditions experienced the same set of 150 trials, i.e., the same 150 combinations of target image, category pairs, and example images, but in randomized order. All participants in the [random] conditions experienced another set of 150 trials, also in randomized order. All the category pairs used are listed in Supplementary Table T2 Selection of examples with Bayesian teaching. The goal of Bayesian teaching is to select small subsets of the training data such that the inference made by a explainee model using this small subset will be similar to the inference made by a target model using the entire training data. For this study, the target model is the ResNet-50-PLDA model trained on the 100 selected categories as described earlier. The inference task is to classify the target image among the 100 categories. The inference task of the explainee model is the 2AFC image classification task presented in each trial. For the explainee model, we search for an ideal-observer model 40,41 that would capture the participant's inference in the 2AFC task. A good candidate is the ResNet-50-PLDA because it is trained on human-labeled data and achieves high accuracy on predicting humans' labelling behavior. This means that the target model and explainee model share the same parameters (the ResNet-50 weights and PLDA parameters mentioned after Eq. (4) We introduce some notation to define the Bayesian teaching probability formally. The two categories that define the 2AFC task in each trial consist of the predicted category of the ResNet-50-PLDA model and an alternative category, which we denote by y * and y , respectively. The two examples sampled from the modelpredicted category are denoted by τ y * , and the two sampled from the alternative category are denoted by τ y . Let the explainee model be denoted by f L and the target image be denoted by d * . The Bayesian teaching probability, P T , is defined as the probability that the selected examples, τ y * and τ y , will lead the explainee model to classify the target image as the target model would. Mathematically, this probability can be expressed using Bayes' rule as: The sum in the denominator is over all possible candidate sets of the four examples. The set of all candidate sets is denoted by . Equation (2) assumes a uniform prior over so that the prior terms in the numerator and denominator cancel out. Technically, is the Cartesian product of all possible pairings of images in the category y * with all possible pairings of images in the category y , which is on the order of 10 11 for the dataset in use. Our goal here is to select τ y * and τ y such that f L (·) , the numerator of Eq. conditions. We found that the full range can usually be covered by forming a Cartesian product of 1000 random pairings from each category ( 10 6 combinations). In general, given a target value of f L (·) , one could use genetic algorithm 56 or other types of discrete optimization method to select the examples. To sample in proportion to P T , one could use Markov Chain Monte Carlo and variational inference techniques 14,57,58 . These optimization and advanced inference methods would also be more efficient in the case that more than a few examples for each category is desired.
Using Bayes' rule again, we express the explainee model's inference of the target image's label given the target image and examples, the numerator in Eq. (2), as where f (d * | τ k ) is the probability that the target image, d * , belongs to the category from which the two example images, τ k , are sampled. Under the PLDA model, one can write this probability in closed form as a normal distribution 21 : (2) P T τ y * , τ y | y * , d * = f L y * | τ y * , τ y , d * τ y * ,τ y ′ ∈� f L y * | τ y * , τ y ′ , d * .
( www.nature.com/scientificreports/ Here, u is an image transformed in two steps. First, the image is passed through ResNet-50 and transformed into a feature vector; then, this feature vector undergoes an affine transformation with shift vector m and rotation and scaling matrix A to become u. Thus, in Eq. (4), u * is a transformed target image, and (u k 1 , u k 2 ) are a pair of transformed examples sampled from category k. The quantities m and A in the second transformation and the in Eq. (4) are parameters of the PLDA model obtained by training on the images in the 100 selected categories. The precise definitions of these parameters and the training procedure are presented in Fig. 2 in Ioffe's PLDA paper 55 .
To summarize this subsection, Eq. (2) defines the Bayesian teaching probability, and Eq. (3) defines its numerator (simulated explainee fidelity), f L (·) , used to select examples in the [examples] conditions. A high f L (·) means that the selected examples will lead the model of the explainee to classify the target image as the category predicted by the model to be explained with high probability. Conversely, a low f L (·) means that the selected examples will lead the explainee model to classify the target image as the other category in the 2AFC with high probability. Note that f L (·) is trial specific, as this probability is a function of the target image, d * , the model predicted label of the target image, y * , and the four examples, (τ y * , τ y ) , which precisely define a trial.
Selection of saliency maps with Bayesian teaching. A saliency map is an image mask that shows how important each pixel of the image is to the model's inference. In the [map] conditions, we generate a saliency map for every image displayed. To generate a saliency map, one needs to specify a model, an inference task, and a definition of importance. We used ResNet-50 as the model and the classification of an image into the 1000 categories in ImageNet 1K as the inference task. Using the Bayesian teaching framework, we define importance to be the probability that a mask, m, will lead the model to predict the image, d , to be in category, y, when the mask is applied to the image. This is expressed by Bayes' rule as Here, g L (y | d, m) is probability that the ResNet-50 model will predict the d masked by m to be y; p (m) is the prior probability of m; and M = [0, 1] W×H is the space of all possible masks on an image with W × H pixels. The prior probability distribution p (m) on m is a sigmoid-function squashed Gaussian process prior.
Instead of sampling the saliency maps directly from Eq. (5), we find the expected saliency map for each image by Monte Carlo integration: where m i are samples from the prior distribution p (m), and N = 1000 is the number of Monte Carlo samples used. To see why an expected map is desirable, imagine the following case. Suppose that an image contains 7 goldfish and its category is "goldfish. " In this case, a mask that reveals any one of the goldfish will have a high Q T value. However, it is more desirable that the mask would reveal all the goldfish in the image. The expectation provides this by averaging the masks appropriately weighted by their Q T values. Now, we describe the step-by-step procedures for generating the saliency map for an image, d. First, d is resized to be 224-by-224 pixels, which is the size displayed in the experiments (Fig. 2). A set of 1000 2D functions are sampled from a 2D Gaussian process (GP) with an overall variance of 100, a constant mean of −100 , and a radial-basis-function kernel with length scale 22.4 pixels in both dimensions. The sampled functions are evaluated on a 224-by-224 grid, and the function values are mostly in the range of [−500, 300] . A sigmoid function, 1/(1 + exp(−x)) , is applied to the sampled functions to transform each of the function values, x, to be within the range [0, 1]. This results in 1000 masks. The mean of the GP controls how many effective zeros there are in the mask, and the variance of the GP determines how fast neighboring pixel values in the mask change from zero to one. The 1000 masks are the m i 's in Eq. (6). We produce 1000 masked images by element-wise multiplying the image d with each of the masks. The term g L (y | d, m i ) is the ResNet-50's predictive probability that the ith masked image is in category y. Having obtained these predictive probabilities from ResNet-50, we average the 1000 masks according to Eq. (6) to produce the saliency map of image d. If d is a target image, the y used to generate the saliency map is the ResNet-50-PLDA model's prediction. If d is an example, the y is the category from which the example is sampled.
In the [jet] conditions, the saliency maps are rendered in the Matplotlib package with the "jet" colormap and an alpha value of 0.4 and overlaid on the images (see Fig. 2; images at the bottom row). In the [blur] conditions, a saliency map is rendered by blurring the image for which it is generated ( Fig. 2; images in the middle row). To generate the blur, each pixel value of a saliency map, z, is assigned a blurring window width, w(z) = ceil(30/(1 + exp(20z − 10))) . The jth pixel value of the rendered saliency map is the average pixel value of a patch of the original image, where the patch is w-by-w in size and centered on the jth pixel of the original image. If the jth pixel is close to an edge of the image, the patch becomes rectangular, and the average is over whichever pixel values that are inside the w-by-w sized window.
To conclude this subsection, we make a few final remarks. First, a PLDA layer is unnecessary in the generation of saliency maps because the ResNet-50 model is capable of generating the probabilities g L (y | d, m) in Eq. (5). In contrast, the ResNet-50 model cannot be used directly to generate the probabilities f L (y * | τ y * , τ y , d * ) in Eq. (2). Second, while the 2AFC task may be suitable for generating a saliency map for the target image, it cannot be used to generate saliency maps for the examples. This is the main reason that here we used the inference www.nature.com/scientificreports/ task of the image classification task that the ResNet-50 model is trained on. Lastly, Eq. (6) is the same as Eq. (5) in the RISE approach introduced by Petsiuk, Das, and Saenko 42 , which presents a state-of-the art method for generating saliency maps. Our implementation and their implementation differ only in the way the individual masks are sampled. In our implementation, we sampled functions from a GP prior and turned them into masks by applying a sigmoid function. In 42 , random binary matrices are first sampled and subsequently up-sampled to the desired mask size through bilinear interpolation. The expectation is done in the same way.
Familiarity coding. In addition to the splits by conditions presented in Table 1, analysis also rely on scores of human familiarity with the image categories. The familiarity of each of the [helpful] and [random] trials was manually coded by 7 raters. Each rater was asked to code the trial as "familiar" if they thought they could correctly match the category labels to the images presented in that trial, and "unfamiliar" otherwise. A familiarity score for each pairing of categories was then constructed by assigning each raters judgements as 1 for familiar and 0 for unfamiliar, and computing the mean across raters. The 300 trials across the [helpful] and [random] conditions resulted in 167 unique category pairings (counting the ordering of target versus other category), and their familiarity scores are presented in Supplementary Table T2.
Statistical analysis. Whenever we report testing how well participants predict the model classifications (fidelity), or how often their judgements correspond to the image ground truth (accuracy) we used hierarchical logistic regressions with random intercepts per participants and fixed effects for the remaining terms. For analyses of sensitivity and specificity analyses, we still used a logistic regression framework but only included trials corresponding to true positives and false negatives, or true negatives and false positives, respectively. Sensitivity captures how well participants predict trials when the AI is correct, and specificity capture how well participants predict trials when the model is wrong.
To illustrate, Bayesian teaching improves fidelity used the following equations on the full set of trials, and on a subset of the trials to capture sensitivity and specificity respectively: where the agreement between a participant's choice and the AI classifier's choice is a binary variable coded as 1 when participant correctly predict the AI classification and 0 otherwise, i is the observation index, j is the participant index. ExplanationCondition is a binary dummy variable coded as 1 if participants experienced heatmaps and helpful examples and 0 if they did not experience any explanations.
For the Participants prefer helpful examples section we compared three hierarchical logistic models: (A) An intercept only model that treated intercepts as nested within participants (B) an intercept only model that treated intercepts as nested within participants and conditions, (C) Model two, with an added fixed effect for the familiarity score. We then compared the negative log-likelihoods of these models to determine which best accounted for the observed data.
We evaluated whether Bayesian teaching can lead participants to both correct and incorrect inference by predicting fidelity in the conditions containing examples by fitting three nested models: where SimExplaineeFidelity is the expected probability that the participant pick the same response as the target model, conditional on seeing the examples, CategoryAccuracy, is the average classification accuracy of the target ResNet-50 model for the target category, and ModelCorrectness is a dummy variable coding if ResNet made a correct classification on this particular trial. We then compared the negative log likelihoods of these three models, and reported the coefficients of the best-fitting model (the interaction model).
In the Bayesian teaching improves fidelity through belief-mitigation section we fitted four logistic hierarchical regression models to the full data. These models shared the following form: where FamiliarityScore is a proportion of raters who rated the trial categories as familiar, Examples, MAP and Labels where dummy variables that captured whether examples were shown, whether heatmaps were shown and whether category labels were informative or not, respectively.
These four models were distinguished based on whether the AI was correct or not and whether Y corresponded to whether the participant judgement matched the ground truth or matched the AI's judgement. We fitted similar models to the [examples] trials only, with the only difference being that the Examples term, that  www.nature.com/scientificreports/ previously had captured whether examples were present was replaced with a dummy variable that captured whether the examples presented were helpful or not. Finally, we fitted two more models predicting fidelity from the full data. These are similar to Eq. (6), but added two additional interaction terms: Coefficient tables for these models can be found in Supplementary Tables T3. All hierarchical logistic regression models were fitted using the lme4 package (1.1-23) 59

Data availability
Raw experimental data and analysis code will be deposited at https:// github. com/ CoDaS-Lab/ XAI-BT-SR upon publication.