Due to their strong performance, machine learning (ML) models increasingly make consequential decisions in several critical domains, such as healthcare, finance and law. However, state-of-the-art ML models, such as deep neural networks, have become more complex and hard to understand. This dynamic poses challenges in real-world applications for model stakeholders who need to understand why models make predictions and whether to trust them. Consequently, practitioners have often turned to inherently interpretable ML models for these applications, including decision lists and sets1,2 and generalized additive models3,4,5, which people can more easily understand. Nevertheless, black-box models are often more flexible and accurate, motivating the development of post hoc explanations that explain the predictions of trained ML models. These explainability techniques either fit faithful models in the local region around a prediction or inspect internal model details, such as gradients, to explain predictions6,7,8,9,10,11.

Yet, recent work suggests that practitioners often have difficulty using explainability techniques12,13,14,15. These challenges are due to difficulty in figuring out which explanations to implement, how to interpret the explanation and answering follow-up questions beyond the initial explanation. In the past, researchers have proposed several point-and-click dashboard techniques to help overcome these issues, such as the Language Interpretability Tool16, which is designed to understand natural language processing models and the What-If Tool17—a tool aimed at performing counterfactual analyses for models. However, these methods still require a high level of expertise, because users must know which explanations to run, and lack the flexibility to support arbitrary follow-up questions that users might have. Overall, understanding ML models through simple and intuitive interactions is a key bottleneck in adoption across many applications.

Natural language dialogues are a promising solution for supporting broad and accessible interactions with ML models due to their ease of use, capacity and support for continuous discussion. However, designing a dialogue system that enables a satisfying model understanding experience introduces several challenges. First, the system must handle many conversation topics about the model and data while facilitating natural conversation flow18. For instance, these topics may include explainability questions like the most important features for predictions and general questions such as data statistics or model errors. Further, the system must work for various model classes and data, and it should understand language usage across different settings19. For example, participants will use different terminology in conversations about loan prediction than disease diagnosis. Last, the dialogue system should generate accurate responses that address the users’ core questions20,21. In the literature, researchers have suggested some prototype designs for generating explanations using natural language. However, these initial designs address specific explanations and model classes, limiting their applicability in general conversational explainability settings22,23.

In this Article, we address these challenges by introducing TalkToModel: a system that enables open-ended natural language dialogues for understanding ML models for any tabular dataset and classifier (an overview of TalkToModel is provided in Fig. 1). Users can have discussions with TalkToModel about why predictions occur, how the predictions would change if the data change and how to flip predictions, among many other conversation topics (an example conversation is provided in Fig. 2). Further, they can perform these analyses on any group in the data, such as a single instance or a specific group of instances. For example, on a disease prediction task, users can ask ‘How important is BMI for the predictions?’ or ‘So how would decreasing the glucose levels by 10 change the likelihood of men older than 20 having the disease?’. TalkToModel will respond by describing how, for instance, BMI is the most important feature for predictions, and decreasing glucose will decrease the chance of diabetes by 20%. From there, users can engage further in the conversation by asking follow-up questions. Conversations with TalkToModel make model explainability straightforward because users can talk with the system in natural language about the model, and the system will generate useful responses.

Fig. 1: Overview of TalkToModel.
figure 1

Instead of writing code, users have conversations with TalkToModel as follows. (1) Users supply natural language inputs. (2) The dialogue engine parses the input into an executable representation. (3) The execution engine runs the operations and the dialogue engine uses the results in its response.

Fig. 2: A conversation with TalkToModel.
figure 2

A conversation about diabetes prediction, demonstrating the breadth of different conversation points the system can discuss.

To support such rich conversations with TalkToModel, we introduce techniques for both language understanding and model explainability. First, we propose a dialogue engine that parses user text inputs (referred to as user utterances) into a structured query language-like programming language using a large language model (LLM). The LLM performs the parsing by treating the task of translating user utterances into the programming language as a seq2seq learning problem, where the user utterances are the source and parses in the programming language are the targets24. In addition, the TalkToModel language combines operations for explanations, ML error analyses, data manipulation and descriptive text into a single language capable of representing a wide variety of potential conversation topics most model explainability needs (an overview of the different operations is provided in Fig. 3). To support the system adapting to any dataset and model, we introduce lightweight adaption techniques to fine-tune LLMs to perform the parsing, enabling strong generalization to new settings. Second, we introduce an execution engine that runs the operations in each parse. To reduce the burden of users deciding which explanations to run, we introduce methods that automatically select explanations for the user. In particular, this engine runs many explanations, compares their fidelities and selects the most accurate ones. Finally, we construct a text interface where users can engage in open-ended dialogues using the system, enabling anyone, including those with minimal technical skills, to understand ML models.

Fig. 3: Overview of the operations supported by TalkToModel.
figure 3

The operations are incorporated into the conversation to generate responses. Note, Conv. refers to Conversation operations.


In this section, we demonstrate that TalkToModel accurately understands users in conversations by evaluating its language understanding capabilities on ground-truth data. Next, we evaluate the effectiveness of TalkToModel for model understanding by performing a real-world human study on healthcare workers (for example, doctors and nurses) and ML practitioners, where we benchmark TalkToModel against existing explainability systems. We find users both prefer and are more effective using TalkToModel than traditional point-and-click explainability systems, demonstrating its effectiveness for understanding ML models.

Language understanding

Here we quantitatively assess the language understanding capabilities of TalkToModel by creating gold parse datasets and evaluating the system’s accuracy on these data.

Gold parse collection

We construct gold datasets (that is, ground-truth (utterance, parse) pairs) across multiple datasets to evaluate the language understanding performance of our models. To construct these gold datasets, we adopt an approach inspired by ref. 25, which constructs a similar dataset for multitask semantic parsing.

Our gold dataset-generation process is as follows. First, we write 50 (utterance, parse) pairs for the particular task (that is, loan or diabetes prediction). These utterances range from simple ‘How likely are people in the data to have diabetes?’ to complex ‘If these people were not unemployed, what’s the likelihood they are good credit risk? Why?’. We include each operation (Fig. 3) at least twice in the parses, to make sure that there is good coverage. From there, we ask Mechanical Turk workers to rewrite the utterances while preserving their semantic meaning to ensure that the ground-truth parse for the revised utterance is the same but the phrasing differs. We ask workers to rewrite each pair 8 times for a total of 400 (utterance, parse) pairs per task. Next, we filter out low-quality mturk revisions. We ask the crowd-sourced workers to rate the similarity between the original utterance and revised utterance on a scale of 1 to 4, where 4 indicates that the utterances have the same meaning and 1 indicates that they do not have the same meaning. We collect 5 ratings per revision and remove (utterance, parse) pairs that score below 3.0 on average. Finally, we perform an additional filtering step to ensure data quality by inspecting the remaining pairs ourselves and removing any bad revisions.

As we want to evaluate TalkToModel’s capacity to generalize across different scenarios, we perform this data collection process across three different tasks: Pima Indian Diabetes Dataset26, German credit dataset26 and the Correctional Offender Management Profiling for Alternative Sanctions (COMPAS) recidivism dataset27. After collecting revisions and ensuring quality, we are left with 200 pairs for the German credit dataset, 190 for the diabetes dataset and 146 for the COMPAS dataset.


We compare two strategies for using pre-trained LLMs to parse user utterances into the grammar: (1) few-shot GPT-J 28 and GPT-3.5 models29 and (2) fine-tuned T530. The GPT-J and GPT-3.5 models are higher capacity and more amenable to be trained by in-context learning. This procedure includes examples of the input and target from the training prepended to the test instance29,31,32. In contrast, the T5 models require traditional fine-tuning on the input and target pairs. Consequently, the few-shot approach is quicker to set up because it does not require fine-tuning, making it easier for users to get started with the system. However, the fine-tuned T5 leads to improved performance and a better user experience overall while taking longer to set up. We expect that fine-tuned T5 leads to improved performance overall because it has access to all the training data, whereas, the few-shot models are limited by the context window size. To train these models through fine-tuning or prompting, we generate synthetic (utterance, parse) pairs because it is impractical to assume that we can collect ground-truth pairs for every new task we wish to use TalkToModel. We provide additional training details in Methods.

We evaluate both fine-tuned T5 models and few-shot models on the testing data. We additionally implement a naive nearest-neighbours baseline, where we select the closest user utterance in the synthetic training set according to cosine distance of all-mpnet-base-v2 sentence embeddings and return the corresponding parse33. For the GPT-J models, we compare N-shot performance, where N is the number of (utterance, parse) pairs from the synthetically generated training sets included in the prompt, and sweep over a range of N for each model. For the larger models, we have to use relatively smaller N for inference to fit on a single 48 GB graphics processing unit.

When parsing the utterances, one issue is that their generations are unconstrained and may generate parses outside the grammar, resulting in the system failing to run the parse. To ensure the generations are grammatical, we constrain the decodings to be in the grammar by recompiling the grammar at inference time into an equivalent grammar consisting of the tokens in the LLM’s vocabulary 34. While decoding from the LLM, we fix the likelihood of ungrammatical tokens to 0 at every generation step. Because the GPT-3.5 model must be called through an application programming interface, which does not support guided decoding, we decode greedily with temperature set to one.

Evaluating the parsing accuracy

To evaluate performance on the datasets, we use the exact match parsing accuracy25,35,36. This metric is whether the parse exactly matches the gold parse in the dataset. In addition, we perform the evaluation on two splits of each gold parse dataset, in addition to the overall dataset. These splits are the independent and identically distributed (IID) and compositional splits. The IID split contains (utterance, parse) pairs where the parse’s operations and their structure (but not necessarily the arguments) are in the training data. The compositional split consists of the remaining parses that are not in the training data. Because language models struggle compositionally, this split is generally much harder for language models to parse37,38.


We present the results in Table 1. T5 performs better overall than the few-shot GPT-J and GPT-3.5 models. Notably, the T5 small model performs better than the GPT-J 6B model, which has two orders of magnitude more parameters. While the few-shot models underperform the fine-tuned T5 models overall, GPT-3.5 is the best-performing few-shot model and performs considerably better than the GPT-J models, particularly in the compositional split. Overall, these results suggest using fine-tuned T5 for the best results, and we use T5 large in our human studies.

Table 1 Exact match parsing accuracy (%) for the three gold datasets, on the IID and compositional splits, and overall

Utility of explainability dialogues

The results in the previous section show that TalkToModel understands user intentions to a high degree of accuracy. In this section, we evaluate how well the end-to-end system helps users understand ML models compared with current explainability systems.

Study overview

We compare TalkToModel against ‘explainerdashboard’, one of the most popular open-source explainability dashboards39. This dashboard has similar functionality to TalkToModel, considering it provides an accessible way to compute explanations and perform model analyses. Thus, it is a reasonable baseline. Last, we perform this comparison using the diabetes dataset, and a gradient-boosted tree trained on the data40. To compare both systems in a controlled manner, we ask participants to answer general ML questions with TalkToModel and the dashboard. Each question is about basic explainability and model analysis, and participants answer using multiple choice, where one of the options is ‘Could not determine’ if they cannot figure out the answer (although it is straightforward to answer all the questions with both interfaces). For example, questions are about comparing feature importances ‘Is glucose more important than age for the model’s predictions for data point 49?’ or model predictions ‘How many people are predicted not to have diabetes but do not actually have it?’ Participants answer ten questions in total. We divide the ten questions into two blocks of five questions each. Both blocks have similar questions but different values to control for memorization (the exact questions are given in Supplementary Section A). Participants use TalkToModel to answer one block of questions and the dashboard for the other block. In addition, we provide a tutorial on how to use both systems before showing users the questions for the system. Last, we randomize question, block and interface order to control for biases due to showing interfaces or questions first.


Following previous work on evaluating human and ML coordination and trust, we assessed several metrics to evaluate user experiences41,42,43. We evaluated the following statements along the 1–7 Likert scale at the end of the survey:

  • Easiness: I found the conversational interface easier to use than the dashboard interface

  • Confidence: I was more confident in my answers using the conversational interface than the dashboard interface

  • Speed: I felt that I was able to more rapidly arrive at an answer using the conversational interface than the dashboard interface

  • Likeliness to use: based on my experience so far with both interfaces, I would be more likely to use the conversational interface than the dashboard interface in the future

To control for bias associated with the ordering of the terms conversational interface and dashboard interface, we randomized their ordering. We also measured accuracy and time taken to answer each question. Last, we asked to participants to write a short description comparing their experience with both interfaces to capture participants qualitative feedback about both systems.


As TalkToModel provides an accessible way to understand ML models, we expect it to be useful for subject-matter experts with a variety of experience in ML, including users without any ML experience. As such, we recruited 45 English-speaking healthcare workers to take the survey using the Prolific service44 with minimal or no ML expertise This group comprises a range of healthcare workers, including doctors, pharmacists, dentists, psychiatrists, healthcare project managers and medical scribes. The vast majority of this group (43) stated they had either no experience with ML or had heard about it from reading articles online, while two members indicated they had equivalent to an undergraduate course in ML. As another point of comparison, we recruited ML professionals with relatively higher ML expertise from ML Slack channels and email lists. We received 13 potential participants, all of which had graduate-course-level ML experience or higher, and included all of them in the study. We received institutional review board approval for this study from the University of California, Irvine institutional review board approval process and informed consent from participants.

Metric results

A substantial majority of healthcare workers agreed that they preferred TalkToModel in all the categories we evaluated (Table 2). The same is true for the ML professionals, save for whether they were more likely to use TalkToModel in the future, where 53.8% of participants agreed they would instead use TalkToModel in the future. In addition, participants’ subjective notions around how quickly they could use TalkToModel aligned with their actual speed of use, and both groups arrived at answers using TalkToModel significantly quicker than using the dashboard. The median question answer time (measured at the total time taken from seeing the question to submitting the answer) using TalkToModel was 76.3 s, while it was 158.8 s using the dashboard.

Table 2 User study results for respondents that agree TalkToModel is better than the dashboard

Participants were also much more accurate and completed questions at a higher rate (that is, they did not mark ‘Could not determine’) using TalkToModel (Table 3). While both healthcare workers and ML practitioners clicked ‘Could not determine’ for a quarter of the questions using the dashboard, this was true for 13.8% of healthcare workers and 6.1% of ML professionals using TalkToModel, demonstrating the usefulness of the conversational interface. On completed questions, both groups were much more accurate using TalkToModel than the dashboard. Most surprisingly, although ML professionals agreed that they preferred TalkToModel only about half the time, they answered all the questions correctly using it, while they only answered 62.5% of questions correctly with the dashboard. Finally, we observed that TalkToModel’s conversational capabilities were highly effective. There were only 6 utterances out of over 1, 000 total utterances that the conversational aspect of the system failed to resolve. These failure cases generally involved certain discourse aspects like asking for additional elaboration (‘more description’).

Table 3 User study results for completion rate and accuracy across interfaces and participant groups

The largest source of errors for participants using the explainability dashboard were two questions concerning the top most important features for individual predictions. The errors for these questions account for 47.4% of healthcare workers and 44.4% of ML professionals’ total mistakes. Solving these tasks with the dashboard requires users to perform multiple steps, including choosing the feature importance tab in the dashboard, while the streamlined text interface of TalkToModel made it much simpler to solve these tasks.

Qualitative results

For the qualitative user feedback, we provide representative quotes from similar themes in the responses. Users expressed that they could more rapidly and easily arrive at results, which could be helpful for their professions.

Displayquote 1

“I prefer the conversational interface because it helps arrive at the answer very quickly. This is very useful especially in the hospital setting where you have hundreds of patients getting check ups and screenings for diabetes because it is efficient and you can work with medical students on using the system to help patient outcomes. P39 medical worker at a tertiary hospital.

Participants also commented on the user friendliness of TalkToModel and its strong conversational capabilities, stating, “the conversational [interface] was straight to the point, way easier to use” (P35 nurse) and that “the conversational interface is hands-down much easier to use… it feels like one is talking to a human” (P45 ML professional). We did not find any negative feedback surrounding the conversational capabilities of the system. Overall, users expressed strong positive sentiment about TalkToModel due to the quality of conversations, presentation of information, accessibility and speed of use.

Several ML professionals brought up points that could serve as future research directions. Notably, participants stated that they would rather look at the data themselves rather than rely on an interface that rapidly provides an answer.

Displayquote 2

“I would almost always rather look at the data myself and come to a conclusion than getting an answer within seconds. P11 ML professional.

In the future, it would be worthwhile including visualizations of raw data and analyses performed by the system to increase trust with expert users, such as ML professionals, who may be sceptical of the high-level answers provided by the system currently.


With ML models becoming increasingly complex, there is a need to develop techniques to explain model predictions to stakeholders. Nevertheless, it is often the case that practitioners struggle to use explanations and frequently have many follow-up questions they wish to answer. In this work, we show that TalkToModel makes explainable AI accessible to users from a range of backgrounds by using natural language conversations. Our experiments demonstrate that TalkToModel comprehends users with a high degree of accuracy and can help users understand the predictions of ML models much better than existing systems can. In particular, we showed that TalkToModel is a highly effective way for domain experts such as healthcare workers to understand ML models, like those applied to disease diagnosis. Lastly, we designed TalkToModel to be highly extensible and released the code, data and a demo for the system at, making it straightforward for users and researchers to build on the system. In the future, it will be helpful to investigate applications of TalkToModel ‘in the wild’, such as in doctors’ offices, laboratories or professional settings, where model stakeholders could use the system to understand their models.


In this section, we describe the components of TalkToModel. First, we introduce the dialogue engine and discuss how it understands user inputs, maps them to operations and generates text responses based on the results of running the operations. Second, we describe the execution engine, which runs the operations. Finally, we provide an overview of the interface and the extensibility of TalkToModel.

Text understanding

To understand the intent behind user utterances, the system learns to translate or parse them into logical forms. These parses represent the intentions behind user utterances in a highly expressive and structured programming language TalkToModel executes.

Compared with dialogue systems that execute specific tasks by modifying representations of the internal state of the conversation45,46, our parsing-based approach allows for more flexibility in the conversations, supporting open-ended discovery, which is critical for model understanding. Also, this strategy produces a structured representation of user utterances instead of open-ended systems that generate unstructured free text47. Having this structured representation of user inputs is key for our setting where we need to execute specific operations depending on the user’s input, which would not be straightforward with unstructured text.

TalkToModel performs the following steps to accomplish this: (1) the system constructs a grammar for the user-provided dataset and model, which defines the set of acceptable parses; (2) TalkToModel generates (utterance, parse) pairs for the dataset and model; (3) the system fine-tunes an LLM to translate user utterances into parses; and (4) the system responds conversationally to users by composing the results of the executed parse into a response that provides context for the results and opportunities to follow up.


To represent the intentions behind the user utterances in a structured form, TalkToModel relies on a grammar, defining a domain-specific language for model understanding. While the user utterances themselves will be highly diverse, the grammar creates a way to express user utterances in a structured yet highly expressive fashion that the system can reliably execute. Compared with approaches that treat determining user intentions in conversations as a classification problem48,49, using a grammar enables the system to express compositions of operations and arguments that take on many different values, such as real numbers, that would otherwise be combinatorially impossible in a prediction setting. Instead, TalkToModel translates user utterances into this grammar in a seq2seq fashion, overcoming these challenges24. This grammar consists of production rules that include the operations the system can run (an overview is provided in Table 3), the acceptable arguments for each operation and the relations between operations. One complication is that user-provided datasets have different feature names and values, making it hard to define one shared grammar between datasets. Instead, we update the grammar based on the feature names and values in a new dataset. For instance, if a dataset contained only the feature names ‘age’ and ‘income’, these two names would be the only acceptable values for the feature argument in the grammar.

To ensure that our grammar provides sufficient coverage for explainable artificial intelligence (XAI) questions, we verify our grammar supports the questions from the XAI question bank. This question bank was introduced in ref. 50 based on interviews with AI product designers and includes 31 core, prototypical questions XAI systems should answer, excluding socio-technical questions beyond the scope of TalkToModel (for example, ‘What are the results of other people using the [model]’). The prototypical questions address topics such as the input/data to the model (‘What is the distribution of a given feature?’), model output (‘What kind of output does the system give?’), model performance (‘How accurate are the predictions?’), global model behaviour (‘What is the systems overall logic?’), why/why not the system makes individual predictions (‘Why is this instance given this prediction?’) and what-if or counterfactual questions (‘What would the system predict if this instance changes to…?’). To evaluate how well TalkToModel covers these questions, we review each question and evaluate whether our grammar can parse it. Overall, we find that our grammar supports 30 out of 31 of the prototypical questions. We provide a table of each question and corresponding parse in Supplementary Tables 6 and 7. Overall, the grammar covers the vast majority of XAI related questions, and therefore, has good coverage of XAI topics.

Supporting context in dialogues

User conversations with TalkToModel naturally include complex conversational phenomena such as anaphora and ellipsis51,52,53. Meaning, conversations refer back to events earlier in the conversation (‘What do you predict for them?’) or omit information that must be inferred from conversation (‘Now show me for people predicted incorrectly’). However, current language models parse only a single input, making it hard to apply them in settings where the context is important. To support context in the dialogues, TalkToModel introduces on a set of operations in the grammar that determine the context for user utterances. In contrast with approaches that maintain the conversation state using neural representations45,54, grammar operations allow for much more trustworthy and dependable behaviour while still fostering rich interactions, which is critical for high-stakes settings, and similar mechanisms for incorporating grammar predicates across turns have been shown to achieve strong results53. In particular, we leverage two operations: previous filter and previous operation, which look back in the conversation to find the last filter and last operation, respectively. These operations also act recursively. Therefore, if the last filter is a previous filter operation, TalkToModel will recursively call previous filter to resolve the entire stack of filters. As a result, TalkToModel is capable of addressing instances of anaphora and ellipsis by using these operations to resolve the entity via co-reference or infer it from the previous conversation history. This dynamic enables users to have complex and natural conversations with TalkToModel.

Parsing dataset generation

To parse user utterances into the grammar, we fine-tune an LLM to translate utterances into the grammar in a seq2seq fashion. We use LLMs because these models have been trained on large amounts of text data and are solid priors for language understanding tasks. Thus, they can better understand diverse user inputs than training from scratch, improving the user experience. Further, we automate the fine-tuning of an LLM to parse user utterances into the grammar by generating a training dataset of (utterance, parse) pairs. Compared with dataset-generation methods that use human annotators to generate and label datasets for training conversation models55,56, this approach is much less costly and time consuming, while still being highly effective, and supports users getting conversations running very quickly. This strategy consists of writing an initial set of user utterances and parses, where parts of the utterances and parses are wildcard terms. TalkToModel enumerates the wildcards with aspects of a user-provided dataset, such as the feature names, to generate a training dataset. Depending on the user-provided dataset schema, TalkToModel typically generates anywhere from 20,000 to 40,000 pairs. Last, we have already written the initial set of utterances and parses, so users only need to provide their dataset to set up a conversation.

Semantic parsing

Here we provide additional details about the semantic parsing approach for translating user utterances into the grammar. The two strategies for parsing user utterances using pre-trained LLMs that we considered were (1) few-shot GPT-J28 and (2) fine-tuned T530. With respect to the few-shot models, because the LLM’s context window accepts only a fixed number of inputs, we introduce a technique to select the set of most relevant prompts for the user utterance. In particular, we embed all the utterances and identify the closest utterances to the user utterance according to the cosine distance of these embeddings. To ensure a diverse set of prompts, we select only one prompt per template. We prompt the LLM using these (utterance, parse) pairs, ordering the closest pairs immediately before the user utterance because LLMs exhibit recency biases57. Using this strategy, we experiment with the number of prompts included in the LLM’s context window. In practice, we use the all-mpnet-base-v2 sentence transformer model to perform the embeddings33, and we consider the GPT-J 6B, GPT-Neo 2.7B and GPT-Neo 1.3B models in our experiments.

We also fine-tune pre-trained T5 models in a seq2seq fashion on our datasets. To perform fine-tuning, we split the dataset using a 90%/10% train/validation split and train for 20 epochs to maximize the next token likelihood with a batch size of 32. We select the model with the lowest validation loss at the end of each epoch. We fine-tune with a learning rate of 1 × 10−4 and the AdamW optimizer58. Last, our experiments consider the T5 small, base and large variants.

Generating responses

After TalkToModel executes a parse, it composes the results of the operations into a natural language response that it returns to the user. TalkToModel generates these responses by filling in templates associated with each operation based on the results. The responses also include sufficient context to understand the results and opportunities for following up (examples in Table 2). Further, because the system runs multiple operations in one execution, TalkToModel joins response templates, ensuring semantic coherence, into a final response and shows it to the user. Compared with approaches that generate responses using neural methods59, this approach ensures that the responses are trustworthy and do not contain useless information hallucinated by the system, which would be a very poor user experience for the high-stakes applications we consider. Further, because TalkToModel supports a wide variety of different operations, this approach ensures sufficient diversity in responses, so they are not repetitive.

Executing parses

In this section, we provide an overview of the execution engine, which runs the operations necessary to respond to user utterances in the conversation. Further, this component automatically selects the most faithful explanations for the user, helping ensure explanation accuracy.

Feature importance explanations

At its core, TalkToModel explains why the model makes predictions to users with feature importance explanations. Feature importance explanations ϕ(x, f) → ϕ accept a data point \({{{\bf{x}}}}\in {{\mathbb{R}}}^{d}\) with d features and model as input f(x) → y, where y [0, 1] is the probability for a particular class, and generates a feature attribution vector \({{{\bf{\upphi }}}}\in {{\mathbb{R}}}^{d}\), where greater magnitudes correspond to higher importance features6,7,60,61,62,63.

We implement the feature importance explanations using post hoc feature importance explanations. Post hoc feature importance explanations do not rely on internal details of the model f (for example, internal weights or gradients) and only on the input data x and predictions y to compute explanations, so users are not limited to only certain types of model64,65,66,67,68. Note that our system can easily be extended to other explanations that rely on internal model details, if required4,8,69,70.

Explanation selection

While there exists several post hoc explanation methods, each one adopts a different definition of what constitutes an explanation71. For example, while local interpretable model agnostic explanations (LIME), Shapley additive explanations (SHAP) and integrated gradients all output feature attributions, LIME returns coefficients of local linear models, SHAP computes Shapley values and integrated gradients leverages model gradients. Consequently, we automatically select the most faithful explanation for users, unless a user specifically requests a certain technique. Following previous works, we compute faithfulness by perturbing the most important features and evaluating how much the prediction changes72. Intuitively, if the feature importance ϕ correctly captures the feature importance ranking, perturbing more important features should lead to greater effects.

While previous works65,73, compute the faithfulness over many different thresholds, making comparisons harder, or require retraining entirely from scratch, we introduce a single metric that captures the prediction sensitivity to perturbing certain features called the fudge score. This metric is the mean absolute difference between the model’s prediction on the original input and a fudged version on m {0, 1}d features

$${{{\rm{Fudge}}}}(\,f,{{{\bf{x}}}},{{{\bf{m}}}})=\frac{1}{N}\mathop{\sum }\limits_{n=1}^{N}\ | \,f({{{\bf{x}}}})-f({{{\bf{x}}}}+{{{{\bf{\upepsilon }}}}}_{n}\odot {{{\bf{m}}}})|$$

where is the tensor product and \({{{\bf{\upepsilon }}}} \approx {{{\mathcal{N}}}}(0,I\sigma )\) is N × d-dimensional Gaussian noise. To evaluate faithfulness for a particular explanation method, we compute area under the fudge score curve on the top-k most important features, thereby summarizing the results into a single metric

$${\mathbb{1}}(k,{{{\bf{\upphi }}}})=\left\{\begin{array}{ll}1\quad &{{{\rm{if}}}}{\phi }_{i}\in \arg \mathop{\max }\limits_{{{{{\bf{\upphi }}}}}^{{\prime} }\subset \{1\ldots d\},| {{{{\bf{\upphi }}}}}^{{\prime} }| =k}{\sum }_{i\in {{{{\bf{\upphi }}}}}^{{\prime} }}| {\phi }_{i}| \\ 0\quad &{{{\rm{otherwise}}}}\end{array}\right.$$
$${{{\rm{Faith}}}}({{{\bf{\upphi }}}},\,f,\,{{{\bf{x}}}},\,K)=\mathop{\sum }\limits_{k=1}^{K}{{{\rm{Fudge}}}}(f,\,{{{\bf{x}}}},\,{\mathbb{1}}(k,{{{\bf{\upphi }}}}))$$

where \({\mathbb{1}}(k,{{{\bf{\upphi }}}})\) is the indicator function for the top-k most important features. Intuitively, if a set of feature importances ϕ correctly identifies the most important features, perturbing them will have greater effects on the model’s predictions, resulting in higher faithfulness scores. We compute faithfulness for multiple different explanations and select the highest. In practice, we consider LIME64 with the following kernel widths [0.25, 0.50, 0.75, 1.0] and KernelSHAP74. We leave all settings to default besides the kernel widths for LIME. In practice, we set σ = 0.05 to ensure that perturbations happen in the local region around the prediction, K to \({{{\rm{floor}}}}(\frac{d}{2})\), and N = 10,000 to sample sufficiently. One complication arises for categorical features, where we cannot apply Gaussian perturbations. For these features, we randomly sample these features from a value in the dataset column 30% of the time to guarantee that the feature remains categorical under perturbation. Last, if multiple explanations return similar fidelities, we use the explanation stability metric proposed in ref. 75 to break ties, because it is much more desirable for the explanation to robust to perturbations7,76. To use the stability metric proposed in ref. 75 to break ties if the explanations fidelities are quite close (less than δ = 0.01), we compute the Jaccard similarity between feature rankings instead of the L2 norm as is used in their work. The reason is that the norm might not be comparable between explanation types, because they have different ranges, while the Jaccard similarity should not be affected. Further, we compute the area under the top-k curve using the Jaccard similarity stability metric, as in equation (3), to make this measure more robust.

Additional explanation types

As users will have explainability questions that cannot be answered solely with feature importance explanations, we include additional explanations to support a wider array of conversation topics. In particular, we support counterfactual explanations and feature interaction effects. These methods enable conversations about how to get different outcomes and whether features interact with each other during predictions, supporting a broad set of user queries. We implement counterfactual explanations using diverse counterfactual explanations, which generates a diverse set of counterfactuals77. Having access to many plausible counterfactuals is desirable because it enables users to see a breadth of different, potentially useful, options. Also, we implement feature interaction effects using the partial dependence based approach from ref. 78 because it is effective and quick to compute.

Exploring data and predictions

Because the process of understanding models often requires users to inspect the model’s predictions, errors and the data, TalkToModel supports a wide variety of data and model exploration tools. For example, TalkToModel provides options for filtering data and performing what-if analyses, supporting user queries that concern subsets of data or what would happen if data points change. Users can also inspect model errors, predictions, prediction probabilities, compute summary statistics, and evaluation metrics for individuals and groups of instances. TalkToModel additionally supports summarizing common patterns in mistakes on groups of instances by training a shallow decision tree on the model errors in the group. Also, TalkToModel enables descriptive operations, which explain how the system works, summarize the dataset and define terms to help users understand how to approach the conversation. Overall, TalkToModel supports a rich set of conversation topics in addition to explanations, making the system a complete solution for the model understanding requirements of end users.


While we implement TalkToModel with several different choices for operations such as feature importance explanations and counterfactual explanations, TalkToModel is highly modular and system designers can easily incorporate new operations or change existing ones by modifying the grammar to best support their user populations. This design makes TalkToModel straightforward to extend to new settings, where different operations may be desired.

Broader impact statement

The TalkToModel system and, more generally, conversational model explainability can be applied to a wide range of applications, including financial, medical or legal applications. Our research could be used to improve model understanding in these situations by improving transparency and encouraging the positive impact of ML systems, while reducing errors and bias. Although TalkToModel has many positive applications, the system makes it easier for those without high levels of technical expertise to understand ML models, which could lead to a false sense of trust in ML systems. In addition, because TalkToModel makes it easier to use ML model for those with lower levels of expertise, there is additionally a risk of inexperienced users applying ML models inappropriately. While TalkToModel includes several measures to prevent such risks, such as qualifying when explanations or predictions are inaccurate, and clearly describing the intended purpose of the ML model, it would be useful for researchers to investigate and possible adopters to be mindful of these considerations. While completing this research, the authors complied with all relevant ethical regulations of human research.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.