Explaining machine learning models with interactive natural language conversations using TalkToModel

Slack, Dylan; Krishna, Satyapriya; Lakkaraju, Himabindu; Singh, Sameer

doi:10.1038/s42256-023-00692-8

Download PDF

Article
Open access
Published: 27 July 2023

Explaining machine learning models with interactive natural language conversations using TalkToModel

Nature Machine Intelligence volume 5, pages 873–883 (2023)Cite this article

11k Accesses
6 Citations
48 Altmetric
Metrics details

Subjects

A preprint version of the article is available at arXiv.

Abstract

Practitioners increasingly use machine learning (ML) models, yet models have become more complex and harder to understand. To understand complex models, researchers have proposed techniques to explain model predictions. However, practitioners struggle to use explainability methods because they do not know which explanation to choose and how to interpret the explanation. Here we address the challenge of using explainability methods by proposing TalkToModel: an interactive dialogue system that explains ML models through natural language conversations. TalkToModel consists of three components: an adaptive dialogue engine that interprets natural language and generates meaningful responses; an execution component that constructs the explanations used in the conversation; and a conversational interface. In real-world evaluations, 73% of healthcare workers agreed they would use TalkToModel over existing systems for understanding a disease prediction model, and 85% of ML professionals agreed TalkToModel was easier to use, demonstrating that TalkToModel is highly effective for model explainability.

Foundation metrics for evaluating effectiveness of healthcare conversations powered by generative AI

Article Open access 29 March 2024

Large language models in medicine

Article 17 July 2023

ThoughtSource: A central hub for large language model reasoning data

Article Open access 08 August 2023

Main

Due to their strong performance, machine learning (ML) models increasingly make consequential decisions in several critical domains, such as healthcare, finance and law. However, state-of-the-art ML models, such as deep neural networks, have become more complex and hard to understand. This dynamic poses challenges in real-world applications for model stakeholders who need to understand why models make predictions and whether to trust them. Consequently, practitioners have often turned to inherently interpretable ML models for these applications, including decision lists and sets^1,2 and generalized additive models^3,4,5, which people can more easily understand. Nevertheless, black-box models are often more flexible and accurate, motivating the development of post hoc explanations that explain the predictions of trained ML models. These explainability techniques either fit faithful models in the local region around a prediction or inspect internal model details, such as gradients, to explain predictions^{6,7,8,9,10,11}.

Yet, recent work suggests that practitioners often have difficulty using explainability techniques^12,13,14,15. These challenges are due to difficulty in figuring out which explanations to implement, how to interpret the explanation and answering follow-up questions beyond the initial explanation. In the past, researchers have proposed several point-and-click dashboard techniques to help overcome these issues, such as the Language Interpretability Tool¹⁶, which is designed to understand natural language processing models and the What-If Tool¹⁷—a tool aimed at performing counterfactual analyses for models. However, these methods still require a high level of expertise, because users must know which explanations to run, and lack the flexibility to support arbitrary follow-up questions that users might have. Overall, understanding ML models through simple and intuitive interactions is a key bottleneck in adoption across many applications.

Natural language dialogues are a promising solution for supporting broad and accessible interactions with ML models due to their ease of use, capacity and support for continuous discussion. However, designing a dialogue system that enables a satisfying model understanding experience introduces several challenges. First, the system must handle many conversation topics about the model and data while facilitating natural conversation flow¹⁸. For instance, these topics may include explainability questions like the most important features for predictions and general questions such as data statistics or model errors. Further, the system must work for various model classes and data, and it should understand language usage across different settings¹⁹. For example, participants will use different terminology in conversations about loan prediction than disease diagnosis. Last, the dialogue system should generate accurate responses that address the users’ core questions^20,21. In the literature, researchers have suggested some prototype designs for generating explanations using natural language. However, these initial designs address specific explanations and model classes, limiting their applicability in general conversational explainability settings^22,23.

In this Article, we address these challenges by introducing TalkToModel: a system that enables open-ended natural language dialogues for understanding ML models for any tabular dataset and classifier (an overview of TalkToModel is provided in Fig. 1). Users can have discussions with TalkToModel about why predictions occur, how the predictions would change if the data change and how to flip predictions, among many other conversation topics (an example conversation is provided in Fig. 2). Further, they can perform these analyses on any group in the data, such as a single instance or a specific group of instances. For example, on a disease prediction task, users can ask ‘How important is BMI for the predictions?’ or ‘So how would decreasing the glucose levels by 10 change the likelihood of men older than 20 having the disease?’. TalkToModel will respond by describing how, for instance, BMI is the most important feature for predictions, and decreasing glucose will decrease the chance of diabetes by 20%. From there, users can engage further in the conversation by asking follow-up questions. Conversations with TalkToModel make model explainability straightforward because users can talk with the system in natural language about the model, and the system will generate useful responses.

**Fig. 2: A conversation with TalkToModel.**

To support such rich conversations with TalkToModel, we introduce techniques for both language understanding and model explainability. First, we propose a dialogue engine that parses user text inputs (referred to as user utterances) into a structured query language-like programming language using a large language model (LLM). The LLM performs the parsing by treating the task of translating user utterances into the programming language as a seq2seq learning problem, where the user utterances are the source and parses in the programming language are the targets²⁴. In addition, the TalkToModel language combines operations for explanations, ML error analyses, data manipulation and descriptive text into a single language capable of representing a wide variety of potential conversation topics most model explainability needs (an overview of the different operations is provided in Fig. 3). To support the system adapting to any dataset and model, we introduce lightweight adaption techniques to fine-tune LLMs to perform the parsing, enabling strong generalization to new settings. Second, we introduce an execution engine that runs the operations in each parse. To reduce the burden of users deciding which explanations to run, we introduce methods that automatically select explanations for the user. In particular, this engine runs many explanations, compares their fidelities and selects the most accurate ones. Finally, we construct a text interface where users can engage in open-ended dialogues using the system, enabling anyone, including those with minimal technical skills, to understand ML models.

**Fig. 3: Overview of the operations supported by TalkToModel.**

Results

In this section, we demonstrate that TalkToModel accurately understands users in conversations by evaluating its language understanding capabilities on ground-truth data. Next, we evaluate the effectiveness of TalkToModel for model understanding by performing a real-world human study on healthcare workers (for example, doctors and nurses) and ML practitioners, where we benchmark TalkToModel against existing explainability systems. We find users both prefer and are more effective using TalkToModel than traditional point-and-click explainability systems, demonstrating its effectiveness for understanding ML models.

Language understanding

Here we quantitatively assess the language understanding capabilities of TalkToModel by creating gold parse datasets and evaluating the system’s accuracy on these data.

Gold parse collection

We construct gold datasets (that is, ground-truth (utterance, parse) pairs) across multiple datasets to evaluate the language understanding performance of our models. To construct these gold datasets, we adopt an approach inspired by ref. ²⁵, which constructs a similar dataset for multitask semantic parsing.

Our gold dataset-generation process is as follows. First, we write 50 (utterance, parse) pairs for the particular task (that is, loan or diabetes prediction). These utterances range from simple ‘How likely are people in the data to have diabetes?’ to complex ‘If these people were not unemployed, what’s the likelihood they are good credit risk? Why?’. We include each operation (Fig. 3) at least twice in the parses, to make sure that there is good coverage. From there, we ask Mechanical Turk workers to rewrite the utterances while preserving their semantic meaning to ensure that the ground-truth parse for the revised utterance is the same but the phrasing differs. We ask workers to rewrite each pair 8 times for a total of 400 (utterance, parse) pairs per task. Next, we filter out low-quality mturk revisions. We ask the crowd-sourced workers to rate the similarity between the original utterance and revised utterance on a scale of 1 to 4, where 4 indicates that the utterances have the same meaning and 1 indicates that they do not have the same meaning. We collect 5 ratings per revision and remove (utterance, parse) pairs that score below 3.0 on average. Finally, we perform an additional filtering step to ensure data quality by inspecting the remaining pairs ourselves and removing any bad revisions.

As we want to evaluate TalkToModel’s capacity to generalize across different scenarios, we perform this data collection process across three different tasks: Pima Indian Diabetes Dataset²⁶, German credit dataset²⁶ and the Correctional Offender Management Profiling for Alternative Sanctions (COMPAS) recidivism dataset²⁷. After collecting revisions and ensuring quality, we are left with 200 pairs for the German credit dataset, 190 for the diabetes dataset and 146 for the COMPAS dataset.

Models

We compare two strategies for using pre-trained LLMs to parse user utterances into the grammar: (1) few-shot GPT-J ²⁸ and GPT-3.5 models²⁹ and (2) fine-tuned T5³⁰. The GPT-J and GPT-3.5 models are higher capacity and more amenable to be trained by in-context learning. This procedure includes examples of the input and target from the training prepended to the test instance^29,31,32. In contrast, the T5 models require traditional fine-tuning on the input and target pairs. Consequently, the few-shot approach is quicker to set up because it does not require fine-tuning, making it easier for users to get started with the system. However, the fine-tuned T5 leads to improved performance and a better user experience overall while taking longer to set up. We expect that fine-tuned T5 leads to improved performance overall because it has access to all the training data, whereas, the few-shot models are limited by the context window size. To train these models through fine-tuning or prompting, we generate synthetic (utterance, parse) pairs because it is impractical to assume that we can collect ground-truth pairs for every new task we wish to use TalkToModel. We provide additional training details in Methods.

We evaluate both fine-tuned T5 models and few-shot models on the testing data. We additionally implement a naive nearest-neighbours baseline, where we select the closest user utterance in the synthetic training set according to cosine distance of all-mpnet-base-v2 sentence embeddings and return the corresponding parse³³. For the GPT-J models, we compare N-shot performance, where N is the number of (utterance, parse) pairs from the synthetically generated training sets included in the prompt, and sweep over a range of N for each model. For the larger models, we have to use relatively smaller N for inference to fit on a single 48 GB graphics processing unit.

When parsing the utterances, one issue is that their generations are unconstrained and may generate parses outside the grammar, resulting in the system failing to run the parse. To ensure the generations are grammatical, we constrain the decodings to be in the grammar by recompiling the grammar at inference time into an equivalent grammar consisting of the tokens in the LLM’s vocabulary ³⁴. While decoding from the LLM, we fix the likelihood of ungrammatical tokens to 0 at every generation step. Because the GPT-3.5 model must be called through an application programming interface, which does not support guided decoding, we decode greedily with temperature set to one.

Evaluating the parsing accuracy

To evaluate performance on the datasets, we use the exact match parsing accuracy^25,35,36. This metric is whether the parse exactly matches the gold parse in the dataset. In addition, we perform the evaluation on two splits of each gold parse dataset, in addition to the overall dataset. These splits are the independent and identically distributed (IID) and compositional splits. The IID split contains (utterance, parse) pairs where the parse’s operations and their structure (but not necessarily the arguments) are in the training data. The compositional split consists of the remaining parses that are not in the training data. Because language models struggle compositionally, this split is generally much harder for language models to parse^37,38.

Accuracy

We present the results in Table 1. T5 performs better overall than the few-shot GPT-J and GPT-3.5 models. Notably, the T5 small model performs better than the GPT-J 6B model, which has two orders of magnitude more parameters. While the few-shot models underperform the fine-tuned T5 models overall, GPT-3.5 is the best-performing few-shot model and performs considerably better than the GPT-J models, particularly in the compositional split. Overall, these results suggest using fine-tuned T5 for the best results, and we use T5 large in our human studies.

Table 1 Exact match parsing accuracy (%) for the three gold datasets, on the IID and compositional splits, and overall

Full size table

Utility of explainability dialogues

The results in the previous section show that TalkToModel understands user intentions to a high degree of accuracy. In this section, we evaluate how well the end-to-end system helps users understand ML models compared with current explainability systems.

Study overview

We compare TalkToModel against ‘explainerdashboard’, one of the most popular open-source explainability dashboards³⁹. This dashboard has similar functionality to TalkToModel, considering it provides an accessible way to compute explanations and perform model analyses. Thus, it is a reasonable baseline. Last, we perform this comparison using the diabetes dataset, and a gradient-boosted tree trained on the data⁴⁰. To compare both systems in a controlled manner, we ask participants to answer general ML questions with TalkToModel and the dashboard. Each question is about basic explainability and model analysis, and participants answer using multiple choice, where one of the options is ‘Could not determine’ if they cannot figure out the answer (although it is straightforward to answer all the questions with both interfaces). For example, questions are about comparing feature importances ‘Is glucose more important than age for the model’s predictions for data point 49?’ or model predictions ‘How many people are predicted not to have diabetes but do not actually have it?’ Participants answer ten questions in total. We divide the ten questions into two blocks of five questions each. Both blocks have similar questions but different values to control for memorization (the exact questions are given in Supplementary Section A). Participants use TalkToModel to answer one block of questions and the dashboard for the other block. In addition, we provide a tutorial on how to use both systems before showing users the questions for the system. Last, we randomize question, block and interface order to control for biases due to showing interfaces or questions first.

Metrics

Following previous work on evaluating human and ML coordination and trust, we assessed several metrics to evaluate user experiences^41,42,43. We evaluated the following statements along the 1–7 Likert scale at the end of the survey:

Easiness: I found the conversational interface easier to use than the dashboard interface
Confidence: I was more confident in my answers using the conversational interface than the dashboard interface
Speed: I felt that I was able to more rapidly arrive at an answer using the conversational interface than the dashboard interface
Likeliness to use: based on my experience so far with both interfaces, I would be more likely to use the conversational interface than the dashboard interface in the future

To control for bias associated with the ordering of the terms conversational interface and dashboard interface, we randomized their ordering. We also measured accuracy and time taken to answer each question. Last, we asked to participants to write a short description comparing their experience with both interfaces to capture participants qualitative feedback about both systems.

Recruitment

As TalkToModel provides an accessible way to understand ML models, we expect it to be useful for subject-matter experts with a variety of experience in ML, including users without any ML experience. As such, we recruited 45 English-speaking healthcare workers to take the survey using the Prolific service⁴⁴ with minimal or no ML expertise This group comprises a range of healthcare workers, including doctors, pharmacists, dentists, psychiatrists, healthcare project managers and medical scribes. The vast majority of this group (43) stated they had either no experience with ML or had heard about it from reading articles online, while two members indicated they had equivalent to an undergraduate course in ML. As another point of comparison, we recruited ML professionals with relatively higher ML expertise from ML Slack channels and email lists. We received 13 potential participants, all of which had graduate-course-level ML experience or higher, and included all of them in the study. We received institutional review board approval for this study from the University of California, Irvine institutional review board approval process and informed consent from participants.

Metric results

A substantial majority of healthcare workers agreed that they preferred TalkToModel in all the categories we evaluated (Table 2). The same is true for the ML professionals, save for whether they were more likely to use TalkToModel in the future, where 53.8% of participants agreed they would instead use TalkToModel in the future. In addition, participants’ subjective notions around how quickly they could use TalkToModel aligned with their actual speed of use, and both groups arrived at answers using TalkToModel significantly quicker than using the dashboard. The median question answer time (measured at the total time taken from seeing the question to submitting the answer) using TalkToModel was 76.3 s, while it was 158.8 s using the dashboard.

Table 2 User study results for respondents that agree TalkToModel is better than the dashboard

Full size table

Participants were also much more accurate and completed questions at a higher rate (that is, they did not mark ‘Could not determine’) using TalkToModel (Table 3). While both healthcare workers and ML practitioners clicked ‘Could not determine’ for a quarter of the questions using the dashboard, this was true for 13.8% of healthcare workers and 6.1% of ML professionals using TalkToModel, demonstrating the usefulness of the conversational interface. On completed questions, both groups were much more accurate using TalkToModel than the dashboard. Most surprisingly, although ML professionals agreed that they preferred TalkToModel only about half the time, they answered all the questions correctly using it, while they only answered 62.5% of questions correctly with the dashboard. Finally, we observed that TalkToModel’s conversational capabilities were highly effective. There were only 6 utterances out of over 1, 000 total utterances that the conversational aspect of the system failed to resolve. These failure cases generally involved certain discourse aspects like asking for additional elaboration (‘more description’).

Table 3 User study results for completion rate and accuracy across interfaces and participant groups

Full size table

The largest source of errors for participants using the explainability dashboard were two questions concerning the top most important features for individual predictions. The errors for these questions account for 47.4% of healthcare workers and 44.4% of ML professionals’ total mistakes. Solving these tasks with the dashboard requires users to perform multiple steps, including choosing the feature importance tab in the dashboard, while the streamlined text interface of TalkToModel made it much simpler to solve these tasks.

Qualitative results

For the qualitative user feedback, we provide representative quotes from similar themes in the responses. Users expressed that they could more rapidly and easily arrive at results, which could be helpful for their professions.

Displayquote 1

“I prefer the conversational interface because it helps arrive at the answer very quickly. This is very useful especially in the hospital setting where you have hundreds of patients getting check ups and screenings for diabetes because it is efficient and you can work with medical students on using the system to help patient outcomes.” P39 medical worker at a tertiary hospital.

Participants also commented on the user friendliness of TalkToModel and its strong conversational capabilities, stating, “the conversational [interface] was straight to the point, way easier to use” (P35 nurse) and that “the conversational interface is hands-down much easier to use… it feels like one is talking to a human” (P45 ML professional). We did not find any negative feedback surrounding the conversational capabilities of the system. Overall, users expressed strong positive sentiment about TalkToModel due to the quality of conversations, presentation of information, accessibility and speed of use.

Several ML professionals brought up points that could serve as future research directions. Notably, participants stated that they would rather look at the data themselves rather than rely on an interface that rapidly provides an answer.

Displayquote 2

“I would almost always rather look at the data myself and come to a conclusion than getting an answer within seconds.” P11 ML professional.

In the future, it would be worthwhile including visualizations of raw data and analyses performed by the system to increase trust with expert users, such as ML professionals, who may be sceptical of the high-level answers provided by the system currently.

Discussion

With ML models becoming increasingly complex, there is a need to develop techniques to explain model predictions to stakeholders. Nevertheless, it is often the case that practitioners struggle to use explanations and frequently have many follow-up questions they wish to answer. In this work, we show that TalkToModel makes explainable AI accessible to users from a range of backgrounds by using natural language conversations. Our experiments demonstrate that TalkToModel comprehends users with a high degree of accuracy and can help users understand the predictions of ML models much better than existing systems can. In particular, we showed that TalkToModel is a highly effective way for domain experts such as healthcare workers to understand ML models, like those applied to disease diagnosis. Lastly, we designed TalkToModel to be highly extensible and released the code, data and a demo for the system at https://github.com/dylan-slack/TalkToModel, making it straightforward for users and researchers to build on the system. In the future, it will be helpful to investigate applications of TalkToModel ‘in the wild’, such as in doctors’ offices, laboratories or professional settings, where model stakeholders could use the system to understand their models.

Methods

In this section, we describe the components of TalkToModel. First, we introduce the dialogue engine and discuss how it understands user inputs, maps them to operations and generates text responses based on the results of running the operations. Second, we describe the execution engine, which runs the operations. Finally, we provide an overview of the interface and the extensibility of TalkToModel.

Text understanding

To understand the intent behind user utterances, the system learns to translate or parse them into logical forms. These parses represent the intentions behind user utterances in a highly expressive and structured programming language TalkToModel executes.

Compared with dialogue systems that execute specific tasks by modifying representations of the internal state of the conversation^45,46, our parsing-based approach allows for more flexibility in the conversations, supporting open-ended discovery, which is critical for model understanding. Also, this strategy produces a structured representation of user utterances instead of open-ended systems that generate unstructured free text⁴⁷. Having this structured representation of user inputs is key for our setting where we need to execute specific operations depending on the user’s input, which would not be straightforward with unstructured text.

TalkToModel performs the following steps to accomplish this: (1) the system constructs a grammar for the user-provided dataset and model, which defines the set of acceptable parses; (2) TalkToModel generates (utterance, parse) pairs for the dataset and model; (3) the system fine-tunes an LLM to translate user utterances into parses; and (4) the system responds conversationally to users by composing the results of the executed parse into a response that provides context for the results and opportunities to follow up.

Grammar

To represent the intentions behind the user utterances in a structured form, TalkToModel relies on a grammar, defining a domain-specific language for model understanding. While the user utterances themselves will be highly diverse, the grammar creates a way to express user utterances in a structured yet highly expressive fashion that the system can reliably execute. Compared with approaches that treat determining user intentions in conversations as a classification problem^48,49, using a grammar enables the system to express compositions of operations and arguments that take on many different values, such as real numbers, that would otherwise be combinatorially impossible in a prediction setting. Instead, TalkToModel translates user utterances into this grammar in a seq2seq fashion, overcoming these challenges²⁴. This grammar consists of production rules that include the operations the system can run (an overview is provided in Table 3), the acceptable arguments for each operation and the relations between operations. One complication is that user-provided datasets have different feature names and values, making it hard to define one shared grammar between datasets. Instead, we update the grammar based on the feature names and values in a new dataset. For instance, if a dataset contained only the feature names ‘age’ and ‘income’, these two names would be the only acceptable values for the feature argument in the grammar.

To ensure that our grammar provides sufficient coverage for explainable artificial intelligence (XAI) questions, we verify our grammar supports the questions from the XAI question bank. This question bank was introduced in ref. ⁵⁰ based on interviews with AI product designers and includes 31 core, prototypical questions XAI systems should answer, excluding socio-technical questions beyond the scope of TalkToModel (for example, ‘What are the results of other people using the [model]’). The prototypical questions address topics such as the input/data to the model (‘What is the distribution of a given feature?’), model output (‘What kind of output does the system give?’), model performance (‘How accurate are the predictions?’), global model behaviour (‘What is the systems overall logic?’), why/why not the system makes individual predictions (‘Why is this instance given this prediction?’) and what-if or counterfactual questions (‘What would the system predict if this instance changes to…?’). To evaluate how well TalkToModel covers these questions, we review each question and evaluate whether our grammar can parse it. Overall, we find that our grammar supports 30 out of 31 of the prototypical questions. We provide a table of each question and corresponding parse in Supplementary Tables 6 and 7. Overall, the grammar covers the vast majority of XAI related questions, and therefore, has good coverage of XAI topics.

Supporting context in dialogues

User conversations with TalkToModel naturally include complex conversational phenomena such as anaphora and ellipsis^51,52,53. Meaning, conversations refer back to events earlier in the conversation (‘What do you predict for them?’) or omit information that must be inferred from conversation (‘Now show me for people predicted incorrectly’). However, current language models parse only a single input, making it hard to apply them in settings where the context is important. To support context in the dialogues, TalkToModel introduces on a set of operations in the grammar that determine the context for user utterances. In contrast with approaches that maintain the conversation state using neural representations^45,54, grammar operations allow for much more trustworthy and dependable behaviour while still fostering rich interactions, which is critical for high-stakes settings, and similar mechanisms for incorporating grammar predicates across turns have been shown to achieve strong results⁵³. In particular, we leverage two operations: previous filter and previous operation, which look back in the conversation to find the last filter and last operation, respectively. These operations also act recursively. Therefore, if the last filter is a previous filter operation, TalkToModel will recursively call previous filter to resolve the entire stack of filters. As a result, TalkToModel is capable of addressing instances of anaphora and ellipsis by using these operations to resolve the entity via co-reference or infer it from the previous conversation history. This dynamic enables users to have complex and natural conversations with TalkToModel.

Parsing dataset generation

To parse user utterances into the grammar, we fine-tune an LLM to translate utterances into the grammar in a seq2seq fashion. We use LLMs because these models have been trained on large amounts of text data and are solid priors for language understanding tasks. Thus, they can better understand diverse user inputs than training from scratch, improving the user experience. Further, we automate the fine-tuning of an LLM to parse user utterances into the grammar by generating a training dataset of (utterance, parse) pairs. Compared with dataset-generation methods that use human annotators to generate and label datasets for training conversation models^55,56, this approach is much less costly and time consuming, while still being highly effective, and supports users getting conversations running very quickly. This strategy consists of writing an initial set of user utterances and parses, where parts of the utterances and parses are wildcard terms. TalkToModel enumerates the wildcards with aspects of a user-provided dataset, such as the feature names, to generate a training dataset. Depending on the user-provided dataset schema, TalkToModel typically generates anywhere from 20,000 to 40,000 pairs. Last, we have already written the initial set of utterances and parses, so users only need to provide their dataset to set up a conversation.

Semantic parsing

Here we provide additional details about the semantic parsing approach for translating user utterances into the grammar. The two strategies for parsing user utterances using pre-trained LLMs that we considered were (1) few-shot GPT-J²⁸ and (2) fine-tuned T5³⁰. With respect to the few-shot models, because the LLM’s context window accepts only a fixed number of inputs, we introduce a technique to select the set of most relevant prompts for the user utterance. In particular, we embed all the utterances and identify the closest utterances to the user utterance according to the cosine distance of these embeddings. To ensure a diverse set of prompts, we select only one prompt per template. We prompt the LLM using these (utterance, parse) pairs, ordering the closest pairs immediately before the user utterance because LLMs exhibit recency biases⁵⁷. Using this strategy, we experiment with the number of prompts included in the LLM’s context window. In practice, we use the all-mpnet-base-v2 sentence transformer model to perform the embeddings³³, and we consider the GPT-J 6B, GPT-Neo 2.7B and GPT-Neo 1.3B models in our experiments.

We also fine-tune pre-trained T5 models in a seq2seq fashion on our datasets. To perform fine-tuning, we split the dataset using a 90%/10% train/validation split and train for 20 epochs to maximize the next token likelihood with a batch size of 32. We select the model with the lowest validation loss at the end of each epoch. We fine-tune with a learning rate of 1 × 10⁻⁴ and the AdamW optimizer⁵⁸. Last, our experiments consider the T5 small, base and large variants.

Generating responses

After TalkToModel executes a parse, it composes the results of the operations into a natural language response that it returns to the user. TalkToModel generates these responses by filling in templates associated with each operation based on the results. The responses also include sufficient context to understand the results and opportunities for following up (examples in Table 2). Further, because the system runs multiple operations in one execution, TalkToModel joins response templates, ensuring semantic coherence, into a final response and shows it to the user. Compared with approaches that generate responses using neural methods⁵⁹, this approach ensures that the responses are trustworthy and do not contain useless information hallucinated by the system, which would be a very poor user experience for the high-stakes applications we consider. Further, because TalkToModel supports a wide variety of different operations, this approach ensures sufficient diversity in responses, so they are not repetitive.

Executing parses

In this section, we provide an overview of the execution engine, which runs the operations necessary to respond to user utterances in the conversation. Further, this component automatically selects the most faithful explanations for the user, helping ensure explanation accuracy.

Feature importance explanations

At its core, TalkToModel explains why the model makes predictions to users with feature importance explanations. Feature importance explanations ϕ(x, f) → ϕ accept a data point ${{{\bf{x}}}}\in {{\mathbb{R}}}^{d}$ with d features and model as input f(x) → y, where y ∈ [0, 1] is the probability for a particular class, and generates a feature attribution vector ${{{\bf{\upphi }}}}\in {{\mathbb{R}}}^{d}$, where greater magnitudes correspond to higher importance features^{6,7,60,61,62,63}.

We implement the feature importance explanations using post hoc feature importance explanations. Post hoc feature importance explanations do not rely on internal details of the model f (for example, internal weights or gradients) and only on the input data x and predictions y to compute explanations, so users are not limited to only certain types of model^{64,65,66,67,68}. Note that our system can easily be extended to other explanations that rely on internal model details, if required^4,8,69,70.

Explanation selection

While there exists several post hoc explanation methods, each one adopts a different definition of what constitutes an explanation⁷¹. For example, while local interpretable model agnostic explanations (LIME), Shapley additive explanations (SHAP) and integrated gradients all output feature attributions, LIME returns coefficients of local linear models, SHAP computes Shapley values and integrated gradients leverages model gradients. Consequently, we automatically select the most faithful explanation for users, unless a user specifically requests a certain technique. Following previous works, we compute faithfulness by perturbing the most important features and evaluating how much the prediction changes⁷². Intuitively, if the feature importance ϕ correctly captures the feature importance ranking, perturbing more important features should lead to greater effects.

While previous works^65,73, compute the faithfulness over many different thresholds, making comparisons harder, or require retraining entirely from scratch, we introduce a single metric that captures the prediction sensitivity to perturbing certain features called the fudge score. This metric is the mean absolute difference between the model’s prediction on the original input and a fudged version on m ∈ {0, 1}^d features

$${{{\rm{Fudge}}}}(\,f,{{{\bf{x}}}},{{{\bf{m}}}})=\frac{1}{N}\mathop{\sum }\limits_{n=1}^{N}\ | \,f({{{\bf{x}}}})-f({{{\bf{x}}}}+{{{{\bf{\upepsilon }}}}}_{n}\odot {{{\bf{m}}}})|$$

(1)

where ⊙ is the tensor product and ${{{\bf{\upepsilon }}}} \approx {{{\mathcal{N}}}}(0,I\sigma )$ is N × d-dimensional Gaussian noise. To evaluate faithfulness for a particular explanation method, we compute area under the fudge score curve on the top-k most important features, thereby summarizing the results into a single metric

$${\mathbb{1}}(k,{{{\bf{\upphi }}}})=\left\{\begin{array}{ll}1\quad &{{{\rm{if}}}}{\phi }_{i}\in \arg \mathop{\max }\limits_{{{{{\bf{\upphi }}}}}^{{\prime} }\subset \{1\ldots d\},| {{{{\bf{\upphi }}}}}^{{\prime} }| =k}{\sum }_{i\in {{{{\bf{\upphi }}}}}^{{\prime} }}| {\phi }_{i}| \\ 0\quad &{{{\rm{otherwise}}}}\end{array}\right.$$

(2)

$${{{\rm{Faith}}}}({{{\bf{\upphi }}}},\,f,\,{{{\bf{x}}}},\,K)=\mathop{\sum }\limits_{k=1}^{K}{{{\rm{Fudge}}}}(f,\,{{{\bf{x}}}},\,{\mathbb{1}}(k,{{{\bf{\upphi }}}}))$$

(3)

where ${\mathbb{1}}(k,{{{\bf{\upphi }}}})$ is the indicator function for the top-k most important features. Intuitively, if a set of feature importances ϕ correctly identifies the most important features, perturbing them will have greater effects on the model’s predictions, resulting in higher faithfulness scores. We compute faithfulness for multiple different explanations and select the highest. In practice, we consider LIME⁶⁴ with the following kernel widths [0.25, 0.50, 0.75, 1.0] and KernelSHAP⁷⁴. We leave all settings to default besides the kernel widths for LIME. In practice, we set σ = 0.05 to ensure that perturbations happen in the local region around the prediction, K to ${{{\rm{floor}}}}(\frac{d}{2})$, and N = 10,000 to sample sufficiently. One complication arises for categorical features, where we cannot apply Gaussian perturbations. For these features, we randomly sample these features from a value in the dataset column 30% of the time to guarantee that the feature remains categorical under perturbation. Last, if multiple explanations return similar fidelities, we use the explanation stability metric proposed in ref. ⁷⁵ to break ties, because it is much more desirable for the explanation to robust to perturbations^7,76. To use the stability metric proposed in ref. ⁷⁵ to break ties if the explanations fidelities are quite close (less than δ = 0.01), we compute the Jaccard similarity between feature rankings instead of the L2 norm as is used in their work. The reason is that the norm might not be comparable between explanation types, because they have different ranges, while the Jaccard similarity should not be affected. Further, we compute the area under the top-k curve using the Jaccard similarity stability metric, as in equation (3), to make this measure more robust.

Additional explanation types

As users will have explainability questions that cannot be answered solely with feature importance explanations, we include additional explanations to support a wider array of conversation topics. In particular, we support counterfactual explanations and feature interaction effects. These methods enable conversations about how to get different outcomes and whether features interact with each other during predictions, supporting a broad set of user queries. We implement counterfactual explanations using diverse counterfactual explanations, which generates a diverse set of counterfactuals⁷⁷. Having access to many plausible counterfactuals is desirable because it enables users to see a breadth of different, potentially useful, options. Also, we implement feature interaction effects using the partial dependence based approach from ref. ⁷⁸ because it is effective and quick to compute.

Exploring data and predictions

Because the process of understanding models often requires users to inspect the model’s predictions, errors and the data, TalkToModel supports a wide variety of data and model exploration tools. For example, TalkToModel provides options for filtering data and performing what-if analyses, supporting user queries that concern subsets of data or what would happen if data points change. Users can also inspect model errors, predictions, prediction probabilities, compute summary statistics, and evaluation metrics for individuals and groups of instances. TalkToModel additionally supports summarizing common patterns in mistakes on groups of instances by training a shallow decision tree on the model errors in the group. Also, TalkToModel enables descriptive operations, which explain how the system works, summarize the dataset and define terms to help users understand how to approach the conversation. Overall, TalkToModel supports a rich set of conversation topics in addition to explanations, making the system a complete solution for the model understanding requirements of end users.

Extensibility

While we implement TalkToModel with several different choices for operations such as feature importance explanations and counterfactual explanations, TalkToModel is highly modular and system designers can easily incorporate new operations or change existing ones by modifying the grammar to best support their user populations. This design makes TalkToModel straightforward to extend to new settings, where different operations may be desired.

Broader impact statement

The TalkToModel system and, more generally, conversational model explainability can be applied to a wide range of applications, including financial, medical or legal applications. Our research could be used to improve model understanding in these situations by improving transparency and encouraging the positive impact of ML systems, while reducing errors and bias. Although TalkToModel has many positive applications, the system makes it easier for those without high levels of technical expertise to understand ML models, which could lead to a false sense of trust in ML systems. In addition, because TalkToModel makes it easier to use ML model for those with lower levels of expertise, there is additionally a risk of inexperienced users applying ML models inappropriately. While TalkToModel includes several measures to prevent such risks, such as qualifying when explanations or predictions are inaccurate, and clearly describing the intended purpose of the ML model, it would be useful for researchers to investigate and possible adopters to be mindful of these considerations. While completing this research, the authors complied with all relevant ethical regulations of human research.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Data availability

The German credit, COMPAS and diabetes datasets and models can be found at https://github.com/dylan-slack/TalkToModel/tree/main/data. The fine-tuned language models used for TalkToModel for each of these datasets can be found at https://huggingface.co/dslack/all-finetuned-ttm-models. The mturk-generated dataset used to assess parsing accuracy and the accuracy results can be found at https://github.com/dylan-slack/TalkToModel/tree/main/experiments/parsing_accuracy. The user study response data are provided at https://github.com/dylan-slack/TalkToModel/blob/main/data/ttm-user-study-responses.csv.

Code availability

We release an open-source implementation of TalkToModel at https://github.com/dylan-slack/TalkToModel⁷⁹. Beyond the methods described so far, this release includes visualizations for conversations, interactive tooling to help users construct questions, saving results and conversation environments so they can be shared, abstractions for creating new operations and synthetic datasets, routines to adapt TalkToModel to new models and datasets automatically, and runtime optimizations (generating responses typically takes <2 seconds).

References

Lakkaraju, H., Bach, S. H. & Leskovec, J. Interpretable decision sets: a joint framework for description and prediction. In Proc. 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining 1675–1684 (Association for Computing Machinery, 2016).
Angelino, E., Larus-Stone, N., Alabi, D., Seltzer, M. & Rudin, C. Learning certifiably optimal rule lists. In Proc. 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining 35–44 (Association for Computing Machinery, 2017).
Lou, Y., Caruana, R., Gehrke, J. & Hooker, G. Accurate intelligible models with pairwise interactions. In Proc. 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (eds Ghani, R. et al.) 623–631 (Association for Computing Machinery, 2013).
Agarwal, R. et al. Neural additive models: interpretable machine learning with neural nets. Adv. Neural Inf. Process. Syst. 34, 4699–4711 (2021).
Chang, C.-H., Caruana, R. & Goldenberg, A. Node-GAM: neural generalized additive model for interpretable deep learning. In International Conference on Learning Representations (2022).
Ribeiro, M. T., Singh, S. & Guestrin, C. Model-agnostic interpretability of machine learning. In ICML Workshop on Human Interpretability in Machine Learning (2016).
Slack, D., Hilgard, A., Singh, S. & Lakkaraju, H. Reliable post hoc explanations: modeling uncertainty in explainability. Adv. Neural Inf. Process. Syst. 34, 9391–9404 (2021).
Selvaraju, R. R. et al. Grad-CAM: visual explanations from deep networks via gradient-based localization. In 2017 IEEE International Conference on Computer Vision 618–626 (IEEE, 2017).
Slack, D., Rauschmayr, N. & Kenthapadi, K. Defuse: training more robust models through creation and correction of novel model errors. In NeurIPS 2021 Workshop on Explainable AI Approaches for Debugging and Diagnosis (2021).
Hase, P., Xie, H. & Bansal, M. The out-of-distribution problem in explainability and search methods for feature importance explanations. Adv. Neural Inf. Process. Syst. 34, 3650–3666 (2021).
Simonyan, K., Vedaldi, A. & Zisserman, A. Deep inside convolutional networks: visualising image classification models and saliency maps. In Workshop at International Conference on Learning Representations (2014).
Lakkaraju, H., Slack, D., Chen, Y., Tan, C. & Sing, S. Rethinkingexplainability as a dialogue: a practitioner’s perspective. HAI Workshop @ NeurIPS (2022).
Kaur, H. et al. Interpreting interpretability: understanding data scientists’ use of interpretability tools for machine learning. In Proc. 2020 CHI Conference on Human Factors in Computing Systems 1–14 (Association for Computing Machinery, 2020).
Weld, D. S. & Bansal, G. The challenge of crafting intelligible intelligence. Commun. ACM 62, 70–79 (2019).
Article Google Scholar
Fok, R. & Weld, D. S. In search of verifiability: explanations rarely enable complementary performance in AI-advised decision making. Preprint at https://arxiv.org/abs/2305.07722 (2023).
Tenney, I. et al. The language interpretability tool: extensible, interactive visualizations and analysis for NLP models. In Proc. 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations (eds Liu, Q. & Schlangen, D.) 107–118 (Association for Computational Linguistics, 2020).
Wexler, J. et al. The what-if tool: interactive probing of machine learning models. IEEE Trans. Vis. Comput. Graph. 26, 56–65 (2020).
Ward, N. G. & DeVault, D. Ten challenges in highly-interactive dialog systems. In AAAI Conference on Artificial Intelligence (2015).
Carenini, G., Mittal, V. O. & Moore, J. D. Generating patient-specific interactive natural language explanations. In Proc. Annual Symposium on Computer Applications in Medical Care 5–9 (1994).
Pennebaker, J. W., Mehl, M. R. & Niederhoffer, K. G. Psychological aspects of natural language use: our words, our selves. Annu. Rev. Psychol. 54, 547–577 (2002).
Article Google Scholar
Zhang, Z., Takanobu, R., Zhu, Q., Huang, M. & Zhu, X. Recent advances and challenges in task-oriented dialog systems. Sci. China Technol. Sci. 63, 2011–2027 (2020).
Article Google Scholar
Sokol, K. & Flach, P. Glass-box: explaining AI decisions with counterfactual statements through conversation with a voice-enabled virtual assistant. In Proc. 27th International Joint Conference on Artificial Intelligence (ed. Lang, J.) 5868–5870 (IJCAI, 2018).
Feldhus, N., Ravichandran, A. M. & Möller, S. Mediators: conversational agents explaining NLP model behavior. IJCAI-ECAI Workshop on Explainable Artificial Intelligence (XAI) (2022).
Sutskever, I., Vinyals, O. & Le, Q. V. Sequence to sequence learning with neural networks. In Proc. 27th International Conference on Neural Information Processing Systems Vol. 2 (eds Ghahramani, Z. et al.) 3104–3112 (MIT Press, 2014).
Yu, T. et al. Spider: a large-scale human-labeled dataset for omplex and cross-domain semantic parsing and text-to-SQL task. In Proc. 2018 Conference on Empirical Methods in Natural Language Processing (eds Riloff, E. et al.) 3911–3921 (Association for Computational Linguistics, 2018).
Dua, D. & Graff, C. UCI Machine Learning Repository (UCI, 2017); http://archive.ics.uci.edu/ml
Angwin, J., Larson, J., Mattu, S. & Kirchner, L. Machine bias. ProPublica (2016).
Wang, B. & Komatsuzaki, A. GPT-J-6B: a 6 billion parameter autoregressive language model. GitHub https://github.com/kingoflolz/mesh-transformer-jax (2021).
Brown, T. et al. Language models are few-shot learners. Adv. Neural Inf. Process. Syst. 33, 1877–1901 (2020).
Raffel, C. et al. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res. 21, 5485–5551 (2020).
MathSciNet Google Scholar
Min, S. et al. Rethinking the role of demonstrations: what makes in-context learning work? In Proc. 2022 Conference on Empirical Methods in Natural Language Processing 11048–11064 (Association for Computational Linguistics, 2022).
Xie, S. M., Raghunathan, A., Liang, P. & Ma, T. An explanation of in-context learning as implicit bayesian inference. In International Conference on Learning Representations (2022).
Reimers, N. & Gurevych, I. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. In Proc. 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (eds Padó, S. & Huang, R) 3982–3992 (Association for Computational Linguistics, 2019).
Shin, R. et al. Constrained language models yield few-shot semantic parsers. In Proc. 2021 Conference on Empirical Methods in Natural Language Processing (eds Moens, M.-F. et al.) 7699–7715 (Association for Computational Linguistics, 2021).
Talmor, A., Geva, M. & Berant, J. Evaluating semantic parsing against a simple web-based question answering model. In Proc. 6th Joint Conference on Lexical and Computational Semantics (eds Ide, N. et al.) 161–167 (Association for Computational Linguistics, 2017).
Gupta, S., Singh, S. & Gardner, M. Structurally diverse sampling for sample-efficient training and comprehensive evaluation. In Findings of the Association for Computational Linguistics: EMNLP 2022 (eds Goldberg, Y. et al.) 4966–4979 (Association for Computational Linguistics, 2022).
Oren, I., Herzig, J., Gupta, N., Gardner, M. & Berant, J. Improving compositional generalization in semantic parsing. In Findings of the Association for Computational Linguistics: EMNLP 2020 (eds Cohn, T. et al.) 2482–2495 (Association for Computational Linguistics, 2020).
Yin, P. et al. Compositional generalization for neural semantic parsing via span-level supervised attention. In Proc. 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (eds Toutanova, K. et al.) 2810–2823 (Association for Computational Linguistics, 2021).
Dijk, O. et al. oegedijk/explainerdashboard: v0.3.8.2: reverses set_shap_values bug introduced in 0.3.8.1. Zenodo https://doi.org/10.5281/zenodo.6408776 (2022).
Pedregosa, F. et al. Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011).
MathSciNet MATH Google Scholar
Chen, Q., Schnabel, T., Nushi, B. & Amershi, S. Hint: integration testing for AI-based features with humans in the loop. In 27th International Conference on Intelligent User Interfaces 549–565 (ACM, 2022).
Freed, M. et al. RADAR: a personal assistant that learns to reduce email overload. In Proc. 23rd National Conference on Artificial Intelligence Vol. 3 (ed. Cohn, A.) 1287–1293 (AAAI Press, 2008).
Glass, A., McGuinness, D. L. & Wolverton, M. Toward establishing trust in adaptive agents. In Proc. 13th International Conference on Intelligent User Interfaces 227–236 (Association for Computing Machinery, 2008).
Palan, S. & Schitter, C. Prolific.ac—a subject pool for online experiments. J. Behav. Exp. Finance 17, 22–27 (2018).
Article Google Scholar
Chen, H., Liu, X., Yin, D. & Tang, J. A survey on dialogue systems: recent advances and new frontiers. SIGKDD Explor. Newsl. 19, 25–35 (2017).
Article Google Scholar
Li, X., Chen, Y.-N., Li, L., Gao, J. & Celikyilmaz, A. End-to-end task-completion neural dialogue systems. In Proc. Eighth International Joint Conference on Natural Language Processing (Volume 1: Long Papers) (eds Kondrak, G. & Watanabe, T.) 733–743 (2017).
Dong, C. et al. A survey of natural language generation. ACM Comput. Surv. 55, 1–38 (2022).
Liu, Y., Han, K., Tan, Z. & Lei, Y. Using context information for dialog act classification in DNN framework. In Proc. 2017 Conference on Empirical Methods in Natural Language Processing (eds Palmer, M. et al.) 2170–2178 (Association for Computational Linguistics, 2017).
Cai, W. & Chen, L. Predicting user intents and satisfaction with dialogue-based conversational recommendations. In Proc. 28th ACM Conference on User Modeling, Adaptation and Personalization (eds Kuflik, T. et al.) 33–42 (Association for Computing Machinery, 2020).
Liao, Q. V., Gruen, D. & Miller, S. Questioning the AI: informing design practices for explainable AI user experiences. In Proc. 2020 CHI Conference on Human Factors in Computing Systems 1–15 (Association for Computing Machinery, 2020).
Grosz, B. J., Joshi, A. K. & Weinstein, S. Providing a unified account of definite noun phrases in discourse. In 21st Annual Meeting of the Association for Computational Linguistics 44–50 (Association for Computational Linguistics, 1983).
Tseng, B.-H. et al. CREAD: combined resolution of ellipses and anaphora in dialogues. In Proc. 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (eds Toutanova, K. et al.) 3390–3406 (Association for Computational Linguistics, 2021).
Guo, D., Tang, D., Duan, N., Zhou, M. & Yin, J. Dialog-to-action: conversational question answering over a large-scale knowledge base. In Proc. 32nd International Conference on Neural Information Processing Systems (eds Bengio, S. et al.) 2946–2955 (Curran Associates Inc., 2018).
Gao, S., Sethi, A., Agarwal, S., Chung, T. & Hakkani-Tur, D. Dialog state tracking: s neural reading comprehension approach. In Proc. 20th Annual SIGdial Meeting on Discourse and Dialogue (eds Nakamura, S. et al.) 264–273 (Association for Computational Linguistics, 2019).
Gao, J., Galley, M. & Li, L. Neural approaches to conversational AI. In Proc. 56th Annual Meeting of the Association for Computational Linguistics: Tutorial Abstracts (eds Artzi, Y. & Eisenstein, J.) 2–7 (Association for Computational Linguistics, 2018).
Rieser, V. & Lemon, O. in Data-Driven Methods for Adaptive Spoken Dialogue Systems (eds Lemon, O. & Pietquin, O.) 5–17 (Springer, 2012).
Zhao, Z., Wallace, E., Feng, S., Klein, D. & Singh, S. Calibrate before use: improving few-shot performance of language models. In Proc. 38th International Conference on Machine Learning (eds Meila, M. & Zhang, T.) 12697-12706 (PMLR, 2021).
Loshchilov, I. & Hutter, F. Decoupled weight decay regularization. In International Conference on Learning Representations (2019).
Shao, Y. et al. Generating high-quality and informative conversation responses with sequence-to-equence models. In Proc. 2017 Conference on Empirical Methods in Natural Language Processing (eds Palmer, M. et al.) 2210–2219 (Association for Computational Linguistics, 2017).
Smilkov, D., Thorat, N., Kim, B., Viégas, F. & Wattenberg, M. Smoothgrad: removing noise by adding noise. In Workshop on Visualization for Deep Learning (2017).
Yeh, C.-K., Hsieh, C.-Y., Suggala, A., Inouye, D., & Ravikumar, P. On the (In)fidelity and sensitivity of explanations. In Proc. 33rd International Conference on Neural Information Processing Systems (eds Wallach, H. M. et al.) 10967–10978 (Curran Associates, Inc. 2019).
Chen, J., Song, L., Wainwright, M. J. & Jordan, M. I. L-Shapley and c-Shapley: efficient model interpretation for structured data. In International Conference on Learning Representations (2019).
Agarwal, S. et al. Towards the unification and robustness of perturbation and gradient-based explanations. In Proc. 38th International Conference on Machine Learning (eds Meila, M. & Zhang, T.) 110–119 (PMLR, 2021).
Ribeiro, M. T., Singh, S. & Guestrin, C. 2016. "Why should I trust you?": explaining the predictions of any classifier. In Proc. 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining 1135–1144 (Association for Computing Machinery, 2016).
Lundberg, S. M. et al. From local explanations to global understanding with explainable AI for trees. Nat. Mach. Intell. 2, 56–67 (2020).
Article Google Scholar
Lakkaraju, H., Kamar, E., Caruana, R. & Leskovec, J. Faithful and customizable explanations of black box models. In Proc. 2019 AAAI/ACM Conference on AI, Ethics, and Society 131–138 (Association for Computing Machinery, 2019).
Plumb, G., Molitor, D. & Talwalkar, A. Model agnostic supervised local explanations. In Proc. 32nd International Conference on Neural Information Processing Systems (eds Bengio, S. et al.) 2520–2529 (Curran Associates, 2018).
Li, J., Nagarajan, V., Plumb, G. & Talwalkar, A. A learning theoretic perspective on local explainability. In International Conference on Learning Representations (2020).
Murdoch, W. J., Singh, C., Kumbier, K., Abbasi-Asl, R. & Yu, B. Definitions, methods, and applications in interpretable machine learning. Proc. Natl Acad. Sci. USA 116, 22071–22080 (2019).
Article MathSciNet MATH Google Scholar
Sundararajan, M., Taly, A. & Yan, Q. Axiomatic attribution for deep networks. In Proc. 34th International Conference on Machine Learning Vol. 70 (eds Precup, D. & Teh, Y. W.) 3319–3328 (JMLR.org, 2017).
Krishna, S. et al. The disagreement problem in explainable machine learning: a practitioner’s perspective. ICML Workshop on Interpretable Machine Learning in Healthcare (2022).
Meng, C., Trinh, L., Xu, N., Enouen, J. & Liu, Y. Interpretability and fairness evaluation of deep learning models on MIMIC-IV dataset. Sci. Rep. 12, 7166 (2022).
Article Google Scholar
Hooker, S., Erhan, D., Kindermans, P.-J. & Kim, B. A Benchmark for Interpretability Methods in Deep Neural Networks (Curran Associates, 2019).
Scott M. Lundberg and Su-In Lee. 2017. A unified approach to interpreting model predictions. In Proc. 31st International Conference on Neural Information Processing Systems (eds von Luxburg, U. et al.) 4768–4777 (Curran Associates Inc., 2017).
Alvarez-Melis, D. & Jaakkola, T. S. On the robustness of interpretability methods. ICML Workshop on Human Interpretability in Machine Learning (2018).
Agarwal, C. et al. Rethinking stability for attribution-based explanations. ICLR Pair2Struct Workshop (2022).
Mothilal, R. K., Sharma, A. & Tan, C. Explaining machine learning classifiers through diverse counterfactual explanations. In Proc. 2020 Conference on Fairness, Accountability, and Transparency 607–617 (Association for Computing Machinery, 2020).
Greenwell, B. M., Boehmke, B. C. & McCarthy, A. J. A simple and effective model-based variable importance measure. Preprint at https://arxiv.org/abs/1805.04755 (2018).
Slack, D., Krishna, S., Lakkaraju, H. & Singh, S. TalkToModel: explaining machine learning models with interactive natural language conversations. Zenodo https://doi.org/10.5281/zenodo.7502206 (2022).

Download references

Acknowledgements

We acknowledge helpful feedback from P. Hase, J. Ugander, M. T. Ribeiro, B. Lim and the UCI NLP lab concerning earlier versions of the system and papers. This work is supported in part by the NSF awards IIS-2008461 to H.L., IIS-2040989 to S.S. and H.L., IIS-2046873 to S.S. and IIS-2008956 to S.S., and research awards from Google, JP Morgan, Amazon, Harvard Data Science Initiative, the D³ Institute at Harvard and the Hasso Plattner Institute. H.L. thanks S. and M. Lakkaraju for their continued support and encouragement. The views expressed here are those of the authors and do not reflect the official policy or position of the funding agencies.

Author information

These authors contributed equally: Himabindu Lakkaraju, Sameer Singh.

Authors and Affiliations

Department of Computer Science, University of California Irvine, Irvine, CA, USA
Dylan Slack & Sameer Singh
Department of Computer Science, Harvard University, Cambridge, MA, USA
Satyapriya Krishna & Himabindu Lakkaraju
Harvard Business School, Boston, MA, USA
Himabindu Lakkaraju

Authors

Dylan Slack
View author publications
You can also search for this author in PubMed Google Scholar
Satyapriya Krishna
View author publications
You can also search for this author in PubMed Google Scholar
Himabindu Lakkaraju
View author publications
You can also search for this author in PubMed Google Scholar
Sameer Singh
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

D.S. designed and developed the TalkToModel system and studies, and prepared the paper. S.K. designed and implemented the explanation selection procedure, drafted sections in the paper and edited the paper. S.S. and H.L. contributed equally to advising the development of the system and experiments, conceiving the system, reviewing the paper and editing the paper.

Corresponding author

Correspondence to Dylan Slack.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Machine Intelligence thanks Mengnan Du, Shafiq Rayhan Joty and Sherin Mathews for their contribution to the peer review of this work.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Supplementary Sections A–C, including Supplementary Figs. 1–8 and Tables 1–8.

Reporting Summary

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Slack, D., Krishna, S., Lakkaraju, H. et al. Explaining machine learning models with interactive natural language conversations using TalkToModel. Nat Mach Intell 5, 873–883 (2023). https://doi.org/10.1038/s42256-023-00692-8

Download citation

Received: 03 October 2022
Accepted: 22 June 2023
Published: 27 July 2023
Issue Date: August 2023
DOI: https://doi.org/10.1038/s42256-023-00692-8

This article is cited by

Generative AI
- Stefan Feuerriegel
- Jochen Hartmann
- Patrick Zschech
Business & Information Systems Engineering (2024)

Subjects

Abstract

Similar content being viewed by others

Foundation metrics for evaluating effectiveness of healthcare conversations powered by generative AI

Large language models in medicine

ThoughtSource: A central hub for large language model reasoning data

Main

Results

Language understanding

Gold parse collection

Models

Evaluating the parsing accuracy

Accuracy

Utility of explainability dialogues

Study overview

Metrics

Recruitment

Metric results

Qualitative results

Displayquote 1

Displayquote 2

Discussion

Methods

Text understanding

Grammar

Supporting context in dialogues

Parsing dataset generation

Semantic parsing

Generating responses

Executing parses

Feature importance explanations

Explanation selection

Additional explanation types

Exploring data and predictions

Extensibility

Broader impact statement

Reporting summary

Data availability

Code availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Peer review

Peer review information

Additional information

Supplementary information

Supplementary Information

Reporting Summary

Rights and permissions

About this article

Cite this article

Share this article

This article is cited by

Generative AI

Search

Quick links