Explaining Machine Learning Models with Interactive Natural Language Conversations Using TalkToModel

Practitioners increasingly use machine learning (ML) models, yet they have become more complex and harder to understand. To address this issue, researchers have proposed techniques to explain model predictions. However, practitioners struggle to use explainability methods because they do not know which to choose and how to interpret the results. We address these challenges by introducing TalkToModel: an interactive dialogue system that enables users to explain ML models through natural language conversations. TalkToModel comprises three components: 1) an adaptive dialogue engine that interprets natural language and generates meaningful responses, 2) an execution component, which constructs the explanations used in the conversation, 3) a conversational interface. In real-world evaluations, 73% of healthcare workers agreed they would use TalkToModel over existing systems for understanding a disease prediction model, and 85% of ML professionals agreed TalkToModel was easier to use, demonstrating that TalkToModel is highly eﬀective for model explainability.


Introduction
Due to their strong performance, machine learning (ML) models increasing make consequential decisions in several critical domains, such as healthcare, finance, and law.However, state-of-the-art ML models, such as deep neural networks, have become more complex and hard to understand.This dynamic poses challenges in real-world applications for model stakeholders who need to understand why models make predictions and whether to trust them.Consequently, practitioners have often turned to inherently interpretable machine learning models for these applications including decision lists and sets [33,8,75,84] and generalized additive models [41,5,14,13,85], which people can more easily understand.Nevertheless, black-box models are often more flexible and accurate, motivating the development of post-hoc explanations that explain the predictions of trained ML models.These techniques either fit faithful models in the local region around a prediction or inspect internal model details, like gradients, to explain predictions [56,66,61,67,58,29,65,68].
Yet, recent work suggests practitioners often have difficulty using explainability techniques [35,31,79].These challenges are due to difficulty figuring out which explanations to implement, how to interpret their results, and answering followup questions beyond the initial explanation.In the past, researchers have proposed several point-and-click dashboard techniques to help overcome these issues, such as the Language Interpretability Tool (LiT) [73], which is designed to understand natural language processing (NLP) models and the "What-If" Tool [2]-a tool aimed at performing counterfactual analyses for models.However, these methods still require a high level of expertise because users must know which explanations to run and lack the flexibility to support arbitrary follow up questions users might have.Overall, being able to understand ML models through simple and intuitive interactions is a key bottleneck in adoption across many applications.
Natural language dialogues are a promising solution for supporting broad and accessible interactions with ML models due to their ease of use, capacity, and support for continuous discussion.However, designing a dialogue system that enables a satisfying model understanding experience introduces several challenges.First, the system must handle many conversation topics about the model and data while facilitating natural conversation flow [77].For instance, these topics may include explainability questions like the most important features for predictions and general questions such as data statistics or model errors.Further, the system must work for a variety of model classes and data, and it should understand language usage across different settings [12].For example, participants will use different terminology in conversations about loan prediction compared to disease diagnosis.Last, the dialogue system should generate accurate responses that address the users' core questions [51,86].In the literature, researchers have suggested some prototype designs for generating explanations using natural language.However, these initial designs address specific explanations and model classes, limiting their applicability in general conversational explainability settings [69,20].
In this work, we address these challenges by introducing TalkToModel: a system that enables openended natural language dialogues for understanding ML models for any tabular dataset and classifier (an overview of TalkToModel is provided in Figure 1).Users can have discussions with TalkToModel about why predictions occur, how the predictions would change if the data changes, and how to flip predictions, among many other conversation topics (an example conversation is provided in Table 1).Further, they can perform these analyses on any group in the data, such as a single instance or a specific group of instances.For example, on a disease prediction task, users can ask "how important is BMI for the predictions?"or "so how would decreasing the glucose levels by ten change the likelihood of men older than twenty having the disease?"TalkToModel will respond by describing how, for instance, BMI is the most important feature for predictions, and decreasing glucose will decrease the chance by 20%.From there, users can engage further in the conversation by asking follow up questions like, "what if you instead increased glucose by ten for that group of men?" and TalkToModel use the context to accurately respond.Conversations with TalkToModel make model explainability straightforward because users can simply talk with the system in natural language about the model, and the system will generate useful responses.
To support such rich conversations with TalkToModel, we introduce techniques for both language understanding and model explainability.First, we propose a dialogue engine that parses user text inputs (referred to as user utterances) into an SQL-like programming language using a large language model (LLM).The LLM performs the parsing by treating the task of translating user utterances into the programming language as a seq2seq learning problem, where the user utterances are the source and parses in the programming language are the targets [71].In addition, the TalkToModel language combines operations for explanations, ML error analyses, data manipulation, and descriptive text into a single language capable of representing a wide-variety of potential conversation topics most model explainability needs (an overview of the different operations is provided in Table 2).To support the system adapting to any dataset and model, we introduce lightweight adaption techniques to fine-tune LLMs to perform the parsing, enabling strong generalization to new settings.Second, we introduce an execution engine that runs the operations in each parse.To reduce the burden of users deciding which explanations to run, we introduce methods that automatically select explanations for the user.In  particular, this engine runs many explanations, compares their fidelities, and selects the most accurate ones.Finally, we construct a text interface where users can engage in open-ended dialogues using the system, enabling anyone, including those with minimal technical skills, to understand ML models.

Results
In this section, we demonstrate TalkToModel accurately understands users in conversations by evaluating its language understanding capabilities on ground truth data.Next, we evaluate the effectiveness of TalkToModel for model understanding by performing a real-world human study on healthcare workers (e.g., doctors and nurses) and ML practitioners, where we benchmark TalkToModel against existing explainability systems.We find users both prefer and are more effective using TalkToModel than traditional point-and-click explainability systems, demonstrating its effectiveness for understanding ML models.Gold Parse Collection We construct gold datasets (i.e., ground truth (utterance, parse) pairs) across multiple datasets to evaluate the language understanding performance of our models.To construct these gold datasets, we adopt an approach inspired by Yu et al. [83], which constructs a similar dataset for multitask semantic parsing.
Our gold dataset generation process is as follows.First, we write 50 (utterance, parse) pairs for the particular task (i.e., loan or diabetes prediction).These utterances range from simple "How likely are people in the data to have diabetes?"to complex "If these people were not unemployed, what's the likelihood they are good credit risk?Why?" and conversational "What if they were twenty years older?".We include each operation (Table 2) at least twice in the parses, to make sure there is good coverage.From there, we ask Mechanical Turk workers to rewrite the utterances while preserving their semantic meaning to ensure that the ground truth parse for the revised utterance is the same but the phrasing differs.We ask workers to rewrite each pair 8 times for a total of 400 (utterance, parse) pairs per task.Next, we filter out low-quality mturk revisions.We ask the crowd sourced workers to rate the similarity between the original utterance and revised utterance on a scale of (1)(2)(3)(4), where 4 indicates the utterances have the same meaning and 1 that they do not have the same meaning.We collect 5 ratings per revision and remove (utterance, parse) pairs that score below 3.0 on average.Finally, we perform an additional filtering step to ensure data quality by inspecting the remaining pairs ourselves and removing any bad revisions.
Since we want to evaluate TalkToModel's capacity to generalize across different scenarios, we perform this data collection process across 3 different tasks: Pima Indian Diabetes Dataset [19], German Credit Dataset [19], and the COMPAS recidivism dataset [9].After collecting revisions and ensuring quality, we are left with 200 pairs for German Credit, 190 for diabetes, and 146 for COMPAS.Table 3: Exact Match Parsing Accuracy (%) for the 3 gold datasets, on the IID and Compositional splits, as well as Overall.The fine-tuned T5 models perform significantly better than few-shot GPT-J, and T5 Large performed the best.These results demonstrate that TalkToModel can understand user intentions with a high degree of accuracy using the T5 models.Models We compare two strategies for using pre-trained LLMs to parse user utterances into the grammar 1.) few-shot GPT-J [76] and 2.) finetuned T5 [53].Both these models translate user utterances into the TalkToModel grammar in a seq2seq fashion.However, the GPT-J models are higher-capacity and more amenable to be trained by in-context learning.This procedure includes examples of the input and target from the training prepended to the test instance [10,45,80].On the other hand, the T5 models require traditional finetuning on the input and target pairs.Consequently, the few-shot approach is quicker to set up because it does not require finetuning, making it easier for users to get started with the system.However, the finetuned T5 leads to improved performance and a better user experience overall while taking longer to set up.To train these models through finetuning or prompting, we generate synthetic (utterance, parse) pairs because it is impractical to assume that we can collect ground truth pairs for every new task we wish to use TalkToModel.We provide additional training details in the methods section.

German
We evaluate both fine-tuned T5 models and few-shot GPT-J models on the testing data.We additionally implement a naive nearest neighbors baseline, where we select the closest user utterance in the synthetic training set according to cosine distance of all-mpnet-base-v2 sentence embeddings and return the corresponding parse [54].For the GPT-J models, we compare N -Shot performance, where N is the number of (utterance, parse) pairs from the synthetically generated training sets included in the prompt, and sweep over a range of N for each model.For the larger models, we have to use relatively smaller N in order for inference to fit on a single 48GB GPU.
When parsing the utterances, one issue is that their generations are unconstrained and may generate parses outside the grammar, resulting in the system failing to run the parse and bad user experiences.To ensure the generations are grammatical, we constrain the decodings to be in the grammar [63].This technique, referred to as guided decoding, constrains the LLM generations to only allow those tokens that appear next in the grammar at any point during generation.Practically, we accomplish this by recompiling the grammar at inference time into an equivalent grammar consisting of the tokens in the LLM's vocabulary.While decoding from the LLM, we fix the likelihood of ungrammatical tokens to 0 at every generation step.Thus, the LLM only generates grammatical parses.
Evaluating The Parsing Accuracy To evaluate performance on the datasets, we use the exact match parsing accuracy [72,83,28].This metric is whether the parse exactly matches the gold parse in the dataset.In addition, we perform the evaluation on two splits of each gold parse dataset, in addition to the overall dataset.These splits are the IID and compositional splits.The IID split contains (utterance, parse) pairs where the parse's operations and their structure (but not necessarily the arguments) are in the training data.The compositional split consists of the remaining parses that are not in the training data.Because LM's struggle compositionally, this split is generally much harder for LM's to parse [48,82].
Accuracy We present the results in Table 3.The fine-tuned T5 performs better overall than the few shot GPT-J models.In particular, the T5 Large models perform strongly on both the IID and compositional data and can even parse complex compositional phrases.Notably, the T5 small model performs better than the GPT-J 6B model, which has two orders of magnitude more parameters.This dynamic is particularly true in the compositional splits in the data where the GPT-J few shot models never exceed 10% parsing accuracy.Overall, these results indicate TalkToModel can understand user utterances with a high degree of accuracy using our best performing T5 models.Further, we recommend using this model for the best results and use it for our remaining evaluation.

User Study: Utility of Explainability Dialogues
The results in the previous subsection show TalkToModel understands user intentions to a high degree of accuracy.In this subsection, we evaluate how well the end-to-end system helps users understand ML models compared to current explainability systems.

Study Overview
We compare TalkToModel against explainerdashboard, one of the most popular open-source explainability dashboards [18].This dashboard has similar functionality to TalkToModel, considering it provides an accessible way to compute explanations and perform model analyses.Thus, it is a reasonable baseline.Last, we perform this comparison using the Diabetes dataset, and a gradient boosted tree trained on the data [50].To compare both systems in a controlled manner, we ask participants to answer general ML questions with TalkToModel and the dashboard.Each question is about basic explainability and model analysis, and participants answer using multiple choice, where one of the options is "Could not determine."if they cannot figure out the answer (though it is straightforward to answer all the questions with both interfaces).For example, questions are about comparing feature importances "Is glucose more important than age for the model's predictions for data point 49?" or model predictions "How many people are predicted not to have diabetes but do not actually have it?"Participants answer 10 total questions.We divide the 10 questions into 2 blocks of 5 questions each.Both blocks have similar questions but different values to control for memorization (exact questions given in Supplementary Information A).Participants use TalkToModel to answer one block of questions and the dashboard for the other block.In addition, we provide a tutorial on how to use both systems before showing users the questions for the system.Last, we randomize question, block, and interface order to control for biases due to showing interfaces or questions first.Metrics Following previous work on evaluating human and ML coordination and trust, we assess several metrics to evaluate user experiences [17,21,24].We evaluate the following statements along 1-7 Likert scale at the end of the survey: • Easiness: I found the conversational interface easier to use than the dashboard interface • Confidence: I was more confident in my answers using the conversational interface than the dashboard interface • Speed: I felt that I was able to more rapidly arrive at an answer using the conversational interface than the dashboard interface • Likeliness To Use: Based on my experience so far with both interfaces, I would be more likely to use the conversational interface than the dashboard interface in the future To control for bias associated with the ordering of the terms conversational interface and dashboard interface, we randomized their ordering.We also measure accuracy and time-taken to answer each question.Last, we asked to participants to write a short description comparing their experience with both interfaces to capture participants qualitative feedback about both systems.
Recruitment Since TalkToModel provides an accessible way to understand ML models, we expect it to be useful for subject matter experts with a variety of experience in ML, including users without any ML experience.As such, we recruited 45 English speaking healthcare workers to take the survey using the Prolific service [49] with minimal or no ML expertise This group comprises a range of healthcare workers, including doctors, pharmacists, dentists, psychiatrists, healthcare project managers, and medical scribes.The vast majority of this group (43) stated they had either no experience with ML or had heard about it from reading articles online, while two members indicated they had equivalent to an undergraduate course in ML.As another point of comparison, we recruited ML professionals with relatively higher ML expertise from ML Slack channels and email lists.We received 13 potential participants, all of which had graduate course level ML experience or higher, and included all of them in the study.We received IRB approval for this study from our institution's IRB approval process and informed consent from participants.
Metric Results A significant majority of health care workers agreed they preferred TalkToModel in all the categories we evaluated (Table 4).The same is true for the ML professionals, save for whether they were more likely to use TalkToModel in the future, where 53.8% of participants agreed they would instead use TalkToModel in the future.In addition, participants subjective notions around how quickly they could use TalkToModel aligned with their actual speed of use, and both groups arrived at answers using TalkToModel significantly quicker than the dashboard.The median question answer time (measured at the total time taken from seeing the question to submitting the answer) using TalkToModel was 76.3 seconds, while it was 158.8 seconds using the dashboard.
Participants were also much more accurate and completed questions at a higher rate (i.e., they did not mark "could not determine) using TalkToModel (Table 5).While both health care workers and ML practitioners clicked could not determine for a quarter of the questions using the dashboard, this was true for 13.8% of health care workers and 6.1% of ML professionals using TalkToModel, demonstrating the usefulness of the conversational interface.On completed questions, both groups were much more accurate using TalkToModel than the dashboard.Most surprisingly, though ML professionals agreed they preferred TalkToModel only about half the time, they answered all the questions correctly using it, while they only answered 62.5% of questions correctly with the dashboard.Finally, we observed TalkToModel's conversational capabilities were highly effective.There were only 6 utterances out of over 1, 000 total utterances the conversational aspect of the system failed to resolve.These failure cases generally involved certain discourse aspects like asking for additional elaboration ("more description").
The largest source of errors for participants using the explainability dashboard were two questions concerning the top most important features for individual predictions.The errors for these questions account for 47.4% of health care workers and 44.4% of ML professionals' total mistakes.Answering these questions with the dashboard requires users to perform multiple steps, including choosing the feature importance tab in the dashboard, selecting local explanations for the correct instance, and ranking the features according to their importance.On the other hand, the streamlined text interface of TalkToModel made it much simpler to solve these questions resulting fewer errors.

Qualitative Results
For the qualitative user feedback, we provide representative quotes from similar themes in the responses.Users expressed that they could more rapidly and easily arrive at results, which could be helpful for their professions, I prefer the conversational interface because it helps arrive at the answer very quickly.This is very useful especially in the hospital setting where you have hundreds of patients getting check ups and screenings for diabetes because it is efficient and you can work with medical students on using the system to help patient outcomes.-P39medical worker at a tertiary hospital.
Participants also commented on the user friendliness of TalkToModel and its strong conversational capabilities, stating, "the conversational [interface] was straight to the point, way easier to use"-P35 Nurse, and that "the conversational interface is hands-down much easier to use... it feels like one is talking to a human."-P45ML Professional.We did not find any negative feedback surrounding the conversational capabilities of the system.Users also commented on how easy it was to access information compared to the dashboard, With the conversational interface you can ask whatever you want to know and with the dashboard you need to specifically search information that you don't actually know where it is.-P31Physical Therapist.
All in all, users expressed strong positive sentiment about TalkToModel due to the quality of conversations, presentation of information, accessibility, and speed of use.
Several ML professionals brought up points that could serve as future research directions.Notably, participants stated they would rather look at the data themselves rather than rely on an interface that rapidly provides an answer, I would almost always rather look at the data myself and come to a conclusion than getting an answer within seconds.-P11ML Professional.
In the future, it would be worthwhile including visualizations of raw data and analyses performed by the system to increase trust with expert users, such as ML professionals, who may be skeptical of the high-level answers provided by the system currently.

Discussion
With ML models increasingly becoming more complex, there is need to develop techniques to explain model predictions to stakeholders.Nevertheless, it is often the case that practitioners struggle to use explanations and often have many follow up questions they wish to answer.In this work, we show TalkToModel makes explainable AI accessible to users that come from a range of backgrounds by using natural language conversations.Our experimental findings demonstrate TalkToModel both comprehends users to a high-degree of accuracy and can help users understand the predictions of ML models much better than existing systems.In particular, we showed TalkToModel is a highly effective way for domain experts such as healthcare workers to understand ML models, like those applied to disease diagnosis.Last, we designed TalkToModel to be highly extensible and release the code, data, and a demo for the system at https://github.com/dylan-slack/TalkToModel,making it straightforward for explainability users and researchers to build on the system.
In the future, it will be helpful to investigate applications of TalkToModel in-the-wild, such as in doctors' offices, laboratories, or professional settings, where model stakeholders use the system to understand their models.In addition, it will also be helpful to explore how to use language models to generate conversation responses grounded in the results of the operations.Finally, it will also be helpful to evaluate how best to integrate TalkToModel in existing scientific and professional work streams to promote its impact and usefulness.

Methods
In this section, we describe the components of TalkToModel.First, we introduce the dialogue engine and discuss how it understands user inputs, maps them to operations, and generates text responses based on the results of running the operations.Second, we describe the execution engine, which runs the operations.Finally, we provide an overview of the interface and the extensibility of TalkToModel.

Text Understanding
To understand the intent behind user utterances, the system learns to translate or parse them into logical forms.These parses represent the intentions behind user utterances in a highly-expressive and structured programming language TalkToModel executes.
Compared to dialogue systems that execute specific tasks by modifying representations of the internal state of the conversation [15,37], our parsing-based approach allows for more flexibility in the conversations, supporting open-ended discovery, which is critical for model understanding.Also, this strategy produces a structured representation of user utterances instead of open-ended systems that generate unstructured free text [60].Having this structured representation of user inputs is key for our setting where we need to execute specific operations depending on the user's input, which would not be straightforward with unstructured text.
TalkToModel performs the following steps to accomplish this: 1) the system constructs a grammar for the user-provided dataset and model, which defines the set of acceptable parses, 2) TalkToModel generates (utterance, parse) pairs for the dataset and model, 3) the system finetunes a large language model (LLM) to translate user utterances into parses, and 4) the system responds conversationally to users by composing the results of the executed parse into a response that provides context for the results and opportunities to follow up.
Grammar To represent the intentions behind the user utterances in a structured form, TalkToModel relies on a grammar, defining a domain specific language for model understanding.While the user utterances themselves will be highly diverse, the grammar creates a way to express user utterances in a structured yet highly expressive fashion that the system can reliably execute.Compared with approaches that treat determining user intentions in conversations as a classification problem [39,11], using a grammar enables the system to express compositions of operations and arguments that take on many different values, such as real numbers, that would otherwise be combinatorially impossible in a prediction setting.Instead, TalkToModel translates user utterances into this grammar in a seq2seq fashion, overcoming these challenges [71].This grammar consists of production rules that include the operations the system can run (an overview is provided in Table 2), the acceptable arguments for each operation, and the relations between operations.One complication is that user-provided datasets have different feature names and values, making it hard to define one shared grammar between datasets.Instead, we update the grammar based on the feature names and values in a new dataset.For instance, if a dataset only contained the feature names age and income, these two names would be the only acceptable values for the feature argument in the grammar.
To ensure our grammar provides sufficient coverage for XAI questions, we very our grammar supports the questions from the XAI question bank.This question bank was introduced by Liao et al. [38] based on interviews with AI product designers and includes 31 core, prototypical questions XAI systems should answer, excluding socio-technical questions beyond the scope of TalkToModel (e.g., What are the results of other people using the [model]).The prototypical questions address topics such as the input/data to the model ("What is the distribution of a given feature?"), model output ("What kind of output does the system give?"),model performance ("How accurate are the predictions?"),global model behavior ("What is the systems overall logic?"), why/why not the system makes individual predictions ("Why is this instance given this prediction?"), and what-if or counterfactual questions ("What would the system predict if this instance changes to...?").To evaluate how well TalkToModel covers these questions, we review each question and evaluate whether our grammar can parse it.Overall, we find our grammar supports 30/31 of the prototypical questions.We provide a table of each question and corresponding parse in Supplementary Table 6 and Supplementary Table 7. Overall, the grammar covers the vast majority of XAI related questions, and therefore, has good coverage of XAI topics.
Supporting Context In Dialogues User conversations with TalkToModel naturally include complex conversational phenomena such as anaphora and ellipsis [26,74,27].Meaning, conversations refer back to events earlier in the conversation ("what do you predict for them?") or omit information that must be inferred from conversation ("Now show me for people predicted incorrectly.").However, current language models only parse a single input, making it hard to apply them in settings where the context is important.To support context in the dialogues, TalkToModel introduces on a set of operations in the grammar that determine the context for user utterances.In contrast with approaches that maintain the conversation state using neural representations [15,23], grammar operations allow for much more trustworthy and dependable behavior while still fostering rich interactions, which is critical for high-stakes settings, and similar mechanisms for incorporating grammar predicates across turns have been shown to achieve strong results [27].In particular, we leverage two operations: previous filter and previous operation, which look back in the conversation to find the last filter and last operation, respectively.These operations also act recursively.Therefore, if the last filter is a previous filter operation, TalkToModel will recursively call previous filter to resolve the entire stack of filters.As a result, TalkToModel is capable of addressing instances of anaphora and ellipsis by using these operations to resolve the entity via co-reference or infer it from the previous conversation history.This dynamic enables users to have complex and natural conversations with TalkToModel.
Parsing Dataset Generation To parse user utterances into the grammar, we finetune an LLM to translate utterances into the grammar in a seq2seq fashion.We use LLMs because these models have been trained on large amounts of text data and are solid priors for language understanding tasks.Thus, they can better understand diverse user inputs than training from scratch, improving the user experience.Further, we automate the finetuning of an LLM to parse user utterances into the grammar by generating a training dataset of (utterance, parse) pairs.Compared to dataset generation methods that use human annotators to generate and label datasets for training conversation models [22,59], this approach is much less costly and time consuming, while still being highly effective, and supports users getting conversations running very quickly.This strategy consists of writing an initial set of user utterances and parses, where parts of the utterances and parses are wildcard terms.TalkToModel enumerates the wildcards with aspects of a user-provided dataset, such as the feature names, to generate a training dataset.Depending on the user-provided dataset schema, TalkToModel typically generates anywhere from 20, 000-40, 000 pairs.Last, we have already written the initial set of utterances and parses, so users only need to provide their dataset to setup a conversation.
Semantic Parsing Here, we provide additional details about the semantic parsing approach for translating user utterances into the grammar.The two strategies for parsing user utterances using pre-trained LLMs that we considered were 1.) few-shot GPT-J [76] and 2.) finetuned T5 [53].With respect to the few-shot models, because the LLM's context window only accepts a fixed number of inputs, we introduce a technique to select the set of most relevant prompts for the user utterance.In particular, we embed all the utterances and identify the closest utterances to the user utterance according to the cosine distance of these embeddings.To ensure a diverse set of prompts, we only select one prompt per template.We prompt the LLM using these (utterance, parse) pairs, ordering the closest pairs immediately before the user utterance because LLMs exhibit recency biases [87].Using this strategy, we experiment with the number of prompts included in the LLM's context window.In practice, we use the all-mpnet-base-v2 sentence transformer model to perform the embeddings [54], and we consider the GPT-J 6B, GPT-Neo 2.7B, and GPT-Neo 1.3B models in our experiments.
We also fine-tune pre-trained T5 models in a seq2seq fashion on our datasets.To perform finetuning, we split the dataset using a 90/10% train/validation split and train for 20 epochs to maximize the next token likelihood with a batch size of 32.We select the model with the lowest validation loss at the end of each epoch.We fine-tune with a learning rate of 1e-4 and the AdamW optimizer [40].Last, our experiments consider the T5 Small, Base, and Large variants.
Generating Responses After TalkToModel executes a parse, it composes the results of the operations into a natural language response it returns to the user.TalkToModel generates these responses by filling in templates associated with each operation based on the results.The responses also include sufficient context to understand the results and opportunities for following up (examples in Table 1).Further, because the system runs multiple operations in one execution, TalkToModel joins response templates, ensuring semantic coherence, into a final response and shows it to the user.Compared to approaches that generate responses using neural methods [62], this approach ensures the responses are trustworthy and do not contain useless information hallucinated by the system, which would be a very poor user experience for the high-stakes applications we consider.Further, because TalkToModel supports a wide variety of different operations, this approach ensures sufficient diversity in responses, so they are not repetitive.

Executing Parses
In this subsection, we provide an overview of the execution engine, which runs the operations necessary to respond to user utterances in the conversation.Further, this component automatically selects the most faithful explanations for the user, helping ensure explanation accuracy.
Feature Importance Explanations At its core, TalkToModel explains why the model makes predictions to users with feature importance explanations.Feature importance explanations Φ(x, f ) → φ accept a data point x ∈ R d with d features and model as input f (x) → y, where y ∈ [0, 1] is the probability for a particular class, and generates a feature attribution vector φ ∈ R d , where greater magnitudes correspond to higher importance features [66,68,56,81,16,6].
We implement the feature importance explanations using post-hoc feature importance explanations.Post-hoc feature importance explanations do not rely on internal details of the model f (e.g., internal weights or gradients) and only on the input data x and predictions y to compute explanations, so users are not limited to only certain types of models [57,43,34,52,36].Note that our system can easily be extended to other explanations that rely on internal model details, if required [61,47,5,70].
Explanation Selection While there exists several post hoc explanation methods, each one adopts a different definition of what constitutes an explanation [32].For example, while LIME, SHAP, and Integrated Gradients all output feature attributions, LIME returns coefficients of local linear models, SHAP computes Shapley values, and Integrated Gradients leverages model gradients.Consequently, we automatically select the most faithful explanation for users, unless a user specifically requests a certain technique.Following prior works, we compute faithfulness by perturbing the most important features and evaluating how much the prediction changes [44].Intuitively, if the feature importance φ correctly captures the feature importance ranking, perturbing more important features should lead to greater effects.
While previous works [43,30], compute the faithfulness over many different thresholds, making comparisons harder, or require retraining entirely from scratch, we introduce a single metric that captures the prediction sensitivity to perturbing certain features called the fudge score.This metric is the mean absolute difference between the model's prediction on the original input and a fudged version where is the tensor product and ∼ N (0, Iσ) is N × d dimensional Gaussian noise.To evaluate faithfulness for a particular explanation method, we compute area under the fudge score curve on the top-k most important features, thereby summarizing the results into a single metric, where 1(k, φ) is the indicator function for the top-k most important features.Intuitively, if a set of feature importances φ correctly identifies the most important features, perturbing them will have greater effects on the model's predictions, resulting in higher faithfulness scores.We compute faithfulness for multiple different explanations and select the highest.In practice, we consider LIME [57] with the following kernel widths [0.25, 0.50, 0.75, 1.0] and KernelSHAP [42].We leave all settings to default besides the kernel widths for LIME.In practice, we set σ = 0.05 to ensure perturbations happen in the local region around the prediction, K to floor( d 2 ), and N = 10, 000 to sample sufficiently.One complication arises for categorical features, where we cannot apply Gaussian perturbations.For these features, we randomly sample these features from a value in the dataset column 30% of the time to guarantee the feature remains categorical under perturbation.Last, if multiple explanations return similar fidelities, we use the explanation stability metric proposed by Alvarez-Melis and Jaakkola [7] to break ties, because it is much more desirable for the explanation to robust to perturbations [66,3].In order to use the stability metric proposed by Alvarez-Melis and Jaakkola [7] to break ties if the explanations fidelities are quite close (less than δ = 0.01), we compute the jaccard similarity between feature rankings instead of the l2 norm as is used in their work.The reason is that the norm might not be comparable between explanation types, because they have different ranges, while the jaccard similarity should not be affected.Further, we compute the area under the top k curve using the jaccard similarity stability metric, as in Equation 3, to make this measure more robust.
Additional Explanation Types Since users will have explainability questions that cannot be answered solely with feature importance explanations, we include additional explanations to support a wider array of conversation topics.In particular, we support counterfactual explanations and feature interaction effects.These methods enable conversations about how to get different outcomes and if features interact with each other during predictions, supporting a broad set of user queries.We implement counterfactual explanations using DiCE, which generates a diverse set of counterfactuals [46].Having access to many plausible counterfactuals is desirable because it enables users to see a breadth of different, potentially useful, options.Also, we implement feature interaction effects using the partial dependence based approach from Greenwell et al. [25] because it is effective and quick to compute.
Exploring Data and Predictions Because the process of understanding models often requires users to inspect the model's predictions, errors, and the data itself, TalkToModel supports a wide variety of data and model exploration tools.For example, TalkToModel provides options for filtering data and performing what-if analyses, supporting user queries that concern subsets of data or what would happen if data points change.Users can also inspect model errors, predictions, prediction probabilities, compute summary statistics, and evaluation metrics for individuals and groups of instances.
TalkToModel additionally supports summarizing common patterns in mistakes on groups of instances by training a shallow decision tree on the model errors in the group.Also, TalkToModel enables descriptive operations, which explain how the system works, summarize the dataset, and define terms to help users understand how to approach the conversation.Overall, TalkToModel supports a rich set of conversation topics in addition to explanations, making the system a complete solution for the model understanding requirements of end users.

Extensibility
While we implement TalkToModel with several different choices for operations such as feature importance explanations and counterfactual explanations, TalkToModel is highly modular and system designers can easily incorporate new operations or change existing ones by modifying the grammar to best support their user populations.This design makes TalkToModel straightforward to extend to new settings, where different operations may be desired.

A.1 Grammar Details
In this subsection, we provide additional details about the grammar.First, we describe the design of the grammar.After, we provide details about how we update the grammar for new datasets.
Design Recall from the main paper that the grammar serves as a logical form of user utterances, which TalkToModel can execute.Here, we provide more details about the grammar design.The grammar defines relations between the operations and the acceptable values arguments in Table 2.For example, the grammar defines different acceptable values for the comparison argument in the filter operation, such as less than or equal to or greater than.In addition, we structure the grammar to make parses appear closer to natural language text instead of a formal programming language like SQL or Python.The reason is that language models tend to perform better at translating utterances into grammars that are more similar to natural language instead of a programming language [63].Consequently, we design the grammar, so that parses appear more like natural language text, without unnecessary parentheses and commas, and omitting unnecessary arguments where possible.In general, we found that simplifying the grammar and making it more like natural language as much as possible considerably improved performance.For example, the question "What are the three most important features for people older than thirty-five?"would simply translate to filter age greater than 35 and topk 3. Note, here, because TalkToModel is applied to only one dataset at a time, the dataset argument can be omitted for simplicity.Practically, we implement the grammar in Lark [64] because the implementation supports interactive parsing, simplifying the process of implementing the guided decoding strategy.
Updating the Grammar For Datasets and User Utterances In the main paper, we discussed how we update the grammar based on the dataset.Here, we provide more description about how we update the grammar.We update the grammar based on the feature names and categorical feature values in the dataset.In particular, the acceptable values for the feature argument (Table 2) in the grammar becomes all the feature values in the dataset.Further, the value argument for a categorical feature becomes the set of categorical feature values that appear in the data for that feature.Because there are many possible values of numeric features, we instead extract potential numeric values from user utterances as they are provided to the system.Specifically, we set the value argument in the grammar for numeric features to contain the set of numeric values that appear in the user utterance.We additionally support string based numeric values (e.g., "fifty-five" or "twelve") to make the system handle a wider variety of cases.

A.2 Training Dataset Generation
In this subsection, we provide details about the generation of the (utterance, parse) pair training dataset.To ensure that we generate a diverse and comprehensive set of (utterance, parse) pairs for training, we compose a total of 687 templates that use 6 different wildcard types.The templates consist of a diverse set of utterances that encompass the different operations permitted in the system.The wildcards include categorical feature names, numeric feature names, class names, numeric feature values, categorical feature values, explanation types, and common filtering expressions (e.g., "{NUMERIC_FEATURE} above {NUMERIC_VALUE}").Because templates can have potentially many wildcards, we recursively enumerate all the wildcard values for each parse.Further, we also limit the number of values for certain wildcards to ensure the number of training pairs generated does not become extremely large.In particular, we limit the number of numeric values to 2 values per feature.In addition, to prevent templates with many wildcards from dominating the training dataset, we also downsample the number of values per wildcard to 2 values after the initial recursion.In this way, we the training dataset does not get dominated by a few templates that have many wildcards, which we found improves performance.
As an example, an utterance template is "Explain the predictions on data with {NUMERIC_FEATURE} greater than {NUMERIC_VALUE}" and the corresponding parse template is filter {NUMERIC_FEATURE} greater than {NUMERIC_VALUE} and explain feature importance.From there, we enumerate the numeric features in the dataset and a selection of numeric feature values, substituting these into {NUMERIC_FEATURE} and {NUMERIC_VALUE} respectively to generate data.Supplementary Table 4: The prediction gap on important features (PGI) and prediction gap on unimportant features (PGU) results.We bold the statistically significant best result.Overall, explanation selection is the best explanation method in all settings, except for PGU and the german credit data where it is better than SHAP but not significantly better than LIME.To perform this analysis, we use the faithfulness metrics provided by the widely-used OpenXAI framework [4].Specifically, we use the Prediction Gap on Important feature perturbation (PGI) and the Prediction Gap on Unimportant feature perturbation (PGU) metrics.These metrics measure the change in perturbing the most influential features and least important features, respectively.Intuitively, PGI captures that perturbations to influential features should result in more significant changes to predictions (higher PGI is better).PGU captures that perturbations to non-influential features should result in smaller changes to the prediction (lower PGU is better).
We compare our explanation selection method against both SOTA explanation methods LIME and SHAP [55,42].To make our evaluation more comprehensive, we compare against LIME using 4 different settings of the kernel width hyperparameter [0.25, 0.50, 0.75, 1.0], because this hyperparameter can have significant effects on the resulting explanation.We leave all settings to default otherwise.Further, we perform this comparison using our 3 diverse datasets: Diabetes, COMPAS, and German Credit, and we compute explanations three times for each data point to reduce error due to explanation sampling.We set the important features used for the PGI metric to the most influential 50% of features and the unimportant features used for the PGU metric to the least influential 50% of features to ensure we provide a comprehensive evaluation for the explanation's ranking of all features in the data.
We present the results in Table 4 and provide the mean PGU or PGI value for each explanation and dataset.Further, we bold the best statistically significant result according to a Bonferroni corrected t-test, where we compare the explanation selection procedure to each of the other explanation methods for the respective dataset and metric.Overall, we find that explanation selection performs better than baseline SOTA explanations across almost every dataset and metric considered, except for the PGU metric on the German dataset, where explanation selection performs on par with the best performing LIME explanations.

B.2 Effects of the number of training templates
In this subsection, we provide experimental details about the effects of the number of training templates on the parsing accuracy of the system.Because we use a template strategy to generate training data (Section 4), we must decide on how many prompts to write and include in the training scheme.This raises the question of how the number of templates affects model performance.To understand this behavior, we retrain the T5-Base model, randomly sampling the number of training templates over different percentages of the original templates set.In particular, we sweep over the following percentages [20%, 40%, 60%, 80%, 100%] on the diabetes dataset, downsampling and retraining Supplementary Table 5: Percentage of mistakes for few-shot GPT-J 6B where selected prompts do not include all the operations in the parse of the user utterance.We see that most of the time the operations for the parse of the user utterance are included in the prompts for the 10-Shot models, yet these methods still perform relatively poorly compared to finetuned T5.
per template.We give the results in Figure 1.We see that there are clear accuracy gains over using a smaller number of templates.Further, the gains for the compositional model performance seem to somewhat level off, but this is not the case for the IID split, suggesting further templates may help IID performance.

B.3 Parsing Error Analysis
In the main text, we demonstrated that the fine-tuned T5 models performed considerably better than few-shot GPT-J (Table 3).In this subsection, we perform additional error analysis for why this occurs.This dynamic brings up the question: what is the cause of these poor few-shot results?Since the few-shot GPT-J models select the (utterance, parse) pairs from the synthetic dataset using nearest neighbors on a sentence embedding model, it could be possible these issues are due to the sentence embedding model failing to select good pairs.In particular, this nearest neighbors technique could fail to select pairs with the operations necessary to parse the user utterance, and not the model failing to learn from the examples.To evaluate whether this is the case, we compute the percent of mistakes that do not include the operations necessary to parse the user utterance for GPT-J.The results provided in Table 5 demonstrate that, especially for the 10-Shot case, the operations needed to parse the user utterance are included in the prompts, indicating the issues are likely due to the model's capacity to learn few-shot, rather than the selection mechanism.In this work, we were limited by using up to 6-billion parameter GPT-J, but it could be possible to achieve better results with larger models, as results on emergent abilitiees suggest [78].

B.4 User Study Results: Per Question Likert Scores
In this subsection, we provide additional user study results.In addition to the questions asked at the end of the survey (Table 4 and Table 5), we also asked users to rate their experiences using both interfaces on a 1-7 Likert while they were taking the survey.In particular, we asked users how much they agreed with the following statements: • I am confidence I completed my answer correctly.
• Completing this task took me a lot of effort.
• The interface was useful for completing the task.
• Based on my experience so far, I trust the interface to understand machine learning models.
• Based on my experience so far, I would use the interface again to understand machine learning models.
To evaluate these results, we compute the mean and standard deviation of the Likert score for the 1st through 5th question each user sees (question ordering is randomized so users see different questions first).We compute this for each statement and interface.The results for the medical workers are provided in Figure 2 and the ML professionals in Figure 3. Overall, the medical workers clearly prefer TalkToModel while answering the questions.Interestingly, they seem to gain trust in TalkToModel over time, going from "somewhat agree" to "agree" with the statement "Based on my experience so far, I trust the interface to understand ML models" by the end of the survey.

B.5 XAI Question Bank
Here, we provide parses in the TalkToModel for the prototypical questions given in the XAI question bank [38].Our grammar can parse 30/31 core, prototypical questions, excluding socio-technical questions, demonstrating the grammar's broad coverage.Note, that questions provided in the question bank vary in how they are phrased regarding whether additional coreference is necessary.For instance, the question bank includes both questions of the form "what do you predict for this?" versus "what do you predict for Q?").For conciseness, we write each question in the form where no further coreference is necessary ("what do you predict for Q?").For the case where additional coreference is necessary it is straightforward to use the previous_filter operation to resolve the coreference.These results demonstrates the TalkToModel grammar is well equipped to support XAI questions.Based on my experience with the interface so far, I would use the interface again to understand machine learning models.

Conversational Dashboard
Supplementary Figure 3: ML professionals per question likert results: these participants were more mixed about which interface they preferred while taking the survey.Interestingly, while the participants were much more accurate using TalkToModel, this population rated their answers at a similar confidence and said they trusted the interfaces similarly while taking the survey.Error bars are 1 standard deviation.

Figure 1 :
Figure 1: Overview of TalkToModel: Instead of writing code, users have conversations with Talk-ToModel as follows: (1) users supply natural language inputs.(2) the dialogue engine parses the input into an executable representation.(3) the execution engine runs the operations and the dialogue engine uses the results in its response.

1 :
Randomly sampling prompt templates and re-training T5-Base on the Diabetes dataset.For each down-sample %, the prompts are randomly down-sampled and the model is re-trained 5 times.The error bars are 1 standard deviation.
p l e t e l y D i s a g r e e D i s a g r e e S o m e w h a t D i s a g r e e N e u t r a l S o m e w h a t A g r e e A g r e e C o m p l e t e l y A g r e e I am confident I completed my answer correctly.p l e t e l y D i s a g r e e D i s a g r e e S o m e w h a t D i s a g r e e N e u t r a l S o m e w h a t A g r e e A g r e e C o m p l e t e l y A g r e e Completing this task took me a lot of effort.p l e t e l y D i s a g r e e D i s a g r e e S o m e w h a t D i s a g r e e N e u t r a l S o m e w h a t A g r e e A g r e e C o m p l e t e l y A g r e e The interface was useful for completing the task.p l e t e l y D i s a g r e e D i s a g r e e S o m e w h a t D i s a g r e e N e u t r a l S o m e w h a t A g r e e A g r e e C o m p l e t e l y A g r e e Based on my experience so far, I trust the interface to understand machine learning models.p l e t e l y D i s a g r e e D i s a g r e e S o m e w h a t D i s a g r e e N e u t r a l S o m e w h a t A g r e e A g r e e C o m p l e t e l y A g r e eBased on my experience with the interface so far, I would use the interface again to understand machine learning models.Conversational DashboardSupplementary Figure2: Medical Worker per question likert results: in general, these participants preferred TalkToModel over the dashboard to answer the questions.Error bars are 1 standard deviation.
p l e t e l y D i s a g r e e D i s a g r e e S o m e w h a t D i s a g r e e N e u t r a l S o m e w h a t A g r e e A g r e e C o m p l e t e l y A g r e e The interface was useful for completing the task.p l e t e l y D i s a g r e e D i s a g r e e S o m e w h a t D i s a g r e e N e u t r a l S o m e w h a t A g r e e A g r e e C o m p l e t e l y A g r e e Based on my experience so far, I trust the interface to understand machine learning models.p l e t e l y D i s a g r e e D i s a g r e e S o m e w h a t D i s a g r e e N e u t r a l S o m e w h a t A g r e e A g r e e C o m p l e t e l y A g r e e

Table 1 :
A conversation about diabetes prediction, demonstrating the breadth of different conversation points the system can discuss.
"So how important is bmi for the model's predictions on the is highly important for the model's predictions.rank 1 is the most important feature (out of 8 features).Overall, bmi bmi feature is ranked on average 2.86 (95% CI [2.685, 3.035]).Here,For the model's predictions on data with age greater than 30.0, the bmi for these people?" "Ah, so which two features are mostly more important than 2. pedigree_function 1. glucose the top 2 most important features are: For the models predictions on data where age is greater than 30.0,incorrecthere"?"What types of patients is the model typically predicting Want to take a closer look at these rules?on100.0% of 2 samples.³(bmi ≤ 26.95) and (glucose > 180.5) then the model is incorrect <= 0.943) then the model is incorrect on 62.5% of 8 samples ³ (bmi > 26.95) and (glucose > 125.0) and (age > 57.5) then the model is incorrect on 62.3% of 53 samples ³ if (bmi > 26.95) and (glucose ≤ 125.0) and (pedigree_function predicts incorrect:For data with age greater than 30.0, the model typically predict

Table 2 :
Overview of the operations supported by TalkToModel, which are incorporated into the conversation to generate responses.

Table 4 :
User study results: % of respondents that agree (> Neutral Likert score) TalkToModel is better than the dashboard in the 4 comparison questions.A significant portion of respondents agreed TalkToModel is better than the dashboard in all the categories except Grad.students and "Likeliness To Use".Still, a majority agreed TalkToModel was superior in this case.

Table 5 :
User study results: Completion rate and accuracy across interfaces and participant groups.We compute the completion rate as the questions users provided and answer for and did not mark "could not determine."We measure accuracy on completed questions.Participants answered questions at a higher rate more accurately using TalkToModel than the dashboard.