Explaining a series of models by propagating Shapley values

Local feature attribution methods are increasingly used to explain complex machine learning models. However, current methods are limited because they are extremely expensive to compute or are not capable of explaining a distributed series of models where each model is owned by a separate institution. The latter is particularly important because it often arises in finance where explanations are mandated. Here, we present Generalized DeepSHAP (G-DeepSHAP), a tractable method to propagate local feature attributions through complex series of models based on a connection to the Shapley value. We evaluate G-DeepSHAP across biological, health, and financial datasets to show that it provides equally salient explanations an order of magnitude faster than existing model-agnostic attribution techniques and demonstrate its use in an important distributed series of models setting.


Introduction
With the widespread adoption of machine learning (ML), series of models (i.e., where the outputs of predictive models are used as inputs to separate predictive models) are increasingly common.Examples include: (1) stacked generalization, a widely used technique [1][2][3][4][5] to improve generalization performance by ensembling the predictions of many models (called base-learners) using another model (called a meta-learner) [6], (2) neural network feature extraction, where models are trained on features extracted using neural networks [7,8], typically for structured data [9][10][11], and (3) consumer scores, where predictive models that describe a specific behavior (e.g., credit scores [12]) are used as inputs to downstream predictive models.For example, a bank may use a model to predict customers' loan eligibility on the basis of their bank statements and their credit score, which itself is often a predictive model [13].
Explaining a series of models is crucial for debugging and building trust, even more so because a series of models is inherently harder to explain compared to a single model.One popular paradigm for explaining models are local feature attributions, which explain why a model makes a prediction for a single sample (known as the "explicand" [14]).Existing model-agnostic local feature attribution methods (e.g., IME [15], LIME [16], KernelSHAP [17]) work regardless of the specific model being explained.They can explain a series of models, but suffer from two distinct shortcomings: (1) their sampling-based estimates of feature importance are inherently variable, and (2) they have high computational cost which may not be tractable for large pipelines.Alternatively, model-specific local feature attribution methods (i.e.attribution methods that work for specific types of models) are often much faster than model-agnostic approaches, but generally cannot be used to explain a series of models.Examples include those for (1) deep models (e.g., DeepLIFT [18], Integrated Gradients [19]) and (2) tree models (e.g., Gain/Gini Importance [20], TreeSHAP [21]).
In this paper, we present DeepSHAP -a local feature attribution method that is faster than model-agnostic methods and can explain complex series of models that pre-existing model-specific methods cannot.DeepSHAP is based on connections to the Shapley value, a concept from game theory that satisfies many desirable axioms.We make several important contributions:

Generalizing DeepSHAP local explanations
A closely related method to DeepSHAP was designed to explain deep models (f : R m → R) [17], by performing DeepLIFT [18] using the average as a baseline.However, using a single average baseline is not the correct approach to explain non-linear models based on connections to Shapley values with an interventional conditional expectation set function and a flat causal graph [30].Instead, we show that the correct way to obtain the interventional Shapley value local feature attributions (denoted as φ(f, x e ) ∈ R m ) based on an explicand (x e ∈ R m ), or sample being explained, is to average over single baseline feature attributions (denoted as φ(f, x e , x b ) ∈ R m ) where baselines are x b ∈ R m and D is the set of all baselines (details in Methods Section 6.2): DeepLIFT [18] explains deep models by propagating feature attributions at each layer of the deep model.Here, we extend DeepLIFT by generalizing DeepLIFT's rescale rule to accommodate more than neural network layers while guaranteeing layer-wise efficiency (details in Methods Section 6.4).For a series of models which can be represented as a composition of functions (f m, and o k = 1) with intermediary models (f i (x) = (h i • • • • • h 1 )(x)), DeepSHAP attributions are computed as: We use Hadamard division to denote an element-wise division of a by b that accommodates zero division, where, if the denominator b i is 0, we set a i /b i to 0. The attributions φ for a particular model in the stack are computed utilizing DeepLIFT with the rescale rule for deep models [18], interventional TreeSHAP for tree models [21], or exactly for linear models.Each intermediate attribution ψ i serves as feature attribution that satisfies efficiency for h i 's input features, where the attribution in the raw feature space is given by ψ 1 .This approach takes inspiration from the chain rule applied specifically to deep networks in [18], that we extend to more general classes of models.

Incorporating a baseline distribution
We now use DeepSHAP to explain deep models with different choices of baseline distributions to empirically evaluate our theoretical connections to interventional conditional expectations.We show that single baseline attributions are biased in a CNN that achieves 75.56% test accuracy (hyperparameters in Appendix Section A.2.1) in the CIFAR10 data set [31].We aim to demonstrate that single baselines can lead to bias in explanations by comparing attributions using either a single baseline (an all-black image) as in DeepLIFT or a random set of 1000 baselines (random training images) as in DeepSHAP.Although the black pixels in the image are qualitatively important, using a single baseline leads to biased attributions with little attribution mass for black pixels (Figure 2).In comparison, averaging over multiple baselines leads to qualitatively more sensible attributions.Quantitatively, we show that despite the prevalence of darker pixels (pixel distribution plots in Figure 2), single baseline attributions are biased to give them low attribution, whereas averaging over many baselines more sensibly assigns a large amount of credit to dark pixels (attribution distribution plots in Figure 2).To generalize this finding beyond DeepSHAP, we replicate this bias for IME and IG, two popular feature attribution methods that similarly rely on baseline distributions (Appendix Section A.4.1).To demonstrate the importance of baseline distributions as a parameter, we explain an MLP (hyperparameters in Appendix Section A.2.2) with 0.872 ROC AUC for predicting fifteen year mortality in the NHANES I data set.We use DeepSHAP to explain an explicand relative to a baseline distribution drawn uniformly from all samples (Figure 3a (top)).This explanation places substantial emphasis on age and sex because it compares the explicand to a population that includes many younger/female individuals.However, in practice epidemiologists are unlikely to compare a 74-year old male to the general population.Therefore, we can manually select a baseline distribution of older males to reveal novel insights, as in Figure 3a (bottom).The impact of gender is gone because we compare only to males, and the impact of age is lower because we compare only to older individuals.Furthermore, the impact of physical activity is much higher because being physically active is more healthy compared to older individuals, who are less active than the general population.This example illustrates that the baseline distribution is an important parameter for feature attributions.

Natural scientific questions with baseline distributions
To provide a more principled approach to choosing the baseline distribution parameter, we propose k-means clustering to select a baseline distribution (detail in Methods Section 6.3).Previous work analyzed clustering in the attribution space or contrasting to negatively/positively labelled samples [22].In Figure 3b, we show clusters according to age and gender.Then, we explain many older male explicands using either a general population or an older male population baseline distribution (Figure 3c).When we compare to the older male baselines, the importance of age is centered around zero, sex is no longer important, and the importance orderings of remaining features change.Further, the inquiry we make changes from "What features are important for older males relative to a general population?" to "What features are important for older males relative to other older males?".To quantitatively evaluate whether our attributions answer the second inquiry, we can ablate features in order of their positive/negative importance by masking with the mean of the older male baseline distribution (Figure 3d, (Methods Section 6.8)).In both plots, lower curves indicate attributions that better estimated positive and negative importance.For both tests, attributions with a baseline distribution chosen by k-means clustering substantially outperforms a baseline distribution drawn from the general population.
We find that our clustering-based approach to selecting a baseline distribution has a number of advantages.Our recommendation is to choose baseline distributions by clustering according to non-modifiable, yet meaningful, features like age and gender.This yields explanations that answer questions relative to inherently interpretable subpopulations (e.g., older males).The first advantage is that choosing baseline distributions in this way decreases variance in the features that determined the clusters and subsequently reduces their importance to the model.This is desirable for age and gender because individuals typically cannot modify their age or gender in order to reduce their mortality risk.Second, this approach could potentially reduce model evaluation on off-manifold samples when computing Shapley values [32,33] by considering only baselines within a reasonable subpopulation.The final advantage is that the flexibility of choosing a baseline distribution allows feature attributions to answer natural contrastive scientific questions [22] that improve model comprehensibility, as in Figure 3c.
4 Explaining a series of models DeepSHAP (and DeepLIFT) have been shown to be very fast and performant explanation methods for explaining deep models [18,34,35].In this section, we instead focus on evaluating our extension of DeepSHAP to accommodate a series of mixed models (trees, neural networks, and linear models) and address four impactful applications.

Group attributions identify meaningful gene sets
We explain two MLPs trained to predict Alzheimer's disease status and breast cancer tumor stage from gene expression data with test ROC AUC of 0.959 and 0.932, respectively.We aim to demonstrate that our approach to propagating attributions to groups contributes to model interpretability by validating our discoveries with scientific literature.Gene expression data is often extremely high dimensional; as such, solutions such as gene set enrichment analysis (GSEA) are widely used [36].In contrast, we aim to attribute importance to gene sets while maintaining efficiency by proposing a group rescale rule (Methods Section 6.7).This rule sums attributions for genes belonging to each group and then normalizes according to excess attribution mass due to multiple groups containing the same gene.It generalizes to arbitrary groups of features beyond gene sets, such as categories of epidemiological features (e.g., laboratory measurements, demographic measurements, etc.).
We can validate several key genes identified by DeepSHAP.For Alzheimer's disease, the overexpression of SERPINA3 has been closely tied to prion diseases [37], and UBTD2 has been connected to frontotemporal dementia -a neurodenerative disorder [38].For breast cancer tumor stage, UBE2C was positively correlated with tumor size and histological grade [39].In addition to understanding gene importance, understanding higher level importance can be obtained using gene sets, i.e., groups of genes defined by biological pathways or co-expression.We obtain gene set attributions by grouping genes according to a curated gene set. 3: the KEGG (Kyoto Encyclopedia of Genes and Genomes) pathway database (additional gene set attributions in Appendix Section A.4.3) Figure 4: Propagating attributions to gene sets enables higher level understanding.(a) Gene and gene set attributions for predicting Alzheimer's disease using gene expression data.(b) Gene and gene set attributions for predicting breast cancer tumor stage using gene expression data.Residuals in the gene set attributions summarize contributions for genes that are not present in any gene set and describes variations in output not described by the pathways we analyzed.Note that (a) and (b) show summary plots (Appendix Section A.3.3).
Next, we verify important gene sets identified by DeepSHAP.For Alzheimer's disease, the glyoxylate and dicarboxylate metabolism pathway was independently identified based on metabolic biomarkers [40]; several studies have demonstrated aberrations in the TCA cycle in Alzheimer's disease brain [41]; and alterations of purine-related metabolites are known to occur in early stages of Alzheimer's disease [42].For breast cancer, many relevant proteins are involved in ubiquitin-proteasome pathways [43] and purine metabolism was identified as a major metabolic pathway differentiating a highly metastatic breast cancer cell line from a slightly metastatic one [44].Identifying these phenotypically relevant biological pathways demonstrates that our group rescale rule identifies important pathways.

Loss attributions provide insights to model behavior
We examine an NHANES (1999-2014) mortality prediction GBT model (0.868 test set ROC AUC) to show how explaining the model's loss (loss explanations) provides important insights different from insights revealed by explaining the model's output (output explanations).DeepSHAP lets us explain transformations of the model's output.For instance, we can explain a binary classification model in terms of its log-odds predictions, its probability predictions (often easier for non-technical collaborators to understand; see Appendix Section A.4.2), or its loss computed based on the prediction.Here, we focus on local feature attributions that explain per-sample loss. 4e train our model on the first five release cycles of the NHANES data (1999-2008) and evaluate it on a test set of the last three release cycles (2009-2014) (Figure 3a).As a motivating example, we simulate a covariate shift in the weight variable by re-coding it to be measured in pounds, rather than kilograms, in release cycles 7 and 8 (Figure 3b).Then, we ask, "Can we identify the impact of the covariate shift with feature attributions?"Comparing the train and test output attributions, release cycles 7 and 8 are skewed, but they mimic the same general shape of the training set attributions.If we did not color by release cycles, it might be difficult to identify the covariate shift.In contrast, for loss attributions with positive labels, we can identify that the falsely increased weight leads to many misclassified samples where the loss weight attribution exceeds the expected loss.Although such debugging is powerful, it is not perfect.Note that in the negatively labelled samples, we cannot clearly identify the covariate shift because higher weights are protective and lead to more confident negative mortality prediction.
Next, we examine the natural generalization gap induced by covariate shift over time, which shows a dramatically different loss in the train and test sets (Figure 5c).We can see that output attributions are similarly shaped between the train and test distributions; however, the loss attributions in the test set are much higher than in the training set.We can quantitatively verify that negative blood lead affects model performance more in the test set by ablating blood lead for the top 10 samples in the train and test sets according to their loss distributions.From this, we can see that blood lead constitutes a substantial covariate shift in the model's loss and helps explain the observed generalization gap.
As an extension of the quantitative evaluation in Figure 5c, we can visualize the impact on the model's loss of ablating by output attributions compared to ablating by loss attributions (Figure 5d).This ablation test (Methods Section 6.8) asks "What features are important to the model's performance (loss)?"Ablating the positive and negative attributions both increase the mean model loss by hiding features central to making predictions.However, ablating by the negative loss attribution directly increases the loss far more drastically than ablating by the output.More so, ablating positive loss attributions clearly decreases the mean loss, which is not achievable by output attribution ablation.Finally, we compare loss attributions computed using either a model-agnostic approach or DeepSHAP.In this setting, DeepSHAP is two orders of magnitude faster than model-agnostic approaches (IME, KernelSHAP, and LIME) while showing extremely competitive positive loss ablation performance and the best negative loss ablation performance.We compare DeepSHAP explanations to a number of model-agnostic explanations for a series of two models: a CNN feature extractor fed into a GBT model that classifies MNIST zeros with 0.998 test accuracy.In this example, non-linear transformations of the original feature space improve performance of the downstream model (Appendix Section A.4.4) but make model-specific attributions impossible.Qualitatively, we can see that DeepSHAP and IME are similar, whereas KernelSHAP is similar for certain explicands but not others5 (Figure 6a).Finally, LIME's attributions show the shape of the original digit, but there is a consistent attribution mass around the surrounding parts of the digit.Qualitatively, we observe that the DeepSHAP attributions are sensible.The pixels that constitute the zero digit and the absence of pixels in the center of the zero are important for a positive zero classification. 6n terms of quantitative evaluations, we report the runtime and performance of the different approaches in Figure 6b.We see that DeepSHAP is an order of magnitude faster than model-agnostic approaches, with KernelSHAP being the second fastest.Then, we ablate the top 10% of important positive or negative pixels to see how the model's prediction changes.If we ablate positive pixels, we would expect the model's predictions to drop, and vice versa for negative pixels; doing both showed that DeepSHAP outperforms KernelSHAP and LIME, and performs comparably to IME at greatly reduced computational cost.

Explaining distributed proprietary models
We evaluate DeepSHAP explanations for a consumer scoring example that feeds a simulated GBT fraud score model and a simulated MLP credit score model into a GBT bank model, which classifies good risk performance (0.681 test ROCAUC) (Figure 7).Consumer scores (e.g., credit scores, fraud scores, health risk scores, etc.) describe individual behavior with predictive models [12].A vast industry of data brokers generates consumer scores based on a plethora of consumer data.For instance, a single data broker in a 2014 FTC study had 3000 data segments on nearly every consumer in the United States, and another broker added three billion new records to its databases each month [45]. 7Unfortunately, explaining the models that use consumer scores can obscure important features.For instance, explaining the bank model in Figure 7a will tell us that fraud and credit scores are important (in Figure 7c), but these scores are inherently opaque to consumers [12].The truly important features may instead be those that these scores use.A better solution might be model-agnostic methods that explain the entire pipeline at once.However, the model-agnostic approaches require access to all models.In Figure 7a, a single institution would have to obtain access to fraud, credit, and bank models to use the standard model-agnostic approaches (Figure 7b (left)).This may be fundamentally impractical because each of these models is proprietary.This opacity is concerning given the growing desire for transparency in artificial intelligence [12,45,46].
DeepSHAP naturally addresses this obstacle by enabling attributions to the original features without forcing companies to share their proprietary models if each institution in the pipeline agrees to work together and has a consistent set of baselines.Furthermore, DeepSHAP can combine any other efficiency-satisfying feature attribution method in an analogous way (e.g., integrated/expected gradients [19]).Altogether, DeepSHAP constitutes an effective way to "glue" together explanations across distributed models in industry.In particular, in Figure 7a, the lending institution can explain its bank model in terms of bank features and fraud and credit scores.The bank then sends fraud and credit score attributions to their respective companies, who can use them to generate DeepSHAP attributions to the original fraud and credit features.The fraud and credit institutions then send the attributions back to the bank, which can provide explanations in terms of the original, more interpretable features to their applicants (Figure 7d).
We first quantitatively verify that the DeepSHAP attributions for this pipeline are comparable to the model agnostic approaches in Figure 7b.We once again see that DeepSHAP attributions are competitive with the best performing attributions methods for ablating the top 5 most important positive or negative features.Furthermore, we see that DeepSHAP is several orders of magnitude faster than the best performing ablation methods (KernelSHAP and IME) and an order of magnitude faster and much more performant than LIME.
We can qualitatively verify the attributions in Figures 7c-d.In Figure 7c, we find that the fraud and credit scores are extremely important to the final prediction.In addition, bank features including low revolving Figure 7: Explaining a stacked generalization pipeline of models for the HELOC data set (details in Appendix Section A.1.7)(a) A simulated model pipeline in the financial services industry.We partition the original set of features into fraud, credit, and bank features.We train a model to predict risk using fraud data and a model to predict risk using credit data.Then, we use the outputs of the fraud and credit models as scores alongside additional bank features to predict the final customer risk.(b) Ablation tests (ablating top 5 positive/negative features out of a total 22 features) comparing model agnostic approaches (LIME, KernelSHAP, IME with 100 samples), which require access to all models in the pipeline, and DeepSHAP, which allows institutions to keep their proprietary models private.(c) Summary plot of the top six features the bank model uses to predict risk (TreeSHAP).(d) Summary plot of the top six features the entire pipeline uses to explain risk (DeepSHAP).The green features originate from the fraud data, and the yellow features from the credit data.We explain 1000 randomly samples explicands using 100 randomly sampled baselines for all attribution methods.Note that (c) and (d) show summary plots (Appendix Section A.3.3).balance divided by credit limit ("NetFractionRevolvingBurden") and low number of months since inquisitions ("MSinceMostRecentInqExcl7Days") are congruously important to good risk performance.Then, in Figure 7d we use the generalized rescale rule to obtain attributions in the original feature space.Doing so uncovers important variables hidden by the fraud and credit scores.In particular, we see that the fraud score heavily relied on a high number of months since the applicants's oldest trade ("MSinceOldestTradeOpen"), and the credit score relied on a low number of months since recent delinquency ("MSinceMostRecentDelq") in order to identify applicants that likely had good risk performance.Importantly, the pipeline we analyze in Figure 7a also constitutes a stacked generalization ensemble, which we analyze more generally in Appendix Section A.4.5.

Discussion
In this manuscript, we presented examples where explaining a series of models is critical.Series of models are prevalent in a variety of applications (health, finance, environmental science, etc.), where understanding model behavior contributes important insights.Furthermore, having a fast approach to explain these complex pipelines may be a major desiderata for a diagnostic tool to debug ML models.
The practical applications we focus on in this paper include gene set attribution, where the number of features far surpasses the number of samples.In this case, we provide a rule that aggregates group attributions to higher level groups of features while maintaining efficiency.Second, we demonstrate the utility of explaining transformations of a model's default output (Appendix Section A.4.2).Explaining the probability output rather than the log-odds output of a logistic model yields more naturally interpretable feature attributions.Furthermore, explaining the loss of a logistic model enables debugging model performance and identification of covariate shift.A third application is neural network feature extraction, where pipelines may include transformations of the original features fed into a different model.In this setting we demonstrate the computational tractability of DeepSHAP compared to model-agnostic approaches.Finally, because our approach propagates feature attributions through a series of models while satisfying efficiency at each step (Methods Section 6.5), the intermediary attributions at each part of the network can be interpreted as well.We use this to understand the importance of both consumer scores and the original features used by the consumer scores.
In consumer scoring, distributed proprietary models (i.e., models that exist in different institutions) have historically been an obstacle to transparency.This lack of transparency is particularly concerning given the prevalence of consumer scores, with some data brokers having thousands of data segments on nearly every American consumer [45].In addition, many new consumer scores fall outside the scope of previous regulations (e.g., the Fair Credit Reporting Act and the Equal Credit Opportunity Act) [12].In fact, these new consumer scores that depend on features correlated with protected factors (e.g., race) can reintroduce discrimination hidden behind proprietary models, which is an issue that has historically been a concern in credit scores (the oldest existing example of a consumer score) [12].DeepSHAP naturally enables feature attributions in this setting and takes a significant and practical step towards increasing the transparency of consumer scores and provides a tool to help safeguard against hidden discrimination.
It should be noted that we focus specifically on evaluating DeepSHAP for a series of mixed model types.Previous work evaluates the rescale rule for explaining deep models, specifically.The original presentation of the rescale rule [18] demonstrates its applicability to deep networks in explaining digit classification and regulatory DNA classification.Schwab & Karlen shows that for explaining deep networks, DeepSHAP is a very fast yet performant approach in terms of an ablation test for explaining MNIST and CIFAR images. 8inally, Sixt et al. shows that many modified back propagation feature attribution techniques are independent of the parameters of later layers, with the exception of DeepLIFT.This particularly significant finding suggests that compared to most fast back propagation-based deep feature attribution approaches, DeepSHAP (which relies on the same rescale rule as DeepLIFT) is not ignorant of later layers in the network.
Although DeepSHAP works very well for explaining a series of mixed model types in practice, an inherent limitation is that it is not guaranteed to satisfy the desirable axioms (e.g., implementation invariance) that other feature attribution approaches satisfy (assuming exact solutions to their intractable problem formulations) [15,17,19].This suggests that DeepSHAP may be more appropriate for model debugging or for identifying scientific insights that warrant deeper investigation, particularly in settings where models or the input dimension is huge and tractability is a major concern.However, for applications where high-stakes decision making is important, it may be more appropriate to run axiomatic approaches to completion or use interpretable models [47].Furthermore, in many real world circumstances, such as distributed proprietary models based on credit risk scores, exact axiomatic approaches and interpretable models are not feasible.In these cases DeepSHAP represents a promising direction that allows multiple agents to collaboratively build explanations while maintaining separation of model ownership.

The Shapley value
The Shapley value is a solution concept for allocating credit among players (M = 1, • • • , m) in an m-person game.The game is fully described by a set function v(S) : P(S) → R 1 that maps the power set of players S ⊆ M to a scalar value.The Shapley value for player i is the average marginal contribution of that player for all possible permutations of remaining players: We denote the finite symmetric group Σ n−1 (M ), which is the set of all possible permutations, and S P to be the set of players before player i in the permutation P .The Shapley value is a provably unique solution under a set of axioms (Appendix Section A.5.1).One axiom that we focus on in this paper is efficiency: Adapting the Shapley value for feature attribution of ML models Unfortunately, the Shapley value cannot assign credit for an ML model (f (x) : R m → R 1 ) directly because most models require inputs with values for every feature, rather than a subset of features.Accordingly, feature attribution approaches based on the Shapley value define a new set function v(S) that is a "lift" of the original model [48].In this paper, we focus on local feature attributions, 9 which describe a model's behavior for a single sample, called an explicand (x e ).A "lift" is defined as: One common lift is the observational conditional expectation, where the lift is the conditional expectation of the model's output holding features in S fixed to x e S and X is a multivariate random variable with joint distribution D: Another common lift is the interventional conditional expectation with a flat causal graph, where we "intervene" on features by breaking the dependence between features in X S and the remaining features using the causal inference do-operator [30]: Both approaches have tradeoffs that have been described elsewhere [14,22,32,33,49].Here, we focus on the interventional approach because it is most closely related to DeepSHAP and does not require estimating the joint density of X.
The Shapley values computed for any lift will satisfy efficiency in terms of the lift µ.However, for the interventional and observational lift described above, the Shapley value will also satisfy efficiency in terms of the model's prediction: This means that attributions can naturally be understood to be in the scale of the model's predictions (e.g., log-odds or probability for binary classification).

Interventional Shapley values baseline distribution
We can define a single baseline lift where χ S is a spliced sample and χ S i = x e i if i ∈ S, else χ S i = x b i .Then, we can decompose the Shapley value φ i (f, x e ) for the interventional conditional expectation lift (eq.8) (henceforth referred to as the interventional Shapley value) into an average of Shapley values with single baseline lifts (proof in Appendix Section A.5.2): Shapley value for single baseline lift Here, D is an empirical distribution with equal probability for each sample in a baseline data set.An analogous result exists for the observational conditional distribution lift using an input distribution [22]. 10 In the original DeepLIFT paper, [18] recommend two heuristic approaches to define baseline distributions: (1) choosing a sensible single baseline and ( 2) averaging over multiple baselines.In addition, DeepSHAP, as previously described in [17], created attributions with respect to a single baseline equal to the expected value of the inputs.In this paper, we show that from the perspective of Shapley values with an interventional conditional expectation lift, averaging over feature attributions computed with single baselines drawn from an empirical distribution is the correct approach.One exception to this are linear models, where taking the average as the baseline is equivalent to averaging over many single baseline feature attributions [49].Interventional Shapley values computed with a single baseline satisfy efficiency in terms of the model's prediction:

Selecting a baseline distribution
As in the previous section, we define a baseline distribution D over which we compute Shapley values with single baseline lifts.This baseline distribution is naturally chosen to be a distribution over the training data X train , where each sample x j ∈ R m has equal probability.The interpretation of this distribution is that the explicand is compared to each baseline in D. This means that the interventional Shapley values implicitly create attributions that explain the model's output relative to a baseline distribution.
Although the entire training distribution is a natural and interpretable choice of baseline distribution, it may be desirable to use others.To automate the process of choosing such an interpretable baseline distribution, we turn to unsupervised clustering.We Then, the cluster selected as a baseline distribution explaining an explicand x e is chosen based on: 10 The attributions for these single baseline games are also analogous to baseline Shapley in [14].

A generalized rescale rule to explain a series of models
We define a generalized rescale rule to explain an arbitrary series of models that propagates approximate Shapley values with an interventional conditional expectation lift for each model in the series. 11To describe the approach, we define a series of models to be a composition of functions , and we define intermediary models We define the domain and codomain of each model in the series as h i (x) : R mi → R oi .Then, we can define the propagation for a single baseline12 recursively: We use Hadamard division to denote an element-wise division of a by b that accommodates zero division, where if the denominator b i is 0, we set a i /b i to 0. Additionally, φ are an appropriate feature attribution technique that approximates interventional Shapley values while crucially satisfying efficiency for the model h i it is explaining.In this paper, we utilize DeepLIFT (rescale) for deep models, TreeSHAP for tree models, and exact interventional Shapley values for linear models.We define efficiency as 11×mi φ(h i , x e , x b ) = f i (x e ) − f i (x b ) where 1a×b is a matrix of ones with shape a × b and the approximate Shapley value functions φ return matrices in R (mi×oi) .The final attributions in the original feature space are: Furthermore, this approach yields intermediate attributions that serve as meaningful feature attributions.In particular, ψ i can be interpreted as the importance of the inputs to the model where the new explicand and baseline are , respectively.This approach takes inspiration from the chain rule applied specifically for deep networks in [18], but we extend it to more general classes of models.Proof.We will prove by induction that

Efficiency for intermediate attributions
For simplicity of notation, denote φi = φ(h i , x e , x b ).Assumption: Each φ satisfies efficiency Base Case: By our assumption, Induction Step: Conclusion: By the principle of induction, each intermediate attribution satisfies efficiency (eq.19).
Then, because the interventional Shapley value with a baseline distribution is the average of many single baseline attributions, it satisfies a related notion of efficiency: This can be naturally interpreted as the difference between the explicand's prediction and the expected value of the function across the baseline distribution.
An additional property of the generalized rescale rule is that although it is an approximation to the interventional Shapley values in the general case, if every model in the composition is linear (h i (x) = βx), then this propagation exactly yields the interventional Shapley values (Appendix Section A.5.3).

Connecting DeepLIFT's rules to the Shapley values
Now we can connect the Shapley values to DeepLIFT's Rescale and RevealCancel rules.Both rules aim to satisfy an efficiency axiom (what they call summation to delta) and can be connected to an interventional conditional expectation lift with a single baseline (as in Section 6.3).
In fact, multi-layer perceptrons are a special case where the models in the series are non-linearities applied to linear functions.We first represent deep models as a composition of functions (h The Rescale and RevealCancel rules canonically apply to a specific class of function: , where f is a non-linear function and g is a linear function parameterized by β ∈ R m .We can interpret both rules as an approximation to interventional Shapley values based on the following definition.Definition 6.1 (k-partition approximation).A k-partition approximation to the Shapley values splits the features in x ∈ R m into K disjoint sets.Then, it exactly computes the Shapley value for each set and propagates it linearly to each component of the set.
The Rescale rule can be described as a 1-partition approximation to the Interventional Shapley values for h i (x), while the RevealCancel rule can be described as a 2-partition approximation that splits according to whether β i x i > t, where the threshold t = 0.This k-partition approximation lets us consider alternative variants of the Rescale and RevealCancel rules that incur exponentially larger costs in terms of K and for different choices of thresholds.

Explaining groups of input features
Here, we further generalize the Rescale rule to support groupings of features in the input space.Having such a method can be particularly useful when explaining models with very large numbers of features that are more understandable in higher level groups.One natural example is gene expression data, where the numbers of features is often extremely large.
We introduce a group rescale rule that facilitates higher level understanding of feature attributions.We can define a set of groups G 1 , • • • , G o whose members are the input features x i .If each group is disjoint and covers the full set of features, then a natural group attribution that satisfies efficiency is the sum: If the groups are not disjoint or do not cover all input features, then the above attributions do not satisfy efficiency.To address this, we define a residual group G R that covers all input features not covered by the remaining groups.Then, the new attributions are a rescaled version of eq.32 We can naturally extend this approach to accommodate non-uniform weighting of group elements, although we do not experiment with this in our paper.

Ablation Tests
We evaluate our feature attribution methods with ablation tests [21,50].In particular, we rely on a simple yet intuitive ablation test.For a matrix of explicands X e ∈ R ne,m , we can get attributions φ(f, X e ) ∈ R ne,m .The ablation test is defined by three parameters: (1) the feature ordering, (2) an imputation sample x b ∈ R m , and (3) an evaluation metric.Then, the ablation test replaces features one at a time with the baseline's feature value based on the feature attributions to assess the impact on the evaluation metric.We can iteratively define the ablation test based on modified versions of the original explicands: Note that , where arg max k,axis=1 (G) returns an indicator matrix of the same size as G and 1 indicates that the element was in the maximum k elements across a particular axis.Then, the ablation test measures the mean model output (e.g., the predicted log-odds, predicted probability, the loss, etc.) if we ablate k features to be the average over the predictions for each ablated explicand: Note that for our ablation tests we focus on either the positive or the negative elements of φ, since the expected change in model output is clear if we ablate only by positive or negative attributions.Ablation tests are a natural approach to test whether feature attributions are correct for a set of explicands.For feature attributions that explain the predicted log-odds, a natural choice of model output for the ablation test is the mean of the log-odds predictions.Then, as we ablate increasing numbers of features, we expect to see the model's output change.When we ablate the most positive features, the mean model output should decrease substantially.As we ablate additional features, the mean model output should still decrease, but less drastically so.This implies that, for positive ablations, lower curves imply attributions that better described the model's behavior.In contrast, for negative ablations

Data availability
The NHANES I, NHANES 1999-2014, CIFAR, and MNIST data sets are all publicly available.The HELOC data set can be obtained by accepting the data set usage license: (https://community.fico.com/s/explainable-machine-learning-challenge?tabset-3158a=a4c37). Metabric data access is restricted and requires getting an approval through Sage Bionetworks Synapse website: https://www.synapse.org/#!Synapse:syn1688369 and https://www.synapse.org/#!Synapse:syn1688370.ROSMAP data access is restricted and requires getting an approval through Sage Bionetworks Synapse website: https://www.synapse.org/#!Synapse:syn3219045 and is available as part of the AD Knowledge Portal [51].breast cancer tumors.Although the original analyses of the transcriptomic information from the tumors are used for a variety of analyses, we purely utilize the transcriptomic information to predict tumor status.

A.1.5 CIFAR
The CIFAR10 data set consists of 32 × 32 color images with 10 possible classes that are a labeled subset of the 80 million tiny images data set [31].The mutually classes include airplanes, automobiles, birds, cats, deers, dogs, frogs, horses, ships, and trucks.In particular, the images were collected by colleagues at MIT and NYU with natural images collected on a number of search engines.

A.1.6 MNIST
The MNIST database consists of 28×28 black and white handwritten digits [28].The digits are size-normalized and centered in a fixed-size image.There are ten possible classes that correspond to the digits 0, • • • , 9.

A.1.7 HELOC
The Home Equity Line of Credit (HELOC) data set [29] is an anonymized data set of HELOC applications from real homeowners.A HELOC is a line of credit offered by a bank as a percentage of home equity.The outcome is whether the applicant will repay their HELOC account within two years.Financial institutions use predictions of loan repayment to decide whether applicants qualify for a line of credit.The data set was released as part of a FICO xML Challenge (https://community.fico.com/s/explainable-machine-learning-challenge)and can be obtained under appropriate agreement to a data set usage license.

A.2.1 CIFAR Multiple Baseline
Model Hyperparameters: The model explained is a CNN with the following sequence of layers: a convolutional layer with 32 filters of shape 3 by 3 with a ReLU activation, a convolutional layer with 32 filters of shape 3 by 3 with a ReLU activation, a max pooling layer with of size 2 by 2, a dropout layer with 0.25 probability, a convolutional layer with 64 filters of shape 3 by 3 with a ReLU activation, a convolutional layer with 64 filters of shape 3 by 3 with a ReLU activation, a max pooling layer with of size 2 by 2, a dropout layer with 0.25 probability, a dense layer with 512 nodes with ReLU activation, a dropout layer with probability 0.5, a dense output layer with softmax activation.RMSprop with a learning rate of 0.0001 and decay of 1 × 10 −6 is used to optimize the network for a categorical cross entropy loss over 100 epochs with batch sizes of 32.The test accuracy achieved by the model is 75.56%.
Experimental setup: In Figure 2 we explain three explicands with black objects: a plane, a horse, and an ostrich.For the single baseline attributions we utilize DeepSHAP with a single black image as the baseline.For multiple baselines we utilize DeepSHAP with a baseline distribution of 1000 randomly sampled images from the training data set.The feature attribution plots take the local feature attributions for the softmax output corresponding to the true label.For simple visualization we take the absolute value of the attributions and average across channels to get a grayscale image which we plot after normalizing the attribution values between zero and one.The pixel distributions are the number of pixels in the gray scale version of the explicand image that fell within ten equally sized gray scale value bins.The attribution distribution is the sum of the attribution mass for the pixels in the original image that correspond to each gray scale value bin.

A.2.2 NHANES Multiple Baseline
Model Hyperparameters: The model we explain is an MLP with four hidden layers with 100 nodes each.The hidden layers have ReLU activation functions and dropout layers in between.The final output node is a sigmoid activation trained to minimize binary cross entropy loss to optimize mortality classification.RMSprop with a learning rate of 0.001 is used to optimize the network over 50 epochs with batch sizes of 128.The test ROC achieved by the model is .872.
Experimental setup: In Figure 3a, the explicand is a randomly chosen older male individual from the NHANES data set.In the top force plot we show DeepSHAP attributions for the explicand with a baseline distribution of 1000 randomly chosen samples from the training set.In the bottom force plot we show DeepSHAP attributions for the same explicand with a baseline distribution of 1000 randomly chosen samples from older (>60 years old) males from the training set.In Figure 3b, the clusters are obtained by k-means clustering (k=8) the training data with only two features: age and sex.In Figure 3c, the explicands are the older male cluster in the training data (n=1137).We show two summary plots where the top are DeepSHAP attributions with a baseline distribution of 1000 randomly chosen samples from the training set and the bottom uses the older male cluster as a baseline distribution.In Figure 3d, we perform an ablation test that ablates all explicands in the older male clusters according to either their most positive or most negative local feature attributions.When ablating we impute by the mean feature value in the older male cluster.Then we evaluate the model's prediction across all of the explicands after ablating features one at a time.

A.2.3 Gene set explanations
Model hyperparameters: We train two GBToost classifiers (an implementation of gradient boosting trees) to predict our binary phenotypes (Alzheimer's and breast cancer tumor stage) based on transcriptomic data.The classifiers are trained with a learning rate of 0.3, a max tree depth of 6, and automatic heuristic tree construction.We train with a validation set and 10 early stopping rounds.For Alzheimer's classification we achieve a test ROC AUC of 0.959 and for breast cancer tumor stage classification we achieve a test ROC AUC of 0.932.
Experimental setup: The feature attributions for the tree model are obtained using Interventional Tree Explainer [21].These attributions correspond to the importance of each gene to the log odds of the output phenotypes.In order to explain these attributions in terms of groups we utilize our group rescale rule to propagate the gene attributions to pathway attributions.For Alzheimer's we fix the baseline distribution to be the training data set and for breast cancer which has more samples we fix a baseline distribution of 100 random samples from the training set for breast cancer.

A.2.4 NHANES Loss explanations
Model hyperparameters: We train an GBToost classifier (an implementation of gradient boosting trees) to predict our mortality based on epidemiological features.The classifier is trained with a learning rate of 0.3, a max tree depth of 6, and automatic heuristic tree construction.We train with a validation set and 10 early stopping rounds.For the weight-shifted test set we achieve an ROC AUC of 0.860 and for the non-shifted test set we achieve an ROC AUC of 0.868.
Experimental setup: In Figure 5b, we generate output feature attributions using Interventional Tree Explainer and use DeepSHAP's generalized rescale rule to explain the loss in addition to the output.The loss and output attributions are explained with respect to the same baseline distribution of 1000 random samples from the training set.The loss attributions for positive labelled explicands and negative labelled explicands are very different, leading us to plot them as separate dependence plots.In Figure 5c, we generate output and loss feature attributions as before.For the ablation, we do a simplified univariate ablation where we impute the blood loss to the mean of the baseline distribution for samples selected based on largest loss attributions.In Figure 5d, we perform an ablation test that ablates 1000 explicands from the training set according to either their loss or output local feature attributions.When ablating we impute by the mean value of a given feature in the explicands.Then we evaluate the model's prediction across all of the explicands after ablating features one at a time.

A.2.5 MNIST Feature Extraction
Model hyperparameters: We train a CNN model to classify all digits in MNIST.The CNN model consists of a convolutional layer with 32 filters of size 3 by 3 with ReLU activation, a max pooling layer with pools of size 2 by 2, a convolutional layer with 64 filters of size 3 by 3 with ReLU activation, a max pooling layer with pools of size 2 by 2, a dense layer with 100 nodes and ReLU activation, and the dense output layer with 10 nodes and softmax activation.We utilize categorical cross-entropy loss, an Adam optimizer with learning rate 0.001, and train for 10 epochs.Then, in order to utilize the model to to extract higher level features from raw MNIST images, we remove the final output layer.The GBToost model we train to predict zeros

A.4.2 Probability vs. log-odds explanations
In Figure 10 we illustrate the difference between explanations in log-odds versus probability space using attributions obtained from rescaling the log-odds explanations provided by TreeSHAP.
Figure 10: The summary plot for the log-odds model output differs to the summary plot for the probability output in terms of ordering of features.This is to be expected because of the non-linear mapping between log-odds and probability.Often times, it can be useful to communicate scientific findings in terms of the probability output of the model, although the log-odds output is also natural as it is the output margin.

A.4.3 Additional gene sets
We present attributions aggregated by the Reactome canonical pathway gene set and the Biological Process gene ontology gene set in Figure 11.

A.4.5 Stacked generalization
We compare five bagged MLP base-learners (feature attributions in Figures 14-18) and three meta-learners (average voting, logistic regression, and gradient boosting trees) that use the base-learners' predictions as features for NHANES (1999-2014) mortality prediction with performance in Figure 13a.We see that average voting outperforms any individual MLP and is improved upon by a non-uniform weighting scheme (logistic regression).Finally, stacked generalization with a gradient boosted tree meta-model outperforms both linear approaches.
Since our framework enables attributions that satisfy efficiency at each layer, we obtain the importance each meta model assigns to each base-learner (Figure 13b), which is much harder to do for model-agnostic methods because it will require separately estimating the importance for each layer.Although the average voting scheme assigns equal importance to each base model, each MLP's predictions are different, leading to the different shapes in the summary plots.In contrast, the logistic regression model downweights MLP0 and MLP3 and primarily relies on MLP2 and MLP4 which achieved the highest performance.The gradient boosting tree model uses the base-learners in a non-linear fashion.For MLP4, high predictions actually decreases the overall prediction of the meta-learner.These meta-level explanations reveal novel insights that explanations in the original feature space would not.Finally, we can also propagate the meta-level explanations back to the original input space and verify that most models give similarly reasonable feature attributions in Figure 7c.We compare a CNN on the raw digits to an GBT model trained on the raw digits to an GBT model trained to classify digits on the basis of the features extracted by the trained CNN.We vary the number of estimators in GBToost to investigate how well underparamterized trees classify digits with different features.Overall, using GBT models with the features extracted from the CNN yield much higher accuracy for trees with the same number of estimators.

A.5.1 Shapley value axioms
The Shapley values satisfy a number of desirable properties in terms of the set function v.It is uniquely defined by three axioms: • Efficiency: The sum of the Shapley values for each player equals the value of the game with the set of all players (the grand coalition): • Monotonicity: If a player i always increases game v 1 's value more than they would company v 2 for all possible remaining sets of players, then i's attribution for v 1 should be greater than or equal to their attribution in v 2 : • Missingness: Employees i that don't help or hurt the company's profit must have no attribution: While the above three axioms determine the Shapley values as a unique solution concept for credit allocation, the Shapley values have a number of additional desirable properties: • Symmetry: If two players have the same marginal impact for all subsets, then they should have the same Shapley value: • Linearity: The Shapley values for a linear combination of games is equal to the linear combination of Shapley values for each game: and

A.5.2 Baseline distribution proof for interventional Shapley values
Proof.Define D to be the data distribution, N to be the set of all features, and f to be the model being explained.Additionally, define X (x, x , S) to return a sample where the features in S are taken from x and the remaining features from x .Define C to be all combinations of the set N \ {i} and P to be all

*Figure 1 :
Figure 1: DeepSHAP estimates Shapley value feature attributions to explain a series of models using a baseline distribution.(a) Local feature attributions with DeepSHAP require explicands (samples being explained), a baseline distribution (samples being compared to), and a model that is comprised of a series of models.They can be visualized to understand model behavior (Appendix Section A.3).(b) Theoretical motivation behind DeepSHAP (Methods Sections 6.1 and 6.4).(c) The baseline distribution is an important, but often overlooked, parameter that changes the scientific question implicit in the local feature attributions we obtain.(d) Explaining a series of models enables us to explain groups of features, model loss, and complex pipelines of models (deep feature extraction and stacked generalization).Experimental setups are described in Appendix Section A.2.

Figure 2 :
Figure 2: Using a single all-black baseline image (DeepLIFT) leads to biased attributions compared to attributions with a randomly sampled baseline distribution (DeepSHAP).The image is the explicand.The attribution plots are the sum of the absolute value of the feature attributions for the three channels of the input image.The pixel distribution is the distribution of pixels in terms of their grayscale values.The attribution distribution is the amount of attribution mass upon a group of pixels binned by their grayscale values.

Figure 3 :
Figure 3: The baseline distribution is an important parameter for model explanation.(a) Explaining an older male explicand with both a general population baseline distribution and an older male baseline distribution.(b) Automatically finding baseline distributions using 8-means clustering on age and sex.(c) Explaining the older male subpopulation (62-75 years old) with either a general population baseline or an older male baseline.(d) Quantitative evaluation of the feature attributions via positive and negative ablation tests where we mask with the mean of the older male subpopulation.Note that (b) shows summary plots (Appendix Section A.3.3) and (c) shows dependence plots (Appendix Section A.3.2).

Figure 5 :
Figure 5: Explanations of the model's loss rather than the model's prediction yields new insights.(a) We train on the first five cycles of NHANES (1999-2008) and test on the last three cycles (2009-2014).(b) We identify a simulated covariate shift in cycles 7-8 (2011-2014) by examining loss attributions.(c) Under a natural covariate shift, we identify and quantitatively validate test samples for which blood lead greatly increases the loss in comparison to training samples.(d) We ablate output attributions (DeepSHAP) and loss attributions (DeepSHAP, IME, KernelSHAP, and LIME) to show their respective impacts on model loss.We compare only to model-agnostic methods for loss attributions because explaining model loss requires explaining a series of models.Note that (b) and (c) show dependence plots (Appendix Section A.3.2).

Figure 6 :
Figure 6: Explaining a series of models comprised of a convolutional neural network feature extractor and a gradient boosted tree classifier.(a) Explanations from DeepSHAP and state of the art model-agnostic approaches.Each model-agnostic approach has a "number of samples" parameter which we set to 100,000.(b) Quantitative evaluation of approaches, including runtime and ablation of the top 10% of positive and negative features.
utilize k-means clustering on a reduced version of the training data ( Xtrain ) comprised of xj = [x j i ∀i ∈ M r ] with a reduced set of features (M r ).The output of the k-means clustering are clusters C 1 , • • • , C k with means µ 1 , • • • , µ k that minimize the following objective on the reduced training data:

Theorem 1 .
As one might expect, each intermediate attribution ψ i satisfies efficiency: Each attribution ψ i ∈ R m , ∀i ∈ 1, • • • , k satisfies efficiency and sums up to f k (x e ) − f k (x b ).

Figure 9 :
Figure 9: Demonstrating bias of a single baseline for integrated/expected gradients.

Figure 11 : 4 A. 4 . 4
Figure 11: Additional gene set attributions for the Reactome canonical pathway gene set and the Biological Process gene ontology gene set.Analogous to the attributions in Figure 4

Figure 12 :
Figure12: Investigating the utility of CNN feature extraction in MNIST.We compare a CNN on the raw digits to an GBT model trained on the raw digits to an GBT model trained to classify digits on the basis of the features extracted by the trained CNN.We vary the number of estimators in GBToost to investigate how well underparamterized trees classify digits with different features.Overall, using GBT models with the features extracted from the CNN yield much higher accuracy for trees with the same number of estimators.

Figure 13 :
Figure 13: Explaining stacked generalization by looking at meta-level and raw feature explanations.(a) The test set performance of the five MLP models and three meta-models that use make predictions based on the MLP models' predictions.(b) Intermediary explanations for the meta-models that assign credit based on which MLP was important to the meta-model's prediction.(c) Raw feature explanations obtained by propagating the credit for each meta-model in (b) to the original feature space.(b) and (c) show summary plots (Appendix Section A.3.3).