Abstract
Posthoc interpretability methods are critical tools to explain neuralnetwork results. Several posthoc methods have emerged in recent years but they produce different results when applied to a given task, raising the question of which method is the most suitable to provide accurate posthoc interpretability. To understand the performance of each method, quantitative evaluation of interpretability methods is essential; however, currently available frameworks have several drawbacks that hinder the adoption of posthoc interpretability methods, especially in highrisk sectors. In this work we propose a framework with quantitative metrics to assess the performance of existing posthoc interpretability methods, particularly in timeseries classification. We show that several drawbacks identified in the literature are addressed, namely, the dependence on human judgement, retraining and the shift in the data distribution when occluding samples. We also design a synthetic dataset with known discriminative features and tunable complexity. The proposed methodology and quantitative metrics can be used to understand the reliability of interpretability methods results obtained in practical applications. In turn, they can be embedded within operational workflows in critical fields that require accurate interpretability results for, example, regulatory policies.
Main
Timeseries—sequences of indexed data that follow a specific time order—are ubiquitous. They can describe physical systems^{1} such as the state of the atmosphere and its evolution, social and economic systems^{2} such as the financial market, and biological systems^{3} such as the heart and the brain via electrocardiogram (ECG) and electroencephalogram signals, respectively. The availability of this type of data is increasing, and so is the need for automated analysis tools that are capable of extracting interpretable and actionable knowledge from them. To this end, although established and more interpretable timeseries approaches remain competitive for many tasks^{4,5,6}, artificial intelligence (AI) technologies and neural networks in particular are opening the path towards highly accurate predictive tools for an increasing number of timeseries regression^{7,8,9} and classification^{10,11} learning tasks. Yet the adoption of AI technologies as blackbox tools is problematic in several applied contexts. To address this issue, numerous interpretability methods have been proposed in the literature, especially in the context of neural networks. These different methods usually produce tangibly different results, preventing practitioners from fully unlocking the interpretability of the results, which is increasingly needed. Figure 1 shows four different posthoc interpretability methods applied to timeseries classification, in which the neural network is tasked with identifying the pathology associated with a patient’s ECG. The four interpretability methods produce remarkably different results for the same model. Hence the question: which method produced an interpretability map closer to the one actually adopted by the neural network to make its prediction? In this paper we answer this question quantitatively while addressing the issues found in the existing literature on the evaluation of interpretability methods. Aside from research purposes, understanding the accuracy of interpretability methods is de facto mandatory in critical sectors (such as healthcare) for legal and ethical reasons^{12}. Failing to understand the performance of interpretability methods may prevent their adoption and, in turn, lead practitioners to avoid using neural network tools altogether, in favour of more whitebox and interpretable tools.
Different definitions of what it means for a neuralnetwork model to be interpretable have been formulated. Most of these definitions can be summarized under two categories: transparency and posthoc interpretability^{13}. Transparency refers to how a model and its individual constituents work, whereas posthoc interpretability refers to how a trained model makes predictions and uses the input features it is given. In this work we consider posthoc interpretability applied to timeseries classification, as it is seen as a key to meet recent regulatory requirements^{12} and translate current research efforts into realworld applications, especially in highrisk areas such as healthcare^{14}. Posthoc interpretability methods assign a relevance to each feature of a sample, reflecting its importance to the model for the classification task being performed. The ability to express the specific features used by a neural network to classify a given sample can help humans assess the reliability of the classification produced and allows one to compare the model’s predictions with existing knowledge. It also provides a way to understand possible model biases that could lead to the model’s failure in a realworld setting.
A range of methods to provide posthoc interpretability of classification results have been developed in the past few years. These are mainly focused on natural language processing and image classification tasks. With the more recent growing interest for neuralnetwork interpretability, leading actors in the machine learning community built a range of posthoc interpretability methods. As part of this effort, Facebook recently released the Captum library to group a large number of interpretability methods under a single developmental framework^{15}. Although these initiatives allow researchers to more easily use the different methods, they do not provide a systematic and comprehensive evaluation of those methods on data with different characteristics and across neuralnetwork architectures. A systematic methodology that provides the accurate evaluation of these methods is of paramount importance to allow their wider adoption, and measure how trustable the results they provide are.
The evaluation of interpretability methods was initially based on a heuristic approach in which the relevance attributed to the different features was compared with the expectation of an observer for common image classification tasks^{16}, or of a domain expert for more complex tasks^{17,18}. However, these works shared a common pitfall: they assumed the representation of a task learned by a neural network should use the same features as a human expert. The community later moved towards the idea that the evaluation should be independent of human judgement^{19}. This paradigm shift was supported by the evidence that certain saliency methods—while looking attractive to human experts—produced results independent of the model they aimed to explain, thereby failing the interpretability task^{20}. More recent evaluations were performed by occluding (also referred to as corrupting) the most relevant features identified and comparing the drop in score observed between model predictions on the initial and modified samples^{21}. This evaluation method was later questioned, as corrupting the images changes the distribution of the values of the sample and therefore the observed drop in score might be caused by this shift in distribution rather than actual information being removed^{22}. An approach named ROAR was proposed to address this issue^{22}, in which important pixels are removed in both the training and testing sets. The model is then retrained on the corrupted (that is, occluded) samples, with the drop in score being retained on this newly trained model. This method has the benefit of maintaining a similar distribution across the training and evaluation sets with the modified samples. Yet we argue that it does not necessarily explain which features the initial network used to make its prediction as the similarity between neural network models is only maintained if the models are trained on datasets sampled from the same distribution^{23}. In their case, the distribution is changed as the model is retrained on a corrupted dataset and therefore the posthoc interpretability of the retrained model is not constrained to being similar to the one of the initial model. The posthoc interpretability instead highlights the properties of the dataset in regards to its target, such as the redundancy of the information present in the features that are indicative of a given class—a limitation that was acknowledged by the authors^{21}.
Neural network interpretability for timeseries data was only recently explored. Initial efforts applied some of the interpretability methods introduced for natural language processing and image classification on univariate time series, and evaluated the drop in score obtained by corrupting the most relevant parts (also referred to as time steps) of the signal^{24}. An evaluation of some interpretability methods was recently proposed^{25}, with a dataset designed to address the issue of retaining equal distribution between the initial and occluded datasets; however, this work may have two crucial drawbacks: the proposed dataset contains static discriminative properties (for example, the mean of the sample) and it is not independent of human judgement. The former issue can lead the model to learn from static properties and thus the dataset might not reflect the complexity of realworld timeseries classification tasks, in which time dependencies usually play the discriminative role. The latter is related to the assumption that the model uses:

all the discriminative information synthetically provided (comprising a static shift applied to a portion of the time series),

no information outside of it.
We argue that this assumption does not necessarily hold as the model might require just a subset of the discriminative information provided and might use information from outside of the discriminative portion.
In this work we propose an approach for the modelagnostic evaluation of interpretability methods for timeseries classification that addresses the various issues just highlighted. The approach consists of two new metrics, namely \({\mathrm{AUC}}{\tilde{S}}_{{{{\rm{top}}}}}\) and \({\mathrm{F}}1\tilde{S}\). The first is the area under the top curve, and aims to measure how the top relevance indeed captures the most important time steps for the neural network, whereas the second—the modified F1 score—is a harmonic mean reflecting the capability of the different interpretability methods to capture both the most and least important time steps. These two metrics evaluate how interpretability methods order time steps according to their importance, referred to as relevance identification. In this paper we also aim to qualitatively evaluate the capacity of the different interpretability methods to reflect the importance of each time step relative to the others. The latter evaluation is referred to as relevance attribution. We note that a key aspect of this work is the training of the models with a random level of perturbation for each batch, in a similar fashion to widely used dataaugmentation methods^{26}. This perturbation is later used to corrupt the signal when evaluating interpretability methods such that the distribution is maintained across the training and perturbed datasets used for the evaluation. This addresses one of the main concerns found in the literature (that is, the shift in distribution when occluding samples in the evaluation set) and does not require retraining through the ROAR approach.
The six interpretability methods we considered are: (1) DeepLift^{27}, (2) GradShap^{17}, (3) Integrated Gradients^{28}, (4) KernelShap^{17}, (5) DeepLiftShap^{17} and (6) Shapley Value Sampling (also referred to as Shapley sampling or simply Shapley)^{29}. These were chosen to capture a broad range of available interpretability methods while keeping the problem computationally tractable for all of the models presented. These interpretability methods are applied to three neuralnetwork architectures; namely, convolutional (CNN), bidirectional longshort term memory (BiLSTM) and Transformer neural networks. The evaluation of the interpretability methods for timeseries classification is performed on a new synthetic dataset as well as on two datasets adopted in practical applications. The overall code framework is part of the InterpretTime library freely available at Github (https://github.com/hturbe/InterpretTime).
In summary, the approach proposed and the new synthetic dataset we outlined address the following points:

1.
The need for a robust and quantifiable approach to evaluate and rank the performance of interpretability methods over different neuralnetwork architectures trained for the classification of time series. Our approach addresses the issues found in the literature by providing novel quantitative metrics for the evaluation of interpretability methods independent of human judgement^{19}, using an occluded dataset^{21} and without retraining the model^{22}.

2.
The lack of a synthetic dataset with tunable complexity that can be used to assess the performance of interpretability methods, and that is able to reproduce timeseries classification tasks of arbitrary complexity. We note that our synthetic dataset differs from ref. ^{25} as the neural network must learn the time dependencies in the data. Furthermore, the dataset encodes a priori knowledge of the discriminative features, analogous to the BlockMNIST synthetic dataset^{30}. Finally, the classification task is multivariate by design, as the neural network must learn at least two features to predict the correct class. This is a desirable property as realworld datasets are commonly multivariate.
We first present an evaluation of six interpretability methods using the proposed framework across different datasets and model architectures. These results are then discussed to highlight the main trends as well as the potential for the developed metrics to build trust in posthoc interpretability methods. In Methods we outline the new framework to evaluate interpretability methods for timeseries classification, including the novel method used to maintain a constant distribution between the training and evaluation sets, the new metrics and the synthetic dataset.
Results
All of the metrics presented in this section are built on the relevance—denoted by R—that an interpretability method provides along the time series (a more detailed explanation for R is provided in Table 1 and Methods). An example for the ECG time series is depicted in Fig. 1, in which the countour maps represent R and the black lines represent the actual time series the neural network is using to make the prediction. The higher the relevance, the more important the portion of the time series associated with it is for the neural network classification task. The metrics are evaluated on three different datasets (synthetic, ECG and FordA) and three different architectures (BiLSTM, CNN and Transformer). The hyperparameters and classification metrics for the different models are presented in Supplementary Sections 2 and 3.
We next focus on evaluating the effectiveness of an interpretability method in ordering time steps according to their importance to explaining the neural network’s predictions. This crucial aspect of interpretability methods’ evaluation is also referred to as relevance identification; it is measured by \({\mathrm{AUC}}{\tilde{S}}_{{{{\rm{top}}}}}\) and \({\mathrm{F}}1\tilde{S}\), which are described in Table 1 and Methods.
The ordering of the time steps obtained using the relevance is used to corrupt the top and bottomk elements with positive relevance. Here, k elements refers to the percentage of time steps in the time series that are corrupted with respect to the total number of time steps with positive relevance. Topk elements refers to a corruption strategy that corrupts time steps starting with higher relevance and descending to lower relevance. Similarly, bottomk elements refers to a corruption strategy starting from time steps with lower relevance and ascending to higher relevance. We note that k is only used for calculating the number of elements to corrupt; however, the evaluation of the interpretability methods is performed with respect to the total number of elements in the sample, denoted by \(\tilde{N}\). This was performed such that the evaluation of different interpretability methods is independent of the number of time steps assigned with positive relevance, and instead is based on the total number of time steps. Figure 2 shows \(\tilde{S}\), that is, the normalized change in score (see equation (5) in Methods) for a Transformer trained on the newly created synthetic dataset. Results for the BiLSTM and CNN architectures are presented in Extended Data Figs. 1 and 2. These \(\tilde{S}\tilde{N}\) curves constitute the basis for computing \({\mathrm{AUC}}{\tilde{S}}_{{{{\rm{top}}}}}\) and \({\mathrm{F}}1\tilde{S}\). The figure contains all of the six interpretability methods considered in this work, and a baseline (depicted in black). The baseline illustrates \(\tilde{S}\tilde{N}\) for a random assignment of the relevance. Similar figures for the ECG dataset are presented in Extended Data Figs. 3, 4 and 5, whereas the results obtained on the FordA dataset are in Supplementary Section 1.1. Both \(\tilde{S}\) and \(\tilde{N}\) are detailed in Table 1 and Methods. As mentioned above, the points \(\tilde{N}\) are removed in two ways: from the most important to the least important points identified by the interpretability method (topk strategy), resulting in the top curve and from the least important to the most important points (bottomk strategy), resulting in the bottom curve in Fig. 2.
The higher the value of \({\mathrm{AUC}}{\tilde{S}}_{{{{\rm{top}}}}}\), the better the interpretability method has understood which points were the most important for the model to assign the correct class. The smaller the area under the bottom curve, the better the interpretability method has understood which points were least important for the model to assign the correct class. A good tradeoff between the two therefore shows that the interpretability method has identified both the most and least important points. The \({\mathrm{F}}1\tilde{S}\) metric represents the harmonic mean between the capacity to extract the most and least relevant time steps. A higher score—as with the \({\mathrm{AUC}}{\tilde{S}}_{{{{\rm{top}}}}}\) metric—represents a better relevance identification performance.
Table 2 shows \({\mathrm{AUC}}{\tilde{S}}_{{{{\rm{top}}}}}\) and \({\mathrm{F}}1\tilde{S}\) for all of the datasets, interpretability methods and neuralnetwork architectures considered. Furthermore, the observed drop in accuracy for samples being progressively corrupted is presented in Extended Data Figs. 6 and 7 for the synthetic and ECG datasets, respectively, whereas the results for the FordA datasets are presented in Supplementary Section 1.2.
The relevance identification evaluated through the two metrics presented above focuses on assessing how the relevance produced by interpretability methods allows ordering of time steps to extract the most (or least) relevant time steps for a model. Another important aspect of interpretability methods is their capacity to estimate the relative effect of a given time step on the final prediction. We call this aspect relevance attribution. Developing on properties of the interpretability methods included in this work, the relevance attribution of interpretability methods are evaluated qualitatively using curves of the adjusted normalized change in score \({\tilde{S}}_{\mathrm{A}}\) (defined in equation (11) in Methods) versus the timeseries information content (TIC) index, the latter of which measures the proportion of positive relevance contained in the corrupted portions of the time series. Figure 3 shows \({\tilde{S}}_{\mathrm{A}}\) as a function of the TIC index measured on the ECG dataset, and allows a qualitative evaluation of the relevance attribution performance of interpretability methods. If a curve is above the theoretical unit linear slope (depicted as dashed black lines), the interpretability method underestimates the influence of the corrupted time steps with regard to their effect on the model’s prediction. The opposite is true if the curve stands below the unit slope. The evaluation of the relevance attribution can therefore be seen as a measure of how well calibrated an interpretability method is in terms of the relevance it assigns to the different time steps with respect to their importance for the model to make its predictions. Similar figures for the synthetic and FordA datasets are presented in Extended Data Fig. 8 and Supplementary Section 1.3, respectively.
Discussion
This paper presents a new evaluation method and a set of evaluation metrics for posthoc interpretability to answer the question posed in the introduction: which method produced an interpretability map closer to the one actually adopted by the neural network to make its prediction? The two new metrics, \({\mathrm{AUC}}{\tilde{S}}_{{{{\rm{top}}}}}\) and \({\mathrm{F}}1\tilde{S}\), allow quantification of the relevance identification performance of an interpretability method and can be used to, for example, rank interpretability methods. These two metrics agree in identifying Shapley as the best performing method (see Table 2).
Focusing on the \({\mathrm{AUC}}{\tilde{S}}_{{{{\rm{top}}}}}\) values presented in Table 2, Shapley consistently outperforms the other interpretability methods across the different datasets and architectures (except for the CNN trained on the ECG dataset). The \({\mathrm{AUC}}{\tilde{S}}_{{{{\rm{top}}}}}\) metric reflects the capacity of Shapley to extract the most important time steps for a model prediction. Shapley is, however, the most computationally intensive interpretability method of the ones tested in this paper. It is therefore convenient to look for alternatives, which depend on the type of architecture selected. Integrated Gradients is the second best interpretability method for BiLSTM networks, whereas DeepLiftShap is ranked second for CNN. The results are slightly less clear for Transformer, where GradShap and Integrated Gradients have very similar performances.
In addition to \({\mathrm{AUC}}{\tilde{S}}_{{{{\rm{top}}}}}\), the \({\mathrm{F}}1\tilde{S}\) metric measures the ability of different interpretability methods not only to select the most important time steps but also the least important ones. The rankings produced using the two metrics are consistent with one another for both Transformer and BiLSTM, while favouring DeepLiftShap for CNN.
We addressed the issues identified in the literature to obtain reliable results for \({\mathrm{AUC}}{\tilde{S}}_{{{{\rm{top}}}}}\) and \({\mathrm{F}}1\tilde{S}\). In particular, we evaluated the interpretability methods avoiding human judgement, and did not retrain the model, while also avoiding a distribution shift between the training and occluded sets used to evaluate the interpretability methods. The distribution shift is one of the main concerns found in the interpretability literature. The method proposed in this paper (described in detail in Methods) addresses this issue and thus the drop in score observed in Fig. 2 as the samples are progressively corrupted cannot be attributed to a distribution shift. We also note that a larger drop in score is systematically observed when corrupting the most relevant time steps (identified by the interpretability method) as compared with corrupting a random selection of time steps (black lines in Fig. 2), as expected. The approach presented in this paper allows quantitative evaluation of interpretability methods without retraining, while avoiding a distribution shift between the training and evaluation sets.
The two metrics, \({\mathrm{AUC}}{\tilde{S}}_{{{{\rm{top}}}}}\) and \({\mathrm{F}}1\tilde{S}\), along with the majority of the literature on interpretability methods evaluation, focus on relevance identification (that is, ranking time steps according to their importance). In this work we also make a first step towards evaluating relevance attribution. This evaluates how the relevance reflects the relative importance of each time step compared with the others. The attribution is qualitatively evaluated using the \({\tilde{S}}_{A}\,{{\mbox{TIC}}}\,\) curves (Fig. 3). These curves provide an understanding of the ability of an interpretability method to correctly weigh relevance and are compared with a newly derived theoretical estimation (derived in Supplementary Section 4). The relevance attribution performance consistently varies between the different neural networks tested and it also changes between datasets. The common denominator is the inability of the interpretability methods to follow the theoretical estimate. This indicates that the relevance attributed to each time step does not reflect the relative importance of this time step in the classification task. The attributed relevance instead acts more as a ranking of the most important time steps among themselves. For example, a point with a relevance of 0.1 for a total classification score of 1 might not necessarily account for 10% of the final prediction, but will be more important than a point with a relevance equal to 0.05. Albeit qualitative, these curves may be used to visually assess whether an interpretability method provides a balanced relevance.
As part of this work, we also provide a new synthetic dataset that can be used to evaluate interpretability methods (a sample of which can be found in Fig. 4). The new dataset forces the neural network to learn time dependencies as opposed to static information, and the discriminative portions of the time series are known a priori. Furthermore, the dataset is multivariate by construction, which is a desirable property especially when trying to mimic realworld (that is, nonsynthetic) datasets. The performance of the interpretability methods on the new synthetic dataset is consistent with the performance obtained on the two realworld datasets tested as part of this work, namely FordA and ECG. The designed dataset hence acts as a good proxy for realworld classification tasks with two convenient properties: its complexity and properties are tuneable, and its generation is lightweight. As such, it can complement realworld datasets for a range of different research objectives within the context of evaluating posthoc interpretability methods in timeseries classification, given its known multivariate and timedependent discriminative properties.
Finally, we assessed the usefulness of interpretability methods validated with our evaluation framework in an operational setting. In particular, we used the ECG clinical dataset because it provides a good example of how interpretability can be used once the interpretability methods have been evaluated. The use of clinical data was favoured because the healthcare sector will probably become highly regulated and thus require accurate interpretability of AI technologies^{12}. To this end, we interacted with clinicians to understand a common disease that is representative in the context of ECGs, and that can be of interest to them. This turned out to be the wellstudied cardiac disease, right bundle branch block (RBBB). In the classification task presented in Fig. 1 for the ECG data, the neural networks were trained to classify RBBB. Shapley is able to pinpoint a specific and compact region of interest in the time series, whereas the other methods provide interpretability maps that are less compact (in the case of KernelShap, a sparse map without a clear region). The feature highlighted by the Shapley relevance map corresponds to one of the morphological features cardiologists look at to diagnose the disease of interest^{31}. The interpretability method also shows that the trained model relies almost entirely on a single lead to predict the disease in question, namely RBBB; however, other diagnostic criteria focusing on different leads are commonly used. This type of analysis provides practical insights to understand how trained models will perform in an applied operational setting, and may help identifying possible biases, spurious correlations and potential corrective actions. Moving forwards, it is of interest to understand how these analyses could be implemented for regulatory purposes, for example, and deployed as part of AIbased technologies in new highrisk applications.
Methods
Tackling distribution shift
A longtime issue when evaluating interpretability methods has been the shift in distribution between the training and corrupted datasets used for the evaluation. Interpretability methods have been frequently evaluated, comparing the drop in score when the most relevant time steps are corrupted with the score of the initial sample. The ROAR approach was proposed to address this issue^{22}, however, retraining the model of interest comes with its own drawbacks, as discussed in the main text. The training method presented next aims to address this issue, thereby maintaining a constant distribution between the training dataset and the one used to evaluate the interpretability method. To achieve this task, the models presented in this paper were trained with random perturbations applied to the time series. This method was inspired by dataaugmentation strategies commonly used when training models for image classification, object detection and other imagebased tasks. On these tasks, random cropping has been shown to improve the classification performance of the developed model as well as its robustness^{32}. In this work, the aim of the perturbations in the training set is not related to improving the performance of the model but to instead maintain an identical distribution between the training dataset and the corrupted samples used to evaluate the interpretability methods.
Similarly to the random cropping applied to images, part of the times series is corrupted by substituting the initial time steps with points drawn from a normal distribution \(\sim {{{\mathcal{N}}}}(0,1)\). This distribution follows the normalization applied as a preprocessing step to the samples. In a similar fashion as DropBlock^{33}, consecutive time steps (or blocks) are corrupted. The augmentation is applied per batch, with the overall fraction of the time series being corrupted (γ) and the size of the blocks (β) being sampled from the following uniform distributions:
Given the method described above, when specific time steps are corrupted to evaluate the interpretability methods, the distribution is retained with the samples used when training the model; β is chosen to reflect the range of consecutive time steps above the median positive relevance empirically observed over the used datasets; γ was also empirically chosen to cover most of the samples, where the positive relevance is rarely assigned to more than 80% of the total number of time steps in a given sample. The change in score observed when corrupting time steps can therefore not be attributed to a shift in the distribution, and hence fully reflects a loss of information for the model as measured by the interpretability methods. This approach addresses the distribution shift in the evaluation of interpretability methods without requiring retraining the model. The latter point is important as it is not possible to assert that the retrained model uses the same time steps as the initial one which the interpretability methods aim to explain.
Novel approach for evaluating posthoc interpretability methods
The timeseries classification task considered in this paper can be formalized as follows (the symbols and notation adopted are also reported in Table 1). Given a trained neuralnetwork model, f, we aim to map a set of features \({{{\bf{X}}}}\in {{\mathbb{R}}}^{M\times T}\) to a labelled target \(\,{{\mbox{C}}}\,\in {{\mathbb{N}}}^{{N}_{c}}\) for each sample i contained in a given dataset \({{{{\mathcal{D}}}}}_{i}={[{{{\bf{X}}}},{{\mbox{C}}}]}_{i}\,\,{{\mbox{for}}}\,\,i=1,\ldots ,J\), where M is the number of features in X, T is the number of ordered time steps per feature, N_{c} is the number of labels and J is the total number of samples available. Typically, a final dense layer will produce logits that are then fed to a softmax layer to output the probability of sample i to belong to a given class C \(\in {{\mathbb{N}}}^{{N}_{c}}\).
To assess quantitatively timeseries interpretability methods, we developed novel metrics or indices to evaluate how closely an interpretability method reflects the representation learned by the model of interest. Interpretability methods produce an attribution scheme \({{{\mathcal{A}}}}\) that assigns relevance R to the input X for a specific class c ∈ C such that \({{{{\mathcal{A}}}}}_{c}:{{{\bf{X}}}}\to \left\{{{{\bf{R}}}}\in {{\mathbb{R}}}^{M\times T}\right\}\), where X = (x_{m,t}) and R = (r_{m,t}), with m and t being the indices associated with M and T, respectively. For simplicity, the class that the attribution scheme aims to explain is dropped for the rest of the paper and the attribution scheme is denoted by \({{{\mathcal{A}}}}\). The new metrics are built on the relevance that an interpretability method provides along the time series. The relevance can be positive or negative (except for some interpretability methods, where it is only positive; see for example, the saliency method^{34}). A positive relevance means that the neural network is using that portion of the time series to make its prediction. A negative relevance indicates that the neural network sees the portion of the time series as going against its prediction. As we are interested in how the network is using data to make its predictions, we use the positive relevance to build the new metrics. Logits have often been used as the input for interpretability methods^{34,35}. However, Srinivas and colleagues^{36} demonstrate that presoftmax outputs are related to a generative model that is uninformative of the discriminative model used for the classification task. In this sense, the evaluation of the interpretability methods for the rest of the paper are produced with the postsoftmax models’ output as well as evaluated with changes in these outputs when corrupting samples, denoted by \(S:{{\mathbb{R}}}^{M\times T}\to {[0,1]}^{{N}_{c}}\).
In this work we aim to build a framework to evaluating the performance of interpretability methods. To this end, we chose six interpretability methods that capture a broad range of methods available, while keeping the problem computationally tractable. These are: (1) DeepLift^{27}, (2) GradShap^{17}, (3) Integrated Gradients^{28}, (4) KernelShap^{17}, (5) DeepLiftShap^{17} and (6) Shapley^{29}. Their implementation uses the Captum library^{15}.
Relevance identification and attribution
We need to tackle two aspects to assess the performance of interpretability methods: relevance identification and relevance attribution. We next detail these two concepts along with the methods developed to measure them.
Relevance identification
The concept behind relevance identification is that interpretability methods should correctly identify and order, according to their relevance, the set of points (in our case time steps) used by the model to make its predictions. Extending the assumption formulated by Shah and coworkers^{30}, time steps with larger relevance are more relevant for the model to make a prediction than the ones with smaller relevance. The relevance produced by an attribution scheme can be used to create an ordering that ranks feature m at timestep t (namely x_{m,t}) from a sample X according to its importance for the model’s prediction. It is important to remember that here we only focus on the positive relevance. The ordering can then be used to define \({\bar{{{{\bf{X}}}}}}_{k}^{{{{\rm{top}}}}}\) and \({\bar{{{{\bf{X}}}}}}_{k}^{{{{\rm{bottom}}}}}\), which represent the samples with top and bottomk time steps corrupted, respectively. These are ordered using the assigned relevance, corrupted as follows:
where \({Q}_{{R}^{+}}(p)\) denotes the pquantile (where p = 1 − k for \({\bar{{{{\bf{X}}}}}}_{k}^{{{{\rm{top}}}}}\) and p = k for \({\bar{{{{\bf{X}}}}}}_{k}^{{{{\rm{bottom}}}}}\)) of the set of positive attributed relevance in a given sample R^{+}, measured over sample X using attribution scheme \({{{\mathcal{A}}}}\), with R^{+} = {r_{m,t}∣r_{m,t} > 0}. The rest of the analysis is performed for the following set of top and bottomk percentage of time steps with positive relevance: k ∈ [0.05, 0.15, 0.25, …, 0.95, 1].
In the general case of a modified sample \(\bar{{{{\bf{X}}}}}\), where \(\bar{N}\) points along the time series are corrupted, we can define the normalized difference in score:
It is possible to build \(\tilde{S}\) versus \(\tilde{N}\) curves for top or bottomk points (or time steps) corrupted, where \(\tilde{N}=\frac{\bar{N}}{N}\) is the fraction of points removed with respect to the total number of time steps N = M × T present in the time series. The area under the \(\tilde{S}\tilde{N}\) curve is denoted as:
Using equation (6), we can define \({\mathrm{AUC}}{\tilde{S}}_{{{{\rm{top}}}}}\) and \({\mathrm{F}}1\tilde{S}\). The former aims to evaluate the ability of an interpretability method to recover the most important time steps for a model’s prediction. In this sense, the area under the drop in score when the topk time steps are progressively corrupted should be maximized. To normalize for the interpretability methods assigning a different number of time steps with positive relevance, the \({\mathrm{AUC}}{\tilde{S}}_{{{{\rm{top}}}}}\) is measured on a modified \({\mathrm{AUC}}\tilde{S}\) curve. This modified curve is created by adding an extra point with coordinates \(\left(\tilde{N}=1;\tilde{S}=\tilde{S}({\bar{{{{\bf{X}}}}}}_{k = 1}^{{{{\rm{top}}}}})\right)\). Adding this point allows favouring of interpretability methods that are able to achieve a large drop in score with a minimal number of time steps assigned with positive relevance.
The \({\mathrm{F}}1\tilde{S}\) aims to build an harmonic mean between the ability of an attribution scheme to correctly rank the time steps with the highest and smallest relevance, respectively. Corrupting time steps with high relevance should result in a substantial drop in score (the model’s outputs are greatly affected). Corrupting time steps with small relevance should result in a negligible drop in score (the model’s outputs are negligibly affected). The best attribution scheme should have maximized \({\mathrm{AUC}}{\tilde{S}}_{{{{\rm{top}}}}}\) for topk points corruptions, and minimized \({\mathrm{AUC}}{\tilde{S}}_{{{{\rm{bottom}}}}}\) for bottomk points corruptions. Regarding this desired property, we can define the following F1 score:
Relevance attribution
The idea behind relevance attribution is that the relevance should not only serve to order the time steps but also reflect the individual contribution of each time step relative to the others towards the model’s predicted score. All interpretability methods presented in this work are additive feature attribution methods, as defined by Lundberg and colleagues^{17}. The produced relevance therefore aims to linearly reflect the effect of each feature on the model’s outputs. Focusing on the positive relevance, the difference between the model’s prediction on an initial given sample S(X) and a version with the topk points corrupted \(S({\bar{{{{\bf{X}}}}}}_{k}^{{{{\rm{top}}}}})\) is dependent on the positive relevance corrupted between the two samples. The proportion of relevance attributed to corrupted time steps to the initial one is summarized with the time information content index defined as:
where the following sets are defined:
The TIC index reflects the ratio of the relevance attributed to the topk set of points to the total positive relevance. Taking the model’s output when all of the positive relevance is corrupted as a reference, we can normalize the change in score as follows:
where \(S({\bar{{{{\bf{X}}}}}}_{k = 1}^{{{{\rm{top}}}}})\) corresponds to the model’s output when all time steps with positive relevance of sample X is corrupted. We name the quantity in equation (11) adjusted normalized drop in score. Given the linear additivity property of the relevance, the index \({\tilde{S}}_{A}(k)\) should be equal to the TIC(k) index (see Supplementary Section 4) such that the information ratio satisfies
Given this theoretical approximation, it is possible to evaluate how different interpretability methods over or underestimate the role of different time steps in the model’s prediction. An information ratio larger than one will indicate the relevance of the points under the quantile of interest was underestimated while the opposite is true for an information ratio smaller than one. An example of \({\tilde{S}}_{A}(k)\,{{\mbox{TIC}}}\,(k)\) curve is depicted in Fig. 3 where we report the results for every interpretability method considered as well as the theoretical linear line (dashed).
Datasets
The new interpretability evaluation approach has been applied to: a new synthetic dataset created for this work, a standard univariate dataset for anomaly classification and a biomedical dataset based on ECG signals. The three datasets are described below.
A new synthetic dataset
The evaluation of interpretability methods for timeseries classification has been lacking a dataset: (1) where the discriminative features are known and (2) that replicates the complexity of common timeseries classification tasks with time dependencies across features. The developed dataset is inspired by the BlockMNIST dataset, which is derived from the MNIST dataset^{30}. Each sample in the dataset comprises six features, each with 500 time steps corresponding to Δt = 2 ms. Each feature comprises a sine wave with its amplitude multiplied by 0.5 and frequency \(\sim {{{\mathcal{U}}}}(2,5)\), which serves as a random baseline. In two of the features—picked randomly for each sample—sine waves with a support of 100 time steps are added to the baseline at a random position in time.
The respective frequency of sine waves f_{1} and f_{2} are drawn from a discrete uniform distribution \(\sim {{{\mathcal{U}}}}(10,50)\). In the remaining four features, a square wave is included with a probability of 0.5 and frequency \(\sim {{{\mathcal{U}}}}(10,50)\).
The classification task then consists of predicting whether the sum of the two frequencies (f_{1} and f_{2}) is above or below a given threshold τ. For the presented task, τ was set to 60 to balance the classes of the classification target y such that :
The main idea behind the developed dataset is to force the network to learn temporal dependencies, that is, the frequency of the sine wave with closed support, as well as dependencies across features, the sum of the frequencies (f_{1} and f_{2}). An example of a generated sample is presented in Fig. 4. The closed support sine waves used to create the classification target are observed in features 2 and 6. We note that the synthetic dataset proposed here can be regarded as a family of datasets, as the number of features, length of time series, class imbalance and discriminative features are tunable.
FordA
The FordA dataset is part of the UCR Time Series Classification Archive, which aims to group different dataset for time series classification^{37}. FordA is a univariate and binary classification task. The data originate from an automotive subsystem and the classification task aims to find samples with a specific anomaly. The dataset comes with a training (n = 3,601) and testing split (n = 1,320), which was retained in this paper. The dataset is of interest as it has often served as a benchmark for classification algorithms^{38} as well as for benchmarking interpretability methods^{24}.
ECG dataset
To mimic a realworld classification task, we applied the interpretability framework to an ECG dataset. Electrocardiogram records the electrical activity of the heart and typically produces twelve signals, corresponding to twelve sensors or leads. For this task, a subset of the Classification of twelvelead ECGs (The PhysioNet—Computing in Cardiology Challenge 2020^{39}, published under Creative Commons Attribution 4.0 License) was used. The dataset was narrowed down to the CPSC subset^{40}, which included 6,877 ECGs annotated for nine cardiovascular diseases. As part of these annotations, it was chosen to classify the ECGs for the presence/absence of a Right Bundle Branch Block (RBBB). The dataset includes 5,020 cases showing no sign of a RBBB and 1,857 cases annotated as carrying a RBBB; RBBB was found to be associated with higher cardiovascular risks as well as mortality^{41}.
The data were first denoised using different techniques for low and highfrequency artifacts. The baseline wander as well as lowfrequencies artifacts were first removed by performing Empirical mode decomposition (EMD). The instantaneous frequency is computed and averaged across each intrinsic mode resulting from EMD to obtain an average frequency of the modes. Modes with an average frequency below 0.7 Hz are then discarded and the signal is reconstructed with the remaining modes. The threshold is a parameter based on the literature where the thresholds range between 0.5 and 1 Hz (refs. ^{42,43,44}). Given the difficulty to separate highfrequency noises using EMD, powerline and others highfrequency noises are removed by thresholding the wavelet transform coefficients using the ‘universal threshold’^{45}.
To obtain an average beat, the Rpeaks of each ECG are extracted using the BioSPPy library^{46}. The beats centred around the Rpeaks are then extracted from the ECG by taking 0.35 before and 0.55 s after the Rpeak. The mean of the extracted beat is then computed to obtain an average beat of each lead. An example of the initial signal along the transformed one for a subset of the twelve leads is presented in Extended Data Figs. 9 and 10. The average beat was computed in each lead. The resulting modified twelve leads were used to train the model.
Baseline for interpretability methods
The interpretability methods used as part of this work require setting an uninformative baseline as a reference. Most methods (Integrated Gradients, DeepLift, Shapley, KernelShap) require a single sample set as the baseline. For those methods, the baseline was set as the mean taken across samples for each time step. GradShap and DeepLiftShap uses a distribution of baseline and this baseline was constructed by taking 50 random samples from the test set.
Reporting summary
Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.
Data availability
All datasets and trained models used in this paper have been made available on Zenodo (https://zenodo.org/record/7534770#.Y8lkkXbMI2w)^{47}. The ECG dataset is based on the public dataset released as part of The PhysioNet/Computing in Cardiology Challenge 2020^{39} available under the following https://doi.org/10.13026/f4ab0814. The FordA dataset comes from the UEA & UCR Time Series Classification Repository^{37}. The synthetic dataset used as part of this study can be generated using the code shared on github: https://github.com/hturbe/InterpretTime.
Code availability
The full code used to perform the analysis is available at https://github.com/hturbe/InterpretTime. The specific version of the code used to generate the results presented in this article is archived in Zenodo^{48}.
References
Weyn, J. A., Durran, D. R. & Caruana, R. Improving data driven global weather prediction using deep convolutional neural networks on a cubed sphere. J. Adv. Modell. Earth Syst. Sep 12, e2020MS002109 (2020).
Yang, R. et al. Big data analytics for financial Market volatility forecast based on support vector machine. Int. J. Inf. Manage. 50, 452–462 (2020).
Rajkomar, A. et al. Scalable and accurate deep learning with electronic health records. NPJ Digit. Med. 1, 1–10 (2018).
Dau, H. A. et al. The UCR time series archive. IEEE/CAA J. Autom. Sin. 6, 1293–1305 (2019).
Manibardo, E. L., Laña, I. & Del Ser, J. Deep learning for road traffic forecasting: does it make a difference? IEEE Trans. Intell. Transp. Syst. 23, 6164–6188 (2021).
Ye, L & Keogh, E. Time series shapelets: a new primitive for data mining. In Proc. 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining 947–956 (ACM, 2009).
Hewamalage, H., Bergmeir, C. & Bandara, K. Recurrent neural networks for time series forecasting: current status and future directions. Int. J. Forecast. 37, 388–427 (2021).
Lim, B., Arık, S. Ö., Loeff, N. & Pfister, T. Temporal fusion transformers for interpretable multihorizon time series forecasting. Int. J. Forecast. 37, 1748–1764 (2021).
Tang, B. & Matteson, D. S. Probabilistic transformer for time series analysis. In Advances in Neural Information Processing Systems Vol. 34, 23592–24608 (NeurIPS, 2021).
Fawaz, H. I., Forestier, G., Weber, J., Idoumghar, L. & Muller, P. A. Deep learning for time series classification: a review. Data Min. Knowl. Discov. 33, 917–963 (2019).
Hong, S., Zhang, W., Sun, C., Zhou, Y. & Li, H. Practical lessons on 12lead ECG classification: metaanalysis of methods from PhysioNet/computing in cardiology challenge 2020. Front. Physiol. https://doi.org/10.3389/fphys.2021.811661 (2022).
Proposal for a Regulation of the European Parliament and of the Council Laying Down Harmonised Rules on Artificial Intelligence (Artificial Intelligence Act) and Amending Certain Union Legislative Acts COM/2021/206 final (European Commission, DirectorateGeneral for Communications Networks, Content and Technology, 2021); https://eurlex.europa.eu/legalcontent/EN/ALL/?uri=CELEX:52021PC0206
Lipton, Z. C. The mythos of model interpretability: in machine learning, the concept of interpretability is both important and slippery. Queue 16, 31–57 (2018).
Shad, R., Cunningham, J. P., Ashley, E. A., Langlotz, C. P. & Hiesinger, W. Designing clinically translatable artificial intelligence systems for highdimensional medical imaging. Nat. Mach. Intell. 3, 929–935 (2021).
Kokhlikyan, N. et al. Captum: a unified and generic model interpretability library for PyTorch. Preprint at https://arxiv.org/abs/2009.07896 (2020).
Montavon, G., Bach, S., Binder, A., Samek, W. & Müller, K. R. Explaining nonlinear classification decisions with deep Taylor decomposition. Pattern Recogn. 65, 211–222 (2017).
Lundberg, S. & Lee, S.I. A unified approach to interpreting model predictions. In Advances in Neural Information Processing Systems (eds Guyon, I. et al.) Vol. 30 (NeurIPS, 2017).
Neves, I. et al. Interpretable heartbeat classification using local modelagnostic explanations on ECGs. Comput. Biol. Med. 133, 104393 (2021).
Jacovi, A. & Goldberg, Y. Towards faithfully interpretable NLP systems: How should we define and evaluate faithfulness? In Proc. of the 58th Annual Meeting of the Association for Computational Linguistics 4198–4205 (Association for Computational Linguistics, 2020).
Adebayo, J. et al. Sanity checks for saliency maps. In Advances in Neural Information Processing Systems Vol. 31 (2018).
Samek, W., Binder, A., Montavon, G., Lapuschkin, S. & Müller, K. R. Evaluating the visualization of what a deep neural network has learned. IEEE Trans. Neural Netw. Learn. Syst. 28, 2660–2673 (2016).
Hooker, S., Erhan, D., Kindermans, P. J. & Kim, B. A benchmark for interpretability methods in deep neural networks. In Advances in Neural Information Processing Systems Vol. 32 (NeurIPS, 2019).
Hacohen, G., Choshen, L. & Weinshall, D. Let’s agree to agree: neural networks share classification order on real datasets. In International Conference on Machine Learning 3950–3960 (PMLR, 2020).
Schlegel, U., Arnout, H., ElAssady, M., Oelke, D. & Keim, D. A. Towards a rigorous evaluation of XAI methods on time series. In 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW) 4197–4201 (IEEE, 2019); https://doi.org/10.1109/ICCVW.2019.00516
Ismail, A. A., Gunady, M., Corrada Bravo, H. & Feizi, S. Benchmarking deep learning interpretability in time series predictions. In Advances in Neural Information Processing Systems Vol. 33, 6441–6452 (2020).
Liu, B., Wang, X., Dixit, M., Kwitt, R. & Vasconcelos, N. Feature space transfer for data augmentation. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 9090–9098 (IEEE, 2018).
Shrikumar, A., Greenside, P. & Kundaje, A. PMLR. Learning important features through propagating activation differences. In International Conference on Machine Learning 3145–3153 (ICML, 2017).
Sundararajan, M., Taly, A. & Yan, Q. Axiomatic attribution for deep networks. In International Conference on Machine Learning 3319–3328 (PMLR, 2017).
Castro, J., Gómez, D. & Tejada, J. Polynomial calculation of the Shapley value based on sampling. Comput. Oper. Res. 36, 1726–1730 (2009).
Shah, H., Jain, P. & Netrapalli, P. Do input gradients highlight discriminative features? In Advances in Neural Information Processing Systems Vol. 34, 2046–2059 (NeurIPS, 2021).
Surawicz, B., Childers, R., Deal, B. J. & Gettes, L. S. AHA/ACCF/HRS recommendations for the standardization and interpretation of the electrocardiogram: part III: intraventricular conduction disturbances: a scientific statement from the American Heart Association Electrocardiography and Arrhythmias Committee, Council on Clinical Cardiology; the American College of Cardiology Foundation; and the Heart Rhythm Society Endorsed by the International Society for Computerized Electrocardiology. J. Am. College Cardiol. 53, 976–981 (2009).
Cubuk, E. D., Zoph, B., Shlens, J. & Le, Q. RandAugment: practical automated data augmentation with a reduced search space. In Advances in Neural Information Processing Systems (eds Larochelle H. et al.) Vol. 33, 18613–18624 (NeurIPS, 2020).
Ghiasi, G., Lin, T. Y. & Le, Q. V. Dropblock: a regularization method for convolutional networks. In Advances in Neural Information Processing Systems Vol. 31 (NeurIPS, 2018).
Simonyan, K., Vedaldi, A. & Zisserman, A. Deep inside convolutional networks: visualising image classification models and saliency maps. In Proc. of the 2nd International Conference on Learning Representations (eds Bengio, Y. & LeCun, Y.) (ICLR, 2014).
Selvaraju, R. R. et al. GradCAM: visual explanations from deep networks via gradientbased localization. In 2017 IEEE International Conference on Computer Vision (ICCV) 618–626 (IEEE, 2017).
Srinivas, S. & Fleuret, F. Rethinking the role of gradientbased attribution methods for model interpretability. In 2021 International Conference on Learning Representations (ICLR, 2021).
Bagnall, A., Lines, J., Bostrom, A., Large, J. & Keogh, E. The Great Time Series Classification Bake Off: a review and experimental evaluation of recent algorithmic advances. Data Min. Knowl. Discov. 31, 606–660 (2017).
Yang, C. H. H., Tsai, Y. Y. & Chen, P Y. Voice2Series: Reprogramming acoustic models for time series classification. In Proc. 38th International Conference on Machine Learning (eds Meila M. & Zhang, T.) Vol. 139, 11808–11819 (PMLR, 2021); https://proceedings.mlr.press/v139/yang21j.html
Perez Alday, E. A. et al. Classification of 12lead ECGs: The PhysioNet/Computing in Cardiology Challenge 2020 (PhysioNet, 2022); https://physionet.org/content/challenge2020/1.0.2/
Liu, F. et al. An open access database for evaluating the algorithms of electrocardiogram rhythm and morphology abnormality detection. J. Med. Imaging Health Inform. 8, 1368–1373 (2018).
Bussink, B. E. et al. Right bundle branch block: prevalence, risk factors, and outcome in the general population: results from the Copenhagen City Heart Study. European Heart J. 34, 138–146 (2012).
Thakor, N. V. & Zhu, Y. S. Applications of adaptive filtering to ECG analysis: noise cancellation and arrhythmia detection. IEEE Trans. Biomedi. Eng. 38, 785–794 (1991).
Van Alste, J. A. & Schilder, T. S. Removal of baseline wander and powerline interference from the ECG by an efficient FIR filter with a reduced number of taps. IEEE Trans. Biomed. Eng. BME32, 1052–1060 (1985).
van Alsté, J. A., van Eck, W. & Herrmann, O. E. ECG baseline wander reduction using linear phase filters. Comput. Biomed. Res. 19, 417–427 (1986).
Donoho, D. L. & Johnstone, I. M. Ideal spatial adaptation by wavelet shrinkage. Biometrika 81, 425–455 (1994).
Carreiras, C. et al. BioSPPy: Biosignal Processing in Python (GitHub, 2018); https://github.com/PIAGroup/BioSPPy/
Turbé, H., Bjelogrlic, M., Lovis, C. & Mengaldo, G. Dataset: Evaluation of PostHoc Interpretability Methods in TimeSeries Classification (Zenodo, 2023);: https://doi.org/10.5281/zenodo.7534770
Turbé, H, Bjelogrlic, M, Lovis, C, Mengaldo, G. hturbe/InterpretTime: Initial Release to Replicate Results of the Submitted Article (Zenodo, 2023); https://doi.org/10.5281/zenodo.7560836
Acknowledgements
G.M acknowledges Singapore’s Ministry of Education support through MOE Tier 1 grant 224900A00010. We also thank A. Gualandi for fruitful discussions and precious feedback he provided as part of this research. We thank the anonymous reviewers for their insightful comments, which helped considerably improve the paper.
Funding
Open access funding provided by University of Geneva.
Author information
Authors and Affiliations
Contributions
H.T. conceived the initial research idea with input from all of the authors. M.B., G.M. and H.T performed the experiments and data analysis. M.B., G.M. and H.T wrote the paper with input from all of the authors. C.L. and G.M. supervised the project.
Corresponding authors
Ethics declarations
Competing interests
The authors declare no competing interests.
Peer review
Peer review information
Nature Machine Intelligence thanks Massimo Rivolta, Eamonn Keogh and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Extended data
Extended Data Fig. 1 \(\tilde{S}\) as a function of the ratio of points removed with respect to the total number of time steps in the sample, \(\tilde{N}\).
Each subfigure represents one of the six interpretability methods considered for a BiLSTM trained on the synthetic dataset.
Extended Data Fig. 2 \(\tilde{S}\) as a function of the ratio of points removed with respect to the total number of time steps in the sample, \(\tilde{N}\).
Each subfigure represents one of the six interpretability methods considered for a CNN trained on the synthetic dataset.
Extended Data Fig. 3 \(\tilde{S}\) as a function of the ratio of points removed with respect to the total number of time steps in the sample, \(\tilde{N}\).
Each subfigure represents one of the six interpretability methods considered for a BiLSTM trained on the ECG dataset.
Extended Data Fig. 4 \(\tilde{S}\) as a function of the ratio of points removed with respect to the total number of time steps in the sample, \(\tilde{N}\).
Each subfigure represents one of the six interpretability methods considered for a CNN trained on the ECG dataset.
Extended Data Fig. 5 \(\tilde{S}\) as a function of the ratio of points removed with respect to the total number of time steps in the sample, \(\tilde{N}\).
Each subfigure represents one of the six interpretability methods considered for a Transformer trained on the ECG dataset.
Extended Data Fig. 6 Change in accuracy as a function of the ratio of points removed with respect to the total number of time steps in the sample, \(\tilde{N}\) for the six interpretability methods considered using the synthetic dataset.
Results depicted for (a) BiLSTM, (b) CNN and (c) Transformer.
Extended Data Fig. 7 Change in accuracy as a function of the ratio of points removed with respect to the total number of time steps in the sample, \(\tilde{N}\) for the six interpretability methods considered using the ECG dataset.
Results depicted for (a) BiLSTM, (b) CNN and (c) Transformer.
Extended Data Fig. 8 \({\tilde{S}}_{A}\) as a function of the TIC index for the six interpretability methods considered using the synthetic dataset.
Results depicted for (a) BiLSTM, (b) CNN and (c) Transformer.
Extended Data Fig. 9
Raw ECG signal for two selected leads from a given sample.
Extended Data Fig. 10
Processed ECG signal for two selected leads from a given sample.
Supplementary information
Supplementary Information
Supplementary Figs. 1–5, Tables 1–16 and theoretical estimation.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Turbé, H., Bjelogrlic, M., Lovis, C. et al. Evaluation of posthoc interpretability methods in timeseries classification. Nat Mach Intell 5, 250–260 (2023). https://doi.org/10.1038/s4225602300620w
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1038/s4225602300620w
This article is cited by

Neurosymbolic AI for Mining Public Opinions about Wildfires
Cognitive Computation (2023)