Main

Time-series—sequences of indexed data that follow a specific time order—are ubiquitous. They can describe physical systems1 such as the state of the atmosphere and its evolution, social and economic systems2 such as the financial market, and biological systems3 such as the heart and the brain via electrocardiogram (ECG) and electroencephalogram signals, respectively. The availability of this type of data is increasing, and so is the need for automated analysis tools that are capable of extracting interpretable and actionable knowledge from them. To this end, although established and more interpretable time-series approaches remain competitive for many tasks4,5,6, artificial intelligence (AI) technologies and neural networks in particular are opening the path towards highly accurate predictive tools for an increasing number of time-series regression7,8,9 and classification10,11 learning tasks. Yet the adoption of AI technologies as black-box tools is problematic in several applied contexts. To address this issue, numerous interpretability methods have been proposed in the literature, especially in the context of neural networks. These different methods usually produce tangibly different results, preventing practitioners from fully unlocking the interpretability of the results, which is increasingly needed. Figure 1 shows four different post-hoc interpretability methods applied to time-series classification, in which the neural network is tasked with identifying the pathology associated with a patient’s ECG. The four interpretability methods produce remarkably different results for the same model. Hence the question: which method produced an interpretability map closer to the one actually adopted by the neural network to make its prediction? In this paper we answer this question quantitatively while addressing the issues found in the existing literature on the evaluation of interpretability methods. Aside from research purposes, understanding the accuracy of interpretability methods is de facto mandatory in critical sectors (such as healthcare) for legal and ethical reasons12. Failing to understand the performance of interpretability methods may prevent their adoption and, in turn, lead practitioners to avoid using neural network tools altogether, in favour of more white-box and interpretable tools.

Fig. 1: Relevance produced by four post-hoc interpretability methods.
figure 1

Relevance produced by four post-hoc interpretability methods, obtained on a time-series classification task, where a Transformer neural network needs to identify the pathology of a patient from ECG data. Two signals (V1 and V2) are depicted in black, and the contour maps represent the relevance produced by the interpretability method. Red indicates positive relevance, whereas blue indicates negative relevance. The former marks portions of the time series that were deemed important by the interpretability method for the neural-network prediction, whereas the latter marks portions of the time series that were going against the prediction.

Different definitions of what it means for a neural-network model to be interpretable have been formulated. Most of these definitions can be summarized under two categories: transparency and post-hoc interpretability13. Transparency refers to how a model and its individual constituents work, whereas post-hoc interpretability refers to how a trained model makes predictions and uses the input features it is given. In this work we consider post-hoc interpretability applied to time-series classification, as it is seen as a key to meet recent regulatory requirements12 and translate current research efforts into real-world applications, especially in high-risk areas such as healthcare14. Post-hoc interpretability methods assign a relevance to each feature of a sample, reflecting its importance to the model for the classification task being performed. The ability to express the specific features used by a neural network to classify a given sample can help humans assess the reliability of the classification produced and allows one to compare the model’s predictions with existing knowledge. It also provides a way to understand possible model biases that could lead to the model’s failure in a real-world setting.

A range of methods to provide post-hoc interpretability of classification results have been developed in the past few years. These are mainly focused on natural language processing and image classification tasks. With the more recent growing interest for neural-network interpretability, leading actors in the machine learning community built a range of post-hoc interpretability methods. As part of this effort, Facebook recently released the Captum library to group a large number of interpretability methods under a single developmental framework15. Although these initiatives allow researchers to more easily use the different methods, they do not provide a systematic and comprehensive evaluation of those methods on data with different characteristics and across neural-network architectures. A systematic methodology that provides the accurate evaluation of these methods is of paramount importance to allow their wider adoption, and measure how trustable the results they provide are.

The evaluation of interpretability methods was initially based on a heuristic approach in which the relevance attributed to the different features was compared with the expectation of an observer for common image classification tasks16, or of a domain expert for more complex tasks17,18. However, these works shared a common pitfall: they assumed the representation of a task learned by a neural network should use the same features as a human expert. The community later moved towards the idea that the evaluation should be independent of human judgement19. This paradigm shift was supported by the evidence that certain saliency methods—while looking attractive to human experts—produced results independent of the model they aimed to explain, thereby failing the interpretability task20. More recent evaluations were performed by occluding (also referred to as corrupting) the most relevant features identified and comparing the drop in score observed between model predictions on the initial and modified samples21. This evaluation method was later questioned, as corrupting the images changes the distribution of the values of the sample and therefore the observed drop in score might be caused by this shift in distribution rather than actual information being removed22. An approach named ROAR was proposed to address this issue22, in which important pixels are removed in both the training and testing sets. The model is then retrained on the corrupted (that is, occluded) samples, with the drop in score being retained on this newly trained model. This method has the benefit of maintaining a similar distribution across the training and evaluation sets with the modified samples. Yet we argue that it does not necessarily explain which features the initial network used to make its prediction as the similarity between neural network models is only maintained if the models are trained on datasets sampled from the same distribution23. In their case, the distribution is changed as the model is retrained on a corrupted dataset and therefore the post-hoc interpretability of the retrained model is not constrained to being similar to the one of the initial model. The post-hoc interpretability instead highlights the properties of the dataset in regards to its target, such as the redundancy of the information present in the features that are indicative of a given class—a limitation that was acknowledged by the authors21.

Neural network interpretability for time-series data was only recently explored. Initial efforts applied some of the interpretability methods introduced for natural language processing and image classification on univariate time series, and evaluated the drop in score obtained by corrupting the most relevant parts (also referred to as time steps) of the signal24. An evaluation of some interpretability methods was recently proposed25, with a dataset designed to address the issue of retaining equal distribution between the initial and occluded datasets; however, this work may have two crucial drawbacks: the proposed dataset contains static discriminative properties (for example, the mean of the sample) and it is not independent of human judgement. The former issue can lead the model to learn from static properties and thus the dataset might not reflect the complexity of real-world time-series classification tasks, in which time dependencies usually play the discriminative role. The latter is related to the assumption that the model uses:

  • all the discriminative information synthetically provided (comprising a static shift applied to a portion of the time series),

  • no information outside of it.

We argue that this assumption does not necessarily hold as the model might require just a subset of the discriminative information provided and might use information from outside of the discriminative portion.

In this work we propose an approach for the model-agnostic evaluation of interpretability methods for time-series classification that addresses the various issues just highlighted. The approach consists of two new metrics, namely \({\mathrm{AUC}}{\tilde{S}}_{{{{\rm{top}}}}}\) and \({\mathrm{F}}1\tilde{S}\). The first is the area under the top curve, and aims to measure how the top relevance indeed captures the most important time steps for the neural network, whereas the second—the modified F1 score—is a harmonic mean reflecting the capability of the different interpretability methods to capture both the most and least important time steps. These two metrics evaluate how interpretability methods order time steps according to their importance, referred to as relevance identification. In this paper we also aim to qualitatively evaluate the capacity of the different interpretability methods to reflect the importance of each time step relative to the others. The latter evaluation is referred to as relevance attribution. We note that a key aspect of this work is the training of the models with a random level of perturbation for each batch, in a similar fashion to widely used data-augmentation methods26. This perturbation is later used to corrupt the signal when evaluating interpretability methods such that the distribution is maintained across the training and perturbed datasets used for the evaluation. This addresses one of the main concerns found in the literature (that is, the shift in distribution when occluding samples in the evaluation set) and does not require retraining through the ROAR approach.

The six interpretability methods we considered are: (1) DeepLift27, (2) GradShap17, (3) Integrated Gradients28, (4) KernelShap17, (5) DeepLiftShap17 and (6) Shapley Value Sampling (also referred to as Shapley sampling or simply Shapley)29. These were chosen to capture a broad range of available interpretability methods while keeping the problem computationally tractable for all of the models presented. These interpretability methods are applied to three neural-network architectures; namely, convolutional (CNN), bidirectional long-short term memory (Bi-LSTM) and Transformer neural networks. The evaluation of the interpretability methods for time-series classification is performed on a new synthetic dataset as well as on two datasets adopted in practical applications. The overall code framework is part of the InterpretTime library freely available at Github (https://github.com/hturbe/InterpretTime).

In summary, the approach proposed and the new synthetic dataset we outlined address the following points:

  1. 1.

    The need for a robust and quantifiable approach to evaluate and rank the performance of interpretability methods over different neural-network architectures trained for the classification of time series. Our approach addresses the issues found in the literature by providing novel quantitative metrics for the evaluation of interpretability methods independent of human judgement19, using an occluded dataset21 and without retraining the model22.

  2. 2.

    The lack of a synthetic dataset with tunable complexity that can be used to assess the performance of interpretability methods, and that is able to reproduce time-series classification tasks of arbitrary complexity. We note that our synthetic dataset differs from ref. 25 as the neural network must learn the time dependencies in the data. Furthermore, the dataset encodes a priori knowledge of the discriminative features, analogous to the BlockMNIST synthetic dataset30. Finally, the classification task is multivariate by design, as the neural network must learn at least two features to predict the correct class. This is a desirable property as real-world datasets are commonly multivariate.

We first present an evaluation of six interpretability methods using the proposed framework across different datasets and model architectures. These results are then discussed to highlight the main trends as well as the potential for the developed metrics to build trust in post-hoc interpretability methods. In Methods we outline the new framework to evaluate interpretability methods for time-series classification, including the novel method used to maintain a constant distribution between the training and evaluation sets, the new metrics and the synthetic dataset.

Results

All of the metrics presented in this section are built on the relevance—denoted by R—that an interpretability method provides along the time series (a more detailed explanation for R is provided in Table 1 and Methods). An example for the ECG time series is depicted in Fig. 1, in which the countour maps represent R and the black lines represent the actual time series the neural network is using to make the prediction. The higher the relevance, the more important the portion of the time series associated with it is for the neural network classification task. The metrics are evaluated on three different datasets (synthetic, ECG and FordA) and three different architectures (Bi-LSTM, CNN and Transformer). The hyperparameters and classification metrics for the different models are presented in Supplementary Sections 2 and 3.

Table 1 Notations and symbols used in this paper

We next focus on evaluating the effectiveness of an interpretability method in ordering time steps according to their importance to explaining the neural network’s predictions. This crucial aspect of interpretability methods’ evaluation is also referred to as relevance identification; it is measured by \({\mathrm{AUC}}{\tilde{S}}_{{{{\rm{top}}}}}\) and \({\mathrm{F}}1\tilde{S}\), which are described in Table 1 and Methods.

The ordering of the time steps obtained using the relevance is used to corrupt the top- and bottom-k elements with positive relevance. Here, k elements refers to the percentage of time steps in the time series that are corrupted with respect to the total number of time steps with positive relevance. Top-k elements refers to a corruption strategy that corrupts time steps starting with higher relevance and descending to lower relevance. Similarly, bottom-k elements refers to a corruption strategy starting from time steps with lower relevance and ascending to higher relevance. We note that k is only used for calculating the number of elements to corrupt; however, the evaluation of the interpretability methods is performed with respect to the total number of elements in the sample, denoted by \(\tilde{N}\). This was performed such that the evaluation of different interpretability methods is independent of the number of time steps assigned with positive relevance, and instead is based on the total number of time steps. Figure 2 shows \(\tilde{S}\), that is, the normalized change in score (see equation (5) in Methods) for a Transformer trained on the newly created synthetic dataset. Results for the Bi-LSTM and CNN architectures are presented in Extended Data Figs. 1 and 2. These \(\tilde{S}-\tilde{N}\) curves constitute the basis for computing \({\mathrm{AUC}}{\tilde{S}}_{{{{\rm{top}}}}}\) and \({\mathrm{F}}1\tilde{S}\). The figure contains all of the six interpretability methods considered in this work, and a baseline (depicted in black). The baseline illustrates \(\tilde{S}-\tilde{N}\) for a random assignment of the relevance. Similar figures for the ECG dataset are presented in Extended Data Figs. 3, 4 and 5, whereas the results obtained on the FordA dataset are in Supplementary Section 1.1. Both \(\tilde{S}\) and \(\tilde{N}\) are detailed in Table 1 and Methods. As mentioned above, the points \(\tilde{N}\) are removed in two ways: from the most important to the least important points identified by the interpretability method (top-k strategy), resulting in the top curve and from the least important to the most important points (bottom-k strategy), resulting in the bottom curve in Fig. 2.

Fig. 2: \(\tilde{S}\) as a function of the ratio of points removed with respect to \(\tilde{N}\).
figure 2

Each subfigure represents one of the six interpretability methods considered for a transformer trained on the synthetic dataset.

The higher the value of \({\mathrm{AUC}}{\tilde{S}}_{{{{\rm{top}}}}}\), the better the interpretability method has understood which points were the most important for the model to assign the correct class. The smaller the area under the bottom curve, the better the interpretability method has understood which points were least important for the model to assign the correct class. A good trade-off between the two therefore shows that the interpretability method has identified both the most and least important points. The \({\mathrm{F}}1\tilde{S}\) metric represents the harmonic mean between the capacity to extract the most and least relevant time steps. A higher score—as with the \({\mathrm{AUC}}{\tilde{S}}_{{{{\rm{top}}}}}\) metric—represents a better relevance identification performance.

Table 2 shows \({\mathrm{AUC}}{\tilde{S}}_{{{{\rm{top}}}}}\) and \({\mathrm{F}}1\tilde{S}\) for all of the datasets, interpretability methods and neural-network architectures considered. Furthermore, the observed drop in accuracy for samples being progressively corrupted is presented in Extended Data Figs. 6 and 7 for the synthetic and ECG datasets, respectively, whereas the results for the FordA datasets are presented in Supplementary Section 1.2.

Table 2 \({\mathrm{AUC}}{\tilde{S}}_{{{{\rm{top}}}}}\) and \({\mathrm{F}}1\tilde{S}\) metrics for all of the datasets, interpretability methods and neural networks considered in this work

The relevance identification evaluated through the two metrics presented above focuses on assessing how the relevance produced by interpretability methods allows ordering of time steps to extract the most (or least) relevant time steps for a model. Another important aspect of interpretability methods is their capacity to estimate the relative effect of a given time step on the final prediction. We call this aspect relevance attribution. Developing on properties of the interpretability methods included in this work, the relevance attribution of interpretability methods are evaluated qualitatively using curves of the adjusted normalized change in score \({\tilde{S}}_{\mathrm{A}}\) (defined in equation (11) in Methods) versus the time-series information content (TIC) index, the latter of which measures the proportion of positive relevance contained in the corrupted portions of the time series. Figure 3 shows \({\tilde{S}}_{\mathrm{A}}\) as a function of the TIC index measured on the ECG dataset, and allows a qualitative evaluation of the relevance attribution performance of interpretability methods. If a curve is above the theoretical unit linear slope (depicted as dashed black lines), the interpretability method underestimates the influence of the corrupted time steps with regard to their effect on the model’s prediction. The opposite is true if the curve stands below the unit slope. The evaluation of the relevance attribution can therefore be seen as a measure of how well calibrated an interpretability method is in terms of the relevance it assigns to the different time steps with respect to their importance for the model to make its predictions. Similar figures for the synthetic and FordA datasets are presented in Extended Data Fig. 8 and Supplementary Section 1.3, respectively.

Fig. 3: \({\tilde{S}}_{\mathrm{A}}\) as a function of the TIC index for the six interpretability methods considered using the ECG dataset.
figure 3

ac, Results depicted for the Bi-LSTM (a), CNN (b) and Transformer (c) architectures.

Discussion

This paper presents a new evaluation method and a set of evaluation metrics for post-hoc interpretability to answer the question posed in the introduction: which method produced an interpretability map closer to the one actually adopted by the neural network to make its prediction? The two new metrics, \({\mathrm{AUC}}{\tilde{S}}_{{{{\rm{top}}}}}\) and \({\mathrm{F}}1\tilde{S}\), allow quantification of the relevance identification performance of an interpretability method and can be used to, for example, rank interpretability methods. These two metrics agree in identifying Shapley as the best performing method (see Table 2).

Focusing on the \({\mathrm{AUC}}{\tilde{S}}_{{{{\rm{top}}}}}\) values presented in Table 2, Shapley consistently outperforms the other interpretability methods across the different datasets and architectures (except for the CNN trained on the ECG dataset). The \({\mathrm{AUC}}{\tilde{S}}_{{{{\rm{top}}}}}\) metric reflects the capacity of Shapley to extract the most important time steps for a model prediction. Shapley is, however, the most computationally intensive interpretability method of the ones tested in this paper. It is therefore convenient to look for alternatives, which depend on the type of architecture selected. Integrated Gradients is the second best interpretability method for Bi-LSTM networks, whereas DeepLiftShap is ranked second for CNN. The results are slightly less clear for Transformer, where GradShap and Integrated Gradients have very similar performances.

In addition to \({\mathrm{AUC}}{\tilde{S}}_{{{{\rm{top}}}}}\), the \({\mathrm{F}}1\tilde{S}\) metric measures the ability of different interpretability methods not only to select the most important time steps but also the least important ones. The rankings produced using the two metrics are consistent with one another for both Transformer and Bi-LSTM, while favouring DeepLiftShap for CNN.

We addressed the issues identified in the literature to obtain reliable results for \({\mathrm{AUC}}{\tilde{S}}_{{{{\rm{top}}}}}\) and \({\mathrm{F}}1\tilde{S}\). In particular, we evaluated the interpretability methods avoiding human judgement, and did not retrain the model, while also avoiding a distribution shift between the training and occluded sets used to evaluate the interpretability methods. The distribution shift is one of the main concerns found in the interpretability literature. The method proposed in this paper (described in detail in Methods) addresses this issue and thus the drop in score observed in Fig. 2 as the samples are progressively corrupted cannot be attributed to a distribution shift. We also note that a larger drop in score is systematically observed when corrupting the most relevant time steps (identified by the interpretability method) as compared with corrupting a random selection of time steps (black lines in Fig. 2), as expected. The approach presented in this paper allows quantitative evaluation of interpretability methods without retraining, while avoiding a distribution shift between the training and evaluation sets.

The two metrics, \({\mathrm{AUC}}{\tilde{S}}_{{{{\rm{top}}}}}\) and \({\mathrm{F}}1\tilde{S}\), along with the majority of the literature on interpretability methods evaluation, focus on relevance identification (that is, ranking time steps according to their importance). In this work we also make a first step towards evaluating relevance attribution. This evaluates how the relevance reflects the relative importance of each time step compared with the others. The attribution is qualitatively evaluated using the \({\tilde{S}}_{A}-\,{{\mbox{TIC}}}\,\) curves (Fig. 3). These curves provide an understanding of the ability of an interpretability method to correctly weigh relevance and are compared with a newly derived theoretical estimation (derived in Supplementary Section 4). The relevance attribution performance consistently varies between the different neural networks tested and it also changes between datasets. The common denominator is the inability of the interpretability methods to follow the theoretical estimate. This indicates that the relevance attributed to each time step does not reflect the relative importance of this time step in the classification task. The attributed relevance instead acts more as a ranking of the most important time steps among themselves. For example, a point with a relevance of 0.1 for a total classification score of 1 might not necessarily account for 10% of the final prediction, but will be more important than a point with a relevance equal to 0.05. Albeit qualitative, these curves may be used to visually assess whether an interpretability method provides a balanced relevance.

As part of this work, we also provide a new synthetic dataset that can be used to evaluate interpretability methods (a sample of which can be found in Fig. 4). The new dataset forces the neural network to learn time dependencies as opposed to static information, and the discriminative portions of the time series are known a priori. Furthermore, the dataset is multivariate by construction, which is a desirable property especially when trying to mimic real-world (that is, non-synthetic) datasets. The performance of the interpretability methods on the new synthetic dataset is consistent with the performance obtained on the two real-world datasets tested as part of this work, namely FordA and ECG. The designed dataset hence acts as a good proxy for real-world classification tasks with two convenient properties: its complexity and properties are tuneable, and its generation is lightweight. As such, it can complement real-world datasets for a range of different research objectives within the context of evaluating post-hoc interpretability methods in time-series classification, given its known multivariate and time-dependent discriminative properties.

Fig. 4: Sample from the synthetic dataset.
figure 4

Subfigures show the six features of the sample. The classification task for the synthetic dataset aims to classify whether the sum of the frequencies of two sine waves with a compact support of 100 time steps are above a specific threshold. These waves can be observed in the sample below in features 2 and 6.

Finally, we assessed the usefulness of interpretability methods validated with our evaluation framework in an operational setting. In particular, we used the ECG clinical dataset because it provides a good example of how interpretability can be used once the interpretability methods have been evaluated. The use of clinical data was favoured because the healthcare sector will probably become highly regulated and thus require accurate interpretability of AI technologies12. To this end, we interacted with clinicians to understand a common disease that is representative in the context of ECGs, and that can be of interest to them. This turned out to be the well-studied cardiac disease, right bundle branch block (RBBB). In the classification task presented in Fig. 1 for the ECG data, the neural networks were trained to classify RBBB. Shapley is able to pinpoint a specific and compact region of interest in the time series, whereas the other methods provide interpretability maps that are less compact (in the case of KernelShap, a sparse map without a clear region). The feature highlighted by the Shapley relevance map corresponds to one of the morphological features cardiologists look at to diagnose the disease of interest31. The interpretability method also shows that the trained model relies almost entirely on a single lead to predict the disease in question, namely RBBB; however, other diagnostic criteria focusing on different leads are commonly used. This type of analysis provides practical insights to understand how trained models will perform in an applied operational setting, and may help identifying possible biases, spurious correlations and potential corrective actions. Moving forwards, it is of interest to understand how these analyses could be implemented for regulatory purposes, for example, and deployed as part of AI-based technologies in new high-risk applications.

Methods

Tackling distribution shift

A long-time issue when evaluating interpretability methods has been the shift in distribution between the training and corrupted datasets used for the evaluation. Interpretability methods have been frequently evaluated, comparing the drop in score when the most relevant time steps are corrupted with the score of the initial sample. The ROAR approach was proposed to address this issue22, however, retraining the model of interest comes with its own drawbacks, as discussed in the main text. The training method presented next aims to address this issue, thereby maintaining a constant distribution between the training dataset and the one used to evaluate the interpretability method. To achieve this task, the models presented in this paper were trained with random perturbations applied to the time series. This method was inspired by data-augmentation strategies commonly used when training models for image classification, object detection and other image-based tasks. On these tasks, random cropping has been shown to improve the classification performance of the developed model as well as its robustness32. In this work, the aim of the perturbations in the training set is not related to improving the performance of the model but to instead maintain an identical distribution between the training dataset and the corrupted samples used to evaluate the interpretability methods.

Similarly to the random cropping applied to images, part of the times series is corrupted by substituting the initial time steps with points drawn from a normal distribution \(\sim {{{\mathcal{N}}}}(0,1)\). This distribution follows the normalization applied as a preprocessing step to the samples. In a similar fashion as DropBlock33, consecutive time steps (or blocks) are corrupted. The augmentation is applied per batch, with the overall fraction of the time series being corrupted (γ) and the size of the blocks (β) being sampled from the following uniform distributions:

$$\gamma \sim {{{\mathcal{U}}}}(0,0.8)$$
(1)
$$\beta \sim {{{\mathcal{U}}}}(1,7)$$
(2)

Given the method described above, when specific time steps are corrupted to evaluate the interpretability methods, the distribution is retained with the samples used when training the model; β is chosen to reflect the range of consecutive time steps above the median positive relevance empirically observed over the used datasets; γ was also empirically chosen to cover most of the samples, where the positive relevance is rarely assigned to more than 80% of the total number of time steps in a given sample. The change in score observed when corrupting time steps can therefore not be attributed to a shift in the distribution, and hence fully reflects a loss of information for the model as measured by the interpretability methods. This approach addresses the distribution shift in the evaluation of interpretability methods without requiring retraining the model. The latter point is important as it is not possible to assert that the retrained model uses the same time steps as the initial one which the interpretability methods aim to explain.

Novel approach for evaluating post-hoc interpretability methods

The time-series classification task considered in this paper can be formalized as follows (the symbols and notation adopted are also reported in Table 1). Given a trained neural-network model, f, we aim to map a set of features \({{{\bf{X}}}}\in {{\mathbb{R}}}^{M\times T}\) to a labelled target \(\,{{\mbox{C}}}\,\in {{\mathbb{N}}}^{{N}_{c}}\) for each sample i contained in a given dataset \({{{{\mathcal{D}}}}}_{i}={[{{{\bf{X}}}},{{\mbox{C}}}]}_{i}\,\,{{\mbox{for}}}\,\,i=1,\ldots ,J\), where M is the number of features in X, T is the number of ordered time steps per feature, Nc is the number of labels and J is the total number of samples available. Typically, a final dense layer will produce logits that are then fed to a softmax layer to output the probability of sample i to belong to a given class C \(\in {{\mathbb{N}}}^{{N}_{c}}\).

To assess quantitatively time-series interpretability methods, we developed novel metrics or indices to evaluate how closely an interpretability method reflects the representation learned by the model of interest. Interpretability methods produce an attribution scheme \({{{\mathcal{A}}}}\) that assigns relevance R to the input X for a specific class cC such that \({{{{\mathcal{A}}}}}_{c}:{{{\bf{X}}}}\to \left\{{{{\bf{R}}}}\in {{\mathbb{R}}}^{M\times T}\right\}\), where X = (xm,t) and R = (rm,t), with m and t being the indices associated with M and T, respectively. For simplicity, the class that the attribution scheme aims to explain is dropped for the rest of the paper and the attribution scheme is denoted by \({{{\mathcal{A}}}}\). The new metrics are built on the relevance that an interpretability method provides along the time series. The relevance can be positive or negative (except for some interpretability methods, where it is only positive; see for example, the saliency method34). A positive relevance means that the neural network is using that portion of the time series to make its prediction. A negative relevance indicates that the neural network sees the portion of the time series as going against its prediction. As we are interested in how the network is using data to make its predictions, we use the positive relevance to build the new metrics. Logits have often been used as the input for interpretability methods34,35. However, Srinivas and colleagues36 demonstrate that pre-softmax outputs are related to a generative model that is uninformative of the discriminative model used for the classification task. In this sense, the evaluation of the interpretability methods for the rest of the paper are produced with the post-softmax models’ output as well as evaluated with changes in these outputs when corrupting samples, denoted by \(S:{{\mathbb{R}}}^{M\times T}\to {[0,1]}^{{N}_{c}}\).

In this work we aim to build a framework to evaluating the performance of interpretability methods. To this end, we chose six interpretability methods that capture a broad range of methods available, while keeping the problem computationally tractable. These are: (1) DeepLift27, (2) GradShap17, (3) Integrated Gradients28, (4) KernelShap17, (5) DeepLiftShap17 and (6) Shapley29. Their implementation uses the Captum library15.

Relevance identification and attribution

We need to tackle two aspects to assess the performance of interpretability methods: relevance identification and relevance attribution. We next detail these two concepts along with the methods developed to measure them.

Relevance identification

The concept behind relevance identification is that interpretability methods should correctly identify and order, according to their relevance, the set of points (in our case time steps) used by the model to make its predictions. Extending the assumption formulated by Shah and co-workers30, time steps with larger relevance are more relevant for the model to make a prediction than the ones with smaller relevance. The relevance produced by an attribution scheme can be used to create an ordering that ranks feature m at time-step t (namely xm,t) from a sample X according to its importance for the model’s prediction. It is important to remember that here we only focus on the positive relevance. The ordering can then be used to define \({\bar{{{{\bf{X}}}}}}_{k}^{{{{\rm{top}}}}}\) and \({\bar{{{{\bf{X}}}}}}_{k}^{{{{\rm{bottom}}}}}\), which represent the samples with top- and bottom-k time steps corrupted, respectively. These are ordered using the assigned relevance, corrupted as follows:

$${\bar{{{{\bf{X}}}}}}_{k}^{{{{\rm{top}}}}}=\left\{\begin{array}{ll}{x}_{m,t}\quad &\,{{\mbox{if}}}\,\,{r}_{m,t} < {Q}_{{R}^{+}}(1-k)\\ {{{\mathcal{N}}}}(0,1)\quad &\,{{\mbox{otherwise}}}\,\end{array}\right.$$
(3)
$${\bar{{{{\bf{X}}}}}}_{k}^{{{{\rm{bottom}}}}}=\left\{\begin{array}{ll}{x}_{m,t}\quad &\,{{\mbox{if}}}\,\,{r}_{m,t} > {Q}_{{R}^{+}}(k)\\ {{{\mathcal{N}}}}(0,1)\quad &\,{{\mbox{otherwise}}}\,\end{array}\right.$$
(4)

where \({Q}_{{R}^{+}}(p)\) denotes the p-quantile (where p = 1 − k for \({\bar{{{{\bf{X}}}}}}_{k}^{{{{\rm{top}}}}}\) and p = k for \({\bar{{{{\bf{X}}}}}}_{k}^{{{{\rm{bottom}}}}}\)) of the set of positive attributed relevance in a given sample R+, measured over sample X using attribution scheme \({{{\mathcal{A}}}}\), with R+ = {rm,trm,t > 0}. The rest of the analysis is performed for the following set of top- and bottom-k percentage of time steps with positive relevance: k [0.05, 0.15, 0.25, …, 0.95, 1].

In the general case of a modified sample \(\bar{{{{\bf{X}}}}}\), where \(\bar{N}\) points along the time series are corrupted, we can define the normalized difference in score:

$$\tilde{S}(\bar{{{{\bf{X}}}}})=\frac{S({{{\bf{X}}}})-S(\bar{{{{\bf{X}}}}})}{S({{{\bf{X}}}})}$$
(5)

It is possible to build \(\tilde{S}\) versus \(\tilde{N}\) curves for top- or bottom-k points (or time steps) corrupted, where \(\tilde{N}=\frac{\bar{N}}{N}\) is the fraction of points removed with respect to the total number of time steps N = M × T present in the time series. The area under the \(\tilde{S}-\tilde{N}\) curve is denoted as:

$${\mathrm{AUC}}\tilde{S}=\int\nolimits_{0}^{1}\tilde{S}\,\,{{\mbox{d}}}\,\tilde{N}$$
(6)

Using equation (6), we can define \({\mathrm{AUC}}{\tilde{S}}_{{{{\rm{top}}}}}\) and \({\mathrm{F}}1\tilde{S}\). The former aims to evaluate the ability of an interpretability method to recover the most important time steps for a model’s prediction. In this sense, the area under the drop in score when the top-k time steps are progressively corrupted should be maximized. To normalize for the interpretability methods assigning a different number of time steps with positive relevance, the \({\mathrm{AUC}}{\tilde{S}}_{{{{\rm{top}}}}}\) is measured on a modified \({\mathrm{AUC}}\tilde{S}\) curve. This modified curve is created by adding an extra point with coordinates \(\left(\tilde{N}=1;\tilde{S}=\tilde{S}({\bar{{{{\bf{X}}}}}}_{k = 1}^{{{{\rm{top}}}}})\right)\). Adding this point allows favouring of interpretability methods that are able to achieve a large drop in score with a minimal number of time steps assigned with positive relevance.

The \({\mathrm{F}}1\tilde{S}\) aims to build an harmonic mean between the ability of an attribution scheme to correctly rank the time steps with the highest and smallest relevance, respectively. Corrupting time steps with high relevance should result in a substantial drop in score (the model’s outputs are greatly affected). Corrupting time steps with small relevance should result in a negligible drop in score (the model’s outputs are negligibly affected). The best attribution scheme should have maximized \({\mathrm{AUC}}{\tilde{S}}_{{{{\rm{top}}}}}\) for top-k points corruptions, and minimized \({\mathrm{AUC}}{\tilde{S}}_{{{{\rm{bottom}}}}}\) for bottom-k points corruptions. Regarding this desired property, we can define the following F1 score:

$${\mathrm{F}}1\tilde{S}=\frac{{\mathrm{AUC}}{\tilde{S}}_{{{{\rm{top}}}}}(1-{\mathrm{AUC}}{\tilde{S}}_{{{{\rm{bottom}}}}})}{{\mathrm{AUC}}{\tilde{S}}_{{{{\rm{top}}}}}+(1-{\mathrm{AUC}}{\tilde{S}}_{{{{\rm{bottom}}}}})}$$
(7)

Relevance attribution

The idea behind relevance attribution is that the relevance should not only serve to order the time steps but also reflect the individual contribution of each time step relative to the others towards the model’s predicted score. All interpretability methods presented in this work are additive feature attribution methods, as defined by Lundberg and colleagues17. The produced relevance therefore aims to linearly reflect the effect of each feature on the model’s outputs. Focusing on the positive relevance, the difference between the model’s prediction on an initial given sample S(X) and a version with the top-k points corrupted \(S({\bar{{{{\bf{X}}}}}}_{k}^{{{{\rm{top}}}}})\) is dependent on the positive relevance corrupted between the two samples. The proportion of relevance attributed to corrupted time steps to the initial one is summarized with the time information content index defined as:

$$\,{{\mbox{TIC}}}\,(k)=\frac{{\sum }_{{R}_{k}^{+}}{r}_{m,t}}{{\sum }_{{R}^{+}}{r}_{m,t}+\epsilon }$$
(8)

where the following sets are defined:

$${R}^{+}=\{{r}_{m,t}| {r}_{m,t} > 0\}$$
(9)
$${R}_{k}^{+}=\{{r}_{m,t}| {r}_{m,t}\in {R}^{+}\cap {r}_{m,t}\ge {Q}_{R}(1-k)\}$$
(10)

The TIC index reflects the ratio of the relevance attributed to the top-k set of points to the total positive relevance. Taking the model’s output when all of the positive relevance is corrupted as a reference, we can normalize the change in score as follows:

$${\tilde{S}}_{A}(k)=\frac{S({{{\bf{X}}}})-S\left({\bar{{{{\bf{X}}}}}}_{k}^{{{{\rm{top}}}}}\right)}{S({{{\bf{X}}}})-S\left({\bar{{{{\bf{X}}}}}}_{k = 1}^{{{{\rm{top}}}}}\right)}$$
(11)

where \(S({\bar{{{{\bf{X}}}}}}_{k = 1}^{{{{\rm{top}}}}})\) corresponds to the model’s output when all time steps with positive relevance of sample X is corrupted. We name the quantity in equation (11) adjusted normalized drop in score. Given the linear additivity property of the relevance, the index \({\tilde{S}}_{A}(k)\) should be equal to the TIC(k) index (see Supplementary Section 4) such that the information ratio satisfies

$$\,{{\mbox{IR}}}\,=\frac{{\tilde{S}}_{A}(k)}{\,{{\mbox{TIC}}}\,(k)}=1$$
(12)

Given this theoretical approximation, it is possible to evaluate how different interpretability methods over- or underestimate the role of different time steps in the model’s prediction. An information ratio larger than one will indicate the relevance of the points under the quantile of interest was underestimated while the opposite is true for an information ratio smaller than one. An example of \({\tilde{S}}_{A}(k)-\,{{\mbox{TIC}}}\,(k)\) curve is depicted in Fig. 3 where we report the results for every interpretability method considered as well as the theoretical linear line (dashed).

Datasets

The new interpretability evaluation approach has been applied to: a new synthetic dataset created for this work, a standard univariate dataset for anomaly classification and a biomedical dataset based on ECG signals. The three datasets are described below.

A new synthetic dataset

The evaluation of interpretability methods for time-series classification has been lacking a dataset: (1) where the discriminative features are known and (2) that replicates the complexity of common time-series classification tasks with time dependencies across features. The developed dataset is inspired by the BlockMNIST dataset, which is derived from the MNIST dataset30. Each sample in the dataset comprises six features, each with 500 time steps corresponding to Δt = 2 ms. Each feature comprises a sine wave with its amplitude multiplied by 0.5 and frequency \(\sim {{{\mathcal{U}}}}(2,5)\), which serves as a random baseline. In two of the features—picked randomly for each sample—sine waves with a support of 100 time steps are added to the baseline at a random position in time.

The respective frequency of sine waves f1 and f2 are drawn from a discrete uniform distribution \(\sim {{{\mathcal{U}}}}(10,50)\). In the remaining four features, a square wave is included with a probability of 0.5 and frequency \(\sim {{{\mathcal{U}}}}(10,50)\).

The classification task then consists of predicting whether the sum of the two frequencies (f1 and f2) is above or below a given threshold τ. For the presented task, τ was set to 60 to balance the classes of the classification target y such that :

$$y=\left\{\begin{array}{ll}0,\quad &\,{{\mbox{if}}}\,\,{f}_{1}+{f}_{2} < \tau \\ 1,\quad &\,{{\mbox{otherwise}}}\,\end{array}\right.$$
(13)

The main idea behind the developed dataset is to force the network to learn temporal dependencies, that is, the frequency of the sine wave with closed support, as well as dependencies across features, the sum of the frequencies (f1 and f2). An example of a generated sample is presented in Fig. 4. The closed support sine waves used to create the classification target are observed in features 2 and 6. We note that the synthetic dataset proposed here can be regarded as a family of datasets, as the number of features, length of time series, class imbalance and discriminative features are tunable.

FordA

The FordA dataset is part of the UCR Time Series Classification Archive, which aims to group different dataset for time series classification37. FordA is a univariate and binary classification task. The data originate from an automotive subsystem and the classification task aims to find samples with a specific anomaly. The dataset comes with a training (n = 3,601) and testing split (n = 1,320), which was retained in this paper. The dataset is of interest as it has often served as a benchmark for classification algorithms38 as well as for benchmarking interpretability methods24.

ECG dataset

To mimic a real-world classification task, we applied the interpretability framework to an ECG dataset. Electrocardiogram records the electrical activity of the heart and typically produces twelve signals, corresponding to twelve sensors or leads. For this task, a subset of the Classification of twelve-lead ECGs (The PhysioNet—Computing in Cardiology Challenge 202039, published under Creative Commons Attribution 4.0 License) was used. The dataset was narrowed down to the CPSC subset40, which included 6,877 ECGs annotated for nine cardiovascular diseases. As part of these annotations, it was chosen to classify the ECGs for the presence/absence of a Right Bundle Branch Block (RBBB). The dataset includes 5,020 cases showing no sign of a RBBB and 1,857 cases annotated as carrying a RBBB; RBBB was found to be associated with higher cardiovascular risks as well as mortality41.

The data were first denoised using different techniques for low and high-frequency artifacts. The baseline wander as well as low-frequencies artifacts were first removed by performing Empirical mode decomposition (EMD). The instantaneous frequency is computed and averaged across each intrinsic mode resulting from EMD to obtain an average frequency of the modes. Modes with an average frequency below 0.7 Hz are then discarded and the signal is reconstructed with the remaining modes. The threshold is a parameter based on the literature where the thresholds range between 0.5 and 1 Hz (refs. 42,43,44). Given the difficulty to separate high-frequency noises using EMD, power-line and others high-frequency noises are removed by thresholding the wavelet transform coefficients using the ‘universal threshold’45.

To obtain an average beat, the R-peaks of each ECG are extracted using the BioSPPy library46. The beats centred around the R-peaks are then extracted from the ECG by taking 0.35 before and 0.55 s after the R-peak. The mean of the extracted beat is then computed to obtain an average beat of each lead. An example of the initial signal along the transformed one for a subset of the twelve leads is presented in Extended Data Figs. 9 and 10. The average beat was computed in each lead. The resulting modified twelve leads were used to train the model.

Baseline for interpretability methods

The interpretability methods used as part of this work require setting an uninformative baseline as a reference. Most methods (Integrated Gradients, DeepLift, Shapley, KernelShap) require a single sample set as the baseline. For those methods, the baseline was set as the mean taken across samples for each time step. GradShap and DeepLiftShap uses a distribution of baseline and this baseline was constructed by taking 50 random samples from the test set.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.