InterpretTime: a new approach for the systematic evaluation of neural-network interpretability in time series classification

We present a novel approach to evaluate the performance of interpretability methods for time series classification, and propose a new strategy to assess the similarity between domain experts and machine data interpretation. The novel approach leverages a new family of synthetic datasets and introduces new interpretability evaluation metrics. The approach addresses several common issues encountered in the literature, and clearly depicts how well an interpretability method is capturing neural network's data usage, providing a systematic interpretability evaluation framework. The new methodology highlights the superiority of Shapley Value Sampling and Integrated Gradients for interpretability in time-series classification tasks.


Introduction
Time series, sequences of indexed data that follow a specific time order, are ubiquitous.They can describe physical systems [1], such as the state of the atmosphere and its evolution, social and economic systems [2], such as the financial market, and biological systems [3], such as the heart and the brain via ECG and EEG signals, respectively.Availability of this type of data is increasing, and so is the need for automated analysis tools capable of extracting interpretable and actionable knowledge from them.To this end, artificial intelligence (AI) technologies, and neural networks in particular, are opening the path towards highly-accurate predictive tools for time-series regression [4] and classification [5] learning tasks.Yet, interpretability of the results produced by these tools is still lacking, undermining their more widespread adoption in critical sectors.To this end, a key issue is the lack of a systematic and accurate evaluation methodology for interpretability methods.This prevents practitioners to adopt the most suitable and accurate interpretability method for the task at hand, aspect that is now strongly demanded in several applications.The lack of a systematic and accurate interpretability evaluation framework is the gap we aim to fill in this work.
Different definitions of what it means for a neural-network model to be interpretable have been formulated.Most of these definitions can be summarised under two categories: transparency and post-hoc explanation [6].Transparency refers to how a model and its individual constituents work.Post-hoc explanation refers to how a trained model makes predictions and uses the input features it is given.In this work, we consider post-hoc explanation applied to time series classification, because it is seen as the key to meet recent regulatory requirements [7] and translate current research efforts into real-world applications, especially in high risk areas, such as healthcare [8].Post-hoc explanation methods assign a relevance score to each feature of a sample reflecting its importance to the model for the classification task being performed.The ability to express the specific features used by a neural network to classify a given sample can help humans assess the reliability of the classification produced and allows comparing the model's predictions with existing knowledge.It also provides a way to understand possible model's biases which could lead to the failure of the model in a real-world setting.
A range of methods to provide post-hoc explanation have been developed in the past few years to interpret classification results.These are mainly focused on natural language processing (NLP) and image classification tasks.More recently, with the growing interest for neural-network interpretability, leading actors in the machine learning community built a range of post-hoc interpretability methods.As part of this effort, Facebook recently released the Captum library to group a large number of interpretability methods under a single developmental framework [9].While these initiatives allow researchers to use the different interpretability methods more easily, they do not provide a systematic and comprehensive evaluation of those methods on data with different characteristics and across neural-network architectures.A systematic methodology that provides the accurate evaluation of these methods is of paramount importance to allow their wider adoption, and measure how trustable the results they provide are.
The evaluation of interpretability methods was initially based on a heuristic approach, where the relevance attributed to the different features was compared to the expectation of an observer for common image classification tasks [10], or of a domain expert for more complex tasks [11,12].However, these works had a common pitfall: they assumed the representation of a task learned by a neural network should use the same features as a human expert.The community later moved towards the idea that the evaluation should be independent of human judgement [13].This paradigm shift was supported by the evidence that certain saliency methods, while looking attractive to human experts, produced results independent of the model they aimed to explain, thereby failing the interpretability task [14].More recent evaluations were performed by occluding (also referred to as corrupting) the most relevant features identified and comparing the drop in score observed between model's predictions on the initial and modified samples [15].This evaluation method was later questioned, as corrupting the images changes the distribution of the sample's values and therefore the observed drop in score might be caused by this shift in distribution rather than actual information being removed [16].To address this issue, an approach named ROAR was proposed [16], where important pixels are removed both in the train and test set.The model is then retrained on the corrupted (i.e., occluded) samples, with the drop in score being retained on this newly trained model.This method has the benefit of maintaining a similar distribution across the train and test set.Yet, we argue that it does not necessarily explain which features the initial network used to make its prediction.It rather highlights the properties of the dataset in regards to its target, such as the redundancy of the information present in the features that are indicative of a given class -limitation that was acknowledged by the authors.Neural-network interpretability for time-series data was only recently explored.Initial efforts applied some of the interpretability methods introduced for NLP and image classification on univariate time series, and evaluated the drop in score obtained by corrupting the most relevant parts of the signal [17].An evaluation of some interpretability methods was recently proposed [18], with a dataset designed to address the issue of retaining equal distribution between the initial and the occluded dataset.A possible drawback of the proposed dataset is that static properties of the samples, such as the mean, can be used by the neural network to classify a sample.Hence, this task might not reflect the complexity of "real-world" time-series classification tasks, where time dependencies usually play the discriminative role.In addition, the paper lacks a robust evaluation of the different methods independent of human judgement as it is expected that the model uses all the redundant information provided in the dataset.
In this work, we address the several issues plaguing existing interpretability evaluation studies, providing a new accurate approach for the model-agnostic evaluation and benchmarking of interpretability methods for time series classification.In figure 1, we depict the interpretability workflow underlying the new approach (part of the InterpretTime library freely available in Github2 ), where we (a) train different neural network architectures on a new family of synthetic datasets (and ECG datasets), (b) The interpretability methods we considered are six, namely: i) DeepLift [19], ii) GradShap [11], iii) Integrated Gradients [20], iv) KernelShap [11], v) Saliency [21], vi) Shapley Value Sampling (also referred to as Shapley Sampling) [22].These were chosen to capture a broad range of available interpretability methods, while maintaining the problem computationally tractable for all the models presented.These interpretability methods are applied to three neural-network architectures, namely convolutional (CNN), bidirectional long-short term memory (bi-LSTM), and transformer neural networks.The evaluation of the interpretability methods for time-series classification is carried out on a new family of synthetic datasets as well as on an ECG dataset.
The new family of datasets aims to mimic arbitrary complex multivariate time-series data, and it is based on a nonlinear transformation of chaotic dynamical systems and composed of three datasets.The first synthetic dataset 1, denoted by SD1, simply applies the nonlinear transformation to the dynamical systems, while the second (SD2) and third (SD3) additionally corrupt the time series with different patterns of white noise.Further details are presented in methods 2.1.
The ECG dataset was chosen because the automatic classification of ECGs has seen a growing interest, with recent studies focused on post-hoc neural-network explainability [23,24], and is of practical interest in "real-world" applications.Additional details on the ECG dataset used here are presented in methods 2.2.
The new approach to interpretability methods' evaluation is based on novel metrics introduced for time series classification, namely TIC, S E , AU CS E , and HMI.These metrics aim to capture both (i) whether the interpretability method reflects the data representation learned by the model, as well as (ii) how this data representation compares to the one of a domain (human) expert.We call the former relevance identification and attribution, whose key new metrics are TIC, S E , and AU CS E , and are described in methods 1 and 1.1.We call the latter human-machine interpretability, whose key new metric is HMI, and is described in methods 1.2.
In summary, the new synthetic family of datasets, along with the novel interpretability evaluation metrics just outlined address the following key points: 1.The necessity for a robust and quantifiable approach to evaluate and rank interpretability methods' performance over different neural-network architectures trained for the classification of time series.Our approach addresses the issues found in the literature by providing an evaluation of interpretability methods independent of human judgement [13], using an occluded dataset [15], and without a retrained model [16].2. The need for a quantitative approach to assess the overlap between a human expert and a neural network in terms of data interpretation.We call this aspect human-machine interpretability.
3. The lack of a synthetic family of datasets with tunable complexity that can be used to assess the performance of interpretability methods, and that is able to reproduce time-series classification tasks of arbitrary complexity.
This paper is organized as follows.In section 2, we present the key results.In section 3 we discuss the results and summarise the main conclusions.In methods, we outline the new approach to interpretability evaluation for time series classification, including the new family of synthetic datasets and the novel metrics.

Results
The results are divided by neural-network interpretability methods' evaluation (section 2.1), and by human-machine interpretability assessment (section 2.2).
All metrics presented in this section are built on the relevance score that an interpretability method provides along the time series.An example for ECG time series is depicted in figure 2, where the red line is the relevance score, while the blue line is the actual time series the neural network is using to make the prediction.

Interpretability evaluation: relevance identification and attribution
In figure 3, we show the new key metrics for evaluating interpretability methods, for the three different neural networks considered in this work.Each row in the figure corresponds to a given neural network architecture, where the first row corresponds to CNN, the second to bi-LSTM, and the third to transformer.Each column represents a different metric for all the six interpretability methods considered in this work and a baseline which illustrates a random assignment of the relevance.
In particular, the first column shows the normalized drop in score with respect to the neural networks' expectancy S E vs. the amount of points deemed important by the interpretability method Ñr , and consequently removed to compute the drop in score.Both, S E and Ñr are detailed in methods 1.1.
These S E − Ñr curves allow evaluating the relevance identification performance of interpretability methods (i.e., the effectiveness of an interpretability method in identifying the relevant portions of the time series that were used by the neural network to make its prediction).The higher the curve the better the interpretability method's performance.
The second column shows S E as a function of the time series information content (TIC) index (detailed in methods 1).The latter index measures the amount of information contained in portions of the time series that were used by the neural network to make its prediction.These S E −TIC curves allow a qualitative evaluation of the relevance attribution performance of interpretability methods.Indeed, the difference between the initial classification score and the expectancy of the model's prediction E(S(X)) should be linearly proportional to the relevance removed using the TIC index.This theoretical linear trend is depicted in the figure as a dashed line in the plot.The interpretability method closest to this trend has the best relevance attribution performance.
Finally, the third column shows the accuracy of the neural network predictions as a function of Ñr .The accuracy drop has often been used to evaluate interpretability methods.It is however very dependent on the underlying class distribution of the sample, especially for unbalanced dataset similar to the ECG dataset presented in this paper.Given the latter, S E was favored as the evaluation metric.
The results presented in figure 3 correspond to the second synthetic dataset, SD2, using an occlusion method (i.e., corruption of the underlying time series) based on normal sampling as described in methods 1.1.Corresponding figures for SD1, and SD3 are reported in extended data figure 1 and 2.
Possible biases related to not maintaining i.i.d distribution between the train and test set are addressed by replicating the occlusion of the signals using a permutation of the time steps (as opposed to normal sampling).Figures for all three synthetic datasets using permutation as occlusion method are reported in extended data figure 3, 4 and 5. Results on the ECG dataset can be found in extended data figure 6  and 7.The area under the SE − Ñr curve, namely AU CS E , is used as an indicator of interpretability methods' performance in identifying the correct portions of the time series that were used by the neural network to make its prediction.Hence, AU CS E is used to rank the interpretability methods considered in this work.In table 1, we show AU C SE , for the three synthetic datasets introduced in methods 2.1, namely synthetic dataset 1 (SD1), 2 (SD2), and 3 (SD3), when the signal is corrupted with noise sampled from a normal distribution.We also report the ranking of the interpretability methods based on the average AU C SE across the three datasets, where bold indicates the best performing method (ranking equal to 1).

Human-machine interpretability evaluation
In table 3, we show the human-machine interpretability (HMI) index.This is evaluated on the two synthetic datasets which include time steps where the initial signal has been replaced with non-informative content, SD2 and SD3, respectively.On these two datasets we can attribute an expert score ω de equal to 0 on portions of the signal with white noise mimicking the role of a domain expert that knows these portions of the signal are not useful to the classification task.For the rest of the signal ω de is set to 1 as by design we know these sections carry information about the target.In bold, we highlight the methods that best match human expert and neural network data interpretation.We observe that these are rather different from the best performing interpretability methods measured in terms of relevance identification.This aspect emphasizes that there is an important distinction between how a human expert might interpret the data as compared to a neural networks.

Discussion and conclusion
The results presented in section 2 show how the new interpretability metrics are able to correctly assess whether an interpretability method is able to pinpoint the relevant parts of the time series that were actually used by the neural network to make its predictions.In particular, AU C SE allows quantifying the relevance identification performance of an interpretability method, and is used to rank the methods considered.The SE − TIC curves allow a qualitative understanding of the relevance attribution performance.Overall, these two metrics together provide a systematic way of assessing the performance of different interpretability methods, and overcome the several issues highlighted in the existing literature, thereby providing an accurate benchmarking framework for time series classification interpretability.
More specifically, the results shown in tables 1, and 2 highlight an advantage of the Shapley Value Sampling and Integrated Gradients methods across both the synthetic and ECG datasets.This advantage is retained both when corrupting the signal with white noise as well as when permuting the time steps (extended data table 1 and 2).This invariance to corruption methods also address a key issue found in the literature.The drop in performance of an interpretability method has been associated to a shift in terms of distribution between the initial and the corrupted dataset.In our analysis, we show that a larger drop in score is systematically observed both when permuting and when corrupting the most relevant time steps (identified by the interpretability method) as compared to permuting or corrupting a random selection of time steps.
We also note that the Shapley Value Sampling method was the most computing intensive among the methods tested as part of this research.In this regards, the Integrated Gradients can offer a good compromise given its performance across all models and datasets tested given its shorter running time.
For the two interpretability methods above, our analysis shows that the most important time steps for a neural network to classify a sample are correctly grasped.However, we observe a discrepancy between the theoretical estimate and the actual curves produced by the different interpretability methods (middle column in figure 3).This indicates that the relevance attributed to each time step does not reflect the relative importance of this time step in the classification task.Instead the attributed relevance acts more as a ranking of the most important time steps among themselves.For example, a point with a relevance of 0.1 for a total classification score of 1 might not necessarily account for 10% of the final prediction but will be more important than a point with a relevance equal to 0.05.
A limitation of the designed methodology is the assumption that time steps are independent and hence their relevance can be estimated separately by occluding the time steps of interest.This assumption does not hold for neural networks which learn dependencies across time steps.This limitation is however taken into account by occluding the quantile of the most relevant time steps.By occluding the most important time steps together and not separately, the most important dependencies are to be captured.Further post processing of the relevance extracted with the interpretability methods should be developed to highlight these dependencies across time steps as well as across features.
Along with the ranking of interpretability methods performance, we introduced the HMI score.This provides a way of assessing whether domain experts and neural networks agree on the time series data interpretation, aspect that to the authors' knowledge has not been formalized before.The results presented on the synthetic family of datasets, table 3, shows that the interpretability method that best match the intuition of an "expert" do not reflect the interpretability method which achieved the highest AU C SE .This difference between the two metrics reinforces the idea that while it might be of interest to compare the important part of a time series for a given task between a trained model and an expert, it should not be used to assess the performance of an interpretability method.
A key-enabler to achieve the aforementioned results is the novel family of synthetic datasets based on chaotic dynamical systems.This addresses several drawbacks present in existing literature and forces the neural network to learning time dependencies as opposed to learning static information.As part of the presented research, it was shown that the best interpretability methods on the synthetic dataset were also performing best on the ECG dataset emphasising how the designed dataset acts as a good proxy for "real-world" classification task and as such might be used for a range of different research objectives.
Finally, the interpretability computed on the ECG dataset shows how such information might be useful in practical healthcare applications.The relevance presented in figure 2 shows that the most important lead for the model to classify an ECG with the presence of a specific cardiac disease called a right bundle branch block (RBBB) is the lead V1 which is also a diagnostic criteria for cardiologists [25].Of great interest, the interpretability method is also able to show that the trained model almost entirely rely on this lead to predict a RBBB when there exist other diagnostic criteria for this disease.This type of analysis provides practical insights to understand how trained models will perform in an applied setting, and may help identifying possible biases and potential corrective actions.

Methods 1 Novel approach to interpretability evaluation
The time-series classification task considered in this paper can be formalized as follows.Given a trained neural-network model N , we aim to map a set of features X ∈ R M ×T to a labelled target C ∈ N Nc , for each sample i contained in a given dataset D i = [X, C] i for i = 1, . . ., N , where M is the number of features, T is the number of ordered elements per feature, N c is the number of labels, and N is the total number of samples available.Typically, this is achieved by a score S(X i ) that the trained neural network N provides, and that is then used as an input to a softmax layer to output the probability of sample i to belong to a given class C ∈ N Nc .
To assess quantitatively time-series interpretability methods, we developed novel metrics or indices which encompass two key areas: 1. evaluate how closely an interpretability method reflects the representation learned by the model of interest, and 2. how the learned representations compares to the one an expert would use to approach the classification task.
While these two points have often been interchanged to evaluate interpretability methods, we reassert the importance to distinguish them.Hence, we developed independent metrics to evaluate them separately.
The new metrics are built on the relevance score that an interpretability method provides along the time series.The relevance score can be positive or negative (except for some interpretability methods, where it is only positive -see e.g., the saliency method [21]).A positive relevance score means that the neural network is using that portion of the time series to make its prediction.A negative relevance score indicates that the neural network sees the portion of the time series as going against its prediction.As we are interested in how the network is using data to making its predictions, we use the positive relevance score to build the new metrics.
In this work, we considered six interpretability methods: i) DeepLift [19], ii) GradShap [11], iii) Integrated Gradients [20], iv) KernelShap [11], v) Saliency [21], vi) Shapley Value Sampling [22].These methods were chosen to capture a broad range of available interpretability methods, while maintaining the problem computationally tractable for all the models presented.The implementation of the interpretability methods leveraged those provided in the Captum library [9].The objective is to evaluate each of the methods above in terms of their performance as well as assess them in terms of human-machine data interpretation.
Central to the analysis of the relevance distribution, we define the time information content (TIC) index as: where I + q is the set of points with positive relevance above a relevance quantile q, τ = t ∈ I + q and is a small number that prevents the ratio to diverge to infinity.The TIC index reflects the ratio of the relevance attributed to the set of points I + q to the total positive relevance I + , integrated over the time series.In the rest of the study, TIC is calculated for each quantile within the following set: Q = {0.05,0.15, 0.25, . . ., 0.95}.The TIC constitutes a key index for both the ranking of interpretability methods' performance as well as for human-machine interpretability evaluation described next.

Relevance identification and attribution
For assessing interpretability methods' performance, we need to tackle two aspects, relevance identification and relevance attribution.We detail these two concepts along the methods developed to measure them next.
Relevance identification.The concept behind relevance identification is that interpretability methods should correctly identify and order according to their relevance the set of points used by the model to make its predictions.To verify that this is the case, we occlude the set of points I + q for each quantile q in the quantile set Q.This results in the modified sample Xi , where N I + points along the time series are occluded.Given the discussion provided in section 1 and the comments on maintaining i.i.d distribution across the different datasets, we adopt two different occlusion methods.The first method consists in sampling new values from a normal distribution with mean equal to 0 and standard deviation equal to 1 2 √ 3 to replace the set of values of interest I + q .This first technique might violate the i.i.d property.The second method randomly shuffles values in I + q , instead of sampling them from a normal distribution.This second technique has the benefit of maintaining a constant distribution between the initial samples and the modified ones.A random baseline is created by occluding a set of N I + random points to provide a comparison between the interpretability metrics and a random assignment of the relevance.We then run our trained model on the modified sample Xi and define the normalised drop in score SE with respect to the network's expectancy E[S(X)], where the network's expectancy is the score average over the entire dataset.The new metric SE is defined as: Using this newly created metric, it is possible to produce SE vs. Ñr curves, where Ñr is the fraction of points removed with respect to the total number of points T present in the time series, i.e.Ñr (q) = . These curves allow us to understand the effectiveness of interpretability methods vs. the amount of points flagged by the relevance score, and consequently removed.The number of points flagged is strictly linked to the quantile adopted.As we decrease the quantile, we increase Ñr .The overall effectiveness of the interpretability method can be summarised as the area under the SE − Ñr curve described above, and denoted as where SE ( Ñr ) is extended to pass through the origin and SE (1) is measured at the smallest quantile of the quantile set Q.The aim is to have a fair comparison across all methods which might attribute different number of points with positive relevance.Relevance attribution.The idea behind relevance attribution is that relevance should reflect the individual contribution of each time step towards the model predicted score.The local accuracy property states that the sum of the relevance should be equal to the difference between the score and the expectancy of the network E[S(X)] [11].Extending this property, the difference between the initial score and the expectancy of the model's predictions E[S(X)] should be linearly proportional to the relevance removed using the TIC index.Given this theoretical approximation, it is possible to evaluate how different interpretability methods over-or under-estimate the role of different time steps in the model's prediction.Indeed the index SE should be proportionally linear to the TIC index so that the information ratio (IR) satisfies A slope larger than one will indicate the relevance of the points under the quantile of interest was under-estimated while the opposite is true for a slope smaller than one.An example of these curves is depicted in figure 3, in the middle column, where we report the theoretical linear line (dashed), and the SE − TIC curves for every interpretability method considered.

Human-machine interpretability
To quantify how the relevance attributed by the interpretability method follows the expectations of domain experts, the TIC index is modified to include domain-expert weights ω de and denoted as human-machine intepretability (HMI) score: where ∆ N wI + is the difference in terms of number of points with positive relevance between the interpretability method and the human expert, N I + is the number of points having positive relevance identified by the interpretability method, and τ is the set of time steps with positive relevance I + .The domain-expert weights ω de aim to reflect the importance of the different time steps as seen by a domain expert, and can be either zero or one.A value of 0 means that the associated portion of the time series is not important to make a decision.A value of 1 means that the associated portion of the time series is important.Consequently, HMI produces a perfect interpretability score equal to 1, if ω de is equal to 1 only on those portions of the time series that were flagged by the relevance score (i.e., by the machine) as important.In this case, the neural network interpretation of the time series would match exactly the one given by the domain expert.However, in the more general case, the domain expert might also flag other parts of the time series as important (the limit case is the domain expert flagging the entire time series as important).In this case, it is necessary to penalize the parts of the time series non-overlapping with the relevance score produced by the machine.This is achieved by penalising the HMI score by the factor γ p .The latter is equal to 0, if ∆N wI + = 0 (exact overlapping of human and machine interpretation), and equal to 1 if ∆N wI + = N I + .We further observe that the fraction constituting γ p could be larger than one.If this is the case, we bound the value to 1, for easier interpretation of the score -i.e., HMI = 0 means that there is no overlap between human and machine data interpretation.

Datasets
The new interpretability evaluation approach has been applied to two sets of data, namely a new family of synthetic datasets created for this work, and an ECG dataset.The two are described next.

A new family of synthetic datasets
The new family of datasets envisioned for the interpretability of time series classification is based on five different and well-known chaotic dynamical systems, namely Chua [26], Duffing [27], Lorenz [28], Rikitake [29], and Rössler [30].Each dynamical system (also referred to as attractor) constitutes a different class (or label) in the classification task, and is composed by three time series characterizing the nonlinear dynamics (composed by three state variables) of the attractor.
Dynamical systems are described by a set of differential equations of the form: The equations for each of the five dynamical systems used in this paper are defined below.Each system include a range of parameters which are either fixed or randomly sampled for each attractor from a given interval in order to generate different behaviours .The first system used is the Chua system which aims to replicate an electronic circuit with chaotic behavior [26] and can be formulated as [31]: with the following constant, a = 15.6, ν 1 = −1.143,ν 2 = −0.714and b randomly sampled in the interval [25,51].
The second system is the Duffing oscillator which models a forced oscillator, and whose equations are: with a = 0.1 and b sample from the interval [0.1, 0.65].
Fourth is the Rikitake system which aims to explain the reversal of the Earth's magnetic field with the following set of equations [29]: Fifth is the Rössler system which aims to showcase chaotic behavior with fractal properties [30].The equations for this system are the following: with a = 0.2, b = 0.2 and c sampled from the interval [4,18].
All samples were generated by integrating in time the above systems of equations using the 5th order Runge-Kutta method from the desolver Python package 3 .The samples were integrated for 3500 time steps with the first 1000 time steps being discarded and the resulting time series being downsampled by a factor of 10.For each attractor class, 500 samples are generated with initial condition and system parameters drawn from a uniform distribution within a prescribed range (the ranges adopted are defined in table 4).Each sample includes 250 time steps.
In order to replicate "real-world" time series, that typically come with data corruption due to e.g., sensors issues or external disturbances, each sample is modified using a nonlinear transformation function of the form: ) where a, b, c, d ∈ R 1×3 are parameters also sampled from a uniform distribution and allow to tune the complexity of the classification task.
To force the neural networks to acquire time-series knowledge as opposed to use static knowledge, we rescaled the three time series associated to each attractor so that the maximum norm over the three quantities is one.The mean was also removed from each of the three quantities.These two steps are the key for preventing static metrics to be discriminative between different attractors (i.e.classes).In addition to these transformed and rescaled attractors, that we refer to as synthetic dataset 1 (SD1), we created two additional datasets with data portions replaced by white noise to form the following dataset family: 1. transformed attractors (synthetic dataset 1, or SD1), 2. transformed attractors where random locations are corrupted with white noise (synthetic dataset 2, or SD2), 3. transformed attractors with first 100 time steps of each feature corrupted with white noise (synthetic dataset 3, or SD3), The rationale behind choosing chaotic dynamical systems for the new synthetic dataset family lies on the similarity between identifying different dynamics, belonging to different phenomena that are common in "real-world" classification tasks such as ECG and EEG disease identification.
Overall the creation of the new synthetic dataset family, which includes three synthetic datasets, allows evaluating the metrics developed in this work, and produces consistent results in terms of interpretability method ranking as well as human-machine interpretability.

ECG dataset
To mimic a real-world classification task, we applied the interpretability framework to an ECG dataset.The dataset was narrowed down to the CPSC subset [32] which included 6877 ECGs annotated for 9 cardiovascular diseases.As part of these annotations, it was chosen to classify the ECGs for the presence/absence of a Right Bundle Branch Block (RBBB).The dataset includes 5020 cases showing no sign of a RBBB and 1857 cases annotated as carrying a RBBB.RBBB was found to be associated with higher cardiovascular risks as well as mortality [33].
In order to obtain an average beat, the R-peaks of each ECG are extracted using the BioSPPy library [34].The beats centred around the R-peaks are then extracted from the ECG by taking 0.35 before and 0.55 seconds after the R-peak.The mean of the extracted beat is then computed to obtain an "average" beat from each ECG.An example of the initial signal along the transformed one used as a feature for the classification task are presented in figure 4 and 5.
Figure 4: Raw ECG signal for two selected leads from a given sample

B Classification Metrics B.1 Syntetic datasets
Each synthetic dataset was split with a 0.7,0.15,0.15split between the train, validation and test set.
Accuracy for the three model types and across the three synthetic datasets are presented in table B1.

Figure 2 :
Figure 2: Plot of a sample from the ECG dataset correctly classified with the presence of a RBBB by the Transformer model.The signal of the two selected features (out of 12) is presented in blue along the attributed importance (relevance score) in the classification shown in red.The presented relevance was measured using the Shapley Value Sampling method.

Figure 3 :
Figure 3: Metrics SE as a function of ratio of points removed Ñr (left column) and TIC (middle column), and accuracy as a function of ratio of points removed (right column), for (a) CNN, (b) bi-LSTM, and (c) transformer, using the synthetic dataset 2 (SD2) and normal sampling.
ECG records the electrical activity of the heart and typically produces 12 signals, corresponding to 12 sensors or leads.For this task, a subset of the Classification of 12-lead ECGs: the PhysioNet -Computing in Cardiology Challenge 2020 published under CC Attribution 4.0 License was used.

Figure 5 :
Figure 5: Processed ECG signal for two selected leads from a given sample Extended data figure2: Metrics SE as a function of ratio of points removed (left column) and TIC (middle column), and accuracy as a function of ratio of points removed (right column), for (a) CNN, (b) bi-LSTM, and (c) transformer, using the synthetic dataset 3 (SD3) and normal sampling.
Extended data figure3: Metrics SE as a function of ratio of points removed (left column) and TIC (middle column), and accuracy as a function of ratio of points removed (right column), for (a) CNN, (b) bi-LSTM, and (c) transformer, using the synthetic dataset 1 (SD1) and permuting the time steps.
Extended data figure4: Metrics SE as a function of ratio of points removed (left column) and TIC (middle column), and accuracy as a function of ratio of points removed (right column), for (a) CNN, (b) bi-LSTM, and (c) transformer, using the synthetic dataset 2 (SD2) and permuting the time steps.
Extended data figure5: Metrics SE as a function of ratio of points removed (left column) and TIC (middle column), and accuracy as a function of ratio of points removed (right column), for (a) CNN, (b) bi-LSTM, and (c) transformer, using the synthetic dataset 3 (SD3) and permuting the time steps.
Extended data figure6: Metrics SE as a function of ratio of points removed Ñr (left column) and TIC (middle column), and accuracy as a function of ratio of points removed (right column), for (a) CNN, (b) bi-LSTM, and (c) transformer, using the ECG and normal sampling.
Extended data figure7: Metrics SE as a function of ratio of points removed (left column) and TIC (middle column), and accuracy as a function of ratio of points removed (right column), for (a) CNN, (b) bi-LSTM, and (c) transformer, using the ECG dataset and permuting the time steps.

Figure B1 :
Figure B1: Confusion matrix for the CNN model trained on the ECG task for the train, validation and test set

Figure B3 :
Figure B3: Confusion matrix for the Transformer model trained on the ECG task for the train, validation and test set

Table 1 :
The same results on the ECG dataset are presented in table 2. Results obtained when corrupting the signal by shuffling the time steps are presented in extended data table 1 for the synthetic datasets and extended data table 2 for the ECG dataset, revealing similar results.All models' hyperparameters and classification scores produced for the trained models are reported in appendix A and B. AU C SE score for networks trained on the three synthetic datasets SD1, SD2, and SD3, along with the average AU C SE score across the three synthetic datasets, and the ranking of the interpretability methods based on the average score (bold indicates the best performing interpretability method in terms of AU C SE ).

Table 2 :
AU C SE metric for networks trained on the ECG dataset, along with the ranking of the interpretability methods (bold indicates the best performing interpretability method in terms of AU C SE ).We observe that the relevance identification performance measured by the AU CS E − Ñr curve presents consistent results across the three neural network architectures, with Shapley Sampling and Integrated Gradient showing the best performance (except for CNN on SD3, where DeepLift ranks second together with Integrated Gradients).In terms of relevance attribution, Shapley Sampling and Integrated Gradient overshoot the theoretical linear trend significantly, compared to all other interpretability methods.Yet, none of the methods closely follows the theoretical estimate.

Table 3 :
HMI score on the two synthetic datasets with added white noise, SD2, and SD3 (bold indicates best agreement between domain-expert and neural-network data interpretation).
The HMI index shown in table3opens the opportunity to measure in practical scenarios how human domain experts (e.g., cardiologists in the case of ECG time series classification of hearth disease) interpret data compared to AI solutions.This point is of crucial importance given recent regulatory requirements for the use of AI technologies in critical sectors.

Table 4 :
Ranges from which the initial conditions of the different samples are sampled from for the Extended data table 1: Evaluation of the interpretability metrics for networks trained on synthetic datasets with signal corrupted by permuting time steps (bold indicates the best performing interpretability method for each dataset).

table 2 :
Evaluation of the interpretability metrics for the networks trained on the ECG datasets with signal being corrupted by permuting time steps (bold indicates the best performing interpretability method for each network).
Extended data

Table B1 :
Classification accuracy across the 3 synthetic datasets for the train, validation (valid.)andAccuracyalong the precision and recall are presented for the classification task on the ECG dataset in tableB2.In addition confusion matrices for the three models are presented in figureB1, B2 and B3.

Table B2 :
Classification metrics for the 3 models trained on the ECG test reported for the train, validation (valid.)and test set CNN Bi-LSTM Transformer Train Valid.Test Train Valid.Test Train Valid.Test Accuracy 0.