Introduction

Brain–Computer Interface (BCI) systems provide means by which one could use the brain activity measured either invasively or non-invasively to interact with an external device or their environment1. These systems record the brain activity, process the signal, and translate relevant features into commands that can be used to control a virtual or physical machine such as a computer, robot, exoskeleton, prosthetic, or even a digital avatar2. BCI systems are currently being used in both assistive modes such as providing means for individuals who are paralyzed to control external devices/communicate or as a rehabilitation tool to promote or improve their recovery process3. BCI systems have also proved to be useful in assisting individuals with different neuromuscular and neurological disorders such as spinal cord injury4, stroke5, cerebral palsy6, etc. Thus, BCI systems can compensate, restore or replace their reduced functional capabilities and facilitate neural recovery.

A typical BCI system contains multiple stages of pre and post-processing. The artifact removal stage contains different pre-processing steps which handle most of the artifacts that contaminate the brain signals. This is usually followed by a feature engineering stage wherein the most relevant features for the particular task of interest are identified. These features are then used to train a classifier/regression model to generate the commands for controlling an external device7.

Recent advances in machine learning and deep learning-based decoders have led to significant improvement in decoding capabilities using electroencephalography (EEG). Lately, with the advancements in deep learning (DL), studies adopting such models as decoders have exponentially increased. DL models use a computational framework that has multiple layers that learn representations at multiple levels of abstraction. In addition to improving the predictive power, the utility of DL is mainly inspired by the possibility of removing this multi-stage processing of EEG. Many studies have been using deep learning models to function in an end-to-end manner wherein the same model is supposed to handle the artifacts, identify relevant features, as well as perform decoding8, 9. Indeed, over 60–70% of studies do not handle artifacts when using deep learning models8,9,10. The possibility of not handcrafting the features required for decoding is also an advantage of using DL models. The model would be able to automatically identify the relevant features thus not limiting the decoding to the hand-picked or pre-selected features. A review by Roy et al.9 reported that studies have reported a median decoding increase of 5.4% between DL algorithms and traditional ML algorithms demonstrating the benefit of using DL models as a decoder. However, these models do suffer from poor interpretability and explainability which limits their widespread adoption in spite of the performance improvement, especially in industries such as healthcare11, 12. Therefore, there exist concerns on whether this improvement in decoding is from learning the underlying true data distribution or learning spurious artifacts present in the data13, 14. Interpreting how and why a model is arriving at specific decisions will be critical to eliminating similar biases in algorithms.

Even though there exist many variants of the algorithms being developed to interpret the neural network models, the broad majority of them could be categorized into three categories: Model Distillation, Visualization methods, and Intrinsic methods15. A summary of these different types of model explanations is given in Fig. 1.

  1. 1.

    Distillation/approximations methods: A group of approaches tries to approximate the DL models with simpler models whose input-output behavior mimics that of the DL model. Later, by interpreting the simpler model, insights into how the complex model works can be obtained. These approaches are broadly labeled under the category of distillation methods. One of the most popular among these methods would be the use of the Local Interpretable Model-agnostic Explanations (LIME) method16.

  2. 2.

    Visualization: Visualization methods are approaches that in general highlight the most important feature or attribute present in the input that affects the decision of the model through different visualization. One of the most common approaches is the saliency maps which highlight the important segment of the input. These could further be divided into different categories based on how they are implemented. The majority of the approaches developed in this category are based on back-propagation17. The gradient/relevancy score for a particular class or neuron is back-propagated in some form for these approaches. The most common and oldest approach is the Gradient approach18 which is estimating the gradient of the output with respect to the input. Variants of the simpler models have been developed which are more robust and less noisy like FullGrad19, Input X Gradient (IxG), Layerwise Relevance Propagation(LRP)20, DeepLift21 or different approaches of class activation maps likes GradCAM22, GradCAM++23, LayerCAM24, GuidedGradCAM25, ScoreCAM26 etc. There are a few methods that attempt to reverse the forward operations (’Inversion’) in a CNN such as Deconvolution27 and Guided Backpropagation28. Other approaches like activation maximization involve adding an additional ’optimization’ step wherein it tries to create an input that maximizes the score for a particular class/filter of interest18. Through all of these methods, the researcher gets additional context through different ways of scientific visualization on what drives a model decision.

  3. 3.

    Intrinsic methods: Intrinsic methods involve either developing models which provide an explanation for the decision as part of its model output or those in which explanations can be extracted from the architecture rather straightforwardly way15. Some common methods involve models using the attention mechanism29. The attention mechanism generates a contextual vector for downstream processing by learning a conditional distribution over the input. Some studies on the other hand engineer the deep network to perform specific meaningful functions which are easily interpretable. One such approach is the development of SincNet30 which is based on parameterized sinc functions wherein the model learns cutoff frequencies for the filter banks. This allows for more easily interpretable filters as the most highly activated units would correspond to a particular frequency band.

The needed emphasis on explainability hasn’t picked up a similar pace in popularity compared to deep learning in general for EEG applications. The adoption of explainability for deep learning models in the research involving EEG is still very rare. In the EEG literature, a majority of the model explanations are based on the visualization method using the backpropagation approach. A brief literature review is detailed in the section below. The scope of the paper will therefore be limited to the visualization approach as this is also the most extensively developed explainability method in other domains as well31, 32. Limiting the scope to these methods further allows for a more straightforward comparison of their effectiveness. Recent research in computer vision has shown that many of these visualization-based approaches when applied to images have reliability issues33, 34. Adebayo et al. showed that visual inspection of model explanations alone can mislead into giving compelling cases. They demonstrated that many of the commonly used explainable methods lack sensitivity to the model and the data generating process34. In that study, they randomized the labels and separately reinitialized the model weights. Then they hypothesized that if the model was specific to data and the trained model the explanations should be significantly different with randomization. However, they found that many methods were invariant to these manipulations and only gradients and GradCAM passed their sanity checks. In a separate study Kindermans et al. show that many methods do not satisfy input invariance either33. Most of these studies in EEG limit visualization to either one example or an average of one subject. Thus, it is not clear whether the proposed methods would generalize to other datasets. Therefore, it remains unclear which explainability method(s) are robust and reliable when applied to EEG data, and whether or not these methods are sensitive to only certain features in EEG. The sensitivity element is equally important on top of robustness because unlike images, EEG is a bit more complex with features in multiple domains such as temporal, spectral, and spatial domains all equally relevant. Looking at raw time series is less intuitive relative to looking at an image. Also, finding the ground truth in real EEG is a challenging task, particularly with the lower values of signal-to-noise ratios (SNR). Even the same task repeated might have a large source of variability due to the nature of how the human brain works, the influence of the environment, etc. Knowing the exact location of a particular feature in time could be difficult to ascertain when looking at individual trials as well. In addition, often multiple features and noise superimpose making it difficult to know which feature the model is sensitive to.

To address some of these challenges, in this study, we introduce a framework wherein we use simulated EEG to compare different deep-learning explanation methods for EEG applications. The use of simulated data allows the isolation of distinct EEG features. This further allows the production of selective and controlled variations of these features. Here, we test twelve heatmap-based methods on simulated EEG to understand the ground truth sensitivity and robustness of these methods for varying levels of SNR. The sensitivity to detect three fundamental attributes in EEG, specifically the temporal, spectral, and spatial properties are evaluated. We provide a more objective assessment of the robustness and sensitivity of these explainability approaches to these different attributes in EEG. This work compares the strengths and weaknesses of these methods to better understand the pitfalls and provide recommendations for their appropriate application in EEG research.

Figure 1
figure 1

Different types of explanation approaches in neural network models.

Literature review

To better quantify the number of studies that employ explainability approaches when using deep learning on EEG, a literature review was conducted using the Web of Science. The advanced search option was used with the criterion ((AB = (EEG) OR AB = (Electroencephalography)) AND (AB = (neural network) OR AB = (deep learning) OR AB = (CNN) OR AB = (Convolutional Neural Network) OR AB = (Recurrent Neural Network) OR AB = (LSTM) OR AB = (GRU))) AND (ALL = (interpretability) OR ALL = (explainability) OR ALL = (interpretable)). The search conducted in November 2021 gave a total of 65 publications. Among these 30 did not use any specific explainability method in the paper. They either only refer to interpretability/explainability in the paper for discussion purpose or is not relevant. A few of the papers that include interpretability in title/abstract used hand-crafted features to train the model and refer to them as “interpretable models”. These studies were also not included. Two papers were not considered because of poor quality. After removing these papers, only 33 studies remained that used some form of model explanation. On the other hand, studies without the part (ALL = (interpretability) OR ALL = (explainability) OR ALL = (interpretable)) in the advanced search provided a total of 5951 papers suggesting the studies including model explanation currently is less than 0.6%.

The types of methods used in the 33 studies are summarized in Fig.  2. The majority of the studies use some form of heatmap approach. These heatmap approaches highlight the part of the input data the model is looking at to arrive at the correct prediction. The most commonly method (Saliency) is also the most simplest wherein the gradient w.r.t. input was computed35,36,37,38,39,40,41,42. The next commonly used method is plotting the convolutional filters directly; usually, the convolutional filters that have a kernel spanning the entire EEG channels (spatial convolutional layer weights)43,44,45,46. However, looking at the raw weights does not directly indicate whether they are class-specific features or not. Considering there is a large number of filters, the ideal combination of filters that contribute positively to the prediction would be difficult to discern. Also, previous studies have shown that significant non-zero weights can be observed for channels whose activity can be independent of the underlying cortical activity47. Many other studies used occlusion-based model explanations wherein they occlude or zero out parts of the input to identify the most sensitive region. However, occlusion methods are not ideal when there are dependencies between non-local features. In that case, it has to be known apriori how to define the mask to include these dependencies (width, the shape of the mask, etc). Other studies have used more complex versions of back-propagation approaches. E.g. Sturn et al. used LayerWise Relevance Propagation (LRP) to identify scalp relevancy associated with motor imagery48. Similarly, Lawhern et al. used the Deep Learning Important FeaTures (DeepLift) method46 for motor imagery and error-related negativity response task. Ravindran et al. used GradCAM to demonstrate that CNN was learning from common perturbation evoked potentials in single-trial EEG49. Ravindran et al. later developed an approach combining clustering and gradcam explanations to demonstrate that these decoders were not biased by artifacts50. However, when artifacts were not handled, the model was learning from artifacts. Another approach a good number of studies have used is the activation maximization approach18, which synthetically generate inputs that maximally activate a particular neuron, typically the final layer neurons51,52,53. Few studies attempted a perturbation approach in which they perturb the input and evaluate the change in output54, 55. The other category includes studies that use approaches not commonly used. Most of them either visualize clustering of hidden layer activation to show class separation56 or show a correlation of hidden layer activation to different features57, 58.

Figure 2
figure 2

Left: Pie chart showing the distribution of methods used in the screened studies from the web of science search. Right: Trend showing the number of EEG publications using deep learning, with and without explainability (not screened).

Results

The five fold cross-validated decoding accuracy for all the combinations of SNR and conditions is summarized in Table 1. As expected, with lower SNR, the decoding performance decreases. The chance level of spatial, spectral, and temporal conditions are 50%, 25%, and 25% respectively since the classes are balanced. The cross-validated robustness and sensitivity measures were estimated for each of the three conditions for different levels of SNR. The following subsections give the comparison for each of the conditions.

Table 1 Cross validated test accuracy for different set of simulated data.

Event related potential component (temporal precision)

The averaged cross-validated performance metrics are summarized in Fig. 3. From the Relevance Mass Accuracy (RMA) measure, Deeplift was found to be the most accurate/sensitive followed by LRP and I × G to localize the ERP component. This was followed by Guided GradCAM and LayerCAM. On the other hand, GradCAM++ was the worst at temporal precision, followed closely by GradCAM and ScoreCAM.

When the similarity of original explanations was compared to that with randomized labels, it was observed that methods like GradCAM++, Fullgrad, and Saliency have very similar explanations suggesting that their explanations are not class-specific. Similarly, Deconvolution and Guided Backpropagation also yielded a high correlation with the original true explanation. DeepLift, LRP, I × G, and GradCam were the most robust.

In the case of randomized weights, Deconvolution and Guided Backpropagation had the highest R-value followed by GradCAM++. For the Structural Similarity Index (SSIM), GradCAM++ had the highest value followed by Saliency, FullGrad and ScoreCAM. DeepLift, LRP, I × G were still having low values.

Overall, Deeplift was found to be the best closely followed by LRP and I × G,. They had a good trade-off in both robustness and sensitivity whereas GradCAM++ was the worst. Even though Saliency, Guided Backpropagation, and LayerCAM had good sensitivity, they were not very robust to randomizing labels and weights.

Spectral perturbation (frequency)

The averaged cross-validated performance metrics are summarized in Fig. 4. From the RMA measure, most measures do have high accuracy but Deeplift was still the most accurate/sensitive method. This was closely followed by LRP, I × G, Guided G-cam, Guided Backpropagation, Saliency, Deconvolution, and FullGrad. ScoreCam, GradCAM++ was the worst followed by GradCAM and LayerCAM.

When the similarity of original explanations to that with randomized labels is compared, like before, it was observed that GradCAM++, Fullgrad, Saliency, Guided Backpropagation and Deconvolution have very similar explanations suggesting their explanations are not class-specific. Similarly LayerCAM and ScoreCAM also yielded a high correlation with the original true explanation. DeepLift, LRP, I × G, GradCAM, Guided GradCAM were the most robust.

In the case of randomized weights, GradCAM++, Deconvolution, Guided Backpropagation had the highest R-value followed by Saliency, FullGrad, ScoreCam, LayerCAM, and Guided GradCAM. For SSIM, GradCAM++ had the highest value follower by Saliency, FullGrad, and ScoreCam. DeepLift, LRP, I × G, and GradCAM were still having low values.

Overall, Deeplift was found to be the best closely followed by LRP and I × G. They had a good tradeoff in both robustness and sensitivity whereas GradCAM++ was the worst. Even though Saliency, Deconvolution, Guided BP, Guided GradCAM, and FullGrad had good sensitivity they were not very robust to randomizing labels and weights.

Scalp distribution (spatial)

The averaged cross-validated performance metrics are summarized in Fig. 5. Here, cosine similarity was used instead of RMA as there exists a non-zero ground-truth value in all channels due to volume conduction. Here, unlike other measures, based on cosine similarity, it was found that on the true explanation, GradCAM, and ScoreCAM had the highest RMA followed by GradCAM++, FullGrad and LayerCAM. DeepLift, LRP, and I × G still had high values but were lower than the other measures. Guided Backpropagation, Deconvolution, and Guided GradCAM were the worst for spatial relevancy. Even though GradCAM has high sensitivity, its performance drops much fast with SNR lower than 19 dB compared to other methods.

However, when the similarity of original explanations was compared to that with randomized labels, the measures like GradCAM++, ScoreCAM, and Fullgrad which had the highest sensitivity to ground truth, also had the most similarity to the randomized label explanation. Saliency and Guided Backpropagation also had high similarities to the original explanation. DeepLift, LRP, I × G, GradCAM, and Guided GradCAM were the most robust.

Similarly, in the case of randomized weights,GradCAM++, ScoreCAM, and Fullgrad which had the highest sensitivity to ground truth, also had the most similarity to the randomized label explanation. Saliency and Guided Backpropagation also had high similarities to the original explanation. In addition, randomizing weights had a high similarity for LayerCAM as well. DeepLift, LRP, I × G, GradCAM, and Guided GradCAM still remain the most robust.

Overall. GradCAM, Deeplift, LRP, and I × G were the better approach and had a good tradeoff in both robustness and sensitivity. GradCAM++ was the worst. Even though ScoreCAM, FullGrad, LayerCAM, and Saliency had good sensitivity, they were not robust to randomizing labels and weights.

Figure 3
figure 3

Comparison of the cross-validated metrics for different explanation methods with and without label/model weight randomization for detecting ERP components.

Figure 4
figure 4

Comparison of the cross-validated metrics for different explanation methods with and without label/model weight randomization for detecting spectral perturbation features.

Figure 5
figure 5

Comparison of the cross-validated metrics for different explanation methods with and without label/model weight randomization for detecting spatial features.

Discussion

Including explainability approaches in deep learning is critical to understanding the operation of the model, identifying the most relevant features with discriminative power, and generating scientific insights about the datasets. However, choosing these approaches requires a good understanding of the strengths and weaknesses of the methods available when applied to EEG. Twelve heatmap-based visualization methods were systematically compared for their ability to detect different fundamental attributes of EEG. Using a simulation framework allows us to limit and understand the exact feature from which the model can learn from. Using real EEG, it is very difficult and challenging to ensure the model is only learning from a particular feature and to know the true ground truth available, their location, duration, etc. For the same reason, it would be very difficult to compare the methods on how well they capture the ground truth signal as well. The robustness and accuracy of these methods to varying temporal, spectral, and spatial attributes of EEG for different signal-to-noise ratios were compared. Figure 6 gives a high-level summary of the different comparisons. The methods which have a mean sensitivity measure greater than 0.55 (higher is better) are indicated by the dark blue color. The red color indicates the particular method for the condition being considered is not class-specific (robustness measure > 0.5; higher is worse). Similarly, the orange color indicates the method is not class specific with robustness measure > 0.3 but < 0.5. If the method is not model specific it is indicated by the asterisk “*” symbol. Here, if the robustness measure > 0.5, they are marked with “**” and if the robustness measure > 0.3 and < 0.5, it will be indicated by a single “*”.

Figure 6
figure 6

Comparison of the cross-validated metrics for different explanation methods with and without label/model weight randomization for detecting spatial features; LRP: layerwise relevance propagation; IxG: integrated gradient; G-GCAM: guided-GradCAM; Guided BP: guided back-propagation; Deconv: deconvolution.

Evaluating the robustness and sensitivity measures, even though many measures show high accuracy/sensitivity to the feature of interest, they are not class or model-specific. E.g., Saliency/Gradient is a basic yet one of the most commonly used model explanation methods in EEG38, 40,41,42. They also have high sensitivity to detect spectral perturbation and relevant channels as well. However, randomizing the model weights or labels yielded a very similar explanation to the original one. This suggests that they are not model or label-specific. Therefore, this method should be used with caution. A similar observation was found for many of the methods like Deconvolution, Guided Backpropagation, ScoreCAM, FullGrad, LayerCAM, and GradCAM++ as well. GradCAM++ was one of the least reliable explanation methods.

On the other hand, DeepLift, Input × Gradient, and LRP was found to be both accurate as well as robust in all three cases (spatial, temporal, and spectral). Looking at the explanation metrics, LRP with epsilon rule and Input × Gradient share very significant similarities. This is because previous studies have shown that when all the non-linearities involved are ReLU, epison rule-based LRP approximates to Input × Gradients59. There exist multiple studies in Computer Vision that assessed the unreliability of Saliency map-based approaches34, 60. However, these studies do not measure the accuracy of these explanation methods. This is an important question because, in the study by Adebayo et al.34, they identified that GradCAM was one of the most reliable/robust explanation methods available. In this study, we do show that even though the robustness aspect is preserved in all the 3 conditions, GradCAM is not ideal in the case of spectral perturbation and temporal data conditions. The reasoning for that comes from the framework itself. GradCAM as well as the general class activation maps, compute the model explanation w.r.t. the last convolutional block. With successive pooling and convolution operations, the temporal resolution of the activation in the final convolutional layer would be small. These methods get an estimate of the relevant input by performing a bilinear interpolation to upsample to the input dimension. These will lead to reduced temporal resolution, a key attribute in EEG. However, when we are not interested in the temporal aspect, but instead want to look at spatial relevancy, GradCAM was found to be the most accurate method. Another limitation of using GradCAM which needs to be checked for was that their performance decreased much faster than other methods when the SNR decreased i.e. when the model confidence dropped. One additional point to keep in mind if researchers plan on using GradCAM is that many of the existing EEG architecture uses a spatial convolutional layer in the initial layers. This spatially mixes the information across channels and the succeeding layers do not have channel-independent data. Therefore, using GradCAM in such a case will not be able to produce channel relevancy as the last convolution is purely temporal data. So, this study recommends researchers adopt heatmap-based model explanation methods to either use DeepLift or Layerwise Relevance Propagation in general to explain deep learning studies. However, unless the decoding is poor, GradCAM is still a good alternative for estimating spatial relevancy.

There are several reasons why DeepLift could be performing better than the other methods being tested. Traditional gradient based methods, such as saliency map and gradient backpropagation for instance, have the issue of saturation61 when using certain activation functions. In these saturation regions, the gradient becomes zero, effectively masking their effects. Similarly, these methods also have issues when dealing with “dead neurons” which are neurons that have no activity62. This is particularly relevant when using ReLU as it zero’s out activity for certain inputs. However, DeepLift avoids this issue as it makes use of a baseline and has the conservation property. It ensures that the sum of the contribution scores for the input would add up to the difference between the model prediction for specific input and that of the baseline. By redistributing the contribution of each neuron’s activation across the input features w.r.t. the baseline, DeepLift overcomes the saturation problem. Hence they are more flexible with different activation functions with saturation or “dead neuron” effects. Also since DeepLift distributes a neuron’s activation across the input features using the conservation property, it might be able to capture the interactions between the input features better. DeepLift also provides layer by layer breakdown unlike methods like GradCAM for instance which is limited to a specific layer of interest. DeepLift considers the contributions of each layer and neuron to the final prediction. This could provide both local and global explanations.

Overall LRP or DeepLift was the most reliable method of all. They were also the most accurate in identifying the ground truth. Even though GradCAM is one of the most robust methods, they fail when the SNR is either low or in the case wherein temporal precision is critical.

These method when added to existing studies will provide additional context to evaluate the bias of the models to spurious correlations or artifacts. There exist multiple ways in which integrating explainable methods could be beneficial when developing BCI. For instance, consider a model for decoding motor imagery for stroke rehabilitation purposes. Ideally, a model should learn neural features from regions of the brain which has representations of limb movement (typically motor-related signals). Say a DL model exhibits significant performance gain compared to traditional models, but if the model is learning from irrelevant noise signals instead of motor-related potentials, the rehabilitation will not be effective and the high-performance increase becomes insignificant. Ideally, the predictive models should utilize neural features associated with the task-specific region to induce neurorecovery, rather than an unrelated neural activity that is not associated with the motor task. Similarly, if explanations could be provided on what the model was looking for when making the decisions, the researchers/end users who are hesitant to use these models could be more open to their adoption.

One of the advantages of using DL is the possibility it offers to avoid the need for hand crafting features- so-called feature engineering. The model can automatically identify relevant patterns required for decoding. This is another important area in which explainability would offer tremendous possibilities. Understanding what the models are looking at could lead to new scientific discoveries and progress the field forward. This can also be useful for in making useful implementation decisions. For example, in a recent study Ravindran et al.50 demonstrated the balance perturbations could be detected from single trial EEG. From the model explanations, we can identify the subset of channels which were deemed most important. When developing a wearable system for fall prevention, we could then use this information to select a smaller montage thus reducing the set-up time, cost, etc. for the system. Similar decision making could be possible for decoders developed for different BCI tasks.

Additionally, using explainability approaches allows us to understand the failure modes in the model, giving valuable insights about the model. These methods will help debug the model by identifying some of its limitations and mistakes thereby improving the model. Overall, there exists multitude of ways in which explainability approaches could be directly useful in improving the BCI.

Some of the limitations and future directions of the analysis are discussed below:

Approximation error

Synthetic EEG is only an approximation to measured EEG. Many physiological and non-physiological signals and artifacts, which are generally present in measured EEG, are not contained in the synthetic EEG. This can be both an advantage and also a limitation of the simulation approach. There is a possibility of missing some key EEG properties while modeling using simulation. However, in this study, objectivity was prioritized higher to compare the different methods. Moreover, EEG data is quasi-stationary, context-dependent, and influenced by learning. Thus, interpretability models must also account for these factors if they are part of the experimental design. In future studies, with the developed framework, identified confounds and complex modeling could be investigated.

High level explanations

The scope of this research is limited to visualization methods that highlight key segments of the input data. However, assessing which specific feature in EEG caused the correct prediction would still be difficult to ascertain. However, combining the methods can help develop insights. Knowing the scalp relevance heatmap can help isolate the relevant channels. Later, checking the relevancy of temporal data can get specificity for temporal localization. Following this with activation maximization53 on these channels or other feature perturbation approaches54 can give insight into the relevant frequency bands or feature that is being perturbed. This can be followed up with traditional signal processing methods focused on the relevant regions to gain additional insights. This method can identify which features are not sensitive (if any) as well as the regions that are not important and those that can be avoided.

Other approaches

Although this research limited the analysis to visualization-based approaches, there are other types of model explanations as summarized in the introduction. Some of these methods could provide better insights. However, exploring all of these iterations is outside of the scope of the study and will be explored in future studies.

Conclusion and future directions

The approach used here will serve as a benchmark for future researchers to get familiarized with the robustness and effectiveness of multiple explainable techniques; specifically, different heatmap-based attribution methods. The research provides a summary and recommendations to understand when some of these methods fail and what they can capture in EEG. This study is limited to features that are commonly reported in the tasks studied in this research. There could be many other features to test for and the set is not exhaustive. In this study, we kept the scope to a generic CNN architecture particularly since many of the methods such as deconvolution or guided backpropagation, for instance, are only applicable to CNN architectures. This was also informed by the literature which suggests that majority of EEG studies are using CNN models8. However, there exists different variants of deep learning models such as CNN with attention modules and/or residual connections, Transformer models, Recurrent Neural Networks etc. This would involve further detailed analysis exploring the effects of these variations. We hope this work will stimulate and introduce a framework wherein future studies can leverage the approach of simulation and use learning from this study to answer some of these interesting questions. Future studies should expand to such model variants and improve upon the framework we introduce here.

Overall, this research identified that some of the most used model explanation methods such as Saliency/Gradient are not class or model-specific. It was found that DeepLift was consistently accurate as well as robust to detect the three key attributes tested here. GradCAM even though was consistently robust, does not have good temporal precision. However, it is still good for detecting spatial patterns for signals with high SNR. Overall, specific recommendations and best practices for the use of back-propagation-based visualization methods for EEG-based decoder design are provided.

Methods

Convolutional neural network

The architecture for the model is summarized in Fig. 7. The intention was to use a very generic CNN model without any specialized architectural changes. This was done to ensure generalizability to existing studies. The input to the model is the 1 s EEG window (batch size × 250 samples × 62 channels). Two channels were removed as they are not contained in the forward model. The model consisted of 5 temporal convolution layers of 32 units each (5 × 1 kernel size with a stride length of 1) and 1 spatial convolution layer of 32 units (1 × 62 kernel size). The number of convolutional layers was kept as 6 as the majority of the prior studies used 6 or lower convolutional layers8. The filter size was selected such that the total receptive field for the final convolutional block would span at least half the sampling rate (125 Hz). A temporal pooling layer of 2 × 1 pooling dimension with a stride length of 2 was also used after every convolutional filter layer except the last two blocks. The output from these convolutional layers was flattened and fed into a dense, fully connected layer of 32 hidden units followed by an output layer with softmax activation.

A dropout layer with alpha = 0.5 was added in between the dense layer and the output layer to reduce overfitting. Except for the output layer, the model utilized ReLU as the activation function. ReLU was used as the activation function as this was also the most popular activation function used (70% of studies8). The proposed model was implemented in python 3.7 using Pytorch library63. For each of the conditions (temporal, spatial, and spectral), an independent model was trained to classify the distinct classes. A five fold cross-validation was performed and model explanations and the comparison metrics were estimated on the test set from each fold. The value across the folds are then compared between the type of model explanations.

Figure 7
figure 7

Model architecture: Each block corresponds to different types of layers in the model. The dotted line illustrates the dropout operation during the training phase aimed at reducing overfit. During inference, all units were retained.

Simulated data

To compare the relative performance of different model explanation methods, the SEREEGA library64 was used to simulate ground truth EEG features. The typical workflow used to simulate EEG activity using SEREEGA is summarized in Fig. 8. The process starts by defining the lead field matrix and the head model. The New York head model was used for generating the lead field matrix65. The toolbox supports the pre-generated leadfield that includes 75,000 source locations which could be projected to 228 sensor locations on the scalp. The New York head model does detailed segmentation of six types of tissues (scalp, skull, cerebrospinal fluid, gray matter, white matter, and air cavities). Later, the source location was selected to project the feature from. The source location could either be randomly selected or chosen manually based on the Montreal Neurological Institute (MNI) coordinates66. Later, the orientation for the dipoles was chosen. Each source has a default orientation associated with it. But, the orientation that is either tangential or perpendicular to the scalp for each of the dipoles can also be chosen. For this study, all dipoles are chosen to be perpendicular to the scalp surface to improve the localization of the scalp projection for ground truth.

Once the source and the orientation are selected, an activation/signal would be added to these sources. SEREEGA offers systematic deflections in the time domain to simulate event-related potentials as well as systematic modulations of oscillatory activity to simulate event-related spectral perturbation. The toolbox also allows the simulation of different types of additive noises (pink, white, brown, etc). Once the appropriate signal and noise are added, it allows mixing of the signal and noise in varying proportions such that different combinations of Signal-to-Noise Ratio (SNR) could be achieved at the projected scalp EEG. In addition, uncorrelated white noise was added to simulate sensor noise. Using the combination of signal, noise, source location, and orientation the toolbox allows the creation of ground truth simulated EEG with varying localization capabilities in temporal, spatial, and spectral domains.

Figure 8
figure 8

Steps present in generating different types of features in simulated EEG using the SEREEGA toolbox; ERSP: event-related spectral perturbation; ERP: event-related potential; NY: New York; SNR: signal-to-noise-ratio.

For all the simulations, the leadfield matrix projected onto actiCAP64 channel configuration from the sources was used. The sampling rate was set to 250 Hz with the window size of each simulated epoch 1 s long. To replicate brain noise, sources equaling the number of channels—the number of signal dipoles were uniformly selected randomly across the brain surface and a 5 μV pink noise was added to these sources similar to the simulation replication done by Krol et al.64. For each condition, to evaluate the performance impact under varying SNR, the noise was added to yield the following SNR: − 3.5 dB, − 12 dB, − 16 dB, − 19 dB, and − 23 dB. Figure 9 shows an example of the difference when the simulated ERP component gets added with noise at varying SNR.

Figure 9
figure 9

Representative example to demonstrate the effect of varying SNR on an ERP component.

Event-related potential components

To evaluate how different model explanations are fair in localizing the temporal aspect of EEG, different ERP components were simulated. Four distinct classes of ERP components were simulated with N = 10,000 per class. For each epoch, the source location was sampled from one among 10 source locations in Table 2. Even though the precise location is not very important, in order to have some constraint, source locations were selected corresponding to perturbation evoked potentials based on ranges suggested in the source analysis results from prior studies associated with perturbation evoked responses (PEP)67,68,69. The source locations in the MNI coordinates are shown in Fig. 10.

Table 2 MNI coordinates of the ERP sources.

The following attributes for the source components were tested in the simulation.

  1. 1.

    Class 1: Time locked positive deflection of EEG. Class 1 contained a positive component centered at 60 ± 8 ms latency with a peak width of 50 ± 2 ms. The amplitude of the component was randomly sampled between 1 and 13 μV uniformly. The component’s magnitude and width closely resemble the characteristic range of the P1 component in perturbation-evoked potentials70. One among the first 5 source locations from Table 2 was selected randomly as the source location.

  2. 2.

    Class 2: Same properties as Class 1 but different latency (latency difference). Class 2 contained a positive component centered at 900 ± 5 ms latency with a peak width of 100 ± 4 ms. The amplitude of the component was the same as that of Class 1. However, latencies were shifted to avoid overlap between the two classes to better quantify and compare the explainability techniques. One among the first 5 source location from Table 2 was selected randomly as the source location.

  3. 3.

    Class 3: Same magnitude as Class 1 and 2 but negative deflection instead of positive (sign difference). Class 3 consisted of an ERP component with the same amplitude as class 2 but inverted with a latency centered at 500 ± 8 ms and a width of 100 ± 4 ms One among the first 5 source locations from Table 2 was selected randomly as the source location.

  4. 4.

    Class 4: Same magnitude and sign as class 3 but a different source location (source difference). Class 4 consisted of a signal of the same properties as Class 3 except that the source location is different. One among the source location (6–10) from Table 2 was selected randomly as the source location.

Figure 10
figure 10

Dipole locations in MNI coordinates for the ERP components.

Spectral perturbations

To test the sensitivity to detect spectral perturbation events, four separate classes of data were simulated each belonging to spectral perturbation events happening in four separate frequency bands. The magnitude of the signal was set to 0.5–3 μV64. For each epoch, the magnitude and the latency were kept the same for all classes and they only differed in their spectral content/frequency. The latency of the center of the spectral burst for each epoch was uniformly random sampled to be between 200 and 500 ms to add a source of variability. The burst width was randomly sampled to be between 400 and 600 ms. The MNI coordinates used for the sources are summarized in Table 3. The source location was referenced based on dipoles associated with motor imagery/execution from prior literature71,72,73. For each epoch, one of the dipole locations was selected at random to act as the source. All the dipole locations are shown in Fig. 11.

  1. 1.

    Class 1: Spectral perturbation in the frequency band of 3–8 Hz. The magnitude, latency, and width of the burst were randomized between epochs.

  2. 2.

    Class 2: Spectral perturbation in the frequency band of 8–13 Hz. The magnitude, latency, and width of the burst were randomized between epochs.

  3. 3.

    Class 3: Spectral perturbation in the frequency band of 14–30 Hz. The magnitude, latency, and width of the burst were randomized between epochs.

  4. 4.

    Class 4: Spectral perturbation in the frequency band of 30–58 Hz. The magnitude, latency, and width of the burst were randomized between epochs.

Table 3 MNI coordinates of the dipoles selected for the spectral perturbation and spatial condition simulations.
Figure 11
figure 11

Dipole locations in MNI coordinates for both the spectral and spatial conditions.

The representative example of simulated EEG from each of the classes is shown in Fig. 12.

Figure 12
figure 12

Representative example to demonstrate the effect of varying SNR on a spectral component.

Spatial precision

Different ERP components and spectral perturbations with identical properties but different dipole location was simulated to assess the channel specificity. The only separation between the two classes created here is the location of the source signal. Class 1 had dipoles localized in the left hemisphere and Class 2 contains dipoles in the right hemisphere. Here the model is expected to learn all the distinct features and localize the correct scalp projection. The dipole source location for Class 1 was randomly selected from all source locations in the left hemisphere in Table 3. Class 2 on the other hand corresponds to locations in the right hemisphere in Table 3

Robustness and sensitivity analysis

For each condition, the simulated EEG with the respective properties are generated as discussed before. This signal is then forward projected. Noise is later added with varying levels of signal-to-noise ratios as discussed before. To get the ground truth explanation, the tapered window corresponding to the signal location was forward projected using the same lead field matrix. The segment outside of the projected signal would have a value of 0. The section with the signal (across all the channels) was normalized by dividing by the maximum value. The sensitivity/accuracy of each method was compared by evaluating the performance metrics (discussed below) w.r.t. this ground truth data.

To test the robustness of each of the explanation methods, the approach used in Adebayo et al.34 was adopted. Once the original explanation was obtained, the explanation after independently randomizing the labels and the model weights was re-computed. This tests whether the explanations are class or model-specific. The similarity of explanations w.r.t. the original explanation based on the absolute Pearson’s correlation coefficient and the SSIM measure (detailed later) was estimated. Ideally, if the model is accurate, it should have high similarity to the ground truth. On the other hand, if explanations are model and/or class-specific, the randomization performed should yield very dissimilar explanations to the original explanations. If the explanations are very similar even after randomizing, it indicates that the explanation is not very robust. The process was repeated for each type of signal/condition and SNR levels for all the explanation methods being compared.

Explanation methods

The different types of visualization-based explanation methods being compared in this study are detailed below. All the methods were implemented in Python using Pytorch 1.7.0 framework63 using either Captum 0.4.074 or the Pytorch-grad-cam toolbox75.

Gradient/saliency (S)

Gradient or basic Saliency map (Sal) as referred to in some studies is probably one of the earliest yet commonly used model explanation approaches. The gradient gives a measure of how a change in input x would change the prediction S(x) in a small neighborhood around the input18. It is given by

$$\begin{aligned} Saliency/Gradient = \frac{\partial S}{\partial x}. \end{aligned}$$
(1)

Deconvolution

Deconvolutions (Deconv) can be thought of as reversing the process done in a convolutional neural network27. Essentially attempting to recreate the input from the output activation by running the CNN in reverse top-down. The convolutions get replaced with deconvolutions also called transposed convolution. The filter values are copied after transposing their values. The process also replaces max-pooling layers with unpooling operations wherein the feature map is upsampled depending on the pooling parameters while retaining the maximum value. This is done by storing the position of the maximum value in the forward operation of the CNN. The process is repeated from the layer whose filter is to be visualized back to the input space.

Guided backpropagation

Guided backpropagation28 (Guided-BP) builds upon deconvolution. It combines vanilla backpropagation at ReLUs (knowing which elements are positive in the previous feature map) with DeconvNets (keeping only positive gradients).

Input × gradient

Input × Gradient is another type of attribution method wherein, the gradient was multiplied with the input x76. The equation to compute the Input × Gradient is

$$\begin{aligned} Input \times Gradient = \frac{\partial S}{\partial x} \cdot x. \end{aligned}$$
(2)

GradCAM

GradCAM is a generalization for Class Activation Map (CAM) as CAM limits the CNN to require a global average pooling layer at the end of the convolutional blocks22. GradCAM on the other hand does not require this.

For the \(k{th}\) feature map activation \(A_k\) in the final convolutional layer of a CNN, the gradient of the score \(y_c\) for the class c of interest is initially computed. The average score of the gradient w.r.t. each node in the feature map is computed to get an importance value \(\alpha _{k,c}\) for the particular feature map. The equation to estimate \(\alpha _{k,c}\) is

$$\begin{aligned} \alpha _{k,c} = \frac{1}{m \cdot n}\sum _{i=1}^{m}\sum _{j=1}^{n}\frac{\partial {y_c}}{\partial {A_{k,i,j}}}. \end{aligned}$$
(3)

Here, \(A_{k,i,j}\) is a single neuron/node at position (ij) in the feature map \(A_k\) of dimension m x n. GradCAM then linearly combines the importance score for each of the feature map and pass them through a ReLU the total relevance score map equals to

$$\begin{aligned} GradCAM = \textrm{ReLU} \left (\sum _k^K \alpha _{k,c}A_k \right). \end{aligned}$$
(4)

The relevancy score is then upsampled using bi-linear interpolation to the same dimension as the input.

GradCAM++

GradCAM++ can be considered as a generalized formulation for GradCAM23. This method uses the second and third-order derivatives on the gradients to obtain the gradient weights

Guided GradCAM

Guided GradCAM is a combination of GradCAM and Guided Backpropagation to obtain pixel-level granular GradCAM representation25. GradCAM is combined with Guided Backpropagation by performing an element-wise product of the two to obtain Guided GradCAM.

Layer wise relevance propagation

Layer-wise Relevance Propagation (LRP) redistributes the prediction score for a particular class of interest through a custom backward pass through the model back to the input following a conservation principle20.

DeepLift

DeepLift is similar to LRP in the sense that it decomposes the output prediction for a particular input by backpropagating the contribution of all neurons in the model to each feature of the input21. DeepLift gives a measure of the change in output from a “reference” output w.r.t. the change in input from a ’reference’ input. The reference is a neural input that is task-irrelevant. Here an array of zeros is used with the same dimension as the input46.

ScoreCAM

ScoreCAM is a perturbation-based expansion to the class activation map framework26. ScoreCAM basically tries to mask part of the input and observe the change in prediction score for the class of interest similar to the occlusion approach. However, unlike occlusion, here the mask is obtained by initially forward passing to get the feature map activation. To perturb the input these are up-sampled to the input dimension and smoothed by normalizing to have a value between 0 and 1. Later they are masked based on the activation scores and the masked input is fed into the CNN to compute the prediction score which serves as a weight for the feature map. This process is repeated for all the filters present in the final convolutional layer and pooled to obtain the final ScoreCAM representation.

FullGrad

FullGrad is an attribution method that aggregates the gradient for the entire network by decomposing the prediction score into input sensitivity and per-neuron sensitivity components. FullGrad computes the gradient of the biases from the entire network and sums them19.

LayerCAM

LayerCAM builds on top of GradCAM wherein the class activation maps are extracted for all layers instead of the final convolutional layer as is done in CAM/GradCAM24.

Metrics

The visualization approach assigns relevancy or importance scores to each pixel/data point in the input. To compare different explanation methods, metrics to quantify the similarity of the explanations after randomization as well as, the efficiency in capturing the true underlying ground truth is equally important. For the robustness measure, both the Pearson’s correlation and Structural Similarity index (SSIM)77 were used to compare explanations before and after randomization. The output of the visualization methods being compared here can be considered as images with relevancy scores on a pixel basis. SSIM has been demonstrated to have good agreement with human observers when using reference images by quantifying the perceptual difference and has been shown to perform better compared to both mean squared error, as well as the peak signal-to-noise ratio. In addition, the correlation coefficient further quantifies the linear relationship between the two. Ideally, for a robust method, the original explanations should become uncorrelated or minimally correlated w.r.t. the explanation after randomizing.

Robustness metrics

The measures used to compare the similarities between the explanations are adapted from Adebayo et al.34.

  1. 1.

    Pearson’s Correlation Coefficient: Compute the sample correlation between the explanations yielding a measure of the strength and direction of the linear relationship between the two variables. Here the explanations would initially be flattened out. The equation to compute Pearson’s Correlation Coefficient is

    $$\begin{aligned} r = \frac{cov(x,y)}{\sqrt{var(x)} \cdot \sqrt{var(y)}}. \end{aligned}$$
    (5)
  2. 2.

    Structural Similarity Index (SSIM): Measure the perceptual similarities between two images SSIM. Given two images/inputs, SSIM provides a measure of distortion along the luminance, contrast, and correlation dimensions78.

here, cov(xy) is the covariance between x and y and var corresponds to the variance

Sensitivity metrics

To compare the effectiveness of these models in identifying the true signal of interest, two measures to quantify the sensitivity are used. The main goal of evaluating these measures is to ensure that a majority of the top relevancy scores assigned fall in the ground truth region of the data. A ground truth mask is a binary array with a value of one assigned to all non-zero data points in the ground truth and a value of zero for others. The relevance mass accuracy measure quantifies how much of the total relevancy assigned by the methods is localized in the ground truth region. This gives a measure of accuracy.

  1. 1.

    Relevance Mass Accuracy (RMA): Ratio of the total relevancy inside the ground truth mask divided by the sum of the total relevancy assigned for the input79. The equation to compute RMA is

    $$\begin{aligned} Relevance Mass Accuracy = \frac{R_{within}}{R_{total}}. \end{aligned}$$
    (6)

    here, \(R_{within}\) is the relevancy score assigned by each of the method that falls within the ground truth whereas \(R_{total}\) is the total relevancy score assigned by the method. Since in the simulation, the source signal has been assigned to a dipole that projects onto the surface, a non-zero ground-truth value is assigned to all channels due to volume conduction. Therefore, to compare the similarity with the ground truth topoplot representation, a different distance measure of similarity is used for spatial data

  2. 2.

    Cosine Similarity (For Spatial Sensitivity): Cosine similarity computes the cosine of the angle between two non-zero vectors which is equivalent to the inner product of the vectors after normalizing to get unit length80. The equation to compute cosine similarity is

    $$\begin{aligned} Cosine Similarity = \frac{A \cdot B}{\sqrt{\Sigma {A}} \cdot \sqrt{\Sigma {B}}}. \end{aligned}$$
    (7)