Introduction

Significant advancements in computing hardware, such as graphics processing units and field-programmable gate arrays, along with the availability of large datasets, has enabled researchers to develop highly effective neural networks in the last decade. However, training and utilizing these networks often involves a large amount of energy consumption, thus restricting the deployment of neural networks for data, and/or energy, limited settings: typically applications in dynamic/mobile environments. On the contrary, biology-inspired neural networks need only very few or even only one data point to perform at a competitive level compared to “traditional” neural networks (see page 54 in ref1). Therefore, machine learning architectures more closely resembling biological neural networks are quickly gaining in popularity. One such example is the spiking neural network (SNN)2,3,4 which mimics biological neural networks through its layers composed of spiking neurons. These neurons more closely resemble the synaptic connections between neurons in biological neural networks through their emission of aperiodic spikes as opposed to floating point numbers in the case of the traditional artificial neuron. This sparse and discrete behavior of SNNs has been shown to reduce energy consumption by orders of magnitude when implemented on emerging neuromorphic hardware5. However, shallow SNNs can be insufficient to detect patterns that occur at random times/locations in tasks such as object detection/segmentation, similar to standard multi-layer perceptrons. This has inspired the development of hybrid convolutional and spiking neural networks, referred to as convolutional spiking neural networks (CSNNs)6,7,8, which combine the convolutional layer’s power of extracting spatio-temporal features with the energy efficiency of spiking neuron layers. In the past few years, CSNNs have received increased attention in diverse applications such as computer vision9,10, speech recognition11, hand-gesture recognition12 and detection of Alzheimer’s disease13 reinforcing their utility in deciphering complex and multi-dimensional data.

The main contribution of this study is to evaluate the use of CSNNs in advanced driver-assist systems (ADAS), specifically those approaches that utilize electroencephalograms (EEGs). ADAS can be summarized as a group of assistive technologies designed to decrease the cognitive load associated with the driving task by assisting with driving and/or parking decisions thus aiding the driver in safely operating their vehicle. This technology has been rapidly introduced in modern vehicles and has shown to greatly improve road safety and reduce traffic accidents14. EEG is a method of measuring and recording electrical potentials from across various points in the human brain, thus serving as a primary method of discerning a person’s current cognitive activity. EEG-based applications are commonly explored in the field of brain-computer-interface (BCI)15, which has contributed in the development of machine learning models dedicated specifically to the analysis and interpretation of EEG signals, for example “EEGNet”16. The inclusion of EEG as an auxiliary input source effectively fuses the fields of ADAS and BCI and gives subsequently developed technologies the advantage of an accurate real-time measure of otherwise unknown aspects of the driver state17,18,19,20 and also allows for the prediction of a driver’s intended action (e.g. braking)21,22,23 before it occurs. Literature has reported anticipatory potentials being observed as early as 130 ms24 and 320 ± 200 ms25 before action onset. The present study focuses on the latter advantage of EEG and seeks to train a CSNN as the predictive classifier to detect these anticipatory brain potentials and thus predict braking intention.

Although some initial studies have been made to demonstrate the effectiveness of shallow SNNs in typical BCI applications26,27,28,29,30,31,32, the proliferation of other convolutional networks in the realm of BCI (e.g. EEGNet) and the reported success of deep learning methods in EEG decoding problems33 implies that the inclusion of deep learning methods, such as the addition of convolutional layers, leads to a performance gain in classification tasks involving EEG data. Furthermore, the relative ease with which SNNs and their deep learning counterparts, CSNNs, can be mapped to emerging high-efficiency neuromorphic-computing hardware5,34,35,36,37 makes them ideally suited for deployment in mobile, energy-limited applications. The use of energy-efficient neuromorphic hardware becomes even more advantageous when implementing various learning methods for online continuous learning or one-shot learning38 in energy-constrained applications.

To the author’s knowledge, the potential of CSNNs for EEG-based ADAS has not yet been explored and will be a novel contribution. To achieve a fairer juxtaposition than directly comparing the CSNN’s performance in this study to other methods in the literature, additional neural network models were trained on the same dataset to provide clearer context. These models include: i) a CNN of similar architecture; ii) EEGNet; and iii) three graph neural networks (GNNs). The CNN was chosen to be a direct comparison of the spiking architecture to a non-spiking architecture, the EEGNet was chosen as the “state of the art” benchmark model because of its previous history of generalizing better across different BCI paradigms and high performance achievement as compared to existing CNNs and traditional approaches16. Lastly, the inclusion of GNNs was motivated as an alternative to standard CNN networks because of their similar performance on adjacent EEG decoding tasks39,40,41,42.

Related work

Previous studies on BCI-based driver intent detection present a gamut of technical approaches that mainly differ in the pre-processing strategies and various classifiers used. Popular approaches include linear and quadratic discriminant analysis methods. Teng et al.43 combined regularized linear discriminant analysis with the sequential forward-floating search method of feature engineering, Kim et al.44 combined EEG, tibalis anterior electromyography (EMG) and brake pedal signal, and Haufe et al.24 compared EEG, EMG and brake pedal response to determine the input features that predict braking intention the fastest. Khaliliardali et al.25,45, on the other hand, used quadratic discriminant analysis in conjunction with bandpass filtering the EEG inputs in a low frequency range of 0.1–1 Hz.

As competitors to discriminant analysis methods, other classification methods prevalent in the literature are shallow and deep neural networks. For example, Hernandez et al.46 investigated support vector machines and convolutional neural networks (CNNs) to differentiate normal driving and braking intention EEG signals achieving a reported average accuracy of 71% and 72% for support vector machines and CNNs, respectively. Nguyen et al.47 featured a multilayer perceptron neural network in a comparison of EEG band power-based and autoregressive-based feature selection methods, reporting a better accuracy of 91% with the autoregressive based method. Lee et al.48 used recurrent convolutional neural networks (RCNNs) for an EEG braking intention decoding task, achieving an AUC score of 0.86. It is evident from the literature that a variety of methods have been used with mixed results. Although there are some examples of neural network usage, the use of SNNs for the braking intention EEG decoding problem is noticeably absent.

The EEG pattern studied here is the contingent negative variation (CNV), which is a type of slow cortical potential (SCP) that occurs prior to movement in the central region of the brain. The CNV, in particular, manifests when a subject is given a warning stimulus followed closely by an imperial stimulus, or stimulus requiring an action. It is featured in previous movement intention literature that also focused on driver braking intent detection21,25,45 and has a strong theoretical foundation on being a key anticipatory signal. Using CNV as a measure of how global and local temporal prediction affects expectancy implementation, it is seen that the signal has a larger amplitude when a given stimulus is expected to occur versus when it is unexpected49. Furthermore, Mento50 provided a comprehensive treatment of CNV as a marker for cognitive expectancy and motor preparation for the warning-imperative stimulus pair paradigm. However, the CNV is not the only EEG pattern used for intention detection in the literature. Event-related desynchronization (ERD) and Bereitschaftpotentials (readiness potentials) are both alternative indications of movement preparatation. ERD is an EEG phenomenon occurring in the mu and beta frequency bands up to two seconds before movement is realized. It is marked by a decrease in the spectral power of EEG within those bands that is not restored until after the movement is completed. On the other hand, Bereitschaftpotentials are slowly building neural signals which occur 1–2 s before movement onset, similar to CNV.

Previous research illustrating ERD as a movement preparation indicator include a study to find a suitable classifier for ERD using data from self-determined reaching movement experiments51, development of a novel algorithm for using ERDs to detect hand movement intention using adaptive wavelet transform52 and using ERDs as model inputs to reduce false positives in a motor imagery for rehabilitation application using a two-phase classifier design53. Bereitschaftpotentials have also received attention with advancements made in their understanding and prediction. Nguyen et al.54 conducted a study simultaneously integrating acquired EEG and fMRI through computational modeling and determined that reciprocal connections between the SMA and anterior mid-cingulate cortex are important to maintain sustained activity of the readiness potential before movement. Mirzabagherian et al.55 developed two convolutional neural networks composed of temporal-spatial, separable and depth-wise layers and used these networks to detect a type of Bereitschaftpotentials known as movement-related cortical potentials (MRCPs), indicating five different hand movements performed by patients with cervical spinal cord injury. Gatti et al.56 also studied harnessing MRCPs for movement speed and force intent detection. Mussini and Di Russo57 investigated how anxiety can affect anticipatory brain functions by observing its effect on pre-stimulus ERP and the Bereitschaftspotential when performing tasks with and without feedback. Both ERD and Bereitschaftpotentials are popular and useful EEG related signals that could serve as an alternative to slow cortical potentials for movement intention related applications. To the author’s knowledge, use of SNNs for predicting ERD or Bereitschaftpotentials is yet to be explored.

Results

Identification of braking intention signature in EEG signals

An example of the pre-processed EEG signals from 19 channels with data markers signifying the temporal locations of the audible countdown commands is shown in Fig. 1a. Each marker is denoted by its associated countdown number from 5 to 1 and ending with “STOP” when the stop command was given. The Cz grand average of the pre-processed data is shown in Fig. 1b along with scalp plots visualizing how the channel grand averages changed over time. The grand averages were calculated by averaging the Cz electrode signal from all participants and across all trials, similar to the procedure followed by Khaliliardali, Chavarriaga, Gheorghe and Millan25. The following observations can be made from Fig. 1.

  1. 1.

    The negative EEG potential, termed contingent negative variation (CNV) potential, started after the “2” count marker and reached the maximum negative value between the “1” count marker and “Stop” command.

  2. 2.

    The negativity rate sharply increased at the “1” count marker and the potential rate became sharply positive midway between the “1” count marker and “Stop” command.

  3. 3.

    Anticipatory potentials were clearly observed before the actual braking action.

  4. 4.

    The more negative potentials were spatially localized in the centro-medial electrodes.

The results obtained are consistent with other past studies on CNV21,25,45.

Figure 1
figure 1

Pre-processed EEG signals. (a) Channel potentials with associated countdown and “Stop” command markers and scale. (b) Cz grand average signal with scalp maps representing the grand average at the midpoint between two neighboring markers and color bar on the right displaying the potential in \(\upmu V\).

Table 1 Classification performance with floating-point EEG input data (best performance in each classification measure highlighted in bold font)..
Table 2 Five-channel ablation study with floating-point EEG input data (best performance in each classification measure highlighted in bold font)..

Classification performance—case 1: 32-bit single precision floating-point EEG input data

The final pre-processed dataset used for training the models included 10702 data segments collected from 15 participants in an experiment described in the Experimental Design section and preprocessed according to the procedure outlined in the Data Preprocessing section: 8573 data segments labelled as class ‘0’ (no intention signal) and 2129 data segments labelled as class ‘1’ (intention signal). The models were evaluated by means of tenfold stratified cross-validation where the training and testing partitions of each fold maintained the original class distribution. The data segments in each fold were shuffled before training and testing. To mitigate the skewed distribution of the classes, wherein approximately 80% of the dataset was negative class and approximately 20% was positive class, the samples belonging to each class were given a weight used in the training loss function. The class weights were calculated as:

$$\begin{aligned} w_{c,i} = \frac{N_{D}}{2N_{c,i}} \end{aligned}$$
(1)

where \(N_{D}\) and \(N_{c,i}\) are the total number of data points in the training partition and the number of class i data points in the training partition, respectively.

The convolutional layers in the CSNN and CNN contained a kernel size of \(5\times 5\), a stride of 1, a padding of 0, and 12 and 64 filters for the first and second convolutional layers, respectively. All max pooling layers of both architectures used a \(2\times 2\) pooling region with a stride of 2. The sigmoid surrogate function smoothness value, k, for the surrogate backpropagation of the CSNN was chosen as 0.25. The GCN and GCS graph convolutional layers had output sizes of 115, 28, 14 and 3. The GIN had graph convolutional layers with output sizes of 924, 462 and 231. The multi-layer perceptrons used in the GIN contained 5 hidden layers, each with 256 hidden neurons. All GNN architectures utilized in this work were implemented using the Spektral Python package58. The CSNN and CNN models were implemented using PyTorch59. EEGNet was implemented using tensorflow60 with default values with the exception of the length of the 2D convolutional kernel, which was set to 250.

All architectures were trained for up to 1000 epochs per fold with a batch size of 8, and with early termination if the loss did not improve for 50 subsequent epochs. The architectures had the same respective weight initialization for each fold. The CSNN Leaky-Integrate-and-Fire (LIF) layer had a memory decay rate of 0.5, a spiking threshold of 0.5 and used 25 input steps. All models used the ’ADAM’ optimizer with learning rate \(\gamma = 5e-4\), running average coefficients \(\beta _{1} = 0.9\) and \(\beta _{2} = 0.999\), stability parameter \(\epsilon = 1e-8\), and binary or categorical (for EEGNet) cross entropy as the loss function.

Table 1 shows the mean and standard deviation for predictive accuracy (Acc), true positive rate (TPR), true negative rate (TNR), F1-score, number of epochs trained, and total training time for each model. The table also shows each model’s inference time for a single data point. The p-values from two-sample t-tests comparing the CSNN with the other models are shown for each classification metric. No attempt was made to optimize the learning parameters of each network. The results indicate that the CSNN outperformed all the other models in every classification metric category albeit closely followed by EEGNet. The CSNN also showed small standard deviations showcasing the consistency in results across folds. The GNNs exhibited a tendency to get stuck in local minima, evidenced by large standard deviations in all the performance metrics. However, the CSNN had the largest average training time and largest inference time out of all of the models.

Classification performance—case 2: abalation study with 32-bit single precision floating-point EEG input data from five channels

Laboratory experiments offer the privilege of using state-of-the-art equipment that is typically unhindered in its data collection methods. However, real-world scenarios may present unique challenges where a full 20 channel EEG headset is not possible or practical. To study how a reduced number of available channels would affect the performance of the classifier, a five-channel analysis was performed. The Cz channel and four surrounding channels (Pz, C3, C4 and Fz) were considered for the ablation study. Table 2 shows results for the classification performance with a reduced number of channels.

Classification performance—case 3: delta-modulated spike train input data

The effect of processing the input EEG data from all 19 channels into spike train data before passing to the CSNN network was also studied, namely its impact on classification performance of the network. The filtered and segmented EEG data, normalized to lie in the range of 0 to 1 inclusively, was transformed into a 19 channel array of spike train data by monitoring the change in value of successive data points in each channel. If the value change was greater than a threshold value, a spike ’1’ was recorded. If not, a ’0’ was recorded. The result was a binary array of the same dimensions as the floating-point input. The threshold value was varied from 0.05 to 1 which resulted in spike trains of differing densities. The CSNN model used had the same parameters as in Case 1.

Table 3 shows the tenfold classification results for Case 3. The results indicate that good predictive performance can still be achieved when the floating-point EEG data were converted into spike trains prior to being input to the network, if a suitable threshold value was selected. Spike train conversion thresholds that are too small or too large constrain the abilities of the CSNN to learn effectively, most likely because features important to the correct classification of data were being obscured. If the threshold was too small, then the spike train becomes saturated, making it difficult for the network to determine the important features. If the threshold was too large, then important features may not be captured at all.

The best results were obtained with a threshold of 0.5. The sensitivity of the classification results to the threshold value was studied using two-sample t-tests comparing the classification measures corresponding to the 0.5 threshold with the measures corresponding to the other thresholds. The t-test results shown in Table 3 indicate significant performance degradation above a threshold of 0.625 and below 0.375. This implies that a range of threshold values could be used to obtain statistically similar results, offering flexibility in the threshold selection. The performance of the CSNN using a threshold value of 0.5 was comparable to the EEGNet results when trained on the floating-point data. A two-sample t-test showed that TNR for the CSNN was statistically better (p-value was 0.041), but the accuracy and F1-score metrics were statistically similar (p-values were 0.700 and 0.537, respectively). On the other hand, the TPR for EEGNet was significantly better (p-value was 0.028).

Table 3 Classification performance with delta modulated spike train input data (best performance in each classification measure highlighted in bold font).

Discussion

The results presented here show that the CSNN can be used as a classifier for detecting features in EEG data that predict braking intention, which occurs before the actual physical activity. To benchmark the CSNN performance, results were compared to a standard CNN, EEGNet and three GNN models using a 10-fold cross-validation scheme with the CSNN achieving the highest performance and with more consistency. The p-values from two-sample t-tests in Table 1 show a significantly higher performance of the CSNN over the GNNs in almost every metric category (except TNR of the GCN network, where the p-value was slightly above 0.05). This result is not surprising when the means of the metrics are compared. This fact is in stark contrast to the p-values of the CNN and EEGNet, where the CNN has noticeably lower mean values than the CSNN and is statistically similar; however, EEGNet is the closest of any of the other models and is statistically different. This can be explained by the fact that the CNN had enough folds containing results that were nearly identical to the CSNN performance, but also had a couple of folds with poor performance that biased the grand averages. As a result, the overall model performance was not significantly different than the CSNN. On the other hand, although EEGNet performance was competitive and consistent, it did not quite match the CSNN performance on any fold. Therefore, the p-values indicated statistically significant higher performance of the CSNN compared with EEGNet. The authors hypothesize that a possible explanation for the CSNN’s success is that converting the floating-point numbers to spike trains allows it to filter more efficiently, passing the most important feature maps to the next layer.

Despite the CSNN’s success in classification, it had the longest average training time and largest inference time by a large margin. While computational efficiency could be improved by deploying CSNNs on neuromorphic hardware, those gains were not realized in this study where only a von Neumann computer was utilized. Neuromorphic chips, such as the Intel Loihi chip, have been shown to produce faster training times than von Neumann processors and deep learning accelerators5.

Results of the five-channel ablation study indicate that a few strategically chosen channels may be sufficient with the CSNN’s classification performance being nearly identical to that attained with all 19 channels. It can also be seen that the CNN performance increased considerably compared to the results shown in Table 1 with all 19 channels. Although the mean performance of the CNN was substantially lower when all 19 channels were used, the p-values in Table 1 indicate that the results were not significantly different from the CSNN. As noted previously, a couple of folds with poor performance biased the CNN grand average. Therefore, the boost in CNN performance to the level of CSNN in the ablation study is not altogether surprising and could simply have resulted from more consistent performance across folds when using only five channels.

The spike train conversion results shown in Table 3 have significant implications because converting the floating-point EEG data to spike trains at the outset could allow for additional energy savings, thereby taking full advantage of any neuromorphic computing hardware used to implement the CSNN for large, complex datasets and/or in real-time applications. Findings from this study can be exploited in future work to implement the CSNN on a neuromorphic platform to study the actual energy efficiency and feasibility for on-line learning in real-time applications.

The Cz grand average calculation and the analysis itself is based on the assumption of zero phase shift of the CNV pattern relative to the external stimuli. Lew et al.21 and Khaliliardali et al.25 discuss in great detail SCPs, of which CNVs are a subset. In these studies, the SCP was determined to begin as early as 1.5 s before the onset of movement and therefore the crucial aspects of the CNV should be contained within the data between the “1” count marker and the stop marker. Variations of the exact timing of the pattern will occur between trials but as mentioned above, the grand average pattern obtained in this study are consistent with the results reported in the literature. Also, it is well known that EEG has poor spatial resolution. For this reason, the CNV pattern in Fig. 1 appears to occur across the entire brain instead of the central area as expected. Results with a higher number of channels than the 19 channels considered in this study have reported greater regional localization25. Furthermore, ablation study results presented in Table 2 confirm that the CNV is occurring in the central area.

The EEG data was collected from 15 participants with the dataset containing a total of 3244 trials that were cleaned and segmented into 10,702 data segments. Although the number of participants in this study is relatively small, it is on par with other studies in the literature. The classification experiment conducted in-house involved using a simulated-realistic testbed with the participant operating a remote-controlled vehicle using a live video feed under ideal conditions. Although this study focused on a narrow frequency band of 0.1–1 Hz as suggested by Garipelli et al.22 and similar to other studies21,25,45, a much wider 0.1–45 Hz bandpass filter has also been investigated44,46,49. A possible extension to this work could examine the performance of the CSNN using the full neural dynamics. It would also be interesting to study the performance of the CSNN with participants operating under cognitive stress, for example, when under fatigue or in the presence of distractions or to study the abilities of the CSNN in other EEG-BCI applications such as P300, motor imagery, motor-related cortical potentials and steady-state evoked potentials. Exploring the use of CSNNs for movement intention detection using Bereitschaftpotentials, which are well-known as an indicator of movement preparation within the brain, much like CNVs, is also compelling. Their use in braking intention, or other driving related task intention, using CSNNs would be another interesting research direction.

Methods

Figure 2
figure 2

Schematic of neural network architectures. (a) CSNN. (b) CNN.

Figure 3
figure 3

Schematic diagram of EEGNet.

Convolutional spiking neural networks

CSNNs are deep networks comprised of standard convolutional layers that extract feature maps from the input data before passing these feature maps to subsequent spiking layers. These combination layers are referred to here as “convolutional-spiking” layers. For classification, the output layer is composed of a fully connected layer with linear activation function followed by a spiking layer. Figure 2a shows a schematic of the specific CSNN used in this study. It is comprised of two convolutional-spiking layers followed by the output layer. Each convolutional-spiking layer is comprised of a two-dimensional convolution layer followed by a two-dimensional max pooling layer and ending with an output spiking layer. The spiking layer in each convolutional-spiking layer is composed of a tensor of LIF neurons having the same shape as the shape of the input to the layer. In the output layer, the fully connected layer and the subsequent output LIF spiking layer both have two neurons for the two classes in the EEG dataset. The predicted output of the CSNN was determined by counting the number of spikes output by these two neurons and setting the predicted label to the class represented by the neuron which produced the most spikes. A tie would result in a predicted class of ‘0’.

Due to the non-differentiable nature of the output of spiking neurons, training SNNs is difficult and requires special approaches beyond simply using standard backpropagation. If the spiking behavior of a neuron is represented as:

$$\begin{aligned} S[t] = \Theta (U_{mem}[t]-\theta ) \end{aligned}$$
(2)

where \(\Theta (\cdot )\) is the heavy-side function, \(U_{mem}[t]\) is the membrane potential of the neuron and \(\theta\) is the spiking threshold, the derivative of (2) with respect to \(U_{mem}\) is the dirac delta function:

$$\begin{aligned} \frac{\partial S}{\partial U_{mem}[t]}=\delta (U_{mem}[t] - \theta ) \in \{0,\infty \} \end{aligned}$$
(3)

which is defined as zero for all time except where \(U_{mem} = \theta\) at which it is infinity. This leads to the “dead neuron” problem for training using backpropagation. To mitigate this, the surrogate gradient approach is employed wherein during the “backward-pass”, when the gradient of the loss function due to the network parameters is being computed, the heavy-side function is approximated using a sigmoidal function, thereby creating a readily differential function. The exact function used as the surrogate in this paper is described as:

$$\begin{aligned} {\tilde{S}} = \frac{U_{mem}[t]-\theta }{1+k|U_{mem}[t]-\theta |} \end{aligned}$$
(4)

where k is known as the ’slope’ and determines the smoothness of the surrogate function. The derivative of (4) is then obtained as:

$$\begin{aligned} \frac{\partial {\tilde{S}}}{\partial U_{mem}[t]} = \frac{1}{(k|U_{mem}[t]-\theta |+1)^2} \end{aligned}$$
(5)

From (5) it can be seen that as k increases, (5) converges to (3). For a more detailed explanation of SNNs and their training, the reader is referred to61.

Convolutional neural network

For the sake of a direct spiking vs. non-spiking comparison, a CNN composed of a largely identical architecture as that of the CSNN is considered to fully quantify any differences in performance that may arise by the addition of the spiking layers. As shown in Fig. 2b, the CNN has two convolutional layers, each including a max pooling layer, followed by a fully connected linear layer and ending with a logistic sigmoid output layer for class prediction.

EEGNet

EEGNet is a single CNN architecture designed for classification tasks across multiple EEG-based BCI domains (P300 visual-evoked potentials, error-related negativity responses, movement-related cortical potentials and sensory motor rhythms)16. It consists of two blocks of convolutional layers followed by a dense layer with finally a softmax layer. It is compact in terms of the number of model parameters (see Fig. 3). In the first block, two convolutional operations are done in sequence. This block starts with a temporal convolution to learn frequency filters followed by “depthwise” convolution to learn frequency-specific spatial filters. The second block also includes two convolutional operations. The first is another “depthwise” convolution to individually learn the temporal feature map and the second is “pointwise” convolution to optimally combine the feature maps. These two convolutions are combined into one layer, termed “Separable 2D Convolution”. The output of the second block of layers is then flattened and passed to the dense and softmax layer for generation of the predicted class.

Graph neural networks

GNNs are a specialized version of neural networks designed to operate on graph data. A graph is a grouping of data with defined internal relationships (edges) between objects (nodes) where these relationships may or may not be euclidean in nature. Mathematically, a graph is typically represented as: \({\mathscr {G}}=({\mathscr {V}},{\mathscr {E}},A)\) where \({\mathscr {V}}\) represents a finite set of nodes having length \(|{\mathscr {V}}| = N\), \({\mathscr {E}}\) is a set of edges between the nodes, and \(A \in {\mathscr {R}}^{N \times N}\) is the adjacency matrix containing the edge weights. Graph data is input to a GNN as a matrix of node feature vectors: \(X \in {\textbf{R}}^{N \times n}\), where N is the number of nodes and n is the number of node features, along with its adjacency matrix, A, and sometimes a set of edge features, E. Some graph operations also include the diagonal degree matrix D where \(D_{ii} = \sum _{j} A_{ij}\). For more details on graph theory and a detailed survey providing a comprehensive overview of GNNs, the reader is referred to62,63. Because of the spatial relationship between electrodes, EEG data can naturally be represented as graphs with each data input sharing the same node and edge structure, thereby differing only in node data. Expressing the data as graphs allows for the non-euclidean spatial relationships to be exploited as extra information available to the classifier. For this reason, EEG-BCI applications have used GNNs previously, to notable effect.

Figure 4
figure 4

Schematic diagrams of GNNs. (a) GCNConv, (b) GCSConv, (c) GINConv.

As shown in Fig. 4, the three GNN architectures used in this work differ only in their initial graph processing layers, whilst universally sharing the last three layers. Each network possessed a global attention sum pool graph aggregation layer which feeds into a fully-connected hidden layer with rectified linear unit (ReLu) activation. The final layer is a classification output layer consisting of one neuron with a logistic sigmoid activation function. The global attention sum pool layer computes

$$\begin{aligned} X^{'}= \sum _{i=1}^{N}\alpha _{i}X_{i}, \quad \alpha = \text {softmax}(Xa) \end{aligned}$$
(6)

where a is a trainable weight vector, X is the layer input tensor, N is the number of nodes in the input graph, and the softmax operation is applied over nodes instead of features.

As shown in Fig. 4a, the first architecture, GCNConv network64, consists of four graph convolutional (GCN) layers. The GCN layers perform the operation:

$$\begin{aligned} Y = {\hat{D}}^{-1/2}{\hat{A}}{\hat{D}}^{-1/2}XW+b \end{aligned}$$
(7)

where Y is the output of the layer, \({\hat{A}} = A + I\) is the adjacency matrix of the input graph plus the identity matrix of appropriate shape, \({\hat{D}} = \sum _{j} {\hat{A}}_{ij}\) is the degree matrix, X is the layer input tensor, W is the layer weights, and b is the layer matrix.

Figure 4b shows the second architecture, which is the GCSConv network. This consists of four GCS layers, which are GCN layers with an added, trainable skip connection. The GCS layer operation is described by:

$$\begin{aligned} Y = D^{-1/2}AD^{-1/2}XW_{1} + XW_{2} + b \end{aligned}$$
(8)

where Y is the output of the layer, D is the degree matrix, A is the adjacency matrix, X is the node feature matrix, \(W_{1}\) and \(W_{2}\) are the two sets of layer weights, and b is the layer bias.

The third architecture is the GIN architecture65 and is shown in Figure 4c. This architecture contains three graph isometric network (GIN) layers where each layer performs the following operation for each node in the input matrix:

$$\begin{aligned} Y_{i} = MLP\left( (1+\varepsilon ) \cdot x_{i} + \sum _{j \in N(i)} x_{j}\right) \end{aligned}$$
(9)

where \(MLP(\cdot )\) is a multi-layer perceptron, \(\epsilon\) is a learned parameter and \(x_{i}\) is the ith node of the input matrix.

The adjacency matrix, which would be common to all data samples, for all the three GNNs, was calculated as follows:

$$\begin{aligned} A = |P| - I \end{aligned}$$
(10)

where \(A \in {\textbf{R}}^{N \times N}\) is the adjacency matrix, \(|P| \in {\textbf{R}}^{N \times N}_{+}\) is the absolute value of the Pearson’s correlation coefficient of the dataset and \(I^{N \times N}\) is the identity matrix.

Experimental design

The total number of participants in the experiment was 15 (13 male, 2 female). They consisted of Missouri University of Science and Technology students and professors, all in healthy condition. The participants had normal or corrected to normal vision and had normal hearing. The experiment received approval from the University of Missouri Institutional Review Board, and all experiments were performed in accordance with relevant guidelines and regulations. Written informed consent was obtained from all subjects and/or their legal guardian(s) prior to their participation. Further, written informed consent was obtained for publication of identifying information/images in an online open-access publication.

The objective of the experiment was to induce a predictable response in the participants such that any anticipatory signals that may occur can be reliably measured and recorded using an EEG. The experiment simulated a real-world driving environment wherein the participants operated an open-source remote-controlled robot called JetBot (built using Waveshare’s Jetbot AI Kit and Nvidia’s 4GB Jetson-Nano) on a novel testbed designed to simulate urban roadways (see picture inserted in Fig. 5) . The testbed boundary was marked with standard masking tape and the track material was Delxo’s anti-slip tape (with 80-grit granularity) to provide additional traction to the JetBot wheels. The participants navigated the JetBot in the testbed lanes using a Logitech G29 Driving Force racing wheel and pedal setup, while watching a live video feed cast to a computer monitor from an on-board camera. The JetBot was programmed to drive at a constant speed without the participant pressing the acceleration pedal, necessitating only the use of the steering wheel and brake pedal for full control. There were no other ’vehicles’ or obstructions on the testbed lanes and participants were free to navigate anywhere within the testbed boundaries.

The EEG signals of the participants were recorded using a Neuroelectrics ENOBIO 20 EEG headset. The electrode setup used was the international 10–20 standard and the sampling frequency during the experiment was 500 Hz. Data was collected from 19 channels by applying a high conductivity Signagel saline gel on the electrodes to increase the quality of data capture. The data acquisition software used was the Neuroelectrics NIC2 software, which featured its own EEG signal quality monitor. The quality monitor assessed the EEG signal by computing a quality index (QI) that was dependent on: (i) line noise, which was defined as electrical noise originating from surrounding power lines; (ii) main noise, which was defined as the signal power of the standard EEG band; and (iii) offset, which was the mean value of the waveform. Specifically, QI was calculated as66:

$$\begin{aligned} QI(t) = \tanh \Bigg (\sqrt{\bigg (\frac{\zeta _{L}(t)}{W_{\zeta _{L}}}\bigg )^2 + \bigg (\frac{\zeta _{m}(t)}{W_{\zeta _{m}}}\bigg )^2 + \bigg (\frac{O(t)}{W_{O}}\bigg )^2}\Bigg ) \end{aligned}$$
(11)

where \(\zeta _{L}(t)\) and \(W_{\zeta _{L}}\) denote the line noise and line noise normalizing weight (= 100 \(\upmu\)V), respectively, \(\zeta _{m}(t)\) and \(W_{\zeta _{m}}\) denote the main noise and main noise normalizing weight (= 250 \(\upmu\)V), respectively, and O(t) and \(W_O\) denote the offset and offset normalizing weight (= 280 mV), respectively. The NIC2 software indicators used a color scheme to indicate different levels of QI. A green indicator meant that the signal had a QI between 0 and 0.5, an orange indicator meant a QI of between 0.5 and 0.8, and a red indicator meant a QI of 0.8 to 1. For this experiment, green indicators for all channels was the standard; however, brief periods of orange indicator were considered acceptable. The data was filtered using a 60 Hz filter during capture to help reduce electrical noise and all channels were captured with reference to the Common Mode Sense channel (20th channel) which was fixed to the participant’s right ear lobe.

Figure 5
figure 5

Experimental design illustration of a trial. Photo by Micheal Pierce/Missouri S &T.

The experimental design was based, in part, on the experiment conducted in25 and is illustrated in Fig. 5. Each participant underwent eight sets of 30 trials each for a total of 240 trials, with short 5–10 min breaks between sets. Each trial consisted of a set of audible commands issued by MATLAB that included a “Start” command, upon hearing which the participant would release the brake allowing the JetBot to move, followed by a countdown from 5 to 1 and ending with a “Stop” command, when the participant would immediately stop the JetBot by pressing the brake. To ensure that the participants responded in a timely fashion, activity at the brake pedal was monitored and any trial where the brake pedal depression did not register a numerical reading higher than 0.05% of its total depression range within 0.25 seconds of the issuance of the “Stop” command were marked for removal. Trials where the participant braked too early were manually marked for removal as well. The EEG recording would begin concurrently with the issuance of the “Start” command, markers corresponding to the countdown numbers would be applied to the data concurrently with each audio count and the EEG recording would stop at detection of braking action or after the 0.25 s delay to check brake pedal depression.

Data preprocessing

The data was processed in several steps using the open source EEG toolkit EEGLAB67 along with the TBT plugin68 for only the trials that were not marked for removal during the experiments (see Fig. 6).

  1. 1.

    Each trial was:

    1. (a)

      Spectrally filtered using a FIR bandpass filter from 0.1 Hz to 1 Hz as suggested in22.

    2. (b)

      Cleaned using EEGLAB’s built-in automated cleaning function “Clean_RawData and ASR” (Artificial Subspace Reconstruction) to remove bad channels under the following criteria: If the channel: (i) was flat for more than 5 s; (ii) correlated at less than 0.8 to an estimate based on nearby channels; and (iii) contained more than four standard deviations of line noise relative to its signal. The ASR corrected bad data periods containing high-amplitude artifacts69, such as eye blinks, and its maximum acceptable 0.5 s window standard deviation limit was set to a conservative 20 standard deviations. The values used in this step were the standard default values in EEGLAB and were also used as part of a pre-processing scheme in a previous study70.

    3. (c)

      Segmented by slicing the data according to the markers corresponding to the countdown numbers or the “Stop” command. For example, as shown in Fig. 5, the first segment consisted of taking only the data that occurred between the “5” count marker and the “4” count marker. The segments from 5 to 1 were given a “0” label or were regarded as not containing an intention signal and the segment between “1” and “Stop” was labelled as “1” and was regarded as containing the signal of interest. Each segment was then baseline corrected by subtracting the mean value of the segment from every value in the segment.

  2. 2.

    Each segment was then further cleaned using the TBT plugin to remove high amplitude noise. Channels were removed from the segment if either of the following two criteria was met for a data period duration of 10% of an segment or more70: (i) if they exceeded \(\pm 100\) \(\upmu\)v in magnitude, or (ii) if the joint probabilities (i.e., probabilities of activity) exceeded 3 standard deviations for local or global thresholds. If either criterion was met for less than 10% of an segment, then the offending data period was removed and subsequently interpolated. Segments with more than 50% of channels removed (i.e., 10 channels) were omitted entirely. Any removed channels were re-interpolated after the cleaning was finished for consistency in input dimension. Similar segment-by-segment cleaning strategies utilizing the TBT plugin have been employed in a previous study49.

  3. 3.

    Every data point was padded to a uniform length of 1848, the size of the largest dataset in a trial, by appending zeros to the end. Padding data should not significantly change the prediction results and is commonly done when using machine learning algorithms on datasets containing images having different sizes. Furthermore, padding the data should less significantly alter the dataset versus truncating data, which allows for the possibility of deletion of key features.

  4. 4.

    Finally, each channel was normalized such that the data lay within a range of 0 to 1. The particular equation used was:

    $$\begin{aligned} {\overline{X}}_{i} = \frac{X_{i} - \min (X_{i})}{\max (X_{i}) - \min (X_{i})} \end{aligned}$$
    (12)

    where i is the \(i{\rm th}\) data channel and X is the data vector corresponding to that channel, \(\max (\cdot )\) and \(\min (\cdot )\) are the maximum and minimum channel values per segment, respectively.

Figure 6
figure 6

Data processing pipeline.