Interpretable EEG seizure prediction using a multiobjective evolutionary algorithm

Seizure prediction might be the solution to tackle the apparent unpredictability of seizures in patients with drug-resistant epilepsy, which comprise about a third of all patients with epilepsy. Designing seizure prediction models involves defining the pre-ictal period, a transition stage between inter-ictal brain activity and the seizure discharge. This period is typically a fixed interval, with some recent studies reporting the evaluation of different patient-specific pre-ictal intervals. Recently, researchers have aimed to determine the pre-ictal period, a transition stage between regular brain activity and a seizure. Authors have been using deep learning models given the ability of such models to automatically perform pre-processing, feature extraction, classification, and handling temporal and spatial dependencies. As these approaches create black-box models, clinicians may not have sufficient trust to use them in high-stake decisions. By considering these problems, we developed an evolutionary seizure prediction model that identifies the best set of features while automatically searching for the pre-ictal period and accounting for patient comfort. This methodology provides patient-specific interpretable insights, which might contribute to a better understanding of seizure generation processes and explain the algorithm’s decisions. We tested our methodology on 238 seizures and 3687 h of continuous data, recorded on scalp recordings from 93 patients with several types of focal and generalised epilepsies. We compared the results with a seizure surrogate predictor and obtained a performance above chance for 32% patients. We also compared our results with a control method based on the standard machine learning pipeline (pre-processing, feature extraction, classifier training, and post-processing), where the control marginally outperformed our approach by validating 35% of the patients. In total, 54 patients performed above chance for at least one method: our methodology or the control one. Of these 54 patients, 21 (\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\approx$$\end{document}≈38%) were solely validated by our methodology, while 24 (\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\approx$$\end{document}≈44%) were only validated by the control method. These findings may evidence the need for different methodologies concerning different patients.

and REM (R)), and surgery decision (performed (s), offered but not performed (o), not offered (n), and invasive monitoring required (i)).

Features Description
We justify here the use of our features: statistical moments, Hjorth parameters, relative spectral power, wavelet coefficient energy, and spectral edge frequency. We only used linear features. These are mathematical techniques that use phase/frequency and amplitude information of the signal and comply with the linearity property. When one extracts linear features, assumes the quasi-stationarity of the EEG signal within each time window from the data segmentation step in the signal pre-processing stage.

Statistical Moments
Several studies 1-8 report the use of statistical moments for characterising the amplitude distribution of the EEG time series. The first moment is the mean, the second moment is the variance, the third moment is the skewness, and the fourth moment is the kurtosis. The skewness is zero for symmetric amplitude distributions and non-zero for asymmetric distributions. The kurtosis measures the relative peakedness or flatness of an amplitude distribution 2 . These statistical measures have shown significant changes during the pre-ictal period compared to the inter-ictal state 3,5 . A decrease in the variance and an increase in the kurtosis were observed in the pre-ictal period when compared with inter-ictal based data 4,9 .

Hjorth Parameters
Hjorth parameters are three time-domain measures of brain activity that authors have been using in seizure anticipation. These parameters have the intention of constituting a clinically helpful tool for quantitatively describing the electroencephalogram (EEG) 2 . The activity, mobility, and complexity are measures of mean power, root-mean-squared frequency, and root-meansquare frequency spread, respectively. Authors have reported a significant increase in the mobility and complexity of the EEG during the pre-ictal period 3, 5 .

Relative Spectral Power
Different physiological and pathological processes are reflected by activity in different frequency ranges of the power spectrum of the EEG 2 . Relative spectral power band can be calculated by firstly computing the Power Spectral Density (PSD) of the time-series within a time window. It is essential to mention that the calculation of the PSD assumes the signal in each window is short enough to be considered quasistationary and long enough to capture the brain low-frequency activity. Although there are several methods for spectral estimation, the simplest form can be by applying the Fast Fourier Transform (FFT) and then averaging its squared coefficients that belong to the frequency band of interest. A normalised spectral power provides a more robust measure of the fluctuations in a patient's daily life to reveal the pre-ictal state 3 . Authors have reported that brain activity increases or decreases at specific frequency bands during certain events. Before seizure onset, there may be a transfer of power from the lower to the higher frequencies 3-5, 10-12 . For example, Mormann et al. 5 showed a decrease in Delta band power, coupled with a relative power decrease in the other sub-bands. Bandarabadi et al. 10 showed that relative combinations of sub-band spectral powers across channel pairs might be used for tracking gradual changes preceding seizures.

Wavelet Coefficients Energy
Wavelet transform is a time-frequency domain transform that can be an alternative to the FFT as it decomposes the signal in different resolution levels according to different frequency ranges 1,3,4 . In other words, wavelets provide a time-variant decomposition adapted to the signal, capturing minor details and sudden changes by providing higher frequency resolution for lower frequencies and higher time resolution to higher frequencies. With the signal decomposed, it is possible to compute several measures using the wavelet coefficients, as in the case of signal energy. By computing the energy of the signals originated by the decomposition, a measure of the energy in different frequency ranges can be achieved 3, 4, 13 .

Spectral Edge Frequency
Usually, most EEG signal's power is within the frequency band from 0 Hz up to 40Hz 2 . Spectral edge frequency quantifies the power contained in lower frequencies concerning a given threshold. Spectral edge frequency (SEF) is the measure that indicates the frequency below x per cent of the overall signal power is contained and is frequently used in seizure prediction 1, 3-5 . As the existence of a power transfer from low to high frequencies has been reported during the pre-ictal stage, SEF may also be capable of capturing these dynamics.

Multiobjective Evolutionary Algorithm Details (MOEA) Configuration Details
Once every individual has been evaluated, parents are selected to reproduce and generate offspring. In this study, considering a population of N individuals, a group of N/2 parents are selected and recombine until N offspring are produced. The offspring is then subjected to mutation (with a given probability, just as before for recombination) and a replacement strategy is put in place to select the N individuals that will make up the following generation. To do these steps, we firstly need to rank the population, where we used non-donimated sorting.

8/27
For each individual, its rank is equal to the number of individuals dominating it plus one (e.g. all nondominated individuals are assigned rank 1). Afterwards, a fitness value was assigned by interpolating from the best individual (rank 1) to the worst (rank n ≤ N) within the individuals of each rank. In other words, we iteratively searched for all the non-dominated solutions in the population that have not been labelled as belonging to a previous front. After labelling that new front, a front counter was incremented and the process was repeated until all solutions have been ranked. Afterwards the individuals within each rank were sorted according to a measure, the crowding distance, which corresponds to the average side length of a cuboid defined by its nearest neighbours in the same front 14 . In essence, the larger the crowding distance is, the fewer solutions occupy the vicinity of a given individual. It is computed, for each objective m, in the following manner: sort individuals according to their fitness score f , assign an infinite value to boundary solutions (so that they are always selected), and compute the distance measure for the remaining solutions as given by Eq. 1. The overall crowding distance was calculated as the sum of the distance values concerning each objective.
For parent selection, the population was ranked using non-dominated sorting and then, within each rank, individuals were sorted by the crowding distance (the higher it was, the better rank they were assigned to). Parents were then chosen using binary tournaments and reproduced until N offspring were generated. With respect to the replacement strategy, we used an elitist approach. Firstly, after evaluating the newly created offspring, the 2N individuals (current generation and offspring) were ranked with non-dominated sorting. Then, entire fronts were added into the new generation, starting from the rank 1 individuals, followed by rank 2, and so on, until a new set could no longer be accommodated. This last set of solutions was then sorted according to the crowding distance, and the better ones were chosen to fill out the rest of the new population.

Neighbourhood details
We provide here more details concerning the established neighbourhoods. As in Pinto et al. 15 , window length and time instant genes have a straightforward ordering, as their values correspond to increasing/decreasing discrete time intervals. Electrodes' neighbourhood is based on their scalp position. As there is no decreasing/increasing relationship among the mathematical operators (mean, median, variance, integral), all genes were considered neighbours of each other. In this work, the difference is in the feature's genes, as significantly more options are available.
The extracted features, however, can be divided into several groups and sub-groups, as seen in Fig. 3.b). Firstly, they can be divided concerning their time/frequency domain. Then, time-domain features can be divided into statistical moments (ordered by their n-th moment: mean, variance, skewness, kurtosis) or into Hjorth Parameters (ordered by their n-th moment: activity, mobility, complexity). Frequency-domain features can be divided into spectral edge features (ordered by the frequency: 50%, 75%, and 90%) or frequency band ones. Frequency band features can be divided into relative spectral power bands (ordered by their frequency range: delta, theta, beta, alpha, low-gamma, high-gamma) and the energy from wavelet decomposition levels (ordered by their decomposition levels: D1, D2, D3, D4, D5, D6, D7, and D8). Please note that the used overall division might lead to discussion, as wavelet transforms may be considered forms of time-frequency representations and not solely belonging to the frequency-domain. Nevertheless, we justify our choice as we believe that these features are conceptually similar to relative power spectral bands.

Control method
We used a control methodology to verify the performance of our approach, inspired in the existing clinical trial that deals with EEG seizure susceptibility detection, namely the Seizure Advisory System Feasibility Study (NCT01043406) 16 and other previous works, such as Direito et al. 1 . To simplify the process, its development was merely inspired and not fully reproduced.
The Neurovista Advisory System used a patient-specific layered structure, which includes filtering, feature extraction, and classification. Input signals were filtered by a set of octave-wide digital filters, from 8Hz to 128Hz range. Notch filters were also optionally available. The signals were firstly segmented where they extracted the average energy, Teager-Kaiser energy, and line-length, into windows of 5 seconds and, secondly, these features were normalised. This output resulted in a total of 288 features, as these resulted from the combination of 16 available iEEG input channels, 6 filter/normalisation options, and 3 available features. During training, the 16 best performing features are selected by using a backward elimination feature selection method based on Hilbert-Schmidt Independence Criterion (BAHSIC). The selected features were used to training a classification model, which was inspired by two types of classification models, decision trees and a k-nearest neighbors (kNN). In this, a given feature vector was classified by using several decision surfaces. Simply put, the input was firstly compared to the first decision surface to determine the binary side of the surface where the point was on. Then, and depending on the previous binary answer, a new decision surface is selected to perform a second surface comparison. This process continues until reaching a total of ten decision layers. Thus, the feature space was divided in 2 10 partitions. The used surfaces were chosen during training. To each, it was assigned a relative measure of seizure risk concerning their time distance to a seizure event. The training phase used a hold-out or cross-fold validation method to get the required configuration parameters. Then, the classifier output was filtered and thresholded, to prevent rapid changes and thus, be robust to noise.
We replaced the filtering/decomposition, and feature extraction processes of the advisory system by our extracted features. By this way, we used the same data as in the EA. Then, from all possible combinations of features and electrodes, we selected the best set. Feature selection was based in two phases: using two filter methods 17 and an embedded one. Concerning the filter methods: firstly, we selected the best 100 features with Pearson's linear correlation coefficient and secondly, the best 25 with Area Under the Curve (AUC). Then, we used the wrapper method Recursive Feature Elimination (RFE) to select an optimal set of features. The filter method had the objective of substantially decreasing the number of features, and thus computational power, as it would take the RFE a substantially long period to compute. There are two important configuration options when using the RFE: the used algorithm to help selecting the features (the estimator), and the number of features. For the estimator, we used a linear Support Vector Machine (SVM) 1 . For the number of features, we used the first three chronological seizures for a grid-search (3, 5, 7, 10, 15, 20 features) that optimises the Sample Sensitivity (S ss ) and Sample Specificity (S sp ): S ss * S sp (also named here as fitness). We also used this grid search to find the best pre-ictal period (30, 35, 40, 45, 50, 55, 60 minutes). For the grid search, we used a 3-fold as in Direito et al. 1 : fold 1 used seizures #1 and #2 for training, and #3 for validation; fold 2 used seizures #1 and #3 for training, and #2 for validation; and fold 3 used seizures #2 and #3 for training, and #1 for validation. The remaining seizures comprised the testing group. Please note that we used the RFE with the SVM estimador instead of a backward feature elimination based on the BAHSIC criterion, as its code implementation was more easily available on Python scikit-learn library.
We adapted our classification model by using a Random Forest, since the one from the advisory system consists of an ensemble of a large number of individual decision tree classifiers. Each decision tree is trained with a different data distribution. Thus, we intended to mimic the ten layer structure along with the different decision surfaces, by using the random forest algorithm. In testing, we also implemented the Firing Power to smooth the output over time, with a similar threshold value to our approach (0.7). We also used refractory periods. This methodology is patient-specific.

Training
Testing

Electrodes and Lobes Discussion
We present here some examples of the number of different electrodes and lobes histograms for all patients, specifically for patients 1200, 12702, 55202, 81402, and 1319203, which had, as most common set (mode) of electrodes and lobes the following values, respectively: 4 and 3; 2 and 1; 3 and 2; 2 and 2; and 4 and 2. On the GitHub page of our paper, one can find these histograms for all patients. We believe these findings evidence the relevance of the patient comfort metric in the MOEA.