Power efficient refined seizure prediction algorithm based on an enhanced benchmarking

Deep learning techniques have led to significant advancements in seizure prediction research. However, corresponding used benchmarks are not uniform in published results. Moreover, inappropriate training and evaluation processes used in various work create overfitted models, making prediction performance fluctuate or unreliable. In this study, we analyzed the various data preparation methods, dataset partition methods in related works, and explained the corresponding impacts to the prediction algorithms. Then we applied a robust processing procedure that considers the appropriate sampling parameters and the leave-one-out cross-validation method to avoid possible overfitting and provide prerequisites for ease benchmarking. Moreover, a deep learning architecture takes advantage of a one-dimension convolutional neural network and a bi-directional long short-term memory network is proposed for seizure prediction. The architecture achieves 77.6% accuracy, 82.7% sensitivity, and 72.4% specificity, and it outperforms the indicators of other prior-art works. The proposed model is also hardware friendly; it has 6.274 k parameters and requires only 12.825 M floating-point operations, which is advantageous for memory and power constrained device implementations.

www.nature.com/scientificreports/ are two important parameters for sampling. As shown in Fig. 1b, the SOP is the interval where the seizure is expected to occur, the period between the alarm and the beginning of the SOP is the SPH 16 . A seizure prediction alarm is considered successful if a seizure occurs after the SPH, and within the SOP. The SPH is 0 in some studies 10,17,18 , and ranges from 10 s to 4 h in other studies 8,9,19 . SOP also varies from 5 min to 1 h in several studies 8,20,21,22 . The SPH mentioned in Park et al. 23 is actually the SOP definition used in this study, which easily leads to misunderstandings and confusion. Furthermore, not all seizures can be used to collect preictal samples; the study of leading seizures has a much higher value for possible clinical intervention 24 . However, the definition of leading seizure has been entirely overlooked in some studies 8,10,18 , and even in those studies that consider this concept, the value of seizure-free time (T) varied from 30 min to 4 h 16,17,19 . Under the various choices for T, SOP, and SPH, the generated sample set for training will be completely different even with the same EEG database. Additional details are discussed in "Materials and methods" section. For model training and evaluation, the sample set should be split into training, validation, and testing sets. A variety of different approaches are used for this task. Tsiouris et al. 25 shuffled all samples and then used a stratified 10-fold cross-validation to evaluate the prediction performance. Zhang et al. 26 separated the total dataset into training set and testing set according to the ratio of 8:2. The leave-one-out cross-validation method was used in these two studies 16,21 . However, the number of available seizures are not equal due to different definitions of leading seizures, which result in a different number of cross-validation experiments. These validation methods are so different that the reliability of the model cannot be guaranteed.
Furthermore, the main aim of seizure prediction research is to improve the patients' quality of life; thus efficient hardware implementation is very important 11,27 . However, most research involves complex algorithms that are time-consuming and unfavorable for practical application.
We propose in this work a processing procedure as a reference for training reliably seizure prediction algorithms and facilitating fair benchmarking of successive prediction methods. We investigate the SOP, SPH and T selections for providing trustworthy labels for training and validation. A one-dimensional CNN with Bi-LSTM network is proposed and evaluated on CHB-MIT database shows that it could help improve prediction performance, and the demonstrated power and cost efficiency are advantageous for implant devices.
The remaining sections of this paper are organized as follows: "Materials and methods" section introduces the dataset, data processing method and proposed model structure. "Results" section gives the evaluation results of the proposed model and the comparison results between this work and other works. Several issues are discussed based on the experimental results in "Discussion" section, which greatly influence model performance, including the different choice of sampling parameters and dataset partition methods. "Conclusion" section concludes all the contributions of this work.

Materials and methods
Sampling parameters. To clarify the standards for sampling, several important concepts must be defined.
As seizures often occur in clusters, the interval between two clusters is defined as the seizure-free time T; the first seizure in each cluster is called a leading seizure. Chen et al. 24 claimed that T should be based on an analysis of natural seizure clusters. In particular, they recommended listing all of the seizures in chronological order and observing how they naturally cluster, and then recording the longest duration among all natural clusters as the value of T. However, most existing EEG recordings in the public dataset are too short to observe the natural seizure clusters. The value of T can only be chosen as large as possible to obtain more true leading seizures. While the larger the value of T, the less positive samples can be obtained. Considering the trade-off between the number of positive samples and obtaining true leading seizures, T = 4 h is chosen in our study; it is also the most commonly used value in related studies 19,23 .
SOP and SPH are two important parameters to locate the preictal state accurately on EEG signals. However, the SPH and SOP are often clinically unknown 19 and are usually chosen based on assumptions, which are  19 , the SPH and SOP are 4 h and 1 h, respectively; this period is too long, as even if the model gives a correct alarm, the earliest seizure may occur 4 h later, and the latest is 5 h later; this cause patients to experience premature anxiety. Thus, suitable values for the SOP and SPH not only need to allow the patients to have enough time to take protective measures (i.e., the SPH should be greater than 0) but also to ensure enough preictal samples to train the model (the SOP should be appropriately longer); furthermore, there is a need to avoid unnecessary stress for patients (i.e., neither the SOP nor SPH should be excessively long). According to this analysis and most of the previous works, the sampling parameters used in this work are presented in Table 1. The SPH and SOP are 5 and 30 min respectively, to locate the preictal segments on the long-term EEG recordings. The interictal state was considered as at least 2 h before or after a seizure, to ensure that the segment was far enough away from the preictal period 16,21 . These segments are further divided into smaller fragments by moving window method to generate the sample set.
Sample set partition and model evaluation. For the purpose of training and testing the model, the sample set should be split into training, validation, and testing sets. For other machine learning problems, the most common method is to divide the dataset according to such as 8:1:1 or other proper ratios or use K-fold cross-validation. But seizure prediction is a time series problem; our purpose is to predict the things that may happen in the future based on present data; the testing set must be invisible throughout the training process. Thus, leave-one-out cross-validation method is used in this work to simulate the clinical situation in real-life. Namely, suppose a patient has N seizures in a dataset. In that case, the preictal samples of the N-1 seizures are used as the training set, and the remaining sample is considered as an impending seizure for evaluating the model performance; this process is repeated N times. We consider that if samples come from the same preictal segment, they have some identical features, leave-one-out cross-validation method could guarantee that the characteristics of testing set will not leak into the training set, thereby ensuring the model's reliability.
Dataset. The CHB-MIT EEG dataset was used in this study. This dataset was collected at the Children's Hospital Boston, consisting of EEG recordings from pediatric subjects with intractable seizures, which were collected by surface electrodes on patients' scalps. The recordings are achieved using 23 channels and a sampling rate of 256 Hz. The recordings, grouped into 23 cases, were collected from 22 subjects (5 males, ages 3-22; and 17 females, ages 1.5-19) 28 . The data collection followed a protocol approved by the Ethics Committee on Clinical Investigations as the Beth Israel Deaconess Medical Center and the Massachusetts Institution of Technology. All methods were performed in accordance with the relevant guidelines and regulations. Informed written patient consent was also obtained.
Proposed model. The EEG sample was a matrix with time on the horizontal axis and channels on the vertical axis. Almost all related works treat all channels as a whole, i.e., they feed the entire matrix into the CNN model as the input. Inspired by univariate features in machine learning-based prediction algorithms, we argued that a CNN could also extract features in a single-channel fashion, then designed our CNN model using onedimensional convolution. The features extracted by CNN need to be fed into a classifier to get the prediction result. We compared two classifier structures, one using the fully connected layers as the classifier directly; the other is to add a Bi-LSTM network between the CNN and the fully connected layers. Bi-LSTM network 29 is a kind of RNN, which is an improved version of traditional LSTM, has a better feature extraction ability in temporal dimension, and is widely used to process time series data 21,30 . Through experiments, we found the model's performance was improved with Bi-LSTM; the comparison results are presented in Table 2.
The structure of the proposed model is shown in Fig. 2a. There were five one-dimensional convolutional layers in our proposed model; each layer was followed by batch normalization and a ReLU activation function. The kernel sizes of the convolutional layer and max-pooling layer were 1 and 2, respectively. The Bi-LSTM network had only one recurrent layer, and the number of features in the hidden state was 10. A fully connected layer with a dropout rate of 0.5 and SoftMax function is included in the final classifier.
As a single channel of the signal contains less information than an entire matrix, it was reasonable that the model's accuracy would decrease when each vector sample was directly classified. However, during the same period, with N channels of EEG signals, using the idea of voting in ensemble learning in the evaluation stage  www.nature.com/scientificreports/ could improve the accuracy. As shown in Fig. 2b, one matrix sample was divided into N vector samples, where N was the number of channels, and the label of each vector was the same as in the original matrix; then, the vector samples were used as the model's input. The model calculated the categories of N channels separately, and then counted the numbers of 0s and 1s. When the number of positive samples exceeded a set threshold (e.g., N/2), the current EEG sample was considered as positive; otherwise, it was negative. By splitting EEG signals into individual channels for training and inference, one-dimensional convolutional and max-pooling layers can be applied to design the model, significantly reducing the number of parameters and related computations compared with its two-dimensional counterparts. Moreover, there is no data pre-processing in this work which further reduces the computation burden. For low-power consumption processing, our proposed one-dimensional CNN model will be convenient for hardware implementation.

Results
In order to evaluate and compare our proposed model, we used the public CHB-MIT database 28 of EEG signals. It contains 23 epileptic patients' long-duration recordings as collected by surface electrodes on the patients' scalps. The recordings are achieved using 23 channels and a sampling rate of 256 Hz.
The proposed model is compared with five state-of-the-art seizure prediction algorithms 16,21,26,31,32 that were previously trained and evaluated with CHB-MIT database. Some algorithms 25,33,34 were excluded for comparison because of the complex feature engineering process is unfriendly for implant devices.
For a fair comparison, all models are trained and evaluated with the CHB-MIT database, the T, SOP and SPH was 4 h, 30 min, and 5 min, respectively; the interictal state was 2 h before or after a seizure. Different sampling lengths and feature extraction methods were performed according to the corresponding descriptions in their works. The studies presented in the studies 21,32 and our proposed approach used a 5-s moving window with no overlap for sampling; the window lengths in 16 16 converted the EEG samples into a two-dimensional image by short-time Fourier transform. The Pearson correlation coefficient (PCC) for each pair of channels was required to generate the correlation matrix 26 . Raw EEG signals were used in these studies 21,31,32 , and our proposed model. In the following, we compare the different models in terms of prediction performance and implementation feasibility for implant devices.
With 4 h seizure-free time T, the available seizures (true leading seizures) of each patient are decreased, and the leave-one-out validation method requires at least three seizures for cross-validation. Combining these two conditions, only subjects 1, 6,8,9,10,18,22 have at least three leading seizures; the experimental results on these subjects are presented in Table 3. The second column indicates which leading seizure is used as the testing set, the last row shows the average accuracy, sensitivity and specificity of each model. Our model achieves the highest accuracy (80.0%) and sensitivity (87.7%), significantly higher than that of the second-ranked, which are 75.3% and 81.5%, respectively. Although the 72.3% specificity is not the highest, but it ranks second among all the models, and the other two indicators of our model are much higher than presented in Xu et al. 31 . As a result, with fewer parameters and less computing resources requirement, the proposed model achieves the highest accuracy and sensitivity. Table 4 shows the required number of model parameters and the number of floating-point operations. The number of parameters is related to the required memory for storing the model, therefore affecting the size of the implant devices. Integrated in chips based implantable devices, the on-chip memory must be small, up to 200 kb. In addition to model parameters, the intermediate calculation results also need to be stored in the silicon chip. Thus, models with fewer parameters are more feasible for area-critical implant chips. The parameters  21 and Xu et al. 31 require giga-level operation, making them impossible for low-power consumption biochips' implementation. The studies 16,26,32 require a few hundred mega operations, although proved feasible for lowpower microcontroller units platforms 35 , they are only suitable for wearable devices where power consumption is still at a few mW level. The required floating-point operations of our model count only 12.825M, which are more than twentyfold lower than other models, the low-computation requirement gives unique advantages for low-power consumption of implants' implementation. Our model has the least floating-point operations and Table 3. Evaluation results of using leave-one-out cross-validation method to test different models. The first two columns refer to the subjects and the number of seizures used as the test set. 1 Accuracy. 2 True positive rate (Sensitivity). 3 True negative rate (Specificity).

Subject/ Seizures
Daoud et al. 21 Truong et al. 16 Zhang et al. 26 Xu et al. 31 Lawhern et al. 32 This work ACC 1 TPR 2 TNR 3 ACC  TPR  TNR  ACC  TPR  TNR  ACC  TPR  TNR  ACC  TPR  TNR  ACC  TPR    www.nature.com/scientificreports/ requires the least power consumption, and the number of parameters of our model is only slightly more than that of Lawhern et al. 32 , therefore is more suitable for hardware implementation.

Discussion
In Table 3, when testing seizure 1 of chb08, seizure 1 of chb10, and seizure 2 of chb22, all the models have a high sensitivity, but the accuracy is approximately 50%, and the specificity is close to 0. It indicates that all samples are predicted as preictal stages. The indicators of seizure 4 of chb06 and seizure 4 of chb10 are also abnormal, as the accuracy is much lower than 50%. The performance of a CNN model is often related to three factors: data, model, training process. Among these six models, different feature extraction methods (raw data, STFT and PCC), different model structure design and multiple cross-validations were performed, but all models showed similar abnormalities on the same test set. Thus, we consider that these abnormalities are only related to the training data, which could be explained from two aspects: sampling parameters and sample set partition method. As discussed in "Materials and methods" section, without knowing the ground truth, the sampling parameters of SOP and SPH are only hypothetical values. The ground truth SOP and SPH can vary greatly among different individuals; even in the same subject, the values can change over time 36 . Hence, the preictal segment and the prototypical physiological preictal signature (the ground truth) cannot match perfectly 37 . All the samples from the preictal segments are labeled as 1 (positive samples), as they are assumed to have similar characteristics 38 . But there can be some samples that without such characteristics are labeled incorrectly due to the mismatch with the ground truth. The greater the number of false samples in the training set, the more likely the model will predict a negative sample as a positive. When such a model is tested on a normal testing set (with no or just a few false samples), it will generate many false positive predictions (negative samples are predicted as positive). This explains why the specificity is lower than sensitivity in seizure 1 of chb08, seizure 1 of chb10, and seizure 2 of chb22. However, if the testing set itself has many false samples, the situation becomes more complex as the low accuracy shown in seizure 4 of chb06 and seizure 4 of chb10. The two sampling parameters SPH and SOP are set as 5 min and 30 min respectively in this work to fit the ideal prediction situation, but these values are not suitable for those abnormal cases.
A grid search method can be used to find the optimal values for the SOP and SPH, but when we increased the SPH to 30 min or longer, we could not obtain any preictal samples from some seizures, owing to missing data.
In addition to SOP and SPH, seizure-free time T is another sampling parameter. It is taken into account the distinction between leading and follow-up seizures during sampling. As shown in Fig. 3, the postictal duration of seizure A is T1 and the SOP and SPH of seizure B are T2 and T3, respectively, the time interval between seizure A and B is T. If T is too short and is less than T1+T2+T3, i.e., no distinction is made between leading and followup seizures, there will be some overlap between the postictal (of seizure A) and preictal (of seizure B) segments (Fig. 3a), indicating that the preictal segment contains some features of the postictal stage. The postictal stage is a period that measures the time it takes for the patient to return to normal after a seizure occurs, which is much easier to predict than the true preictal segment 24 . Thus, the model may achieve overly optimistic results under this situation. On the contrary, if T is greater than T1+T2+T3 (Fig. 3b), the preictal data is clean, contains no other noise signals or features, which is important to build a reliable prediction model. However, the duration of the postictal stage is unknown as the existing databases do not contain such annotations; we can only make T as large as possible to avoid overlap between preictal and postictal.
In addition to the leave-one-out cross-validation method, Fig. 4 shows another two commonly used preictal sample set partition methods. The top frame is assumed to be the preictal sample set of a certain patient whose EEG signals include five seizures in total. The numbers indicate that from which seizure the preictal segment samples were collected. Owing to missing data, the sample sizes of each preictal segment were not equal, as reflected in the different widths of the rectangles in the top frame. The sample set is split into ten parts in the middle frame, one as validation set, one as the testing set, and the remaining eight parts as training set, which is K-fold cross-validation. In the bottom frame, the sample set is split into training, validation and testing set as a ratio of 8:1:1. As mentioned in "Materials and methods" section, the last two dataset partition methods do not take the real-life situation into account but simply divide the sample set mathematically. When predicting an epileptic seizure, all the data of this seizure should be unknown. Thus, the testing set should simulate such a situation, and the training set can only contain historical data. In Fig. 4, the training, validation and testing sets are marked as blue, green and orange, respectively. It can be noted that, except for leave-one-out cross-validation method, the other two methods divide the preictal samples of seizure No. 5 into two parts (dashed lines), namely, the samples from the same seizure (seizure No. 5) exist in both the training and testing sets, which leads to data leakage from testing set to training set. The situation of validation set is similar to the testing set; it could not be  www.nature.com/scientificreports/ guaranteed that the model is not overfitting. Actually, we have observed around 10% performance increase due to the overfitting. Thus, models usually perform not "so well" when using leave-one-out cross-validation method. According to Hussein et al. 39 , the samples with the same state (preictal or interictal) have similar features, but the features can change over time. This means there are some differences between two preictal segments. In other words, preictal samples have some common features (global features) and some specific features (local features).
As such, the model's performance may not always be good on different cross-validation sets. A model can easily achieve high accuracy when ignoring the preceding rules; nevertheless, we should focus on the model's generalization ability instead of simply pursuing high performance on a small dataset. This is why we emphasized that the proposed procedure should be followed. Furthermore, high-quality data and reliable annotation are two critical issues in supervised machine learning and deep learning problems. Although the SPH and SOP are clinically unknown, other prior knowledge may also be helpful for seizure prediction research. For example, by observing the states of the patients, specialist physicians might be able to determine the duration of the postictal state, seizure types, temperatures, and other signs in the patients, along with which seizures belong to the same cluster.

Conclusion
In this work, we analyzed the effects of T, SOP, SPH on the sample set. We also discussed the different evaluation methods and the corresponding dataset partition methods. All these factors have great impact on the model's performance. We also proposed a CNN model and a novel single-channel training and inferring method. The model consists of one-dimensional convolutional layers and a Bi-LSTM network, has 6.274 k parameters and require 12.825 M floating-point operations, which is faster and more lightweight than previous models. The proposed model achieves 77.6% accuracy, 82.7% sensitivity, and 72.4% specificity on public dataset CHB-MIT, outperforms the state-of-the-art works.  Figure 4. Different dataset partition methods. Assuming the patient has 5 seizures in total, the top frame is the leave-one-out cross-validation, the 5 labeled rectangles correspond to the preictal segment of 5 different seizures. The samples are split as K-fold and 8:1:1 respectively in the following two frames, where K is 10 in this instance.