A ResNet-LSTM hybrid model for predicting epileptic seizures using a pretrained model with supervised contrastive learning

In this paper, we propose a method for predicting epileptic seizures using a pre-trained model utilizing supervised contrastive learning and a hybrid model combining residual networks (ResNet) and long short-term memory (LSTM). The proposed training approach encompasses three key phases: pre-processing, pre-training as a pretext task, and training as a downstream task. In the pre-processing phase, the data is transformed into a spectrogram image using short time Fourier transform (STFT), which extracts both time and frequency information. This step compensates for the inherent complexity and irregularity of electroencephalography (EEG) data, which often hampers effective data analysis. During the pre-training phase, augmented data is generated from the original dataset using techniques such as band-stop filtering and temporal cutout. Subsequently, a ResNet model is pre-trained alongside a supervised contrastive loss model, learning the representation of the spectrogram image. In the training phase, a hybrid model is constructed by combining ResNet, initialized with weight values from the pre-trained model, and LSTM. This hybrid model extracts image features and time information to enhance prediction accuracy. The proposed method’s effectiveness is validated using datasets from CHB-MIT and Seoul National University Hospital (SNUH). The method’s generalization ability is confirmed through Leave-one-out cross-validation. From the experimental results measuring accuracy, sensitivity, and false positive rate (FPR), CHB-MIT was 91.90%, 89.64%, 0.058 and SNUH was 83.37%, 79.89%, and 0.131. The experimental results demonstrate that the proposed method outperforms the conventional methods.

Epilepsy is a chronic neurological disorder that affects about 50 million people, which is approximately 1% of the world's population.Seizures are typical clinical manifestations of epilepsy, characterized by sudden and temporary neurobehavioral symptoms caused by abnormally hypersynchronous electrical discharges from overexcited neurons in the brain 1,2 .Except for a few special cases, seizures occur irregularly, and patient's premonitory symptoms are uncertain.Moreover, the exact onset time cannot be estimated because it differs among individuals.Because of this unpredictability, people with epilepsy are limited in their social activities and exposed to trauma and danger, which substantially impacts their quality of life 3 .Furthermore, patients with severe epilepsy are hospitalized and managed throughout the day by medical personnel.However, the medical personnel are insufficient to manage all patients, and correct judgments cannot be made based solely on patient behavior monitoring.As a result, various studies related to epilepsy are being conducted to ensure the stability of daily lives for epilepsy patients 4 and enable precise prevention and treatment with limited medical resources.
Due to the fact that EEG detects electric signals generated by the brain during seizures, absence seizures and focal seizures without awareness can also be identified 5 .Therefore, from the 1970s to the present, EEG data has

Database
The dataset used in this research can be classified according to the reference electrode selection method, using two methods: 'Unipolar reference' and 'Bipolar reference' .The SNUH dataset was measured using the 'Unipolar reference' method, while the CHB-MIT dataset used the 'Bipolar reference' method.In the 'Unipolar reference' method, the GND value is determined by averaging all the electrodes and converting them into digital signals, with all electrodes sharing the same GND.The difference between the individual signal and the signal measured at the common ground is recorded.However, this method is susceptible to fine noise or common-mode signals, which can also be amplified and output.On the other hand, in the 'Bipolar reference' method, each adjacent electrode is used as a GND to convert it into a digital signal.This method is resistant to noise from the common signal between electrode attachment points, as the measurement procedure eliminates it.However, it makes it difficult to observe brain waves at a specific location.A description of the two datasets is included below.

Pre-processing
The amount of data, the model, and the characteristics of the data all significantly influence the performance of models in data-based supervised learning.EEG data has three disadvantages: a class imbalance between pre-ictal and inter-ictal, an insufficient data quantity, and the complexity and irregularity of the data, making analysis difficult.These disadvantages directly affect the model's performance.We addressed these issues during the pre-processing phase.

Data sampling
As illustrated in Fig. 1, we defined the period before ictal onset as "pre-ictal" and set the durations to 10, 15, and 30 min."Inter-ictal" is defined as the period more than 3 hours away from the seizure, when the seizure waveform is absent from the EEG 30 .The validation datasets, CHB-MIT and SNUH, exhibit a class imbalance between pre-ictal and inter-ictal due to the relatively small number of ictals in comparison to the total length.When there is a large difference in the number of classes in the dataset, classes with a high distribution are given more weight during training.In the case of a seizure dataset with a substantial proportion of inter-ictal, overall accuracy may increase while sensitivity decreases.As sensitivity directly related to the patient's life, resolving the difference in distribution between the two classes can lead to improved performance.To resolve the imbalance, we employed undersampling to extract data of the same length as pre-ictal and inter-ictal data, as depicted in Fig. 1.Additionally, oversampling was conducted to supplement the existing limited data and compensate for information loss during undersampling.As shown in Fig. 2, the window size was set to 10 s, and the sliding window algorithm was applied every 1s to generate overlapping data.Through data sampling, the data imbalance was resolved, and insufficient data were supplemented.

STFT
The irregular and complex raw EEG data presented in Fig. 3a was transformed into a spectrogram image with the x-axis indicating time and the y-axis representing frequency using STFT, as shown in Fig. 3b.As a spectrogram, the power value of the frequency band at a particular time can be easily observed, and it can be analyzed using both the time information on the x-axis and the image characteristics.1) is converted into a discrete digital signal using STFT.Here, x[n] represents the raw signal in the time domain, m and n denote the time axes, and ω signifies the frequency axis.w[] refers to the window function.For continuous data analysis, the Hanning window function with a window length of 1 s and 50 % overlap was applied to enhance the time resolution 31 .As depicted in Fig. 3c, the data were constructed using only the informationrich data in the 0 ∼ 60 Hz band.Difficult-to-analyze EEG were transformed into spectrograms containing time- frequency information, and a preprocessing step was performed to facilitate feature extraction from the data.

Pretext task: pre-training
We conducted pre-training to achieve high performance with a limited data.The original data were augmented using a band-stop filter and temporary cutout, and then trained within a model consisting of ResNet and supervised contrastive loss.Training with augmented data can prevent overfitting, and the image representation is acquired in advance.Even with a small dataset, the training model could determine optimal parameters through the use of the pre-trained ResNet.

Data augmentation
The augmentation method has primarily been employed in image processing within the field of vision 32 , and it has also found applications in signal processing 33 and other domains.For EEG data, which contains both signal information and STFT-converted image data, a band-stop filter and temporal cutout were employed to satisfy both requirements.The STFT-applied image takes the form of a horizontally and vertically cropped representation when specific frequency band and time zone information is removed.The images shown in Fig. 3d, e were generated through augmentation.The temporal cutout was vertically cropped, and 6 of 10 s were removed at random.The temporal cutout involved vertical cropping, removing 6 out of 10 seconds at random.Experiments were performed to determine the removed time and length of frequency information.Augmented images were used as input for the pre-trained model. (1)

Residual learning
CNN 34 , which is effective in analyzing patterns in images, has been widely utilized in the field of computer vision.Deeper layers within CNN models are recognized as crucial for determining the model's performance.However, contrary to initial expectations, increasing the depth of layers in CNN-based models often leads to degradation issues 35 .ResNet was introduced as a solution to address this degradation problem.It employs the model structure of VGGNet (Visual Geometry Group Net) 36 and incorporates shortcut connections to add input values to output values 35 .
In this study, we employ ResNet-18, the shallowest model in the ResNet architecture.This decision is influenced by the experimental dataset, consisting of small images with dimensions of (21x60).Smaller images inherently carry less information, making it more challenging to effectively capture essential features and patterns within deeper networks.ResNet-18 consists of the five blocks, as illustrated in Fig. 4.Each block includes batch normalization, Rectified Linear Unit (ReLU), and max pooling.The input dimensions for CHB-MIT and SNUH datasets are (18 × 21 × 60) and (21 × 21 × 60), respectively.These dimensions represent the number of electrodes, the temporal information derived from a window size of 1 second and 50% overlap, and the frequency components.For CHB-MIT, the resulting feature maps from each block are as follows: (64 × 21 × 60), (64 × 21 × 60), (128 × 11 × 30), (256 × 6 × 15), and (512 × 3 × 8).The final feature map obtained from ResNet is transformed into a 512-dimensional vector through adaptive average pooling and a flatten layer.Throughout the pre-training and training processes, the output from ResNet is utilized as input values for the supervised contrastive loss and the LSTM layer, respectively.

Supervised contrastive learning
Contrastive learning has its origins in metric learning 37 and is currently primarily studied as a learning technique for pre-trained models.Among the notable approaches are self-supervised contrastive learning 38 and supervised contrastive learning 39 .Self-supervised contrastive learning is an unsupervised learning algorithm that is appropriate for large quantities of unlabeled data, but it cannot outperform supervised learning.The proposed method to address these deficiencies is supervised contrastive learning.In contrast to self-supervised contrastive learning, loss values are allocated based on class.In other words, it is a method of supervised learning using labeled data.Equation ( 2) represents self-supervised contrastive loss, while Equation (3) denotes supervised contrastive learning.
The symbol • represents the dot product, and τ is the hyperparameter.When the batch size is N and I ≡ {1 . . .2N} is the index of an augmented sample, 2N indexes are included.z j (i) represents a single positive sample, which is the remaining augmented image, while 2(N − 1) indexes represent negative samples, denoted by z a .In the denominator, equation z i • z a represents similarity comparisons for negative samples, and it is repeated 2(N − 1) times.Only one image augmented from the same image has its numerator z i • z j (i) compared for similarity.With the exception of one augmented image, all images are considered negative.Equation ( 3) P(i) represents a sample from the same class considered as positive.Positive and negative samples are separated by class, and the loss is calculated as the mean similarity value for all positive samples 39 .In supervised contrastive loss during training, the loss is determined by comparing data within the same batch.Therefore, the larger the positive sample size and batch size, the better the performance.We conducted pre-training using supervised contrastive learning, which clearly demonstrates distinctions between objects (Fig. 5).
(2)  modeling, and translation.Additionally, to address the gradient vanishing phenomenon that occurs on long-term dependency data of RNN, it is possible to transmit information over long distances without losing it through the cell state.Figure 6 illustrates the internal structure of the LSTM cell state, encompassing the forget gate, input gate, and output gate.
• The LSTM's calculation procedure is as follows: C t represents the cell state value, h t denotes the hidden state value, x t is the input value, σ signifies the sigmoid function, tanh is the Hyperbolic Tangent function, and f t , i t , Ct and o t represent the output values of each gate.
(a) Equation ( 4) represents a forget gate.The sigmoid function produces a value ranging between 0 and 1, indicating the extent to which past information should be discarded.A value closer to 0 implies less retention of information.
(b) Equations ( 5) and ( 6) correspond to the input gate, responsible for selecting crucial information from incoming data.Equation ( 5) defines the value to be updated using the sigmoid function, while equation (6) calculates a new candidate value vector Ct , which will contribute to updating the cell state. (4) (c) Equation ( 7) updates C (t−1) to C t .This process involves updating the new cell state through a com- bination of addition and multiplication involving the data from the preceding steps.Specifically, Ct is updated by multiplying the previous cell state C (t−1) with the output ft from the forget gate, and it is further updated through a combination of multiplication and addition involving the values from the input gate.
(d) Equations ( 8) and ( 9) represent an output gate responsible for generating the final output.In Equation (8), the sigmoid function determines the value of x t to be output.Ultimately, in Equation ( 9), the output is determined by multiplying the result obtained from Equation ( 8) with C t .
As demonstrated in the previous equation, the cell state selectively discards irrelevant past information, incorporates pertinent current information, and iteratively updates itself using the gates.This enables the LSTM model to exhibit outstanding performance, even when dealing with data that exhibits long-term dependencies 40 .

ResNet-LSTM hybrid model
The STFT pre-processing step produces image data with time information along the x-axis and frequency information along the y-axis.In this study, time and frequency information was used to extract data characteristics using a hybrid model combining ResNet and LSTM.ResNet was used to extract the image features, which were extracted as 512-dimensional vector values.It was delivered to the LSTM as an input.Time-series analysis was performed on the extracted features using LSTM with one hidden layer.It was classified using a linear classifier with the dropout and ReLU layers in the output layer.

Result and discussion
EEG data has three disadvantages in seizure prediction: complexity and irregularity, a small number of datasets, and imbalance.Patient-specific seizure prediction is more restricted by the separation of patient-specific data.Therefore, we developed a pre-trained model that can be applied to the prediction of seizures.To address the potential issue of overfitting due to limited training data, we employed the model described in the Pretext task process, as depicted in Fig. 5.This approach helped us reduce the risk of overfitting and improve the generalization of our model.Moreover, it provided initial weight values to determine the optimal training model parameters.The proposed method's pseudocode is shown in Algorithm 1 and 2.
We defined a single data as 10 s and predicted seizures by classifying pre-ictal and inter-ictal data.Leave-oneout cross-validation was employed to aggregate the results effectively.In this approach, a pair of pre-ictal and inter-ictal data instances were treated as a singular unit, with N-1 units used for training while the remaining unit served as the testing set.This process was iterated N times.Evaluation metrics such as sensitivity, specificity, accuracy, and False Positive Rate (FPR) were employed and are detailed in Table 1.Furthermore, we conducted statistical testing on the means of each patient's performance using a paired t-test.The training process utilized the window-based PyTorch framework and the Stochastic Gradient Descent (SGD) optimizer, which demonstrated superior performance in terms of generalization compared to adaptive optimization methods 41 .This offered an advantage in addressing overfitting concerns when dealing with limited data.For the pre-training phase, a batch size of 512, an epoch of 300, and a learning rate of 0.05 were employed.During the subsequent training phase, an epoch of 100, a learning rate of 0.01, and the same batch size were utilized.Each hyperparameter was determined through a series of experiments.
In this paper, the SNUH and CHB-MIT datasets were utilized for validation.Both datasets share the same number of patients, as detailed in the "DATABASE" section.However, the SNUH dataset contains 78 fewer instances of ictals and is measured using the noisier unipolar reference method.In our experimental setup, we defined the pre-ictal period as 10, 15, and 30 min, with subsequent evaluation metrics confirming the outcomes for each respective period.Tables 2, 3, 4 and 5 present patient-specific the results.Tables 3 and 5 are the results  3 and 5 are the results of applying the pretrain model.A summary of the results' performance is provided in Tables 6 and 7.According to Table 6, all results using pre-train were enhanced, with sensitivity showing improvement relative to specificity in the 10 and 15 min data.In the case of the 30-min data, a higher rate of increase in specificity and FPR led to an improvement in accuracy.Similar to the results obtained from the CHB-MIT, all SNUH results in Table 7 also improved, and the sensitivity of the 10 and 15 min results improved even further.In addition, the specificity was enhanced in the 30 min data.Experiments conducted on both datasets yielded comparable outcomes.The 30 min pre-ictal period presented fewer extractable data compared to the 10 and 15-min periods, and seizure signs tended to weaken as time distanced from the ictal event.Consequently, when comparing the 30 min data to other time intervals, further enhancements in specificity were observed.In the context of seizure prediction, defining the pre-ictal period is a significant consideration.Extending the pre-ictal period offers the advantage of advanced patient preparation, but it comes with the trade-off of reduced accuracy and increased patient anxiety.As demonstrated in Fig. 7, using the two datasets, CHB-MIT had the highest value at 15 min, while SNUH had the highest value at 10 min, and both datasets had similar values at 10 and 15 min.Even with a small amount of data, accuracy for 10, 15 min was ensured in SNUH, and in the paired t-test results of Tables 6 and 7, the numerical values according to the presence or absence of pre-training showed a significant difference in the overall result (p<0.05),indicating that pre-training plays a significant role in improving the numerical value.STFT conversion transforms the EEG data into a spectrogram image with represented on the x-axis and frequency on the y-axis.For training, we used a hybrid model that combines ResNet and LSTM to reflect both types of information.The experimental outcomes for pre-train + ResNet and pre-train + ResNet-LSTM are outlined in Table 8.As a result of the experiment, improved results were obtained for both datasets, confirming the benefits of the hybrid model.
Table 9 shows a previous study conducted on patient-specific seizure prediction using the CHB-MIT dataset.Contemporary research trends involve extracting data in the frequency domain as features and utilizing machine learning and deep learning methodologies as classifiers.Ongoing investigations aim to enhance sensitivity and reduce FPR by addressing challenges such as data imbalance and insufficient samples, both inherent in EEG.Jemal et al. 46 obtained a high sensitivity of 96.1% from 23 patients but with low specificity, and they employed www.nature.com/scientificreports/5-fold cross-validation instead of Leave-one-out cross-validation as the performance validation method.Table 9 includes two approaches 42,44 employ STFT, the same method applied in this study.Among these, Yang et al. 42 experimental results demonstrated low sensitivity of 59.9%, 66%, and 56% for patients 2,9, and 14, respectively.For patients 2 and 9, limited pre-ictal data relative to the total duration was a factor, while patient 14 had a shorter recording duration, indicating less effective training.The majority of studies on seizure prediction using CHB-MIT reported poor patient outcomes due to the aforementioned issues.As demonstrated in Table 3, the experimental results of our study revealed that the 10 min sensitivity for all three patients exceeded 80%, and patient 9's sensitivity improved by nearly 40%.The inter-ictal weight concentration phenomenon was resolved by addressing the class imbalance.By generating a pre-trained model, the representation was acquired in advance, enabling the model to determine the optimal weight values during the actual training process.Through these interventions, we succeeded in enhancing outcomes for patients with previously low sensitivity.Table 9 does not present results based on all 24 patients, as certain experimental patient data was lacking and there was no www.nature.com/scientificreports/common channel.The proposed method's experimental results were presented for all patients, including those used in the previous method 42 .We obtained higher sensitivity and lower FPR compared to conventional methods.

Conclusion
In this paper, we propose a method for predicting epilepsy seizures based on a pre-trained model that employs supervised contrastive learning and a hybrid model that combines ResNet and LSTM.In the pre-processing phase, the data were transformed using STFT to ensure that the training model could efficiently perform feature analysis, and the class imbalance between pre-ictal and inter-ictal as well as the insufficient data were addressed by sampling and oversampling.During pre-training, data were augmented and pre-trained with a ResNet and supervised contrastive loss model so that the training model could find the optimal parameter with fewer data.
During the training phase, image features and time series data were extracted using a hybrid model comprised a pre-trained ResNet and LSTM.The experimental results reveal that CHB-MIT demonstrates optimal performance for the 15 min pre-ictal period, while SNUH performs best for the 10 min pre-ictal period.We demonstrated greater sensitivity and a lower FPR than conventional methods.

Figure 2 .
Figure 2. The sliding window algorithm is employed to apply oversampling with a 10 s window that overlaps by 1 s.

Figure 3 .
Figure 3.This is a pre-processing procedure for a single channel.The data shown in (a) is in its original form, referred to as raw EEG data.The spectrogram image depicted in (b) has undergone the application of STFT.The image (c) represents data that has been truncated to a frequency range of 0-60 Hz.For the pre-training phase, data augmentation techniques were applied to images (d, e).Specifically, temporal cut-out and band-stop filters were utilized.

Figure 5 .
Figure 5.Our proposed method comprises two key modules: the Pretext task and the Downstream task.In the Pretext task, the data augmentation technique involving a band-stop filter and temporary cutout is applied, followed by the training of a pre-trained ResNet model with a supervised contrastive loss.This results in the generation of a pre-trained representation for the augmented data.In the Downstream task, fine-tuning is performed on the LSTM using the pre-trained ResNet, and training is conducted on the preprocessed original data.

Figure 7 .
Figure 7.The graph provides the comparison of two models, ResNet-LSTM and Pre-train + ResNet-LSTM.The evaluation metrics comprised sensitivity, accuracy, and FPR.

Table 2 .
Seizure prediction results obtained with the CHB-MIT dataset.We include experimental results from all 24 patients.The experiment utilized pre-ictal intervals of 10, 15, and 30 min.The experiment utilized preictal intervals of 10, 15, and 30 min.The corresponding table represents the outcome of without pre-training from ResNet-LSTM.Patient 11's 30 min result was excluded from the analysis due to insufficient data.

Table 3 .
Seizure prediction results obtained with the CHB-MIT dataset.We include experimental results from all 24 patients.The experiment utilized pre-ictal intervals of 10, 15, and 30 min.The experiment utilized pre-ictal intervals of 10, 15, and 30 min.The corresponding table demonstrates the outcomes of deploying the pre-train model to ResNet-LSTM.Patient 11's 30 min result was excluded from the analysis due to insufficient data.Significant values are in bold.

Table 4 .
Seizure prediction results obtained with the SNUH dataset.We include experimental results from all 24 patients.The experiment utilized pre-ictal intervals of 10, 15, and 30 min.The experiment utilized pre-ictal intervals of 10, 15, and 30 min.The corresponding table represents the outcome of without pre-training from ResNet-LSTM.

Table 5 .
Seizure prediction results obtained with the SNUH dataset.We include experimental results from all 24 patients.The experiment utilized pre-ictal intervals of 10, 15, and 30 min.The experiment utilized pre-ictal intervals of 10, 15, and 30 min.The corresponding table demonstrates the outcomes of deploying the pre-train model to ResNet-LSTM.Significant values are in bold.

Table 6 .
The corresponding table presents a comparison and summary of the experimental results obtained using pre-trained models on the CHB-MIT dataset.The left and right sides of the table show the results before and after the pre-trained model application, respectively.The p-value represents the result of the paired t-test.Significant values are in bold.

Table 8 .
The following table displays the hybrid model's performance verification results.The left side of the table presents outcomes for the pre-trained ResNet, the right side displays the results for the hybrid model that integrates both pre-trained ResNet and LSTM.The table is organized with the outcomes for the CHB-MIT dataset in the upper section and the SNUH dataset in the lower section.Significant values are in bold.

Table 9 .
42is table presents the outcomes of conventional seizure prediction methods on the CHB-MIT dataset."Thiswork1" corresponds to the validation results for the same patient as in the prior method42."This work 2" signifies the pre-ictal 15 min overall outcomes encompassing all patients.Significant values are in bold.