Multiband entropy-based feature-extraction method for automatic identification of epileptic focus based on high-frequency components in interictal iEEG

Presurgical investigations for categorizing focal patterns are crucial, leading to localization and surgical removal of the epileptic focus. This paper presents a machine learning approach using information theoretic features extracted from high-frequency subbands to detect the epileptic focus from interictal intracranial electroencephalogram (iEEG). It is known that high-frequency subbands (>80 Hz) include important biomarkers such as high-frequency oscillations (HFOs) for identifying epileptic focus commonly referred to as the seizure onset zone (SOZ). In this analysis, the multi-channel interictal iEEG signals were splitted into segments and each segment was decomposed into multiple high-frequency subbands. The different types of entropy were calculated for each of the subbands and the sparse linear discriminant analysis (sLDA) was applied to select the prominent entropy features. Due to the imbalance of SOZ and non-SOZ channels in iEEG data, the use of machine learning techniques is always tricky. To deal with the imbalanced learning problem, an adaptive synthetic oversampling approach (ADASYN) with radial basis function kernel-based SVM was used to detect the focal segments. Finally, the epileptic focus was identified based on detection of focal segments on SOZ and non-SOZ channels. Eight patients were examined to observe the efficiency of the automatic detector. The experimental results and statistical tests indicate that the proposed automatic detector can identify the epileptic focus accurately and efficiently.

Epilepsy is one of the most common neurological disorders of the nervous system, affecting people worldwide at any age. According to the World Health Organization (WHO), approximately 50 million people globally have been diagnosed with epilepsy, which causes social impairment and carries a higher risk of death 1,2 . Epilepsy is defined as repeated and unpredictable seizures caused by abnormal neuronal firing in the brain 3 . Physicians distinguish the type of seizures as either focal (partial) or generalized, based on the location of abnormal brain activity and its propagation 4,5 . Most patients are prescribed inexpensive daily medication to control epileptic seizures, but some become resistant to them; thus, resectioning of the epileptic focus surgically may provide the best chance of seizure control 6,7 . Therefore, the localization of the epileptic focus is crucial for epilepsy treatment. The standard diagnostic modalities for epileptic focus detection are the investigation of seizure semiology, MRI, and EEG. When the epileptologist cannot determine an epileptic focus after using noninvasive methods, the implantation of intracranial electrodes to record iEEG during both of the interictal and ictal phases is indicated.
In practice, accurate detection of the epileptic focus is generally achieved by epileptologists observing long-term iEEG categorizing the patterns of the seizures. The visual examination of long-term iEEG is a time-consuming and laborious process, as the detection of the seizures from the interictal time in iEEG is the Effect of Feature Selection. The combination of eight entropies were used to extract features from each subband defined in Eq. (16). From the eight entropy features, we hypothesized that some features would be more effective for the purpose of recognition. Therefore, we used sLDA weights from the training set induced in Eq. (17) to select the prominent features based on non-zero weights from each subband. In this study, we set the sparsity parameters δ = 3 and δ 1  ) based on the training set to achieve satisfactory results, where the absolute value of δ 1 corresponds to the desired number of variables. Figure 1 shows the colormap of sLDA weights as a function of subbands on the vertical-axis and the different types of entropies on the horizontal-axis. The figure Performance Analysis with Different Cases. To evaluate the performance in the different algorithms (FbA, FbA/ADA, and FbA/FS/ADA), the AUC was performed for eight patients shown in Table 1. The AUC of a method in cases of imbalanced learning is equivalent to the probability of ranking a randomly chosen positive instance higher than a randomly chosen negative instance 29 . In this experiment, we set all parameter for ADASYN according to the study 30 to balance the training features. The AUC for the algorithm FbA/ADA with feature selection exhibits superior results for all eight patients. The reason for lower performance using the FbA method is that the high degree of imbalance distribution between the minority (focal segments) and majority (non-focal  Results with focal Segments-spotting. In this section, we provide the results of individual focal segment detection based on the optimal algorithm (FbA/FS/ADA) with eight epilepsy patients. The result showed in the above section that the selection of prominent features and the oversampling method can significantly improve the performance of an automatic system. Hence, we are the first one to use high-frequency components (ripple and fast ripple) from interictal iEEG, the performances of localizing individual segments were observed in terms of sensitivity, specificity, precision, fall-out, and F-score for the comparison study similar to HFOs-and low frequency-based related studies 9,10,19,20,[31][32][33][34][35] computed from the confusion matrix (shown in Table 6). It is observed from the Table 2 that the proposed method achieved the highest performance for localizing individual segments with the adult patients Pt5 (SEN: 79.25%; fall-out: 2.50%), Pt6 (SEN: 54.82%; fall-out: 3.58%), and Pt8 (SEN: 88.52%; fall-out: 1.46%). The positive likelihood ratios (PLRs) were also used to evaluate the studies in different research [36][37][38]  Results with Channel Identification. Figure 3 shows a graphical representation with the localization of the focal and non-focal segments, which help the epileptologists in two ways: (1) to observe the localization of the focal and non-focal segments over duration of the iEEG, and (2) the number of detected focal segments corresponding to the seizure onset and non-seizure onset channels. This provides useful information about the active electrodes located close to the epileptic focus. The vertical axis in the color map (left) represents the electrodes and the horizontal-axis shows the segment index. Each yellow spot in the color map represents the detected focal segments. The right side of the color map in each patient indicates the number of detected focal segments (x-axis) in each electrode (y-axis) in which a group of bars (red) represents "SOZ" and black bars without color indicates the "non-SOZ". It is observed from the Fig. 3 Table 2. Experimental results for individual segments using the optimal method (FbA/FS/ADA).
Pt7 with deep sheeted FCD were widely distributed through the non-SOZ. The possible reason of widely distribution was that the patients was the BOS-type FCD in which surgeon implanted vertical electrodes into the deep in the brain. A similar scenario was observed in the case of pediatric patients (Pt1, Pt3, and Pt4). Table 3 shows the results of identifying the epileptic focus for each patients by measuring the AUC across all possible thresholds based on detected focal segments in the SOZ and non-SOZ (see in Fig. 3). www.nature.com/scientificreports www.nature.com/scientificreports/ computational cost. The average computational time for each entropy with 10 subbands was measured using Python on iMac Pro (with Intel Xeon W processor and 128 GB RAM). Note that the average results are estimated with 100 runs at the testing phase to detect a single segment. Table 4 shows the average computational time (in seconds) at each entropy with 10 subbands. It is observed that the phase entropy requires the highest computational time. The entropy with Ts, Ren, Sh, and PE requires shorter time compared to others. Since, we used the combination of all eight entropies to design the whole system, the average computational time with eight entropies and 10 subbands was 56.51 s to test a single segment.

Discussion
According to the clinical guidelines related to epilepsy surgery, epilepsy surgeon should consider implanting the intracranial electrodes to observe the seizure onset zone (SOZ), irritative zone, and symptomatic zone before the epileptic focus resection for patients with medically intractable epilepsy. The epileptic zone includes SOZ and a part of irritable zone and symptomatic zone. To determine the epileptic focus, the epileptologists need to analyse and label 3 to 7 days iEEG data depending on the patients conditions. In the proposed automatic system, we need to label only 30-minutes of the interictal iEEG to localize the epileptic focus, instead of 3 to 7 days labeling. An automatic detection or estimation of SOZ from short period of interictal recording provide epileptologists a great assistance and can increase the number of iEEG analysis for patients with intractable epilepsy. A recent study developed EPINETLAB, a multi-graphic user interface (GUI) automated software, to help researchers and clinicians to detect HFOs and identify the SOZ using iEEG/MEG data 39 . To perform a preliminary validation analysis of EEG data, they used six patients with drug-resistant epilepsy and analyzed only the ripple frequency band (80-250 Hz). However, a number of recent studies have found that fast ripple (200-600 Hz) could be more valid and reliable biomarkers than ripple bands to guide epilepsy surgery [21][22][23] .
Several studies exploiting Bern-Barcelona and Bonn EEG datasets 9,10 have been reported for epilepsy-related signal classification. For instance, the Bonn datasets consist of five EEG datasets denoted as Set A (normal: healthy awake and eyes open), Set B (normal: healthy awake and eyes closed), Set C (Epileptic: interictal), Set D (Epileptic: interictal), and Set E (epileptic: ictal) with 100 single-channels and the time duration of each channel was 23.6 s. Nicolaou et al. used approximate entropy as a feature and employed SVM classifier for identifying normal vs ictal EEG with average accuracy of 93.55% 15 . Guo et al. proposed an automatic epileptic seizure detection system with approximate entropy features derived from multi wavelet transform, and combined with an artificial neural network to classify the existence or absence of seizure with average accuracy 99.85% for two cases: (normal vs ictal), and the combination of normal and interictal vs ictal 40 . In another study, wavelet packet entropy and hierarchical EEG classification were proposed with average accuracy of 99.44% for normal vs ictal 41 . A method based on discrete wavelet transforms (DWT) with entropy features was proposed, leading to a classification accuracy of 84% using k-nearest neighbor (kNN), probabilistic neural network (PNN), fuzzy classifier, and least squares support vector machine (LS-SVM) 42 . Mursalin et al. proposed an automated epileptic seizure detection approach using improved correlation-based feature selection and random forest classifier (RFC) with average accuracy of 98.44% 43 . However, the studies with a variant of EEG and iEEG datasets including Friburg 19 , CHB-MIT 20 , Children's hospital Boston datasets 44 etc. were investigated to identify the seizure events based on machine learning approaches. All the above studies used only lower frequency bands (0.5-150 Hz) for limited pairs of electrodes with well balanced problems. In this study, we have considered the full clinical perception, including high-frequency components (100-600 Hz) and multi-channels imbalanced problem, which offers the practical implementation of clinical utilization. Due to the highly imbalanced problem of the iEEG data, we used AUC here instead of using classification accuracy as system evaluation criterion. The proposed method detects epileptic focus for different epilepsy mechanisms patients (age and pathological type) with an average AUC of 0.86. We also observed that the entropy features, such as APE, PE, and Sp, are more discriminative in high-frequency bands. These findings may provide an excellent tool when appropriate methodology will be combined with the high-and low-frequency bands to locate the epileptic focus.
Recently, Ullah et al. has proposed an automated system for epilepsy detection for Bonn dataset based on deep learning approach, yielding 99.1% accuracy 45 . A similar dataset was used to design a deep convolutional neural network (CNN) with 13 layer for categorizing the normal, preictal, and seizure class and obtained an average accuracy of 88.7%, a specificity of 90% and a sensitivity of 95% 46 . Although, the deep-learning based automatic systems have improved the system performance compared to simpler classier SVM, it needs a large amount of training data in order to show such remarkable performance. On contrary to the deep learning, the SVM method is easy to understand and provides consistent performance. The epileptologists can efficiently interpret the classifier outcome to take the right medical decision.   Table 4. Average computational time (s) with each entropy for 10 subbands.
A comparison study with Bern-Barcelona dataset using time-domain multiband analysis, including EMD and BEMD, was reported by Itakura et al. 18 , improving the performance of the system with a 86.89% average accuracy for identifying the seizure patterns. However, this type of analysis is only suitable for single and bivariate iEEG signals. In the case of multi-channel iEEG signals, the number of extracted bands over channels are not consistent. Thus, the EMD and BEMD methods are not suitable for multi-channel iEEG signals. Considering real time implementation, a filter-bank technique is more convenient for decomposing high-frequency bands (100-600 Hz), has little computational cost, and decreases the system complexity when compared with other multiband approaches.
In epilepsy studies for identifying epileptic focus, the visual inspection of iEEG time series have demonstrated that HFO may occur during ictal, preictal, and interictal states [47][48][49][50] , and the rate of HFOs tends to be higher in SOZ 21,[50][51][52] . To detect HFOs, several automatic HFO detectors have been proposed including the methods of artifact rejection, estimating the energy of the signal using Root Mean Square (RMS) amplitude, short-time Linelength or others [53][54][55][56] . However, HFOs-related studies to identify the possible seizure onset channels need the long-time iEEG data to calculate the baseline. Jrad et al. proposed automatic HFO detection with multi-class SVM in depth-EEG signals 57 . In their study, the performance evaluation matrices for evaluating the system were used in terms of sensitivity and false discovery rate (FDR). The reason for using FDR was that the amount of true negative (TN) was large enough in HFOs detection task. They achieved an average result with five drug-resistant epilepsy for Ripple (Sensitivity: 81.1% and FDR: 30.2%) and fast ripple (Sen: 74.6% and FDR: 6.3%). Guo et al. proposed magnetoencephalography-based (MEG) HFOs detector using stacked sparse autoencoder (SSAE) for identifying the HFOs and normal control (NC) with well balanced problem achieving 89.9% in accuracy 58 . The method CNN was used by Johansen et al. 59 for identifying spikes and HFOs with five epilepsy patients with an average AUC of 0.94. To detect spikes, ripples, ripples-on-spikes (RonS), a long short-term memory neural network (LS-MNN) with balanced number of training samples was used by Medvedev et al. achieving more than 90% accuracy 31 . Zuo et al. 33 proposed the CNN-based method for identifying the two kind of HFOs in ripple and fast-ripple separately and achieved average results with sensitivity (77.04% and 83.23% for ripples) and specificity (72.27% and 79.36% for fast ripples) compared to four traditional automated methods proposed in the RIPPLELAB toolbox 32 . The combination of short-time energy (STE) and CNN also used in recent study for identifying HFOs 60 . In their study, the performance of the system in terms of sensitivity and FDR are used to evaluate their system and compared with three related existing studies 32,36,57 . They achieved higher average results with five adult patients for ripple (Sen: 81.1% and FDR: 30.2%) and fast ripple (Sen: 74.6% and FDR: 6.3%). However, their above studies focused on the detection of HFOs in ripple and fast ripple iEEG data separately and the performance evaluation metrics of their system were mainly used based on their balanced or imbalanced problems. Compared to the above HFOs-related studies, we used only 30-min of iEEG data with SOZ and combined the ripple and fast ripple bands together with the multi-band fashion to identify electrodes related to epileptic events. The average sensitivity, specificity, and Fall-out for individual segment identification with eight patients (including different pathological types with pediatric and adult patients) was 52.70%, 90.75%, and 9.24%, respectively. The average AUC for identifying epileptic focus was 0.86 across eight patients. However, some channels for each patient have created a block of epileptic activity detection (see in Fig. 3). Due to the complex nature of the biological systems, the interictal iEEG are strongly non-stationary, which do not allow the linear methods to adapt perfectly over the whole time windows that is a main reason for creating the block of epileptic activity detection.
In order to achieve a more practical system for real-life applications, we have considered further improvements in the following directions. First, we used only 30-min of signal of the interictal phase, whereas an epileptologist can predict the epileptic focus using the proposed automatic system. We need to expand the automated system using the detection of seizure discharges. Second, the influential parameters to design the system were used based on the previous studies 30,42,61 . The values of the parameters as well as the choice of optimal subbands in multi-band analysis are required to adjust in a data-adaptive nature in order to further improve the system. Third, this study evaluate the subject-dependent system based on the SOZ from the discharges of habitual seizures. Due to subject-specific nature of iEEG signals, the distributions of extracted features among patients were distinct. In machine-learning research, several studies proposed the use of domain transfer learning to adapt the different distributions of features extracted from different subjects 62,63 . We strongly believe that domain transfer learning to implement subject-independent system could be one of the best solutions for future study. However, the problem is very challenging due to very different locations of electrodes and subjects-specific nature of epilepsy events. In addition, Islam et al. 26 reported that the appropriate selection of operational subbands can significantly improve the system performance due to subject-specific nature of EEG signals. Therefore, the possible extension of this study is to detect the most significant subbands in the high-frequency components, which may further improve our system performance in the future. Thus, there are several avenues for further research to design the automatic system with feature-extraction and classification.

conclusion
This study developed an effective epileptic focus detection method from high-frequency components for interictal iEEG data. Eight feature-extraction methods with multi-band analysis were proposed and tested. We evaluated the proposed method for eight epilepsy patients considering different ages and pathological types (adult and pediatric) to investigate efficiency in the high-frequency components (ripple and fast ripple). The detection results were broader around the SOZ electrode ranges for the patient of BOS-FCD type pathology. Moreover, we had the variability of AUC with the patients, in which the pediatric patients have a tendency toward less sensitivity than the adult patients.

Materials and Methods
Dataset. More than 100 patients with focal cortical dysplasia (FCD) were studied at the Juntendo University-Epilepsy Center in Tokyo, Japan. This study was approved by the ethics committee of Juntendo University Hospital as well as the Tokyo University of Agriculture and Technology, Japan. All methods were performed in accordance with relevant guidelines and regulations. All the patients signed the informed consent for a research protocol. During pre-surgical evaluation, several non-invasive diagnostic protocols, such as seizure semiological evaluation, interictal scalp EEG, MRI, molecular imaging, and psychomotor-development testing, were performed to determine the electrode locations for each patient. Video-EEG monitoring was also indicated for drug-resistant epilepsy cases as a pre-surgical evaluation.
The subdural electrodes (4-mm diameter and 10-mm distance) (UNIQUE MEDICAL Co, Tokyo, Japan) were implanted and covered almost the entire surface over the FCD and the adjacent cortex. In patients with the bottom of sulcus (BOS) type of dysplasia, the surgeon dissected the cortical sulcus and implanted small electrodes on the vertical sulcus. The iEEGs were acquired using the Neuro Fax digital video EEG system (NIHON-KODEN, Tokyo, Japan) with a sampling rate of 2 kHz. The number of electrodes were defined for each patient based on an epileptologist s review during iEEG data recording. Among 100 patients, epileptologists selected only eight patients with SOZ and a positive (focal) label was assigned to a channel judged to a seizure onset electrode by    Table 6. Confusion matrix for a two-class problem.
epileptologists, and a negative (non-SOZ) label was given to the rest of the channels. Therefore, data on eight patients obtained from SOZ and non-SOZ electrodes were used to evaluate the proposed method. Table 5 shows the summary of the iEEG dataset from the eight patients. The 3D representation of the brain and the electrode positions during recording are shown in Fig. 4 for each patient. The red circle represents the SOZ 21,22 marked by epileptologists.
Data pre-processing. The multi-channel interictal iEEG data were recorded for at least three days until an adequate number of habitual seizures were obtained for analyzing. The epileptologists reviewed the interictal iEEG recordings and annotated the electrodes that had possibly indicated the SOZ. In this study, we used 30-min interictal iEEG data from each patient and split the 30-min iEEG signals into 20-s segments resulting in total 90 segments. The label (SOZ/non-SOZ) was given to each electrode, we assumed that all the segments in SOZ can be considered as focal segments and the segments for other channels were assumed to be non-focal segments. A third-order Butterworth bandpass filter was applied to extract the high-frequency components (100-600 Hz) from each interictal iEEG segment.
Multiband Analysis. In practice, the EEG time series that exhibits nonstationary behavior with a variety of neurological events may contain noise that can deteriorate the performance of the system in a single-band approach. Therefore, a filter bank, which is an array of bandpass filters, was applied to decompose an EEG signal into a set of analysis signals exhibiting multiple subband frequency components 64,65 . To develop more accurate detection of brain activities related to the specific mental tasks, EEG-based studies proposed the filter-bank method to divide the wide frequency ranges into narrow subbands [24][25][26] . More specific, Higashi et al. proposed a filter-bank approach to improve the performance of MI-BCI, which decomposed the 4-40 Hz frequency ranges into 6 subbands with a bandwidth of 6 Hz each 66 . Ang et al. divided the similar frequency ranges (4-40 Hz) into narrow subbands with a bandwidth of 4 Hz each 24 . In signal processing study, the choice of subbands should be as narrow as possible to achieve more accurate detection of automatic system similar to these EEG-BCI [24][25][26]66 . However, the choice of dividing the wide frequency bands into narrow subbands indeed depends on system performance, real-time applications as well as the reduction of system complexity 18,25,67 . Considering system performance as well as the reduction of the system complexity, the proposed multiband approach divided the high-frequency bands, including ripple (100-250 Hz) and fast ripple (250-600 Hz), into 10 subbands, each of which has a band width of 50 Hz. The subbands are labeled   68 . It is extensively used in many areas of biomedical signal processing, such as EEG 69 and ECG signal analysis 70 . To estimate the approximate entropy (APE) from each segment, let us define as a time series x i ( ) of the n-th subbands S n of each channel. The time series x i ( ) can be represented − + L d 1 vectors as … − + X X X L d (1), (2), , ( 1), where L is the length of signal (in our case = L 40, 000 for each channel of a segment due to 2 kHz sample frequency). Each X i ( ) vector can be expressed as: d where d is the embedding dimension and for each i, ≤ ≤ − + i L d 1 1. APE is defined as: is a correlation integral indicating the probability of the vector X i ( ), which remains similar to X j ( ) within tolerance limit r. The C r ( ) i d is defined as 71,72 : where  ⋅ ( ) is the indicator function and the ⋅ dist( ) represents the distance between two vectors X i ( ) and X j ( ). In this study, the value of the r parameter is chosen as the 0.2 times the standard deviation of the data, and = d 2 used in the study 68 .
Sample Entropy. Sample entropy (Sp) is a modified version of APE intended to resolve a weakness of APE 73 . The main drawback of APE is a biased estimate due to self-matches of templates. Sp reduces the bias caused by the use of the self matches in the computation of APE 73 . Sp is defined for a given time series x i ( ) as: is defined as 71,72 : where X i ( ) is a vector induced from Eq. (1) and  ⋅ ( ) is the indication function to count the true condition number excluding the self-matches 68 . In this study, the parameters r and d were set to the similar to the approximate entropy.
Permutation Entropy. Permutation entropy (PE) is a simple and robust method for estimating the complexity of a time series used for automated seizure prediction 74 . For a given time series x i ( ), each vector in such a way that τ τ T h e n , w e c a n d e f i n e τ τ τ ), , ( . For the set of vectors ( 1) , the probability of each possible permutation Π k ( = … k d 1, 2, , !) can be introduced as , where L is the length of time series x i ( ) and Π C( ) k is the number of occurrences of the order pattern Π k . The PE can be defined as: In this study, the parameters d and τ were set to 3 and 1, respectively.
Spectral Entropy. Spectral entropies quantify the complexity of a time series based on the power spectrum 75 . Several studies have proposed the use of spectral entropy, including Shannon (Sh) and Reny's entropy (Ren), to (2020) 10:7044 | https://doi.org/10.1038/s41598-020-62967-z www.nature.com/scientificreports www.nature.com/scientificreports/ characterize the seizure activities [75][76][77] . To obtain the power level for each frequency, the Fourier transform of the time series x i ( ) is used. The normalization of the power p f was estimated as:  Phase Entropy. Phase entropies are defined through a bispectrum known as higher order spectra 78 . The bispectrum of a time series x i ( ) can be defined as: where E represents the expectation operator of a random variable. The F is the Fourier transform of the time series x i ( ) and F* is its conjugate. The two types of phase entropy, S 1 and S 2 , can be defined as: Tsallis Entropy. Tsallis entropy is the generalized version of Shanon entropy and controls the trade off between the contributions from the tails and the main mass of the distribution 79 . Tsallis entropy is defined as 79 : f f q where p f is the normalization of power computed from the Eq. (9) and q is a real number, frequently called the entropic-index, that characterizes the degree of non-extensivity of the system 79,80 . In this study, we set = q 2.
feature Selection. In machine learning, one of the challenges is the selection of the best feature set from all the available feature space reported in different studies 25,[81][82][83] . The selection of entropy features extracted from interictal iEEG data could provide a more accurate classification with respect to the whole set of features. To select more relevant entropy features, sparse LDA is a recently advanced technique 84,85 , which reveals discriminant directions of a few variables instead of all the variables used in the standard LDA 86,87 . After extracting entropy features from an interictal iEEG segment, the entropies of n-th subband for a channel can be defined as: where v n denotes the combination of entropies ( = D 8) extracted from the n-th subband of a channel using the above feature-extraction methods. We can calculate the entropies for all channels with each segment and finally stacked all of the segments to form the training features ∈ × M IR n H D , where = × H ch s such that ch and s are the total number of channels and segments, respectively. The sparse LDA criterion from the set of the training features M n and class C n for n-th subband is defined sequentially as 88 . The parameters δ and δ 1 were tuned such that βˆn has G non-zero elements. Let us define the index of the βˆn as . The features  v n for n-th subband with a channel can be defined as: Finally, feature V* is defined by concatenating features  v n of N subbands for a channel as: After applying sLDA, the selected training features for N subbands can be written as: The feature vector of the i in -th sample of SOZ channel is denoted by ⁎ v F i ( ) in and the feature vector of the j in -th sample of non-SOZ channel is denoted by Note that the dataset is generally imbalanced say  I J in in .
imbalanced Learning problem. In the case of epileptic focus detection, the number of non-SOZ electrodes representing the majority class is much higher than the SOZ electrodes (minority class). This can produce several difficulties in standard machine learning methods due to an imbalance in class distribution and concept complexity 30,89,90 . Therefore, the use of sampling methods in imbalanced learning applications requires the modification of an imbalanced data set by some mechanisms in order to provide a balanced distribution 89,90 . Recent studies have shown that a balanced data set with several base classifiers provides improved classification performance compared to an imbalanced data set 30,[89][90][91] . In this section, we generate surrogate data using the adaptive synthetic (ADASYN) approach, which is one of the solutions used to solve the imbalanced learning problem. The balance set ν  can be defined from the training feature set ν induced from Eq. (20) as: The following algorithm, proposed by He et al. 30,89 , is employed here to generate surrogate samples ⁎ṽ Step 1 Calculate the number of synthetic data examples that need to be generated for the entire focal class by: The β represents an arbitrary number in the range of 0 to 1 to specify the desired balance level after the synthetic data generation process. We set the β in Eq. (22) to 1, which corresponds to fully balanced data 30 .
Step 2 For each example ⁎ v F i ( ) in in the focal class, find the K-nearest neighbors according to the Euclidean distance and calculate the ratio Γ i in as follows: where Θ j in is the number of samples in the K-nearest neighbor of ⁎ ( ) v N j in that belong to the non-focal class and Z is a normalization factor such that Γ i in is a distribution function Step 3 Determine the number of synthetic samples to be generated for each ⁎ v F i ( ) in in the focal class as: Step 4 Generate g i in synthetic data samples for each sample of focal class using SMOTE algorithm 92 as: in is a randomly chosen focal data example from the K-nearest neighbors ( = K 5) for ⁎ v F i ( ) in and δ denotes the random number belonging to [0,1]. The other parameters for ADASYN were used as default setting 30 . cross-validation Design. To evaluate the developed system, we needed to divide the data into training and test sets, which was the critical step due to an imbalanced number of focal and non-focal channels. To optimally divide the data into training and test sets, this study proposes k-fold cross-validation technique ( = k 10) by dividing 90 segments into k subsets of equal size. Among the k subsets, a single subset is retained for testing the model, and the remaining ( − k 1) subsets are used as training. As mentioned, ADASYN was applied to highly imbalanced feature sets in the training stage to balance the class features. The cross-validation process is then repeated k times and the result of a system is taken by averaging all the runs.
performance Measurement for Segments. Instead of using classification accuracy as a system evaluation criterion for imbalanced datasets, a set of assessment metrics related to receiver operating characteristics (ROC) graphs 29 were used as performance measurements. Under the imbalanced learning condition, the classification accuracy is not sufficient as a standard performance measure 29,[93][94][95] . Therefore, the representation of classification performance can be derived from the confusion matrix, as illustrated in Table 6. Based on Table 6, the evaluation metrics can be defined as: (2020) 10:7044 | https://doi.org/10.1038/s41598-020-62967-z www.nature.com/scientificreports www.nature.com/scientificreports/ • Sensitivity (SEN) or recall: where TP is the number of correctly detected segments from the total number of focal segments in the SOZ channels and FN indicates the number of incorrectly detected segments from the total number of focal segments in the SOZ channels. • Specificity (SPE): where TN is the number of correctly detected segments from the total number of non-focal segments in the non-SOZ channels and FP represents the number of incorrectly detected segments from the total number of non-focal segments in the non-SOZ channels. • F 1 score is the harmonic mean of preision and sensitivity defined as: performance Measurement for channels. The performance of each patient to identify epileptic focus was estimated by using AUC-ROC 96 . The sensitivity (SEN) and false positive rate (FPR) of the channels for each fold were computed using each threshold values (in our case, zero to maximum number of detected focal segments for each fold). After achieving SEN and FPR of the channels with each fold, we estimated the AUC by using trapezoid rule 96 and average all the folds to achieve the final results.