A Long Short-Term Memory neural network for the detection of epileptiform spikes and high frequency oscillations

Over the last two decades, the evidence has been growing that in addition to epileptic spikes high frequency oscillations (HFOs) are important biomarkers of epileptogenic tissue. New methods of artificial intelligence such as deep learning neural networks can provide additional tools for automated analysis of EEG. Here we present a Long Short-Term Memory neural network for detection of spikes, ripples and ripples-on-spikes (RonS). We used intracranial EEG (iEEG) from two independent datasets. First dataset (7 patients) was used for network training and testing. The second dataset (5 patients) was used for cross-institutional validation. 1000 events of each class (spike, RonS, ripple and baseline) were selected from the candidates initially found using a novel threshold method. Network training was performed using random selections of 50–500 events (per class) from all patients from the 1st dataset. This ‘global’ network was then tested on other events for each patient from both datasets. The network was able to detect events with a good generalisability namely, with total accuracy and specificity for each class exceeding 90% in all cases, and sensitivity less than 86% in only two cases (82.5% for spikes in one patient and 81.9% for ripples in another patient). The deep learning networks can significantly accelerate the analysis of iEEG data and increase their diagnostic value which may improve surgical outcome in patients with localization-related intractable epilepsy.


Methods patient information and intracranial recordings.
We used intracranial EEG records from 12 patients with drug-resistant epilepsy obtained strictly for clinical purposes during their evaluation for surgery. Records from 7 patients were taken from the open access database ieeg.org (files from the Mayo Clinic; the MC data set). The second dataset (5 patients) was obtained from the Medical College of Georgia, Augusta University Health (the AUH dataset). The study was carried out in accordance with the relevant ethical guidelines and regulations and was approved by the Georgetown-MedStar Institutional Review Board and the Medical College of Georgia's Institutional Review Board. Informed consent was obtained for a secondary analysis of patients' de-identified iEEG records. The MC dataset was used for all network training, cross-validation and analysis of network parameters. The network trained on all seven MC patients was then tested on five AUH patients in order to check its generalisability across institutions. All patients were diagnosed with localization-related epilepsy and had typical focal impaired awareness seizures (FIAS) with focal to bilateral tonic-clonic seizures (FBTCS) in two patients (ILAE 2017 seizure classification 21 ). All MC patients underwent surgery with a good surgical outcome after the two-year follow-up period (Engel's Class 1). Information on surgery and surgical outcome was not available for the AUH dataset.
Subdural grid and/or intracerebral electrode placement was dictated strictly by clinical requirements and therefore electrode configuration varied between patients. Intracranial EEGs from multiple sites of temporal, frontal and parietal cortices as well as the amygdala-hippocampus complex were recorded at the sampling rate of www.nature.com/scientificreports www.nature.com/scientificreports/ 500 Hz with the number of channels varying from 16-108 across patients. Demographic and clinical characteristics of patients as well as the amount of data used in our study are presented in Table 1.

Selection of electrographic events as the 'ground truth'.
To obtain a more complete representation of interictal activity in each patient, we used the following criteria to select segments for analysis. First, we scanned the whole record available in each patient and selected segments free of gross artifacts such as large deviations from the zero line (possibly due to patient's movements or technical problems). Then for each MC patient we selected 6-9 one-hour segments of interictal activity (regardless of the sleep-wake cycle). Because records from the AUH dataset were significantly shorter, we were able to extract 10-min segments of interictal activity from each patient. For both datasets, the only criterion was that interictal segments had to be separated by at least one hour from the preceding and the following seizures. Also, in order to incorporate possible modulations in the interictal patterns as much as possible in the MC dataset, we aimed to select segments from different parts of the whole record which could span up to several days. As a result, the selected segments could be separated from each other by several hours and together covered many hours of recordings (from 21 to 103 hours from the beginning of the first segment to the end of the last segment depending on the total length of the record; Table 1).
Signal pre-processing for both datasets included a check for and removal of any missing data points and highpass filtering at 0.5 Hz (300-order FIR, forward-backward filtering). If a distinctive line interference was present in the signal, notch filtering was also applied (a second order IIR notch filer, Q factor = 35). A continuous power spectrogram (short-time FFT with the Hanning window) of the pre-processed data was calculated with time and frequency resolution of 0.25 s and 4 Hz, respectively. The spectrogram was then averaged over frequencies within the following eight spectral bands: theta (4-8 Hz), alpha (8)(9)(10)(11)(12)(13), beta (13-30 Hz), gamma1 (30-56 Hz), gamma2 (64-116 Hz), rip1 (124-176 Hz), rip2 (184-196 Hz) and rip3 (204-236 Hz) giving the eight spectral power values used as features in further analysis. The division of the ripple band (>124 Hz) into three sub-bands (rip1, rip2 and rip3) was necessary in order to avoid frequencies of 120, 180 and 200 Hz at which interference noise was sometimes present.
To find events which would be considered as a ground truth for network training and testing, we used an automated threshold-based approach to find event candidates. Those candidates were then visually verified for their inclusion into or exclusion from the ground truth set of events. The initial screening for spike, ripple and RonS candidates was done using the following method. For each spectral band in one-hour spectrogram, we calculated the empirical probability density distribution of the log-transformed power values. The logarithm of power density is known to be well approximated by normal distribution 22 . In our case this was also confirmed by the Kolmogorov-Smirnov normality test (p < 0.05). The maximum (peak) value of this normal distribution is the mode i.e., the most likely (or most probable) value of spectral power. However, due to the discrete nature of binning when the empirical distribution is calculated, the position of its peak may be slightly displaced from the exact mode due to random variations from bin to bin. To reduce this effect, we calculated the mean of the most probable log-transformed power values that constituted 20% of all distribution. In other words, we took a mean value in the 20% vicinity of the peak of the power density distribution. This value was taken as representing the baseline value of spectral power when the events of interests (spikes and ripples) are not present. All power values were then normalized by dividing them by the corresponding baseline power for each spectral band. These relative spectrograms were used for visual reviews of the event candidates.
To find event candidates within each time bin (0.25 s), we used the following threshold values applied to the relative spectrograms. To be identified as a spike, the spectral power had to be at least 4 times the baseline power www.nature.com/scientificreports www.nature.com/scientificreports/ in both beta and gamma bands but not in the ripple band (the 'spike' criterion: a 4-fold increase in power in the beta-gamma bands relative to the baseline power). For ripples, the spectral power had to be at least 7 times the baseline power within at least one ripple sub-band (rip1 or rip2 or rip3) but not in the beta-gamma bands (the 'ripple' criterion). For RonS, both the 'spike' and the 'ripple' criteria had to be satisfied. Lastly, the baseline candidates were selected as time bins where none of the above criteria was met. The higher threshold for ripples compared to spikes (7 versus 4) was chosen based on our preliminary analysis showing that the relative increase in power during ripples was indeed higher than the power changes during spikes.
All final ground truth events were selected from the candidates using visual inspection and verification at the appropriate temporal resolution (for spikes) and with the help of additional highpass filtering of iEEG at >100 Hz for ripples and RonS. HP filtering allowed us to make sure that the number of oscillations within a HFO burst is more than 3 to be qualified as a ripple. This criterion helps avoid 'false' HFOs which are a by-product of highpass filtering of spikes 23,24 . After visual inspection of approximately 1200-1300 event candidates of each class (spike, ripple, RonS and baseline), we selected 1000 events (per class) for every patient from the MC dataset. Importantly, these 1000 events (per event class and per patient) were randomly selected from all channels and all one-hour segments in each MC patient.
the LStM network. The major component of all LSTM networks is a hidden layer (the LSTM layer) consisting of the memory cells. Each memory cell has an internal structure including the cell itself and three gates which control the cell behaviour across time. The cell gates are essential for the ability of the cell to act as a memory unit, and they allow the network to detect long-term dependencies in the stream of input data. The input comes to the network through a sequence input layer and after the LSTM layer there are a fully connected layer and the output layer (usually realizing the softmax operation) which conclude the classification architecture of the network. The network for this study was created using the MATLAB version 2018b (MathWorks, Inc., Natick, MA). The main architecture was based on one bidirectional LSTM layer (Fig. 1). The bidirectional layer of memory cells allows the network to analyse information in both directions (from past to future and backwards) and this allows the network to understand the context better. As part of our exploration of the network parameter space, we also used several modifications of the network architecture namely, a single LSTM layer or two consecutively connected LSTM layers as well as dropout layers after each LSTM layer.
As an input to the network, we used relative spectral power values within all 8 frequency bands for each time bin as well as the preceding and the following bin. Thus, the feature input vector had a dimension of 8 × 3 (the number of frequency bands by the number of time points). Three consecutive time bins served to avoid a possible 'split' of the event between two adjacent bins as well as to provide the temporal pattern and context of the events to the network. Network training was performed on the MC dataset using different scenarios as follows: (1) training and testing within each MC patient (within-subject validation); (2) training for one patient's data and testing on other patients (between-subjects validation); (3) training and testing on all MC patients ('global' training and testing); and finally (4) the 'global' network trained on the whole MC dataset was tested on each patient from the AUH dataset ('between-institutions' validation). All training sessions were performed 10 times using random selection of 50-500 events (per class) from the ground truth set while other 500 events (per class) were used for testing. Network performance was assessed by the mean values and standard deviations (in %) for total accuracy, Acc across all four classes of events as well as sensitivity, Sens(i) and specificity, Spec(i) for each class i = 1 ÷ 4. These values were calculated using the following formulae:

Results
Waveforms of the ground truth events. Representative spike, RonS and ripple detected by our threshold method and confirmed by two experts (A.V.M. and A.M.M.) using visual and spectral analysis are shown in Fig. 2. The problem with spikes is that highpass filtering of a spike produces a burst of high frequency oscillations which can be mistakenly marked as ripple resulting in a false positive error for ripples and RonS events. This is a well-known effect of highpass filtering of sharp waves producing sharp transients 23 . The spectrogram helps avoid this problem because sharp transients of a spike or sharp wave do not usually have a significant spectral power in the ripple bands (>120 Hz), and using a threshold of 7 helped us tell apart single spikes (Fig. 2, left) from RonS (Fig. 2, right).
The major focus of this study was to explore the ability of the network to generalise (a) across different segments of iEEG within each patient (within-subject validation); (b) across different patients from the same dataset (between-subjects validation) and finally, (c) whether the network with optimal parameters can generalise across datasets from different institutions ('between-institutions' validation). In the majority of calculations we used a bidirectional LSTM network with the number of hidden units (HU) equal to 200. Non-overlapping training and testing sets each contained 500 events per class (spike, ripple, RonS and baseline) randomly selected from the ground truth set with the number of randomizations equal 10.

Within-subject and between-subjects validations.
During the within-subject validation, network testing showed high values of accuracy exceeding 90% for spikes, ripples and baseline segments for all patients (the main diagonal entries in Table 2). The results of RonS detection were slightly lower in accuracy being in the range 85-90% for 4 out of 7 patients in the MC dataset (Table 2).
During the 'between-subjects' validation, the network was trained on each patient's data and then tested on all other patients within the MC dataset. Here the network showed poorer results with accuracy below 86% in 4 and 7 cases for spikes and ripples, respectively (out of 42 pair-wise comparisons per class) as well as below 80% in 19 and below 60% in 7 out of 42 pair-wise comparisons for RonS (the off-diagonal entries in Table 2). This was the evidence that a network trained on one patient does not necessarily generalise when applied to other patients, and this is likely due to inter-individual variability of electrographic waveforms especially for RonS events (Table 2). www.nature.com/scientificreports www.nature.com/scientificreports/ training and testing of a 'global' network. To overcome a limited ability of the network to generalise from one patient to another, we decided to train a network on all patients from the MC dataset. Thus, the global feature vector 1000 × 4 × 7 = 28000 was created (1000 events per event class per patient). Then the network training and testing on this global vector was done in a similar way by random selections of 500 events (per class) from each MC patient. These 500 × 4 × 7 = 14000 events were used for training while the other 14000 events were used for testing, and the process was repeated 10 times. For the 'between-institutions' validation, the 'global' network was tested on all 10-min interictal segments from the AUH dataset. The 'global' network was able to recognize all events from both datasets with the total accuracy and specificity greater than 90% in all cases and sensitivity less than 86% in only two cases (for spikes in patient MC6 and ripples in patient MC7, Table 3). Overall, the average sensitivity for spikes and RonS was slightly lower (91.3 ± 5.1% and 92.5 ± 4.1%, respectively) compared to ripples (96.7 ± 4.9%). . For all these parameters which determine the network size and its trained state, testing was always performed on 500 events (per class per patient) not used in training. At N = 500, the network accuracy and sensitivity reached a maximum level at about 1000 iterations and at this level it practically plateaued for the next 1000 iterations for all event classes (Fig. 3). A slight decrease in sensitivity (most likely due to overfitting) was observed only for RonS and ripples from 1000 to 2000 iterations (Fig. 3). This result demonstrates the network robustness to overfitting. In regard to the number of hidden units, the network showed a slight decline in sensitivity from HU = 200 to HU = 50 with a more noticeable decline at HU = 20. Remarkably, with only 20 hidden units the network was able to detect more than 90% of events correctly with the number of iterations greater than 1000 (Fig. 3).
A decrease in size of the training set from N = 500 to N = 50 also had only a minimal effect on the network performance with a drop in sensitivity for RonS from 92.04% to 86.89% (statistically significant by paired t-test, p = 0.03; Table 4). Changing the network structure (one LSTM layer versus two consecutive layers with or without dropout layers) did not have any noticeable effect on performance compared to the bidirectional LSTM network (a standard model in this study; Table 4).

Discussion
In this study we applied for the first time Long Short-Term Memory deep learning networks for detection of electrographic events such as epileptic spikes, ripples and ripples-on-spikes in intracranial EEG records from patients with intractable epilepsy. The LSTM network was trained on the ground truth set of events which was initially created through a novel automated threshold-based approach with subsequent visual verification and selection of events of four classes namely, spikes, ripples and ripple-on-spike as well as baseline activity.
Our threshold method for automated detection of events is different from the commonly used methods in that it avoids calculation of the standard deviation of spectral power as well as it operates on the relative values of spectral power (relative to baseline). The standard deviation is used in the traditional methods to calculate threshold as the mean power (or RMS) plus several (e.g., 4 or more) standard deviations. However, if the standard deviation is calculated over the entire EEG segment, it would be unavoidably inflated due to the presence of the events of interest (spikes and ripples). These events are rare but have a higher amplitude than baseline activity and therefore may significantly skew the (log-transformed) power density distribution toward higher values. This would result in an inflated standard deviation and an unnecessary enhancement of the threshold. As a result, many events of interest may be missed. This problem can be addressed if the standard deviation is calculated not over the entire segment but over the periods of baseline activity (i.e., when the events of interest are not present). But this requires finding those events in the first place. Our method also avoids manual or automated selection of baseline segments which would ideally lack any events of interest. Instead, the mean spectral power of baseline activity is calculated from the empirical distribution of spectral power density by taking the mean of the most probable spectral power values. This is justified under the assumption that relatively rare events of interest, although skewing the power density distribution at the high end, do not significantly affect the most probable values around the peak of this distribution. Indeed, the most probable power values are determined by baseline activity which constitutes up to 80-90% of activity (by duration) during interictal or even preictal periods.
After the calculation of the baseline power within each spectral band, our method uses relative thresholds for a ratio between the current-bin spectral power and the corresponding baseline power (for each spectral band). Specific values of these relative thresholds used in this study were selected on the basis of our preliminary analyses comparing our threshold method with the visual inspection of the events.
We are aware of only a few studies where deep neural networks were applied to the study of epileptiform events such as spikes or HFOs. The convolutional neural network (CNN) was used by Johansen et al. (2016) for detecting epileptic spikes in scalp EEG from five patients 25 . The authors report that the CNN achieved the area under the ROC curve (AUC) of 0.947. This was higher than the best performing model based on a Support Vector Machine which achieved an AUC of 0.912 25 . The CNN was also used by Zuo et al. (2019) to detect ripples and fast ripples in iEEG recordings from 6 patients with intractable epilepsy. The authors report that the CNN was able to detect HFOs with a higher sensitivity (77.04% for ripples and 83.23% for fast ripples) and specificity (72.27% for ripples and 79.36% for fast ripples) than four traditional automated methods implemented in the RIPPLELAB toolbox 26   also outperformed the threshold-based models by achieving 89.9% in accuracy, 88.2% in sensitivity and 91.6% in specificity 27 . The LSTM architecture is better suited for the analysis of time series data compared to other neural network architectures and the LSTM networks have been started to attract interest in the field of EEG analysis. Thus, a LSTM recurrent network has been recently used for recognition of human emotional states 28 and for human decision prediction 29 from scalp EEG with the reported results outperforming the traditional analytic machine learning algorithms.
In this study, we applied for the first time a LSTM network for the detection of distinctive electrographic events such as epileptic spikes, ripples and ripples-on-spikes. After creating a ground truth set of events through our threshold method and subsequent confirmation of events by visual analysis, the LSTM network was trained and tested on different subsets of data from individual patients. The main results can be summarized as follows. First, the LSTM network trained on the data from an individual patient can detect other events within the same patient with high accuracy but would show a poorer performance on other patients. It is most likely due to individual variations in spectral and temporal profiles of the events across patients. A limited generalisability is a phenomenon when a network trained within one dataset shows a worse performance on other datasets. It is one of the problems that can limit the application of artificial neural networks in many fields of science and technology. The lack of generalisability is often a result of overfitting when a network is over-trained on a specific set of data features which makes this network well-tuned and over-sensitive to those features only. One powerful solution to avoid overfitting is the inclusion of dropout layers into the network architecture. The dropout layer forbids weight adjustment on some randomly selected units within the hidden layer on each training step thus preventing overfitting. In this study we did not see any prominent effects of overfitting within the explored range of network parameters (Fig. 3) and thus we attribute a poor 'between-subjects' generalisability of the network trained just on one patient to the individual variation of the electrographic events (such as the amplitude and duration of spikes and ripples which can vary from patient to patient). The important step to achieve a good generalisability in our study was to train the network on all seven patients from the MC dataset. Also of importance, the ground truth training set was created by events randomly selected from all channels in each MC patient thus increasing the intra-and inter-subject variability of data features in the training set. The network trained on seven MC patients was able to recognize other events in the same patients as well as in the dataset from a different institution (the www.nature.com/scientificreports www.nature.com/scientificreports/ AUH dataset) with high accuracy, sensitivity and specificity (Table 3). A slightly lower sensitivity for spikes and RonS compared to ripples across two datasets is likely due to a greater variability of spike amplitude which varies considerably within and between patients. This result emphasises the importance of creating a fully representative training set for supervised learning of neural networks which would cover a broad range of feature variations across many individuals and different institutions.
Our results demonstrate that the LSTM deep learning networks can be used for automated detection of epileptiform events such as spikes, RonS and ripples within intracranial EEG records. Most importantly, the network trained on an extended dataset from 7 patients showed excellent generalisability detecting events with high accuracy, sensitivity and specificity in all records from two independent datasets. Deep learning networks can significantly accelerate the analysis of prolonged iEEG data and increase its diagnostic value as well as provide additional information for a more accurate localization of the seizure onset zone with potential improvement of surgical outcomes.

Data availability
The data of the secondary analysis in this study are available on request from the corresponding author (A.V.M.). The original iEEG data are publicly available on the IEEG portal (www.ieeg.org).