Introduction

Epilepsy is among the most common neurological disorders worldwide with an estimated 5 million people diagnosed each year1. Epileptic seizures are characterized by pathological electrical activity in regions of the brain that manifest as functional disturbances that may be transient2. Although first-line treatment to control seizures consists of antiepileptic drugs, more than 30% of patients are pharmacoresistant and at high risk for premature mortality3,4,5. Stereoelectroencephalography (SEEG) is a method for localizing epileptogenic foci in patients with drug resistant epilepsy (DRE) involving placement of macroelectrode depth electrodes into the brain, followed by continuous monitoring in a specialized epilepsy monitoring unit (EMU)6,7,8. Epileptologists must quickly recognize abnormal SEEG waveforms, and EMU staff must monitor patients for signs of clinical seizures around the clock, making this is a highly time- and resource-intensive process.

Deep learning-based approaches are promising solutions to automated seizure detection, but they are not without limitations9,10,11,12,13. Previous studies have: (1) used algorithms engineered to classify previously recorded EEG sequences without a framework for real-time event detection, (2) required large training datasets, extensive annotations, and a pre-screening for artifacts to achieve adequate results, and (3) produced high false positive rates, commonly due to static thresholding methods applied in the decision function. This limits clinical utility, particularly in the context of large-scale data produced by continuous in-hospital recordings. Furthermore, acquiring large, annotated datasets and screening for artifacts is time- and cost-prohibitive which diminishes utility unless the pre-trained models are exceptionally well-generalizable. Given the variety of waveforms, dynamic noise, and other idiosyncrasies often present in patient recordings, seizure detection remains challenging.

We present our results from training individually tailored, self-supervised Long Short-Term Memory (LSTM) deep neural networks on continuous in-hospital multichannel SEEG and video recordings with no explicitly labeled data (Fig. 1). Here, we define seizure detection as the task of anomaly detection in high-dimensional sequences. A dynamic thresholding method, developed by NASA for use on the Mars Rover, was adapted to improve detection sensitivity and mitigate false positives, suggesting feasibility as a new, more dynamic paradigm for real-time anomaly detection in video and electroencephalographic data14.

Figure 1
figure 1

Overview of the workflow for continuous monitoring with video and SEEG and real-time analysis in the epilepsy monitoring unit. Patients with DRE receive continuous monitoring of their intracranial SEEG leads (red) and simultaneous video recording in their hospital beds (blue). A convolutional LSTM autoencoder (CNN + LSTM) was applied to the video recordings to calculate a regularity score for each frame over time. This regularity score time series and the SEEG time series (green sequence, bottom left) were then separately fed into an LSTM network to reconstruct their signals (blue sequence, bottom middle) and calculate a reconstruction error (red sequence, bottom right) which was then subjected to a self-supervised dynamically thresholding method to identify anomalous events in real-time.

Concurrent SEEG and video signals, totaling over 2000 h across all patients and channels analyzed, were processed by adapting previously described methods (“Signal processing”)15,16. LSTMs and convolutional LSTM autoencoders were trained for each patient as described in “LSTM training and parameters”. Dynamic thresholding was compared to conventional static thresholding, crossover experiments (Figs. 2, 3) were performed to characterize models’ patient-specificity, and joint models incorporating SEEG and video detection were constructed to assess the added benefit of multimodal detection. Model outputs were compared to ground truth anomalous sequences agreed upon by three fellowship-trained epileptologists who were blinded to the results of the model. The positive predictive value (PPV), sensitivity, and F1 scores were compared between models. Mean absolute percent error and minimal duration of recording data to train each model were also noted.

Figure 2
figure 2

Design of crossover experiments to assess patient-specificity of models. LSTM models were trained on recordings from one patient and tested on recordings from another patient.

Figure 3
figure 3

Crossover testing produces a large increase in the number of false positive results. This indicates that trained models are attuned to the unique electrical signal of a given patient. Green shading refers to prediction mismatches that correspond to correctly identified anomalies whereas red shading refers to prediction mismatches that correspond to false positives.

Results

Dynamic vs. static thresholding

The dynamic threshold with error pruning was compared to a baseline, fixed threshold, label-free anomaly detection approach. Inspection of the threshold demonstrated that the mathematical optimization in each window found localized levels that effectively categorized anomalies in real-time (Fig. 4A,B). Compared to static thresholding, dynamic thresholding did not improve sensitivity significantly (difference: 14.3%; 95% CI − 21.7 to 50.3%; Wilcoxon–Mann–Whitney test; N = 14; p = 0.42) but did significantly increase PPV (difference: 39.0%; 95% CI 4.5–73.5%; Wilcoxon–Mann–Whitney test; N = 14; p = 0.03). Additionally, F1 scores were significantly higher for the dynamic threshold (difference: 0.31; 95% CI 0.1–0.61; Wilcoxon–Mann–Whitney test; N = 14; p = 0.04).

Figure 4
figure 4

Self-supervised error thresholding for real-time detection of anomalies in SEEG and video data. An LSTM network is trained to predict the next window of values in the test time series sequence (A, blue). These values are compared to the actual values (A, orange), and a smoothed error is calculated for each value in the sequence (B, red sequence). Prediction mismatches (A, purple) manifest as higher errors. A self-supervised dynamic threshold (B, magenta line) enables effective local classification of true anomalous sequences (B, green bar) while omitting many of the false positives (B, red bars) that result from traditional static thresholding methods (B, blue line). Concurrently acquired video recordings for each patient were considerably noisier and signal reconstruction was not as robust, demonstrated by the higher reconstruction errors (C, red sequence). While video sequences captured all of the true seizure events in the study population (C, middle, green bar), they also captured several false positive events, such as nurse visits (C, far right, red bars).

Crossover experiments

Crossover experiments assessed whether the learned features from each patient training model generalized to testing sequences derived from different DRE patients (Figs. 2, 3). With six distinct crossover combinations, anomaly detection sensitivity remained comparable to the non-crossover experiments (difference: 4.8%; 95% CI − 38.4 to 47.9%; Wilcoxon–Mann–Whitney test; N = 14; p = 0.82), but PPV (difference: 56.5%; 95% CI 25.8–87.3%; Wilcoxon–Mann–Whitney test; N = 14; p = 0.002) and F1 scores significantly declined (difference: 0.38; 95% CI 0.08–0.67; Wilcoxon–Mann–Whitney test; N = 14; p = 0.02). After training on continuous data from a given patient, testing the network on an unseen sequence derived from the same patient resulted in high fidelity of predicted sequences, with most of the prediction mismatches (Fig. 3, left, top, green circle) corresponding to true anomalies (Fig. 3, left, bottom, green bar). Testing this same model on an unseen, normalized sequence derived from a different patient produced considerably more prediction mismatches (Fig. 3, right, top, red and green circles), resulting in higher false positive rates (Fig. 3, right, bottom, red bars).

Multimodal detection

Joint models incorporating self-supervised anomaly detection in video and SEEG recordings were constructed to determine the potential added benefit of multimodal detection (Fig. 4C). Multimodal detection significantly improved sensitivity (difference: 25.0%; 95% CI 0.2–49.9%; Wilcoxon–Mann–Whitney test; N = 14; p < 0.05) over dynamically thresholded SEEG recordings, but decreased PPV, though not significantly (difference: 21.3%; 95% CI − 10.3 to 52.9%; Wilcoxon–Mann–Whitney test; N = 14; p = 0.17). Relative to video detection alone, the combined workflow also improved the PPV (difference: 28.5%; 95% CI 4.6–52.4%; Wilcoxon–Mann–Whitney test; N = 14; p = 0.02) and F1 scores (difference: 0.22; 95% CI − 0.01 to 0.44; Wilcoxon–Mann–Whitney test; N = 14; p = 0.06).

Discussion

The study is the first to implement a multimodal self-supervised deep learning workflow for intracranial seizure detection in DRE patients. While previous studies have used bedside recordings to classify hypermotor seizures, few have jointly evaluated video and electroencephalographic feeds to detect seizures17,18. This study provides a novel proof-of-concept in this arena by demonstrating the potential of self-supervised anomaly thresholding to improve the sensitivity and PPV of automated seizure detection on continuous multimodal recordings in real-time. Because error residuals in anomaly detection are often non-Gaussian, the nonparametric dynamic thresholding method for error classification used in this study overcomes a major limitation of prior studies using parametric thresholding methods which assumed a distribution that does not fit the residuals.

The pipeline presented in this work utilizes a LSTM network and a convolutional LSTM autoencoder to enable real-time detection of anomalous events in high-resolution SEEG and video data, respectively, making them valuable in a prospective setting. Models were trained on only 5–10 min of SEEG recordings which did not necessarily include a seizure event and labeled data was not required, thereby reducing time and cost of analysis. Crossover studies suggested the self-learned representations of SEEG recordings are patient-specific, which provides confidence in the ability of our algorithm to identify clinically relevant features given the diversity of signal properties between patients. Taken together, clinical translation of this work could personalize the care of patients and augment the workflow of staff in the EMU. By ingesting a few initial minutes of a patient’s recording, this pipeline would enable continuous long-term monitoring of ictal events and reduce frequent false alarms in the context of subtle environmental changes, which would otherwise be time intensive and cost prohibitive.

Earlier methods in inpatient epileptic seizure detection have traditionally relied more on constant monitoring of patient recordings by trained personnel. This is available in approximately 56–80% of EMUs, whereas automatic online EEG warning systems are present in only 15–19% of EMUs19. While clinical seizure semiology provides some critical information to help elucidate the zone of onset and propagation pathways, periictal behavioral assessments facilitate an even more comprehensive understanding of these details. Most algorithms for EEG-based seizure detection in clinical settings center around multiple-channel analyses rather than single-channel19. Following data acquisition by the electrodes, these systems typically employ a method for artifact rejection followed by an algorithm for event detection usually involving analysis of the electrographic changes during seizures in terms of amplitude, frequency, or rhythmicity. Methods for these analyses in previous algorithms for patient-specific seizure detection have included both linear and nonlinear time–frequency signal analysis techniques20,21,22. More recent studies focusing on automated seizure detection have relied on other machine learning techniques, including support vector machines, k-nearest neighbors, and convolutional neural networks, which require complete electroencephalograms before determining whether anomalies are present23,24,25. Such properties limit the application of these approaches primarily to retrospective data. Furthermore, unlike deep learning methods which learn the best features to implement to achieve optimal performance, these older methods require manual feature extraction and careful programming of the network to obtain acceptable results. Other work has focused on developing large pre-trained models with the goal of successful generalization to other patients26. Of note, there are several generalized, commercially-available seizure detection algorithms currently on the market, including Persyst-Reveal27, IdentEvent28, BESA29, and EpiScan30. The primary limitation of these methods, however, is that they may not generalize well to other patients given the wide variety of signal characteristics that may exist as a result of recording quality, patient disease and electrophysiological characteristics, or other uncontrollable factors. This, in turn, may limit clinical efficacy. In contrast, as described previously, the workflow presented in this study could be rapidly deployed in clinical settings to create patient-specific models with improved adaptability for prospective prediction.

Limitations

This study’s limitations include using retrospective data for training and a relatively small patient cohort, which could introduce selection bias. While overfitting is always a concern in deep learning, we controlled for this by holding out data for validation for each patient and using early stopping criteria during model training (“LSTM training and parameters”). Additionally, although incorporating videos improved sensitivity, it also increased false positives. Developing more sophisticated tiered or weighted systems for escalating anomalies detected in concurrent multimodal recordings could reduce false positives in this workflow. Future work is underway to adapt these methods to a prospective, randomized format to confirm the utility of self-supervised dynamic thresholding for seizure detection in a clinical setting.

Conclusions

Self-supervised dynamic thresholding of patient-specific models significantly improves the PPV of seizure detection in continuous SEEG recordings from DRE patients compared to traditional static thresholds. Incorporating concurrent video recordings into multimodal models significantly improved sensitivity, but reduced PPV, though not significantly. The characteristics of these models are promising for future deployment in clinical settings to improve the speed, precision, and cost-effectiveness of epilepsy monitoring, which may ultimately improve the safety profile of SEEG monitoring for our patients.

Methods

Study protocol

Patients with drug resistant epilepsy (DRE) at an academic medical center were retrospectively enrolled in the study. Subjects with significant progressive disorders or unstable medical conditions requiring acute intervention, those taking more than three concomitant antiepileptic drugs (AEDs) or with changes in AED regimen within 28 days, and patients with onset of epilepsy treatment less than two years prior to enrollment, were excluded from the study. In total, 14 consecutive DRE patients underwent surgical implantation of 10–18 multichannel SEEG leads from 2018–2019 as per standard hospital protocols (average: 15 leads, 147 channels) and subsequent in-hospital video and SEEG monitoring for 4–8 days (average: 6 days). Patients were 16–38 years old (average: 24.5 years), 57% were female, and 71.4% were taking AEDs during the recording period. All patients had recordings with at least one epileptiform event (Table 1). This study was approved by the Mount Sinai Health System Institutional Review Board (IRB). Informed consent was waived by the IRB with oversight from the Program for the Protection of Human Subjects Office. All methods were performed in accordance with their relevant guidelines and regulations.

Table 1 Characteristics of the patient population.

Signal processing

High-resolution SEEG recordings sampled at 512 Hz were obtained from the Natus NeuroWorks platform, filtered with a one-pass, zero-phase, non-causal 50 Hz low-pass finite impulse response filter, and scaled to (− 1, 1). Concurrent video recordings for each patient in the monitoring unit were acquired at 480p resolution at 30 frames per second. Videos were segmented into sequential clips, converted to .tiff image files using FFmpeg, and fed into a convolutional LSTM autoencoder that was structured to have 2 convolutional layers, 3 convolutional LSTM layers, and 2 deconvolutional layers16. A regularity score time series was calculated for all video frames by computing the reconstruction error of each frame by summing up all pixel-wise errors, as described by Hasan et al.15. Signal processing was conducted using MNE 0.17.1 and SciPy Signal in Python 3.7.

LSTM training and parameters

A self-supervised training regimen was established where each channel from the SEEG recordings and regularity score time series was divided into training and testing sequences using variable train:test splits ranging from 20:80 to 50:50. 29% of recordings in the train set had epileptiform events whereas 86% of recordings in the test set had such events (Table 2). A LSTM network with 80 hidden layers was initialized for each channel and trained on the unlabeled training sequence for up to 35 epochs (or until early stopping criteria were met) with a sequence length typically between 250,000 to 750,000 elements, which spanned anywhere from 10 to 30 min overall and either did or did not include known anomalies. To mitigate the risk of model overfitting, early stopping criteria were used while training each model. These criteria specified that training iterations must decrease the loss metric by at least 0.003 to allow additional training iterations to occur. Using a training “patience” of 5, up to 5 consecutive training iterations were allowed to occur without decreasing the loss metric by at least 0.003 before model training was stopped early. Each LSTM used a mean-squared error loss metric, an Adam optimizer, and a dropout of 0.3. Within the training sequences, 20% of the data was set aside as validation before testing. After training, the performance of each model was assessed on the unseen test sequences. The network was assessed for its ability to predict future values in real-time (Fig. 4A), compare the predictions to the actual values, and compute a smoothed error based on the difference between the actual and predicted values (Fig. 4B). LSTMs and convolutional autoencoders were implemented using TensorFlow.

Table 2 Neural network specifications and results.

Self-supervised dynamic thresholding method

A novel dynamic thresholding approach, developed by the NASA Jet Propulsion Laboratory to detect real-time anomalies in telemetry data from the Mars Rover, Curiosity, was adapted to our models to label anomalies based on the error values from the time series predictions14. In contrast to conventional static thresholds frequently used for anomaly detection (e.g. mean ± 2 standard deviations), this dynamic method uses a sliding window approach to find optimal local thresholds, such that the percent decrease in the mean and standard deviation of the smoothed error in the window is maximized if values above the set threshold are excluded. To mitigate false positives, an error pruning procedure was implemented in which the sequence of smoothed errors was incrementally stepped through, the percent decrease between time steps was computed, and steps with a percent change greater than 10% remained anomalies while steps with a change less than 10% were reclassified as normal.

Crossover and multimodal video/SEEG detection experiments

To evaluate the patient-specific nature of the LSTM models, crossover experiments were conducted, in which models were trained on recordings from one patient and tested on another, while all other conditions remained identical to previous testing conditions, including the dynamic thresholding and error pruning methods (Fig. 2). Fourteen combinations of train and test sequences derived from the study population were randomly selected to conduct the crossover experiments.

To assess the added value of multimodal detection, the concurrent video and SEEG recordings for each patient were separately fed into the corresponding deep neural networks described previously. The resulting anomalous sequence predictions made by the self-supervised dynamically threshold in the LSTM decision function for each detection modality was then pooled before comparing the predicted anomaly times with the consensus labels of the expert panel of epileptologists. We did not encounter any disagreements among the panel regarding consensus labeling within this dataset. The results of model performance on individual patient recordings are detailed in Table 3, along with the patient’s clinical and electrophysiologic seizure manifestations.

Table 3 Patient-specific clinical and electrophysiologic seizure manifestations, as well as model performance on individual patient recordings.

Metrics for assessing signal reconstruction quality

We assessed the models for their ability to capture the underlying signal itself using standard time series metric of mean absolute percentage error (MAPE), representing each recording channel that was reconstructed by the LSTM for each patient. The MAPEs ranged from 0.15–1.57% for each patient (average: 0.75%; Table 2), suggesting generally excellent reconstruction of the SEEG signal by the LSTM. Video regularity score signals were noisier due to diverse events occurring during recording, leading to higher MAPEs (average: 19.95%; Table 2).

Statistics

For continuous variables in this study, the Kolmogorov–Smirnov test was first used to test for a normal distribution. Given the lack of a normal distribution in the data of this study, continuous variables were compared using the Wilcoxon–Mann–Whitney test. A threshold of p < 0.05 with two-tailed testing was used to determine statistical significance. Statistics were conducted using Prism 7.