Heart rate variability analysis for the identification of the preictal interval in patients with drug-resistant epilepsy

Electrocardiogram (ECG) recordings, lasting hours before epileptic seizures, have been studied in the search for evidence of the existence of a preictal interval that follows a normal ECG trace and precedes the seizure’s clinical manifestation. The preictal interval has not yet been clinically parametrized. Furthermore, the duration of this interval varies for seizures both among patients and from the same patient. In this study, we performed a heart rate variability (HRV) analysis to investigate the discriminative power of the features of HRV in the identification of the preictal interval. HRV information extracted from the linear time and frequency domains as well as from nonlinear dynamics were analysed. We inspected data from 238 temporal lobe seizures recorded from 41 patients with drug-resistant epilepsy from the EPILEPSIAE database. Unsupervised methods were applied to the HRV feature dataset, thus leading to a new perspective in preictal interval characterization. Distinguishable preictal behaviour was exhibited by 41% of the seizures and 90% of the patients. Half of the preictal intervals were identified in the 40 min before seizure onset. The results demonstrate the potential of applying clustering methods to HRV features to deepen the current understanding of the preictal state.


Study assumptions
The main goal of this study is to characterize the preictal interval using unsupervised methods. In sum, we applied seven clustering methods (K-means, agglomerative hierarchical, density-based spatial clustering of applications with noise using ε = 1, 2, 3 and 4 and Gaussian mixture models) to all combinations of three features in the F = 32 feature dataset (4960 feature combinations per seizure). We obtained a total of 34720 clustering solutions per seizure that we analysed aiming at characterizing the preictal interval. Accordingly, we settled the following assumptions (represented in Fig. S1) to identify what we considered could be valid clustering solutions: (i) Seizures separated by at least 240 minutes were considered independent events.
(ii) Although 240 minutes of data were analysed, only the cardiac changes observed within the 120 minutes before the seizure discharge were influenced by that event, with the data in the 240-120-minute interval before seizure onset considered the minimum data required to represent interictal data (see Fig. S1).
(iii) The search for the preictal interval was undertaken for solutions comprising two clusters.
(iv) Given the higher probability of a preictal interval lasting less than an interictal interval, the smaller cluster found in each two-cluster solution represented the preictal interval (see Fig. S1).
(v) This interval may not occur strictly near the seizure onset but could be captured as an ECG-related event eventually preceding an EEG seizure onset (see Examples 2 and 4 in Fig. S1).
2 Seizure metadata Table S1 contains information regarding each patient, namely, sex, age at hospital admission and onset age (corresponding to the occurrence of the first epileptic event), epileptic focus lateralization and the number of seizures analysed for each patient. In Fig. S2 it is possible to find information about the seizure classification, vigilance state and onset hour, respectively, annotated for each non-discarded subsequent seizure. Fig. S3 depicts a histogram of the number of seizures (out of a total of 238) for which the time interval between that seizure and the previous one lies in each of the intervals defined in the x axis.   3 Extracting HRV from ECG

ECG signals pre-processing
The ECG pre-processing step (represented in Fig. S4) started by inspecting 5-min non-overlapping windows of the ECG raw data. According to the Task Force of the European Society of Cardiology and the North American Society of Pacing and Electrophysiology, five minutes is considered to be the minimum time interval required to compute HRV metrics 1 . In each 5-min window, a notch filter (IIR filter design and zero-phase distortion by forward and backward filtering) was applied at 50 Hz frequency to remove the powerline interference 5 . The next step consisted of applying the Discrete Wavelet Transform (DWT) to the 5-min non-overlapping windows, to obtain the frequencies of interest in ECG as well as to remove its baseline wander. Each window underwent a multilevel one-dimensional wavelet analysis, using a mother wavelet belonging to the biorthogonal family (bior3.3 ). This mother wavelet very closely resembles the morphology of the QRS complex 2 . As a result, ten frequency bands were obtained (ten decomposition levels and ten approximation levels). From the wavelet analysis, two frequency bands were further inspected: the approximation coefficient containing frequency content up to 45 Hz, and the approximation coefficient containing frequency components up to 0.8 Hz. The latter, comprising the baseline wander frequency components, was then subtracted from the former 3-5 .
Subsequently, a R-peak detection algorithm, based on the Pan & Tompkins algorithm 6 was used to identify the R-peaks in each 5-min non-overlapping window. Similarly to Pan & Tompkins, two thresholds were computed. The first one corresponds to the application of a two-second moving average filter to the moving-window integration (using a 0.2 second window). The second one consists of a baseline threshold intended to prevent the analysis of lead-off periods. In other words, to prevent the moving average threshold from reaching zero, a baseline threshold was defined, corresponding to the minimum of the moving average threshold plus 10% of the difference between the median and the minimum of that threshold (defined experimentally). In this way, it was possible to improve the algorithm's robustness regarding the specific characteristics of the ECG of each lead and each patient.
The ECG signals were then inspected regarding the identification of noisy segments. Towards that end, a simple thresholding approach was developed, based on the computation of two measures in each 5-min window: the first detail coefficient, returned by the DWT and the zero-crossing rate. The latter was particularly useful in identifying for instance periods of lead-off signal.

RR interval series editing
The RR interval (RRI) series is obtained by computing the time intervals between adjacent QRS complexes, taking the R-peak as the reference fiducial point 7,8 . At this point, it is important to note the distinction between normal-to-normal (NN) intervals and RRIs. The former are observed between adjacent heartbeats resulting from sinus node depolarizations, whereas the latter comprises all interbeat intervals, including both the NN intervals and abnormal intervals 1 . Abnormal RRIs are obtained by analysing false beats introduced by (i) physiological artifact in the form of ectopic beats, also known as premature beats, and (ii) technical artifact originated by problems in signal acquisition, e.g., subject movement or electrode detachment, or by the  Figure S4: ECG pre-processing and RRIs series extraction. emergence of false positives or false negatives during peak detection phase 3,7,8 . The presence of ectopic beats has been reported to undermine the HRV analysis specifically by leading to an overestimation of the values of power in the LF and HF frequency bands (described in section 4.2), and consequently of the LF to HF ratio 7,8 .
To prevent the analysis of abnormal RRIs, a simple criteria has been included in this HRV analysis: if a given RRI falls outside the range of 20% around the mean of the previous ten RRIs, then either it is removed or it is replaced by an interpolated RRI value. This criterion has been built on the assumption that abrupt changes in the HR are unlikely to occur during sinus node activity 3 . When the abnormal RRI is higher than the range of values previously mentioned, it is replaced using linear interpolation 3,7,8 .
From now on, the ECG signals were spanned using a 5-min (300 seconds) window with 98.33% (295 seconds) of overlapping. The RRIs corresponding to that 5-min window were pulled from RRI series, being ready for HRV-feature computation. This means that each feature sample corresponds to a time increase of five seconds. The overlap percentage was set in order to, in a future study, allow for the fusion of information from both ECG and EEG data. EEG recordings are typically analysed with a five second non-overlapping spanning window.
Furthermore, if more than 80% of the RRI series, comprised in a 5-min window, contains abnormal RRIs, that window is considered as noise and no features are extracted from it 8 . RRI and NN intervals notation will henceforth be used interchangeably to denote corrected RRIs.

HRV feature engineering
HRV captures the variability of the intervals between consecutive R-peaks. Such short-term oscillations reflect the neurocardiac function or, in other words, the modulation of the heart-brain interactions. In fact, an HRV analysis is typically conducted to inspect the influence of the ANS on the sinoatrial (SA) node and, this way, to assess the relative balance between the sympathetic and parasympathetic branches of the ANS (sympatho-vagal balance) 1,9 . The computation of the HRV features is intended to capture the influence of the two branches of the ANS, sympathetic and parasympathetic, responsible for maintaining homeostasis. Studies have demonstrated that, when the sympathetic nervous activity is triggered during seizures, it typically results in increased HR and blood pressure, and possible occurrence of tachycardia and tachypnea 10,11 . When the parasympathetic response predominates, for a considerably low number of seizures, the normal cardiorespiratory function is altered with regard to the decreasing of heart and respiration rates and also blood pressure 10,11 .
All HRV features used in this study are presented in Table S2, with additional details regarding (i) the units of measurement; (ii) the minimum length of clean ECG signal required to compute these HRV-derived features; (iii) the number of missing values obtained for each feature resulting from the existence of noisy 5-min windows; (iv) which autonomous nervous system response has been associated with each feature (it can be both) according to the literature and (v) the computational time required to compute each feature.

HRV linear time domain features
Time domain measures of HRV are the simplest to obtain, as they are basically the result of the application of descriptive statistical methods. These measures are considered robust to a previous preprocessing of the RRI series towards artifact and/or ectopic beat removal 7 . Two types of features can be derived from the RRI series: (i) from the series of RRIs; and (ii) from the time-series resulting from the difference between successive RRIs 1 .
The following time-domain features were computed: N N 50: number of RRIs that last more than 50 ms.
pN N 50: percentage of RRIs that last more than 50 ms.
SDN N : standard deviation of RRIs.
RM SSD: square root of the mean squared differences of successive RRIs.
SDSD: standard deviation of the differences between successive RRIs.
RRM ean, RRM in, RRM ax and RRV ar: respectively, mean, minimum, maximum and variance of the RRI time-series.
SDNN corresponds to the standard deviation of the RRIs, which is equivalent to the square root of variance and, therefore, to the total power of the frequency spectrum in a given signal segment. This measure captures the cyclic components introducing variability in the ECG segment under analysis. It is important to highlight that, as the total variance of the HRV is known to increase with the length of the ECG segment, the SDNN will also be influenced by that factor. Consequently, to perform comparisons among SDNN measures, it is mandatory that the different segments analysed have the same length 1,9 .

HRV linear frequency domain features
The HRV frequency analysis was performed by computing the Lomb-Scargle periodogram 15 . This nonparametric estimation method was chosen to handle the unevenly sampled RRI series, without requiring data interpolation and, hence, do not assuming any underlying model. Using this method, it is possible to avoid the shifting of the spectral peaks towards the low frequency components and, therefore, the overestimation of the total power present in LF and HF bands 3,7 .
The frequency spectrum (see Fig. S5) is inspected towards the assessment of three different spectral components: very low frequency (VLF) in the 0.0033-0.04 Hz range, low frequency (LF) corresponding to the 0.04-0.15 Hz band, and the high frequency (HF) component found in the 0.15-0.4 Hz range 1, 3, 7, 9 . The power in LF and HF frequency bands can be normalized to the T otal power − V LF power entity, yielding LF norm and HF norm and thus emphasizing the influence each branch of the ANS has on the HRV regulation. Equivalently, by normalizing the frequency bands, it is possible to decrease the impact of the changes in the Total power, particularly on the values of LF and HF powers. Furthermore, it also allows for direct comparisons of power measures between two subjects. In fact, significant differences were reported between the values of Total power and power in each frequency band, among healthy subjects 1, 16-20 .
Eight features were then extracted from the frequency spectrum of each 5-min window: T otal power: window's total power.
V LF power: power in the VLF band.
LF power: power in the LF band.
HF power: power in the HF band.
LF norm = LF power T otal power − V LF power .
HF norm = HF power T otal power − V LF power .
Additionally, HF spectral component (or respiratory sinus arrhythmia), as the name suggests, captures HR variations associated with the respiratory cycle, therefore, reflecting the vagal or parasympathetic activity. Conversely, LF component has not yet been associated with a clear interpretation. While it is considered by the majority of the research community as an indicator of sympathetic activation, other studies suggest that LF power may be influenced by both parasympathetic and sympathetic nervous systems. As a result, the ratio of Figure S5: Examples of frequency analysis. The frequency spectrum was computed for two 5-min windows ranging from (a) 78 to 73 minutes and (b) 26 to 21 minutes before seizure onset, respectively, for the first seizure of patient 2.
LF to HF powers (typically assumed to characterize sympathovagal balance) and the VLF power remain HRV features that are not to date known to clearly manifest activity from either branch 1,10,13,14,21,22 . Finally, Total power is useful in capturing the variance of all the RRIs 14 .

HRV nonlinear features
HRV is known to be the result of the complex interactions between haemodynamic, electrophysiological and humoral body functions. Additionally, the influence of the autonomic and central nervous systems' controlling actions will also be reflected in HRV metrics 1,23 . A nonlinear analysis might provide new complimentary information regarding such complex interactions. Nonlinear indices, even though not being regarded to directly capture ANS responses and, therefore, having a less clear interpretation comparing to linear measures, are known to provide higher reliability (repeatability) across measurements, when compared to linear ones 12,24 . A considerable set of nonlinear features was assembled and is described briefly below.
From the Poincaré plot representation (see Fig. S6), obtained by computing a scatter plot of each RRI against the previous one 9, 25 , it is possible to derive three features:   In other words, SD 1 and SD 2 can be regarded as measures of short-and long-term RRI variability, respectively. Accordingly, SD 1 translates parasympathetic activity, whereas SD 2 captures overall HRV 13 . Lastly, the unpredictability of the RRI series can be determined by computing SD 1 /SD 2 9 . The computation of the remaining nonlinear features, which will be described hereafter, assumes that the RRI series is evenly sampled. As such, a cubic spline interpolation was performed on the irregularly sampled RRI series. This method was reported to introduce the lowest error when computing LF and HF powers 7 . An RRI time-series was obtained, characterized by a sampling frequency of 4 Hz that allows for spectral components up to 2 Hz to be captured, according to the Nyquist theorem 3,7,19 .
Detrended fluctuation analysis (DFA) was performed to explore the extent of long-range correlations in the RRI time-series for different time scales and, this way, to inspect the fractal scaling properties of HRV [25][26][27] . Particularly, two scaling exponents were returned from DFA:  By inspecting these features, it is possible to know what is the likelihood that similar sequences found for m points, within a tolerance, r, will also be found for m + 1 points. Both measures require, therefore, the definition of the number of points, m, comprised in the sequences further compared, and the tolerance value, r, for which matches are accepted. In an HRV analysis, value of m parameter is typically set to 2. The tolerance, taken as the similarity criterion, was set to 0.2 × SD, SD corresponding to the standard deviation of the 5-min window of the interpolated RRI signal (of length N ) 19,[28][29][30] . A detailed description of the remaining process to compute these measures can be found in Delgado-Bonal and Marshak paper 30 . Low values of ApEn or SampEn are related to the existence of patterns, whereas high values can be found for less predictable and more complex processes 29 .
By using the information of an RRI time-series to reconstruct the underlying m-dimensional dynamical system, it is possible to compute the following features: Correlation dimension (CD).
The former quantifies the complexity of the system, whereas the latter provides information regarding the overall predictability of the system, i.e., the evolution of the trajectories in the phase space. A zero value of LLE can be found for periodic signals, whilst more chaotic systems are associated with an increase of the LLE 25,31,32 . Similarly, an increase in complexity corresponds to an increase in the value of CD 31,33 . To compute such measures, the reconstruction of the underlying system's phase space from the available measured data (RRI time-series) must be performed. In this study, the two parameters required for phase space reconstruction, the embedding dimension, m, and the time delay, τ , were estimated (for each 5-min window) using the False Nearest Neighbour algorithm and first local minimum of the average mutual information method, respectively [34][35][36] . The LLE was then obtained using Rosenstein et al. (1993) 37 method, which is one of the most widely used for this purpose, whereas CD was computed based on Grassberger & Procaccia (1983) 38 method.
A recurrence quantification analysis (RQA) is typically conducted to find hidden periodicities in the aforementioned phase space trajectory. Such information can be accessed by computing the recurrence plot (RP) of a given time-series 39 . The first step to obtain such representation is to compute the square matrix with dimensions N × N (known as colour recurrence plot and depicted in Fig. S7), containing the pairwise Euclidean distance (given by the norm || · ||) between all samples of the trajectory, N , in the m-dimensional space 32, 40 . Afterwards, a threshold distance ε is used to define a sphere centred at the state x i . If x j falls within that sphere, then the Heaviside function Θ(·) decides for R i,j = 1, meaning the states are close to each other. Otherwise, R i,j = 0. This binary matrix, symmetric along the identity line, can therefore be visualized in a black (R i,j = 1) and white plot (R i,j = 0), the RP. In this study, the value of ε was estimated for each 5-min window, corresponding to 10% of the maximum phase space diameter 41 . The RQA returned seven RP measures of complexity, which quantify recurrence point density and indicate the existence of diagonal and/or vertical lines in the RP. A comprehensive description of the computation of these measures can be found in [40][41][42] . The following RQA features were derived: Recurrence rate (REC).

Determinism (DET).
Average diagonal line length (L). Length of the longest diagonal line (L max ).

Laminarity (LAM).
Trapping time (TT).   Table S1). The feature combination c = 2159, corresponding to features RRMean, LF Power and SampEn, is depicted. Seizure onset occurs at 0 min, as the x-axis represents the time anticipating seizure onset. The vertical dashed lines indicate the location of the 5-min windows depicted in Figures  S5, S6 and S7.