Biomarker discovery in neurological and psychiatric disorders critically depends on reproducible and transparent methods applied to large-scale datasets. Electroencephalography (EEG) is a promising tool for identifying biomarkers. However, recording, preprocessing, and analysis of EEG data is time-consuming and researcher-dependent. Therefore, we developed DISCOVER-EEG, an open and fully automated pipeline that enables easy and fast preprocessing, analysis, and visualization of resting state EEG data. Data in the Brain Imaging Data Structure (BIDS) standard are automatically preprocessed, and physiologically meaningful features of brain function (including oscillatory power, connectivity, and network characteristics) are extracted and visualized using two open-source and widely used Matlab toolboxes (EEGLAB and FieldTrip). We tested the pipeline in two large, openly available datasets containing EEG recordings of healthy participants and patients with a psychiatric condition. Additionally, we performed an exploratory analysis that could inspire the development of biomarkers for healthy aging. Thus, the DISCOVER-EEG pipeline facilitates the aggregation, reuse, and analysis of large EEG datasets, promoting open and reproducible research on brain function.
Biomarkers that relate brain function to cognitive and clinical phenotypes can help in the prediction, treatment, monitoring, and diagnosis of neurological and psychiatric disorders1,2. The successful identification of biomarkers crucially depends on the application of reproducible and transparent methods3 to large-scale datasets4. Furthermore, to translate biomarkers into clinical practice, they need to be generalizable, interpretable, and easy to deploy in clinical settings.
Electroencephalography (EEG) is a promising tool for biomarker discovery, as it is non-invasive, safe, widely used in clinical and research contexts, portable, and cost-efficient. Consequently, EEG biomarker candidates have been described in depression5,6,7, post-traumatic stress disorder7,8, and chronic pain9. Most of these biomarker candidates have been discovered in resting state data, during which spontaneous neural activity is captured. Still, their translation into clinical practice has not yet been successful1. This is partly due to small sample sizes and the low availability of objective, transparent, and reproducible EEG preprocessing and analysis methods.
In recent years, notable efforts have been made to automatize, speed up, and increase the transparency of EEG research. A standardized EEG Brain Imaging Data Structure (EEG-BIDS) has been created, which allows for the efficient organization, sharing, and reuse of EEG data10. Moreover, automatic preprocessing pipelines have been developed, some for specific populations, settings, and study designs, including pediatric populations11,12, mobile brain-body imaging13, and event-related potentials (ERPs)14,15, and others for more generic purposes15,16,17,18. Beyond, guidelines for the standardized reporting of EEG studies have been established19. A logical next step is to integrate these solutions into an automatic workflow that can preprocess and extract physiologically informative features in large EEG datasets.
Here, we present DISCOVER-EEG, a comprehensive EEG pipeline for resting state data that extends current preprocessing pipelines by extracting and visualizing physiologically relevant EEG features for biomarker identification. As translation of biomarkers benefits from being neuroscientifically plausible and interpretable1, DISCOVER-EEG extracts EEG features, such as oscillatory power, connectivity, and network characteristics that have previously been related to brain dysfunction in neurological and psychiatric disorders20,21,22,23. It builds upon and combines two open-source and widely-used Matlab toolboxes (EEGLAB24 and FieldTrip25) and adheres to COBIDAS-MEEG guidelines for reproducible MEEG research19. It facilitates the aggregation and analysis of large-scale datasets, as it applies to a wide range of EEG setups, and fosters sharing and reusability of the data by handling EEG-BIDS standardized data10.
We tested DISCOVER-EEG in two large and openly-available datasets, the LEMON dataset26, which includes resting state EEG recordings of 213 young and old healthy participants, and the TDBRAIN dataset27, which includes resting state EEG recordings of 1274 participants mainly with psychiatric conditions. In both datasets, we demonstrate the capability of the pipeline to capture a well-known EEG effect: the reduction of alpha power during eyes open compared to eyes closed28. Finally, using the LEMON dataset, we present an example analysis investigating differences in EEG features between old and young healthy populations that could inspire the development of biomarkers of healthy aging. Thus, the DISCOVER-EEG pipeline facilitates the preprocessing and analysis of large EEG datasets, promoting open and reproducible research on brain function.
Open-source and FAIR code
We developed an automated workflow for fast preprocessing, analysis, and visualization of resting state EEG data (Fig. 1) following open science and FAIR principles (Findability, Accessibility, Interoperability, and Reusability)29. The code of the DISCOVER-EEG pipeline is published on GitHub and co-deposited at Zenodo, where it is uniquely referenced by a DOI30 (Findability). The code can be easily downloaded (Accessibility) and receive contributions (please refer to the section Code availability). To ensure its Interoperability and Reusability, the pipeline is based on two open-source Matlab toolboxes, EEGLAB24 and FieldTrip25, which are widely used, maintained, and supported by the developers and the neuroimaging community (i.e., through forums and mailing lists). Basing the pipeline on validated and established software ensures its compatibility with future software updates. Moreover, it facilitates interaction with experts in EEG analysis, who also gave advice and supported the pipeline during its development. The code of the current pipeline is intended to represent a basis that will integrate and benefit from feedback from the neuroimaging community.
Data reusability and large-scale data handling
As biomarker discovery needs large datasets, we designed the pipeline in view of data reusability and large-scale data handling. To this end, we followed the FAIR principles of scientific data management31 and incorporated the EEG-BIDS standardized data structure10 as a mandatory input of the pipeline. Most EEG setups and electrode configurations are compatible with the pipeline thanks to the EEGLAB plugin bids-matlab-tools. We refer the reader to the Results section for a demonstration of performance across two published datasets with different characteristics, such as participant sample, number of channels, recording length, and sampling rate.
The pipeline was designed for resting state EEG data, during which spontaneous neural activity is recorded. Resting state data can be recorded easily in healthy and patient populations, in different study designs (e.g., cross-sectional, longitudinal), and in different types of neuropsychiatric disorders. Therefore, the use of resting state data facilitates the application to different settings and research questions. EEG recordings might also be accompanied by standardized patient-reported outcomes, such as the PROMIS questionnaires32, which can assess symptoms (e.g., pain, fatigue, anxiety, depression) across different neuropsychiatric disorders and, thus, enable cross-disorder analyses. Together, these considerations contribute to the scalability and generalizability of the workflow.
Ease of use, transparency, and interpretability
The pipeline consists of a main function, main_pipeline.m, in which the preprocessing, feature extraction, and visualization of the data are carried out, and a params.json file, in which parameters for preprocessing and feature extraction are defined. This latter file is the only one that needs to be configured by the user and can be easily adapted to dataset and/or user-specific demands. When executing the pipeline, parameters are saved to a separate json file to ensure reproducibility. We recommend DISCOVER-EEG users reporting all parameter configurations as well as software versions when reporting their findings.
Additionally, the pipeline focuses on transparency and interpretability of results. For that reason, single images and an optional PDF report can be generated for each recording to visualize the intermediate steps of the preprocessing (Fig. 2) and the extracted EEG features (Fig. 3). These visualizations can serve as quality control checkpoints and help to detect shorter or corrupted files, misalignment of electrodes, or missing data through fast visual inspection33. In that way, we do not exclude any recordings during preprocessing based on quality criteria. Instead, we provide visualization tools that let the users decide how conservative to be in their analysis. Along with their visualization, the preprocessed data and the extracted features of each recording are saved to separate files that can be later used for statistical group analysis and biomarker discovery. These output files can be easily imported to other statistical packages or working environments, such as R34 or Python35, e.g., for applying machine learning or deep learning models.
We developed a Matlab-based EEG processing pipeline that complements previous approaches in other programming languages, such as the MNE-BIDS-pipeline (https://mne.tools/mne-bids-pipeline/1.3/index.html) in Python. Matlab is widely used by the EEG community and enabled us to use well-established Matlab-based EEG toolboxes (please refer to the Design principles section) that provide robust functions for computing functional connectivity measures. Thus, we followed a pragmatic approach towards preprocessing and adopted a simple, established, and automatic workflow in EEGLAB proposed by Pernet et al.14 and originally developed for ERP data. We adapted this pipeline to resting-state data and detail the seven preprocessing steps below.
Loading the data
The pipeline input must be EEG data in BIDS format10, including all mandatory sidecar files. We recommend checking the BIDS compliance with the BIDS validator (https://bids-standard.github.io/bids-validator). By default, only EEG channels with standard electrode positions in the 10-5 system36 are preprocessed. However, it is possible to preprocess non-standard channels if all electrode positions and their coordinate system are specified in the BIDS sidecar file. Optionally, after loading, the data can be downsampled to a frequency specified by the user after applying an anti-aliasing filter.
1. Line noise removal
Line noise is removed with the EEGLAB function pop_cleanline(). The CleanLine plugin adaptively estimates and removes sinusoidal artifacts using a frequency-domain (multi-taper) regression technique. CleanLine, compared to band-stop filters, does not introduce gaps in the power spectrum and avoids the frequency distortions filters create. This step was added to the Pernet et al.14 pipeline to explore brain activity at gamma frequencies (>30 Hz). The line noise frequency (e.g., 50 Hz or 60 Hz) must be specified in the BIDS file sub-<label>_eeg.json to be appropriately removed.
2. High pass filtering and bad channel rejection
Artifactual channels are detected and removed with the function pop_clean_rawdata(). The first step of this function is the application of a high pass filter with the function clean_drifts() with a default transition band of 0.25 to 0.75 Hz. A channel is considered artifactual if it meets any of the following criteria: 1) If it is flat for more than 5 seconds, 2) If the z-scored noise-to-signal ratio of the channel is higher than a threshold set to 4 by default, 3) If the channel’s time course cannot be predicted from a randomly selected subset of remaining channels at least 80% of the recorded time. Channels marked as artifactual are removed from the data. The mentioned parameters are the defaults proposed by Pernet et al.14.
Data is re-referenced to the average reference with the function pop_reref(). Optionally, the time course of the original reference channel can be reconstrued and added back to the data if the user specifies it in the file params.json. The name of the reference electrode name must be specified in the BIDS file sub-<label>_eeg.json.
4. Independent Component Analysis and automatic IC rejection
Independent Component Analysis (ICA) is performed with the function pop_runica() using the algorithm runica. Artifactual components are automatically classified into seven distinct categories (‘Brain’, ‘Muscle’, ‘Eye’, ‘Heart’, ‘Line Noise’, ‘Channel Noise’, and ‘Other’) by the ICLabel classifier37. Note that ICA is performed only on clean channels, as bad channels detected in step 2 were removed from the data. Therefore, the category ‘Channel Noise’ in the IC classification refers to channel noise remaining after bad channel removal in step 2. By default, only components whose probability of being ‘Muscle’ or ‘Eye’ is higher than 80% are subtracted from the data14. Due to the non-deterministic nature of the ICA algorithm, its results vary across repetitions. That is, every repetition of the ICA algorithm leads to small differences in the reconstructed time series after removing artifactual components. While these deviations are small, we noticed that they affect the removal of bad time segments (step 6). For that reason, the pipeline performs 10 times steps 4, 5, and 6 by default and selects the bad time segment mask most similar to the average bad time segment mask across all repetitions. This reduces the variability of rejected time segments. However, for computational economy, the number of iterations may be set to one in the params.json file.
5. Interpolation of removed channels
Channels that were removed in step 2 are interpolated with the function pop_interp() using spherical splines38. This step was added to the Pernet et al.14 pipeline to hold the number of channels constant across participants.
6. Bad time segment removal
Time segments containing artifacts are removed with the Artifact Subspace Reconstruction (ASR) method39 implemented in the function pop_clean_rawdata(). This method automatically removes segments in which power is abnormally strong. First, a clean data segment is identified according to the default ASR settings and used for calibration. Calibration data contains all data points in which less than 7.5% of channels are noisy. Here, a channel is defined as noisy if the standard deviation of its RMS is higher than 5.5. Therefore, the length of the calibration data depends on the specific recording. Then, in a sliding window fashion, the whole EEG signal is decomposed via PCA, and the principal subspaces of the window segment are compared with those of the calibration data. Segments with principal subspaces deviating from the calibration data (20 times higher variance) are removed. Again, default parameters are in line with Pernet et al.14.
7. Data segmentation into epochs
Lastly, the continuous data are segmented into epochs with the function pop_epoch(). By default, data are segmented into 2-second epochs with a 50% overlap. Although longer epochs might be desirable for the Alpha Peak Frequency estimation to increase frequency resolution, short epochs favor the reliability of functional connectivity measures40,41. Thus, we propose 2-second epochs to establish a balance between frequency resolution, stationarity of the signal, and reliability of the later extracted features. Fifty percent overlap was chosen to provide a smooth estimation of the power spectra and mitigate the loss of signal due to tapering42. Epochs containing a discontinuity (e.g., because a segment containing an artifact was discarded) are rejected automatically. Data segmentation was adapted from Pernet et al.14, which focused on event-related data.
EEG feature extraction
DISCOVER-EEG extracts EEG features that have previously been related to different neurological and psychiatric disorders and have the potential to be translated into neuroscientifically plausible and interpretable biomarkers.
In electrode space, power spectra and the Alpha Peak Frequency, i.e., the frequency at which a peak in the power spectrum in the alpha range occurs, are extracted. Power changes in different frequency bands have been found in a broad spectrum of neuropsychiatric disorders20, and APF has been correlated with behavioral and cognitive characteristics43 in aging and disease44.
In source space, two measures of functional connectivity in the theta, alpha, beta, and gamma bands, are extracted. Additionally, brain networks derived from these connectivity matrices are further characterized with common graph theory measures45. Functional connectivity is a promising biomarker candidate, as it has been successfully used for the classification, tracking, and stratification of patients with several neurological and psychiatric disorders, such as Alzheimer’s disease46, Post Traumatic Stress Disorder8, and Major Depressive Disorder7,47 (for a recent review see48). In EEG, functional connectivity measures are commonly classified as phase-based or amplitude-based, each type capturing different and complementary communication processes in the brain49. This pipeline thus includes a phase-based measure, the debiased weighted Phase Lag Index (dwPLI)50, and an amplitude-based measure, the orthogonalized Amplitude Envelope Correlation (AEC)51. Both functional connectivity measures are undirected and have low susceptibility to volume conduction. The dwPLI is also more sensitive and capable of capturing non-linear relationships compared to other phase-based measures42. For each connectivity matrix, two local graph measures (the degree and the clustering coefficient) are calculated at each source location, and three global graph measures (the global clustering coefficient, the global efficiency, and the smallworldness) summarize the whole network in one value.
Power and connectivity features are computed using the preprocessed and segmented data in FieldTrip. Graph theory measures are computed with the Brain Connectivity Toolbox45. Specific parameters on feature extraction are defined in the file params.json and detailed below.
1. Power spectrum
Power spectra are computed with the FieldTrip function ft_freqanalysis between 1 and 100 Hz using Slepian multitapers with +/−1 Hz frequency smoothing. For 2-second epochs, the maximum frequency resolution of the power spectrum is, by definition, the inverse of the epoch length, i.e., 0.5 Hz. As frequency band limits are determined by the spectral resolution, the pipeline zero-pads the epochs to 5 seconds, which yields a resolution of 0.1 Hz, to better capture the frequency range of the bands. Thus, the frequency band limits for theta, alpha, beta, and gamma are 4 to 7.9 Hz, 8 to 12.9 Hz, 13 to 30 Hz, and 30.1 to 80 Hz, respectively. These limits are defined by the COBIDAS-MEEG19 guidelines and maintained throughout all features. Power spectra are computed for every channel and then averaged across epochs and channels to obtain a single global power spectrum. This global power spectrum is saved to sub-<label>_power.mat files and visualized with the different frequency bands highlighted in different colors (Fig. 3).
2. Alpha Peak Frequency
APF is computed based on the global power spectrum in the alpha range (8 to 12.9 Hz). There are two widely accepted strategies to assess the APF in the literature52. The first one reports the frequency at which the highest peak in the alpha range occurs (peak maximum). This comes with the problem that not always a peak is present in the power spectrum. Therefore, the second strategy, calculating the center of gravity (c.o.g.) of the power spectra in the alpha band is also frequently used53. DISCOVER-EEG reports both measures for estimating APF. The peak maximum is computed with the Matlab function findpeaks(). If there is no peak in the alpha range, no value is returned. The center of gravity is computed as the weighted average of frequencies in the alpha band, each frequency being weighted by their power53. Both measures are visualized in the same figure as the global power spectrum (Fig. 3). Individual APF values are also saved to the sub-<label>_peakfrequency.mat files.
3. Source reconstruction
To mitigate the volume conduction problem when computing functional connectivity measures54, we perform a source reconstruction of the preprocessed data with an atlas-based beamforming approach55. For each frequency band, the band-pass filtered data from sensor space is projected into source space using an array-gain Linear Constrained Minimum Variance (LCMV) beamformer56. As source model, we selected the centroids of 100 regions of interest (ROIs) of the 7-network version of the Schaefer atlas57. This atlas is a refined version of the Yeo atlas and follows a data-driven approach in which 100 parcellations are clustered and assigned to 7 brain networks (Visual, Somato-Motor, Dorsal attention, Salience-Ventral attention, Limbic, Control, and Default networks). This atlas can be easily changed according to the user’s preferences in the params.json file. The lead field is built using a realistically shaped volume conduction model based on the Montreal Neurological Institute (MNI) template available in FieldTrip (standard_bem.mat) and the source model. Spatial filters are finally constructed with the covariance matrices of the band-passed filtered data and the described lead fields. A 5% regularization parameter is set to account for rank deficiencies in the covariance matrix, and the dipole orientation is fixed to the direction of the maximum variance following the most recent recommendations58. The power for each source location is estimated using the spatial filter and band-passed data. A visualization of the source power in each frequency band (Fig. 3) is provided by projecting the band-specific source power to a cortical surface model provided as a template in FieldTrip (surface_white_both.mat).
4. Functional connectivity
After creating spatial filters in the four frequency bands, virtual time series in the 100 source locations are reconstructed for each frequency band by applying the respective band-specific spatial filter to the band-pass filtered sensor data. Then, the two functional connectivity measures (dwPLI and AEC) are computed for each frequency band and combination of the 100 reconstructed virtual time series. Average connectivity matrices for each band are visualized (Fig. 3) and saved to separate files (sub-<label>_<conmeasure>_<band>.mat).
The phase-based connectivity measure dwPLI is computed using the FieldTrip function ft_connectivityanalysis with the method wpli_debiased, which requires a frequency structure as input. Therefore, Fourier decompositions of the virtual time series are calculated in each frequency band with a frequency resolution of 0.5 Hz. Thereby, a connectivity matrix is obtained for each frequency of interest in the current frequency band. Connectivity matrices are then averaged across each frequency band resulting in one 100 × 100 connectivity matrix for each frequency band.
The amplitude-based connectivity measure AEC is computed according to the original equations with a custom function compute_aec, as the original implementation was not available in FieldTrip. For each epoch, the analytical signal of the virtual time series is extracted at each source location with the Hilbert transform. For each source pair, the analytical signal at source A is orthogonalized with respect to the analytical signal at source B, yielding the signal A⊥B. Then, the Pearson correlation is computed between the amplitude envelope of signals B and A⊥B. To obtain the average connectivity between sources A and B, the Pearson correlation between the amplitude envelopes of analytical signal A and signal B⊥A is also computed, and the two correlation coefficients are averaged. In this way, we obtain a 100 × 100 connectivity matrix for each epoch. We finally average the connectivity matrices across epochs, resulting in one 100 × 100 connectivity matrix for each frequency band.
5. Brain network characteristics
Graph measures were computed on thresholded and binarized connectivity matrices45. Matrices were binarized by keeping the 20% strongest connections, as this threshold delivers fairly reproducible graph measures based on dwPLI and AEC connectivity59. Nevertheless, it is good practice to test the reliability of final results with different binarizing thresholds60. This threshold can be easily changed in the file params.json.
The computed local network measures are the degree and the clustering coefficient. The degree is the number of connections of a node in the network. The clustering coefficient is the percentage of triangle connections surrounding a node. The measures, thus, assess the global and local connectedness of a node, respectively.
The computed global network measures are the global clustering coefficient, the global efficiency, and the smallworldness. The global clustering coefficient is a measure of functional segregation in the network and is defined as the average clustering coefficient of all nodes. Global efficiency is a measure of functional integration in the network and is defined as the average of the inverse shortest path length between all pairs of nodes. High global efficiency indicates that information can travel efficiently between regions that are far away. Smallworldness compares the ratio between functional integration and segregation in the network against a random network of the same size and degree. Smallworld networks are highly clustered and have short characteristic path length (average shortest path length between all nodes) compared to random networks61. Local and global measures are visualized per recording and saved to files named sub-<label>_graph_<conmeasure>_<band>.mat (Fig. 3).
Testing DISCOVER-EEG in two large, public datasets
We tested the DISCOVER-EEG pipeline in two well-documented and openly-available resting state EEG datasets, the LEMON dataset26,62, including 213 healthy participants, and the TDBRAIN dataset27,63, including 1274 participants, mainly with a psychiatric condition. Below we detail the characteristics of both datasets and the effects of preprocessing using the DISCOVER-EEG pipeline for resting state recordings under eyes open and eyes closed conditions. These results demonstrate that the pipeline works successfully across different EEG systems, electrode numbers and layouts, sampling rates, and recording lengths.
We additionally show that the pipeline can capture a well-known neurophysiological phenomenon in both datasets: the difference in oscillatory alpha power between eyes open and eyes closed conditions28. A comparison of power spectra between eyes open and eyes closed has previously been reported for the TDBRAIN dataset to prove the neurophysiological validity of the data27. Reproducing this effect with a different preprocessing strategy for the TDBRAIN dataset and confirming it for the first time for the LEMON dataset provides evidence for the capability of the DISCOVER-EEG pipeline to capture well-known features of EEG data.
The LEMON dataset is a publicly available resting state EEG dataset of 213 young and old healthy participants acquired in Leipzig (Germany) to study mind-body-emotion interactions26. This dataset is divided into two groups based on the age of the participants: a ‘young’ group (20 to 35 years old; N = 143; 43 females) and an ‘old’ group (55 to 80 years old; N = 70; 35 females). One participant (male) had an intermediate age and was not included in the posterior statistical analyses (next Results section). Resting state EEG was recorded with a BrainAmp MR plus amplifier using 62 active ActiCAP electrodes (61 scalp electrodes in the 10-10 system positions and 1 VEOG below the right eye) provided by Brain Products Gmbh, Gilching, Germany. The ground electrode was located at the sternum, and the reference electrode was FCz. The recording sampling rate was 2500 Hz. Recordings contained 16 blocks of 1-minute duration, 8 with eyes closed and 8 with eyes open in an interleaved fashion. Before the execution of the pipeline, all blocks corresponding to eyes closed and eyes open conditions were extracted and concatenated by condition, including boundaries between blocks. In both conditions, two recordings were discarded because the data were truncated. In the eyes closed condition, an additional recording was discarded because there were no eyes closed marker events. Therefore, the final sample sizes were 210 recordings for eyes closed and 211 for eyes open. For both conditions, recordings of 8 minutes duration were entered into the pipeline.
The TDBRAIN dataset is a publicly available, heterogenous resting state EEG dataset aggregated to obtain neurophysiological insights into psychiatric disorders27. It includes 1274 participants, most of them suffering from a psychiatric disorder. The primary diagnoses are Major Depressive Disorder (N = 426), Attention Deficit Hyperactivity Disorder (N = 271), Subjective Memory Complains (N = 119), and Obsessive-Compulsive Disorder (N = 75). The dataset also includes healthy participants (N = 47) and participants with unknown diagnoses (N = 255). Resting state EEG was recorded with a Compumedics Quickcap or ANT-Neuro Wavegurard Cap using 26 EEG Ag/AgCl electrodes positioned according to the 10-10 system. Additionally, five electrodes recorded vertical and horizontal eye movements, one electrode recorded muscle activity at the masseter, and one electrode captured the electrocardiogram at the clavicle bone. The ground electrode was placed at AFz, and recordings were referenced offline to averaged mastoids (A1 and A2). The recording sampling frequency was 500 Hz. Resting state data were acquired with eyes closed and eyes open in two respective blocks of 2 minutes duration. Some participants had more than one recording session, but only one recording per participant was entered into the pipeline. In total, resting state recordings for both eyes closed and eyes open conditions were available for 1274 individuals.
Table 1 and Fig. 4 show an overview of the DISCOVER-EEG preprocessing results, comparing datasets with different numbers of EEG electrodes (61 for the LEMON and 26 for the TDBRAIN dataset) and recording lengths (8 minutes for the LEMON and 2 minutes for the TDBRAIN datasets). These results point to a fair amount of data remaining after preprocessing and are in the same range as other automatic preprocessing pipelines11,16. They could be useful in future studies interested in benchmarking different EEG configurations or EEG preprocessing strategies.
To validate the pipeline’s outcomes, we replicated the well-known physiological effect of alpha power attenuation in eyes open compared to eyes closed conditions in both datasets. To this end, we estimated the power spectrum for each recording and condition as described in the feature extraction section. For visualization, we performed a grand average across participants and channels for each dataset and condition (Fig. 4). To test for differences between conditions, we performed a dependent samples cluster-based permutation test64 across frequencies in the range of 1 to 100 Hz. The tests were two-tailed and based on 500 randomizations. Cluster-level statistics were calculated by taking the sum of the t-values within each cluster. The cluster threshold and alpha level were set to 0.05. In the LEMON dataset, a significant positive cluster indicated higher power values in the eyes closed condition in the 1 to 37.2 Hz range (p = 0.002, cluster statistic = 1310). In the TDBRAIN dataset, a positive cluster was found in the 1 to 23.2 Hz range (p = 0.002, cluster statistic = 2037) along with a negative cluster in the 25.2 to 99 Hz range (p = 0.002, cluster statistic = −3254). These results indicate lower power in the eyes open condition in the alpha range (8 to 12.9 Hz) for both datasets, as shown with previous analyses of the TDBRAIN dataset27.
Example analysis with DISCOVER-EEG
To exemplify how DISCOVER-EEG features could be aggregated and analyzed, we conducted an exploratory analysis of the LEMON dataset to investigate age-related differences in resting state EEG. Recently, the measure of brain age, i.e., the expected level of cognitive function of a person with the same chronological age, has been proven a valuable marker of cognitive decline in healthy and clinical populations65,66. Brain age is usually estimated by machine and deep learning models trained to predict the chronological age of a participant based on neuroimaging data. One criticism of these methods is their limited neuroscientific interpretability, as it is not straightforward which features the models used to predict brain age65. It is, therefore, highly useful to explore which physiologically meaningful brain features change with age.
To this end, we statistically compared the features automatically extracted by the pipeline in the ‘old’ and ‘young’ groups using Bayesian statistics performed in Matlab with the bayesFactor package67 (Fig. 5). We also provide the scripts used to perform this group analysis and the visualization of group-level data together with the pipeline code.
We tested whether old participants had different APF values than young participants with two-sided independent samples Bayesian t-tests. Two tests were performed, one for each APF measure (the local maximum and the center of gravity). Results show very strong evidence in favor of the alternative hypothesis, i.e., a lower APF in the old compared to the young group, when the APF is computed as local maximum peak (BF10 = 325.5), but inconclusive evidence when the APF is computed as the center of gravity (BF10 = 1.1) (Fig. 5, first row).
We further tested whether there were differences in the connectivity matrices between young and old participants. For each connectivity measure (dwPLI and AEC) and frequency band (theta, alpha, beta, and gamma), we compared the connectivity values of each undirected source pair between the young and the old groups. Thus, we performed 9900 two-sided independent samples Bayesian t-tests per connectivity matrix. In Fig. 5, second row, we depict t-values color-coded to show the direction of effects. Statistical tests showing strong evidence in favor (BF10 > 30) or against (BF10 < 1/30) of the alternative hypothesis are not faded out. We observed strong evidence in favor of a reduction of phase-based connectivity in old participants, predominantly in the alpha band as well as a reduction of amplitude-based connectivity (Fig. 5, second row, non-masked blue values).
We finally tested whether there were any differences between old and young participants in the graph measures. For the local graph measures, we performed one test per source location, i.e., 100 independent sample Bayesian t-tests for each graph measure, connectivity measure, and frequency band. For the global measures, we performed a two-sided independent sample Bayesian t-test per graph measure, connectivity measure, and frequency band. With regard to global measures, the most prevalent differences appeared at low frequencies (theta and alpha) using the AEC (Fig. 5, third row, blue and red dots in brain sketches have BF10 > 30). The strongest effects concerning global measures are a reduction of global efficiency and smallworldness of the older group in the beta band for the dwPLI and the alpha band for the AEC (Fig. 5, third row, raincloud plots with inset BF10 indicating strong evidence). Together, these results show a reduction of local connectivity at theta and alpha frequencies and an increase in network integration at alpha and beta frequencies in older participants.
Overall, the current findings align with previous EEG literature, which has reported a slowing of APF and a general decrease in functional connectivity and network integrity in older individuals44,68.
This example analysis could inspire future studies aimed at discovering explainable biomarkers of healthy aging or risk biomarkers of cognitive decline. However, it does not intend to present a ready-to-use biomarker for aging, which would require validation beyond the scope of this manuscript.
Here, we present DISCOVER-EEG, an open and fully automated pipeline that enables fast and easy aggregation, preprocessing, analysis, and visualization of resting state EEG data. The current pipeline builds upon state-of-the-art automated preprocessing elements and extends them by including the computation and visualization of physiologically relevant EEG features. These EEG features follow recent COBIDAS guidelines for MEEG research19, are implemented in widely used EEG toolboxes, and have been repeatedly associated with cognitive and behavioral measures in healthy as well in neuropsychiatric populations. Therefore, they could represent promising biomarker candidates for neurological and psychiatric disorders.
Importantly, this pipeline presents one of many ways of preprocessing and analyzing resting state data and is not intended to be the ultimate solution for EEG analysis. Instead, it aims to represent a reasonable and pragmatic realization for accelerating the acquisition, preprocessing, and analysis of large-scale datasets with the potential to discover physiologically plausible and interpretable biomarkers. To this end, we adopted a previously published, simple, and robust preprocessing strategy and selected a well-defined set of EEG features that could be useful for biomarker discovery. However, this pipeline might not suit all populations, settings, or research paradigms, such as children, real-life ecological EEG assessments, or paradigms assessing event-related potentials. Additionally, specific EEG layouts, such as those with few electrodes and limited scalp coverage, might not be adequate for source localization or average referencing. However, it is compatible with different types of EEG systems, and its focus on resting state data makes it suitable for patients and healthy populations. The features extracted by DISCOVER-EEG represent a basis for developing neurophysiologically plausible and interpretable biomarkers of neurological and psychiatric disorders. They can be easily extended and adapted to specific study designs, as the modular configuration of the code allows for substituting, removing, or adding specific steps of the preprocessing and feature extraction.
By default, the pipeline uses a realistically shaped volume conduction model based on the Montreal Neurological Institute (MNI) template for source localization. For optimal accuracy of source reconstruction, individual MRIs or at least individual electrode positions could be recorded along with the EEG data. However, this would substantially increase the effort and time needed to acquire data and would hinder the fast generation of large new datasets. For that reason, the pipeline uses a generic template for source localization. However, it might be desirable to have different templates that better reflect the variability of head shapes in the future.
On a broader scope, it should always be considered whether datasets used for biomarker discovery are representative of the population of interest or whether they are biased towards young, Caucasian, highly educated populations, as is often the case69. The participants of the studies used here to test the pipeline were recruited based on convenience sampling and, therefore, might not cover the entire population. The creation and adoption of data standards such as BIDS will help to mitigate this fact by promoting collaboration and data sharing around the world.
Our intention with this pipeline was to push the field of EEG biomarker discovery forward to acquiring and analyzing large datasets, as needed in neuroimaging and artificial intelligence. Moreover, the provided example analysis can serve as a starting point for researchers who want to use complex measures of brain function. However, we do not claim to present a ready-to-use biomarker for aging, nor that the approach we propose is the best or only way to develop such a biomarker. Instead, it is intended as an offer to the community to assess physiologically plausible and interpretable EEG features in an efficient, transparent, and reproducible manner. Therefore, it can help the discovery of EEG-based biomarkers in neuropsychiatric disorders and promotes and facilitates open and reproducible assessments of brain function in EEG communities and beyond.
The EEG pipeline code is available at GitHub under the CC-BY 4.0 license, and it is co-deposited in Zenodo, and referenced with a unique DOI30.
The pipeline was created and tested in Matlab 2020b (The Mathworks, Inc.) on Ubuntu 18.04.5 LTS with the Signal Processing and Statistical and Machine Learning Toolboxes installed. EEGLab (v2022.0)24 with the plugins bids-matlab-tools (v6.1), bva-io (v1.7), firfilt, (v2.4), cleanLine (v2.0), ICLabel (v1.3), clean_rawdata (v2.6) and dipfilt (v4.3) were installed and used for preprocessing. FieldTrip (revision ee916f5e5)25 was used for source reconstruction and EEG feature extraction, and the Brain Connectivity Toolbox (version 03 2019)45 was used for network analysis.
Woo, C. W., Chang, L. J., Lindquist, M. A. & Wager, T. D. Building better biomarkers: brain models in translational neuroimaging. Nat Neurosci 20, 365–377, https://doi.org/10.1038/nn.4478 (2017).
FDA-NIH Biomarker Working Group. BEST (Biomarkers, EndpointS, and other Tools) Resource, https://www.ncbi.nlm.nih.gov/books/NBK326791/ (2016).
Botvinik-Nezer, R. et al. Variability in the analysis of a single neuroimaging dataset by many teams. Nature 582, 84–88, https://doi.org/10.1038/s41586-020-2314-9 (2020).
Marek, S. et al. Reproducible brain-wide association studies require thousands of individuals. Nature 603, 654–660, https://doi.org/10.1038/s41586-022-04492-9 (2022).
Drysdale, A. T. et al. Resting-state connectivity biomarkers define neurophysiological subtypes of depression. Nat Med 23, 28–38, https://doi.org/10.1038/nm.4246 (2017).
Wu, W. et al. An electroencephalographic signature predicts antidepressant response in major depression. Nat Biotechnol 38, 439–447, https://doi.org/10.1038/s41587-019-0397-3 (2020).
Zhang, Y. et al. Identification of psychiatric disorder subtypes from functional connectivity patterns in resting-state electroencephalography. Nat Biomed Eng 5, 309–323, https://doi.org/10.1038/s41551-020-00614-8 (2021).
Toll, R. T. et al. An Electroencephalography Connectomic Profile of Posttraumatic Stress Disorder. Am J Psychiatry 177, 233–243, https://doi.org/10.1176/appi.ajp.2019.18080911 (2020).
Ta Dinh, S. et al. Brain dysfunction in chronic pain patients assessed by resting-state electroencephalography. Pain 160, 2751–2765, https://doi.org/10.1097/j.pain.0000000000001666 (2019).
Pernet, C. R. et al. EEG-BIDS, an extension to the brain imaging data structure for electroencephalography. Sci Data 6, 103, https://doi.org/10.1038/s41597-019-0104-8 (2019).
Gabard-Durnam, L. J., Mendez Leal, A. S., Wilkinson, C. L. & Levin, A. R. The Harvard Automated Processing Pipeline for Electroencephalography (HAPPE): Standardized Processing Software for Developmental and High-Artifact Data. Front Neurosci 12, 97, https://doi.org/10.3389/fnins.2018.00097 (2018).
Debnath, R. et al. The Maryland analysis of developmental EEG (MADE) pipeline. Psychophysiology 57, e13580, https://doi.org/10.1111/psyp.13580 (2020).
Klug, M. et al. The BeMoBIL Pipeline for automated analyses of multimodal mobile brain and body imaging data. bioRxiv https://doi.org/10.1101/2022.09.29.510051 (2022).
Pernet, C. R., Martinez-Cancino, R., Truong, D., Makeig, S. & Delorme, A. From BIDS-Formatted EEG Data to Sensor-Space Group Results: A Fully Reproducible Workflow With EEGLAB and LIMO EEG. Front Neurosci 14, 610388, https://doi.org/10.3389/fnins.2020.610388 (2020).
Pedroni, A., Bahreini, A. & Langer, N. Automagic: Standardized preprocessing of big EEG data. NeuroImage 200, 460–473, https://doi.org/10.1016/j.neuroimage.2019.06.046 (2019).
Rodrigues, J., Weiss, M., Hewig, J. & Allen, J. J. B. EPOS: EEG Processing Open-Source Scripts. Frontiers in Neuroscience 15, https://doi.org/10.3389/fnins.2021.660449 (2021).
Bailey, N. W. et al. Introducing RELAX: An automated pre-processing pipeline for cleaning EEG data- Part 1: Algorithm and application to oscillations. Clinical Neurophysiology 149, 178–201, https://doi.org/10.1016/j.clinph.2023.01.017 (2023).
Appelhoff, S. et al. MNE-BIDS: Organizing electrophysiological data into the BIDS format and facilitating their analysis. J Open Source Softw 4, https://doi.org/10.21105/joss.01896 (2019).
Pernet, C. et al. Issues and recommendations from the OHBM COBIDAS MEEG committee for reproducible EEG and MEG research. Nat Neurosci 23, 1473–1483, https://doi.org/10.1038/s41593-020-00709-0 (2020).
Newson, J. J. & Thiagarajan, T. C. EEG Frequency Bands in Psychiatric Disorders: A Review of Resting State Studies. Frontiers in human neuroscience 12, https://doi.org/10.3389/fnhum.2018.00521 (2019).
Ploner, M. & Tiemann, L. Exploring Dynamic Connectivity Biomarkers of Neuropsychiatric Disorders. Trends in cognitive sciences 25, 336–338, https://doi.org/10.1016/j.tics.2021.03.005 (2021).
Bassett, D. S., Xia, C. H. & Satterthwaite, T. D. Understanding the Emergence of Neuropsychiatric Disorders With Network Neuroscience. Biol Psychiatry Cogn Neurosci Neuroimaging 3, 742–753, https://doi.org/10.1016/j.bpsc.2018.03.015 (2018).
de Lange, S. C. et al. Shared vulnerability for connectome alterations across psychiatric and neurological brain disorders. Nat Hum Behav 3, 988–998, https://doi.org/10.1038/s41562-019-0659-6 (2019).
Delorme, A. & Makeig, S. EEGLAB: an open source toolbox for analysis of single-trial EEG dynamics including independent component analysis. Journal of neuroscience methods 134, 9–21, https://doi.org/10.1016/j.jneumeth.2003.10.009 (2004).
Oostenveld, R., Fries, P., Maris, E. & Schoffelen, J. M. FieldTrip: Open source software for advanced analysis of MEG, EEG, and invasive electrophysiological data. Computational intelligence and neuroscience 2011, 156869, https://doi.org/10.1155/2011/156869 (2011).
Babayan, A. et al. A mind-brain-body dataset of MRI, EEG, cognition, emotion, and peripheral physiology in young and old adults. Sci Data 6, 180308, https://doi.org/10.1038/sdata.2018.308 (2019).
van Dijk, H. et al. The two decades brainclinics research archive for insights in neurophysiology (TDBRAIN) database. Sci Data 9, 333, https://doi.org/10.1038/s41597-022-01409-z (2022).
Adrian, E. D. & Matthews, B. H. C. The Berger rhythm potential changes from the occipital lobes in man. Brain 57, 355–385, https://doi.org/10.1093/brain/57.4.355 (1934).
Barker, M. et al. Introducing the FAIR Principles for research software. Sci Data 9, 622, https://doi.org/10.1038/s41597-022-01710-x (2022).
Gil Ávila, C., Bott, F. S., Gross, J. & Ploner, M. crisglav/discover-eeg: 1.0.0 v. 1.0.0. Zenodo https://doi.org/10.5281/zenodo.8207523 (2023).
Wilkinson, M. D. et al. The FAIR Guiding Principles for scientific data management and stewardship. Scientific Data 3, 160018, https://doi.org/10.1038/sdata.2016.18 (2019).
Cella, D. et al. The Patient-Reported Outcomes Measurement Information System (PROMIS): Progress of an NIH Roadmap Cooperative Group During its First Two Years. Medical Care 45, S3–S11, https://doi.org/10.1097/01.mlr.0000258615.42478.55 (2007).
Niso, G. et al. Open and reproducible neuroimaging: From study inception to publication. NeuroImage 263, 119623, https://doi.org/10.1016/j.neuroimage.2022.119623 (2022).
R Core Team. R: A Language and Environement for Statistical Computing R Foundation for Statistical Computing, Vienna, Austria (2021).
Van Rossum, G. a. D., Fred L. Python 3 Reference Manual. (CreateSpace, 2009*).
Oostenveld, R. & Praamstra, P. The five percent electrode system for high-resolution EEG and ERP measurements. Clinical Neurophysiology 112, 713–719, https://doi.org/10.1016/s1388-2457(00)00527-7 (2001).
Pion-Tonachini, L., Kreutz-Delgado, K. & Makeig, S. ICLabel: An automated electroencephalographic independent component classifier, dataset, and website. NeuroImage 198, 181–197, https://doi.org/10.1016/j.neuroimage.2019.05.026 (2019).
Perrin, F., Pernier, J., Bertrand, O. & Echallier, J. F. Spherical splines for scalp potential and current density mapping. Electroencephalogr Clin Neurophysiol 72, 184–187, https://doi.org/10.1016/0013-4694(89)90180-6 (1989).
Mullen, T. R. et al. Real-Time Neuroimaging and Cognitive Monitoring Using Wearable Dry EEG. Ieee Transactions on Biomedical Engineering 62, 2553–2567, https://doi.org/10.1109/Tbme.2015.2481482 (2015).
Fraschini, M. et al. The effect of epoch length on estimated EEG functional connectivity and brain network organisation. Journal of neural engineering 13, 036015, https://doi.org/10.1088/1741-2560/13/3/036015 (2016).
van Diessen, E. et al. Opportunities and methodological challenges in EEG and MEG resting state functional brain network research. Clinical neurophysiology: official journal of the International Federation of Clinical Neurophysiology 126, 1468–1481, https://doi.org/10.1016/j.clinph.2014.11.018 (2015).
Cohen, M. X. Analyzing Neural Time Series Data: Theory and Practice. (MIT Press, 2014).
Haegens, S., Cousijn, H., Wallis, G., Harrison, P. J. & Nobre, A. C. Inter- and intra-individual variability in alpha peak frequency. NeuroImage 92, 46–55, https://doi.org/10.1016/j.neuroimage.2014.01.049 (2014).
Scally, B., Burke, M. R., Bunce, D. & Delvenne, J. F. Resting-state EEG power and connectivity are associated with alpha peak frequency slowing in healthy aging. Neurobiol Aging 71, 149–155, https://doi.org/10.1016/j.neurobiolaging.2018.07.004 (2018).
Rubinov, M. & Sporns, O. Complex network measures of brain connectivity: uses and interpretations. NeuroImage 52, 1059–1069, https://doi.org/10.1016/j.neuroimage.2009.10.003 (2010).
Badhwar, A. et al. Resting-state network dysfunction in Alzheimer’s disease: A systematic review and meta-analysis. Alzheimers Dement (Amst) 8, 73–85, https://doi.org/10.1016/j.dadm.2017.03.007 (2017).
Gallo, S. et al. Functional connectivity signatures of major depressive disorder: machine learning analysis of two multicenter neuroimaging studies. Mol Psychiatry https://doi.org/10.1038/s41380-023-01977-5 (2023).
Cao, J. et al. Brain functional and effective connectivity based on electroencephalography recordings: A review. Hum Brain Mapp 43, 860–879, https://doi.org/10.1002/hbm.25683 (2022).
Engel, A. K., Gerloff, C., Hilgetag, C. C. & Nolte, G. Intrinsic coupling modes: multiscale interactions in ongoing brain activity. Neuron 80, 867–886, https://doi.org/10.1016/j.neuron.2013.09.038 (2013).
Vinck, M., Oostenveld, R., van Wingerden, M., Battaglia, F. & Pennartz, C. M. An improved index of phase-synchronization for electrophysiological data in the presence of volume-conduction, noise and sample-size bias. NeuroImage 55, 1548–1565, https://doi.org/10.1016/j.neuroimage.2011.01.055 (2011).
Hipp, J. F., Hawellek, D. J., Corbetta, M., Siegel, M. & Engel, A. K. Large-scale cortical correlation structure of spontaneous oscillatory activity. Nat Neurosci 15, 884–890, https://doi.org/10.1038/nn.3101 (2012).
Corcoran, A. W., Alday, P. M., Schlesewsky, M. & Bornkessel-Schlesewsky, I. Toward a reliable, automated method of individual alpha frequency (IAF) quantification. Psychophysiology 55, e13064, https://doi.org/10.1111/psyp.13064 (2018).
Klimesch, W., Schimke, H. & Pfurtscheller, G. Alpha frequency, cognitive load and memory performance. Brain Topogr 5, 241–251, https://doi.org/10.1007/BF01128991 (1993).
Bastos, A. M. & Schoffelen, J. M. A Tutorial Review of Functional Connectivity Analysis Methods and Their Interpretational Pitfalls. Frontiers in systems neuroscience 9, 175, https://doi.org/10.3389/fnsys.2015.00175 (2015).
Pellegrini, F., Delorme, A., Nikulin, V. & Haufe, S. Identifying good practices for detecting inter-regional linear functional connectivity from EEG. NeuroImage 277, 120218, https://doi.org/10.1016/j.neuroimage.2023.120218 (2023).
Van Veen, B. D., van Drongelen, W., Yuchtman, M. & Suzuki, A. Localization of brain electrical activity via linearly constrained minimum variance spatial filtering. IEEE transactions on bio-medical engineering 44, 867–880, https://doi.org/10.1109/10.623056 (1997).
Schaefer, A. et al. Local-Global Parcellation of the Human Cerebral Cortex from Intrinsic Functional Connectivity MRI. Cereb Cortex 28, 3095–3114, https://doi.org/10.1093/cercor/bhx179 (2018).
Westner, B. U. et al. A unified view on beamformers for M/EEG source reconstruction. NeuroImage 246, 118789, https://doi.org/10.1016/j.neuroimage.2021.118789 (2022).
Pourmotabbed, H., de Jongh Curry, A. L., Clarke, D. F., Tyler-Kabara, E. C. & Babajani-Feremi, A. Reproducibility of graph measures derived from resting-state MEG functional connectivity metrics in sensor and source spaces. Hum Brain Mapp 43, 1342–1357, https://doi.org/10.1002/hbm.25726 (2022).
Adamovich, T., Zakharov, I., Tabueva, A. & Malykh, S. The thresholding problem and variability in the EEG graph network parameters. Scientific Reports 12, 18659, https://doi.org/10.1038/s41598-022-22079-2 (2022).
Watts, D. J. & Strogatz, S. H. Collective dynamics of ‘small-world’ networks. Nature 393, 440–442, https://doi.org/10.1038/30918 (1998).
Babayan, A. et al. Functional Connectomes Project International Neuroimaging Data-Sharing Initiative https://doi.org/10.15387/fcp_indi.mpi_lemon (2018).
van Dijk, H. et al. Two Decades - Brainclinics Research Archive for Insights in Neuroscience (TD-BRAIN). Synapse https://doi.org/10.7303/syn25671079 (2021).
Maris, E. & Oostenveld, R. Nonparametric statistical testing of EEG- and MEG-data. Journal of neuroscience methods 164, 177–190 (2007).
Cole, J. H. & Franke, K. Predicting Age Using Neuroimaging: Innovative Brain Ageing Biomarkers. Trends Neurosci 40, 681–690, https://doi.org/10.1016/j.tins.2017.10.001 (2017).
Engemann, D. A. et al. A reusable benchmark of brain-age prediction from M/EEG resting-state signals. NeuroImage 262, 119521, https://doi.org/10.1016/j.neuroimage.2022.119521 (2022).
Krekelberg, B. BayesFactor: Release 2022 v. 2.3.0. Zenodo https://doi.org/10.5281/zenodo.7006300 (2022).
Onoda, K., Ishihara, M. & Yamaguchi, S. Decreased Functional Connectivity by Aging Is Associated with Cognitive Decline. J Cognitive Neurosci 24, 2186–2198, https://doi.org/10.1162/jocn_a_00269 (2012).
Henrich, J., Heine, S. J. & Norenzayan, A. The weirdest people in the world? Behav Brain Sci 33, 61–83; discussion 83–135, https://doi.org/10.1017/S0140525X0999152X (2010).
Allen, M., Poggiali, D., Whitaker, K., Marshall, T. R. & Kievit, R. A. Raincloud plots: a multi-platform tool for robust data visualization. Wellcome Open Res 4, 63, https://doi.org/10.12688/wellcomeopenres.15191.1 (2019).
The study has been supported by the TUM Innovation Network Neurotechnology for Mental Health (NEUROTECH), the Deutsche Forschungsgemeinschaft (PL321/14-1) and the TUM School of Medicine (KKF).
Open Access funding enabled and organized by Projekt DEAL.
The authors declare no competing interests.
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
About this article
Cite this article
Gil Ávila, C., Bott, F.S., Tiemann, L. et al. DISCOVER-EEG: an open, fully automated EEG pipeline for biomarker discovery in clinical neuroscience. Sci Data 10, 613 (2023). https://doi.org/10.1038/s41597-023-02525-0