Background & Summary

The human brain’s ability to rapidly comprehend linguistic information and generate corresponding linguistic expressions is an indicator of its complex processing capabilities1. When exposed to linguistic stimuli, the human brain encodes the semantic information through neural activities2. By analyzing such neural activities, we can uncover the encoding mechanisms of semantics in the brain3. A variety of neural signals, including EEG, Functional Magnetic Resonance Imaging (fMRI), Electrocorticography (ECoG) are employed in language-related tasks, from academic research like investigating language processing mechanisms in the brain to practical applications like language decoding in BCI4,5,6,7,8,9. Recently, a lot of studies on neurolinguistics utilized both traditional machine learning methods and modern deep learning methods in NLP to explore linguistic-related problems10,11,12,13,14,15,16. However, these data-driven methods rely heavily on massive and comprehensive datasets17. In the field of NLP, it is relatively easy to collect large amounts of natural language data. In contrast, acquiring a large volume of neural signals generated in response to natural language stimuli poses significant challenges. To utilize the strong ability of modern data-driven methods, it is important to scale neural datasets to commensurate the state-of-the-art NLP to encompass the wide range of language expressions encountered in daily life. Among all neuroimaging techniques, EEG holds great potential to meet this demand. EEG is non-invasive and cost-effective18, which allows the creation of long-duration neural signal datasets enriched with semantic information. Meanwhile, EEG features high temporal resolution19, which enables it to precisely capture the brain’s rapid dynamic changes in the language processing process.

Despite the abundance of EEG datasets for natural visual stimuli (e.g., THINGS-EEG)20,21,22,23, those for natural language stimuli remain scarce. Currently, only a few language-related EEG datasets exist, such as the ZuCo dataset24. However, the majority of these datasets are collected using stimuli from English language corpora. This leads to limited research on the neural representations of other languages like Chinese. The brain’s processing mechanisms differ for various languages. For example, the brain exhibits specificity in response to Chinese compared to English25. Therefore, it is important to create an EEG dataset based on other language stimuli. Chinese, being distinct from English in both structure and semantics, provides an opportunity to expand our understanding of neural responses to linguistic stimuli. An EEG dataset stimulated by Chinese corpora can facilitate the investigation of cross-linguistic commonalities and variations in language processing in the brain, bringing new perspectives to our understanding of language processing mechanisms.

To address these gaps, we have collected an EEG dataset, named the “ChineseEEG” (Chinese Linguistic Corpora EEG Dataset)26,27. It contains high-density EEG data and simultaneous eye-tracking data recorded from 10 participants, each silently reading Chinese text for about 13 hours. The text materials are sourced from two well-known novels, the Little Prince and Garnett Dream, both in their Chinese versions. This dataset further comprises multiple versions of pre-processed EEG sensor-level data generated under different parameter settings, offering researchers a diverse range of selections. Additionally, we provide embeddings of the Chinese text materials encoded from BERT-base-chinese model, which is a pre-trained NLP model specifically used for Chinese28, aiding researchers in exploring the alignment between text embeddings from NLP models and brain information representations in neural signals.

ChineseEEG26,27 is a pilot EEG dataset specifically stimulated by Chinese text. It offers several advantages. Firstly, each participant was exposed to around 13 hours of diverse Chinese linguistic stimuli, encompassing a broad spectrum of semantic information. The extensive exposure is significant for studying the long-term neural dynamics of language processing in the brain. Secondly, we employed 128 channels of high-density EEG data, which offers superior spatial resolution for precise localization of brain regions involved in language processing. Besides, with a sampling rate of 1 kHz, it effectively captures the dynamics of neural representations during reading. Furthermore, the inclusion of the pre-processed EEG data and text embeddings is beneficial for scholars in both neuroscience and computer science domains who lack inter-disciplinary experience, enabling them to directly utilize well processed data from fields they may not familiar with.

ChineseEEG26,27 can serve as a valuable resource in neuroscience, linguistics, and other related fields. EEG data generated from Chinese language stimuli will significantly support research within the Chinese context, aiding researchers in revealing the characteristics of brain signal representations under Chinese stimuli, and promoting the development of brain-to-text translation, semantic decoding and other practical applications tailored to Chinese context. The dataset can also bring diversity to languages used in related research, encouraging the exploration of similarities and differences in language processing stimulated by different languages. It can also aid in multi-linguistic alignment in NLP by aligning multi-lingual brain signals with natural languages29,30. Given the dataset’s inclusion of widely used text materials like the Little Prince in multilingual neuroimaging research, ChineseEEG can be combined with prior datasets to extend its potential. For example, combining ChineseEEG with neural signal datasets for auditory language comprehension tasks under similar semantic stimuli31,32 can help to uncover the neural mechanisms of language understanding in multi-modal perceptions. Besides, researchers can integrate semantically rich EEG data in ChineseEEG with other neuroimaging modalities, such as fMRI and MEG, in language comprehension tasks33,34 to precisely uncover brain’s spatio-temporal dynamics and thus enhance the understanding of neural mechanisms of language processing in the brain.

Methods

Participants and task overview

We recruited 15 participants (18–29 years old, averaged 21.9 years old, and 9 males). 3 participants participated the pre-experimental test before the official experiment to ensure the rationality of the experimental procedure and the stability of the devices. In the official experiment, 2 participants withdrew halfway due to scheduling conflicts (After communicating with the experimenter, they decided to withdraw from the experiment). In total, data from only 10 participants were used (18–29 years old, averaged 22.7 years old, and 5 males). No participant reported neurological or psychiatric history. All participants are right-handed and have normal or corrected-to-normal vision. Each participant voluntarily enrolled in and signed the informed consent form before the experiment and got a coupon compensation of approximately 50 MOP (MOP is the official currency of the Macao Special Administrative Region of China) for each experimental run (25 runs in total). This study complied with the Declaration of Helsinki and was performed according to the ethics committee approval of the Institutional Review Board of the University of Macau (Approval No. BSERE20-APP011-ICI).

Experimental material

The experimental materials consist of two novels, both in the genre of children’s literature. The first is the Chinese translation of the Little Prince (http://www.xiaowangzi.org/index.html) and the second is Garnett Dream (https://www.feiku6.com/read/s3-langwangmeng/18242419.html), both sourced from the Internet. Using novels, especially children’s literature provides several advantages for research, especially within a naturalistic paradigm. Firstly, given their extensive size, these novels offer vast and diverse linguistic content, encompassing the majority of frequently utilized Chinese characters and daily expressions. Besides, children’s literature can create an engaging environment for participants, making them more focused and emotionally engaged in the experiment.

Each novel was used as the material for a single session in the experiment. Each session was divided into several runs. For the Little Prince, the preface was used as the material for the practice reading phase. The main body of the novel was then used for seven runs in the formal reading phase. The first six runs each includes 4 chapters of the novel, while the seventh run includes the last 3 chapters. For Garnett Dream, the first 18 chapters were used for 18 runs in the formal reading stage, with each run including a complete chapter. Due to the loss of markers during the EEG collection process, run 18 of ses-GarnettDream of sub-07 is unusable. We requested this participant to re-complete the reading task using chapter 19 of Garnett Dream.

To properly present the text on the screen during the experiment, the content of each run was segmented into a series of units, with each unit containing no more than 10 Chinese characters. These segmented contents were saved in Excel (.xlsx) format for subsequent usage. During the experiment, three adjacent units from each run’s content will be displayed on the screen in three separate lines, with the middle line highlighted for the participant to read. The relevant code has been uploaded to the GitHub repository. See Code availability section for detailed information.

The overview of experimental materials is shown in Table 1. In summary, a total of 115,233 characters (24,324 in the Little Prince and 90,909 in Garnett Dream), of which 2,985 characters are unique, are used as experimental stimuli in ChineseEEG dataset.

Table 1 An overview of the experiment.

Experimental procedures

Participants were instructed to sit in an adjustable chair, with their eyes positioned approximately 67 cm away from the monitor (Dell, width: 54 cm, height: 30.375 cm, resolution: 1,920 × 1,080 pixels, vertical refresh rate: 60 Hz), see Fig. 1b. They were tasked with reading a novel and were required to keep their heads still and keep their gaze on the highlighted (red) Chinese characters moving across the screen, reading at a pace set by the program. Eye-tracking technique was utilized to confirm that participants followed the highlighted characters.

Fig. 1
figure 1

Overview of the experiment and the modalities included in the dataset. (a) Equipment utilized in the experiment, including the EGI device for collecting EEG data and the Tobii Pro Glasses 3 eye-tracker for tracking eye movements. (b) The experiment setup. Participants were instructed to sit quietly approximately 67 cm from the screen and sequentially read the highlighted text. (c) The experimental protocol. Participants’ 128-channel EEG signals and eye-tracking data were recorded while reading the highlighted text. (d) The data modalities in the dataset. The dataset comprises raw data such as the original textual stimuli, eye movement data, EEG data, and derivatives such as text embeddings from pre-trained NLP models and pre-processed EEG data.

Each participant was required to complete a total of 1 practice reading phase and 2 formal reading sessions. the Little Prince session was divided into 7 experimental runs and Garnett Dream session was divided into 18 experimental runs. The schedule for the entire experiment is as follows: participants were required to finish all experimental runs over the span of 8 days. The total daily reading duration was set at approximately 1.5 hours to avoid fatigue. Specifically, the reading tasks for the first day comprised the practice reading phase and runs 1–4 of the Little Prince session. The tasks for the second day comprised runs 5–8 of the Little Prince session. From the third to the eighth day, each day’s reading tasks comprised 3 runs of Garnett Dream session. While participants were afforded the flexibility to adjust their schedules in the experiment, they were required to complete all reading tasks within one month.

Each experimental run lasted approximately 30 minutes and was divided into two phases: the eye-tracker calibration phase and the reading phase.

Phase 1: Eye-tracker calibration phase

At the beginning of each run, participants were required to undergo an eye-tracker calibration process. Initially, the message “Hello! Please press the spacebar to start calibration” was displayed at the screen’s center. Participants were instructed to keep their gaze at a fixation point, which sequentially appeared at the four corners and the center of the screen, each for 5 seconds. If the calibration failed, participants were prompted to start another calibration. Upon successful calibration, the message “Calibration successful! The page will automatically redirect in 5 seconds” was displayed at the center of the screen.

During the reading process, the accuracy of eye-tracking data can be influenced by several factors, including drift errors resulting from involuntary eye movements, as well as head movements and equipment positioning discrepancies. By performing calibrations at the beginning of each experimental run, these potential errors can be effectively mitigated, thereby ensuring the precision of the eye-tracking data.

Phase 2: Reading phase

After the calibration phase, participants were automatically directed to the reading phase. During the reading process, the screen initially displayed the serial number of the current chapter. Subsequently, the text appeared with three lines per page, ensuring each line contained no more than ten Chinese characters (excluding punctuation). On each page, the middle line was highlighted as the focal point, while the upper and lower lines were displayed with reduced intensity as the background. Each character in the middle line was sequentially highlighted with red color for 0.35 s, and participants were required to read the novel content following the highlighted cues. To facilitate a smooth reading experience, the text was designed to scroll automatically on the screen. Once participants finished reading the highlighted middle line, the text would scroll, moving the third line up to become the new middle line on the subsequent page.

The reading speed, which is slower than the typical speeds reported in previous studies35, was deliberately chosen. This speed was selected based on feedback from the pre-experimental test to maintain participants’ attention and minimize fatigue throughout the relatively long experimental run. The reading speed was fixed to enable character-level alignment between EEG segments and text. Additionally, fixed speed can also minimize the impact of external interference in the experiment and eliminate the impact of different reading speeds of different participants on subsequent analyses.

To ensure the accuracy of both EEG and eye-tracking data, participants were instructed to consistently focus on the highlighted text, while avoiding significant body movements to maintain a stable reading position. This protocol was strictly enforced to reduce any potential drifts and artifacts in the recordings.

After each run, participants were given sufficient time to rest. They were instructed to start the subsequent run only when they explicitly reported being ready to proceed. Adequate rest time can mitigate fatigue and enable participants to sustain their attention throughout the experiment, thus ensuring the quality of both EEG and eye-tracking data. The experimenter also evaluated each participant’s performance and fatigue level through oral inquiries after each experimental run to ensure they could fully maintain their attention in subsequent runs. During each rest period, the experimenter replenished the saline solution on the electrodes of the EEG cap, which helped to maintain a low impedance, ensuring the collection of high-quality EEG data. Additionally, the experimenter checked the power status of the eye-tracker and replaced the batteries as necessary to ensure its continuous operation.

It should be noted that during the initial participation in the experiment, participants were required to complete a practice reading phase. The preface chapter of the Little Prince was selected as the reading material for this phase. All settings remained the same as those of the formal reading stage, to familiarize participants with the eye-tracker calibration process and the reading task.

The presentation of stimuli was managed using PsychoPy v2023.2.336, with the EGI PyNetstation v1.0.1 module facilitating the connection between PsychoPy and EGI Netstation. We also utilized g3pylib package to control our eye-tracker to follow the eye movement trajectories of the participants.

Data collection and analysis

This section shows the details of the data collection, pre-processing, and data analysis procedure. The modalities included in our dataset26,27 are shown in Fig. 1d, including raw data and derivatives. Raw data contains the raw EEG data, eye-tracking data, raw text materials, and derivatives contain pre-processed EEG data and text embeddings generated by a pre-trained NLP model BERT-base-chinese.

EEG data collection

EEG data was acquired using an EGI 128-channel cap based on the GSN-HydroCel-128 montage with the Geodesic Sensor Net system (see Fig. 1a). The egi-pynetstation v1.0.1 package was used to control the EGI system. Before recording, the experimenter used a soft ruler to locate the position of the Cz electrode (i.e., the center of the brain) for each participant, ensuring the alignment of the electrodes in each experimental run. During recording, the sampling rate was 1 kHz. The impedance of each electrode was kept below 50 kΩ during the experiment. Setups and recording parameters are similar to our previous EEG dataset37. To precisely co-register EEG segments with individual characters during the experiment, we marked the EEG data with triggers (Table 2). The raw EEG data was exported to metafile format (.mff) files on the macOS system.

Table 2 EEG triggers.

Eye-tracking data collection

Eye-tracking data was acquired using Tobii Pro Glasses 3. The device features 16 illuminators and 4 eye cameras integrated into scratch-resistant lenses, along with a wide-angle scene camera, allowing for a comprehensive capture of participant behavior and environmental context (see Fig. 1a). Due to the extensive duration of our experiments, the requirement for a lightweight eye-tracker was prioritized. The Tobii Pro Glass 3 fulfilled this criterion. Tobii Pro Glass 3 has a maximum sampling rate of 100 Hz. Given the relatively slow reading speed in our experiment, a sampling rate of 100 Hz is adequate for capturing the eye movement trajectories of the participants and assessing whether they were fixating on highlighted text at specific moments. More information about Tobii Pro Glass 3 can be found on the official website (https://www.tobii.com/products/eye-trackers/wearables/tobii-pro-glasses-3). We utilized the package g3pylib to control the glasses. The raw data was exported to .rar files.

EEG data pre-processing

To retain maximum amount of valid information in the data, we performed minimal pre-processing on the data, allowing researchers to further process the data according to their specific research needs. The pre-processing pipeline is shown in Fig. 2. These pre-processing steps include data segmentation, downsampling, powerline filtering, band-pass filtering, bad channel interpolation, independent component analysis (ICA), and re-referencing. The MNE v1.6.038 package was utilized to implement all pre-processing steps.

Fig. 2
figure 2

EEG pre-processing pipeline. (a) Data segmentation: Data is segmented based on markers, retaining only the data from the formal reading phase. (b) Band-pass filtering: Two versions of filtered data are provided, with band-pass ranges of 0.5–30 Hz and 0.5–80 Hz respectively. (c) Bad channel interpolation: Our bad channel detection includes automatic detection implemented with the pyprep package and manual checking. For interpolation, the spherical spline interpolation implemented in MNE is utilized. (d) ICA denoising: In this part, the automatic labeling method in mne-iclabel package is utilized followed by a manual checking to remove noisy independent components such as eye movements and heartbeats. (e) Dataset organization: Our dataset is organized in the BIDS41,42 format. The detailed file structure is shown in Fig. 3.

During the data segmentation phase, we only retained data from the formal reading phase of the experiment. Based on the event markers during the data collection phase, we segmented the data, removing sections irrelevant to the formal experiment such as calibration and preface reading. To minimize the impact of subsequent filtering steps on the beginning and end of the signal, an additional 10 seconds of data was retained before the start of the formal reading phase. Subsequently, the signal was downsampled to 256 Hz. This specific sampling rate ensures effective capture of information related to language comprehension while reducing the burden of subsequent data processing and storage. Additionally, it aligns with the principle of minimal pre-processing, leaving necessary room for researchers to conduct personalized pre-processing based on their needs.

Following downsampling, a 50 Hz notch filter was applied to remove the powerline noise from the signal. Next, we performed band-pass overlap-add FIR filter on the signal to eliminate the low-frequency direct current components and high-frequency noise. Here, two versions of filtered data were offered. The first one has a filter band of 0.5–80 Hz and the second one has a filter band of 0.5–30 Hz. Researchers can choose the appropriate version based on their specific needs. After filtering, we performed an interpolation of bad channels. The bad channels were selected automatically using a Python-implemented EEG pre-processing package pyprep v0.4.339. After automatic detection, we manually checked to avoid mislabeling or errors before interpolation. The spherical spline interpolation in the MNE package was utilized in this process.

Independent Component Analysis (ICA) was then applied to the data, utilizing the infomax algorithm available in the MNE package. The number of independent components was set to 20, ensuring that they contain the majority of information while not being so numerous to increase the burden of manual processing. Additionally, we set the random seed of the ICA algorithm to 97 to ensure the reproducibility of the ICA results. An automatic method was used to inspect and label components. It was implemented using mne-iclabel v0.5.140, which is a Python-implemented package for automatic independent component labeling. By manually inspecting the independent components after automatic labeling, we excluded obvious noise components such as Electrooculography (EOG) and Electrocardiogram (ECG). Finally, the data was re-referenced using the average method.

The process of manually identifying bad channels and excluding independent components during the ICA step can be conducted through annotations in a Graphical User Interface (GUI), making the annotation process quicker and more user-friendly.

Data Records

The full dataset is publicly accessible via the ChineseNeuro Symphony community (CHNNeuro) in the Science Data Bank (ScienceDB) platform (https://doi.org/10.57760/sciencedb.CHNNeuro.00007)26 or via the Openneuro platform (https://doi.org/10.18112/openneuro.ds004952.v1.2.0)27. Public data is distributed under the the Creative Commons Attribution 4.0 International Public License (https://creativecommons.org/publicdomain/zero/1.0).

Data organization

The dataset26,27 is organized following the EEG-BIDS41,42 specification, which is an extension to the brain imaging data structure for EEG. The overview directory tree of our dataset is shown in Fig. 3. The dataset contains some regular BIDS files, 10 participants’ data folders, and a derivatives folder. The stand-alone files offer an overview of the dataset: i) dataset_description.json is a JSON file depicting the information of the dataset, such as the name, dataset type and authors; ii) participants.tsv contains participants’ information, such as age, sex, and handedness; iii) participants.json describes the column attributes in participants.tsv; iv) README.md contains a detailed introduction of the dataset.

Fig. 3
figure 3

File structure of the dataset. (a) Eye-tracking data: Each experimental run is associated with a .rar file that contains eye-tracking data. (b) Electrode information files: These include detailed information of electrodes such as the location, type, and sampling rate, as well as information on any channels marked as bad during pre-processing. (c) EEG data and event-related files: Including EEG data in BrainVision format and event files that record marker information. (d) ICA-related files: Containing independent components in numpy format, records of removed components during pre-processing, and topographic maps of the components. (e) Text materials: Containing original and segmented text. (f) Text embedding files: Each file corresponds to an experimental run and is stored in .npy format. (g) Raw EEG data.

Each participant’s folder contains two folders named ses-LittlePrince and ses-GarnettDream, which store the data of this participant reading two novels, respectively. Each of the two folders contains a folder eeg and one file sub-xx_scans.tsv. The tsv file contains information about the scanning time of each file. The eeg folder contains the source raw EEG data of several runs, channels, and marker events files. Each run includes an eeg.json file, which encompasses detailed information for that run, such as the sampling rate and the number of channels. Events are stored in events.tsv with onset and event ID. The EEG data is converted from raw metafile format (.mff file) to BrainVision format (.vhdr,.vmrk and.eeg files) since EEG-BIDS41,42 is not officially compatible with .mff format. All data is formatted to EEG-BIDS41,42 using the mne-bids v0.1442,43 package in Python.

The derivatives folder contains six folders: eyetracking_data, filtered_0.5_80, filtered_0.5_30, preproc, novels, and text_embeddings. The eyetracking_data folder contains all the eye-tracking data. Each eye-tracking data is formatted in a .rar file with eye moving trajectories and other parameters like sampling rate saved in different files. The filtered_0.5_80 folder and filtered_0.5_30 folder contain data that has been processed up to the pre-processing step of 0.5–80 Hz and 0.5–30 Hz band-pass filtering respectively. This data is suitable for researchers who have specific requirements and want to perform customized processing on subsequent pre-processing steps like ICA and re-referencing. The preproc folder contains minimally pre-processed EEG data that is processed using the whole pre-processing pipeline. It includes four additional types of files compared to the participants’ raw data folders in the root directory: i) bad_channels.json contains bad channels marked during bad channel rejection phase. ii) ica_components.npy stores the values of all independent components in the ICA phase. iii) ica_components.json includes the independent components excluded in ICA (the ICA random seed is fixed, allowing for reproducible results). iv) ica_components_topography.png is a picture of the topographic maps of all independent components, where the excluded components are labeled in grey. The novels folder contains the original and segmented text stimuli materials. The original novels are saved in .txt format and the segmented novels corresponding to each experimental run are saved in Excel (.xlsx) files. The text_embeddings folder contains embeddings of the two novels. The embeddings corresponding to each experimental run are stored in NumPy (.npy) files.

Technical Validation

Classic sensor-level EEG analysis

The EEG data in the dataset26,27 can be used to do classic time-frequency analysis. In this section, pre-processed EEG data was used to extract neural oscillations in different frequency bands. Specifically, we targeted the segment corresponding to the sentence “Draw me a sheep” in the Little Prince from the 0.5–80 Hz filtered pre-processed data of sub-07. The analysis was exclusively focused on the C3 electrode to investigate the neural activities at the scalp location overlying the temporal lobe, which is a language processing related area.

To dissect the frequency components inherent in the C3 electrode’s signal, we applied the Fast Fourier Transform (FFT) algorithm to the data. This mathematical technique transforms the time-domain signal into the frequency domain, revealing the spectrum of frequencies present in the neural recordings. We defined frequency bands of interest—Theta (4–8 Hz), Alpha (8–12 Hz), Beta (12–30 Hz), and Gamma (30–100 Hz)—to categorize the neural oscillations according to their respective frequency ranges.

For each frequency band, we separated the components from the FFT results and conducted an inverse FFT to retrieve the time-domain signal representing the band’s oscillatory activity. This step allows for the quantitative analysis of the amplitude of oscillations within each frequency band, offering insights into the neurophysiological activity in these specific ranges. The results of different frequency bands are shown in Fig. 4.

Fig. 4
figure 4

EEG time course and the neural oscillations under different frequency bands (i.e., Theta, Alpha, Beta, and Gamma) corresponding to the Chinese sentence meaning “Draw me a sheep”. The pre-processed EEG data using 0.5–80 Hz band-pass filter from ses-LittlePrince of sub-07 was used in the analysis. We illustrated the EEG signals from electrode C3, which locates at a language processing related area overlying the temporal lobe.

EEG source reconstruction

Apart from the sensor level analysis, the EEG data allows for conducting source localization. Here, three segments of the data were utilized as an example to perform the source-level analysis using the MNE package. The fsaverage MRI template (https://surfer.nmr.mgh.harvard.edu/fswiki/FsAverage)44 in MNE package was utilized to complete the surface reconstruction process. A 3-layer Boundary Element Method (BEM) model with 15360 triangles and conductivities of 0.3 S/m, 0.006 S/m, and 0.3 S/m for the brain, skull, and scalp compartments respectively was created. Source spaces consisted of 10242 sources per hemisphere. Three segments of the pre-processed EEG data with a band-pass frequency band of 0.5–80 Hz corresponding to one line displayed in the experiment were used to calculate the inverse solution. Inverse solutions were calculated using dynamic Statistical Parametric Maps (dSPM). The method was selected because it is widely used by researchers and is representative of currently used methods45. We offer the code of source reconstruction in our GitHub repository. See Code availability section for detailed information.

The visualization of the source activities is shown in Fig. 5b. Results for the left and right hemispheres are presented separately. The moments of peak activation in the left and right brain regions are chosen for visualization. The source localization results for the first segment reveal a dispersed activation area, encompassing the anterior temporal lobe and temporo-parietal region, which are associated with language comprehension and primary processing46. The results of the second segment exhibit more focused activation, particularly near the left middle temporal gyrus, an area (encompassing Wernicke’s area) intimately related to language comprehension47. The activation areas for the third segment are localized in the left temporal and frontal lobes, potentially representing high-level stages of language processing, including sentence construction, semantic processing, and language expression48. Fig. 5c presents plots of source activities over time, derived from 12 sources in the corresponding region with strongest activities. The first two curves in each plot correspond to sources in the left and right hemispheres that reach maximum peak values.

Fig. 5
figure 5

EEG source localization analysis. (a) EEG sensor-level data: Three segments of pre-processed EEG data using 0.5–80 Hz band-pass filter were selected for analysis, accompanied by the corresponding text segments shown above the EEG segments. (b) Visualization of brain activation after source analysis: The dSPM method was utilized to solve the inverse problem. Results for the left and right hemispheres are presented separately. The moments of peak activation in the left and right brain regions are chosen for visualization. (c) Plots of source activity over time: Each plot contains the activities of 12 sources in the region with the strongest activity.

Text embeddings with pre-trained language model

To assist researchers in efficiently exploring the alignment between EEG and text representations, as well as in text decoding based on EEG, this study provides embeddings of two novels calculated using a pre-trained language model, accompanied by the code to compute these embeddings. This work employed Google’s pre-trained language model BERT-base-Chinese28. This model, pre-trained on Chinese corpora, effectively encodes Chinese semantic features. Given that Chinese characters are the smallest unit of composition in Chinese writing and cannot be further decomposed49, bert-base-chinese adopts a character-based tokenization approach, treating each Chinese character as a token for embedding. During the experimental procedure, each displayed line of text contains n Chinese characters. The BERT-base-Chinese model processes these n Chinese characters, yielding an embedding of size (n, 768), where n represents the number of Chinese characters, and 768 the dimensionality of the embedding. To ensure displayed lines of varying length to have embeddings of the same shape, the first dimension of the embeddings is averaged to standardize the embedding size to (1, 768) for each instance. This processing procedure was implemented using the Hugging Face Transformers v4.36.250 package.

Temporal alignment between EEG, text sequences, and eye-tracking data

This section provides a comprehensive explanation on how to align the EEG data with its corresponding text content and eye-tracking data in the temporal domain.

To facilitate semantic decoding, it is necessary to align specific text with its corresponding EEG segment in the temporal domain. During the marking process when collecting the data, the start and end of each line of the stimuli were annotated, thereby enabling the alignment of each text line with a corresponding segment of EEG data. Given the consistent highlighting duration for each character, the EEG segment can be equally divided to match the corresponding character. In the GitHub repository, we offer the script to align the EEG segments to their corresponding text and text embeddings.

The recorded eye-tracking data can be aligned with EEG data to verify whether participants were focusing on the text as expected. The eye-tracking data captures both the scene viewed by the participants at each moment and the coordinates of the gaze points. During each experimental run, the marker “EYES” and “EYEE” were inserted into the EEG recordings when the eye-tracker was activated and deactivated. These markers enable precise alignment between the eye-tracking data and the EEG recordings. Once the alignment is complete, markers in the EEG recordings enable the extraction of specific eye-tracking data segments corresponding to particular EEG segments. These eye-tracking data segments can be used to check whether the eye fixation locations align with the anticipated positions on the screen, thus reflecting the quality of the EEG data.

Usage Notes

The code for the experiment and data analysis has been uploaded to GitHub to facilitate sharing and utilization, which is accessible at https://github.com/ncclabsustech/Chinese_reading_task_eeg_processing.

The code repository contains four main modules, each including scripts desired to reproduce the experiment and data analysis procedures. The script cut_chinese_novel.py in the novel_segmentation_and_text_embeddings folder contains the code to prepare the stimulation materials from source materials. The script play_novel.py in the experiment module contains code for the experiment, including text stimuli presentation and control of the EGI device and Tobii Pro Glasses 3 eye-tracker. The script preprocessing.py in data_preprocessing_and_alignment module contains the main part of the code to apply pre-processing on EEG data. The script align_eeg_with_sentence.py in the same module contains code to align the EEG segments with corresponding text contents and text embeddings. The docker module contains the Docker image required for deploying and running the code, as well as tutorials on how to use Docker for environment deployment.

The code for EEG data pre-processing is highly configurable, permitting flexible adjustments of various pre-processing parameters, such as data segmentation range, downsampling rate, filtering range, and choice of ICA algorithm, thereby ensuring convenience and efficiency. Researchers can modify and optimize this code according to their specific requirements.

Before using our ChineseEEG dataset26,27, we encourage all users to check the README.md and the updated information in the GitHub repository.