ChineseEEG: A Chinese Linguistic Corpora EEG Dataset for Semantic Alignment and Neural Decoding

An Electroencephalography (EEG) dataset utilizing rich text stimuli can advance the understanding of how the brain encodes semantic information and contribute to semantic decoding in brain-computer interface (BCI). Addressing the scarcity of EEG datasets featuring Chinese linguistic stimuli, we present the ChineseEEG dataset, a high-density EEG dataset complemented by simultaneous eye-tracking recordings. This dataset was compiled while 10 participants silently read approximately 13 hours of Chinese text from two well-known novels. This dataset provides long-duration EEG recordings, along with pre-processed EEG sensor-level data and semantic embeddings of reading materials extracted by a pre-trained natural language processing (NLP) model. As a pilot EEG dataset derived from natural Chinese linguistic stimuli, ChineseEEG can significantly support research across neuroscience, NLP, and linguistics. It establishes a benchmark dataset for Chinese semantic decoding, aids in the development of BCIs, and facilitates the exploration of alignment between large language models and human cognitive processes. It can also aid research into the brain’s mechanisms of language processing within the context of the Chinese natural language.


Background & Summary
The human brain's ability to rapidly comprehend linguistic information and generate corresponding linguistic expressions is an indicator of its complex processing capabilities 1 .When exposed to linguistic stimuli, the human brain encodes the semantic information through neural activities 2 .By analyzing such neural activities, we can uncover the encoding mechanisms of semantics in the brain 3 .A variety of neural signals, including EEG, Functional Magnetic Resonance Imaging (fMRI), Electrocorticography (ECoG) are employed in language-related tasks, from academic research like investigating language processing mechanisms in the brain to practical applications like language decoding in BCI [4][5][6][7][8][9] .Recently, a lot of studies on neurolinguistics utilized both traditional machine learning methods and modern deep learning methods in NLP to explore linguistic-related problems [10][11][12][13][14][15][16] .However, these data-driven methods rely heavily on massive and comprehensive datasets 17 .In the field of NLP, it is relatively easy to collect large amounts of natural language data.In contrast, acquiring a large volume of neural signals generated in response to natural language stimuli poses significant challenges.To utilize the strong ability of modern data-driven methods, it is important to scale neural datasets to commensurate the state-of-the-art NLP to encompass the wide range of language expressions encountered in daily life.Among all neuroimaging techniques, EEG holds great potential to meet this demand.EEG is non-invasive and cost-effective 18 , which allows the creation of long-duration neural signal datasets enriched with semantic information.Meanwhile, EEG features high temporal resolution 19 , which enables it to precisely capture the brain's rapid dynamic changes in the language processing process.
Despite the abundance of EEG datasets for natural visual stimuli (e.g., THINGS-EEG) [20][21][22][23] , those for natural language stimuli remain scarce.Currently, only a few language-related EEG datasets exist, such as the ZuCo dataset 24 .However, the majority of these datasets are collected using stimuli from English language corpora.This leads to limited research on the neural representations of other languages like Chinese.The brain's processing mechanisms differ for various languages.For example, the brain exhibits specificity in response to Chinese compared to English 25 .Therefore, it is important to create an EEG dataset based on other language stimuli.Chinese, being distinct from English in both structure and semantics, provides an opportunity to expand our understanding of neural responses to linguistic stimuli.An EEG dataset stimulated by Chinese corpora can facilitate the investigation of cross-linguistic commonalities and variations in language processing in the brain, bringing new perspectives to our understanding of language processing mechanisms.
To address these gaps, we have collected an EEG dataset, named the "ChineseEEG" (Chinese Linguistic Corpora EEG Dataset) 26,27 .It contains high-density EEG data and simultaneous eye-tracking data recorded from 10 participants, each silently reading Chinese text for about 13 hours.The text materials are sourced from two well-known novels, the Little Prince and Garnett Dream, both in their Chinese versions.This dataset further comprises multiple versions of pre-processed EEG sensor-level data generated under different parameter settings, offering researchers a diverse range of selections.Additionally, we provide embeddings of the Chinese text materials encoded from BERT-base-chinese model, which is a pre-trained NLP model specifically used for Chinese 28 , aiding researchers in exploring the alignment between text embeddings from NLP models and brain information representations in neural signals.
ChineseEEG 26,27 is a pilot EEG dataset specifically stimulated by Chinese text.It offers several advantages.Firstly, each participant was exposed to around 13 hours of diverse Chinese linguistic stimuli, encompassing a broad spectrum of semantic information.The extensive exposure is significant for studying the long-term neural dynamics of language processing in the brain.Secondly, we employed 128 channels of high-density EEG data, which offers superior spatial resolution for precise localization of brain regions involved in language processing.Besides, with a sampling rate of 1 kHz, it effectively captures the dynamics of neural representations during reading.Furthermore, the inclusion of the pre-processed EEG data and text embeddings is beneficial for scholars in both neuroscience and computer science domains who lack inter-disciplinary experience, enabling them to directly utilize well processed data from fields they may not familiar with.
ChineseEEG 26,27 can serve as a valuable resource in neuroscience, linguistics, and other related fields.EEG data generated from Chinese language stimuli will significantly support research within the Chinese context, aiding researchers in revealing the characteristics of brain signal representations under Chinese stimuli, and promoting the development of brain-to-text translation, semantic decoding and other practical applications tailored to Chinese context.The dataset can also bring diversity to languages used in related research, encouraging the exploration of similarities and differences in language processing stimulated by different languages.It can also aid in multi-linguistic alignment in NLP by aligning multi-lingual brain signals with natural languages 29,30 .Given the dataset's inclusion of widely used text materials like the Little Prince in multilingual neuroimaging research, ChineseEEG can be combined with prior datasets to extend its potential.For example, combining ChineseEEG with neural signal datasets for auditory language comprehension tasks under similar semantic stimuli 31,32 can help to uncover the neural mechanisms of language understanding in multi-modal perceptions.Besides, researchers can integrate semantically rich EEG data in ChineseEEG with other neuroimaging modalities, such as fMRI and MEG, in language comprehension tasks 33,34 to precisely uncover brain's spatio-temporal dynamics and thus enhance the understanding of neural mechanisms of language processing in the brain.

Methods
Participants and task overview.We recruited 15 participants (18-29 years old, averaged 21.9 years old, and 9 males).3 participants participated the pre-experimental test before the official experiment to ensure the rationality of the experimental procedure and the stability of the devices.In the official experiment, 2 participants withdrew halfway due to scheduling conflicts (After communicating with the experimenter, they decided to withdraw from the experiment).In total, data from only 10 participants were used (18-29 years old, averaged 22.7 years old, and 5 males).No participant reported neurological or psychiatric history.All participants are right-handed and have normal or corrected-to-normal vision.Each participant voluntarily enrolled in and signed the informed consent form before the experiment and got a coupon compensation of approximately 50 MOP (MOP is the official currency of the Macao Special Administrative Region of China) for each experimental run (25 runs in total).This study complied with the Declaration of Helsinki and was performed according to the ethics committee approval of the Institutional Review Board of the University of Macau (Approval No. BSERE20-APP011-ICI).
Experimental material.The experimental materials consist of two novels, both in the genre of children's literature.The first is the Chinese translation of the Little Prince (http://www.xiaowangzi.org/index.html) and the second is Garnett Dream (https://www.feiku6.com/read/s3-langwangmeng/18242419.html), both sourced from the Internet.Using novels, especially children's literature provides several advantages for research, especially within a naturalistic paradigm.Firstly, given their extensive size, these novels offer vast and diverse linguistic content, encompassing the majority of frequently utilized Chinese characters and daily expressions.Besides, children's literature can create an engaging environment for participants, making them more focused and emotionally engaged in the experiment.
Each novel was used as the material for a single session in the experiment.Each session was divided into several runs.For the Little Prince, the preface was used as the material for the practice reading phase.The main body of the novel was then used for seven runs in the formal reading phase.The first six runs each includes 4 chapters of the novel, while the seventh run includes the last 3 chapters.For Garnett Dream, the first 18 chapters were used for 18 runs in the formal reading stage, with each run including a complete chapter.Due to the loss of markers during the EEG collection process, run 18 of ses-GarnettDream of sub-07 is unusable.We requested this participant to re-complete the reading task using chapter 19 of Garnett Dream.
To properly present the text on the screen during the experiment, the content of each run was segmented into a series of units, with each unit containing no more than 10 Chinese characters.These segmented contents were saved in Excel (.xlsx) format for subsequent usage.During the experiment, three adjacent units from each run's content will be displayed on the screen in three separate lines, with the middle line highlighted for the participant to read.The relevant code has been uploaded to the GitHub repository.See Code availability section for detailed information.
The overview of experimental materials is shown in Table 1.In summary, a total of 115,233 characters (24,324 in the Little Prince and 90,909 in Garnett Dream), of which 2,985 characters are unique, are used as experimental stimuli in ChineseEEG dataset.
Experimental procedures.Participants were instructed to sit in an adjustable chair, with their eyes positioned approximately 67 cm away from the monitor (Dell, width: 54 cm, height: 30.375 cm, resolution: 1,920 × 1,080 pixels, vertical refresh rate: 60 Hz), see Fig. 1b.They were tasked with reading a novel and were required to keep their heads still and keep their gaze on the highlighted (red) Chinese characters moving across the screen, reading at a pace set by the program.Eye-tracking technique was utilized to confirm that participants followed the highlighted characters.
Each participant was required to complete a total of 1 practice reading phase and 2 formal reading sessions.the Little Prince session was divided into 7 experimental runs and Garnett Dream session was divided into 18 experimental runs.The schedule for the entire experiment is as follows: participants were required to finish all experimental runs over the span of 8 days.The total daily reading duration was set at approximately 1.5 hours to avoid fatigue.Specifically, the reading tasks for the first day comprised the practice reading phase and runs 1-4 of the Little Prince session.The tasks for the second day comprised runs 5-8 of the Little Prince session.From the third to the eighth day, each day's reading tasks comprised 3 runs of Garnett Dream session.While participants were afforded the flexibility to adjust their schedules in the experiment, they were required to complete all reading tasks within one month.
Each experimental run lasted approximately 30 minutes and was divided into two phases: the eye-tracker calibration phase and the reading phase.Table 1.An overview of the experiment.
Phase 1: Eye-tracker calibration phase.At the beginning of each run, participants were required to undergo an eye-tracker calibration process.Initially, the message "Hello!Please press the spacebar to start calibration" was displayed at the screen's center.Participants were instructed to keep their gaze at a fixation point, which sequentially appeared at the four corners and the center of the screen, each for 5 seconds.If the calibration failed, participants were prompted to start another calibration.Upon successful calibration, the message "Calibration successful!The page will automatically redirect in 5 seconds" was displayed at the center of the screen.During the reading process, the accuracy of eye-tracking data can be influenced by several factors, including drift errors resulting from involuntary eye movements, as well as head movements and equipment positioning discrepancies.By performing calibrations at the beginning of each experimental run, these potential errors can be effectively mitigated, thereby ensuring the precision of the eye-tracking data.
Phase 2: Reading phase.After the calibration phase, participants were automatically directed to the reading phase.During the reading process, the screen initially displayed the serial number of the current chapter.Subsequently, the text appeared with three lines per page, ensuring each line contained no more than ten Chinese characters (excluding punctuation).On each page, the middle line was highlighted as the focal point, while the upper and lower lines were displayed with reduced intensity as the background.Each character in the middle line was sequentially highlighted with red color for 0.35 s, and participants were required to read the novel content following the highlighted cues.To facilitate a smooth reading experience, the text was designed to scroll automatically on the screen.Once participants finished reading the highlighted middle line, the text would scroll, moving the third line up to become the new middle line on the subsequent page.
The reading speed, which is slower than the typical speeds reported in previous studies 35 , was deliberately chosen.This speed was selected based on feedback from the pre-experimental test to maintain participants' attention and minimize fatigue throughout the relatively long experimental run.The reading speed was fixed to enable character-level alignment between EEG segments and text.Additionally, fixed speed can also minimize the impact of external interference in the experiment and eliminate the impact of different reading speeds of different participants on subsequent analyses.
To ensure the accuracy of both EEG and eye-tracking data, participants were instructed to consistently focus on the highlighted text, while avoiding significant body movements to maintain a stable reading position.This protocol was strictly enforced to reduce any potential drifts and artifacts in the recordings.
After each run, participants were given sufficient time to rest.They were instructed to start the subsequent run only when they explicitly reported being ready to proceed.Adequate rest time can mitigate fatigue and enable participants to sustain their attention throughout the experiment, thus ensuring the quality of both EEG and eye-tracking data.The experimenter also evaluated each participant's performance and fatigue level through oral inquiries after each experimental run to ensure they could fully maintain their attention in subsequent runs.During each rest period, the experimenter replenished the saline solution on the electrodes of the EEG cap, which helped to maintain a low impedance, ensuring the collection of high-quality EEG data.Additionally, the experimenter checked the power status of the eye-tracker and replaced the batteries as necessary to ensure its continuous operation.
It should be noted that during the initial participation in the experiment, participants were required to complete a practice reading phase.The preface chapter of the Little Prince was selected as the reading material for this phase.All settings remained the same as those of the formal reading stage, to familiarize participants with the eye-tracker calibration process and the reading task.
The presentation of stimuli was managed using PsychoPy v2023.2.3 36 , with the EGI PyNetstation v1.0.1 module facilitating the connection between PsychoPy and EGI Netstation.We also utilized g3pylib package to control our eye-tracker to follow the eye movement trajectories of the participants.
Data collection and analysis.This section shows the details of the data collection, pre-processing, and data analysis procedure.The modalities included in our dataset 26,27 are shown in Fig. 1d, including raw data and derivatives.Raw data contains the raw EEG data, eye-tracking data, raw text materials, and derivatives contain pre-processed EEG data and text embeddings generated by a pre-trained NLP model BERT-base-chinese.
EEG data collection.EEG data was acquired using an EGI 128-channel cap based on the GSN-HydroCel-128 montage with the Geodesic Sensor Net system (see Fig. 1a).The egi-pynetstation v1.0.1 package was used to control the EGI system.Before recording, the experimenter used a soft ruler to locate the position of the Cz electrode (i.e., the center of the brain) for each participant, ensuring the alignment of the electrodes in each experimental run.During recording, the sampling rate was 1 kHz.The impedance of each electrode was kept below 50 kΩ during the experiment.Setups and recording parameters are similar to our previous EEG dataset 37 .To precisely co-register EEG segments with individual characters during the experiment, we marked the EEG data with triggers (Table 2).The raw EEG data was exported to metafile format (.mff) files on the macOS system.
Eye-tracking data collection.Eye-tracking data was acquired using Tobii Pro Glasses 3. The device features 16 illuminators and 4 eye cameras integrated into scratch-resistant lenses, along with a wide-angle scene camera, allowing for a comprehensive capture of participant behavior and environmental context (see Fig. 1a).Due to the extensive duration of our experiments, the requirement for a lightweight eye-tracker was prioritized.The Tobii Pro Glass 3 fulfilled this criterion.Tobii Pro Glass 3 has a maximum sampling rate of 100 Hz.Given the relatively slow reading speed in our experiment, a sampling rate of 100 Hz is adequate for capturing the eye movement trajectories of the participants and assessing whether they were fixating on highlighted text at specific moments.More information about Tobii Pro Glass 3 can be found on the official website (https://www.tobii.com/products/eye-trackers/wearables/tobii-pro-glasses-3).We utilized the package g3pylib to control the glasses.The raw data was exported to .rarfiles.
EEG data pre-processing.To retain maximum amount of valid information in the data, we performed minimal pre-processing on the data, allowing researchers to further process the data according to their specific research needs.The pre-processing pipeline is shown in Fig. 2.These pre-processing steps include data segmentation, downsampling, powerline filtering, band-pass filtering, bad channel interpolation, independent component analysis (ICA), and re-referencing.The MNE v1.6.0 38 package was utilized to implement all pre-processing steps.In this part, the automatic labeling method in mne-iclabel package is utilized followed by a manual checking to remove noisy independent components such as eye movements and heartbeats.(e) Dataset organization: Our dataset is organized in the BIDS 41,42 format.The detailed file structure is shown in Fig. 3.
During the data segmentation phase, we only retained data from the formal reading phase of the experiment.Based on the event markers during the data collection phase, we segmented the data, removing sections irrelevant to the formal experiment such as calibration and preface reading.To minimize the impact of subsequent filtering steps on the beginning and end of the signal, an additional 10 seconds of data was retained before the start of the formal reading phase.Subsequently, the signal was downsampled to 256 Hz.This specific sampling rate ensures effective capture of information related to language comprehension while reducing the burden of subsequent data processing and storage.Additionally, it aligns with the principle of minimal pre-processing, leaving necessary room for researchers to conduct personalized pre-processing based on their needs.
Following downsampling, a 50 Hz notch filter was applied to remove the powerline noise from the signal.Next, we performed band-pass overlap-add FIR filter on the signal to eliminate the low-frequency direct current components and high-frequency noise.Here, two versions of filtered data were offered.The first one has a filter band of 0.5-80 Hz and the second one has a filter band of 0.5-30 Hz.Researchers can choose the appropriate version based on their specific needs.After filtering, we performed an interpolation of bad channels.The bad channels were selected automatically using a Python-implemented EEG pre-processing package pyprep v0.4.3 39 .After detection, we manually checked to avoid mislabeling or errors before interpolation.The spherical spline interpolation in the MNE package was utilized in this process.
Independent Component Analysis (ICA) was then applied to the data, utilizing the infomax algorithm available in the MNE package.The number of independent components was set to 20, ensuring that they contain the majority of information while not being so numerous to increase the burden of manual processing.Additionally, we set the random seed of the ICA algorithm to 97 to ensure the reproducibility of the ICA results.An automatic method was used to inspect and label components.It was implemented using mne-iclabel v0.5.1 40 , which is a Python-implemented package for automatic independent component labeling.By manually inspecting the independent components after automatic labeling, we excluded obvious noise components such as Electrooculography (EOG) and Electrocardiogram (ECG).Finally, the data was re-referenced using the average method.
The process of manually identifying bad channels and excluding independent components during the ICA step can be conducted through annotations in a Graphical User Interface (GUI), making the annotation process quicker and more user-friendly.

technical Validation
Classic sensor-level EEG analysis.The EEG data in the dataset 26,27 can be used to do classic time-frequency analysis.In this section, pre-processed EEG data was used to extract neural oscillations in different frequency bands.Specifically, we targeted the segment corresponding to the sentence "Draw me a sheep" in the Little Prince from the 0.5-80 Hz filtered pre-processed data of sub-07.The analysis was exclusively focused on the C3 electrode to investigate the neural activities at the scalp location overlying the temporal lobe, which is a language processing related area.
To dissect the frequency components inherent in the C3 electrode's signal, we applied the Fast Fourier Transform (FFT) algorithm to the data.This mathematical technique transforms the time-domain signal into the frequency domain, revealing the spectrum of frequencies present in the neural recordings.We defined frequency bands of interest-Theta (4-8 Hz), Alpha (8-12 Hz), Beta (12-30 Hz), and Gamma (30-100 Hz)-to categorize the neural oscillations according to their respective frequency ranges.
For each frequency band, we separated the components from the FFT results and conducted an inverse FFT to retrieve the time-domain signal representing the band's oscillatory activity.This step allows for the quantitative analysis of the amplitude of oscillations within each frequency band, offering insights into the neurophysiological activity in these specific ranges.The results of different frequency bands are shown in Fig. 4.

EEG source reconstruction.
Apart from the sensor level analysis, the EEG data allows for conducting source localization.Here, three segments of the data were utilized as an example to perform the source-level analysis using the MNE package.The fsaverage MRI template (https://surfer.nmr.mgh.harvard.edu/fswiki/FsAverage) 44 in MNE package was utilized to complete the surface reconstruction process.A 3-layer Boundary Element Method (BEM) model with 15360 triangles and conductivities of 0.3 S/m, 0.006 S/m, and 0.3 S/m for the brain, skull, and scalp compartments respectively was created.Source spaces consisted of 10242 sources per hemisphere.Three segments of the pre-processed EEG data with a band-pass frequency band of 0.5-80 Hz corresponding to one line displayed in the experiment were used to calculate the inverse solution.Inverse solutions were calculated using dynamic Statistical Parametric Maps (dSPM).The method was selected because it is widely used by researchers and is representative of currently used methods 45 .We offer the code of source reconstruction in our GitHub repository.See Code availability section for detailed information.
The visualization of the source activities is shown in Fig. 5b.Results for the left and right hemispheres are presented separately.The moments of peak activation in the left and right brain regions are chosen for visualization.The source localization results for the first segment reveal a dispersed activation area, encompassing the anterior temporal lobe and temporo-parietal region, which are associated with language comprehension and Fig. 4 EEG time course and the neural oscillations under different frequency bands (i.e., Theta, Alpha, Beta, and Gamma) corresponding to the Chinese sentence meaning "Draw me a sheep".The pre-processed EEG data using 0.5-80 Hz band-pass filter from ses-LittlePrince of sub-07 was used in the analysis.We illustrated the EEG signals from electrode C3, which locates at a language processing related area overlying the temporal lobe.primary processing 46 .The results of the second segment exhibit more focused activation, particularly near the left middle temporal gyrus, an area (encompassing Wernicke's area) intimately related to language comprehension 47 .The activation areas for the third segment are localized in the left temporal and frontal lobes, potentially representing high-level stages of language processing, including sentence construction, semantic processing, and language expression 48 .Fig. 5c presents plots of source activities over time, derived from 12 sources in the corresponding region with strongest activities.The first two curves in each plot correspond to sources in the left and right hemispheres that reach maximum peak values.
Text embeddings with pre-trained language model.To assist researchers in efficiently exploring the alignment between EEG and text representations, as well as in text decoding based on EEG, this study provides embeddings of two novels calculated using a pre-trained language model, accompanied by the code to compute these embeddings.This work employed Google's pre-trained language model BERT-base-Chinese 28 .This model, pre-trained on Chinese corpora, effectively encodes Chinese semantic features.Given that Chinese characters are the smallest unit of composition in Chinese writing and cannot be further decomposed 49 , bert-base-chinese adopts a character-based tokenization approach, treating each Chinese character as a token for embedding.During the experimental procedure, each displayed line of text contains n Chinese characters.The BERT-base-Chinese model processes these n Chinese characters, yielding an embedding of size (n, 768), where n represents the number of Chinese characters, and 768 the dimensionality of the embedding.To ensure displayed lines of varying length to have embeddings of the same shape, the first dimension of the embeddings is averaged to standardize the embedding size to (1, 768) for each instance.This processing procedure was implemented using the Hugging Face Transformers v4.36.2 50 package.
Temporal alignment between EEG, text sequences, and eye-tracking data.This section provides a comprehensive explanation on how to align the EEG data with its corresponding text content and eye-tracking data in the temporal domain.
To facilitate semantic decoding, it is necessary to align specific text with its corresponding EEG segment in the temporal domain.During the marking process when collecting the data, the start and end of each line of the stimuli were annotated, thereby enabling the alignment of each text line with a corresponding segment of EEG data.Given the consistent highlighting duration for each character, the EEG segment can be equally divided to match the corresponding character.In the GitHub repository, we offer the script to align the EEG segments to their corresponding text and text embeddings.
The recorded eye-tracking data can be aligned with EEG data to verify whether participants were focusing on the text as expected.The eye-tracking data captures both the scene viewed by the participants at each moment and the coordinates of the gaze points.During each experimental run, the marker "EYES" and "EYEE" were inserted into the EEG recordings when the eye-tracker was activated and deactivated.These markers enable precise alignment between the eye-tracking data and the EEG recordings.Once the alignment is complete, markers in the EEG recordings enable the extraction of specific eye-tracking data segments corresponding to particular EEG segments.These eye-tracking data segments can be used to check whether the eye fixation locations align with the anticipated positions on the screen, thus reflecting the quality of the EEG data.

Usage Notes
The code for the experiment and data analysis has been uploaded to GitHub to facilitate sharing and utilization, which is accessible at https://github.com/ncclabsustech/Chinese_reading_task_eeg_processing.The code repository contains four main modules, each including scripts desired to reproduce the experiment and data analysis procedures.The script cut_chinese_novel.py in the novel_segmentation_and_text_embeddings folder contains the code to prepare the stimulation materials from source materials.The script play_novel.py in the experiment module contains code for the experiment, including text stimuli presentation and control of the EGI device and Tobii Pro Glasses 3 eye-tracker.The script preprocessing.py in data_preprocessing_and_alignment module contains the main part of the code to apply pre-processing on EEG data.The script align_eeg_with_sentence.py in the same module contains code to align the EEG segments with corresponding text contents and text embeddings.The docker module contains the Docker image required for deploying and running the code, as well as tutorials on how to use Docker for environment deployment.
The code for EEG data pre-processing is highly configurable, permitting flexible adjustments of various pre-processing parameters, such as data segmentation range, downsampling rate, filtering range, and choice of ICA algorithm, thereby ensuring convenience and efficiency.Researchers can modify and optimize this code according to their specific requirements.
Before using our ChineseEEG dataset 26,27 , we encourage all users to check the README.mdand the updated information in the GitHub repository.

Fig. 1
Fig. 1 Overview of the experiment and the modalities included in the dataset.(a) Equipment utilized in the experiment, including the EGI device for collecting EEG data and the Tobii Pro Glasses 3 eye-tracker for tracking eye movements.(b) The experiment setup.Participants were instructed to sit quietly approximately 67 cm from the screen and sequentially read the highlighted text.(c) The experimental protocol.Participants' 128-channel EEG signals and eye-tracking data were recorded while reading the highlighted text.(d) The data modalities in the dataset.The dataset comprises raw data such as the original textual stimuli, eye movement data, EEG data, and derivatives such as text embeddings from pre-trained NLP models and pre-processed EEG data.

Fig. 2
Fig. 2 EEG pre-processing pipeline.(a) Data segmentation: Data is segmented based on markers, retaining only the data from the formal reading phase.(b) Band-pass filtering: Two versions of filtered data are provided, with band-pass ranges of 0.5-30 Hz and 0.5-80 Hz respectively.(c) Bad channel interpolation: Our bad channel detection includes automatic detection implemented with the pyprep package and manual checking.For interpolation, the spherical spline interpolation implemented in MNE is utilized.(d) ICA denoising:In this part, the automatic labeling method in mne-iclabel package is utilized followed by a manual checking to remove noisy independent components such as eye movements and heartbeats.(e) Dataset organization: Our dataset is organized in the BIDS41,42 format.The detailed file structure is shown in Fig.3.

Fig. 3
Fig. 3 File structure of the dataset.(a) Eye-tracking data: Each experimental run is associated with a .rarfile that contains eye-tracking data.(b) Electrode information files: These include detailed information of electrodes such as the location, type, and sampling rate, as well as information on any channels marked as bad during preprocessing.(c) EEG data and event-related files: Including EEG data in BrainVision format and event files that record marker information.(d) ICA-related files: Containing independent components in numpy format, records of removed components during pre-processing, and topographic maps of the components.(e) Text materials: Containing original and segmented text.(f) Text embedding files: Each file corresponds to an experimental run and is stored in .npyformat.(g) Raw EEG data.

Fig. 5
Fig. 5 EEG source localization analysis.(a) EEG sensor-level data: Three segments of pre-processed EEG data using 0.5-80 Hz band-pass filter were selected for analysis, accompanied by the corresponding text segments shown above the EEG segments.(b) Visualization of brain activation after source analysis: The dSPM method was utilized to solve the inverse problem.Results for the left and right hemispheres are presented separately.The moments of peak activation in the left and right brain regions are chosen for visualization.(c) Plots of source activity over time: Each plot contains the activities of 12 sources in the region with the strongest activity.