Introducing MEG-MASC a high-quality magneto-encephalography dataset for evaluating natural speech processing

The “MEG-MASC” dataset provides a curated set of raw magnetoencephalography (MEG) recordings of 27 English speakers who listened to two hours of naturalistic stories. Each participant performed two identical sessions, involving listening to four fictional stories from the Manually Annotated Sub-Corpus (MASC) intermixed with random word lists and comprehension questions. We time-stamp the onset and offset of each word and phoneme in the metadata of the recording, and organize the dataset according to the ‘Brain Imaging Data Structure’ (BIDS). This data collection provides a suitable benchmark to large-scale encoding and decoding analyses of temporally-resolved brain responses to speech. We provide the Python code to replicate several validations analyses of the MEG evoked responses such as the temporal decoding of phonetic features and word frequency. All code and MEG, audio and text data are publicly available to keep with best practices in transparent and reproducible research.


Background & Summary
Humans have the unique ability to produce and comprehend an infinite number of novel utterances.This capacity of the human brain has been the subject of vigorous studies for decades.Yet, the core computational mechanisms upholding this feat remain largely unknown [1][2][3] .
To tackle this issue, a common experimental approach has been to decompose language processing into elementary computations using highly controlled factorial designs.This approach allows experimenters to compare average brain responses to carefully chosen stimuli and make inferences based on the select ways that those stimuli were designed to differ.The field has learnt a lot about the neurobiology of language by taking this approach; however, factorial designs also face several key challenges 4 .First, this method has led the community to study language processing in atypical scenarios (e.g. using unusual text fonts 5 , meaningless syntactic constructs 6,7 , or words and phrases isolated from context 8,9 ).Presenting language in this unconventional manner runs the risk of studying phenomena that are not representative of how language is naturally processed.Second, high-level cognitive functions can be difficult to fully orthogonalize in a factorial design.For instance, comparing brain responses to words and sentences matched in length, syntactic structure, plausibility and pronunciation is often close to impossible.In the best case, experimenters will be forced to make concessions on how well the critical contrasts are controlled.In the worst case, unidentified confounds may drive differences associated with experimental contrasts, leading to incorrect conclusions.
During the past decade, several studies have complemented the factorial paradigm with more natural experiments.In these studies, participants listen to continuous speech [10][11][12] , read continuous prose 13,14 or watch videos that include verbal communication 15 .This approach is more likely to recruit neural computations that are representative of day-to-day language processing.Complications arising from correlated language features can be overcome by explicitly modeling properties of interest, in tandem with potential confounds.This allows variance belonging to either source to be appropriately distinguished.
To analyze the brain responses to the complex stimulation that natural language provides, a variety of encoding and decoding methods have proved remarkably effective 10,[16][17][18][19][20][21] .Consequently, language studies based on naturalistic designs have since flourished 11,22 .The popularity of this approach has some of its roots in the rise of natural language processing (NLP) algorithms, which map remarkably onto brain responses to written and spoken sentences [23][24][25][26][27][28][29][30] .Such tools also allow experimenters to annotate the language stimuli for features of interest, without relying on time-consuming annotations done by hand.These data have allowed researchers to identify the main semantic components 10 , recover the hierarchy of integration constants in the language network 31 , distinguish syntax and semantics hubs 32 and to track the hierarchy of predictions elicited during speech processing 28,33,34 .More generally, brain responses to natural stories have proved useful in keeping participants engaged, while studying the neural representations of phonemes, word surprisal and entropy 22,35,36 .
While large and high-quality functional Magnetic Resonance Imaging (fMRI) datasets related to language processing have recently been released 37,38 , there is currently little publicly available high-quality temporally-resolved brain recordings acquired during story listening.The most extensive databases of such data include: • van Essen et al. 39 : 72 subjects recorded with fMRI and MEG as part of the Human Connectome Project, listening to 10 minutes of short stories, no repeated session 39  41 Until the present release of our dataset, there existed no public magneto-encephalography (MEG) with (1) several hours of story listening (2) multiple sessions (3) a systematic audio, phonetic and word annotations (4) a standardized data structure.Thus, our dataset offers a powerful resource to the scientific community.
In the present study, 27 English-speaking subjects performed ~two hours of story listening, punctuated by random word lists and comprehension questions in the MEG scanner.Except if stated otherwise, each subject listened to four distinct fictional stories twice.

Methods
Participants.Twenty-seven English-speaking adults were recruited from the subject pool of NYU Abu Dhabi (15 females; age: M = 24.8,SD = 6.4).All participants provided a written informed consent and were compensated for their time.Participants reported having normal hearing and no history of neurological disorders.All participants were right-handed, as evaluated using the Edinburgh Handedness Inventory questionnaire 42 .All but one participant (S21) were native English speakers -this person was a native speaker of Hindi, and learned English at 10 years old.All but five participants (S3, S12, S16, S20, S21) performed two identical one-hour-long sessions.These two recording sessions were separated by at least one day and at most two months depending on the availability of the experimenters and of the participants.The study was approved by the Institutional Review Board (IRB) ethics committee of New York University Abu Dhabi.
Procedure.Within each ∼1 h recording session, participants were recorded with a 208 axial-gradiometer MEG scanner built by the Kanazawa Institute of Technology (KIT), and sampled at 1,000 Hz, and online band-pass filtered between 0.01 and 200 Hz while they listened to four distinct stories through binaural tube earphones (Aero Technologies), at a mean level of 70 dB sound pressure level.
Before the experiment, participants were exposed to 20 sec of each of the distinct speaker voices used in the study to (i) clarify the structure of the session and (ii) familiarize the participants with these voices.The sound files and scripts are available in ('/stimuli/exp_intro/').
The order in which the four stories were presented was assigned pseudo-randomly, thanks to a "Latin-square design" across participants.The story order for each participant can be found in 'participants.tsv' .This participant-specific order was used for both recording sessions.Our motivation for running two identical sessions was to (i) give researchers the ability to average the data across the two recordings to boost signal-to-noise; (ii) provide a like-for-like data reliability measure; (iii) give the opportunity for matched train and test datasets if attempting to run cross validated analyses.
To ensure that the participants were attentive to the stories, they responded, every ∼3 min and with a button press, to a two-alternative forced-choice question relative to the story content (e.g.'What precious material had Chuck found?Diamonds or Gold').Participants performed this task with an average accuracy of 98%, confirming their engagement with and comprehension of the stories.The questions and answers are provided in ('stimuli/task/question_dict.py').
Participants who did not already have a T1-weighted anatomical scan usable for the present study were scanned in a 3 T Magnetic-Resonance-Imaging (MRI) scanner after the MEG recording to avoid magnetic artefacts.Twelve participants returned for their T1 scan.
Before each MEG session, the head shape of each participant was digitized with a hand-held FastSCAN laser scanner (Polhemus), and co-registered with five head-position coils.The positions of these coils with regard to the MEG sensors were collected before and after each recording and stored in the 'marker' file, following the KIT's system.The experimenter continuously monitored head position during the acquisition to ensure that the participants did not move.
Stimuli.Four English fictional stories were selected from the Manually Annotated Sub-Corpus (MASC) which is part of the larger Open American National Corpus 43 .MASC is distributed without license or other restrictions (https://anc.org/data/masc/corpus/577-2/): • 'LW1': a 861-word story narrating an alien spaceship trying to find its way home (5 min, 20 sec) • 'Cable Spool Boy': a 1,948-word story narrating two young brothers playing in the woods (11 min) • 'Easy Money': a 3,541-word fiction narrating two friends using a magical trick to make money (12 min, 10 sec) • 'The Black Willow': a 4,652-word story narrating the difficulties an author encounters during writing (25 min, 50 sec) An audio track corresponding to each of these stories was synthesized using Mac OS Mojave © version 10.14 text-to-speech.To help decorrelate language features from acoustic representations, we varied both voices and speech rate every 5-20 sentences.Specifically, we used three distinct synthetic voices:' Ava' , 'Samantha' and ' Allison' speaking between 145 and 205 words per minute.Additionally, we varied the silence between sentences between 0 and 1,000 ms.Both speech rate and silence duration were sampled from a uniform distribution between the min and max values.
Each story was divided into ~3 min sound files.In between these sounds -approximately every 30 s -we played a random word list generated from the unique content words (nouns, proper nouns, verbs, adverbs and adjectives) selected from the preceding 5 min segment presented in random order.We decided to include word lists to allow data users to compare brain responses to content words within and outside of context, following experimental paradigms of previous studies 38,44 .In addition, a very small fraction (<1%) of non-words were inserted into the natural sentences, on average every 30 words.We decided to include non-words to allow comparisons between phonetic sequences that do and do not have an associated meaning.
Hereafter, and following the BIDS labeling 45 , each "task" corresponds to the concatenation of these sentences and word lists.Each subject listened to the exact same set of four tasks, in a different block order.
Preprocessing.MEG.The MEG dataset and its annotations are shared raw (i.e.not preprocessed) organized according to the Brain Imaging Data Structure 45 MNE-BIDS 46 .
For four subjects (02, 06, 07, 19) we were unable to record structural MRIs, and so instead we provide the scaled FreeSurfer average MRIs in their place.
The alignment between the spaces of (1) the head-position coils, (2) the MEG sensors and (3) the T1 MRI was co-registered manually with MNE-Python 48 .
Both sentences and word lists were annotated for phoneme boundaries and labels (107 phoneme labels, detailing phoneme category and its location in the word (Beginning; Internal; End) using the 'Gentle aligner' from the Python module lowerquality https://lowerquality.com/gentle/.However, the inclusion of the original audio leaves the possibility for future research to develop more advanced alignment technique and recover additional features.
For each phoneme and word, we indicate the corresponding voice, speech rate, wav file, story, word position within the sequence, and sequence position within the story, and whether the sequence is a word list or a sentence.

Data Records
The dataset is organized according to Brain Imaging Data Structure (BIDS) 1.2.1 45 and publicly available on the Open Science Framework data repository 52 https://doi.org/10.17605/OSF.IO/AG3KJ under a Creative Common Licence 0. An image of the folder structure is provided in Fig. 1.The detailed description of the BIDS file system is available at http://bids.neuroimaging.io/.In summary, • './dataset_description.json'describes the dataset • './participants.tsv'indicates the age and gender of each participant, the order in which they heard the stories, whether they have an anatomical MRI scan, and how many recording sessions they completed • './stimuli/' contains the original texts, the modified texts (i.e. with word lists), the synthesized audio tracks.
• Each' ./sub-SXXX'contains the brain recordings of a unique participant divided by session (e.g.'ses-0' and'ses-1') • In each session folder lies the anatomical and the meg data, and the timestamp annotations (see Fig. 4).
• Sessions are numbered by temporal order (s0 is first; s1 is second).• Tasks are numbered by a unique integer common across participants.
• The dataset can be read directly with MNE-BIDS 46 .

technical Validation
We checked that the present dataset complies with the standardized brain imaging data structure by using the Bids-Validator (https://github.com/bids-standard/bids-validator).
MEG recordings are notoriously noisy and thus challenging to validate empirically.In particular, MEG can be corrupted by environmental noise (nearby electronic systems) and physiological noise (eye movement, heart activity, facial movements) 53 .To address this issue, several labs have proposed a myriad of preprocessing techniques based on temporal and spatial filtering 54 and trial and channel rejection 55 .However, there is currently no accepted standard for the selection and ordering these preprocessing steps.Consequently, we here opted for (1) a minimalist preprocessing pipeline derived from MNE-Python's default pipeline 48 followed by (2) median evoked responses and (3) standard single-trial linear decoding analyses.
Evoked. Figure 2 displays the median evoked responses across participants and words onset and after phoneme onsets, respectively.Both of these topographies are typical of auditory activity in MEG 36 .

Decoding.
For each recording independently, our objective was to verify the alignment between the word annotations and the MEG recordings.To this end, we trained a linear classifier W ∈ Rd across all d = 208 magnetometers (X ∈ Rn × d), for each time sample relative to word (or phoneme) onset independently, and for each subject separately.The classifier consisted of a standard scaler, followed by a linear discriminant regression implemented by scikit-learn 50 using model = make_pipeline(StandardScaler(), LinearDiscriminantAnalysis()) • decode high versus low median zipf-frequency of each word, as defined by the WordFreq package 56 .
• decode whether the phoneme is voiced or not.
The decoding pipeline was trained and evaluated using a five-split cross-validation scheme (with shuffling) using cv = KFold(5, shuffle = True, random_state = 0) The scoring metric reported is Pearson R correlation between the continuous probabilistic output of the classifier on each trial, and the ground truth label (high vs. low for word frequency; voiced vs. voiceless for voicing).The full decoding pipeline can be found in the script check_decoding.py.
The results displayed in Fig. 3 show a reliable decoding at the phoneme and at the word level, across both subjects and tasks (i.e.stories).
The success of our decoding analysis demonstrates: (i) the data have been correctly time-stamped relative to phoneme and word onset, in order to elicit a zero-aligned decoding timecourse; (ii) the data contain reliable signals that contain speech-related properties, suitable for further investigation; (iii) information at multiple levels (phonetic and lexical) are present in the data, allowing users to test hypotheses at different linguistic levels of description.We anticipate that encoding models would provide equally compelling results.Note that a decoding performance of Pearson R = 0.08 is typical for single-trial MEG data of continuous listening, and is of the same magnitude that has been reported in previous studies 38 .We have a large number of events (tens of thousands of phonemes; thousands of words), and this dataset has been demonstrated to provide sufficient statistical power to yield significant results, despite small effect sizes 36,57 .

Fig. 2
Fig. 2 Median (across subjects) evoked response to all words.The gray area indicates the global field power (GFP).

Fig. 3
Fig. 3 (a) Average (mean) decoding of whether the phoneme is voiced or not as a function of time following phoneme onset.The four colors refer to the four tasks (stories + word lists).Error bar are SEM across subjects.(b) Same as A for the decoding of words' zipf frequency as a function of word onset.(c) Decoding of voicing (average across all tasks) for each participant, as a function of time following phoneme onset.(d) Same as C for decoding of word frequency (average across all tasks) for each participant, as a function of time following word onset.

•
Brennan and Hale 40 : 33 subjects recorded with EEG, listening to 12 min of a book chapter, no repeated session 40 • Broderick et al. 11 : 9-33 subjects recorded with EEG, conducting different speech tasks, no repeated sessions 11 • Schoffelen et al. 38 : 100 subjects recorded with fMRI and MEG, listening to de-contextualised sentences and word lists, no repeated session 38 • Armeni et al. 41 : 3 subjects recorded with MEG, listening to 10 h of Sherlock Holmes, no repeated session