Decoding speech perception from non-invasive brain recordings

Decoding speech from brain activity is a long-awaited goal in both healthcare and neuroscience. Invasive devices have recently led to major milestones in that regard: deep learning algorithms trained on intracranial recordings now start to decode elementary linguistic features (e.g. letters, words, spectrograms). However, extending this approach to natural speech and non-invasive brain recordings remains a major challenge. Here, we introduce a model trained with contrastive-learning to decode self-supervised representations of perceived speech from the non-invasive recordings of a large cohort of healthy individuals. To evaluate this approach, we curate and integrate four public datasets, encompassing 175 volunteers recorded with magneto-or electro-encephalography (M/EEG), while they listened to short stories and isolated sentences. The results show that our model can identify, from 3 seconds of MEG signals, the corresponding speech segment with up to 41% accuracy out of more than 1,000 distinct possibilities on average across participants, and more than 80% in the very best participants – a performance that allows the decoding of words and phrases absent from the training set. The comparison of our model to a variety of baselines highlights the importance of (i) a contrastive objective, (ii) pretrained representations of speech and (iii) a common convolutional architecture simultaneously trained across multiple participants. Finally, the analysis of the decoder’s predictions suggests that they primarily depend on lexical and contextual semantic representations. Overall, this effective decoding of perceived speech from non-invasive recordings delineates a promising path to decode language from brain activity, without putting patients at risk for brain surgery.


Introduction
Every year, traumatic brain injuries, strokes and neurodegenerative diseases make thousands of patients lose their ability to speak or even communicate [22,23,56,72,79,79,87].Brain Computer Interface (BCI) has been raising high expectations to detect [13,22,49,72] and restore communication abilities in such patients [17,37,48,66,88,93].Over the past decades, several teams used BCI to efficiently decode phonemes, speech sounds [4,78], hand gestures [88,93] and articulatory movements [9,66] from electrodes implanted in the cortex or over its surface.For instance, Willett et al. [93] decoded 90 characters per minute (with a 94% accuracy, i.e. roughly ≈15-18 words per minute) from a spinal-cord injury patient recorded in the motor cortex during 10 hours of writing sessions.Similarly, Moses et al. [66] decoded 15.2 words per minute (with 74.4% accuracy, and using a vocabulary of 50 words) in an anarthria patient implanted in the sensorimotor cortex and recorded over 48 sessions spanning over 22 hours.Finally, Metzger et al. [61] recently showed that a patient with severe limb and vocal-tract paralysis and implanted over the sensori-motor cortex could efficiently spell words using a code word that represented each English letter (e.g."alpha" for "a"): this approach leads to 6.13% character error rate and 29.4% characters per minute, and hence starts to provide a viable communication channel for such patients.
However, such invasive recordings face a major practical challenge: these high-quality signals require brain surgery and can be difficult to maintain chronically.Several laboratories have thus focused on decoding language from non-invasive recordingss of brain activity like magneto-and electroencephalography (M/EEG).MEG and EEG are sensitive to macroscopic changes of electric and magnetic signals elicited in the cortex, and can be acquired with a safe and potentially wearable setup [14].However, these two devices produce notoriously noisy signals that vary greatly across sessions and across individuals [33,50,82].It is thus common to engineer pipelines that output hand-crafted features, which, in turn, can be learned by a decoder trained on a single participant [20,57,58,67,68,74].
In sum, decoding language from brain activity is, to date, either limited to invasive recordings or to impractical tasks.Interestingly, both of these approaches followed a similar method: i.e. (1) training a model on a single patient and (2) aiming to decode a limited set of interpretable features (MEL spectrogram, letters, phonemes, small set of words).
Instead, we here propose to decode speech from non-invasive brain recordings by using (1) a single architecture trained across a large cohort of participants and (2) deep representations of speech learnt with self-supervised learning on a large quantity of speech data.We focus the present work on speech perception in healthy volunteers rather than speech production in patients to design a deep-learning architecture that effectively addresses two core challenges: (1) the fact that non-invasive brain recording can be extremely noisy and variable across trials and participants and (2) the fact that the nature and format of language representations in the brain remain largely unknown.For this, we introduce a convolutional neural network stacked onto a "Subject Layer" and trained with a contrastive objective to predict the representations of the audio waveform learnt by wav2vec 2.0 pretrained on 56k hours of speech [10] (Figure 1).To validate our approach, we curate and integrate four public M/EEG datasets, encompassing the brain activity of 175 participants passively listening to sentences of short stories.With a sample of 3 seconds of M/EEG signals, our model identifies the matching audio segment (i.e.zero-shot decoding) with up to 72.5% top-10 accuracy (out of 1,594 segments) for MEG and up to 19.1% top-10 accuracy (out of 2,604 segments) for EEG.
This approach provides three main contributions for the development of non-invasive BCI.First, it shows how pretrained speech models can leverage the decoding of speech in the brain, without exposing volunteers to a tedious repetition of every single word targeted by the decoder.Second, this study shows how our design choices -including contrastive learning and our multi-subject architecture -lead to an efficient processing of continuous EEG and MEG recordings, and thus offers a data-driven guideline for the development of future BCI.Finally, our analyses suggest that the resulting decoder primarily relies on high-level representations of words and phrases.

Method
We first formalize the general task of neural decoding and then describe and motivate the different components of our model, before describing the datasets, preprocessing, training and evaluation.

Problem formalization
We aim to decode speech from a time series of high-dimensional brain signals recorded with non-invasive magneto-encephalography (MEG) or electro-encephalography (EEG) while healthy volunteers passively listened to spoken sentences in their native language.How spoken words are represented in the brain is largely unknown [19,39,41].Thus, it is common to train decoders in a supervised manner to predict a latent representation of speech known to be relevant to the brain [4,6,7,54,55].For example, the Mel spectrogram is often targeted for neural decoding because it represents sounds similarly to the cochlea [60].We formalize this problem as follows.Let X ∈ R C×T be a segment of a brain recording of a given subject while she listens to a speech segment of the same duration, with C the number of M/EEG sensors and T the number of time steps.Let Y ∈ R F ×T be the latent representation of speech, using the same sample rate as X for simplicity, here the Mel spectrogram with F frequency bands.In this formalization, supervised decoding consists of finding a decoding function: f reg : R C×T → R F ×T such that f reg predicts Y Figure 1: Method.We aim to decode speech from the brain activity of healthy participants recorded with magnetoencephalography (MEG) or electroencephalography (EEG) while they listen to stories and/or sentences.For this, our model extracts the deep contextual representations of 3 s speech signals (Y of F feature by T time samples) from a pretrained 'speech module' (wav2vec 2.0: Baevski et al. [10]) and learns the representations (Z) of the brain activity on the corresponding 3 s window (X of C recording channels by T time samples) that maximally align with these speech representations with a contrastive loss (CLIP: Radford et al. [80]).The representation Z is given by a deep convolutional network.At evaluation, we input the model with left-out sentences and compute the probability of each 3 s speech segment given each brain representation.The resulting decoding can thus be "zero-shot" in that the audio snippets predicted by the model need not be present in the training set.This approach is thus more general than standard classification approaches where the decoder can only predict the categories learnt during training.
given X.We denote by Ŷ = f reg (X) the representation of speech decoded from the brain.When f reg belongs to a parameterized family of models like deep neural networks, it can be trained with a regression loss L reg (Y, Ŷ ) (e.g. the Mean Square Error), This direct regression approach appears to be dominated by a non-distinguishable broadband component when speech is present (Figure 2.A-B).This challenge motivates our three main contributions: the introduction of a contrastive loss, a pre-trained deep speech representation, and a dedicated brain decoder.

Contrastive loss
We reasoned that regression may be an ineffective loss because it departs from our objective -i.e. which requires maximally distinguishing different speech segments apart.Indeed, a regression objective stems from the principle that all of the dimensions of the Mel spectrogram are (1) equally important and (2) are scaled appropriately: the L2 objective inclines the model to predict low and high frequencies equally well, even if (1) some frequencies (e.g.very low) may be irrelevant to speech and (2) some frequencies may vary in orders of magnitudes lowers than others.To relax this constraint, we Replacing the regression loss with a CLIP loss [80] improves reconstruction in the same subject, still using the mel-spectrogram as the speech representation.D. Now replacing the mel-spectrogram with wav2vec 2.0 [10].The probabilities given by Eq. ( 2) are used to rebuild a mel-spectrogram.E. Architecture of the brain module.Architecture used to process the brain recordings.For each layer, we note first the number of output channels, while the number of time steps is constant throughout the layers.The model is composed of a spatial attention layer, then a 1x1 convolution without activation.A "Subject Layer" is selected based on the subject index s, which consists in a 1x1 convolution learnt only for that subject with no activation.Then, we apply five convolutional blocks made of three convolutions.The first two use residual skip connection and increasing dilation, followed by a BatchNorm layer and a GELU activation.The third convolution is not residual, and uses a GLU activation (which halves the number of channels) and no normalization.Finally, we apply two 1x1 convolutions with a GELU in between.
opted for a contrastive objective and thus replaced the regression loss with the "CLIP" loss (originally for Contrastive Language-Image Pre-Training) by Radford et al. [80], which was originally designed to match latent representations in two modalities, text and images.Unlike the regression objective, this contrastive loss leads the model to find a combination of features that maximally discriminates samples in the batch.Consequently, the model is naturally inclined to focus on the informative dimensions of the Mel spectrograms, and to scale them appropriately.We implement the CLIP loss as follows: Let X be a brain recording segment and Y ∈ R F ×T the latent representation of its corresponding sound (a.k.a "positive sample").We sample N − 1 negative samples Ȳj∈{1,...,N−1} over our dataset and we add the positive sample as ȲN = Y .We want our model to predict the probabilities ∀j ∈ {1, . . ., N }, p j = P Ȳj = Y .We thus train a model f clip mapping the brain activity X to a latent representation Z = f clip (X) ∈ R F ×T .The estimated probability can then be approximated by the dot product of Z and the candidate speech latent representations Y j , followed by a softmax: with ⟨•,•⟩ the inner product over both dimensions of Z and Ŷ .We then train f clip with a cross-entropy between p j and pj .Note that for a large enough dataset, we can neglect the probability of sampling twice the same segment, so that we have p j = 1 j=N , and the cross-entropy simplifies to Following [80], we use the other elements of the batch as negative samples at train time.At test time, the negative samples correspond to all of the segments of the test set but the positive one.

Brain module
For the brain module, we introduce a deep neural network f clip , input with raw M/EEG times series X and a one-hot-encoding of the corresponding subject s, and outputs the latent brain representation Z, with the same sample rate as X.This architecture consists of (1) a spatial attention layer over the M/EEG sensors followed (2) by a subject-specific 1x1 convolution designed to leverage inter-subject variability, which input to (3) a stack of convolutional blocks.An overview of the model is given in Figure 2. In the following, given a tensor U , we note U (i,...) access to specific entries in the tensor.
Spatial attention.The brain data is first remapped onto D 1 = 270 channels with a spatial attention layer based on the location of the sensors.The 3D sensor locations are first projected on a 2D plane obtained with the MNE-Python function find_layout [30], which uses a device-dependent surface designed to preserve the channel distances.Their 2D positions are finally normalized to [0, 1].For each output channel, a function over [0, 1] 2 is learnt, parameterized in the Fourier space.The weights over the input sensors is then given by the softmax of the function evaluated at the sensor locations.Formally, each input channel i has a location (x i , y i ) and each output channel j is attached a function a j over [0, 1] 2 parameterized in the Fourier space as z j ∈ C K×K with K=32 harmonics along each axis, i.e. Re(z The output is given by a softmax attention based on the evaluation of a j at each input position (x i , y i ): with SA the spatial attention.In practice, as a j is periodic, we scale down (x, y) to keep a margin of 0.1 on each side.We then apply a spatial dropout by sampling a location (x drop , y drop ) and removing from the softmax each sensor that is within a distance of d drop = 0.2 of the sampled location.We then add a 1x1 convolution (i.e. with a kernel size of 1) without activation and with the same number D 1 of output channels.
Subject Layer.To leverage inter-subject variability, we learn a matrix M s ∈ R D1,D1 for each subject s ∈ [S] and apply it after the spatial attention layer along the channel dimension.This is similar but more expressive than the subject embedding used by Chehab et al. [21] for MEG encoding, and follows decade of research on subject alignment [35,94].
Residual dilated convolutions.We then apply a stack of five blocks of three convolutional layers.
For the k-th block, the first two convolutions are applied with residual skip connections (except for the very first one where the number of dimension potentially doesn't match), outputs D 2 = 320 channels and are followed by batch normalization [43] and a GELU activation [36].The two convolutions are also dilated to increase their receptive field, respectively by 2 2k mod 5 and 2 2k+1 mod 5 (with k zero indexed).The third layer in a block outputs 2D 2 channels and uses a GLU activation [26] which halves the number of channels.All convolutions use a kernel size of 3 over the time axis, a stride of 1, and sufficient padding to keep the number of time steps constant across layers.The output of the model is obtained by applying two final 1x1 convolutions: first with 2D 2 outputs, followed by a GELU, and finally with F channels as output, thus matching the dimensionality of speech representations.Given the expected delay between a stimulus and its corresponding brain responses, we further shift the input brain signal by 150 ms into the future to facilitate the alignment between Y and Z.

Speech module
The Mel spectrogram is a low-level representation of speech inspired from the cochlea and is thus unlikely to match the rich variety of cortical representations [39].Consequently, we replaced the Mel spectrograms with latent representations of speech.For this, we propose to either learn these representations end-to-end ("Deep Mel" model) or to rely on those learnt by an independent self-supervised speech model ("wav2vec 2.0", Baevski et al. [10]).
End-to-end speech representations with Deep Mel.The "Deep Mel" module uses the same deep convolutional architecture to the brain module devoid of the subject block, and thus simultaneously learns to extract speech and M/EEG representations such that they are maximally aligned.By definition, and unlike wav2vec 2.0, Deep Mel only sees the audio used in the M/EEG datasets.As this end-to-end approach proved to be less efficient than its pretrained counterpart based on wav2vec 2.0, we will thereafter focus on the latter.
Pretrained speech representations with wav2vec 2.0.Wav2vec 2.0 is trained with audio data only to transform the raw waveform with convolutional and transformer blocks to predict masked parts of its own latent representations.Baevski et al. [10] showed that the resulting model can be efficiently fine-tuned to achieve state-of-the-art performance in speech recognition.Besides, this model effectively encodes a wide variety of linguistic features [1,62].In particular recent works show that the activations of wav2vec 2.0 linearly map onto those of the brain [63,92].Consequently, we here test whether this model effectively helps the present decoding task.In practice, we use the wav2vec2-large-xlsr-53 [71], which has been pre-trained on 56k hours of speech from 53 different languages.

Datasets
We test our approach on four public datasets, two based on MEG recordings and two on EEG.All datasets and their corresponding studies were approved by the relevant ethics committee and are publicly available for fundamental research purposes.Informed consent was obtained from each human research participant.We provide an overview of the main characteristics of the datasets in Table 1, including the number of train and test segments and vocabulary size over both splits.For all datasets, healthy adult volunteers passively listened to speech sounds (accompanied with some memory or comprehension questions to ensure participants were attentive), while their brain activity was recorded with MEG or EEG.In Schoffelen et al. [83], Dutch-speaking participants listened to decontextualized Dutch sentences and word lists (Dutch sentences for which the words are randomly shuffled).The study was approved by the local ethics committee (CMO -the local "Committee on Research Involving Human Subjects" in the Arnhem-Nijmegen region).The data is publicly and freely available after registration on the Donders Repository.In Gwilliams et al. [32], English-speaking participants listened to four fictional stories from the Masc corpus [42] in two identical sessions of one hour [31].The study was approved by the Institution Review Board (IRB) ethics committee of New York University Abu Dhabi.In Broderick et al. [16], English-speaking participants listened to extracts of "The old man and the sea".The study was approved by the Ethics Committees of the School of Psychology at Trinity College Dublin, and the Health Sciences Faculty at Trinity College Dublin.In Brennan and Hale [15], English-speaking participants listened to a chapter of "Alice in Wonderland".See Section A.1 in the Appendix for more details.The study was approved by the University of Michigan Health Sciences and Behavioral Sciences Institutional Review Board (HUM00081060).

Preprocessing
M/EEG is generally considered to capture neural signals from relatively low frequency ranges [33].Consequently, we first resampled all brain recordings down to 120 Hz with Torchaudio [95] and then split the data into training, validation, and testing splits with a size roughly proportional to 70%, 20%, and 10%.We define a "sample" as a 3 s window of brain recording with its associated speech representation.A "segment" is a unique 3 s window of speech sound.As the same segment can be presented to multiple subjects (or even within the same subject in Gwilliams et al. [32]), the splits are defined so that one segment is always assigned to the same split across repetitions.We ensure that there are no identical sentences across splits.Furthermore, we exclude all segments overlapping over different splits.For clarity, we restrict the test segments to those that contain a word at a fixed location (here 500 ms before word onset).
M/EEG data can suffer from large artifacts, e.g.eye movements, or variations in the electro-magnetic environment [33].To limit their impact, we apply a "baseline correction" (i.e.we subtract to each input channel its average over the first 0.5 s) and a robust scaler with scikit-learn [77].We normalize the data and clamp values greater than 20 standard deviations to minimize the impact of large outlier samples.For the Mel spectrogram, we use 120 Mel bands (see Supplementary Section A.2) [96], with a normalized STFT with a frame size of 512 samples and hop length of 128 samples, using audio sampled at 16k Hz.We apply log-compression, i.e. log(ϵ + mel), with ϵ=10 −5 .When using wav2vec 2.0, we average the activations of the last four layers of its transformer.We use standard normalization for both representations.

Training
One training epoch is defined as 1,200 updates using Adam [51] with a learning rate of 3•10 −4 and a batch size of 256 We stop training when no improvement is observed on the valid set for 10 epochs and keep the best model based on the valid loss.For the direct regression of the Mel spectrogram, we use the MSE loss.We use two V100 GPUs with 16GB of memory.

Evaluation
Mel reconstructions.In Figure 2, we illustrate some reconstructed Mel spectrograms using different models.With a regression loss, the generation of the Mel spectrogram is made directly.With a CLIP loss, we plot the weighted average across all test segments, where the weight corresponds to the probability of the segment to be true estimated with the CLIP loss.Specifically, given a segment and its matching audio (here the sentence "Thank you for coming Ed"), we retrieve the predicted distribution over the 1,363 segments given by Eq. ( 2).We then use this distribution to average the Mel spectrogram of each candidate segment.
Segment-level evaluation.The top-10 segment accuracy indicates whether the true segment is in the top-10 most likely segments predicted by the decoder.We favor reporting this metric over the standard top-1 accuracy, given the large number of possible segments as the model may be able to decode useful information, without necessarily guessing the exact speech segment.
Word-level evaluation.To evaluate the model at the word level (e.g. Figure 4), we select a 3 s segment for each word of the test set (from -500 ms to 2.5 s).We input the model with the corresponding brain recordings, and output the probability distribution over all test segments.To obtain the distribution over the vocabulary, we group the candidate segments by the corresponding word (i.e. starting at t=0) and sum the probabilities of the same words spoken in different segments.
Top-1 and Top-10 word-level accuracy then quantify whether the true word is within the first or first ten most likely predictions of the model, respectively.
Prediction analysis.To further inspect the predictions of the decoder, we quantify the extent to which they can be predicted from well-defined features Ỹ ∈ R n×fi .For this, we extract the phonetic features (d = 60) with Phonemizer [12] the 'zipf' frequency (d = 1) with Wordfreq [85], the part-of-speech tags (d = 15), the word embedding (d = 300) of each word with spaCy [3] as well as the phrase embedding of the corresponding 3 s speech segment (d = 1, 024) with Laser [84].We refer to 'bag-of-words' the sum of word-embeddings over the segment.We then train a ridge regression with scikit-learn's default parameters [77] to predict the softmax probability of the true word output by the decoder, and estimate, with a five-split cross-validation, the correspondence between these two values with a Pearson R correlation.In sum, this analysis quantifies how well the feature predicts the probability of being selected by the decoder.
Statistics.Statistical comparison is performed on the test set.We use a Wilcoxon test across participants to compare different models on the same datasets.We use Mann-Whitney test across participants to compare different datasets.

Code availability
The code to reproduce the present study is available at github.com/facebookresearch/brainmagick.For more than half of samples, the true audio segment is ranked first or second in the decoders' predictions.Interestingly, these performances can reach high top-1 accuracy in the best performing subjects: e.g.above 80% top-1 accuracy in the best participant of the dataset of Gwilliams et al. [31] (Figure 3A).For comparison, a model that predicts a uniform distribution over the vocabulary ('random model') only achieves less than 1% top-10 accuracy on the same MEG datasets.Decoding performance for EEG datasets is lower: our model reaches 17.7% and 25.7% top-10 accuracy for the two EEG datasets presently analyzed.While modest, these scores are much higher than the random baseline.

Is MEG really much better than EEG?
To investigate whether these performances depend on the total recording duration and/or the number of recording sensors, we train our model on a subset of the data which homogenize recording time, number of sensors, number of participants.For this, we discard the dataset from Brennan and Hale [15], to avoid over-limiting this analysis datasets.Consequently, we match all datasets to the smallest number of channels of the three remaining datasets by keeping a random but fixed subset of channels (e.g.128).We keep only 19 subjects per datasets, again aligning on the smallest for all three datasets.Finally, we keep the same average duration per subject for all three datasets, by dropping out some training segments (i.e. the same segments are dropped for all subjects or repetitions within one subject).All test segments are kept to maximize reliability.Overall, this subsampling diminishes decoding performance (e.g.top-10: 30.3% for Schoffelen et al. [83] and 31.7% for Gwilliams et al. [32]), but MEG decoding remains much better than the EEG (Mann-Whitney across MEG and EEG subjects: all p < 10 −6 ).While these results should be confirmed by using the same stimuli to participants recorded either with EEG or MEG, they suggest that the difference of decoding performance observed between studies is mainly driven by the type of device.

'Speech module' evaluation.
To evaluate our approach, we compare these decoding performances to those obtained with models that target different representations of speech (Figure 3 and Table 2).While a model trained to predict the Mel spectrogram with a regression objective ('Base model' in Table 2) is systematically higher than chance, the use of a contrastive loss ('+ CLIP') leads to decoding gains between that range from 2% (for Brennan and Hale [15]) to 42.7% (for Gwilliams et al. [32]).This gain is further supplemented by targeting latent representation of the Mel spectrogram ('+ Deep Mel').The latent representations of speech sounds, however, appear to be best identified with a pretrained speech module, i.e. by using 'wav2vec 2.0', a model pretrained with self-supervised learning on speech sounds only, rather than by jointly learning speech and M/EEG representations (Table 2).Overall, these results show the importance, for decoding, to target latent representations of speech.To evaluate the elements of the brain module, we performed a series of ablation experiments, and trained the corresponding models on the same data.Overall, several elements impact performance.Performance systematically decreases when removing skip connections, the spatial attention module, the initial or final convolutional layers (Table 3).These results also show how essential clamping is to train the model.Additional experiments confirm that the present end-to-end architecture is robust to M/EEG artefacts, and thus requires little preprocessing (Supplementary sections A.3 and A.2).
Importantly, these ablations analyses also reveal the importance of the subject layer.Note that this gain is modest when compared to performance obtained with the subject embedding we introduced recently [21].To further investigate whether our model effectively leverage the inter-individual variability, we trained it on a variable number of subjects and computed its accuracy on the first 10% of subjects.As shown in Figure 3C, decoding performance steadily increases as the model is trained with more subjects on the two MEG datasets.

Decoded representations best correlate with phrase embeddings.
What type of representations does our model use to decode speech from brain signals?This interpretability question is notoriously difficult to address [6,50].In an attempt to nonetheless shed light on this issue, we analyze the single-word and single-segment predictions of our model.Figure 4 and Supplementary Figure A.1 illustrate such predictions: i.e. the probability of each word given the MEG data of five representative subjects, and five representative segments of Gwilliams et al. [32], respectively.Then, we train a linear regression to predict the softmax probability of the true word estimated by the decoder, given different set of features, ranging from low-level (e.g.phonemes) to high-level representations (e.g.phrase embedding, see Methods for details).The results, displayed in Figure 5, show that the part-of-speech (p < 0.004), word embedding (p < 10 −8 ), bag-of-words (p < 10 −23 ) and phrase embedding (p < 10 −23 ) significantly predict the single-trial decoding predictions.Given that phrase embeddings are known to capture semantic and syntactic representations [18,19,38], these correlations suggest that our model decodes relatively high-level representations of speech.

Discussion
Overall, our model accurately identifies, from 3 seconds of non-invasive recordings of brain activity, the corresponding speech segment with up to 41% accuracy out of more than 1,000 distinct possibilities.This performance, sometimes reaching 80% in the very best participants allows the decoding of perceived words and phrases absent from the training set.
To decode speech perception from M/EEG, two major challenges must be overcome.First, these signals can be very noisy, making it difficult to extract useful information.Second, it is unclear which features of speech are, in fact, represented in the brain.In this section, we will discuss how our "brain" and "speech" modules respectively address these two issues in the case of speech perception.Finally, we evaluate the performance of our model in comparison to previous works and outline the necessary steps to be taken before hoping to deploy this approach for the decoding of speech production in clinical settings.

Efficiently extracting brain signals.
Non-invasive recordings are notoriously noisy: these signals present large variations across trials and subjects and they are often contaminated by large artifacts [33,50,82].Historically, solving this problem has involved complex preprocessing pipelines that include a variety of techniquesindependent component analysis, outlier detection, artifact correction -that end by a linear model specific to each participant [34,47,69].More recently, several deep learning architectures have proved successful in solving simple classification tasks trained on single-subject recordings [24,81].
Building on these efforts, our end-to-end architecture requires minimal preprocessing of M/EEG signals and can be trained with a variety of participants, devices, and stimuli.As decoding speech production can be challenged by the presence of muscular activity, we here evaluate this model on four public datasets where healthy participants listened to natural speech.Our analyses suggest that advanced M/EEG preprocessing does not provide a major advantage in the current decoding task and that a simple baseline correction followed by a robust scaler and clamping suffices.In addition, not only does our subject-specific layer improves decoding performance, but this performance increases with the amount of participants present in the training set.These findings, combined with both the rise of publicly available datasets and the potential to learn informative features from unannotated data [11,91] suggest that this brain module may be a stepping stone for building a foundational model of brain recordings.

How is language represented in the brain?
Separating noise and signal in brain recordings is not, however, the only challenge.The nature of these representations in terms of their acoustic, phonetic, lexical, and semantic properties remains poorly known.Consequently, determining the representations most suitable for decoding is an unresolved problem.To tackle this issue, previous studies have primarily used supervised models targeting well-defined features of language, such as individual letters, phonemes, or frequency bands of the audio spectrogram [5,24,44,46,57,64,70,74,76,93].Although this approach has demonstrated clear successes, it may impede the speed at which words are decoded from brain activity: For instance, spelling out a word letter by letter could be a slow and laborious process.As an alternative, others Figure 4: Word-level predictions for five representative subjects (between the 20% (top) and the 80% percentiles (bottom) of the cohort) of Gwilliams et al. [32] while they listened to the sentence "Thank you for coming, Ed".Blue words correspond to the correct word, black words to negative candidates.
Text size is proportional to the log-probability output by our model.
Figure 5: The R values quantify the extent to which phonemes, word frequency, part-of-speech, word embedding, and phrase embedding respectively predict the probability of the predicted word to be correct.Error bars are the SEM across participants.
have proposed to learn to classify a small set of words [20,25,28,53,67].This approach, however, is difficult to scale to a vocabulary size adequate for natural language.Finally, word semantics may be directly decoded from fMRI signals [2,27,29,40,75,90].However, the corresponding performances remain currently modest at the single trial level.
Here, we show how a neural network pretrained on a large corpus of speech sounds provides representations of language that are particularly valuable for brain decoding.Specifically, we leverage the recent discovery that these self-supervised speech models learn features that linearly relate to those of the brain [63,92] to build our speech module.By applying contrastive learning, we can effectively identify the most appropriate features for identifying new speech segments.Our analyses confirm that this approach outperforms (1) a supervised decoding of the MEL spectrogram as well as (2) 'Deep Mel', i.e. latent representations of the MEL spectrogram optimized for decoding solely from the present M/EEG datasets.Finally, the inspection of the decoding predictions suggests that our model primarily captures the lexical and contextual features captured by modern word embeddings and language models.To date, however, what these high-level features represent and how these representations are structured and manipulated remain to be determined.

Comparison to previous works
Comparing the performance of our model to previous works is difficult because the variety of experimental paradigms is not compensated by a profusion of open datasets and reproducible code.Two elements may, nonetheless, substantiate such a comparison.
First, the size of vocabulary presently considered exceeds previous attempts, often by several orders of magnitude: For example, M/EEG studies typically used supervised decoders to discriminate a very small set of words [20,25,28,53,67] or sublexical classes (e.g.phonemes, syllables, tones) [5,44,46,57,70,74].For example, Sun and Qin [89], Sree and Kavitha [86], and Moinnereau et al. [65] all developed a decoder to classify 11, 5 and 2 distinct imagined phonemes, respectively, from EEG signals.Similarly, Chan et al. [20], Lopopolo and van den Bosch [58], Nguyen et al. [68] respectively developed an MEG decoder to classify 6 distinct part-of-speech (with 48% accuracy), 10 words (83% accuracy) and 3 words (70% accuracy).The limited vocabulary used in these noninvasive studies contrast with the present approach, which demonstrably accurately distinguishes several hundreds of words.Furthermore, the performances of our model are based on vocabularies which do not fully overlap with those used in the training set (Table 1).For example, for the Gwilliams dataset, decoding performance reaches 40% in spite of the fact that nearly 36% of the words were never presented during training.Overall, such zero-shot decoding shows the versatility of this approach and opens the possibility to decode from even larger vocabulary.
Second, although our model's performance remains modest, it may not be too distant from the performance obtained with invasive recordings of brain activity.Indeed, decoding the perception of isolated words from a vocabulary of n=50 words leads to a top-1 accuracy of 22.7% on average, but up to 42.9% in the best participants (Supplementary Figure A.2) In comparison, [66] report decoding produced words from intracranial recordings with a top-1 accuracy of 39.5% for isolated words out of n=50 words.Similarly, still restricting the number of candidates to 50 and, this time, within the context of a sentence, our model decoding is above 72.5% top-1 accuracy on average across participants, and the very best participants reach between 92.2% [83] and 95.9% ( [31], Figure 3B), where Moses et al. [66] reached a top-1 accuracy of 74%.While comparing the decoding of perceived vs produced words should be considered with caution given their different brain bases, the performance of the current model thus leads us to be optimistic about its potential applicability in a speech production context.

Remaining steps to decode speech production in the clinics
Our non-invasive approach focuses on speech perception.To reach the performance obtained with clinical recordings [7,8,37,52,59,61,66,93], decoding intended communication will thus require addressing several challenges.Three specific challenges stand out.
First, the current model needs to be adapted to speech production.This can, in principle, be achieved by replacing the speech module with a neural network pre-trained on production tasks such as handwriting or speech production.
Second, the current contrastive learning objective can only identify the most likely word or speech segment from a predetermined set.The model thus needs to be supplemented with a generative module that can estimate the most likely phoneme, word, or sentence given brain activity without requiring this set of candidates, similarly to what is being achieved with fMRI [73,90].
Finally, our study reveals striking differences between EEG and MEG.While EEG is known to be less precise than MEG, we did not expect such an important difference of decoding performance.Adapting current MEG systems to the clinics will require substantial efforts, however: while new room-temperature sensors already show signal-to-noise ratio comparable to the superconducting quantum interference devices (SQUIDs) used in the present study, these systems are not commonly deployed in clinical setting, whose magnetic environment can be extremely noisy.Combined with A.I. systems, these new devices could nevertheless contribute to improve the diagnosis, prognosis and restoration of language processing in non-or poorly-communicating patients without putting them at risk for brain surgery.In that regard, we hope that the release of a fully-reproducible pipeline will contribute to the development of safe and scalable non-invasive methods for decoding intended communication.

Data Availability
The data from Schoffelen et al. [83] was provided (in part) by the Donders Institute for Brain, Cognition and Behaviour with a "RU-DI-HD-1.0"licence.The data for Gwilliams et al. [32] is available under CC0 1.0 Universal.The data for Broderick et al. [16] is available under the same licence.Finally, the data from Brennan and Hale [15] is available under the CC BY 4.0 licence.All audio files were provided by the authors of each dataset.

Code Availability
The complete source code for processing the datasets, training and evaluating the models and method presented here are available at github.com/facebookresearch/brainmagick.

Competing interests
The authors declare no competing interests.

A Supplementary information
A.1 Datasets The data from Schoffelen et al. [83] was provided (in part) by the Donders Institute for Brain, Cognition and Behaviour with a "RU-DI-HD-1.0"licence.The data for Gwilliams et al. [32] is available under CC0 1.0 Universal.The data for Broderick et al. [16] is available under the same licence.Finally, the data from Brennan and Hale [15] is available under the CC BY 4.0 licence.All audio files were provided by the authors of each dataset.Clamping is here motivated by the fact that electro-magnetic recordings can be subject to perturbations orders of magnitudes larger than brain activity.As explained in Section 2.4, clamping is performed as follows: for each recording independently, we apply a robust scaler such that the data range [-1, 1] maps to the [0.25, 0.75] quantile range.The resulting M/EEG signals are thus expected to have a scale of the order of 1.In Table A.2, we provide the top-10 accuracy for our model similar to Table 2. Extending the clamping range to 100 standard deviations does not appear to help extracting more information.This result suggests that clamping effectively takes care of outlier without removing useful information.On the contrary, the removal of clamping leads to a collapse of decoding performance.This is expected, as extreme outliers will impact, for instance, the BatchNorm mean and standard deviation estimates, and one outlier can thus impact the entire batch.Outliers can also cause extreme gradients and throw off the optimization process.Overall, these analyses emphasize the importance of using clamping for M/EEG analyses.To evaluate the potential usefulness of advanced M/EEG preprocessing techniques, we compare our model to one trained on M/EEG data preprocess with the Autoreject package [45].This package aims to detect and correct corrupted channels based on their spatial neighborhood.Note that this package can also reject samples that are too corrupted.However, as this procedure would change the definition of the test set, we do not consider it here.Due to the added complexity of applying Autoreject, which requires in particular the loading of the full dataset in memory, we only evaluated it on the first 16 recordings of Gwilliams et al. [32] Overall, similar performances were obtained with and without Autoreject.This result suggests that our model does not trivially benefit from advanced preprocessing techniques.Our model uses a fixed delay of 150 ms to align speech and EEG/MEG representations.This decision was originally motivated by the non-compressible delay that exists between the cochlea's response and the cortical activations.To further investigate this decision choice, we here study the impact of this delay on the Gwilliams2022 dataset.We observe a small impact of this parameter, although setting it to 0 only reduces the top-10 accuracy by 0.5%.Are the models trained to decode Mel features (as opposed to latent representations) impeded by the number of Mel?To study this issue, we evaluate the impact of the number of frequency bands used for the Mel spectrogram for different versions of the model, while keeping the minimum and maximum frequencies fixed.For clarity, we only provide the average top-10 accuracy overall datasets.We observe a small increase of the accuracy when using more Mel bands for all the models.

Figure 2 :
Figure 2: Design choices. A. Illustration of a 3 s speech sound segment (bottom) and its corresponding Mel spectrogram (top).B. Mel-spectrogram predicted with a direct regression loss L reg of a brain decoder (orange).C. Replacing the regression loss with a CLIP loss[80] improves reconstruction in the same subject, still using the mel-spectrogram as the speech representation.D. Now replacing the mel-spectrogram with wav2vec 2.0[10].The probabilities given by Eq. (2) are used to rebuild a mel-spectrogram.E. Architecture of the brain module.Architecture used to process the brain recordings.For each layer, we note first the number of output channels, while the number of time steps is constant throughout the layers.The model is composed of a spatial attention layer, then a 1x1 convolution without activation.A "Subject Layer" is selected based on the subject index s, which consists in a 1x1 convolution learnt only for that subject with no activation.Then, we apply five convolutional blocks made of three convolutions.The first two use residual skip connection and increasing dilation, followed by a BatchNorm layer and a GELU activation.The third convolution is not residual, and uses a GLU activation (which halves the number of channels) and no normalization.Finally, we apply two 1x1 convolutions with a GELU in between.

Figure 3 :
Figure 3: A. Each dot represents the top-10 accuracy of a single subject, as estimated either with the full test set (blue) or with 50 possible segments (orange).B. Same as A, for top-1 accuracy.C. Top-10 accuracy as a function of the number of participants in the training set (blue line) as evaluated on the first 10% of the participants.The error bar indicates the standard error of the mean (SEM) across participants (gray lines).

Figure A. 1 : 51 A. 4
Figure A.1: Similar illustration to Figure 4 but for five representative speech segments.The top and bottom segments are the easiest and hardest to decode, respectively.For each segment, we plot the predictions obtained for the subject with the median decoding scores across the cohort.

A. 5
Impact of the number of Mels

Figure A. 2 :
Figure A.2: Decoding performance for single-word, using a vocabulary of 1,000 (red) or 50 (blue) words.

Figure A. 3 :
Figure A.3: Attention weights.Red color indicate that the M/EEG sensors is, on average, associated with a higher spatial attention weight.At the exception of the Brennan dataset, the topographies highlight channels typically activated during auditory stimulation.

Table 2 :
Results.Top-10 segment-level accuracy (%) for a random baseline model that predicts a uniform distribution over the segments ('random'), a convolutional network trained to predict the Mel spectrograms with a regression loss ('base'), the same model trained with a contrastive loss ('+ clip') and our model, i.e. trained to predict the features of wav2vec 2.0 with a contrastive loss ('+ wav2vec 2.0').±indicates the standard deviation across three random initializations of the model's weights.Our model predicts the correct segment, out of more than 1,000 possibilities, with a top-10 accuracy up to 70.7% on average across MEG subjects (Table2, top-1 accuracy up to 41.3%, TableA.1).

Table 3 :
Ablations.Top-10 segment-level accuracy (%) for our model and its ablated versions.Stars indicate significant gain (p < 0.001) across participants.Confidence intervals are computed as Standard Error of the Mean (SEM) over 3 runs.
The code is provided under the CC-NC-BY 4.0 license.We also provide a fixed version of the code as of the release of the present articel, on Zenodo under the DOI 10.5281/zenodo.81143747AcknowledgmentsThiswork was funded in part by FrontCog grant ANR-17-EURE-0017 to JRK for his work at PSL.The project was led by Alexandre Défossez and Jean-Remi King.Jean-Remi King took care of the data curation.The training pipeline was built by Jérémy Rapin, Ori Kabeli and Alexandre Défossez.Ori Kabeli and Alexandre Défossez handled model training and hyper-parameter search.The speech module and evaluation pipeline was built by Charlotte Caucheteux.Alexandre Défossez, Charlotte Caucheteux, Ori Kabeli and Jean Remi King provided in depth data and results analysis.The present paper was written by Alexandre Défossez, Charlotte Caucheteux, and Jean-Remi King.
8 Statement of contributions.