Main

Every year, traumatic brain injuries, strokes and neurodegenerative diseases cause thousands of patients lose their ability to speak or even communicate1,2,3,4,5,6. Brain–computer interfaces (BCIs) have raised high expectations for the detection4,5,7,8 and restoration of communication abilities in such patients9,10,11,12,13,14. Over recent decades, several teams have used BCIs to efficiently decode phonemes, speech sounds15,16, hand gestures11,12 and articulatory movements13,17 from electrodes implanted in the cortex or over its surface. For instance, Willett et al. 12 decoded 90 characters per minute (with a 94% accuracy, that is, roughly 15–18 words per minute) from a patient with a spinal-cord injury, recorded in the motor cortex during 10 hours of writing sessions. Similarly, Moses et al. 13 decoded 15.2 words per minute (with 74.4% accuracy, and using a vocabulary of 50 words) in a patient with anarthria and a BCI implanted in the sensori-motor cortex, recorded over 48 sessions spanning over 22 hours. Finally, Metzger et al. 18 recently showed that a patient with severe limb and vocal-tract paralysis and a BCI implanted in the sensori-motor cortex could efficiently spell words using a code word that represented each English letter (for example, ‘alpha’ for ‘a’): this approach leads to a character error rate of 6.13% and a speed of 29.4 characters per minute, and hence starts to provide a viable communication channel for such patients.

However, such invasive recordings face a major practical challenge: these high-quality signals require brain surgery and can be difficult to maintain chronically. Several laboratories have thus focused on decoding language from non-invasive recordings of brain activity such as magneto-encephalography (MEG) and electro-encephalography (EEG). MEG and EEG are sensitive to macroscopic changes of electric and magnetic signals elicited in the cortex, and can be acquired with a safe and potentially wearable set-up19. However, these two devices produce notoriously noisy signals that vary greatly across sessions and across individuals20,21,22. It is thus common to engineer pipelines that output hand-crafted features, which, in turn, can be learned by a decoder trained on a single participant23,24,25,26,27,28.

In sum, decoding language from brain activity is, so far, limited either to invasive recordings or to impractical tasks. Interestingly, both of these approaches tend to follow a similar method: that is, (1) training a model on a single participant and (2) aiming to decode a limited set of interpretable features (Mel spectrogram, letters, phonemes, small set of words).

Instead, here we propose to decode speech from non-invasive brain recordings by using (1) a single architecture trained across a large cohort of participants and (2) deep representations of speech learned with self-supervised learning on a large quantity of speech data. We focus the present work on speech perception in healthy volunteers rather than speech production in patients to design a deep-learning architecture that effectively addresses two core challenges: (1) the fact that non-invasive brain recording can be extremely noisy and variable across trials and across participants and (2) the fact that the nature and format of language representations in the brain remain largely unknown. For this, we introduce a ‘brain module’ and train it with contrastive learning to align its representations to those of a pretrained ‘speech module’, namely, wav2vec 2.0 (ref. 29) (Fig. 1). We train a single model for all participants, sharing most of the weights except for one participant-specific layer. Figure 1 provides a broad overview of our approach.

Fig. 1: Model approach.
figure 1

We aim to decode speech from the brain activity of healthy participants recorded with MEG or EEG while they listen to stories and/or sentences. For this, our model extracts the deep contextual representations of 3 s speech signals (Y of F feature by T time samples) from a pretrained ‘speech module’ (wav2vec 2.0: ref. 29) and learns the representations (Z) of the brain activity on the corresponding 3 s window (X of C recording channels by T time samples) that maximally align with these speech representations with a contrastive loss (CLIP: ref. 44). The representation Z is given by a deep convolutional network. At evaluation, we input the model with left-out sentences and compute the probability of each 3 s speech segment given each brain representation. The resulting decoding can thus be ‘zero shot’ in that the audio snippets predicted by the model need not be present in the training set. This approach is thus more general than standard classification approaches where the decoder can only predict the categories learnt during training.

To validate our approach, we curate and integrate four public MEG and EEG datasets, encompassing the brain activity of 175 participants passively listening to sentences of short stories (see Table 1 for details). For each MEG and EEG recording, we evaluate our model on its ability to accurately identify the corresponding audio segment from a large set of more than 1,500 segments (that is, ‘zero shot’ decoding).

Table 1 Datasets

This study provides three main contributions for the development of a non-invasive method to decode speech from brain activity. First, it shows how pretrained speech models can leverage the decoding of speech in the brain, without exposing volunteers to a tedious repetition of every single word targeted by the decoder. Second, it shows how specific design choices—including contrastive learning and our multi-participant architecture—improve the processing of continuous EEG and MEG recordings. Finally, our results suggest that the speech decoder is primarily based on high-level and semantic representations of speech.

Results

Accurately decoding speech from MEG and EEG recordings

Our model predicts the correct segment, out of more than 1,000 possibilities, with a top-10 accuracy up to 70.7% on average across MEG participants (Table 2, top-1 accuracy up to 41.3%, and Extended Data Fig. 1). For more than half of the samples, the true audio segment is ranked first or second in the decoders’ predictions. Interestingly, these performances can reach high top-1 accuracy in the best performing participants: for example, above 80% top-1 accuracy in the best participant of the Gwilliams 2022 dataset30 (Fig. 2a). For comparison, a model that predicts a uniform distribution over the vocabulary (‘random model’) only achieves less than 1% top-10 accuracy on the same MEG datasets. Decoding performance for EEG datasets is substantially lower: our model reaches 17.7% and 25.7% top-10 accuracy for the two EEG datasets currently analysed. While modest, these scores are much higher than the random baseline.

Table 2 Results
Fig. 2: Decoding accuracy across subjects and datasets.
figure 2

a, Each dot represents the top-10 accuracy of a single participant, as estimated either with the full test set (blue) or with 50 possible segments (orange). b, The same as in a, but for top-1 accuracy. c, Top-10 accuracy as a function of the number of participants in the training set (blue line) as evaluated on the first 10% of the participants. The error bars indicate the s.e.m. across participants (grey lines).

Is MEG really much better than EEG?

To investigate whether these performances depend on the total recording duration and/or the number of recording sensors, we train our model on a subset of the data that homogenizes recording time, the number of sensors and the number of participants. For this, we discard the dataset of Brennan and Hale31, to avoid over-limiting the analysis dataset. Consequently, we match all datasets to the smallest number of channels of the three remaining datasets by keeping a random but fixed subset of channels (for example, 128). We keep only 19 participants per dataset, again aligning on the smallest for all three datasets. Finally, we keep the same average duration per participant for all three datasets, by dropping out some training segments (that is, the same segments are dropped for all participants or repetitions within one participant). All test segments are kept to maximize reliability. Overall, this subsampling diminishes decoding performance (for example, top 10: 30.3% for the Schoffelen dataset32 and 31.7% for the Gwilliams dataset33), but MEG decoding remains much better than the EEG (Mann–Whitney across MEG and EEG participants: all P < 10−6). Although these results should be confirmed by presenting the same stimuli to participants recorded with both EEG and MEG, they suggest that the difference in decoding performance observed between studies is mainly driven by the type of device.

‘Speech module’ evaluation

To evaluate our approach, we compare these decoding performances to those obtained with models that target different representations of speech (Table 2). While a model trained to predict the Mel spectrogram with a regression objective (‘Base model’ in Table 2) is systematically higher than chance, the use of a contrastive loss (‘+ Contrastive’) leads to decoding gains that range from 2% (for the Brennan and Hale dataset31) to 42.7% (for the Gwilliams dataset33). This gain is further supplemented by targeting the latent representations of the Mel spectrogram (‘+ Deep Mel’). The latent representations of speech sounds, however, appear to be best identified with a pretrained speech module, that is, by using wav2vec 2.0, a model pretrained with self-supervised learning on speech sounds only, rather than by jointly learning speech and MEG and EEG representations (Table 2). Overall, these results show the importance, for decoding, of targeting deep representations of speech.

‘Brain module’ evaluation

To evaluate the elements of the brain module, we performed a series of ablation experiments, and trained the corresponding models on the same data (Extended Data Fig. 2). Overall, these ablations show that several elements impact performance: performance systematically decreases when removing skip connections, the spatial attention module, and the initial or final convolutional layers of the brain module. These results also show the importance of clamping the MEG and EEG signals. Finally, additional experiments show that the present end-to-end architecture is robust to MEG and EEG artefacts, and requires little preprocessing of the MEG and EEG signals (Supplementary Sections A.3 and A.4).

Impact of the number of participants

To test whether our model effectively leverages the inter-individual variability, we trained it on a variable number of participants and computed its accuracy on the first 10% of participants. As shown in Fig. 2c, decoding performance steadily increases as the model is trained with more participants on the two MEG datasets. This result shows that our model effectively learns neural representations that are common across participants, while also accommodating participant-specific representations through the participant layer described in Methods.

Decoded representations best correlate with phrase embeddings

What type of representation does our model use to decode speech from brain signals? This interpretability question is notoriously difficult to address22,34. Figure 3 illustrates this issue: it shows the probability of each word given the MEG data of five representative participants listening to the phrase ‘Thank you for coming, Ed’. Extended Data Fig. 3 shows additional predictions for five representative segments of the Gwilliams dataset 33. In both cases, it can be difficult to judge whether the decoder’s error tends to be related to the phonology or to the semantics of the actual sentence.

Fig. 3: Word-level predictions.
figure 3

Word-level predictions for five representative participants (between the 20% (top) and the 80% percentiles (bottom) of the cohort) of the Gwilliams dataset33 while they listened to the sentence ‘Thank you for coming, Ed’. Blue words correspond to the correct word and black words correspond to negative candidates. Text size is proportional to the log-probability output by our model.

To address this issue, we analyse the single-word and single-segment predictions of our model with a linear model.

Specifically, we train a linear regression to predict the softmax probability of the true word estimated by the decoder, given different set of features, ranging from low-level representations (for example, phonemes) to high-level representations (for example, phrase embedding; see Methods for details). The results, shown in Fig. 4, show that the part-of-speech (P < 0.004), word embedding (P < 10−8), bag-of-words embedding (P < 10−23) and phrase embedding (P < 10−23) significantly predict the single-trial decoding predictions. Overall, the higher level the representation, the more it accounts for the decoder’s predictions. Given that phrase embeddings are known to capture semantic and syntactic representations35,36,37, these results suggest that our decoder primarily relies on high-level and semantic representations of speech.

Fig. 4: Decoding predictions mainly rely on high-level semantic features.
figure 4

The R values quantify the extent to which phonemes, word frequency, part-of-speech, word embedding and phrase embedding predict the probability of the predicted word to be correct. Error bars are the s.e.m. across participants (Table 1).

Methods

We formalize the general task of neural decoding and then describe and motivate the different components of our model, before describing the datasets, preprocessing, training and evaluation.

Problem formalization

We aim to decode speech from a time series of high-dimensional brain signals recorded with non-invasive MEG or EEG while healthy volunteers passively listened to spoken sentences in their native language. How spoken words are represented in the brain is largely unknown37,38,39. Thus, it is common to train decoders in a supervised manner to predict a latent representation of speech known to be relevant to the brain16,34,40,41,42. For example, the Mel spectrogram is often targeted for neural decoding because it represents sounds similar to the cochlea43. We formalize this problem as follows. Let \(X\in {{\mathbb{R}}}^{C\times T}\) be a segment of a brain recording of a given participant while she listens to a speech segment of the same duration, with C the number of MEG or EEG sensors and T the number of time steps. Let \(Y\in {{\mathbb{R}}}^{F\times T}\) be the latent representation of speech, using the same sample rate as X for simplicity, here the Mel spectrogram with F frequency bands. In this formalization, supervised decoding consists of finding a decoding function: \({{{{\bf{f}}}}}_{{{{\rm{reg}}}}}:{{\mathbb{R}}}^{C\times T}\to {{\mathbb{R}}}^{F\times T}\) such that freg predicts Y given X. We denote by \(\hat{Y}={{{{\bf{f}}}}}_{{{{\rm{reg}}}}}(X)\) the representation of speech decoded from the brain. When freg belongs to a parameterized family of models like deep neural networks, it can be trained with a regression loss \({L}_{{{{\rm{reg}}}}}(Y,\hat{Y})\) (for example, the mean square error)

$$\mathop{\min }\limits_{{{{{\bf{f}}}}}_{{{{\rm{reg}}}}}}\mathop{\sum}\limits_{X,Y}{L}_{{{{\rm{reg}}}}}(Y,{{{{\bf{f}}}}}_{{{{\rm{reg}}}}}(X)).$$
(1)

This direct regression approach appears to be dominated by a non-distinguishable broadband component when speech is present (Extended Data Fig. 4a,b). This challenge motivates our three main contributions: the introduction of a contrastive loss, a pretrained deep speech representation and a dedicated brain decoder.

Model

Contrastive loss

We reasoned that regression may be an ineffective loss because it departs from our objective—that is, it requires maximally distinguishing different speech segments apart. Indeed, a regression objective stems from the principle that all of the dimensions of the Mel spectrogram are (1) equally important and (2) are scaled appropriately: the L2 objective inclines the model to predict low and high frequencies equally well, even if (1) some frequencies (for example, very low) may be irrelevant to speech and (2) some frequencies may vary in orders of magnitudes lowers than others. To relax this constraint, we opted for a contrastive objective and thus replaced the regression loss with the ‘CLIP’ loss (originally for Contrastive Language-Image Pre-Training) by ref. 44, which was originally designed to match latent representations in two modalities, text and images. Unlike the regression objective, this contrastive loss leads the model to find a combination of features that maximally discriminates samples in the batch. Consequently, the model is naturally inclined to focus on the informative dimensions of the Mel spectrograms and to scale them appropriately. We implement the CLIP loss as follows. Let X be a brain recording segment and \(Y\in {{\mathbb{R}}}^{F\times T}\) the latent representation of its corresponding sound (also known as ‘positive sample’). We sample N − 1 negative samples \({\bar{Y}}_{j\in \{1,\ldots,N-1\}}\) over our dataset and we add the positive sample as \({\bar{Y}}_{N}=Y\). We want our model to predict the probabilities \(\forall j\in \{1,\ldots ,N\},{p}_{j}={\mathbb{P}}\left[\bar{{Y}_{j}}=Y\right].\) We thus train a model fclip mapping the brain activity X to a latent representation \(Z={{{{\bf{f}}}}}_{{{{\rm{clip}}}}}(X)\in {{\mathbb{R}}}^{F\times T}\). The estimated probability can then be approximated by the dot product of Z and the candidate speech latent representations Yj, followed by a softmax:

$${\hat{p}}_{j}=\frac{{{{{\rm{e}}}}}^{\langle Z,{\bar{Y}}_{j}\rangle }}{\mathop{\sum }\nolimits_{{j}^{{\prime} } = 1}^{N}{{{{\rm{e}}}}}^{\langle Z,{\bar{Y}}_{{j}^{{\prime} }}\rangle }},$$
(2)

with 〈, 〉 the inner product over both dimensions of Z and \(\hat{Y}\). We then train fclip with a cross-entropy between pj and \({\hat{p}}_{j}\). Note that for a large enough dataset, we can neglect the probability of sampling twice the same segment, so that we have \({p}_{j}={{\mathbb{1}}}_{j = N}\), and the cross-entropy simplifies to

$${L}_{{{{\rm{CLIP}}}}}(\,p,\hat{p})=-\log (\,{\hat{p}}_{N})=-\langle Z,Y\rangle +\log \left(\mathop{\sum }\limits_{{j}^{{\prime} }=1}^{N}{{{{\rm{e}}}}}^{\langle Z,{\bar{Y}}_{j}^{{\prime} }\rangle }\right).$$
(3)

Following ref. 44, we use the other elements of the batch as negative samples at train time. At test time, the negative samples correspond to all of the segments of the test set but the positive one.

Brain module

For the brain module, we introduce a deep neural network fclip, input with raw MEG and EEG times series X and a one-hot encoding of the corresponding participant s, and outputs the latent brain representation Z, with the same sample rate as X. This architecture consists of (1) a spatial attention layer over the MEG and EEG sensors followed (2) by a participant-specific 1 × 1 convolution designed to leverage inter-individual variability, which input to (3) a stack of convolutional blocks. An overview of the model is given in the Extended Data Fig. 4e. In the following, given a tensor U, we note U(i,…) access to specific entries in the tensor.

Spatial attention

The brain data are first remapped onto D1 = 270 channels with a spatial attention layer based on the location of the sensors. The three-dimensional sensor locations are first projected on a two-dimensional plane obtained with the MNE-Python function find_layout45, which uses a device-dependent surface designed to preserve the channel distances. Their two-dimensional positions are finally normalized to [0, 1]. For each output channel, a function over [0, 1]2 is learnt, parameterized in the Fourier space. The weights over the input sensors are then given by the softmax of the function evaluated at the sensor locations. Formally, each input channel i has a location (xi, yi) and each output channel j is attached a function aj over [0, 1]2 parameterized in the Fourier space as \({z}_{j}\in {{\mathbb{C}}}^{K\times K}\) with K = 32 harmonics along each axis, that is

$${a}_{j}(x,y)=\mathop{\sum }\limits_{k=1}^{K}\mathop{\sum }\limits_{l=1}^{K}{{{\rm{Re}}}}({z}_{j}^{(k,l)})\cos \left(2\uppi (kx+ly)\right)+{{{\rm{Im}}}}({z}_{j}^{(k,l)})\sin \left(2\uppi (kx+ly)\right).$$
(4)

The output is given by a softmax attention based on the evaluation of aj at each input position (xi, yi):

$$\forall j\in \{1,\ldots ,{D}_{1}\},{{{\rm{SA}}}}{(X)}^{(j)}=\frac{1}{\mathop{\sum }\nolimits_{i = 1}^{C}{{{{\rm{e}}}}}^{{a}_{j}({x}_{i},{y}_{i})}}\left(\mathop{\sum }\limits_{i=1}^{C}{{{{\rm{e}}}}}^{{a}_{j}({x}_{i},{y}_{i})}{X}^{(i)}\right)$$
(5)

with SA the spatial attention. In practice, as aj is periodic, we scale down (x, y) to keep a margin of 0.1 on each side. We then apply a spatial dropout by sampling a location (xdrop, ydrop) and removing from the softmax each sensor that is within a distance of ddrop = 0.2 of the sampled location. The initial motivation for spatial attention was to allow for a cross-dataset model to be defined in a way that would generalize across a diverse number location and set of sensors. Interestingly, we observed this layer to introduce an inductive bias that is beneficial to the prediction accuracy (Extended Data Fig. 2). See Extended Data Fig. 4 for a visualization of the learnt attention maps over each dataset. We then add a 1 × 1 convolution (that is, with a kernel size of 1) without activation and with the same number D1 of output channels.

Participant layer

To leverage inter-individual variability, we learn a matrix \({M}_{s}\in {{\mathbb{R}}}^{{D}_{1},{D}_{1}}\) for each participant s [S] and apply it after the spatial attention layer along the channel dimension. This is similar to but more expressive than the participant embedding used by ref. 46 for MEG encoding, and follows decade of research on participant alignment47,48.

Residual dilated convolutions

We then apply a stack of five blocks of three convolutional layers. For the kth block, the first two convolutions are applied with residual skip connections (except for the very first one where the number of dimension potentially doesn’t match), outputs D2 = 320 channels and are followed by batch normalization49 and a GELU (Gaussian Error Linear Unit) activation50. The two convolutions are also dilated to increase their receptive field, by \({2}^{2k{{{\rm{mod}}}}5}\) and \({2}^{2k+1{{{\rm{mod}}}}5}\) (with k zero indexed), respectively. The third layer in a block outputs 2D2 channels and uses a GLU (Gated Linear Unit) activation51, which halves the number of channels. All convolutions use a kernel size of 3 over the time axis, a stride of 1 and sufficient padding to keep the number of time steps constant across layers. The output of the model is obtained by applying two final 1 × 1 convolutions: first with 2D2 outputs, followed by a GELU and finally with F channels as output, thus matching the dimensionality of speech representations. Given the expected delay between a stimulus and its corresponding brain responses, we further shift the input brain signal by 150 ms into the future to facilitate the alignment between Y and Z. The impact of this offset is considered in the Supplementary Section A.5.

Speech module

The Mel spectrogram is a low-level representation of speech inspired from the cochlea and is thus unlikely to match the rich variety of cortical representations38. Consequently, we replaced the Mel spectrograms with latent representations of speech. For this, we propose either to learn these representations end-to-end (‘Deep Mel’ model) or to rely on those learnt by an independent self-supervised speech model (wav2vec 2.0; ref. 29).

End-to-end speech representations with Deep Mel

The ‘Deep Mel’ module uses the same deep convolutional architecture to the brain module devoid of the participant block, and thus simultaneously learns to extract speech and MEG and EEG representations such that they are maximally aligned. By definition, and unlike wav2vec 2.0, Deep Mel sees only the audio used in the MEG and EEG datasets. As this end-to-end approach proved to be less efficient than its pretrained counterpart based on wav2vec 2.0, we will thereafter focus on the latter.

Pretrained speech representations with wav2vec 2.0

Wav2vec 2.0 is trained with audio data only to transform the raw waveform with convolutional and transformer blocks to predict masked parts of its own latent representations. A previous study29 showed that the resulting model can be efficiently fine-tuned to achieve state-of-the-art performance in speech recognition. Besides, this model effectively encodes a wide variety of linguistic features52,53. In particular, recent studies have shown that the activations of wav2vec 2.0 linearly map onto those of the brain54,55. Consequently, we here test whether this model effectively helps the present decoding task. In practice, we use the wav2vec2-large-xlsr-53 (ref. 56), which has been pretrained on 56,000 hours of speech from 53 different languages.

Datasets

We test our approach on four public datasets, two based on MEG recordings and two based on EEG recordings. All datasets and their corresponding studies were approved by the relevant ethics committee and are publicly available for fundamental research purposes. Informed consent was obtained from each human research participant. We provide an overview of the main characteristics of the datasets in Table 1, including the number of training and test segments and vocabulary size over both splits. For all datasets, healthy adult volunteers passively listened to speech sounds (accompanied by some memory or comprehension questions to ensure participants were attentive), while their brain activity was recorded with MEG or EEG. In Schoffelen et al.32, Dutch-speaking participants listened to decontextualized Dutch sentences and word lists (Dutch sentences for which the words are randomly shuffled). The study was approved by the local ethics committee (the local Committee on Research Involving Human Subjects in the Arnhem–Nijmegen region). The data are publicly and freely available after registration on the Donders Repository. In Gwilliams et al. 33, English-speaking participants listened to four fictional stories from the Masc corpus57 in two identical sessions of 1 hour30. The study was approved by the institutional review board ethics committee of New York University Abu Dhabi. In Broderick et al. 58, English-speaking participants listened to extracts of The Old Man and the Sea. The study was approved by the ethics committees of the School of Psychology at Trinity College Dublin and the Health Sciences Faculty at Trinity College Dublin. In Brennan and Hale31, English-speaking participants listened to a chapter of Alice in Wonderland. See Supplementary Section A.1 for more details. The study was approved by the University of Michigan Health Sciences and Behavioral Sciences institutional review board (HUM00081060).

Preprocessing

MEG and EEG are generally considered to capture neural signals from relatively low-frequency ranges20. Consequently, we first resampled all brain recordings down to 120 Hz with Torchaudio59 and then split the data into training, validation and testing splits with a size roughly proportional to 70%, 20% and 10%, respectively. We defined a ‘sample’ as a 3 s window of brain recording with its associated speech representation. A ‘segment’ is a unique 3 s window of speech sound. As the same segment can be presented to multiple participants (or even within the same participant in ref. 33), the splits were defined so that one segment is always assigned to the same split across repetitions. We ensured that there were no identical sentences across splits. Furthermore, we excluded all segments overlapping over different splits. For clarity, we restricted the test segments to those that contain a word at a fixed location (here 500 ms before word onset).

MEG and EEG data can suffer from large artefacts, for example, eye movements or variations in the electro-magnetic environment20. To limit their impact, we applied a ‘baseline correction’ (that is, we subtracted from each input channel its average over the first 0.5 s) and a robust scaler with scikit-learn60. We normalized the data and clamp values greater than 20 s.d. to minimize the impact of large outlier samples. In Supplementary Section A.3, we study the effect of clamping and show that it is essential to ensure proper training. In Supplementary Section A.4, we further show that this approach is as effective as more complex data-cleaning procedures such as autoreject61.

For the Mel spectrogram, we used 120 Mel bands (Supplementary Section A.6) (ref. 62), with a normalized STFT (short-time Fourier transform) with a frame size of 512 samples and hop length of 128 samples, using audio sampled at 16 kHz. We applied log-compression, that is, \(\log (\epsilon +{{{\rm{Mel}}}})\), with ϵ = 10−5. When using wav2vec 2.0, we averaged the activations of the last four layers of its transformer. We used standard normalization for both representations.

Training

One training epoch is defined as 1,200 updates using Adam63 with a learning rate of 3 × 10−4 and a batch size of 256. We stopped training when no improvement was observed on the valid set for ten epochs and kept the best model based on the valid loss. For the direct regression of the Mel spectrogram, we used the MSE (mean square error) loss. We used two V100 graphics processing units with 16 GB of memory. See Supplementary Section A.7 for an analysis of the impact of the training hyperparameters.

Evaluation

Mel reconstructions

In Extended Data Fig. 4, we illustrate some reconstructed Mel spectrograms using different models. With a regression loss, the generation of the Mel spectrogram is made directly. With a CLIP loss, we plot the weighted average across all test segments, where the weight corresponds to the probability of the segment to be true estimated with the CLIP loss. Specifically, given a segment and its matching audio (here the sentence ‘Thank you for coming Ed’), we retrieve the predicted distribution over the 1,363 segments given by equation (2). We then use this distribution to average the Mel spectrogram of each candidate segment.

Segment-level evaluation

The top-10 segment accuracy indicates whether the true segment is in the top-10 most likely segments predicted by the decoder. We favour reporting this metric over the standard top-1 accuracy, given the large number of possible segments as the model may be able to decode useful information, without necessarily guessing the exact speech segment.

Word-level evaluation

To evaluate the model at the word level, we select a 3 s segment for each word of the test set (from −500 ms to 2.5 s). We input the model with the corresponding brain recordings, and output the probability distribution over all test segments. To obtain the distribution over the vocabulary, we group the candidate segments by the corresponding word (that is, starting at t = 0) and sum the probabilities of the same words spoken in different segments. Top-1 and top-10 word-level accuracy then quantify whether the true word is within the first or first 10 most likely predictions of the model, respectively.

Prediction analysis

To further inspect the predictions of the decoder, we quantify the extent to which they can be predicted from well-defined features \(\tilde{Y}\in {R}^{n\times {f}_{i}}\). For this, we extract the phonetic features (d = 60) with Phonemizer64, the ‘zipf’ frequency (d = 1) with Wordfreq65, the part-of-speech tags (d = 15), the word embedding (d = 300) of each word with spaCy66 as well as the phrase embedding of the corresponding 3 s speech segment (d = 1, 024) with Laser67. We refer to ‘bag-of-words’ as the sum of word embeddings over the segment. We then train a ridge regression with scikit-learn’s default parameters60 to predict the softmax probability of the true word output by the decoder, and estimate, with a five-split cross-validation, the correspondence between these two values with Pearson’s R correlation. In sum, this analysis quantifies how well the feature predicts the probability of being selected by the decoder.

Statistics

Statistical comparison was performed on the test set. We used a Wilcoxon test across participants to compare different models on the same datasets. We used a Mann–Whitney test across participants to compare different datasets.

Discussion

Our model accurately identifies, from three seconds of non-invasive recordings of brain activity, the corresponding speech segment with up to 41% accuracy out of more than 1,000 distinct possibilities. This performance, sometimes reaching 80% in the best participants, allows the decoding of perceived words and phrases that are absent from the training set.

To decode speech perception from MEG and EEG, two major challenges must be overcome. First, these signals can be very noisy, making it difficult to extract useful information. Second, it is unclear which features of speech are, in fact, represented in the brain. Here we discuss how our ‘brain’ and ‘speech’ modules respectively address these two issues in the case of speech perception. Finally, we evaluate the performance of our model compared with previous works and outline the necessary steps to be taken before hoping to deploy this approach for the decoding of speech production in clinical settings.

Efficiently extracting brain signals

Non-invasive recordings are notoriously noisy: these signals present large variations across trials and participants and they are often contaminated by large artefacts20,21,22. Historically, solving this problem has involved complex preprocessing pipelines that include a variety of techniques—independent component analysis, outlier detection, artefact correction—that end with a linear model specific to each participant68,69,70. More recently, several deep-learning architectures have proved successful in solving simple classification tasks trained on single-participant recordings71,72.

Building on these efforts, our end-to-end architecture requires minimal preprocessing of MEG and EEG signals and can be trained with a variety of participants, devices and stimuli. As decoding speech production can be challenged by the presence of muscular activity, we here evaluate this model on four public datasets where healthy participants listened to natural speech. Our analyses suggest that advanced MEG and EEG preprocessing does not provide a major advantage in the current decoding task and that a simple baseline correction followed by a robust scaler and clamping suffices. In addition, not only does our participant-specific layer improves decoding performance but also this performance increases with the number of participants present in the training set. These findings, combined with both the rise of publicly available datasets and the potential to learn informative features from unannotated data73,74, suggest that this brain module may be a stepping stone for building a foundational model of brain recordings.

How is language represented in the brain?

Separating noise and signal in brain recordings is not, however, the only challenge. The nature of these representations in terms of their acoustic, phonetic, lexical and semantic properties remains poorly known. Consequently, determining the representations most suitable for decoding is an unresolved problem. To tackle this issue, previous studies have primarily used supervised models targeting well-defined features of language, such as individual letters, phonemes or frequency bands of the audio spectrogram12,23,24,72,75,76,77,78,79,80. Although this approach has demonstrated clear successes, it may impede the speed at which words are decoded from brain activity: for instance, spelling out a word letter by letter could be a slow and laborious process. As an alternative, others have proposed to learn to classify a small set of words26,28,81,82,83. This approach, however, is difficult to scale to a vocabulary size adequate for natural language. Finally, word semantics may be directly decoded from functional MRI signals84,85,86,87,88,89. However, the corresponding performances currently remain modest at the single-trial level.

Here we show how a neural network pretrained on a large corpus of speech sounds provides representations of language that are particularly valuable for brain decoding. Specifically, we leverage the recent discovery that these self-supervised speech models learn features that linearly relate to those of the brain54,55 to build our speech module. By applying contrastive learning, we can effectively identify the most appropriate features for identifying new speech segments. Our analyses confirm that this approach outperforms (1) a supervised decoding of the Mel spectrogram as well as (2) ‘Deep Mel’, that is, latent representations of the Mel spectrogram optimized for decoding solely from the present MEG and EEG datasets. Finally, the inspection of the decoding predictions suggests that our model primarily captures the lexical and contextual features captured by modern word embeddings and language models. So far, however, what these high-level features represent and how these representations are structured and manipulated remain to be determined.

Comparison with previous works

Comparing the performance of our model to previous works is difficult because the variety of experimental paradigms is not compensated by a profusion of open datasets and reproducible code. Two elements may, nonetheless, substantiate such a comparison.

First, the size of vocabulary currently considered exceeds previous attempts, often by several orders of magnitude. For example, MEG and EEG studies typically used supervised decoders to discriminate a very small set of words26,28,81,82,83 or sublexical classes (for example, phonemes, syllables, tones)23,24,77,78,79,80. For example, several studies90,91,92 developed a decoder to classify 11, 5 and 2 distinct imagined phonemes from EEG signals, respectively. Similarly, several studies25,26,27 developed an MEG decoder to classify 6 distinct part-of-speech (with 48% accuracy), 10 words (83% accuracy) and 3 words (70% accuracy), respectively. The limited vocabulary used in these non-invasive studies contrast with the present approach, which demonstrably accurately distinguishes several hundreds of words. Furthermore, the performances of our model are based on vocabularies that do not fully overlap with those used in the training set (Table 1). For example, for the Gwilliams dataset, the decoding performance reaches 40% in spite of the fact that nearly 36% of the words were never presented during training. Overall, such zero-shot decoding shows the versatility of this approach and opens the possibility to decode from even larger vocabulary.

Second, although our model’s performance remains modest, it may not be too distant from the performance obtained with invasive recordings of brain activity. Indeed, decoding the perception of isolated words from a vocabulary of n = 50 words leads to a top-1 accuracy of 22.7% on average, but up to 42.9% in the best participants (Supplementary Section A.9). In comparison, Moses et al. 13 reported decoding produced words from intracranial recordings with a top-1 accuracy of 39.5% for isolated words out of n = 50 words. Similarly, still restricting the number of candidates to 50 and, this time, within the context of a sentence, our model decoding is above 72.5% top-1 accuracy on average across participants, and the best participants reach between 92.2% (ref. 32) and 95.9% (ref. 30; Fig. 2b), where Moses et al. 13 reached a top-1 accuracy of 74%. While comparing the decoding of perceived versus produced words should be considered with caution given their different brain bases, the performance of the current model thus leads us to be optimistic about its potential applicability in a speech production context.

Remaining steps to decode speech production in the clinics

Our non-invasive approach focuses on speech perception. To reach the performance obtained with clinical recordings10,12,13,18,40,93,94,95, decoding intended communication will thus require addressing several challenges. Three specific challenges stand out.

First, the current model needs to be adapted to speech production. This can, in principle, be achieved by replacing the speech module with a neural network pretrained on production tasks such as handwriting or speech production.

Second, the current contrastive-learning objective can identify only the most likely word or speech segment from a predetermined set. The model thus needs to be supplemented with a generative module that can estimate the most likely phoneme, word or sentence given brain activity without requiring this set of candidates, similarly to what is being achieved with functional MRI89,96.

Finally, our study reveals striking differences between EEG and MEG. While EEG is known to be less precise than MEG, we did not expect such an important difference in decoding performance. Adapting current MEG systems to the clinics will require substantial efforts, however: while new room-temperature sensors already show signal-to-noise ratio comparable to the superconducting quantum interference devices (SQUIDs) used in the present study, these systems are not commonly deployed in clinical settings, whose magnetic environment can be extremely noisy. Combined with artificial intelligence systems, these new devices could nevertheless contribute to improve the diagnosis, prognosis and restoration of language processing in non-communicating or poorly communicating patients without putting them at risk of brain surgery. In that regard, we hope that the release of a reproducible pipeline will contribute to the development of safe and scalable non-invasive methods for decoding intended communication.