Combining predictive coding and neural oscillations enables online syllable recognition in natural speech

On-line comprehension of natural speech requires segmenting the acoustic stream into discrete linguistic elements. This process is argued to rely on theta-gamma oscillation coupling, which can parse syllables and encode them in decipherable neural activity. Speech comprehension also strongly depends on contextual cues that help predicting speech structure and content. To explore the effects of theta-gamma coupling on bottom-up/top-down dynamics during on-line syllable identification, we designed a computational model (Precoss—predictive coding and oscillations for speech) that can recognise syllable sequences in continuous speech. The model uses predictions from internal spectro-temporal representations of syllables and theta oscillations to signal syllable onsets and duration. Syllable recognition is best when theta-gamma coupling is used to temporally align spectro-temporal predictions with the acoustic input. This neurocomputational modelling work demonstrates that the notions of predictive coding and neural oscillations can be brought together to account for on-line dynamic sensory processing.

tenet of the Giraud & Poeppel paper, but where is the evidence that phonemes are encoded and not emergent? Recent ecog work from the Chang group suggests that phonetic features might reorganize to form syllables and this might mean that features are what gamma is encoding (even though I am against functional interpretation of bands) Reviewer #2: Remarks to the Author: This study reports on a computational speech recognition model that combines a theta-gamma oscillatory achitecture (used and established before in a bottom-up manner by the same group; Hyafil et al., eLife 2015) with a predictive-coding-based model arichtecture. The larger framework here is provided by the senior author's model of nested theta--gamma oscillations being involved in the neural speech comprehension processes. This is an interesting extension of the extant model, borrowing heavily from a computational model by the Kiebel group (Yildiz et al.) on bird song.
As to be expected from such a paper, a family of model variants are compared in their performance (ie % syllables correctly identifed). The final results (Fig. 3) are compelling. However, the manuscript seems not to provide some of the critical comparisons or model variants that would make this a compelling new model (see below for more detailed comments on this).
Put more generally, while I might have missed some subtleties of the computational models and the merit these might have for experts who have implemented and used the Hyafil and Yildiz models themselves, I did not walk away from this manuscript convinced that a truly new architecture has been shown here, or that we we think should think differently about speech recognition and speech representation now.
Further comments: -It was a bit disappointing, if understandable, that the units in the model were tuned to very rigid/unnatural rates (ie the theta module operated on strictly 5 Hz throughout; there were always precisely 8 gamma units etc.). Seeing how the model deals with precisely the kind of variations that occurs in natural speech (as the authors have tried with the varying syllable SOAs, to be fair) might have been more informative for future application of the model in cogniitve neuroscience/automatic speech recognition.
-Related, it remained puzzling to me why such a submission would operate only/report only on 30 sentences. I am not asking for a closer link to neural data or human speech comprehension data per se (although this would have also made for a stronger paper, most likely). But the breadth and depth of the speech materials these models are tested on should be increased.
-Most importantly, how can we generalise from these results that a theta module is the critical one, if no differently-tuned module was tested (eg varying the low rate between 1 and 10 Hz)? Again, a richer set of test cases would have made for a more trustworthy claim here. (cf. line 274: "For the sake of parsimony, this was not implemented in the present work.") -At first glance more of a detail: why the very conservative false discovery rate threshold of q = 0.001? Figure 3 and Table S2 give the impression that the model variants differed quite profoundly from each other, but this also raises suspicion -why so vast differences between models, and would this have been different with more/more diverse speech materials as test cases?
The authors present a model that combines a predictive coding approach with an existing model of an oscillator-based system. I think this word makes a solid and worthwhile contribution to the literature, and I commend the authors for the clarity with which their work was presented. However, the impact of the work could be strengthened by showing same result on non-normalized data, testing the model against existing neural data, generating predictions through simulation, and/or testing against non-oscillating ASR models. The case for predictive coding and oscillations is not a sweeping theoretical novelty, but is nonetheless an important line to pursue if the above strengthening points can be added. That said, my major concern is (i) that the authors normalize their speech data without justification, and that they make claims that just are not supported by what they show.
1. I see a few problems that need to be addressed.

(i)
First, the approach, while a step forward from the authors' previous models, still completely ignores the actual behavioral goals of speaking, which is, at the least, communicating meaning (viz., words, phrases, sentences, linguistic structure). Unless the authors want to claim that they think that syllable detection is somehow orthogonal or independent from word recognition or prosodic processing, they need to dial back their claims about speech and anything higher than then syllable. They are making a model of syllable onset detection, not speech processing or sentence processing. The authors need to dial back some of the claims they are making regarding this issue. All references to "speech" and "sentence processing" should be changed to "syllable" and "syllable processing," respectively, including, importantly, the title. Speech processing is much more than syllable detection and prediction of the timing of normalized syllables (such that they are forced to be the same length) -what is shown in this paper -has nothing little to do with perceiving words, never mind sentence processing.

(ii)
For example, what evidence is there that the cochlea (or even auditory cortex) normalizes the length of syllables? The authors say several time in the paper that biological oscillators are more flexible and can deal with the non-stationarity and aperiodicity of speech, but then why do they normalize their training data? We thank Reviewer 1 for bringing up these important points about the model's goal and performance. We admit that several statements that we made were more ambiguous than intended, and we adjusted the text to clarify them.
(i) We fully acknowledge that using sentences with normalized syllables was a weakness of the model and we have addressed the issue in the revised version of the model.
Although we never claimed the auditory system does any sort of syllable normalization, we had several technical reasons for doing so in the previous version of the model. They were mostly related to the methodology used to create the model's "gamma sequence", and to the stored information about spectrotemporal patterns of each syllable. Syllable normalization allowed us to have a streamlined representation of each syllable in the model's memory.
Following the reviewers' comments (the same concern about using normalized syllables was raised by Reviewer 2 as well), we have modified several components of the model, which now enables us to use natural sentences (with natural syllable duration) for simulations: • (ii) We can only agree with the reviewer that speech perception is much more than just recognizing syllables. However, "on-line" syllable identification within natural sentences (what the model does) is a key step towards that goal. The model implements dual-scale decoding of sentences that structurally incorporates the notion of endogenous syllable representations and top-down control, a notion that is absent in most models including ASR algorithms 3 .
(iii) Predictive coding was used to predict the spectrotemporal decompositions of the sound waveform, but not explicitly syllable duration. Surely, the model had intrinsic information about syllable duration (associated with the duration of the gamma sequence) and it attempted to extract/filter syllable onsets from possible cues on the envelope, but those two functions were mostly unrelated to the predictive aspect of the model. The model uses predictive coding to derive the dynamic of the input envelope and to change the activity level of the syllable units in the process of syllable identification. Syllable units changed their activation level based on bottom-up prediction errors and their activation level determined the model's prediction about the spectrotemporal pattern in the input at each moment. As a result, the model identifies individual syllables online from the continuous sentence. Furthermore, we now include a model configuration with no internal syllable duration information, in which the model only "knows" the sequential spectral patterns of syllables (in this case represented by the 8 spectral vectors in the spectral space -one per gamma unit). In this degraded variant of the model "happy" and "haaaappy" would be undistinguishable. We agree that the use for the term 'optimal' was somewhat improper as our goal was not to develop an optimal system that could challenge current ASR systems, but to explore how the brain could possibly make use of different information encoding principles, that is, neural oscillations for information packaging and predictive coding for the continuous dynamic interaction of bottom-up and top-down information flows. We have thus removed the term "optimization" from the title, and we use it cautiously in the manuscript. We now also compare more model variants (with implicit and explicit theta oscillations, and a model without any explicit oscillatory activity). Finally, we have increased the number of sentences used in model simulations (220 sentences instead of 30).

Finally, in the end, the model still only tracks the syllable envelope, is that what that authors think speech is for? In other words, the point of listening to natural sentences is not to accurately predict syllable durations. So, the case for why what this model captures is crucial needs to be more strongly made.
We hope that it is now clearer from the text that the model does not only track the syllable envelope, but "identifies" syllables "in an on-line" manner using predictive coding, that is, it tracks and predicts both detailed spectrotemporal content and changes in the envelope. What we argue is that temporal predictions about syllable duration are necessary to predict/derive the expected spectrotemporal component of the syllables. However, it is the spectrotemporal component that the model uses to identify the correct syllable. Furthermore, as we indicate in response to the previous point, we have also added model configurations without an implicit theta rhythm, hence without internal expectations about syllable duration. Those model configurations only "know" the spectral structure of syllables as a sequence of 8 spectral points that form a syllable, without any expectations about their overall duration.

Minor points (i) Citation 9 isn't quite accurate, that paper doesn't deal with "what is going to be said next" in a linguistically sophisticated way at all. A reference that deals with actual linguistic structure would be more appropriate. (ii) Change Figure 1. Caption to replace "sentence processing" with "syllable processing" and remove the word 'natural' -syllables were normalized afterall (iii) Hierarchical encoding of phonemes with syllables is conjecture -I am aware that this is the main tenet of the Giraud & Poeppel paper, but where is the evidence that phonemes are encoded and not emergent? Recent ecog work from the Chang group suggests that phonetic features might reorganize to form syllables and this might mean that features are what gamma is encoding (even though I am against functional interpretation of bands).
(i) We thank the reviewer for the suggestions; we have updated citation 9, although the intention was to indicate that the speech perception process can be split into two information components: what (syllable identity) was the signal and when (e.g. syllable onset information by theta/readout window for a syllable identification) occurred. (ii) We have updated the title of the manuscript and did the corresponding changes in the manuscript text and figure captions. (iii) Finally, we must clarify that we did not claim that gamma in the model encodes phonemes. Even though timescales of phonemes in natural speech overlap with the typical range of gamma cortical oscillations, in our model there is no precise correspondence between gamma units and phonemes. The model explores if there is an advantage of having gamma range encoding within syllable boundaries, by considering theta-gamma nesting.

Reviewer 2
This study reports on a computational speech recognition model that combines a thetagamma oscillatory architecture (used and established before in a bottom-up manner by the same group; Hyafil et al., eLife 2015) with a predictive-coding-based model architecture.

The larger framework here is provided by the senior author's model of nested theta--gamma oscillations being involved in the neural speech comprehension processes. This is an interesting extension of the extant model, borrowing heavily from a computational model by the Kiebel group (Yildiz et al.) on bird song.
As to be expected from such a paper, a family of model variants are compared in their performance (ie % syllables correctly identified). The final results (Fig. 3) are compelling. However, the manuscript seems not to provide some of the critical comparisons or model variants that would make this a compelling new model (see below for more detailed comments on this).
Put more generally, while I might have missed some subtleties of the computational models and the merit these might have for experts who have implemented and used the Hyafil and Yildiz models themselves, I did not walk away from this manuscript convinced that a truly new architecture has been shown here, or that we think should think differently about speech recognition and speech representation now.
Further comments: 5. -It was a bit disappointing, if understandable, that the units in the model were tuned to very rigid/unnatural rates (ie the theta module operated on strictly 5 Hz throughout; there were always precisely 8 gamma units etc.). Seeing how the model deals with precisely the kind of variations that occurs in natural speech (as the authors have tried with the varying syllable SOAs, to be fair) might have been more informative for future application of the model in cognitive neuroscience/automatic speech recognition.
We thank Reviewer 2 for bringing up these issues. The revised version of the model now uses natural sentences with non-normalized syllables with natural duration. Furthermore, the frequency of the "theta" oscillation in the different model variants is now stimulusdriven and not rigidly fixed to 5Hz, which is the operating frequency during rest (when there is no signal). Finally, even though we still have 8 gamma units per syllable, the duration of each unit is not fixed and can dynamically change either based on prediction errors or informed by the theta module.
6. -Related, it remained puzzling to me why such a submission would operate only/report only on 30 sentences. I am not asking for a closer link to neural data or human speech comprehension data per se (although this would have also made for a stronger paper, most likely). But the breadth and depth of the speech materials these models are tested on should be increased.
We thank the reviewer for prompting us to increase the speech material. We now present results from simulations on 220 sentences (all sentences in training data-set of TIMIT corresponding to one dialect). These sentences contain on average around 13 syllables (more details on Figure 2 in the Manuscript).
7. -Most importantly, how can we generalise from these results that a theta module is the critical one, if no differently-tuned module was tested (eg varying the low rate between 1 and 10 Hz)? Again, a richer set of test cases would have made for a more trustworthy claim here. (cf. line 274: "For the sake of parsimony, this was not implemented in the present work.") Although we did not include the following results in the revised manuscript, we have tested the model performance when the theta rhythm is tuned to different valuesvarying from 1.25 Hz to 10 Hz. We have tested three model variants: two with exogenous theta rhythm (A and C, with and without preferred gamma speed respectively) and one with endogenous theta rhythm (variant A', for which we had also ideal onsets) on 44 sentences (randomly selected 2 sentences from each speaker). We also set the precision of the causal states of the gamma sequence to a higher value so that the dynamics of the endogenous slow oscillation is less distorted (for variant A'). The performance was significantly higher when the frequency of the theta neuron was < 10 Hz. When the preferred gamma speed was set by an endogenous theta rhythm, the best performance occurred for physiological theta values (2.5Hz to 8 Hz). 8. -At first glance more of a detail: why the very conservative false discovery rate threshold of q = 0.001? Figure 3 and Table S2 give the impression that the model variants differed quite profoundly from each other, but this also raises suspicion -why so vast differences between models, and would this have been different with more/more diverse speech materials as test cases?
We are thankful to the reviewer for raising these concerns about the statistical tests performed for evaluation of the model's performance. As already mentioned, we have increased the number of sentences in the data set from 30 to 220 and we have updated the statistical methods used for the model's performance evaluation. Moreover, we now report results based on more traditional, Bonferroni corrections for multiple companions.
this is already the case when we decrease the number of syllables in the sentence. All versions of the model perform better with short than long sentences, confirming it is easier for the model(s) to pick the correct syllables from fewer choices (as would be the case if we used natural speech statistics). As our formulation might indeed be confusing we removed it from the manuscript. Figure 3

in the manuscript) depending on the number of syllables in the sentence (the highest performance for each variant is normalized to 1 to better visualize the decrease in performance with increased number of potential syllables).
3. One of the authors' main points is the important of the contribution of top-down informationyet they are again not modelling the complexity of the problem the brain solves -they don't include word-level or higher information or make use of prosodic structure or the full contents of the natural envelope. I think the limitation this models clearly faces needs to be discussed much more. It is a model of syllable tracking, but not of speech nor language processing.
The goal of the current study is to propose a model of "on-line" syllable parsing and identification from natural sentences, and as argued in the manuscript, this is an essential step toward natural speech processing. Lexico-semantic top-down processing is beyond the scope of this study, which intends to explore the possible advantage of combining generic mechanisms, namely neural oscillatory activity as temporal constraints on top-down/bottom-up informational flows. We certainly agree that speech "perception" is a more complex process than what the model describes and that "content-related top-down" will have to be taken into account in possible follow-ups of this work. We have further clarified this point in the discussion section (lines 273 -287).