The role of isochrony in speech perception in noise

The role of isochrony in speech—the hypothetical division of speech units into equal duration intervals—has been the subject of a long-standing debate. Current approaches in neurosciences have brought new perspectives in that debate through the theoretical framework of predictive coding and cortical oscillations. Here we assess the comparative roles of naturalness and isochrony in the intelligibility of speech in noise for French and English, two languages representative of two well-established contrastive rhythm classes. We show that both top-down predictions associated with the natural timing of speech and to a lesser extent bottom-up predictions associated with isochrony at a syllabic timescale improve intelligibility. We found a similar pattern of results for both languages, suggesting that temporal characterisation of speech from different rhythm classes could be unified around a single core speech unit, with neurophysiologically defined duration and linguistically anchored temporal location. Taken together, our results suggest that isochrony does not seem to be a main dimension of speech processing, but may be a consequence of neurobiological processing constraints, manifesting in behavioural performance and ultimately explaining why isochronous stimuli occupy a particular status in speech and human perception in general.

• fr_p_acc.wav is the French example sentence mixed with audible tones marking accent-group P-centres • fr_p_syl.wav is the French example sentence mixed with audible tones marking syllable P-centres • en_p_acc.wav is the English example sentence mixed with audible tones marking accent-group P-centres • en_p_syl.wav is the English example sentence mixed with audible tones marking syllable P-centres

Example stimuli illustrating temporal modifications
• fr_iso_acc.wav is the French example sentence isochronously retimed at the accent-group level • fr_iso_syl.wav is the French example sentence isochronously retimed at the syllable level • fr_ani_acc.wav is the French example sentence anisochronously retimed at the accent-group level • fr_ani_syl.wav is the French example sentence anisochronously retimed at the syllable level • en_iso_acc.wav is the English example sentence isochronously retimed at the accent-group level • en_iso_syl.wav is the English example sentence isochronously retimed at the syllable level • en_ani_acc.wav is the English example sentence anisochronously retimed at the accent-group level • en_ani_syl.wav is the English example sentence anisochronously retimed at the syllable level

Word-level and talker acoustics analysis
Word-level intelligibility ( Figure 1) and individual talker acoustic analysis ( Figure 2) indicate that both syntactic differences in how the sentences are formed in the two langages, and individual talker's differences in speaking rate variation across sentences could be a contributing factor for a contrastive tendency across French and English.
As seen in Figure 1, the two languages present three marked differences. First, for NAT sentences, intelligibility drops sharply after the first word followed by an increase in French while it is highest for the first 3 words followed by a large decrease towards the end of the sentence in English. This pattern seems to hold in each language for other conditions as well, combined with an additional decrease in intelligibility. Second, the isochronous advantage established for English at the accent level appears to be developing in the second part of the sentence, that is, accent-isochronous sentences show a milder decrease as the sentences unfold compared to other conditions. Third, intelligibility of initial keywords in accentretimed French sentences have similar levels than NAT speech. This could be attributed to the fact that because accent group boundaries tend to fall on the last syllable of words in French (contrasting with trochaic stress patterns in English), the early part of the retimed sentence is in effect unmodified until that point in these conditions. Taken together, these differences in word-level recognition patterns suggest that the syntactic and rhythmic structure of both languages are likely to have an effect on how isochronous modification are implemented by talkers and processed by listeners.
As seen in Figure 2, there is a wider F 0 declination and a stronger intensity declination in English as opposed to French, potentially due to gender difference and individual speaking style. There is also a sensible difference across French and English in average speaking rate in both accent group and syllable levels. Finally, we note that speaking rate is more irregular across the sentence for the female talker, especially at the syllable level. In sum, both idiosyncratic and language factors seem to underly differences in the three acoustic descriptors. These prosodic variations in English, with faster rate and lesser intensity towards the end of the sentences, could well explain the effect of keyword position in Figure 1.