We subdivided a digitized sentence into segments of fixed duration (say, 50 ms). Every segment was then time-reversed without smoothing the transition borders between the segments. The entire spoken sentence was therefore globally contiguous, but locally time-reversed, at every point (A+B in Fig. 1). Listeners report perfect intelligibility of the sentence for segment durations up to 50 ms, and partial intelligibility for segment durations exceeding 100 ms (Fig. 1, bottom), with 50% intelligibility occurring at about 130 ms; by psychoacoustic standards, such segment distortions are very long. Many defining features of speech sounds are rapid temporal transitions with durations well within the reversal window.

Figure 1: Segments of speech showing the effects of time reversal.
figure 1

Top, 50-ms segments of original and locally time-reversed (A+B) speech. Bottom, subjective intelligibility rating by seven subjects (different symbols). The solid line shows the average rating score.

Perception of speech against local time reversal is robust even if alternating segments are shifted in time (A+delayed B). Speech also remains intelligible if odd-numbered segments are displaced forwards in time by two or three times the duration of the window. For example, for segments of 100 ms, shifting the odd-numbered segment forward in time by 200 ms reduces the intelligibility rating by only 15%. For segments of 50 ms, intelligibility is not significantly affected by a displacement of 100 or 200 ms, but the speech does sound more echoic. Furthermore, the results are not changed if half the segments (A in Fig. 1) are presented to one ear and the other half (B in Fig. 1) to the other ear.

When subjects listen repeatedly to locally time-reversed sentences with moderately long windows (100 ms), they report that previously unintelligible words become clear. This type of ‘learning’ is not simply due to an improvement in identification, as subjects say they can now hear actual words, indicating some form of cognitive recalibration. The experience is similar to familiarization with a newly heard accent.

These findings lend support to recent theories7,8 of speech encoding that state, contrary to conventional thinking, that a detailed auditory analysis of the short-term acoustic spectrum is not essential to the speech code. Rather, the ultralow-frequency modulation envelopes in the order of 3 to 8 Hz are critical cues to intelligibility. Although the amplitude spectrum of a waveform is unaffected by time reversal, the temporal envelopes, as well as the fine structure of the running spectrum, are highly distorted for such sounds. The advantage of a robust speech-encoding system that uses higher-order corrective measures and ultralow-frequency cues is obvious in noisy environments where the listener needs to extract perceptually and identify a stream of speech cues that compete with extraneous noise, as in the ‘cocktail party effect’9.