Auditory motion perception emerges from successive sound localizations integrated over time

Humans rely on auditory information to estimate the path of moving sound sources. But unlike in vision, the existence of motion-sensitive mechanisms in audition is still open to debate. Psychophysical studies indicate that auditory motion perception emerges from successive localization, but existing models fail to predict experimental results. However, these models do not account for any temporal integration. We propose a new model tracking motion using successive localization snapshots but integrated over time. This model is derived from psychophysical experiments on the upper limit for circular auditory motion perception (UL), defined as the speed above which humans no longer identify the direction of sounds spinning around them. Our model predicts ULs measured with different stimuli using solely static localization cues. The temporal integration blurs these localization cues rendering them unreliable at high speeds, which results in the UL. Our findings indicate that auditory motion perception does not require motion-sensitive mechanisms.


Comparison with previous findings by Féron et al. 2
Our data demonstrates the effect of spectral content on the UL, which increases with BW and CF. This result might seem in contradiction with those from 2 , shown in Figure 3, who reported higher ULs for low-pitched sounds than for high-pitched sounds using harmonic sounds. But a closer look at the stimuli used explains this discrepancy. We present a time frequency analysis of the four harmonic sounds used (see Figure 3). The analysis reveals that high-pitched sounds had a narrower BW (as the sounds had no energy above 5 kHz, see Table 1). Based on our findings, we predict lower ULs for high-pitch sounds since their BW are narrower. Féron et al. 2 also used Band Limited Noises but failed to observe differences across noises, which might seem inconsistent with our findings. The difference can however be attributed to the filters used: while we used eighth-order filters with very steep slope, Féron used second-order filters. As a result, all their Band Pass Limited noise contained the high-frequency content necessary to achieve optimal UL, which explains their results.

Model implementation details
The psychometric relation between Q x and the front-back confusion rate We use the following psychometric function Such a function has the standard S-shape of a logistic function and is adapted to a bounded support [0, Q max ]. Parameters α and β control the slope and the asymmetry of the graph while e controls the maximum achievable success rate. The parameters e, α and β are obtained by non-linear least squares regression on Langendijk's 3 data (lsqcurvefit function in matlab).

Frequency interpolation
The gradient along frequencies is computed using finite differences. In order to keep the number of central frequency samples equal to the number of gradient samples, we use the interpolated frequency samples (0.5( f n + f n+1 )) n∈{1,N−1} .

Numerical issues in spectral gradient computations
The computation of the spectral gradient in Equation 4 of the main text suffers from numerical issues because differences in log-energy can be high when the energy is close to zero. Perceptually, these gradients are masked by the background noise. To address this problem, we apply the filtering of the sound x only after the gradient computation. To do so, we set to zero the gammatone bands with a central frequency outside the cut-off frequencies. In addition, for the two gammatone bands near the cut-off frequencies, we approximate the remaining energy by linear interpolation with the energy of the gammatone band immediately after the cut-off frequency. For example, in the case of a band-pass filter with a high cut-off frequency f h such that f i < f h < f i+1 where f i is the central frequency of the i-th gammatone. The approximate log-energy gradient is given bŷ (2)

Model discussion
Binaural v.s. monaural model of spectral cues Our model is formally binaural because of Equation 3 in the main text. An alternative would be to consider both ears as independent. In this case, one would compute the front-back cues for both ears and sum them Q x = Q r x + Q l x . As such the model would combine two monaural cues. However, this model would involve more numerical issues than those raised in section 3.C : considering a single ear, the HRTF energy in the controlateral directions is very low, potentially resulting in high values for the computed gradient. In addition, such a model would require a binaural weighting such as the one used by Majdak 4 . Our implementation solves both issues.    The estimation method of the UL is different from ours but the UL has been shown to be robust to different estimation methods 5 . In Féron 2 , the sound was either accelerating or decelerating, and participants were asked to indicate when they were unable (or respectively, able) to perceive the direction, hence the strong hysteresis observed. Bottom: Spectral content of Féron's harmonics sounds with fundamental frequencies: 330 Hz, 440 Hz, 880 Hz, and 1760 Hz. We use a −50 dB threshold for the time-frequency representation, which is below the absolute threshold of audition. Analyses show that the higher the pitch, the narrower the BW, explaining the observed decrease in UL.