Spectral Weighting Underlies Perceived Sound Elevation

The brain estimates the two-dimensional direction of sounds from the pressure-induced displacements of the eardrums. Accurate localization along the horizontal plane (azimuth angle) is enabled by binaural difference cues in timing and intensity. Localization along the vertical plane (elevation angle), including frontal and rear directions, relies on spectral cues made possible by the elevation dependent filtering in the idiosyncratic pinna cavities. However, the problem of extracting elevation from the sensory input is ill-posed, since the spectrum results from a convolution between source spectrum and the particular head-related transfer function (HRTF) associated with the source elevation, which are both unknown to the system. It is not clear how the auditory system deals with this problem, or which implicit assumptions it makes about source spectra. By varying the spectral contrast of broadband sounds around the 6–9 kHz band, which falls within the human pinna’s most prominent elevation-related spectral notch, we here suggest that the auditory system performs a weighted spectral analysis across different frequency bands to estimate source elevation. We explain our results by a model, in which the auditory system weighs the different spectral bands, and compares the convolved weighted sensory spectrum with stored information about its own HRTFs, and spatial prior assumptions.


Stimulus spectra.
provides two examples of stimulus power spectra (calibrated in dBA) as used in the experiments. The spectra were calculated from the white-noise samples generated in Matlab; the solid black line shows a windowed ongoing average through the data. The stimuli correspond to the top-left and bottom-right sounds of Figure 6, respectively. Figure S1. The calculated power spectra for a bandstop (at -36 dB) (A), and a bandpass (at +36 dB) (B) acoustic stimulus, as used in the experiments. The band extends from 6-9 kHz. Figure S2 we present the influence of spectral contrast and stimulus level on the azimuth response components of all subjects. The azimuth localization for different values of NRI and ORI for the 25 stimuli of subject P1 are highlighted by the open symbols (individual trials) and bold black regression lines and regression results in the lower-right of each subpanel. The listener was quite precise in localizing the azimuth components of the stimuli (evidenced from the high r 2 values, in combination with high gains and low biases), regardless the changes in NRI and ORI. The regression results for the other subjects are indicated by the green lines and were quite consistent. Despite the large variation in NRI and ORI, over a 36 dB range in both acoustic dimensions, regression lines were nearly optimal, with only a modest scatter of the data around the regression line (high r 2 ). The performance was slightly affected at the lowest intensities and largest positive contrasts only (bottom row of the stimulus matrix), as evidenced by an increased variability around the regression line.

Average behavior in azimuth.
To quantify a potential systematic azimuth dependency on spectral contrast and sound level, Figure S3 shows the spatial gains (left-hand column), biases (center column) and r 2 values (right-hand column), averaged across all subjects per stimulus in the same format as for the elevation responses in Figure 3. In contrast to the elevation response components, a nearly uniform pattern emerges for the response gain and bias, which did not vary significantly from stimulus to stimulus. There is only a small effect on the azimuth precision at the highest spectral contrasts and lowest sound levels (right-hand column). Responses are quite consistent across subjects. Figure S3. Average influence of the combination of NRI and ORI (A), the influence of spectral contrast (NRI-ORI) (B), and of absolute sound level (C), on the azimuth gain (left-hand column) and azimuth bias (center column) and azimuth precision (right-hand column) for all subjects. The thick solid orange line and shaded orange area correspond to the average and standard deviation for all subjects. Note that all subjects demonstrated very similar behaviour.

The Bayesian prediction for response gain.
In our model we assume that in the first processing stage, the auditory system cross-correlates the (weighted) spectral sensory input, e.g., as measured by the auditory nerve and dorsal cochlear nucleus, with all learned and stored spectral pinna filters. The result leads, after rectification, to a likelihood function of potential target locations in elevation, ε, which depends on the current stimulus location, say at ε * , here called L(ε|ε * ). To ensure optimal localization, with minimal absolute localization errors (best accuracy), and minimum variability (best precision), the final estimation process involves the contribution of a spatial prior, P(ε * ), which we assume is centered around straight ahead. Bayes' rule then transforms the likelihood function into a more precise posterior distribution, which specifies the probability to find the target at elevation ε * , given the noisy sensory evidence and the spatial prior information: Multiplication of two Gaussian distributions again yields a Gaussian distribution, with a mean that lies between the two means of the original Gaussian distributions, and with a variance that is smaller than either of the two Gaussian distributions. Suppose that the prior and likelihood function can be described by the following two Gaussian distributions: where N P and N L are normalization constants, and σ P and σ T are the widths of the distributions. In that case, it follows that their product yields a posterior distribution for which the mean and variance are calculated as: The optimal Bayesian response is found by taking the maximum of the posterior distribution (the Maximum-A-Posteriori, or MAP, estimate). Over the course of many trials, these maxima are found at the mean of the posterior. We thus define the optimal response gain as Note that in the absence of sensory noise, σ T ≈ 0, the response gain g OPT ≈ 1. When the noise is large, i.e., σ T ≫ σ P , the response gain g OPT ↓ 0. The former condition is typically obtained for the azimuth response components, for which the binaural difference cues (per frequency channel) can be measured quite reliably, except at the poorest SNR's. The latter condition (low gain), however, can be readily obtained for the elevation response components under poor spectral conditions (at low SNR's (soft sounds), or for low and poor spectral resolution).

Other priors.
Interestingly, our extended model in Figure 4 also accounts for the low response gains obtained for cases where the sensory evidence is actually very strong (e.g. when the 6-9 kHz notch region is boosted; lower-right corner of the stimulus matrix in Figure 6), yet pointing at a particular, fixed elevation (a considerable upward bias for high positive spectral contrast stimuli). The reason for this counterintuitive behavior (at least in terms of the simple Bayesian framework described above) results from the fact that our model adopts four independent priors, instead of only one: (i) the pinna prior: pinna filters are uncorrelated (i.e., they are unique for each elevation angle) (ii) the prior on source spectra: natural source spectra do not correlate with any of the pinna filters. (iii) the prior on frequency bands: some frequency bands are more informative for changes in elevation than others, and (iv) the spatial prior: the expected distribution of potential target locations.
The low-gain/high-bias responses at high-positive spectral contrasts are thus caused by a strong dominance of the third prior, leading to a peak in the likelihood function that is unrelated to the actual stimulus location, but points consistently to upward stimulus positions.

Model simulations.
To simulate the qualitative elevation response patterns of the subjects, as summarized in Figure 3A-C, with the model of Figure 4, we approximated human HRTFs by a canonical set of spectral shape functions ( Figure S4A), whereby the elevation-independent ear-canal resonance was positioned at 2.5 kHz (modeled by a Gaussian spectral filter with a width of 1 kHz and an amplitude of +15 dB); the center of the spectral notch ran between 6 and 9 kHz in an elevation-dependent way (linear dependence, for simplicity), whereby the width of the notch decreased systematically with elevation; at high elevations (> 35 deg) a peak emerged in the HRTFs at 8 kHz that changed its center with increasing elevation towards lower frequencies (compare with Figure 1). See section S6 for the details.

Figure S4 (A)
Parameterized approximation of human HRTFs (cf. Figure 1) as used in the model simulations of Figure 5. (B) The spectral weighting function as used in the simulations.
Data availability. The experimental data can be obtained from the corresponding author upon reasonable request.