Signal envelope and speech intelligibility differentially impact auditory motion perception

Our acoustic environment contains a plethora of complex sounds that are often in motion. To gauge approaching danger and communicate effectively, listeners need to localize and identify sounds, which includes determining sound motion. This study addresses which acoustic cues impact listeners’ ability to determine sound motion. Signal envelope (ENV) cues are implicated in both sound motion tracking and stimulus intelligibility, suggesting that these processes could be competing for sound processing resources. We created auditory chimaera from speech and noise stimuli and varied the number of frequency bands, effectively manipulating speech intelligibility. Normal-hearing adults were presented with stationary or moving chimaeras and reported perceived sound motion and content. Results show that sensitivity to sound motion is not affected by speech intelligibility, but shows a clear difference for original noise and speech stimuli. Further, acoustic chimaera with speech-like ENVs which had intelligible content induced a strong bias in listeners to report sounds as stationary. Increasing stimulus intelligibility systematically increased that bias and removing intelligible content reduced it, suggesting that sound content may be prioritized over sound motion. These findings suggest that sound motion processing in the auditory system can be biased by acoustic parameters related to speech intelligibility.


Results
In this study, we tested the relative impacts of signal ENV (noise vs. speech) and intelligibility of the signal on sound motion perception. To do so, we created acoustic chimaeras from speech and SMN tokens (for details, see "Methods"). The full stimulus set entailed seven conditions with a speech-like ENV and seven conditions with a SMN ENV. For each ENV type, there were five chimaera stimuli created with 2, 4, 6, 8 and 16 frequency bands and two control stimuli. Chimaera stimuli that contained a speech-like ENV are referred to as Speech Chimaera (SC), while chimaera stimuli that contained a noise-like ENV are referred to as Noise Chimaera (NC). Sublettering indicates the frequency band manipulations within each SC and NC (SC 2 , SC 4 , SC 6 , SC 8 , SC 16 ; NC 2 , NC 4 , NC 6 , NC 8 , NC 16 ). Two stimulus types were added as control: (1) to test whether sound motion perception is affected differently for chimaera and non-chimaera stimuli, we added the original stimuli for each ENV type (original speech: OR S, original SMN: OR SMN ); (2) to test whether sound motion perception is affected differently when the ENV type remains, but the stimulus content becomes unintelligible, we added a reversed 16-band chimaera for each ENV type (reversed speech: R S, reversed SMN: R SMN ). In a sound-proofed chamber, each stimulus condition was presented from a frontal position both as stationary and left-or rightward moving a 10° angular range. After sound presentation, NH listeners reported their perceived sound motion in a one alternative forced-choice task, and verbally repeated what they understood. Speech intelligibility. Speech intelligibility was measured as the ratio of the number of stimuli whose content was correctly understood to the number of all presented stimuli. Figure 1 plots the mean speech intelligibility score (± standard error of the mean, sem) as a function of condition for each ENV type (color legend), along the five chimaera conditions and both control conditions. Overall, speech intelligibility followed the expected trend for both SC (blue) and NC (red) stimuli as previously shown 8 . As the number of frequency bands increased, speech intelligibility increased for SC conditions, from less than 10% (SC 2 ) to 89% (SC 16 ), but decreased for NC conditions, from 45% (NC 2 ) to 0% (NC 16 ). Speech intelligibility scores in the present study were generally lower than those reported by Smith and colleagues 8 . However, this is to be expected, as we tested performance for single words, while Smith et al. (2002) utilized whole sentences, which provided more semantic context that is known to aid recognition. As expected, speech intelligibility was best for OR S , reaching ceiling performance at 100%, and poorest for OR SMN and both reversed stimuli, R S and R SMN , which showed floor performance at 0%. www.nature.com/scientificreports/ Sensitivity to sound motion. Recent work by Warnecke and colleagues showed no effect of sensitivity to sound motion between 8-band acoustic chimaeras with speech-or noise-like ENVs 24 . However, the results of that study suggest that the listeners have better sensitivity to sound motion for unprocessed stimuli and little to no response bias for all unprocessed and chimaera stimuli with a noise-like ENV. This led us to predict that listeners would show better sensitivity to sound motion for unprocessed stimuli with noise-like ENVs compared to speech-like ENVs. To evaluate how well listeners could distinguish between stationary and moving stimuli, we calculated sensitivity (d') for each condition. Figure 2 illustrates listeners' mean sensitivity to sound motion (± sem) across conditions, where a greater d' value indicates better sensitivity for sound motion. To evaluate the impact of signal ENV on how well listeners could distinguish between stationary and moving sounds, we analyzed the chimaera conditions using a 5 × 2 repeated measures analysis of variance (ANOVA). The number Figure 1. Speech intelligibility across conditions for each ENV type. Speech intelligibility (mean ± sem) was measured as a function of chimaera and control conditions, with stimulus ENV type indicated by color. As the number of frequency bands increased, intelligibility of sound content increased for chimaera with a speechlike ENV (blue), but decreased for chimaera with a noise-like ENV (red). During control conditions, listener's performance on sound content intelligibility reached ceiling levels for original speech (OR S ), but floor levels for SMN (OR SMN ) and both reversed control stimuli (R S /R SMN ).

Figure 2.
Sensitivity to sound motion across conditions for each ENV type. Sensitivity scores (d'; mean ± sem) are plotted as a function of chimaera and control conditions, with ENV type indicated by color. Higher scores indicate better discriminability. Across chimaera conditions, listeners were significantly more sensitive to sound motion for stimuli with noise-like ENVs (red) compared to speech-like ENVs (blue). In control conditions, original stimuli did not differ from chimaera stimuli, but listeners were significantly less sensitive to sound motion when stimuli were speech (OR S ) compared to SMN (OR SMN ).  (2,4,6,8,16) and ENV type (noise/speech) were entered as within-subject factors. As expected, we found a significant main effect for ENV type (F 1,21 = 11.27, p = 0.003), indicating that listeners were more sensitive to sound motion when presented with chimaera stimuli that had a noise-like ENV (mean = 2.08, sem = 0.12), compared to a speech-like ENV (mean = 1.72, sem = 0.11). There was no main effect for the number of frequency bands (F 4,84 = 1.5, p = 0.2) and no interaction (F 4,84 = 0.92, p = 0.45). Our experiment contained two control stimulus types to evaluate (1) whether sensitivity to sound motion would differ between chimaeric stimuli (SC 16 /NC 16 ) and non-chimaeric stimuli (OR S /OR SMN ), and (2) whether sensitivity would be impacted when the speech intelligibility between chimaeric stimuli differed (SC 16 /NC 16 vs. R S /R SMN ). We tested the difference in listeners' sensitivity to sound motion for each of these controls using factorial ANOVA analyses. First, a 2 × 2 ANOVA for ENV type (noise/speech) and stimulus type (Chimaera/Original) showed a main effect of ENV type (F 1,84 = 12.44, p = 0.0007; noise: mean = 2.03, sem = 0.16; speech: mean = 1.22, sem = 0.16), but no effect of the stimulus type (F 1,84 = 1.22, p = 0.30), and no interaction (F 1,84 = 1.3, p = 0.25). These results indicate that ENV was a prominent cue for sensitivity to sound motion, independently of whether stimuli were modified to be chimaeric, or not. Further, in a second 2 × 2 ANOVA for ENV type (noise/speech) and stimulus type (Chimaera/Reversed Chimaera), we found no significant main effect or interaction (whole model test: F 3,84 = 0.8, p = 0.45). This indicates that listeners' sensitivity to sound motion was not substantially impacted by the intelligibility of a stimulus' content; a finding that is further supported by the non-significant impact of number of frequency bands on sound motion sensitivity (see above).
Previous research indicates that there is no significant difference between localizing stationary speech and broadband signals, such as SMN [24][25][26][27] . However, it is unclear whether this trend extends to listener's sensitivity for detecting the sound motion of speech and SMN signals. Sound motion sensitivity was evaluated for stimuli which represented the original speech (OR S ) and broadband (OR SMN ) signals. To do so, we used a two-tailed t-test between the OR S and OR SMN conditions, and found that listeners were significantly less sensitive to sound motion (t(42) = -3.36, p = 0.0016) when presented with OR S stimuli (mean = 0.97, SEM = 0.21) as compared to OR SMN (mean = 2.05, SEM = 0.23), indicating that it was more difficult to detect sound motion when stimuli were original speech compared to SMN.

Response bias for sound motion.
Studies on psychophysical phenomona commonly use d' indices of sensitivity as a primary measure of interest. However, a subject's decision-making process can be biased. For example, an equally detectable signal can have the same percent correct performance, but opposite response biases in two subjects 28 . In order to fully assess performance, the response bias, also known as the decision criterion (c), is thus imperative. Previous research found that listeners showed a response bias for stimuli with speech-like ENVs, judging them as stationary compared to stimuli with noise-like ENVs, which did not show that bias 24 . However, the content of stimuli with speech-like ENVs was also perceived as intelligible speech more often compared to the content of stimuli with noise-like ENVs. This created a confound as to what was driving the stationary response bias.
One possibility is that the previously observed response bias was influenced exclusively by the signal ENV. In that case, independently of how well listeners understand the content of each stimulus, all stimuli with a speech-like ENV (SC 2-16 , OR S and R S ) should induce a stationary response bias, i.e. a bias criterion value less than 0. Moreover, independently of how well listeners understand the content of each stimulus, all stimuli with a noise-like ENV (NC 2-16 , OR SMN and R SMN ) should then show a non-stationary bias, i.e. a criterion value greater than or equal to 0. By contrast, if the response bias was influenced exclusively by how well listeners understood the content of each stimulus, i.e. its speech intelligibility, then an increase in intelligibility scores should induce an increase in response bias to report sounds as stationary, independently of the signal's ENV (e.g. NC 2 , SC 4-16 , OR S ). Importantly, if content intelligibility contributes to stationary bias, then removing only content intelligibility (R S ) should remove the bias. Finally, it is possible that an interaction of speech intelligibility and ENV type promotes response biases to sound motion. Figure 3 plots listeners' mean response bias criterion (± sem) across conditions. While a criterion of 0 indicates no response bias, bias criteria greater than 0 or less than 0 indicate a bias towards judging sounds as moving, or stationary, respectively. Perfect scores were adjusted by the ratio of 1 to the doubling of hits and misses. Previous work informed apriori predictions for the response biases of SC and NC stimuli (see above), leading us to evaluate whether signal ENV biased listeners towards judging sounds as stationary or moving using a one-tailed t-tests for each chimaera condition. As we tested simple effects for each condition, we did not adjust our alpha and tested at α = 0.05. The details of statistical testing can be found in Table 1.
All conditions with stimuli containing a noise-like ENV (NC 2-16 , OR SMN , R SMN ) had bias criteria equal to or greater than 0. When testing the NC conditions, all but one condition with four or more frequency bands were significantly larger than 0 (see Table 1; NC 4 : mean = 0.24, sem = 0.12; NC 8 : mean = 0.26, sem = 0.14; NC 16 : mean = 0.33, sem = 0.16). NC 6 approached significance (Table 1; mean = 0.22, sem = 0.13), indicating that these conditions showed a small bias for being judged as moving. Analogously, conditions with a speech-like ENV (SC 2-16 , OR S ) except R S , had bias criteria less than 0, and all SC stimuli with at least six frequency bands were significantly smaller than 0 (see Table 1; SC 6 : mean = -0.42, sem = 0.1; SC 8 : mean = -0.37, sem = 0.1; SC 16 : mean = -0.74, sem = 0.1). This confirms that listeners were substantially biased towards judging SCs with at least six frequency bands as stationary.
We evaluated whether there was a difference in response bias for our control conditions to understand the impact of (1) chimaeric vs non-chimaeric stimulus types and (2) 16 ), which effectively removed response bias for the broadband signal (OR SMN , mean = 0, sem = 0.17), and revealed the strongest response bias towards a stationary percept for the speech signal (OR S , mean = -1.23, sem = 0.11; Fig. 3). These findings corroborate results from Warnecke and colleagues 24 . Second, to test whether the intelligibility of a stimulus' content impacted response bias, a 2 × 2 ANOVA for ENV type (noise/speech) and stimulus type (Chimaera/Reversed Chimaera) showed that both main effects and their interaction were significant (ENV type: F 1,84 = 13.66, p = 0.0004; Chimaera/Reversed Chimaera: F 1,84 = 10.7, p = 0.0016; interaction: F 1,84 = 13.42, p = 0.0004). This indicates that reversing the stimulus in an effort to remove the intelligibility of its content substantially removed listeners' response bias. Importantly, post-hoc comparisons using Tukey's HSD at α = 0.05 indicated that SC 16 differed significantly from all other control conditions (NC 16 , R S , R SMN ). This interaction highlights that the decrease in response bias for this control condition was driven by the difference between the two chimaera conditions with a speech-like ENV that manipulated intelligibility of stimulus content (SC 16 vs. R S ). Further, R S , the temporally-reversed SC 16 , which retained the speech-like ENV but had no intelligible content (R S , Fig. 1), was not significantly different from 0 in a subsequent two-tailed t-test to evaluate its response bias (t(21) = 1.8, p = 0.086; mean = 0.27, sem = 0.15). This indicates that there was no considerable response bias for this stimulus. Collectively, our results indicate that sound motion perception is impacted by an interaction of ENV type and speech intelligibility, because substantial stationary bias is (1) not observed for all stimuli with a speech-like Response bias for sound motion across conditions for each ENV type. Response bias (c; mean ± sem) is plotted as a function of chimaera and control conditions, with ENV type indicated by color. Positive numbers indicate a bias towards judging sounds as moving, while negative numbers indicate a bias towards judging sounds as stationary (grey arrows). Statistical significance (see "Results") is indicated by stars. Listeners showed a stationary perceptual bias for chimaera stimuli that had a speech-like ENV (blue) and at least six frequency bands. By contrast, chimaera stimuli with a noise-like ENV (red) and 4, 8 or 16 frequency bands induced a moving perceptual bias. Bias differed significantly for control conditions, in which listeners showed no bias for OR SMN , while the strongest stationary bias was induced for OR S . Removing sound content intelligibility, R S , removed the stationary bias that was observed in listeners when the same sound's content was intelligible, SC 16 .  (2) not observed for all stimuli whose content is at least in part intelligible to listeners (e.g. NC 2 ). Instead, noise-like ENVs induced either no bias or a small bias toward a moving percept, while speech-like ENVs induce either no bias or a strong bias toward a stationary percept. Importantly, only for stimuli with speech-like ENVs, increasing stimulus content intelligibility increased stationary bias (e.g. SC 6 , SC 8 , SC 16 , OR S ,), and removing content intelligibility removed stationary bias for sound motion detection (e.g. SC 16 vs. R S ).

Discussion
In real-life listening situations we are confronted with a cacophony of sounds, which can be stationary but are often in motion relative to a listener's head. Spatial auditory perception provides awareness for our surroundings. Crucially, it serves to localize sounds, categorize them as useful or dangerous, and enhance communication by selectively guiding attentional resources to a talker or location of interest 2,29,30 . Under ideal conditions, humans can localize stationary broadband sounds with impressive acuity [31][32][33][34][35] . The ability to localize sounds in space depends on the integration of several binaural acoustic cues, namely ITDs and ILDs 31,34 . A few studies have shown that listeners are able to localize stationary speech sounds in the horizontal plane as accurately as non-speech broadband stimuli 26,27,36 , and recent work showed that there is no difference in localization accuracy for a stationary acoustic chimaera with speech-like or noise-like ENVs 24 . Extending these findings of stationary sound localization, we investigated listeners' sensitivity to sound motion. While we found no differences in motion sensivity as the number of frequency bands (i.e. stimulus conent intelligibility) changed, results showed that sound motion was overall more difficult to detect for stimuli with speech-like ENVs compared to stimuli with noise-like ENVs (Fig. 2). Further, we found a significant difference in sound motion sensitivity for the original speech stimulus (OR S ) compared to the original noise stimulus (OR SMN ), two stimuli which show no difference in their stationary localizability [24][25][26][27] . Furture work on this result would benefit from further controlling the effective speech duration to more aptly compare these two types of stimuli. Together, the sensitivity results point to an important difference in the ability to locate a sound vs. detect its motion, suggesting that different underlying mechanisms could contribute to sound motion detection.
Moving sounds create dynamic binaural changes in ITDs and ILDs. Previous work suggests that dynamic ILD cues may be more salient than ITD cues to discriminate sound velocity [21][22][23]31 , an important component of auditory motion perception 20 . We recently proposed that listeners may be tracking changes in the short-term level of a moving signal across spatial locations 24 . That is, when a moving acoustic signal contains a quickly-changing noise-like ENV, a listener would be comparing approximately equal energy levels from one spatial location to the next, indicating that perceived changes in ILDs are mostly due to changes in spatial location. This may help the listener to detect whether a sound moved. In support of that hypothesis, the present study found that listeners were more sensitive to sound motion for any stimuli that contained a noise-like ENV (Fig. 2). In contrast, when a moving acoustic signal contains a speech-like ENV, a listener would be comparing energy levels that vary more slowly in their short-term level across time, while they also move from one spatial location to the next. In such a case, when the sound is in motion, ILD changes could be due to changes in spatial location or changes in the temporal short-term level from the signal itself. These co-occurring cues may confound the judging of sound motion, leading listeners to perceive sounds with a speech-like ENV as stationary.
This conjecture suggests that sound ENV contributes to sound motion detection. However, the results of the present study indicate that not all stimuli with a speech-like ENV result in a stationary bias (e.g. SC 2 , SC 4 , R S ). Hence, at least one other factor may influence sound motion detection.
Stationary response bias increased when listener's understanding of the content of the stimulus increased, indicating that speech intelligibility may impact sound motion detection when coupled with a speech-like ENV (Fig. 3). Increasing the number of frequency bands of the SC effectively also increased their spectro-temporal resolution, improving the stimulus content intelligibility (Fig. 1). As the intelligibility of stimulus content increased, so did the stationary response bias, reaching statistical significiance for any SC with at least 6 frequency bands (Fig. 3). Comparing the SC with highest content intelligibility, SC 16 , to its reversed version, R S , provides strong evidence that speech intelligibility is a salient component of the stationary response bias for sound motion, because reversing SC 16 retained all temporal and spectral components of the chimaera, but removed the intelligibility of its content (Fig. 1). An alternative control in future work could involve foreign word stimuli to exclude the possibility that differences in speech onset and decay time contributed to this effect 37 . In combination with the systematic increase in SC stimuli being categorized as stationary when their content intelligibility improved (Fig. 3), the results demonstrate that sound content intelligibility is one contributor to stationary biases for moving sounds. Note that sounds were not biased to be perceived as stationary in the NC 2 condition, where stimuli had reached about 45% intelligibility. It is possible that independently of stimulus ENV, a stationary bias for a moving sound may only manifest itself when there is a certain minimum stimulus content intelligibility (for example, stationary bias for SC stimuli was only observed when stimulus content intelligibility was at least 62%).
Collectively, our results show that sensitivity to sound motion is accurate for the SMN stimulus, as evident by listeners' high sensitivity scores in that condition (OR SMN , Fig. 2) and the absence of a response bias (OR SMN , Fig. 3). Further, and importantly, sound motion perception in the auditory system appears biased: acoustic chimaera stimuli with a noise-like ENV induce a small moving bias, while both chimaeric and non-chiameric stimuli with a speech-like ENV induce a stationary bias. Importantly, increasing the content intelligibility of stimuli with a speech-like ENV systematically increases stationary bias (Fig. 3), and presenting listeners with clear speech induces the strongest bias toward a stationary percept (OR S , Fig. 3). Further, removing intelligible content from acoustic stimuli removes the stationary perceptual bias (R S , Fig. 3).
Previous work suggested that the processing of speech is similar to that of non-speech stimuli 1 , but it has since been argued that speech might be treated differently than non-speech sounds in auditory perception 38,39 . In fact, our results show that with regard to sound motion perception, the two types of stimuli show different Scientific Reports | (2021) 11:15117 | https://doi.org/10.1038/s41598-021-94662-y www.nature.com/scientificreports/ perceptual biases. Recent neuroimaging studies demonstrated distinct neural systems in the auditory processing of intelligible compared to unintelligible speech (e.g. 40,41 ), giving evidence for an anterior auditory "what" pathway. Subsequent work demonstrated differential activation of the auditory "where" and "what" pathways when listeners were asked to attend to sound location or feature 42 , outlining parallel processing for sounds in the auditory "where" and "what" pathways. Using magnetoencephalographic imaging, researchers determined the time scale of pathway activation and showed that a dissociation between the "where" and "what" pathways started after about 100 ms. Importantly, the "where" pathway was activated about 30 ms earlier than the "what" pathway. In the present study, a dual-task was implemented whereby listeners were asked to report sound motion -thereby needing to localize sounds -and the content of the sound. It is likely that the processing of this task engaged the auditory "where" and "what" pathways. Assuming that the activation of the "where" pathway occurred about 30 ms earlier than that of the "what" pathway, determining sound location could have preceded processing of sound content. Evidence from speech recognition studies has demonstrated that incoming information, such as natural speech, is segmented and processed on the basis of a sliding window of temporal integration. In the context of speech, psychophysical research and theoretical work suggest that two temporal windows may be sliding in parallel in the temporal domain, processing the acoustic signals at the syllabic (~ 125-200 ms) and the phonemic level (~ 25-50 ms) [43][44][45] . If, after initial determination of sound location, the sound content was perceived to be intelligible, subsequent speech processing of our disyllabic stimuli may have prevented the detection of simultaneously-occurring changes in sound location of the moving sound, thereby facilitating the stationary percept of these stimuli. Additional reinforcement of the perceptual bias for the processing of moving speech may have been driven by selective attention: Human neurophysiological work has revealed that selectively attending to phonetic content increases the neural response in the "what" pathway, while selective attending to sound location increases the neural response in the "where" pathway 42 . This could reflect selectivity for task-relevant information. In the present study, listeners were not told to attend to either sound location or content; rather, they performed a dualtask. If a listener's attention was momentarily guided toward the content of acoustic stimuli which contained linguistically-relevant content, it is possible that this selective attention contributed to processing of the sound content at the expense of processing its motion.
Our work shows that listeners are biased in processing acoustic information, indicating that the auditory system can be mislead. In understanding the system's bias, we can utilize the knowledge of misperceptions to guide studies on auditory processing and representations, improve processing algorithms for devices that aid listeners with hearing impairments, and refine the creation of virtual acoustic environments.

Methods
Participants. Twenty-two listeners (ages 19 to 24 years, avg. 20 years) participated in this study, and received either university credit or payment for their participation. All listeners passed hearing screening at octave frequencies between 250 and 8000 Hz, defined as thresholds ≤ 20 dB HL, and none had extensive experience as research participants in psychoacoustic studies. All participants were naïve to the study's experimental design and purpose and gave written informed consent prior to experiment. All experimental procedures followed the regulations set by the National Institutes of Health and were approved by the University of Wisconsin's Health Sciences Institutional Review Board.
Test stimuli and experimental design. In the experiment described below, speech tokens were 420 unique disyllabic words from the Isolated Words corpus 46 , spoken by multiple male and female talkers and recorded at 44.1 kHz. All words started with a consonant. The collection of words had an average duration of 515 ms, ranging from 381 to 656 ms.
To create variations of individual speech tokens, we followed several steps of stimulus creation. First, a matching SMN was created for each word by synthesizing noise of the same power spectrum and duration of each speech token, via randomizing the phase of its Fourier spectrum. Second, to create chimaeras of the speech and SMN signals, we utilized Smith et al. 's (2002) Chimaera-generating approach 8 . The original speech token (OR S ) and SMN stimulus (OR SMN ) were each bandpass-filtered into 2, 4, 6, 8 and 16 frequency bands between 200 to 8000 Hz according to the Greenwood function 47 . High-frequency content (> 8 kHz) is needed for accurate localization in the vertical, polar plane, but not the horizontal, medial plane 48 , which we tested here. Subsequently, for each frequency band, the ENV and TFS of both the OR S and OR SMN signals were extracted using Hilbert transform and exchanged, such that the ENV of one signal was superimposed on the TFS of the other, and vice versa. The newly created signals were summed in the time domain to form a multi-band chimaera. As such, for each pair of a speech token, OR S , and its matching SMN, OR SMN , and for each number of frequency bands, two chimaeras were created: one containing speech-like ENV and the other containing SMN ENV. For this study, speech chimaera (SC) refers to chimaera stimuli with a speech-like ENV and SMN TFS, whereas noise chimaera (NC) refers to chimaera stimuli with a SMN ENV and speech TFS. Third, to test the impact of envelope independently of speech intelligibility, a set of acoustic stimuli were created by reversing the 16-band SC (R S ) and the 16-band NC (R SMN ) stimulus in the time domain. In total, the tested conditions contained 7 stimuli with a speech-like ENV (SC 2 , SC 4 , SC 6 , SC 8 SC 16 , OR S , R S ), as well as 7 stimuli with a SMN ENV (NC 2 , NC 4 , NC 6 , NC 8 , NC 16 , OR SMN , R SMN ). A total of 5,880 stimuli were created from these fourteen conditions for each of the 420 disyllabic words. Each of the fourteen conditions was presented 15 times as stationary, and 15 times as moving, totaling 420 trials per participant. All stimuli and sound motions (stationary/moving) were pseudo-randomly assigned. All sounds started at 0º, and while stationary sounds remained there for their duration, moving sounds traveled a 10º angular distance, left-or rightward, and always ended at the ± 10° azimuthal ranges. We calculated sensitivity and response bias as a measure of sound motion identification for each condition. Procedure. Prior to the main experiment, participants were familiarized with the experimental stimuli by listening to 28 randomly selected examples of stimuli that represented each of the fourteen conditions, once as stationary, and once as moving. These 28 stimuli were not part of a participant's stimulus set for the main experiment. During familiarization, participants were not told that the stimuli would be stationary or moving and the only task was to repeat what they understood. During the main experiment, listeners were asked to identify both the motion of the sound (i.e., classify sound as "stationary" or "moving") and the content of the sound. As such, the experiment employed a dual-task utilizing a one alternative forced choice testing paradigm. On a given trial, participants pressed the touchscreen to start a trial. Subsequently, a sound (stationary or moving) played from the central location of the horizontal loudspeaker array. Participants were instructed to face forward and keep their head still during sound presentation, and after sound offset they could move their head. After sound offset, the frontal touch screen displayed the sentence "The sound I just heard was …" and two selection boxes labeled "stationary" and "moving. " Participants indicated their perceived sound motion by selecting a choice. To indicate the perceived sound content, participants verbally repeated what they understood, which the experimenter, who was seated outside of the testing chamber, manually recorded. Subsequently, participants could start a new trial by pressing the touchscreen.

Data availability
The datasets generated during and/or analysed during the current study are available from the corresponding author on reasonable request. License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http:// creat iveco mmons. org/ licen ses/ by/4. 0/.