Rapid but specific perceptual learning partially explains individual differences in the recognition of challenging speech

Banai, Karen; Karawani, Hanin; Lavie, Limor; Lavner, Yizhar

doi:10.1038/s41598-022-14189-8

Download PDF

Article
Open access
Published: 15 June 2022

Rapid but specific perceptual learning partially explains individual differences in the recognition of challenging speech

Scientific Reports volume 12, Article number: 10011 (2022) Cite this article

5092 Accesses
1 Citations
3 Altmetric
Metrics details

Subjects

Abstract

Perceptual learning for speech, defined as long-lasting changes in speech recognition following exposure or practice occurs under many challenging listening conditions. However, this learning is also highly specific to the conditions in which it occurred, such that its function in adult speech recognition is not clear. We used a time-compressed speech task to assess learning following either brief exposure (rapid learning) or additional training (training-induced learning). Both types of learning were robust and long-lasting. Individual differences in rapid learning explained unique variance in recognizing natural-fast speech and speech-in-noise with no additional contribution for training-induced learning (Experiment 1). Rapid learning was stimulus specific (Experiment 2), as in previous studies on training-induced learning. We suggest that rapid learning is key for understanding the role of perceptual learning in online speech recognition whereas longer training could provide additional opportunities to consolidate and stabilize learning.

Speech in noise perception improved by training fine auditory discrimination: far and applicable transfer of perceptual learning

Article Open access 09 November 2020

Speech perception is similar for musicians and non-musicians across a wide range of conditions

Article Open access 18 July 2019

Learning and adaptation in speech production without a vocal tract

Article Open access 19 September 2019

Introduction

The recognition of connected speech (i.e., utterances longer than one word) under adverse conditions (e.g., distortion, background noise), which are abundant in daily listening environments, can be challenging¹, but practice can lead to substantial improvements^{2,3,4,5,6,7,8,9}. These improvements reflect perceptual learning, defined as relatively long-lasting changes in the ability to extract information from the environment following experience or practice^10,11,12. However, stimulus specificity is considered a hallmark of perceptual learning^11,13,14,15, and perceptual learning of speech is indeed often specific to the acoustic characteristics of the stimuli encountered during practice^16,17,18,19. On the other hand, many sources of acoustic variability are present in typical listening environments (e.g., different talkers, different types and levels of background noise etc.)^1,20, making it unlikely that acoustically-specific past learning could support ‘real-world’ future speech recognition. Alternatively, rapid learning could support online perception in challenging conditions by allowing listeners to quickly adapt to the acoustic characteristics of a broad range of conditions^21,22. Consistent with this hypothesis, recent studies suggest that perceptual learning ‘clusters’ across a range of conditions and could thus form an individual capacity^23,24,25,26. Furthermore, we found that rapid improvements in the recognition of one type of acoustically challenging speech are associated with individual differences in the recognition of other forms of challenging speech^22,27,28. Nevertheless, as we focused on rapid changes, it is not clear whether this learning reflects perceptual learning as defined above (that is: relatively long-lasting). Therefore, in the current study we ask whether rapid perceptual learning that is retained over time is associated with individual differences in speech recognition (Experiment 1), and whether this learning is as stimulus specific as found in previous training studies^17,29 (Experiment 2).

In the context of speech, perceptual learning has been studied for both speech-parts (e.g., syllables, phonemic categories) and connected speech (e.g., sentence recognition)¹². Whereas the former clearly reflects perceptual learning, it could be argued that improvements in the recognition of connected speech under adverse conditions reflect higher-level linguistic or attentional processes, especially when it occurs rapidly. However, models of perceptual learning explicitly acknowledge the role of these processes. For example, according to the Reverse Hierarchy Theory (RHT³⁰) rapid perceptual learning operates on high-level representations of speech, that are behaviorally relevant in a speech recognition scenario. Others³¹ argued that the key to perceptual learning is sufficient stimulus-related neural activity. In connected speech recognition, linguistic context can enhance this activity and therefore support perceptual learning^1,4. Our definition of perceptual learning is neutral with respect to the processes modified by learning. Instead, we focus on the longevity and stimulus specificity of rapid learning. There is some indication that even rapid learning of speech is maintained over time^32,33. In contrast, results for the specificity of either rapid or training-induced perceptual learning of connected speech are mixed^{5,6,16,18,28,34,35,36,37,38,39,40}. For time-compressed speech, some studies found that learning was not specific to the compression rate and even transferred from time-compressed to natural-fast speech^5,37, but others did not^28,29,41. Methodological differences make the outcomes hard to compare across studies. Therefore, one goal of the current study was to test the talker specificity of rapid learning of time-compressed speech across different rapid learning protocols and test times (immediate and delayed).

The potential role of perceptual learning in speech recognition

Theories of both perceptual learning³⁰ and speech processing^42,43 suggest that encounters with speech input trigger implicit and largely automatic processes which attempt to match this input to long-held representations. However, in daily listening situations automatic matching can fail due to lack of agreement between new inputs and long-term representations (e.g., due to sources of acoustic variability like noise or accent. Theoretically, this failure can trigger a learning process that gradually allows listeners to resolve finer-grained acoustic details and help them recognize previously unrecognizable input³⁰. Because learning is triggered by a specific input, learning is at least partially specific to the acoustics of the input^7,30,42. This specificity probably constrains the role of learning in complex communication environments. One option is that intensive experience is required to yield learning that supports speech recognition. However, training-induced learning of challenging speech is often quite specific to the trained stimuli^18,40,44,45. Therefore, it can support future speech perception only to the extent that newly encountered situations replicate the conditions encountered in training. Therefore, intensive training studies are not a good analogue for real life conditions when a practice period is unlikely and the acoustics can change rapidly (e.g., in a multi-talker conversation). Consistent with this view, training often fails to yield quantifiable benefits in any untrained conditions, despite good learning on the trained ones even in listeners with perceptual difficulties (e.g., due to hearing loss)^19,46,47. Studies on learning new speech categories [e.g.,13, 33] are also not a good approximation for daily environments because they usually do not use connected speech.

Another potential role of perceptual learning which we pursue here is based on rapid learning: if learning occurs rapidly, it could serve as a skill listeners can recruit whenever they encounter new acoustic challenges. Accordingly, specific learning could afford optimal adaptation to the particulars of a new acoustic challenge without more general and undesirable changes in speech perception. Rapid-learning studies are more representative of real-world challenges than training studies because they often include little stimulus repetition and connected speech materials^{4,5,28,34,35,48,49}. Therefore, this account is more ecological than accounts based on the generalization of past learning. Consistent with the idea that perceptual learning is a general resource, recent findings show that learning is correlated across different tasks and even across modalities^23,24,26.

Overview of the current study

We conducted two experiments using time-compressed speech to elicit learning. In Experiment 1, we compared learning and retention between rapid and training-induced learning of time-compressed speech, to determine whether rapid learning conforms to the definition of perceptual learning. We also asked whether the two types of learning are differentially correlated with speech recognition in two different tasks—speech-in-noise and natural-fast speech. We report that rapid learning was maintained over time, consistent with the definition of perceptual learning. Furthermore, perceptual learning of time-compressed speech was associated with the perception of natural-fast speech and speech-in-noise, with no apparent differences between rapid and training-induced learning. Experiment 2 focused on the characteristics of rapid learning by exploring the effects of stimulus repetition and talker variability on rapid perceptual learning of time-compressed speech. Comparison of the outcomes to those of previous studies on learning following training^16,17,29 suggests that rapid learning of time-compressed speech is as specific as training-induced learning.

Experiment 1

Methods

Participants

A total of 160 university students or recent graduates (ages 18–35 years, Mean = 26, SD = 3, 91 female and 69 male) participated in this experiment. Participants were volunteers and reported they were native Hebrew speakers, with normal hearing and no history of attention, learning or language deficits and no experience with time-compressed speech. The study was performed in accordance with the declaration of Helsinki. All aspects of the study were approved by the ethics committee of the Faculty of Social Welfare and Health Sciences at the University of Haifa (permit #199/12). Informed consent was obtained from all participants. Participants were tested as described below; no other tests were conducted.

Participants were divided randomly to two groups, a rapid-learning group that was exposed to time-compressed speech during testing but received no additional training and a training group that was tested like the other group and also completed additional training as described below. Both groups completed two test sessions on separate days, in which they performed the speech tasks described below. We note that parts of the data from the rapid-learning group were previously published as part of a conference proceedings⁵⁰, and re-analyzed for the purpose of the current study. One participant had missing data and was not included in data analysis, so we report data from 79 listeners in the rapid-learning group (age: Mean = 26, SD = 4; 38 female, 41 male) and 80 listeners in the training group (ages: Mean = 26, SD = 3; 52 female, 28 male).

Overall design

As shown in Fig. 1, the experiment comprised of two sessions, 5 to 9 days apart. On each session, all participants completed three speech recognition tests—time-compressed speech, natural-fast speech and speech-in-noise, in a counterbalanced order. The training group received additional training on time-compressed speech at the end of the first session. Participants completed the experiment in a quiet room on campus or in their homes. Stimuli were delivered to the two ears through headphones (Sennheiser HD-205 or HD-215) at a comfortable listening level, using custom software⁴⁴. The time-compressed speech task was used to assess learning within and between sessions. Comparisons between the rapid-learning and the training groups were used to assess differences between rapid learning induced by the time-compressed speech tests and training-induced learning. The other two tasks were used to determine if perceptual learning of one type of speech is related to recognition of other types of challenging speech.

Stimuli and tasks

Stimuli

290 simple sentences in Hebrew (based on Prior and Bentin⁵¹), were used. Sentences were five to six words long and had a subject-verb-object grammatical structure. Half of the sentences were semantically plausible (e.g., “the talented poet wrote a poem”) and half the sentences were implausible (e.g., “the angry shopkeeper fired the rabbit”).

Stimuli for the speech-in-noise and time-compressed speech tests were recorded by Talker 1, a female native speaker of Hebrew with an average speech rate of 111 words/min (SD = 17). Stimuli for the natural-fast speech test were recorded by Talker 2, also a female, native speaker of Hebrew, at an average natural-fast rate of 214 words/minute (SD = 26) because prior testing²² suggested that natural-fast speech by Talker 1 was not fast enough to challenge university students who are native speakers of Hebrew. Sentences were recorded in a sound attenuating room at a sampling rate of 44.1 kHz, with a standard microphone, and edited in Audacity® software© 2.1.3 to remove remaining noise and equate root-mean-square (RMS) amplitude across sentences.

Speech recognition tests

Sentences were randomly divided across tests such that on each test half the sentences were plausible, and half were implausible. Different sentences were used on each test and session. Within a test, the order of the sentences was random but fixed across participants, with no sentence repetition. Sentence delivery was self-paced. Participants were asked to transcribe the sentences as accurately as they could, and the number of correctly transcribed words was counted for each sentence. Only perfectly transcribed words (ignoring homophonic spelling errors) were counted as correct. The proportion of correct words per sentence was used as an index of recognition accuracy. The order of the three tests was counterbalanced across participants.

Speech-in-noise tests

On each session participants had to transcribe 25 different sentences. Sentences produced by Talker 1 were mixed with 4-talker babble noise⁴⁴ at a signal-to-noise ratio of –6 dB.

Natural-fast speech tests

On each session participants had to transcribe 20 different sentences produced by Talker 2.

Time-compressed speech tests

On each session participants transcribed 10 sentences produced by Talker 1. To afford isolation of the rapid learning effects, we used the minimal number of sentences thought to yield rapid learning in the majority of participants based on previous work²². Sentences were compressed to 30% of their natural duration using a WSOLA algorithm⁵².

Training

Three blocks of 60 sentences each produced by Talker 1 were delivered on Session 1. In the first block, participants had to transcribe sentences compressed to 30% of their natural duration, as described above. The additional two blocks were adaptive. For each sentence participants had to determine whether it was semantically plausible or not. This procedure was used to give participants extra training without overburdening them. In these adaptive blocks, initial compression was 50%. Subsequently a 2-down/1-up staircase procedure was used to adjust compression based on participants’ responses. The training phase took 30–45 min to complete.

The training phase itself was not the focus of this study, and detailed analyses of learning during this phase were published elsewhere^16,17,29,41. However, to determine that training did elicit learning, we analyzed the data from the non-adaptive recognition block as follows: For each participant a learning curve was constructed based on average performance in 6 ‘mini-blocks’ of 10 sentences each, and the slopes of these curves were calculated. Slopes were positive in 75/82 participants, suggesting that most participants learned during this phase. The average slope (Mean = 0.024, SD = 0.020) was significantly larger than zero (t(81) = 11.20, p < 0.001) with a large effect size (Cohen’s d = 1.24).

Data analysis

Recognition accuracy data (proportions of correctly identified words) were analyzed in R⁵³ with a series of generalized linear mixed models using the lme4 package⁵⁴. Generalized models were use because they require fewer assumptions on the distributions of the data and are more suitable for proportion data⁵⁵; mixed models were used because they are recommended for individual differences studies with language data^56,57. Figures were created in Matlab (R2019b; https://www.mathworks.com/) and Microsoft Office 365 (https://www.office.com/).

Learning analysis

We used data from the time-compressed speech tests to assess rapid perceptual learning within and between sessions as well as training-induced learning (see “Results”). Learning between the two test sessions was our main index of learning because it manifests the retention of learning over time. To this end, for each participant the proportion of words correctly transcribed across all sentences within a session was averaged and the difference between the averages of the two sessions served as a learning index. For the rapid-learning group, this is an index of the rapid learning induced by completing the tests. For the training group, the value is a mixture of the rapid learning that occurred during the tests and the additional contribution of training-induced learning. Group effects in the statistical models described in the Results were used to statistically separate rapid and training-induced learning. Within-session learning across sentences was also modeled to further assess rapid learning and how it may interact with training-induced learning.

Results

Rapid learning of time-compressed speech is perceptual

Time-compressed speech recognition in the two groups and sessions is shown in Fig. 2. In the rapid-learning group, mean recognition accuracy was 0.20 (SD = 0.14) in session 1, and 0.33 (SD = 0.21) in session 2. In the training group, mean accuracy was 0.26 (SD = 0.18) in session 1 and 0.47 (SD = 0.22) in session 2. Our first goal was to determine whether learning of time-compressed speech occurred between the two sessions and whether it differed between the two groups. Learning, defined as the amount of improvement on time-compressed speech recognition accuracy between the two sessions, is also shown in Fig. 2. It suggests that recognition accuracy of the majority of participants in both groups improved between the two sessions.

To determine whether this learning was significant, and whether it was modulated by additional practice, mixed modelling was conducted. Random effects included random intercept for participants, as well as a sentence by participant random slope to account for the possibility that learning rates (changes in accuracy over sentences) vary across participants. Fixed effects included group (rapid-learning, dummy coded as 0 and training, coded as 1), sentence number (coded 1 to 10) and session (session 1 coded 0 and session 2 coded as 1). A binomial regression with logistic link function was used (as recommended for proportion data⁵⁵). Three models were constructed. A model that included the random effects only (AIC = 11,485), a model with additional main effects for each of the three fixed factors (AIC = 10,670), and a “full” model that included all possible interaction terms between the fixed factors (AIC = 10,558). Model comparisons (using the anova() function in R) suggested that the model with main effects fits the data significantly better than the model with random effects only (χ²₍₃₎ = 821, p ≤ 0.001) and the full model fits the data better than the model with only main effects (χ²₍₄₎ = 120, p ≤ 0.001).

The effects in the full model (see Table 1) were used to determine whether learning occurred, whether it was maintained over time and whether it differed between the two groups. As expected, a significant main effect of sentence was present, confirming that rapid learning of time-compressed speech occurred within session. Overall performance was more accurate in the second session (main effect of session). A test of simple main effects confirmed that accuracy increased in both groups (rapid-learning: estimate = 0.82, Z = 14.75, p < 0.001; training: estimate = 1.15, Z = 22.25, p < 0.001). Between-session improvements were larger in the training group (significant group by session interaction), suggesting that training resulted in greater learning than brief exposure. On the other hand, the magnitude of learning within a session was smaller in the second session (significant sentence by session interaction). Although Fig. 3 suggests that the magnitude of decline in rapid learning between sessions could have been larger in the training group, the group by session by sentence interaction was not significant.

Table 1 Fixed effects and interactions from the full learning model.

Full size table

To help interpret the effects from the statistical model, within-session learning is presented in Fig. 3. Each listener transcribed 10 (different) time-compressed sentences on each session, and learning was defined as the difference in transcription accuracy between the final and first 5 sentences in a session. Figure 3 suggests that on the first session this learning was larger than zero (t(161) = 12.87, p < 0.001, Cohen’s d = 1.01) with no significant difference between the two groups (Mean = 0.14 and 0.15; SD = 0.14 and 0.15 in the rapid-learning and training groups respectively t(160) = − 0.43, p = 0.623). As suggested from the sentence by session interaction in the full model, on the second session, learning was still significant (t(161) = 2.42, p = 0.008) but of smaller magnitude (Cohen’s d = 0.19; Mean = 0.07 and − 0.01, SD = 0.13 and 0.16, in the two groups respectively). While the group by sentence by session interaction was not significant in the full model, on the second session learning was significant in the rapid-learning (t(79) = 4.7, p < 0.001; Cohen’s d = 0.53) but not in the training (t(81) = − 0.51, p = 0.61, Cohen’s d = − 0.06) group. Furthermore, whereas learning during the second session was observed in 56/79 participants in the rapid-learning group (with a median of 0.06 and interquartile range from 0 to 0.13), only 41/80 participants in the training group continued to improve during session 2 (Median = 0.003, IQR = − 0.087 to 0.087; χ² = 6.44, p = 0.011).

Taken together, these data suggest that rapid learning occurred during the first test session, and to a lesser extent during the second session in participants that did not receive additional training (the rapid-learning group); additional training resulted in additional learning between sessions, and reduced rapid learning in the second session. Furthermore, rapid learning was maintained between sessions, conforming to the definition of perceptual learning.

Rapid learning and individual differences in speech recognition

One of the goals of this experiment was to determine whether perceptual learning of time-compressed speech was associated with speech perception in independent tasks (natural- fast speech and speech-in-noise), and if so, whether rapid- and training-induced learning differed in this respect.

Speech perception in the two groups and sessions is shown in Fig. 4 (Natural-fast speech: Session 1: Mean = 0.86, SD = 0.10 and Mean = 0.85, SD = 0.09; Session 2: Mean = 0.89, SD = 0.09 and Mean = 0.89, SD = 0.11 in the rapid-learning and training groups, respectively; Speech-in-noise: Session 1: Mean = 0.41, SD = 0.15 and Mean = 0.46, SD = 0.20; Session 2: Mean = 0.30, SD = 0.15 and Mean = 0.35, SD = 0.16 in the rapid-learning and training groups, respectively). Similar performance in the two groups in Session 2 would suggest that the training provided to the training group during Session 1 had no significant contribution to performance in those tasks, and thus that associations between between-session learning on time-compressed speech and Session 2 performance on the other speech tasks reflect rapid perceptual learning. Therefore, speech perception data in each task was modelled as a function of group, session and group by session interaction as fixed effects and random intercepts for participants and individual sentences. Model comparisons suggested that the model with all fixed effects (AIC = 13,832) was a better fit to the natural-fast speech data than the model with random effects only (AIC = 13,939; χ²₍₃₎ = 112, p ≤ 0.001). The fixed effects (see Table 2) suggested that natural-fast speech recognition was more accurate in Session 2, but as both the group effect and the session by group interaction were insignificant, this is not due to generalization of training-induced learning of time-compressed speech in the training group. Similarly, for speech-in-noise, model comparisons suggested that the model with fixed effects (AIC = 25,334) was a better fit for the data than the model with random effects only (AIC = 25,959; χ²₍₃₎ = 631, p ≤ 0.001). Although speech-in-noise recognition was poorer in Session 2 than in Session 1, there was no indication that this is due to training (see Table 2). Therefore, Session 2 speech perception data were used in the following analyses to assess the associations between perception and learning of time-compressed speech.

Table 2 Natural-fast speech and speech-in-noise perception as a function of group and session.

Full size table

Speech recognition is plotted in Fig. 5 as a function of perceptual learning. To determine how perceptual learning contributed to speech recognition in the two tasks, data was again modelled with mixed-effects binomial regression with a logistic link function. For each speech task, the following models were constructed: (1) a “random” model with random intercepts for participant and sentence; (2) a “main effects” model which included three additional main effects: group (rapid-learning coded as 0 and training coded as 1), perceptual learning (the difference between Session 2 and Session 1 as plotted in Fig. 2) and baseline recognition of time-compressed speech (mean of the first 5 sentences from session 1); the two continuous predictors were scaled , and (3) an “interaction” model in which the group by learning interaction was also included. Model comparisons were used to determine whether the “main effects” model fits the speech data better than the model with random effects only. Then the “main effect” and “interaction” models were compared to determine if the contribution of perceptual learning to speech perception differed between the rapid-learning and training groups.

For natural-fast speech, the “main effects” model (AIC = 5991) significantly improved data fit over the “random” model (AIC = 6030, χ²₍₃₎ = 45, p ≤ 0.001). Adding the group by learning interaction in the “interaction” model had no significant effect (AIC = 5993, χ²₍₁₎ = 0.69, p = 0.406, see Table 3 for the parameters of the best fitting model). For speech-in-noise, the “main effects” model (AIC = 11,538) fitted the data significantly better than the random model (AIC = 11,586, χ²₍₃₎ = 54, p ≤ 0.001). Addition of the group by learning interaction had no significant impact on the fit (AIC = 11,539, χ²₍₃₎ = 0.77, p = 0.381, see Table 3 for the parameters of the best fitting model).

Table 3 Prediction models for speech recognition as a function of perceptual learning.

Full size table

Together, it seems that perceptual learning of time-compressed contributes to the recognition of natural-fast speech or speech-in-noise. Additional training did not modify these associations significantly (insignificant group by learning interactions). While similar associations were reported before for within-session learning²², the current findings suggest that the association reflects perceptual learning that is retained over time rather than transient effects. Furthermore, the current findings suggest that the contribution of perceptual learning is not attributable to generalization across speech tasks.

Experiment 2

The outcomes of Experiment 1 suggest that rapid perceptual learning of time-compressed speech is associated with individual differences in other speech tasks. They also suggest that additional training yields additional learning on the trained task. On the other hands, the characteristics of rapid learning, and particularly its stimulus specificity are not well understood. Both talker variability and stimulus repetition were previously suggested to influence the specificity of perceptual learning for speech^58,59,60. Although we found no effect for either of these factors in past training studies with time-compressed speech^17,29, they could still influence more rapid learning on this task. Experiment 2 therefore explored the effects of repetition (5 repetitions of each of 4 sentences and 20 repetitions of a single sentence) and talker variability (1 vs. 5 talkers) on rapid learning of time-compressed speech and its talker specificity.