The functional relations among motor-based prediction, sensory goals and feedback in learning non-native speech sounds: Evidence from adult Mandarin Chinese speakers with an auditory feedback masking paradigm

Liu, Xiaoluan; Tian, Xing

doi:10.1038/s41598-018-30399-5

Download PDF

Article
Open access
Published: 09 August 2018

The functional relations among motor-based prediction, sensory goals and feedback in learning non-native speech sounds: Evidence from adult Mandarin Chinese speakers with an auditory feedback masking paradigm

Xiaoluan Liu^1,2,3 &
Xing Tian^1,2,3

Scientific Reports volume 8, Article number: 11910 (2018) Cite this article

1460 Accesses
7 Citations
1 Altmetric
Metrics details

Subjects

Abstract

Previous studies in speech production and acquisition have mainly focused on how feedback vs. goals and feedback vs. prediction regulate learning and speech control. The present study investigated the less studied mechanism–prediction vs. goals in the context of adult Mandarin speakers’ acquisition of non-native sounds, using an auditory feedback masking paradigm. Participants were asked to learn two types of non-native vowels: /ø/ and /ɵ/—the former being less similar than the latter to Mandarin vowels, either in feedback available or feedback masked conditions. The results show that there was no significant improvement in learning the two targets when auditory feedback was masked. This suggests that motor-based prediction could not directly compare with sensory goals for adult second language acquisition. Furthermore, auditory feedback can help achieve learning only if the competition between prediction and goals is minimal, i.e., when target sounds are distinct from existing sounds in one’s native speech. The results suggest motor-based prediction and sensory goals may share a similar neural representational format, which could result in a competing relation in neural recourses in speech learning. The feedback can conditionally overcome such interference between prediction and goals. Hence, the present study further probed the functional relations among key components (prediction, goals and feedback) of sensorimotor integration in speech learning.

tDCS modulates speech perception and production in second language learners

Article Open access 28 September 2022

The role of auditory processing in L2 vowel learning: evidence from recasts

Article Open access 20 October 2023

Motor engagement relates to accurate perception of phonemes and audiovisual words, but not auditory words

Article Open access 25 January 2021

Introduction

In speech acquisition and control, three factors collectively contribute to the development and maintenance of accurate speech production: auditory feedback, motor-based prediction and sensory goals^1,2,3,4. Considerable amount of scholarly effort has been spent on the investigation of relations between feedback vs. prediction and feedback vs. goals for speech learning and control. The present study is aimed at addressing the less studied issue, i.e., prediction vs. goals, through the lens of Mandarin Chinese adult acquisition of different types of non-native speech sounds.

Auditory feedback, motor-based prediction and sensory goals in speech production

Auditory feedback has attracted considerable scholarly interest due to its importance in speech production. Previous research using empirical and modelling approaches to motor-sensory integration in speech production^5,6,7 suggests that auditory feedback is mainly responsible for ensuring successful vocal achievement of speech targets because feedback enables comparison between the speaker’s own speech and the target speech. More specifically, feedback adjusts the motor control system according to online speech production in such a way that it corrects and updates the internal motor prediction system until speech errors are eliminated and target speech is achieved⁸.

Examples abound as to the importance of auditory feedback in speech production and acquisition. Hearing impairments in early life can severely affect children’s normal speech production and development⁹. Adults suffering from hearing loss also tend to encounter difficulties in maintaining accurate speech production, especially in terms of speech rate, intensity and F0^10,11. Intelligibility of production of vowels and consonants is also affected if proper auditory feedback is deprived in adulthood¹². When the hearing environment becomes less optimal (e.g., noise), the need for auditory feedback is reflected in the Lombard effect¹³, i.e., speakers’ involuntary vocal response to ambient noise by increasing vocal loudness.

However, due to the delay in neural processing of sensory feedback (e.g., neural conduction and central processing), effective control of motor movement also requires the involvement of motor-based prediction, i.e., an internally maintained representation of motor control which predicts consequences of motor movement¹⁴. In speech production, motor-based prediction (prediction henceforth) refers to the internal estimate of the current state of vocal tract dynamics and subsequent auditory results^3,15,16. Such prediction is based on prior knowledge of the causal relation between speech motor commands and sensory output^{1,2,3,4,15,16,17,18,19,20}. Particularly, the causal relation reflects a key dimension of motor learning (e.g., speech acquisition), the primary computation of which requires the formalization of the motor plan based on sensory targets (i.e., sensory-to-motor transformation). The establishment of such computation and verification of the correctness of sensory-to-motor transformation need an online estimation of how good the results will be based on the planned motor movement. Therefore, prediction about sensory consequences of the motor system is needed.

Prediction is hypothesized to be realized through a system termed internal forward model^1,7,21 to estimate the possible outcome of articulatory movement before auditory feedback is received. Such mechanism thus enables fast online motor correction of non-standard articulatory output. Auditory feedback, meanwhile, serves to update and maintain the internal forward model². Thus, prediction allows one to deal with complex speech situations where various extents of articulatory and perceptual demands could challenge accurate speech production^2,22. Evidence for the existence of prediction can be found in perturbation studies where either speech articulation^23,24 or perceptual feedback^25,26,27 was artificially perturbed, e.g., by either shifting pitch or formants from a standard baseline. It was found that perturbation generally elicited a compensatory speech motor response opposite in direction to the manipulated shift in feedback. The observations imply that an internal prediction of the speech target is used to guide the online compensation for the artificially introduced feedback perturbation.

Both auditory feedback and prediction help monitor and verify the achievement of sensory goals, which are a major factor in speech production^3,28,29. As Fairbanks³⁰ postulated, the goal of speech production was represented by sensory outcomes. Specifically, sensory goals define the final target of speech production, and they could be stored in auditory and phonological system. Thus, the function of sensory goals in speech production is to activate speech targets stored in memory during speech acquisition². The most obvious example of the existence of sensory goals is that children’s speech production patterns are determined by the acoustic input they obtain². Adult speech production is also influenced by ambient speech because they tend to automatically reproduce phonetic patterns (e.g., pitch, vowel features) introduced in their surrounding acoustic environment³¹. Moreover, speakers with greater sensory acuity tend to produce larger phonological contrasts (e.g., vowels), which implies the recruitment of sensory goals in speech production²⁹. In addition, several studies have shown the activation of auditory-related brain areas (e.g., posterior superior temporal sulcus/gyrus) in accessing phonological representations in speech production^32,33. Sensory goals thus have been modeled as a major stage in speech production^1,3,5,34,35.

Overall, auditory feedback, prediction and sensory goals are important factors in speech production. The relations between the factors in speech production have been captured in prominent theoretical models such as the Directions Into Velocities of Articulators (DIVA) model^1,28, Task Dynamic (TD) model³⁶, State Feedback Control (SFC) model⁴, and similar models such as those proposed in Tian & Poeppel^15,16 and Hickok³. In those models, an efference copy is sent out to an internal model of the vocal tract by motor commands to predict the current state of the vocal tract. The sensory outcome of the motor commands is estimated by an additional auditory efference copy. Errors occur when there is a deviation between the predicted and actual sensory feedback, which will then be fed back to the internal model to guide motor movement until the goal is correctly achieved.

Relations among prediction, goals, and feedback in adult acquisition of different types of non-native speech sounds

The mechanisms of feedback, prediction and goals in speech production discussed above are particularly evident in human vocal learning^1,2,5: goals are compared with feedback in establishing the causal relation between motor commands and sensory output. Child L1 learning and adult L2 learning are two important domains of vocal learning. Compared with infants and children, adults tend to encounter more difficulties in learning the phonetic/phonological aspect of a foreign language^37,38. From a sensorimotor perspective, such difficulties could be attributed to the already well-established sensory-to-motor and motor-to-sensory transformations (i.e. the mechanism for motor based prediction) in adults as a result of long-term practice of native language. Young children, on the other hand, usually do not achieve mature, adult-like motor control of speech production until 11 or 12 years of age³⁹. Consequently, in early stages of native speech acquisition, auditory feedback is heavily relied upon to monitor speech production errors. Adults, on the contrary, rely more on motor-based prediction mechanisms in speech production due to accurate estimation of the causal relation between motor commands and sensory consequences^40,41.

When it comes to learning a foreign language, the mature motor-sensory loops established in adults’ native language could become a barrier in effectively forming new motor-sensory transformation for the foreign language, hence the ‘critical period hypothesis’⁴² and ‘fundamental difference hypothesis’⁴³ on the neurocognitive differences between child L1 and adult L2 learners. That helps explain why a high similarity between L1 and L2 sounds (i.e., an overlap in phonological space between L1 and L2) could hinder adult L2 learners’ ability to effectively establish the representation of L2 sounds^37,38,44. The reason could lie in the difficulty in adjusting the already mature motor-to-sensory establishment for L1 to the new L2 sounds that are close in perception and production to L1 sounds. Supporting evidence mainly comes from speech perception studies. A typical example is that Japanese speakers tend to learn English /ɹ/ better than /l/ due to the fact that /ɹ/ is not as close as /l/ to their corresponding Japanese phonemes⁴⁵. For speakers of a tonal language (e.g., Mandarin), acquisition of novel L2 tones may not be easy either. Studies have found that there does not necessarily exist a tone-language advantage in learning foreign tones^46,47,48. For example, speakers of Hmong (a tonal language) were worse than English speakers in Mandarin tone identification⁴⁸ (Wang, 2006). It was also found that Cantonese speakers, whose native language has six tones, still encountered difficulty in differentiating tonal contrasts not found in Cantonese⁴⁹. These findings suggest that the native tonal system could exert a negative or even interfering influence on acquisition of non-native tones⁴⁸.

The present study: investigation of prediction vs. goals

In sum, the above review on motor control in speech learning suggests that a considerable amount of scholarly effort has been spent on the relations between feedback vs. goals and feedback vs. prediction. In terms of feedback vs. goals, the review in section 1.1 suggests that sensory goals are used to compare with auditory feedback in establishing both sensory-to-motor and motor-to-sensory transformations. The evidence can be found in empirical studies and computational modelling of children’s acquisition of native speech sounds, where comparison between auditory feedback and sensory goals serves to update motor commands for the achievement of speech targets^2,5. With regard to feedback vs. prediction, studies on self-speech induced suppression (SIS)^18,50,51,52 and feedback perturbation^26,27 suggest that sensory feedback should match motor prediction in speech production and perception, otherwise an enhanced auditory response (e.g., as indexed by M100 which is an auditory response around 100 ms measured in magnetoencephalography) could be triggered to monitor and correct the deviation between the feedback and prediction.

Therefore, what remains less investigated is prediction vs. goals (i.e., the relation between prediction and goals) especially in the case of adults’ acquisition of non-native speech sounds with different degrees of similarity to their native speech. Therefore, the present study attempts to address the following question: can prediction compare directly with goals in speech acquisition? We investigate this question in the context of adults’ acquisition of L2, because as reviewed in section 1.2, the motor-based prediction is available via the established motor-to-sensory transformation in adults as a result of long-term practice of their native language. Thus, there could be a competing relation between L1 and L2 in terms of motor demand on speech production. Moreover, studies on adult L2 perception suggest that the extent of similarity between L1 and L2 sound targets correlates with the degree of difficulty for adults to accurately establish representations of L2 targets (goals), as reviewed in section 1.2. Thus, we further investigate how similarity between prediction established in L1 (via motor-sensory transformation) and goals in L2 (speech targets) affects adults’ speech learning.

To investigate the above research questions, in this study online auditory feedback was manipulated in two ways: feedback masked with noise (the masked condition) and without noise (the unmasked condition) when participants learned to produce two different vowels: /ø/ (e.g., in German) and /ɵ/ (e.g., in Dutch). The first vowel is less similar than the latter vowel to the closest existing pronunciation of Mandarin Chinese vowels (more details in the following sections). In the masked condition, participants can only rely on prediction and goals for speech learning. Given that the participants have not established a precise mapping between neuromuscular command for the articulatory trajectory (i.e., motor prediction) and desired sensory output for the new L2 sounds (i.e., sensory goals), the lack of auditory feedback would render the formation of such mapping more difficult, which could consequently limit their ability to directly compare between prediction and goals to calculate speech errors for update of motor commands.

Therefore, we hypothesize that prediction and goals cannot be directly compared for adult L2 speech learning, regardless of the type of L2 sounds. Thus, we predict that participants can learn neither of the two vowels when feedback is unavailable, i.e., in the masked condition. Whereas in the unmasked condition, auditory feedback is available for learning and we predict that the extent of similarity between prediction in L1 (via motor-sensory transformation) and speech targets in L2 (goals) will affect learning performance. More specifically, based on the above review, learning /ø/ should be better than learning /ɵ/ in the unmasked condition. This is because as shown in previous literature, a high degree of similarity between L1 and L2 speech sounds makes adult L2 acquisition more difficult (as reviewed in section 1.2). From a sensory-motor point of view, this could suggest that the more similarities between motor prediction for L1 and sensory goals in L2, the more interference there could be between them in adult L2 learning. The interference would render goals less accurate during the course of learning. Since /ɵ/ is more similar than /ø/ to existing pronunciations in Mandarin Chinese, there would be more interference from participants’ native language (i.e., motor prediction established through L1) in maintaining the goal for /ɵ/ than for /ø/, which as a result would yield auditory feedback less effective in improving the learning performance of /ɵ/.

Two experiments were conducted to test the above predictions. Experiment 1 used a within-subject design while Experiment 2 used a between-subject design with two additional listening control tasks. A between-subject design was needed to test whether the results of the within-subject design can be replicated. Moreover, a between-subject design could filter out the practice effect which often results from a within-subject repeated measures design. Two additional listening tasks were needed to control for perception and memory accuracy which could be confounds. Specifically, we ruled out the possibility that inaccurate perception and memory of the target speech sounds could lead to the participants’ bad performance of speech production.

Experiment 1

Methods

The study was approved by NYU Shanghai Research Ethical Committee. All experiments were performed according to relevant guidelines and regulations. Informed consent was obtained from all participants for the experiments. Data sets are accessible from https://osf.io/cnvh7/.

Stimuli

Two vowels were selected as stimuli: /ø/ and /ɵ/. This is because they represent different degrees of similarity to Mandarin vowels, which is a key factor in this study as discussed in the Introduction section. Specifically, /ɵ/ sounds perceptually similar to existing Mandarin vowels /ɤ/ and /ɚ/ (Fig. 1), while /ø/ is harder to be perceptually related to an existing Mandarin vowel.

The stimuli were recorded by a male and female native speaker of the respective language (German for /ø/ and Dutch for /ɵ/) in a sound-attenuated booth. They were asked to produce the stimuli five times in their native language. Then they picked one token of their recording that they were most satisfied with, which were used as stimuli for this study. Each stimulus was repeated 10 times (i.e., 10 tokens of the same stimulus) in this study. The average duration of the two vowels were 500 ms.

To justify our selection of the two vowels as stimuli, a production and a perception validation tests were conducted. We invited twelve Mandarin speakers (age M = 28, SD = 3. 8, six females) to produce the seven Mandarin vowels (/a/, /o/, /ɤ/, /i/, /u/, /y/, /ɚ/) prompted on a computer screen, with each vowel produced three times. The participants were then asked to reproduce the two target vowels /ø/ and /ɵ/ delivered auditorily, three times for each vowel. Then they were asked to perceptually compare the two target vowels with their own production of the Mandarin vowels by rating the perceptual similarity between the target vowels and each one of the Mandarin vowels on a scale of 0 to 5, in which 0 means ‘not similar at all’, 5 means ‘very similar’, with the rest of the numbers meaning somewhere in-between the two extremes.

The results were consistent with our expectation. In terms of production, the mean Euclidean distance between /ø/ and existing Mandarin vowels produced by the participants was 2.82 (SEM = 0.08), which was larger than the Euclidean distance between /ɵ/ and existing Mandarin vowels (M = 2.58, SEM = 0.05), and the difference was significant (F (1, 11) = 16.72, p < 0.001, η²_p = 0.6). A linear mixed model was carried out with gender treated as a fixed factor, individual speaker as a random factor, and F1 and F2 raw Hertz was used to calculate the Euclidean distance between Mandarin vowels and the two target sounds /ø/ and /ɵ/. The results showed that gender significantly affected the results in the /ø/ condition (F (1, 10) = 35.7, p < 0.001), and the /ɵ/ condition (F (1, 10) = 13, p = 0.005). Specifically, females had higher Euclidean distance than males in both conditions (for /ø/, females: M = 597.11 Hz, SEM = 18.55 Hz; males: M = 432.75 Hz, SEM = 20.3 Hz; for /ɵ/, females: M = 521.13 Hz, SEM = 17.22 Hz; males: M = 435.74 Hz, SEM = 14.62 Hz).

With regard to the perception test, the results showed that on average, the similarity rating score was significantly higher for /ɵ/ (M = 1.74, SEM = 0.1) than for /ø/ (M = 0.93, SEM = 0.08) [F (1, 11) = 54.34, p < 0.001, η²_p = 0.83)], suggesting participants found /ɵ/ to be more similar to existing Mandarin vowels than /ø/. A closer look at the data revealed that participants particularly found /ɵ/ to be similar to Mandarin vowel /ɤ/ (average rating score: 4.56) and /ɚ/ (average rating score: 4.25). Therefore, the production and perception tests support our selection of /ø/ and /ɵ/ as representing different degrees of similarity to existing Mandarin vowels.

Participants

Seventeen adult native speakers of Mandarin Chinese (nine females and eight males, age M = 26, SD = 2.1) participated in Experiment 1. All of them were without prior knowledge of the speech stimuli (detailed in the following section). They reported no hearing or speech impairments.

Procedure

Participants were tested individually in a sound-attenuated booth. The experiment was presented and controlled using Python 3.5 on a Lenovo computer running Windows 10. The stimuli were presented in two blocks, 10 tokens of each vowel per block. The female version of the speech targets was presented to female participants while the male version of the speech targets was presented to male participants. Participants were asked to reproduce the sound after hearing each auditory token of the speech targets (10 tokens per speech target, via Siemens HD280 headphones). They had 4000 ms to complete the production and recording of each sound token. The auditory feedback of participants’ own voice was manipulated in two conditions: masked with noise (the masked condition) and without noise (the unmasked condition). They completed the masked condition first before doing the unmasked condition. Pink noise generated in MATLAB (2014a) was used as the masking noise due to its effectiveness for masking speech⁵³. The noise lasted 4000 ms with constant sound pressure level throughout. The intensity of the noise was adjusted according to each participant’s response so that it was presented at a relatively comfortable level with sufficient loudness to mask their own voice feedback.

Data analyses

The following acoustic parameters were extracted from Praat⁵⁴: F1 (the first formant of the vowel) and F2 (the second formant of the vowel). Formant trajectory tracking in Praat was manually corrected where necessary. Following Flege et al.⁵⁵, Hertz was converted to Barks (B1 and B2 respectively) to minimize male vs. female differences. The extent to which non-native participants’ production of /ø/ and /ɵ/ differed from native speakers’ production of the vowels was measured by the Euclidean distance of each participant’s B1 and B2 values from the native speakers’ B1 and B2 values. Therefore, the dependent variable of the present study was the Euclidean distance between the learned and target vowels.

Results

Figure 2(a,b) show the Euclidean distance between the learned and target vowels (/ø/ and /ɵ/) in the two auditory feedback conditions. A two-way (vowels and auditory feedback conditions) repeated measures ANOVA showed a significant main effect of auditory feedback conditions [F (1, 16) = 9.35, p = 0.008, η²_p = 0.37], and interaction between the two factors [F (1, 16) = 4.58, p = 0.048, η²_p = 0.22]. For /ø/, a following one-way (masked vs. unmasked) repeated measures ANOVA showed a significant main effect of availability of auditory feedback [F (1, 16) = 8.53, p = 0.01, η²_p = 0.35]. Specifically, the Euclidean distance in the unmasked condition was smaller (M = 1.00, SEM = 0.16) than that in the masked condition (M = 1.42, SEM = 0.18). Furthermore, follow-up trial-by-trial analyses showed that for the feedback masked condition (Fig. 3a), there was not a significant difference between the highest-value trial and the lowest-value trial [F (1, 16) = 1.23, p = 0.28]. For the feedback unmasked condition (Fig. 3b), however, the difference between the highest-value trial and lowest-value trial was significant [F (1, 16) = 8.05, p = 0.01, η²_p = 0.34]. These results suggest that vocal learning performance on /ø/ was better in the unmasked condition than in the masked condition.

For /ɵ/, the masked condition showed a greater distance from the standard speech target (M = 1.36, SEM = 0.11) than the unmasked condition (M = 1.35, SEM = 0.09). However, a one-way repeated measures ANOVA showed that there was not a significant main effect of auditory feedback on the Euclidean distance [F (1, 16) = 0.007, p = 0.94]. Furthermore, a one-way repeated measures ANOVA showed that for both feedback masked and unmasked conditions (Fig. 3c,d), there was not a significant difference between the highest-value trial and lowest-value trial [F (1, 16) = 2, p = 0.18 for the masked condition; F (1, 16) = 2.21, p = 0.16 for the unmasked condition]. These findings suggest that vocal learning performance on /ɵ/ was equally bad in both the masked and unmasked conditions.