Introduction

Language change is observable not only over a large time scale, but also on a synchronic level, where language systems may be undergoing changes imperceptible to the general population. Importantly, both internal and external factors can contribute to language change (Labov, 1994; 2001). Previous research on language variation and change has typically focused on the dissemination of variation within communities, with a particular emphasis on the relationship between language variations and social factors such as gender, age, social class, and nation (Calder and King, 2022; Zhang, 2019; Al Shboul, 2021; Boberg, 2020). Conversely, language is not in a constant state of flux; it also exhibits relative stability. This relative stability can be examined to ascertain the underlying reasons for why and among whom language variations could occur. To address the interplay between variations in human cognition and language change, this study investigates the relationship between the merging of certain tone pairs in Macau Cantonese and cognitive functions among native speakers.

Cantonese is widely spoken in the Pearl River Delta region in southern China including Hong Kong, Macau, and Guangdong province. Compared to Mandarin Chinese, modern Cantonese has a more complex tonal system (cf. Figure 1.1 of Ou, 2012), featuring six lexical tones (T1 to T6) and three allotones (with similar pitch values to T1, T3, and T6 but differing in duration, namely T7, T8, and T9) (Mok et al., 2013; Xie et al., 2017; Zhang et al., 2019)Footnote 1. Several similar pairs of Cantonese tones, such as T2–T5 (two rising tones), T3–T6 (two level tones), and T4–T6 (a low-falling tone and a low-level tone), share the same or similar tonal contour, differing only in subtle pitch distinctions. This can lead to confusion in speech perception, causing difficulties in speech categorization (Holt and Lotto, 2010). Previous research has shown that these pairs are prone to confusion in both speech perception and production (Bauer et al., 2003; Mok et al., 2013). However, research on Cantonese tonal processing has traditionally focused on Guangzhou and Hong Kong (e.g., Jia et al., 2015; Xie et al., 2017), with very limited research on Macau Cantonese. In particular, Zhang (2019) investigated the tone mergers in Cantonese across Hong Kong, Macau, and Zhuhai (a city in mainland China adjacent to Macau), identifying that the three pairs of tones in Macau Cantonese are also undergoing merging. Specifically, T2–T5 can be considered as fully merged with the fastest merging speed among the younger population, followed by T3–T6 with widespread merging in the Macau community. In contrast, T4–T6 is still at the initial stage of the merging process, as evidenced by its slowest pace. Interestingly, while almost all tone pairs have merged in Zhuhai Cantonese and there is a deliberate effort to slow down tone mergingFootnote 2 in Hong Kong Cantonese, the speed of tone merging in Macau Cantonese is moderate. Thus, Macau constitutes an effective window for observing the dynamics of tone merging in Cantonese and understanding how various factors might modulate this process.

While the impact of cognitive factors (e.g., cognitive functions) on Cantonese tone processing has been observed recently, most studies have focused on Hong Kong Cantonese, yielding mixed results. Cognitive functions, typically referred to as a set of cognitive processes (McAfoose and Baune, 2009), encompass attention, memory, executive function, thinking, and language abilities. Previous research has concentrated on attention, memory, and executive function in relation to Cantonese tone processing. Law et al. (2013), using an oddball experiment with tone pairs T1–T6 and T4–T6, found that the group able to distinguish the tones not only generated an MMN component but also a P3a, reflecting higher-level mental activities such as attentional switching or working memory. In a further study, Ou et al. (2015) recruited three groups of participants representing different merging patterns for tone perception tasks, production tasks, and cognitive function tests, respectively. The results indicated that reaction time and accuracy in speech perception and production were modulated by attentional switching and working memory capacity.

Why could cognitive functions affect tone processing? Some scholars have begun to explore the underlying neural mechanisms. Ou and Law (2017), through electrophysiological experiments, pointed out that the quality of acoustic cue representation is influenced by attentional switching, thereby modulating the quality of speech perception and production. However, the three groups of participants recruited by Ou et al. (2015) were not assigned using the same criterion: the selection of “good perception but poor production” speakers was based on the T2–T5 tone pair, while the selection of “poor perception but good production” speakers was based on the T4–T6 tone pair. Differences between tone pairs may have confounded the results. Ou and Law (2017) addressed this limitation but focused only on the T2–T5 tone pair, so its results may not fully reflect the role of cognitive functions in tone merging. In addition, previous studies did not rigorously distinguish between auditory and visual modalities, even though visual information has been shown to influence speech comprehension (McGurk effect, McGurk and MacDonald, 1976). More importantly, a recent study (Li and Deng, 2023) indicated differences in attentional focusing and filtering between visual and auditory modalities: visual information can be more effectively focused on or filtered out, whereas auditory information is more challenging to filter out, which may disrupt task performance. To address these issues, the current study conducted tasks exclusively in the auditory modality and considered all three pairs of tones undergoing merging in Macau Cantonese. This provides a new approach for examining the origins of sound change from a holistic perspective.

Another related question is the relationship between speech perception and production during tone processing. Although there is no consensus yet on the interactive mechanisms between speech perception and production, it is generally acknowledged that they interact in specific ways. The Motor Theory (Liberman and Mattingly, 1985) posited a strong link between perception and production, suggesting a specialized speech module in the brain for controlling both. Articulatory movements exert a mandatory influence on speech perception, implying that changes in production can affect perception. Conversely, the Directions into Velocities of Articulators model (DIVA, Guenther, 1995) contended that auditory perceptual representations in the brain serve as acoustic templates to calibrate the articulatory behavior of speech organs, indicating that speech production relies on perception. However, in the case of Hong Kong Cantonese, the merging of tone perception and production is not synchronous. In some tone pairs, people are more likely to exhibit good perception yet poor production, such as T2–T5, while in others, the opposite is observed, as in T4–T6 (Law et al., 2013; Ou et al., 2015). Nevertheless, this does not imply that the T4–T6 tones will always exhibit a “poor perception and good production” pattern. As an ongoing language change, all merging patterns coexist at a synchronic level (Fung and Wong, 2010a, b). This raises a new question: why is the relationship between perception and production inconsistent in the Cantonese tone system? This suggests the importance of scrutinizing the reasons why perception and production patterns are different among various tone pairs. Hence, this study also aims to investigate the role of cognitive functions in speech perception and production and whether they differ in Macau Cantonese.

In the current study, we reselected tasks to assess cognitive functions, enabling the evaluation of participants’ attention, working memory, and executive function solely in the auditory modality. We also analyzed whether these cognitive functions correlated with their performance in the perception and production tasks of the three tone pairs. Specifically, we used the Test of Everyday Attention (TEA, Robertson et al., 1994) to assess participants’ attention abilities, selecting only subtests related to auditory modality. For working memory, we used the Digit Span Test, including Digit Span Backward, Digit Span Sequence, and Digit Span Forward (Wechsler, 2010). Regarding executive function, we employed an auditory-based Stroop test, which has demonstrated good validity in healthy populations (Kestens et al., 2021). By comparing participants’ performance in these cognitive functions with the discrimination rate and reaction time of perception tasks and acoustic distances in production, this study addressed the following questions: (1) What is the current merging status of the Cantonese tone system in Macau? (2) Does the influence of cognitive functions on tone merging remain consistent in tone pairs at different merging stages? (3) Does the impact of cognitive functions on tone merging stay consistent in tone perception and production? The findings of this study can broaden our understanding of how cognitive functions affect speech processing and provide new insights into the origins of sound changes. The conclusions will offer a fresh perspective on the relationship between speech perception and production.

Methods

Participants

A total of 44 participants (24 females) born and raised in Macau were recruited for this study. The participants’ ages ranged from 17 to 28 years (M = 20.09, SD = 2.38), capturing the tone characteristics of the younger generation of Cantonese speakers in Macau. Cantonese is the exclusive family language for all participants, with Macau Cantonese being strongly represented. At least one of their parents was born in Macau. None of the participants had been absent from Macau for an extended period in the three years leading up to the study. All participants were right-handed, and reported no hearing or vocal disorders. Additionally, they had no history of mental or neurodegenerative illnesses. This study received prior approval from the Research Ethics Committee at the local institution. Informed consent was obtained from all participants, who received monetary compensation for their participation.

Materials

Materials for the production task

Four CV roots (i.e., /si/, /ji/, /fu/, and /se/) were selected to derive the six lexical tones due to their simplicity in Cantonese structures and compatible with all six tones. Based on the criteria involving single-use, common-use, and non-homographic words, and considering previous studies (Ou et al., 2015; Zhang and Peng, 2017), we identified 24 words (4 syllables × 6 tones) for the test stimuli. Three native speakers who did not participate in the experiment confirmed that all test words met these criteria. Additionally, to prevent participants from discerning the study’s purpose, we included another set of 24 words as fillers, randomly interspersed within the word list. A total of 48 words are presented in Table 1.

Table 1 The test stimuli and fillers of production task.

Metatrails for the cognitive function tasks

Following previous studies (Ou et al., 2015; Ou and Law, 2017), a series of published tests (i.e., the TEA, Digit Span Test and auditory Stroop test) were selected to assess participants’ cognitive functions, with greater emphasis on the auditory modality. We assessed participants’ attention, working memory, and executive function; however, only subtests relevant to auditory processing were included in this study.

To assess attention, we utilized the Test of Everyday Attention (Robertson et al., 1994). This test comprises nine subtests that address four fundamental components of attention: selective attention, divided attention, sustained attention, and attentional switching (Ou and Law, 2017; Cohen and Burke, 1993). We selected six subtests, namely Elevator Counting, Elevator Counting with Distraction, Elevator Counting with Reversal, Telephone Searching, Telephone Searching while Counting, and Lottery, to comprehensively assess these four cognitive components. Although Telephone Searching is not directly related to auditory processing, it is necessary for computing the Telephone Searching while Counting score, reflecting the ability in dual-task processing. However, it was not factored into the final analysis due to its irrelevance to auditory processing.

Participants’ auditory working memory capacity was assessed using the digit span task from the Wechsler Adult Intelligence Scale-IV (WAIS-IV, Wechsler, 2010), which includes three subtests. Each subset consists of 16 strings of digits, with the number of digits ranging from 2 to 9. The three subtests differ in their instructions, requiring participants to repeat the strings in forward, backward, or sequence order, respectively. Previous research (Ou et al., 2015; Ou and Law, 2017) only adopted the subtest of digit span backward. We included the simpler digit span forward and the more complex digit span sequence to provide additional insights into our research questions.

Given that previous studies may not effectively measure executive functionFootnote 3, the current study employed an auditory-specific Stroop test to assess participants’ executive function. The Stroop test, first proposed by American psychologist Stroop, is used to assess the ability to inhibit cognitive interference, which occurs when the processing of one stimulus feature affects the simultaneous processing of another attribute of the same stimulus (Stroop, 1935; Scarpina and Tagini, 2017). Specifically, the Auditory Stroop test has been consistently validated to measure cognitive inhibition and flexibility using auditory stimuli (Kestens et al., 2021). We adapted this test into a Cantonese version, consisting of Stroop-tones and Stroop-words. In the Stroop-words, two Cantonese spoken words (i.e., 高/gou1/, meaning “high”; and 低/dai1/, meaning “low”) were used to carry the frequencies of both 250 Hz and 220 Hz. Thus, we created a “高” for 250 Hz and 220 Hz, and similarly, a “低” for both 250 Hz and 220 Hz. In the Stroop-tones test, we made two pure tones with frequencies of 250 Hz and 220 Hz using Praat. Participants’ executive function scores were calculated based on the differences between Stroop-tones and Stroop-words test (see details in Kestens et al., 2021). The two spoken words, obtained from an adult female speaker in Macau, were modified using Praat to adjust the pitch height. All six stimuli were set to a duration of 500 ms, with an intensity of 70 dB.

Materials for the perception task

The stimuli of the perception task were the same as the test stimuli in the production task. A total of 24 words were paired in two sets, AA and AB (e.g., 夫/夫 and 夫/富), based on their respective syllables and different tones, thus forming an AX discrimination paradigm. Thirty-two tone pairs (20 AA pairs and 12 AB pairs) that implicated tone merging were set as test pairs while sixteen tone pairs (4 AA pairs and 12 AB pairs) were set as fillers because no merging would occur among them (see Table 2), preventing any strategic responses by the participants. These stimuli were also recorded from the speaker used for the executive function task. Additionally, some studies (e.g., Bei and Xiang, 2016) have shown that the duration of different lexical tones in Cantonese varies. To eliminate the possible influence of duration on discrimination, all stimuli were modified to a duration of 500 ms and an intensity of 70 dB using Praat.

Table 2 Tone pairs in the perception task.

Procedures

All the tasks took place in the soundproof behavioral laboratory at the University of Macau. In the production task, 24 test words and 24 fillers were randomly presented on slides in black Kai font against a pale-yellow background. Only one word was displayed on each slide, and participants were instructed to read aloud each word at an appropriate pace three times. All slides were played twice. Word production was recorded using the OLYMPUS LS-100 portable recorder and AKG C-420 microphone, with a sampling rate of 44,100 Hz and a bit depth of 24 bits.

The tasks of cognitive functions (i.e., TEA, Digit Span Test, and auditory Stroop test) followed the production task immediately. After two weeks, participants returned to the same laboratory for the perception task to mitigate potential priming effects. The perception task was developed using E-prime 3.0 software (E-Prime Psychology Software Tools Inc., Pittsburgh, USA). Each trial started with a fixation point lasting for 300 ms, followed by a 300 ms blank screen. Subsequently, the first tone of the AX pairs was played, and the second was played after an interval of 500 ms. The program automatically recorded reaction times and key presses when the second tone began playing. Participants were required to judge whether the two sounds they heard were the same or different. If the two sounds in a tone pair were the same, they pressed the “F” key; if different, they pressed the “J” key (F/J buttons were counterbalanced among all participants). Participants had a maximum of 3 seconds to make their judgment. If there was no response within 3 seconds, the program proceeded to the next trial, and that trial was marked as incorrect. Each trial ended with a randomly varying inter-trial interval (800~1000 ms) before the next trial began. The schematic procedure of each trial is shown in Fig. 1. All pairs were presented in a random order, repeated 10 times, resulting in a total of 480 trials (4 syllables × 12 pairs × 10 repetitions) for each participant.

Fig. 1
figure 1

The procedure of the perception task.

Data analysis

Praat was used to extract F0 from the recorded audio files, applying the same standardized procedure to each participant’s recordings. Initially, fillers from the word list were removed, retaining only the first word in the three repetitions (Liang, 2017)Footnote 4. Thus, each participant retained a total of 48 tokens (6 tones × 4 words × 2 repetitions). Using self-developed Praat scripts, in conjunction with manual verification, we identified the pitch-carrying segments of each token for every participant. These segments were then divided into ten equal parts, and ten F0 points were extracted. According to Zhang (2018), applying a semitone transformation relative to each speaker’s average pitch in hertz is the optimal method for tone variation studiesFootnote 5. Therefore, the current study employed this method to normalize the F0 data for each participant (hereafter referred to as ST).

Participants’ performance in attention and working memory tasks was scored according to the guidelines provided in the manual. For the executive function task, reaction times (RT) falling outside three standard deviations (SD) from the mean were initially excluded. Each participant’s executive function score was calculated.

For the perception task, the exclusion of fillers, AA pairs, and data with RT falling outside 3 SD was conducted first. As the perception task assesses the participants’ ability to perceive similar tones, discrimination rate reflects only the strength of their perception ability. A low discrimination rate does not prove that they did not participate seriously. Therefore, we did not eliminate trials with incorrect judgments. Mean reaction time and discrimination rate for all remaining trials were calculated.

Results

Group- and individual-level results

The current study employed the growth curve analysis (GCA, Mirman, 2014) on the production data, which considers the overall pitch contour of tones. By fitting the curves using orthogonal polynomials, this method decomposes the curves into characteristic coefficients such as mean, slope, and curvature (Li et al., 2020). Combining these three characteristic coefficients with the features of Cantonese tones, this study adopted this method. Specifically, for T2–T5, both the mean and slope need to be significantly different to determine that the two tones are not merged, while for T3–T6 and T4–T6, only one of them needs to be significantly different to determine that the two tones are not merged (cf. Si, 2022 for details). The results shown in Fig. 2 indicated that, at the group level, the younger generation in Macau has completed the merging of the T2–T5 tones, with no significant differences in mean, slope, and curvature between the two tones (ps > 0.05). The T3–T6 tones have not yet completed the merging, showing a significant difference in mean (p = 0.01). Additionally, the T4–T6 tones have not completed the merging either, with significant differences in all three features (ps < 0.05).

Fig. 2
figure 2

Tone system of Macau Cantonese among younger generation.

From the group mean level, we can conclude that T3–T6 and T4–T6 have not yet merged, but this does not indicate whether they are in the process of language change. Therefore, we also conducted GCA on the production data of each participant at an individual level, making a qualitative judgment on whether a speaker has merged on each tone pair. The results showed that for T2–T5, all 44 participants have completely merged; for T3–T6, 22 participants have undergone merging; for T4–T6, only 8 participants have undergone merging.

In terms of speech perception, following the criteria of Ou and Law (2016), participants with a discrimination rate of AX task below 95% were considered to have merged the corresponding tone pairs in perception. The results indicated that for T2–T5, all 44 participants have completely merged; for T3–T6, 18 participants have undergone merging; for T4–T6, 15 participants have undergone merging (as shown in Table 3).

Table 3 The number of participants whose tones merged for each tone pair.

To validate our classification, we conducted independent sample Mann–Whitney U tests between the merged and unmerged groups for each tone pair. The results showed that for T3–T6, individuals who have not merged perceptually (M = 0.98, SD = 0.02) exhibited higher discrimination rate than those who have merged (M = 0.83, SD = 0.13), p < 0.001. In terms of reaction time, individuals who have not merged perceptually (M = 957.53 ms, SD = 204.22 ms) had a faster judgment speed than those who have merged (M = 1064.39 ms, SD = 186.25 ms), p = 0.045. In the production task, individuals who have not merged (M = 1.16 ST, SD = 0.63 ST) also showed a larger average ST difference than those who have merged (M = 0.54 ST, SD = 0.64 ST), p = 0.001. For T4–T6, individuals who have not merged perceptually (M = 0.98, SD = 0.02) had a higher discrimination rate than those who have merged (M = 0.87, SD = 0.07), p < 0.001, and individuals who have not merged in production (M = 0.36, SD = 0.17) also exhibited a larger slope difference than those who have merged (M = 0.20, SD = 0.18), p < 0.01. Therefore, it can be concluded that the classification criteria we adopted reflected the characteristics of the merged and unmerged groups.

The results above suggested that there are varying degrees of merging phenomena for both T3–T6 and T4–T6 tone pairs, indicating that ongoing changes are occurring. However, this population shows a higher degree of merging in perception of T4–T6 than in production, while T3–T6 exhibits a similar degree of perception and production merging.

Relationship between perception and production

As T2–T5 tones have actually merged into one tone, we analyzed the relationship between perception and production for the two merging tone pairs, T3–T6 and T4–T6, respectively. The results of independent sample t-tests revealed that, for T3–T6, individuals who have not merged perceptually (M = 1.04 ST, SD = 0.69 ST) exhibited a significantly larger average ST difference than those who have merged (M = 0.58 ST, SD = 0.64 ST), t (42) = −2.25, p = 0.03. However, for T4–T6, there was no significant difference in slope between individuals who have not merged perceptually (M = 0.32 ST, SD = 0.16 ST) and those who have merged (M = 0.36 ST, SD = 0.22 ST), t (42) = −0.66, p = 0.51. When comparing individuals who have not merged and those who have merged in production, for both T3–T6 and T4–T6, there were no significant differences in perception discrimination rate and reaction time (ps > 0.05) (see Fig. 3).

Fig. 3: Comparisons of production/perception between the merged and unmerged group in perception/production among T3–T6 and T4–T6 pairs.
figure 3

(1a) and (2a) show the production distinctions between the merged and unmerged group in perception among T3–T6 and T4–T6 pairs, respectively. (1b) and (2b) show the differences in perception discrimination rate between the merged and unmerged group in production among T3–T6 and T4–T6 pairs, respectively. (1c) and (2c) show the differences in perception reaction time between the merged and unmerged group in production among T3–T6 and T4–T6 pairs, respectively. * denotes p ≤ 0.05.

Interestingly, we observed that the relationship between perception and production was inconsistent for the two tone pairs. We therefore explored the role of cognitive functions in this context. The results of independent sample t-tests (or Mann–Whitney U tests if the assumption of normal distribution was not met, hereafter) showed that, for T3–T6, individuals who have not merged perceptually exhibited a higher level of attention (attention: M = 52.42, SD = 4.97; working memory: M = 37.96, SD = 5.83) and working memory than those who have merged (attention: M = 47.61, SD = 7.53; working memory: M = 32.67, SD = 6.14), ps < 0.05. There was no significant difference in executive function (p > 0.05). For T4–T6, individuals who have not merged perceptually showed a higher level of attention (M = 52.68, SD = 4.67) than those who have merged (M = 46.53, SD = 7.92), p < 0.01. Working memory also reached a significance, such that individuals who have not merged demonstrated higher working memory capacity (M = 37.28, SD = 5.89) than those who have merged (M = 32.93, SD = 6.72), t (42) = −2.21, p = 0.03, while there were no differences in executive function performance (p = 0.81) (see (1a) – (1c) and (3a) – (3c) of Fig. 4). In terms of production, for both T3–T6 and T4–T6, there were no significant differences in the performance of any cognitive function between the unmerged and merged groups (ps > 0.05) (see (2a) – (2c) and (4a) – (4c) of Fig. 4).

Fig. 4: Comparisons of cognitive functions between the merged and unmerged group in perception/production among T3–T6 and T4–T6 pairs.
figure 4

(1a)–(1c) show the differences in attention, working memory and executive function between the merged and unmerged group in T3–T6 perception, respectively. (2a)–(2c) show the differences in attention, working memory and executive function between the merged and unmerged group in T3–T6 production, respectively. (3a)–(3c) show the differences in attention, working memory and executive function between the merged and unmerged group in T4–T6 perception, respectively. (4a)–(4c) show the differences in attention, working memory and executive function between the merged and unmerged group in T4–T6 production, respectively. * denotes p ≤ 0.05. ** denotes p ≤ 0.01.

Relationship between cognitive functions and tone processing

To reveal the correlations between cognitive functions and tone perception or production, Pearson correlation analyses were conducted between all indicators reflecting abilities of tone perception and production and the performance of three cognitive functions. The results showed that, for T2–T5, there were no significant correlations between cognitive functions for neither discrimination rate nor reaction time (ps > 0.05) (see Fig. 5). For T3–T6, there was a moderate positive correlation between discrimination rate and attention (r (42) = 0.55, p < 0.001) and working memory (r (42) = 0.53, p < 0.001), indicating that higher levels of attention and working memory capacity were associated with higher discrimination rate in T3–T6. Moreover, the reaction time showed a weak negative correlation with working memory (r (42) = −0.33, p = 0.03), suggesting that higher working memory capacity was associated with faster judgments in T3–T6 (see Fig. 6). For T4–T6, there was a moderate positive correlation between discrimination rate and attention (r (42) = 0.60, p < 0.001) and working memory (r (42) = 0.48, p < 0.01), indicating that higher levels of attention and working memory capacity were associated with higher discrimination rate in T4–T6. The reaction time showed a weak negative correlation with working memory (r (42) = − 0.30, p = 0.048), suggesting that higher working memory capacity was associated with faster judgments in T4–T6 (see Fig. 7). There were no significant correlations between executive function and perception abilities for either tone pairs (ps > 0.05). As for the production, only the average ST difference of the T3–T6 pair showed significant correlations with attention, r (42) = 0.28, p = 0.03 (see (3a) of Fig. 6).

Fig. 5: Correlation between various indicators of perception and production of T2–T5 and various cognitive functions.
figure 5

(1a)–(1c) show the correlations between perception discrimination rate and attention, working memory and executive function, respectively. (2a)–(2c) show the correlations between perception reaction time and attention, working memory and executive function, respectively. (3a)–(3c) show the correlations between production distinctions and attention, working memory and executive function, respectively.

Fig. 6: Correlation between various indicators of perception and production of T3–T6 and various cognitive functions.
figure 6

(1a)–(1c) show the correlations between perception discrimination rate and attention, working memory and executive function, respectively. (2a)–(2c) show the correlations between perception reaction time and attention, working memory and executive function, respectively. (3a)–(3c) show the correlations between production distinctions and attention, working memory and executive function, respectively. * denotes p ≤ 0.05. ** denotes p ≤ 0.01. *** denotes p ≤ 0.001.

Fig. 7: Correlation between various indicators of perception and production of T4–T6 and various cognitive functions.
figure 7

(1a)–(1c) show the correlations between perception discrimination rate and attention, working memory and executive function, respectively. (2a)–(2c) show the correlations between perception reaction time and attention, working memory and executive function, respectively. (3a)–(3c) show the correlations between production distinctions and attention, working memory and executive function, respectively. * denotes p ≤ 0.05. ** denotes p ≤ 0.01. *** denotes p ≤ 0.001.

Based on the above results, it could be inferred that the relationships between tone perception and production among different tone pairs were not always consistent, and they might be modulated by cognitive functions. The degree to which cognitive functions affect tone perception and production varies across different tone pairs. The correlational results are summarized in Table 4.

Table 4 Correlations between cognitive functions and abilities of perception and production in each tone pair.

Discussion

By examining the abilities of tone perception and production, as well as a series of cognitive functions in native Cantonese speakers, the current study provides the most recent investigation into the tonal patterns of the younger generation in Macau. We explored the potential role of cognitive functions in the merging of Cantonese tones, revealing that the influence of cognitive functions varies across different tone pairs, and there are differences in the correlation between perception and production. Specifically, for T2–T5, our results indicated no significant correlations between cognitive functions and either their perception or production. In contrast, T3–T6 and T4–T6 exhibited varying degrees of correlation between their perception and production aspects and cognitive functions. Furthermore, T3–T6 showed significant correlations with cognitive functions in both perception and production, while T4–T6 only exhibited such correlations in perception (see Table 4). Consistent with prior research (Ou et al., 2015; Ou and Law, 2017), we also identified the crucial roles of attention and working memory in tone processing. Importantly, our study measured the relationship between cognitive functions and perception as well as production within the different tone pairs, respectively. This revealed that the role of cognitive functions in tone processing is distinct across different tone pairs, thereby expanding our understanding of the interplay between speech perception and production.

Considering the inherent differences in contour among these three tone pairs, it is pertinent to explore the underlying mechanism for the different correlations between cognitive functions and different tone pairs in light of the differences in tonal contours and merging speed. In terms of tonal contours, T2–T5 consists of two rising tones, and native speakers can distinguish them based on slope, F0 offset, and, as suggested by Ou and Law (2016), rise time (i.e., the time taken for a sound to reach its maximum amplitude). T3–T6 comprises two level tones, and native speakers can only rely on the average F0 to differentiate them (Liu, 2015). Additionally, these two level tones in Cantonese tend to have continuous perception, with a significant overlap in auditory space (Xie et al., 2017), making it challenging for speakers to establish distinct categorical boundaries. Therefore, perception of T3–T6 is relatively difficult. T4–T6 involves a low-level tone and a low-falling tone, with inherent differences in tonal contours. Furthermore, T4 has a shorter duration, allowing differentiation in both duration and contour. Mok et al. (2013) pointed out that T4 is more stable than T6, as the end of T4 is at a very low F0 level, often accompanied by a creaky voice, serving as an additional cue for differentiation (Yu and Lam, 2011). Thus, T3–T6 relies on fewer distinguishing features, potentially making it more challenging, and members of the Macau speech community may need to exert more cognitive efforts to distinguish T3–T6, implying a faster merging pace for those unable to meet this cognitive demand. However, it is noteworthy that T2–T5, which appears to have more distinguishing features, should theoretically have a slower merging pace than T3–T6. This is contrary to what was observed. This study thus suggests that, despite the differences in rise time and slope, both features fall within the dimension of “contour”, and provide limited assistance. Moreover, findings from child acquisition studies suggest that the production of rising tones may not be as stable and accurate as level and falling tones (Li and Thompson, 1977; Snow, 1998), making confusion more likely for T2–T5 compared to T3–T6. Mok et al. (2013) also investigated the impact of word frequency on the merging of T2–T5. Through analysis of the Hong Kong newspaper corpus, they found that in Cantonese, the frequency of T2 words is higher than that of T5 words, while there is no significant difference in the frequency of T3 and T6 words. Further, Kong (2020) pointed out that analyzing the phoneme load of initials, vowels, and tones in Chinese dialects can reveal the direction of speech changes. In Cantonese, the phoneme load of T5 is smaller than that of T2, making them prone to interaction and resulting in the merging of T5 into T2, consistent with existing research (Zhang, 2019). Considering all these factors, the earlier merging of T2–T5 compared to T3–T6 can be better understood. T4–T6 relies on more distinguishing features, operating in different dimensions such as duration, contour, and phonation types. Therefore, the difficulty in distinguishing them is lower than T3–T6, especially in production, requiring less cognitive demand.

In terms of merging speed, as our results suggested, T2–T5 has essentially completed the merging and has become the sole rising tone in mental representation, and native speakers’ perception of them may rely solely on slope. In other words, whenever they hear a rising tone, speakers in Macau can categorize them into the same tone type, eliminating the need to invest cognitive resources in distinguishing which one is higher or lower. On the other hand, T3–T6 is currently undergoing widespread merging in the speech community, with a merging speed in production comparable to that in perception, and faster than the T4–T6 production (see Table 3). In this community, members encounter more extensive, ambiguous, or even mixed usage of T3 and T6 words. Pronunciations that are ambiguous in production through lexical diffusion (see Wang and Cheng, 1977) impact community members’ perceptual representations of T3–T6. Over time, as their perceptual representations become less stable, more cognitive resources are needed to aid in identifying T3–T6 while simultaneously attempting to maintain the stability of perceptual representations. In contrast, the merging speed of T4–T6 in the community is slower, requiring relatively less cognitive effort to distinguish T4–T6. This may be explained by the speech perception theory (Holt and Lotto, 2010). Not all phonemes we hear every day are standard; they vary significantly depending on factors such as gender, age, and dialect. Speech perception is the process that maps these high-variability acoustic signals onto linguistic representations. A recent study also suggested that high-variability speech can facilitate tone perception (Wei et al., 2022). When speech in a community loses this high variability, members’ ability to effectively perceive similar sounds may decline, requiring more cognitive resources to discriminate them.

Regarding the role of cognitive functions, this study aligned with previous research (Ou et al., 2015; Ou and Law, 2017), indicating that attention plays a positive role in tone processing. However, this study provided novel evidence regarding the role of working memory in T3–T6 perception. Speech is composed of a continuous stream of speech signals, and listeners, on hearing these signals, first need to decode them and accurately map them onto corresponding phonetic representations (Cutler and Clifton, 1999). It is essential to keep these representations active for perceptual processing. Ou et al. (2015) evaluated participants’ verbal working memory using a series of tasks yet found that the digit span backward task exerts no significant contribution to tone processing. Their explanation was that digit span task lacks an examination of speed, which thus might not effectively capture the relationship between memory resources and processing speed. However, our study not only found the significant role of verbal working memory in T3–T6 discrimination rate but also revealed a significant negative correlation between this function and reaction time in T3–T6 perception. That is, higher working memory capacity was associated with shorter judgment times. In particular, the phonological loop is a subcomponent of working memory, involving a subvocal rehearsal system that can preserve memory traces within a few seconds. These memories primarily depend on their acoustic or phonological characteristics, and tokens with similar pronunciations are more challenging to retain in short-term memory (for a review, see Baddeley, 2003). Therefore, we speculated that participants were likely to prioritize the memorization of the acoustic characteristics of the numbers when performing the digit span task. Besides the digit span backward task, our working memory test included a simpler digit span forward task and a more challenging digit span sequence task, with the latter imposing higher operational demands on working memory. In particular, our task selection could more comprehensively reflect participants’ verbal working memory performance, thereby potentially providing a more accurate understanding of the relationship between working memory and language processing.

As for executive function, even though we used the auditory Stroop test, which we believe is more effective for assessing verbal executive function, it did not show any signs related to tone processing. We acknowledge that the auditory Stroop test adapted to Cantonese might be flawed. The ideal Stroop test requires at least two attributes to be inconsistent between the stimuli, while the tones of “高“ and “低“ we chose were consistent in Cantonese. We then artificially lowered the pitch of “低“, which may cause the adapted Stroop to be less authentic. Another possible reason is that our participants were all healthy, young native speakers. Previous studies have shown that executive function is highly related to age (Ferguson et al., 2021), and plays an important role in neurodegenerative diseases (e.g., Alzheimer’s disease, Kirova et al., 2015), multilingual ability (e.g., Carlson and Meltzoff, 2008; Zhang et al., 2015) and other fields. The difficulty of our current tasks may still not be sufficient to engage executive function in these young native speakers.

There are some other issues warranting discussion with respect to the relationship between the speech perception and production. Zhang (2019) found that, compared to Hong Kong Cantonese, the merging process of T2–T5 and T3–T6 in Macau Cantonese is faster. This leads to the question of whether the current merging stage of T2–T5 in Hong Kong Cantonese is likely similar to that of T3–T6 in Macau Cantonese. Our results for T3–T6 also corroborated with the findings for T2–T5 in Hong Kong, as both studies identified the crucial role of attention in their respective tone pairs (Ou and Law, 2017). Moreover, Hong Kong Cantonese’s T2–T5 also exhibited a relationship between perception and production (Ou and Law, 2016). However, in Macau Cantonese, there was no longer a connection between perception and production for T2–T5. Combining these results, we can observe a relationship between perception and production, and some connection with the tone merging process. T4–T6, in the early stage of merging, shows a weaker relationship between perception and production. T3–T6 in Macau Cantonese and T2–T5 in Hong Kong Cantonese, which are in an accelerated merging stage, exhibit a stronger relationship between perception and production. T2–T5 of Macau Cantonese, in the completed merging stage, showed no longer a relationship between perception and production. Furthermore, the roles of cognitive functions co-vary with the perception and production of the three tone pairs: T4–T6 pair can be only affected in perception, while T3–T6 in Macau and T2–T5 in Hong Kong can be affected in both perception and production, and Macau Cantonese’s T2–T5 has no association in either perception or production. Cognitive functions in this process seem to work like a “sliding window”, exerting influence initially on the most unstable tone pair, progressing from perception to production, and then to the more stable tone pair. This supports the DIVA model (see the Introduction part, Guenther, 1995), where accurate perception is a prerequisite for accurate production. Since our knowledge of this topic is still insufficient, more research is needed to better understand how this “window” drives language change.

Conclusion

In summary, the current study provided a preliminary exploration of the role of cognitive functions in different merging tone pairs, revealing varying impacts on perception and production across different tone pairs. Introducing the perspective of cognitive functions offers a new angle for understanding the origin of tone variation. Furthermore, the study observed that the influence of cognitive functions on perception and production was not static. This suggests that, based on the inherent features of tone systems and their spread within a community, cognitive functions might play distinct roles at different stages. Considering cognitive functions as a mediator, our results indicated that the link between perception and production strength may evolve over time. This raises the question of whether changes in this relationship drive tone merging and the potential role of cognitive functions in this process.

However, our study also has certain limitations, such as sample imbalance within the same tone pair. Notably, there were significantly fewer instances of production merging than non-merging (e.g., T4–T6). This could contribute to the lack of correlation between T4–T6 production and cognitive functions. Future research should address this limitation and explore the neurobiological basis of speech processing to unravel the relationship between neural representations of perception and production, shedding light on the sources and mechanisms of language change over time.