Introduction

Sound symbolism refers to the relationship between phonemes and particular perceptual and/ or semantic elements (Sidhu and Pexman, 2018). The iconic association between linguistic forms and meanings not only facilitates lexical processing and learning (Perniss and Vigliocco, 2014; Perniss et al. 2010; Imai and Kita, 2014) but also enhances the vividness and effectiveness during verbal communication (Di Cesare, 2022; Lockwood and Dingemanse, 2015). Many studies probe sound symbolic phenomena with behavioral tasks by asking the perceiver to make a forced-choice judgment (e.g., choosing between round or spiky shapes) based on sounds/pseudowords (e.g., embedding phonemes with similar phonological features in CVCV combinations such as “bouba” (back vowels) or “kiki” (front vowels); Köhler, 1929). To improve the robustness of the results, others asked listeners to make a continuum rating on perceptual or semantic properties with Likert scales (Etzi et al. 2016; Sidhu et al. 2022). These paradigms are used for two key issues of sound symbolism. The first issue is to identify important phonological or acoustic cues for sound symbolic associations, while the second is to investigate the underlying mechanism of sound symbolism regarding how these associations occur.

Acoustic and phonological representation of iconic perception

The iconic perception of phonemes could extend from low-level perceptual (e.g., visual and tactile) dimensions to high-level cognitive (e.g., emotional and interpersonal) dimensions. The features of the phonemes may underlie how such perception is formed. These features can be manifested at the segmental level (phonological or acoustic), such as the backness of a vowel or the formants of a vowel. They may also be represented at the suprasegmental level, such as pitch and duration.

Previous studies largely used phonological features to describe sound symbolic associations between the properties of these dimensions and phonemes (e.g., using the front-back distinction of vowel to characterize bouba-kiki effect) and the accompanied distinct acoustic parameters of sounds received less attention (Knoeferle et al. 2017). While the importance of phonological features in characterizing sound symbolic associations should be recognized, taking acoustic features into consideration is also crucial. Some argued that the phonological features couldn’t fully reflect how speech sounds are perceived by listeners (Parker, 1977). One phonological feature may correspond to a range of acoustic variations and can be expressed differently by multiple speakers in terms of acoustic features. Being more fundamental to human hearing, acoustic features may allow a more detailed and comprehensive characterization of sound symbolic associations. What’s more, acoustic features can be applied universally across languages as they are fundamental to human hearing, while phonological rules differ by language. In this sense, it may be more advantageous to mainly use acoustic features in the description of sound symbolic associations. Nevertheless, some acoustic parameters are associated with phonological features, which establish connections between new findings of acoustic parameters and previous studies using phonological features. For example, the crucial phonological features, such as the height, backness, and roundedness of a vowel, were related to the first, second, and third formants (Wu and Lin, 2014). Therefore, the crucial phonological features determined by previous studies remain important references for acoustic examination.

Although sounds have been found to associate with properties of many modalities, the mappings that received the most attention are sound-vision and sound-touch mappings. Sound-vision mappings here refer to the relationship between phonemes and visual properties, which are the fundamental features of objects that can be perceived through the visual system, such as color, shape, and size (Kennedy, 2007; Feldman, 2003; McCormick et al. 2018). According to former studies, the relationship between visual features and phonemes yielded consistent results. For example, previous research revealed the importance of the first and second formants in sound-size (Knoeferle et al. 2017; Ohtake and Haryu, 2013) and sound-brightness mappings (Asano and Yokosawa, 2011), and also the crucial role of the second and third formant in sound-shape association (Knoeferle et al. 2017). Apart from formants, pitch was also found to influence the perception of size (Bien et al. 2012).

Despite that touch and vision are both low-level perceptual dimensions, features determining sound-touch associations are yet to be specified. Tactile sound symbolism explores how phonemes are related to the sense of touch (Lo et al. 2017). Etzi et al. (2016) found that compared with sounds associated with curved shapes, those related to round shapes (e.g., maluma) were more likely to correspond with the smooth texture, which was consistent with the findings of the study by Blazhenkova and Kumar (2018). However, Sakamoto and Watanabe (2018) held the opposing view that central vowels such as /o/ and /a/ tended to be paired with roughness. The same study further reported that hardness and dryness were linked to central vowels. Fricative consonants had also been found to associate with roughness (Lo et al. 2017). Besides segmental features, other studies have explored the relationship between suprasegmental features and tactile properties. For example, Lin et al. (2021) found that the length of vowels in Japanese sound-symbolic words could affect tactile imageability. Specifically, words containing long vowels (e.g., buubuu) were rated more tactilely imageable than those containing short vowels (e.g., bubu), which pointed to the potential impact of duration. What’s more, Eitan and Rothschild (2011) asked participants to rate 4-second sounds played by instruments varying in pitch and loudness. They found the relationship between high pitch and increased dryness, roughness, hardness, and lightweight. They also reported that quieter sounds were rated as smoother, softer, lighter, and hotter than louder sounds. Due to the limited research and inconsistent results, it is necessary to further underpin the important acoustic cues (both segmental and suprasegmental) for different dimensions of sound-touch mappings.

In addition to the low-level perceptual dimensions, high-level cognitive processes may also be involved in forming sound symbolic associations. Studies on the sound symbolism of valence (evaluation of positivity/negativity) maintain that emotion recognition based on simple phonemes could be embodied in the movements of facial muscles during articulation (Körner and Rummer, 2021; Yu et al. 2021). The reason why /i/ would be rated more positive than /y/, /o/, and /u/ is that articulating /i/ promotes smiling while the articulation of the rest three phonemes hinders smiling. Studies not only investigated the direct mapping between emotion factors and phonemes, but also suggested that emotion factors could underlie multiple sound symbolic associations. For example, arousal has been proposed to be the link between sound and shape (Aryani et al. 2020). The multidimensional study by Sidhu et al. also supported the significant roles of emotional factors such as valence, activity, and potency in multiple sound symbolic relationships (Sidhu et al. 2022). In this study, participants were asked to rate pseudowords on 25 different dimensions (such as round-sharp, good-bad, fast-slow, and masculine-feminine). The exploratory factor analysis revealed that these 25 associations can be grouped into four higher-order factors: valence, activity, potency, and novelty, the first three of which are crucial in forming meanings (Osgood et al. 1957) and also in categorizing emotions (Russell and Mehrabian, 1977).

As social attitudes are tightly intertwined with emotional factors (Moine and Obin, 2020; Schirmer and Adolphs, 2017), the growing attention on the association between phonemes and emotional factors also points to the importance of exploring sound-attitude mapping as well. Social attitudes are generalized evaluations of people, objects, places, ideas, or issues (Ajzen and Fishbein, 1977; Fujisaki et al. 2016). However, previous studies mainly focused on how prosodic features of sentences contribute to attitude perception (Vergis et al. 2020; Caballero et al. 2018; Guldner et al. 2020; Hosbach, 2009; Truesdale and Pell, 2018) and there was a lack of research on how phonemes are symbolically associated with social attitudes. Nevertheless, interpersonal perception extends beyond the perception of social attitudes. Interpersonal perception entails the cognitive process through which individuals observe, interpret, and respond to the intentions, personalities, behaviors, and attitudes of others during social interactions (Kenny, 1988; Malloy and Albright, 1990; Fiske et al. 2007). Some have investigated the relationship between phonemes and other interpersonal information than attitudes, such as social intentions and personalities. For example, it was found that in different social settings, the formant frequency of vowels could be altered in order to express different intentions (Eckert and Labov, 2017). A study on the pairing of sound with personality demonstrated that people whose first names contained sonorant phonemes (such as Mona and Owen) were considered to have more agreeable and conscientious personalities (Sidhu et al. 2019). Though few studies directly examined sound-attitude associations in the context of sound symbolism, previous research hinted at a potential link between phonemes and other interpersonal information such as intentions and personalities. The study is interested in the social attitudes in speech, which are highly influential in interpersonal relationships (Ajzen and Fishbein, 1977) and can even define interpersonal interactions (Fujisaki et al. 2016).

To sum up, one focal purpose of the current study is to determine the significant acoustic cues characterizing multiple sound symbolic associations, especially in the less visited tactile modality and in the rarely investigated interpersonal aspect. Moreover, due to the connection between visual/tactile properties and emotional or interpersonal perception (Etzi et al. 2016; Blazhenkova and Kumar, 2018; Etzi et al. 2014), sound symbolic associations with different modalities may have similar acoustic cues. It is also important to compare the crucial acoustic cues in these three modalities in order to see if any similarities or differences could be revealed.

Meanwhile, controlling for the effect of sex is also crucial for investigating the acoustic representation of sound symbolism. Former studies showed that the sex stereotype occurred due to the perceived sex of speakers (Roche et al. 2023; Roche et al. 2022). For example, Roche et al. discovered that listeners typically associate rising intonation with less confidence in female voices compared to male voices. The sex bias was also related to the listeners’ sex/ observers’ sex (Hu et al. 2023; Yu, 2022; Jiang and Pell, 2017; Jiang and Pell, 2016a; Jiang and Pell, 2016b; Jiang and Pell, 2015; Jiang and Zhou, 2015; Etling and Young, 2007). Some studies reported that female listeners were more sensitive to certain acoustic/semantic cues when making social evaluations (Jiang and Pell, 2017; Jiang and Pell, 2016a; Jiang and Pell, 2016b; Jiang and Pell, 2015) while other studies showed that male listeners were more sensitive (Hu et al. 2023; Jiang and Zhou, 2015).

Mechanisms underlying sound symbolism

Beyond exploring the significant acoustic cues of sound symbolic associations, critical theoretical questions arise in regard to how these associations occur and whether different types of associations (e.g., sound-shape and sound-size associations) share the same mechanisms (Sidhu and Pexman, 2018; Ohtake and Haryu, 2013; Spence, 2011; Parise, 2016). Sound symbolism is closely tied with cross-modal correspondence (Spence, 2011), which refers to “a tendency for the sensory attribute in one modality, either physically present or merely imagined, to be matched with the feature in another sensory modality” (Spence and Parise, 2012, p.410). Some mechanisms proposed for cross-modal correspondences could also help to explain sound symbolic associations. Main proposals relevant to these phenomena place focus on statistical regularity (Spence, 2011; Gallace and Spence, 2006), language patterns (Bergen, 2004; Magnus, 1998), and shared properties (French, 1977; Rummer et al. 2014; Palmer et al. 2013; Velasco et al. 2015).

Statistical regularity account holds that stimuli from two different modalities become associated because the two stimuli tend to frequently co-occur in the natural environment (Spence, 2011). Fort and Schwartz demonstrated the statistical co-occurrence between sound and round/spiky objects: when round objects hit or rolled on a hard plane surface, they tended to make sounds with lower-frequency spectra and increased temporal continuity compared with spiky objects (Fort and Schwartz, 2022). These two parameters were proven significant for listeners in the round/spiky visual judgement of pseudowords. Statistical regularity account would predict common sound symbolic mappings for people speaking different languages as the natural physical environment is shared. For instance, consistent with Fort and Schwartz’s proposal, the bouba-kiki effect was proved to be robust across different cultures and writing systems (Ćwiek et al. 2022).

Closely related to the statistical regularity account is the language pattern account, which instead emphasizes the crucial role of the language environment rather than the natural physical environment (Sidhu and Pexman, 2018; Bergen, 2004; Magnus, 1998). It is argued that people could form sound symbolic associations between phonemes and semantic features based on their frequent co-occurrence in a certain language environment. For example, the phoneme cluster “gl” usually appears in English words about light, such as “glow” (Bergen, 2004) and this cluster was in turn utilized by participants to indicate brightness in a novel word creation task (Magnus, 1998). Unlike the statistical regularity account which supports common sound symbolic associations across languages, the language pattern account instead predicts that speakers from different linguistic backgrounds could form different sound symbolic associations.

While previous studies provided evidence of statistical regularity for some sound symbolic phenomena, the recent multidimensional study regarding 25 sound symbolic associations hinted that the shared property between phoneme and perceptual/semantic elements, especially emotional factors, could underlie different sound symbolic associations (Sidhu et al. 2022). The shared property account argues that stimuli from different modalities become associated with certain phonemes because they share some common properties, which could be a perceptual feature, a linguistic label, an emotional factor, or a mediating dimension (Sidhu and Pexman, 2018). Sidhu et al. (2022) indicated the possibility that emotional factors could be the shared property uniting different phonemes (embedded in the pseudowords) and perceptual/semantic dimensions. A similar opinion is held by the Emotion Mediation Hypothesis that due to the linkage role of emotion between the property of one modality to another, emotion could serve as a mediator (Palmer et al. 2013; Velasco et al. 2015). Ketron and Spears (2019) investigated the mediation effect of arousal in the relationship between vowel backness and consumers’ willingness to pay. They used mediation analysis and found that people are more willing to consume in stores that have back vowels in their names, compared with those with front vowels in their names. This effect was mediated by the perceived arousal of store names, as back vowels were considered more arousing, and higher arousal pointed to more willingness to pay.

Different from the Emotion Mediation Hypothesis, which places its focus on emotional factors, the Transitivity Proposal, which also belongs to the shared property account, instead highlights the role of a mediating dimension as the shared property (French, 1977; Deroy et al. 2013). According to French, there are only limited direct links between phonemes and properties in other modalities (French, 1977). The majority of cross-modal correspondences are mediated by these directly linked properties given the transitive nature of the cross-modal correspondences (i.e., the Transitivity Proposal: A-B and B-C associations could lead to A-C association). For example, the relationship between high-front vowels and smallness might be a direct one. Smallness as the mediating dimension could facilitate the indirect mappings from high-front vowels to other smallness-related properties such as thinness and lightness (Fig. 1a). Unlike the Emotion Mediation Hypothesis, the Transitivity Proposal assumes that both emotional and perceptual features can work as the shared property, which licenses the testing of the mediating role of dimensions more basic to human perception (such as visual and tactile dimensions) in forming sound symbolic associations. The link between speech and action could further help to explain the Transitivity Proposal, especially when the tactile properties work as a mediating dimension. The motor theories of speech perception put forward that hearing a sound can activate action representations related to the speaker’s vocal production (Liberman and Mattingly, 1985; Stasenko et al. 2015; Franken et al. 2022). Previous studies on the vitality form of speech have also shown that different levels of certain acoustic features (such as pitch, intensity, and speech rate) affected the kinematic parameters of subsequent actions (Di Cesare et al. 2017; Lombardi et al. 2021; Rizzolatti et al. 2021). What’s more, it was found that inhibition of orofacial mimicry (holding a pen in one’s mouth) led to poorer performance in the judgement of vocal emotion authenticity (Vilaverde et al. 2024). This evidence pointed to the important role of tactile experiences that affected the high-level cognitive assessment of auditory information. As the tactile experiences involve sensory-motor representations, the inherent relationship between speech and actions may underlie the sound-touch associations, which thus helps to mediate the association between sound and other dimensions.

Fig. 1: Mediation and suppression effects.
figure 1

Illustration of the mediation (a) and suppression (b) effects resulting from the Transitivity Proposal. The dotted line stands for the indirect effect.

It is also important to point out that as Deroy et al. noted, the prediction of the Transitivity Proposal could be contradictory to how dimensions are actually associated (Deroy et al. 2013). Although Deroy et al. pointed out the possibility of an inconsistent result of transition, they also mentioned that little research provided further interpretation. For example, as Deroy et al. noted, the explanation was lacking for the inconsistent transitivity from high pitch through fruity odors to round shapes. The fruity odors were found to correspond with high-pitch and round shapes. Then according to the mediating role of fruity odors, high pitch should be associated with round shapes. However, contradicting the prediction of the Transitivity Proposal, the actual situation was that high pitch was associated with angular shapes (Fig. 1b).

Here we propose that statistically speaking, such an inconsistent result could be interpreted as a suppression effect, which may be evidence for the interaction between the Transitivity Proposal and another sound symbolic mechanism. In the mediation analysis, the mediation effects mean that the indirect effect shares the same sign with the total effect (as in Fig. 1a), while suppression effects occur when “an indirect effect has a sign that is opposite to that of the total effect, and thus omission of the suppressor could lead the total effect to appear small or nonsignificant” (Rucker et al. 2011, p.366; as in Fig. 1b). Therefore, the “inconsistent” transition is likely to result from the fact that the transition underlying the indirect formation of A-C association (via A-B and B-C associations) might interfere with the direct A-C’ mapping formed by another mechanism.

The report of the suppression effect could support the coexistence of multiple mechanisms underlying sound symbolism and demonstrate how different mechanisms might interact with each other.

Therefore, the second goal of the current study is to examine the interactions between different sound symbolic dimensions, which would not only broaden the current understanding of the relationships among visual, tactile, and interpersonal sound symbolic mappings but might also contribute to the theories behind sound symbolism.

The current study

The current study has two main goals. The first goal is to determine and compare the acoustic cues crucial to characterizing visual, tactile, and interpersonal sound symbolic associations for Mandarin rimes. The major focus here is the interpersonal sound symbolic associations as few studies have explored the underlying acoustic cues. The study selected four interpersonal (polite/rude – politeness, friendly/hostile – friendliness, encouraging/authoritative – authoritativeness, passionate/indifferent – indifference) dimensions widely studied by social cognitive research involving human voice (Vergis et al. 2020; Caballero et al. 2018; Guldner et al. 2020; Hosbach, 2009; Truesdale and Pell, 2018).

Interestingly, these interpersonal dimensions have been found to be associated with some visual and tactile dimensions in the studies on cross-modal correspondences (Schirmer and Adolphs, 2017; Etzi et al. 2016; Blazhenkova and Kumar, 2018; Etzi et al. 2014; Brunet et al. 2012; Ackerman et al. 2010; Kaspar, 2013). This gives rise to a subsequent question that if the acoustic cues crucial to characterize the interpersonal sound symbolism overlap with those defining visual or tactile sound symbolism due to the cross-modal correspondences between visual/tactile and interpersonal dimensions. Or, the defining acoustic cues may vary in accordance with the different modalities/dimensions they predict. To answer this question, four visual (spiky/round – shape, small/large – size, bright/dark – brightness, thin/thick – thickness) and four tactile (smooth/rough – roughness, light/heavy – weight, cold/hot – temperature, hard/soft - hardness) were selected. Not only are these dimensions associated with the four interpersonal dimensions chosen, but they were also of great interest to previous studies on sound symbolism (Knoeferle et al. 2017; Ohtake and Haryu, 2013; Asano and Yokosawa, 2011; Tsur, 2006; Etzi et al. 2016; Blazhenkova and Kumar, 2018; Sakamoto and Watanabe, 2018; Motoki et al. 2022). Therefore, the selection of the critical dimensions in the current study reflects the interest of the preceding research. It also facilitates the examination and comparison of the crucial acoustic parameters underlying these sound symbolic associations in native Mandarin speakers.

The second goal of the current study is to examine the relationships among low-level perceptual and high-level cognitive sound symbolic associations. As mentioned above, visual and tactile properties are included because they are the two most commonly investigated low-level perceptual modalities in the studies of sound symbolism. As for the social attitudes, they are selected because they represent the high-level cognitive processes. Major theories of sound symbolism such as the Emotion Mediation Hypothesis and the Transitivity Proposal assume different roles of low-level and high-level dimensions in forming sound symbolic associations. To further distinguish between these two accounts, it is vital for the study to include both low-level and high-level dimensions. This may have theoretical implications for mechanisms underlying sound symbolism.

In the current study, Mandarin rimes were chosen to form the auditory stimuli. The Mandarin syllables consist of initials (as “声母, shēngmǔ” in Mandarin Chinese) and rimes (as “韵母, yùnmǔ”), as specified by the Pinyin system. Initials and rimes in Mandarin can respectively correspond to consonants and vowels. Mandarin rimes are also sometimes referred to as Mandarin rhymes or finals (Leong et al. 2005; Zou et al. 2020; Li et al. 2016). They can be categorized into three types: simple rimes, compound rimes, and nasal rimes (Wu and Lin, 2014). The first type, simple rimes, contains only single vowels, i.e., monophthongs. The second type, compound rimes, contains more than one vowel or component, which are diphthongs and triphthongs. Some compound rimes are referred to as falling diphthongs (Bussmann et al. 2006), in which the first component is more prominent than the second one, such as /ai/, /au /, /ɛi /, and /əu / in Mandarin. Some rimes such as /ia/, /ua/, /uo/, /iɛ /, and /yɛ / are referred to as rising diphthongs (Wu and Lin, 2014; Bussmann et al. 2006). The first component of the rising diphthongs and triphthongs (/iau/, /iəu/, /uai/, and /uɛi/) is sometimes called a “prenuclear glide”, which is less prominent than the latter component(s). Lastly, the nasal rimes in Mandarin are rimes that start with a monophthong or a diphthong and end with an alveolar or velar nasal coda (/n/ or /ŋ/) (Chen, 2000). The simple and compound rimes are entirely made up of vowels (monophthongs, diphthongs, and triphthongs) while only the nasal rimes contain vowels plus nasal codas. The current study not only investigates the simple vowels (monophthongs as in simple finals) in Mandarin but also includes all compound rimes and nasal rimes. All these rimes are important in forming Mandarin syllables. Different from other languages, rimes in Mandarin are taught as an entirety (containing both vowels and nasal codas) in formal education, forming an essential part of formal language learning. Therefore, rimes are fundamental for the perception and production of Mandarin for native speakers since they first learned the phonological rules of Mandarin (Lin et al. 2020).

Two distinct features in Mandarin rimes contribute novelty to previous findings of sound symbolism: formant transition and nasal codas. Although previous studies found that formants of a vowel may contribute to sound symbolic effects (Knoeferle et al. 2017; Ohtake and Haryu, 2013; Asano and Yokosawa, 2011), it remains largely unknown how such effects would be characterized if these formant values change within a diphthong. Diphthongs are crucial for Mandarin as they take up half of the Mandarin rime spectrum. Therefore, formant transitions were included as predictors to characterize sound symbolic relationships in Mandarin diphthongs. Another distinctive feature of Mandarin vowels is that some of them are followed by an alveolar or velar nasal coda (/n/ or /ŋ/). Nasal codas may also contribute to sound symbolic effects as a previous study on Japanese mimetics reveals that words containing nasals are more frequently used to express heaviness (Saji et al. 2013).

Method

Participants

Forty Mandarin-speaking college students (21 female, mean age = 21.15 yrs., SD 1.81) took part in the online survey and rated the sound stimuli. All participants are Mandarin native speakers. Fifteen participants reported did not use any dialects other than Mandarin. Two reported that they understood a dialect but neither could speak the dialect. The remaining participants reported using a particular dialect that was different from standard Mandarin, which spanned from Guangdong, Jiangxi, Henan, Shandong, etc. to Northeast China. Details of participants’ information can be found in Table S1. Despite their dialect backgrounds, the national language policy in China promoted Mandarin greatly so that young generations (as the participants in the current study) have been exposed to Mandarin from birth. The national language policy also requires that all students use Mandarin since they first received the compulsory education from the primary school. In the current study, participants started their compulsory education at the mean age of 6.4 years old and had received 15 years of education on average. What’s more, five out of forty participants experienced vocal voice training (mean duration = 11.2 months) and 15 out of 40 participants received training in public speaking, hosting, or broadcasting (mean duration = 3.67 months). Participants were also asked to self-report their Mandarin proficiency on a 10-point scale, with “10” indicating excellent proficiency. Mean scores for hearing, speaking, reading, and writing proficiency achieved 8.4, 8.0, 8.3, and 7.8, which demonstrated high-level mastery of Mandarin Chinese. All participants were required to use headphones during the experiment.

Materials

Two native Mandarin speakers (both male) recorded the Mandarin rimes in a level tone using Praat (Boersma and Weenink, 2021) at a sample rate of 44,100 Hz. These two speakers were aged 38 and 25 years old. Both of them were native Mandarin speakers despite that one speaker had a dialect background in northeastern China and the other in Shanghai. The use of level tone here was to avoid the confounding effects of different tones on sound symbolism (Chang et al. 2021). Speakers were asked to produce these rimes naturally at a normal speed without any intention in mind. They were asked to read aloud each rime according to the material list neutrally with no emotions as if they were informing others of a piece of news during daily communication.

Two identical rimes produced by the same speaker were concatenated with a silent interval of 50 ms to form a single sound stimulus (within-speaker concatenation, e.g., /an/-/an/). Single syllables in Mandarin are typically associated with lexical semantic information. To avoid the contamination of lexical semantic information, we duplicated syllables to form two-syllable stimuli. In this way, all stimuli can be treated as pseudowords, which follow the Mandarin phonological rules but do not convey lexical meanings (Jiang et al. 2015; Pell et al. 2022). Similar procedures for creating stimuli were also seen in former research such as Cui et al. (2023).

Because it was impossible for speakers to record the two rimes, /ɿ/ and / ʅ/, alone without preceding consonants in Mandarin, they were removed from the materials so that all recorded rimes were voiced alone with no influence from the preceding consonants. Altogether there were 35 rimes (14 monophthongs, 17 diphthongs, and 4 triphthongs) recorded separately by the two speakers, resulting in 70 sound stimuli in total. All sound files were scaled to an average intensity of 70 dB. Triphthongs were excluded from the analysis due to small samples. The list of 35 rimes and their coded features can be found in Table 1.

Table 1 List of sound stimuli.

Procedure

Participants were asked to rate the sound stimuli on a five-point (−2 to 2) scale using an online survey platform (Questionpro; www.questionpro.com). The study investigates sound associations with three aspects: vision, touch, and attitudes. For each aspect, there were four specific dimensions. Each aspect forms one questionnaire and all 70 sound stimuli were presented and randomized in each of the questionnaires. In the questionnaire concerning visual properties, obvious pictures denoting spiky/round, small/large, bright/dark, and thin/thick were presented for reference. These pictures provided demonstrations for the two ends of each scale while for each point of the scale, corresponding linguistic labels were presented (Fig. S1). In the second questionnaire targeting tactile properties, pictures and labels indicating smooth/rough, light/heavy, cold/hot, and hard/soft tactile properties were also shown (Fig. S2). All pictures were made black and white with the same picture size. The attributes of pictures were relevant to the dimensions of interest. It was made sure that the pictures only varied in terms of the attributes we were interested in. Participants were explicitly asked to imagine they were touching objects with specific features mentioned in each scale. They were also reminded to base their decision merely on the features mentioned by the linguistic labels. The final questionnaire included textual descriptions of eight scenarios suggesting eight social attitudes (polite/rude, friendly/hostile, encouraging/authoritative, and passionate/indifferent) (Fig. S3). In this questionnaire, participants were asked to imagine social circumstances where they were interacting with other people. The detailed textual description of opposite interpersonal attitudes in each scale was designed to facilitate the understanding of the linguistic labels. Six participants (4 female, mean age = 22.50 yrs, SD = 1.76, all Mandarin native speakers) who were not included in the main experiment were asked to rate the compatibility between the 8 textual descriptions and corresponding labels on a scale of 1 to 5 (very incompatible to very compatible). All eight textual descriptions obtained an average rating above 4 (polite: 4.67, rude: 4.67, friendly: 4.83, hostile: 4.33, encouraging: 4.83, authoritative: 4.33, passionate: 4.83, and indifferent: 4.50). This showed that the textual descriptions were detailed and salient enough for participants to understand and distinguish between the two polars of social attitudes suggested by the linguistic labels. During the experiment, participants were also asked to merely focus on the auditory features of sound stimuli while making decisions.

Participants first completed the survey of the visual dimension, then the tactile dimension, and finally the interpersonal dimension. This order was consistent with the extent of abstraction of stimuli presentation, making it easier for participants to judge the sound stimuli first for directly presented visual dimensions, then for indirectly presented tactile dimensions with pictures, and finally for the most abstract interpersonal dimensions presented with textual description. This order (from simple to complex questions) could reduce the initial cognitive load and the influence of frustration from the beginning of the task (Brosnan et al. 2021; Harrington et al. 2019).

It’s important to note that the current study mainly used visual demonstration of instructions (with pictures and linguistic labels) for three target modalities. Features of the visual modality were directly presented through pictures, together with the help of linguistic labels. Although the stimuli in the tactile dimensions weren’t directly touchable, previous studies reported consistent cross-modal mappings across different presentation modes of target stimuli (imagined/indirect vs. actual) (Sidhu et al. 2022; Blazhenkova and Kumar, 2018; Doizaki et al. 2017). As for the interpersonal features, although participants didn’t interact with real-life people in a social circumstance, they were able to understand these features through reading textual descriptions and linguistic labels, as indicated by previous manipulations used in studies on sound-emotion associations (Yu et al. 2021; Aryani et al. 2020; Bradley and Lang, 1994). Therefore, apart from sound-vision associations, the use of visual stimuli (picture, linguistic label, or textual description) to present features of other modalities was frequently adopted in the investigation of sound-touch and sound-emotion associations and was considered effective for sound-symbolic investigations.

Acoustic feature extraction

After annotating the steady state of vowels, we extracted mean formants (F1-F4), mean fundamental frequency (F0), and duration using Praat. Formants and pitch were averaged across the entire steady state. For rimes containing one vowel, there was only one set of formants. For rimes containing two or three vowels/components, two or three sets of formants were separately extracted based on the previous annotation of the steady state of each component (Ji et al. 2022). The formant extraction for vowels with nasal coda was only restricted to the preceding vowel(s) and nasality was coded as a categorical variable (zero nasal [baseline], alveolar nasal, and velar nasal) (Wu et al. 2016).

Data analysis

In data analysis, triphthongs were excluded. The study separated the analyses of rimes containing monophthongs (monophthongs only/ monophthongs + nasal codas) and rimes containing diphthongs (diphthongs only/ diphthongs + nasal codas) as the analyses of diphthongs involved additional formant transitions between the two vowels. Trials with triphthongs were excluded for the low numbers of stimuli. All statistical analyses were conducted using R (Version 4.2.0; R Core Team, 2022). For data analysis, the study conducted linear mixed-effect models to reveal the role of acoustic features in sound symbolic associations and mediation analyses to determine how different sound symbolic associations interact with each other. Machine learning models were built with the XGBoost algorithm.

LMEMs

The study ran linear mixed-effect models (LMEMs) with perceptual ratings as the dependent measure and fixed effects for F0, F1-F4, nasality, and duration. Participants’ sex was added to the model as a control variable. The function lmer() from the lmerTest package was used (Kuznetsova et al. 2017). In the analyses of diphthongs, the models first included the fundamental frequency and the formants of the first vowel as fixed effects, and then added the differences of formant values between the two vowels (formant frequency of the following vowel minus that of the preceding vowel as in Eq. (1)) to the fixed effects. We used the VIF (Variance Inflation Factor) to index collinearity for the LMEMs. The VIF (Variance Inflation Factor) was below 4.621 for all variables in the monophthong model. According to the criterion of keeping VIF below 10 (Marquaridt, 1970; Neter et al. 1989; O’brien, 2007), the difference of the first formant (ΔF1) was deleted. Before the deletion of ΔF1, the VIF was 11.156 for F1 and 9.164 for ΔF1. After the deletion, all VIFs were kept under 5.274 for all variables in the diphthong model. Participant and Item were also included as random intercepts. Some variables were on very different scales, which might affect the numerical stability of model estimation of LMEMs (Gelman, 2008; Kims et al. 2005). Therefore, the fundamental frequency and all formants were divided by 1,000. Duration was not rescaled as it was measured in seconds. After rescaling, all predictors ranged between −1.60 and 4.68. Because participants rated 12 perceptual aspects, the study ran 12 linear mixed-effect models for monophthong trials (see Eq. (2)) and 12 models for diphthong trials (see Eq. (3)). The study also adjusted all 252 p-values reported by 24 LMEMs using Benjamini-Hochberg correction to account for multiple comparisons (Benjamini and Hochberg, 1995). The levels of false discovery rates (FDR) were set to be 0.05 and 0.01.

$$\begin{array}{cc}\Delta {f}_{i}={f}_{{i}_{2}}-{{f}_{i}}_{1} & (i\in (1,2,3,4))\end{array}$$
(1)
$${lmer}({rating} \sim 1+{f}_{0}+{f}_{1}+{f}_{2}+{f}_{3}+{f}_{4}+{nasality}+\,{duration}+{gender}+\left(1,|,{participant}\right)+(1{\rm{|}}{item}))$$
(2)
$${lmer}({rating} \sim 1+{f}_{0}+{f}_{{1}_{1}}+{f}_{{2}_{1}}+{f}_{{3}_{1}}+{f}_{{4}_{1}}+\Delta {f}_{2}+\Delta {f}_{3}+\Delta {f}_{4}+\,{nasality}+{gender}+{duration}+\left(1,|,{participant}\right)+(1{\rm{|}}{item}))$$
(3)

Machine learning models: XGBoost

To further investigate the performance of acoustic parameters in classifying the perceptual ratings, the study resorted to XGBoost (eXtreme Gradient Boosting), the supervised machine learning algorithm, using the caret and xgboost packages in R (Kuhn, 2008; Chen et al. 2022; Jiang and Pell, 2018). Before running the model, the perceptual ratings were coded into dichotomous variables. The two lowest ratings (−2 and −1) and the two highest ratings (1 and 2) were re-coded as 0 and 1. The medium rating (0) was excluded from the model as it means the participants couldn’t discriminate between two perceptual contrasts. Predictor variables used in linear mixed analysis were included in the XGBoost models (except for participants’ sex which was not a feature of the stimuli) and all tuning parameters remained the same across different models. When obtaining the accuracy of prediction, the study implemented 5-fold cross-validation and repeated it ten times, which is a common practice for a small sample. Because positive and negative cases were slightly unbalanced, the proportion of the majority of cases was taken as the chance level (50.8–66%). One sample t-test was used to examine if the accuracy was higher than the chance level (p-values corrected with the Benjamini-Hochberg procedure). Apart from the effect of the acoustic features in predicting sound symbolic phenomena, the study was also interested in the relationship among different dimensions and especially how interpersonal perception of sound interacts with the visual or tactile perception of sound. To achieve this aim, we used XGBoost models trained in one dimension to test the data from other dimensions from the 2 different aspects (e.g., using the pattern from the visual dimension of spiky-round to predict four tactile and four interpersonal dimensions). Due to the large number of models, all hyperparameters were kept the same so as to make the results comparable across models. Further methodological details concerning the directionality of training and testing labels can be found in the Supplementary Information (“XGBoost analyses” section).

Mediation and model comparison

Following the LMEMs and the XGBoost models, the study conducted mediation analysis and compared competing models, using mediation and lavaan package in R (Tingley et al. 2014; Rosseel, 2012; Arlot and Celisse, 2010). This was intended for the further characterization of the relationship among different dimensions.

Results

LMEMs

The linear associations between each acoustic feature and perceptual values are briefly reported in Table 2 (monophthong) and Table 3 (diphthong), in which only the β values and the significant levels are shown. The overall model performance, random effects, t-values, and raw p-values of each linear mixed-effect model are reported in Table S2 (monophthong) and Table S3 (diphthong).

Table 2 Linear mixed-effects models for rimes with monophthongs (β-value).
Table 3 Linear mixed-effects models for rimes with diphthongs (β-value).

In the linear mixed-effect models of visual dimensions, the formants most basic to characterize vowels (F1 and F2) turned out to be significant predictors (Fig. 2). The distinction between spiky and round rimes with monophthongs depended on F2 (β = −0.944, p < 0.001) as back vowels were perceived to be rounder. For diphthongs, F2 (β = −0.867, p < 0.001) of the initial portion and ΔF2 (β = −0.929, p < 0.001) between the initial and final portion were strong indicators of spikiness/roundness. The diphthongs whose initial portion displayed lower F2 were considered rounder (e.g., /ua/ rounder than /ia/). A decreased F2 difference between the two portions also pointed to roundness (e.g., /au/ rounder than /ai/). As for judging the size of rimes with monophthongs, participants relied on F1 value (β = 1.590, p < 0.001) and duration (β = 3.156, p < 0.001): low vowels or vowels with longer duration were considered larger. The brightness of rimes containing monophthongs was dependent on both F1 (β = −1.763, p < 0.001) and F2 (β = −0.601, p < 0.001): front vowels and low vowels were considered brighter.

Fig. 2: Plots of the significant relationships between perceptual ratings of visual dimensions and LMEM parameters.
figure 2

(Rimes with monophthongs (ac): F1, F2, and duration; Rimes with diphthong (de): F2 and ΔF2). Each point in the plots stands for a single auditory item.

In deciding the tactile perception of sounds, F2, ΔF2, duration, and nasality proved to be four critical parameters (Fig. 3). When judging the roughness of rimes with diphthongs, participants relied on F2 (β = 0.439, p < 0.001), ΔF2 (β = 0.395, p = 0.002), and duration (β = 2.066, p = 0.003). Rimes with diphthongs starting with a back vowel were considered smoother than those starting with a front vowel. Lower ΔF2 value between the initial and final portion of diphthongs and decreased duration also pointed to increased smoothness (e.g., /au/ smoother than /ai/). F2 was also found to influence the perception of “temperature” in rimes with monophthongs (β = −0.371, p < 0.001) in that it was more likely for front vowels to be associated with coldness than back vowels. In terms of the weight (light-heavy) of sounds, nasality worked as a critical predictor. Compared to monophthongs with no nasal, rimes that contained monopthongs with velar nasals (β = 0.595, p < 0.001) were perceived to be heavier. Monophthongs with alveolar nasals were perceived as lighter than those with velar nasals (t = −3.52, p < 0.001). The velar nasals (β = 0.355, p < 0.001) and longer duration (β = 2.509, p < 0.001) in rimes containing diphthongs were also associated with increased heaviness.

Fig. 3: Plots of the significant relationships between perceptual ratings of tactile dimensions and LMEM parameters.
figure 3

(Rimes with monophthongs (a)–(b): F2 & cold-hot, Nasality & light-heavy; Rimes with diphthongs (c)–(f): F2 & smooth-rough, ΔF2 & smooth-rough, Nasality & light-heavy, Duration). Each point in the plots stands for a single auditory item.

As for the interpersonal perception of sounds, many acoustic parameters proved to be nonsignificant in characterizing the relationship between sound and attitude. Only participants’ sex and nasality showed significant effects (Fig. 4). When judging the politeness of a monophthong, female participants tended to consider the sound stimuli politer compared with male participants (β = −0.252, p = 0.003). The two interpersonal aspects susceptible to the influence of nasality were friendly-hostile and polite-rude: compared to diphthongs with zero nasals, those with alveolar nasals were considered politer (β = −0.709, p < 0.001) and more friendly (β = −0.447, p < 0.001) while those with velar nasals (β = −0.563, p = 0.001) were also considered politer (Fig. 3). The comparison between diphthongs with alveolar nasals and velar nasals reached no significant difference (t = −0.39, p = 0.69).

Fig. 4: Plots of the significant relationships between perceptual ratings of interpersonal dimensions and LMEM parameters.
figure 4

(Rimes with monophthongs (a): Participants’ Sex & polite-rude; Rimes with diphthongs (b)–(c): Nasality & polite-rude, Nasality & friendly-hostile). Each point in the plots stands for a single auditory item.

XGBoost

The linear mixed-effect models identified each critical acoustic parameter for the multidimensional perception of sounds. Machine learning algorithms were further implemented to simulate sound symbolic associations based on the acoustic parameters.

In 24 within-dimension models, the XGBoost algorithm achieved the above-chance performance (chance level ranged from 50.8 to 66%) in 23 models (except for the smoothness-roughness perception of rimes with diphthongs) (Figs. 5 & 6). Using the given acoustic parameters, XGBoost models could well simulate human’s multidimensional perception of sounds. The details of model performance and results of t-tests for 24 within-dimension models can be seen in Table S4.

Fig. 5: Accuracy above chance-level for within- and cross-dimension tests (rimes with monophthongs).
figure 5

The vertical axis stands for the training dimensions and the horizontal axis stands for the testing dimensions. Darker shades mean higher accuracy. The accuracies of the models with performance below the chance level were left blank.

Fig. 6: Accuracy above chance-level for within- and cross-dimension tests (rimes with diphthongs).
figure 6

The vertical axis stands for the training dimensions and the horizontal axis stands for the testing dimensions. Darker shades mean higher accuracy. The accuracies of the models with performance below the chance level were left blank.

The study was also interested in the relationship among dimensions from different aspects (vision, touch, and attitude). In particular, the study was concerned with how the XGBoost model trained in one aspect predicted the perceptual outcomes in other aspects (Figs. 5 & 6). The details of model performance and results of t-tests for 24 within-dimension models can be seen in Table S5 & S6 (only significant tests are included).

The results showed all visual dimensions except for the bright-dark dimension managed to predict the weight perception of rimes with monophthongs. Only the model trained to simulate size and brightness perception of rimes with monophthongs could achieve an above-chance performance in predicting the interpersonal perception of monophthongs (for polite-rude and/or encouraging-authoritative aspects). In the diphthong group, four visual dimensions were extended to at least one tactile dimension. The size perception predicted up to three tactile dimensions. In comparison, only one visual dimension, bright-dark, could be generalized to the interpersonal perception of politeness.

Among models trained for the tactile perception of monophthongs, light-heavy and cold-hot successfully predicted three out of four visual dimensions. Smooth-rough only predicted one visual dimension (small-large), while hard-soft predicted none. All tactile dimensions could be generalized to at least one interpersonal dimension. Smooth-rough and light-heavy even predicted three out of four interpersonal dimensions. As for models trained for the diphthong group, all tactile dimensions succeeded in predicting one or two visual properties. Smooth-rough dimension was extended to 3/4 of the interpersonal properties and light-heavy dimension was generalized to two of them.

When it came to the interpersonal perception of Mandarin rimes, four dimensions of the monophthong trials and four dimensions of the diphthong trials achieved above-chance performance in predicting one or two visual dimensions and the tactile model of light-heavy. The four models trained to simulate the interpersonal perception of rimes with diphthongs additionally predicted the tactile dimension of hard-soft.

In short, for the relationship between visual and interpersonal perceptions, most interpersonal dimensions can be generalized to at least one of two visual dimensions, while only one visual model in the monophthong or diphthong group was extended to interpersonal dimensions, hinting at a difference in the generalizability of these two aspects in predicting each other. For the touch-attitude relationship, both aspects could predict some of the dimensions from the opponent side, pointing to a tighter connection than that of the vision-attitude relationship, especially between light-heavy and all four interpersonal dimensions. For the vision-touch relationship, in the monophthong group, the generalizability of the tactile aspect in predicting visual dimensions was better than that of the visual aspect in predicting tactile dimensions. As for the diphthong group, all models could be extended to some of the dimensions from the other aspects, which suggests that the vision-touch connection is stronger for the diphthong group than that for the monophthong group.

Mediation analysis

LMEMs revealed a shared acoustic predictor, nasality, for the perception of weight, politeness, and friendliness in Mandarin diphthongs. Therefore, the study further conducted mediation analyses to capture the relationship between diphthong-touch and diphthong-attitude mappings. In this analysis, the categorical variable was recoded as a numeric one: zero nasal was coded as “−1” while alveolar nasal was coded as “0” and velar nasal as “1”. Nasality and duration were set as the two predictors for light-heavy dimension as indicated by LMEM while nasality was the only predictor for polite-rude and friendly-hostile dimensions. In the tests of weight-politeness and weight-friendliness relationships, we first treated weight perception as the mediator and then treated politeness/friendliness perception as the mediator. Two opposing models were constructed respectively for weight-politeness or weight-friendliness relationships.

We used lavaan package in R and ran the four models as indicated in Fig. 7. We compared the AIC and BIC values of Model-a and Model-b as a difference larger than 10 has suggested strong evidence in favor of the model with the lower AIC/BIC value (Burnham and Anderson, 2004). It turned out that Model (a) for both relationships were better than Model (b) (Weight-politeness (a): AIC = 8077.272, BIC = 8108.563; Weight-politeness (b): AIC = 8113.282, BIC = 8144.573; Weight-friendliness (a): AIC = 7893.112, BIC = 7924.403; Weight-friendliness (b): AIC = 7931.948, BIC = 7963.239).

Fig. 7: Competing models for mediation analysis.
figure 7

Model (a) for both weight-politeness and weight-friendliness associations achieved lower AIC and BIC values than Model (b).

The mediation analyses showed there was a significant suppression effect in weight-politeness (a) and weight-friendliness (a) models (Rucker et al. 2011). The suppression effect indicated that diphthong-weight mapping negatively affected the formation of diphthong-politeness and diphthong-friendliness mappings (Table 4). The total effect of both models was negative, showing that before considering the suppressor light-heavy, alveolar and velar nasals were perceived to be politer and friendlier than zero nasals (though insignificant for the weigh-friendliness (a) model). After considering the suppressor, the average direct effects (ADE) were still negative but became larger in magnitude. In comparison, the estimated average causal mediation effects (ACME) were both positive for the two models, indicating that increased nasality was related to more perceived rudeness and hostility through the suppressor dimension light-heavy.

Table 4 Results of mediation analysis.

This mediation analysis suggested that when listeners reported the interpersonal impression of diphthongs, their judgment could be influenced by the relationship between diphthongs and weight perception and the relationship between weight and interpersonal perception.

The LMEMs also reported shared features for visual and tactile dimensions, which could enable mediation analysis as well. The results can be found in supplementary materials (Fig. S4-S6, Table S7-S9).

Discussion

The present study recorded perceptual ratings for Mandarin rimes on twelve perceptual dimensions that involved visual, tactile, and interpersonal properties. Our findings complemented relevant studies as few of them conducted sound symbolic ratings involving social attitudes. The distinctive vowel spectrum in Mandarin also allowed for the examination of the role played by formant transitions in diphthongs and by the nasal codas in sound symbolism.

Effects of acoustic cues and listeners’ sex on sound symbolism in Mandarin

Our first research question is to pin down the significant acoustic parameters for 12 sound-symbolic associations. Formant transitions and nasal codas, which were under-explored segmental features in previous acoustic studies, proved to be important for the iconic perception of Mandarin rimes. Specifically, ΔF2 successfully predicted roundness and smoothness of Mandarin rimes with diphthong, which may be built on the evidence that the second formant also proves to affect the perception of roundness and smoothness of monophthongs in previous studies (Etzi et al. 2016; Knoerferle et al. 2017; Blazhenkova and Kumar, 2018). Moreover, rimes containing diphthongs followed by an alveolar or velar nasal were perceived as heavier, politer, and more friendly than those without one. The link between nasals and perceived heaviness is related to a former study on Japanese, in which mimetics containing nasals were often used to express heaviness (Saji et al. 2013). As for the association between nasal codas and friendliness as well as politeness, a systematic analysis of Chinese children’s literature showed that nasals tended to appear in the names of positive characters and this finding might provide certain evidence for their cooccurrence in the context of language use (Wang, 2022). However, as indicated by this corpus study, the association should also be present between nasal codas and the other two interpersonal dimensions, encouragement and passion. Future studies are needed to further look into this phenomenon in larger corpora and test whether the cooccurrences between nasals and positivity only exist between nasal codas and politeness and friendliness but not between nasal codas and encouragement and passion. A small piece of supporting evidence for the connection between politeness and alveolar nasals may come from the use of “您” (with the alveolar nasal coda (/nín/), the polite version of the second-person pronoun, usually used to refer to people on a higher social status such as the elderly) and “你” (/nǐ/, the common second-person pronoun) in the daily communication of native Mandarin speakers. But again, the relationship between the nasal sounds in the Chinese language and the positive evaluations needs more evidence from future studies.

In addition to F2 and ΔF2, we have also discovered the important roles of other acoustic cues in sound symbolism. For example, F1 was found to influence the judgement of size and brightness. As F1 was related to the high/low distinction of a vowel, the current result could also be corroborated by previous studies investigating the role of this phonological feature in sound-size (Ohtake and Haryu, 2013; Thompson and Estes, 2011) and sound-brightness associations (Asano and Yokosawa, 2011). Lin et al. (2021) also supported the importance of duration in tactile perception such as roughness and heaviness.

It is important to note that besides nasality, we have chosen all acoustic features to represent each auditory stimulus. Although some of the results on acoustic features can be supported by previous research using limited phonological features, the combination of multiple acoustic features in the current study provided a more comprehensive and accurate description of sound symbolic relationships. The accuracy data from the machine learning models demonstrated that the limited features chosen for model construction were sufficient to achieve an above-chance prediction of perceptual ratings, which showcased the effectiveness of using mainly acoustic features in characterizing sound symbolic associations. Moreover, the current study included nasal rimes in Mandarin, which caused nasal coda assimilation. It would be inaccurate to only use phonological features to describe nasal rimes as the same phonemes (with the same phonological features) preceding alveolar nasal and velar nasal were pronounced differently (e.g., /a/ in /an/ and /ang/). However, this difference could be captured by acoustic features and further controlled in the statistical models. The current study supported the use of acoustic features in characterizing sound symbolic associations, the results of which could also be compared with studies using other languages.

In the LMEMs, participants’ sex was included as a control variable, which proved to be significant in the prediction of polite-rude judgement of monophthongs. The current results indicated that female listeners judged monophthongs to be politer than male listeners. This is in line with previous studies that showed that female listeners were more sensitive to detecting social cues. For instance, some studies found that when judging the level of confidence in utterances, female listeners could better distinguish between confident and unconfident speech (Jiang and Pell, 2017; Jiang and Pell, 2016a; Jiang and Pell, 2015). Moreover, female listeners were also more sensitive to detecting the incongruency between semantics and prosody in terms of their confidence levels (Jiang and Pell, 2016b). Future studies on sound symbolism could test individual attributes of participants such as the level of empathy to further identify the reasons behind the sex difference in social judgement.

Mechanisms of sound symbolism: suppression effects and the Transitivity Proposal

Based on the findings of linear mixed-effect models, we conducted mediation analysis on tactile dimension (light/heavy), interpersonal dimensions (politeness and friendliness), as well as their common acoustic predictor (nasality). The sense of touch evokes affective responses and is an effective tool in modulating interpersonal relationships (Schirmer and Adolphs, 2017; Schirmer et al. 2022). The role of heaviness as a suppressor of the nasality-attitude relationship may have something to do with the shared sensory-motor representations of articulating nasal rimes and the sense of heaviness. According to the motor theories of speech perception (Liberman and Mattingly, 1985; Stasenko et al. 2015; Franken et al. 2022), hearing alveolar and velar nasals could activate the complex articulatory mechanism of nasal sound production. The production of non-nasal sounds typically only involves oral closures. In comparison, articulating alveolar and velar nasals require oral closure and additional velar lowering, leading to more muscular engagement to achieve nasal resonance. The nasal sounds require the coordination of multiple articulators such as the tongue and the velum, which increases the complexity and thus the efforts of articulatory configuration. Therefore, the link between alveolar/velar nasals and heaviness could be established based on the increased effort/complexity of the articulatory mechanism and the increased effort of bodily experience with heavier objects.

When it comes to the tactile perception of weight, some studies on embodied perception reported that bodily experience of weight could alter participants’ interpersonal judgment of people or events. For example, Kaspar (2013) found that participants with heavier clipboards would consider symptoms of diseases and side effects of drugs to be more serious. Another study by Hartmann et al. (2023) reported that the participants with angry, sad, fearful, or depressed emotions tended to draw human body silhouettes with colors indicating increased bodily heaviness. More specifically for Chinese participants, Zhao et al. (2016) showed they were sensitive in pairing negative words with heaviness and pairing positive words with lightness. The above findings demonstrate how high-level cognitive process interacts with low-level sensory input. More specific to the current study, heaviness has been found to be associated with negative evaluations (the negative evaluations are rudeness and hostility in the current study).

Results of mediation analyses and model comparisons show that treating the tactile dimension as the mediator yields better model fits. A closer investigation into the models reveals that the perceptual rating for light/heavy dimension works as a suppressor between sound and social attitude mappings. The current result patterns support the shared-property account underlying sound symbolic mappings because there are indeed significant indirect effects. The Transitivity Proposal differs from the Emotion Mediation Hypothesis as the Transitivity Proposal allows for low-level sensory dimensions to work as a mediator while the Emotion Mediation Hypothesis supports the mediating effect of high-level affective factors. The current results engage more with the Transitivity Proposal (French, 1977; Deroy et al. 2013) and could not be accommodated by the Emotion Mediation Hypothesis. Further evidence to support the Transitivity Proposal could also be found in the supplementary information, which showed the mediating roles of low-level tactile or visual properties (Table S7S9).

It should be noted that we interpret the results of the mediation analyses with caution and merely propose preliminary conjectures based on the current result patterns. Future research needs to further provide validations for the current results as no studies in sound symbolism have reported such a suppression effect. Our results, though unprecedented as they are, may further add to the Transitivity Proposal, providing evidence for the suppression effect of the perception of weight. If the association between sound and social attitude is solely attributed to or enhanced by the shared tactile property, a full mediation effect should be observed and the direction of the indirect path should be the same as the direct path. The suppression effect found in the current study showed the opposing forces of the indirect and the direct path, hinting at the possible coexistence of two mechanisms and the interaction between them. The mechanism underlying the indirect path could be the shared property account, supported by the significant influence of the shared weight perception upon the effect of nasality on social attitudes. As for the direct path that links nasal codas with social attitudes, the underlying mechanism might be the social inferencing process based on the cooccurrence of nasals and the choice of referential expressions towards a target with specific characteristics as indicated by Wang (2022)’s research on Chinese children’s literature. When participants judge the perceived politeness and friendliness of nasality, their decision is largely dominated by the direct association between nasal codas and social attitudes but negatively affected in part by the indirect association transmitted by weight perception. If the mechanism underlying the indirect path does not interfere with that of the direct path, the significant suppression effect could be statistically absent. The suppression effect might be among the first demonstrations that sound symbolic mechanisms not only reinforce each other but may also interfere with each other.

To our knowledge, few studies have presented such suppression effects in sound symbolic studies. The study by Ketron and Spears (2019) validated our use of mediation analysis but they focused on the interaction of two high-level cognitive sound symbolic associations (between vowel and arousal as well as between vowel and consumer purchase decision). Our study examined the relationship between low-level tactile and high-level interpersonal sound symbolism and presented a suppression effect, which may add to the existing understanding of sound symbolic mechanisms.

Limitations and future implications

Despite innovativeness, several limitations remained in the study.

The suppressing role of low-level tactile perception in interpersonal sound symbolism should be further evaluated with caution as the current finding relied on a statistical model comparison. It appears that the current findings could not be easily accommodated by the order of questionnaires. A subsequent experiment using a similar order of questionnaires (presenting tactile dimensions prior to interpersonal dimensions) which tested the same materials did not observe the same suppression effect of tactile perception in interpersonal sound symbolism (Li et al. 2023). Nevertheless, future studies could reverse the order of tactile and social attitude questionnaires so as to further examine the current results.

Prosody of the sound stimuli could also play a role. The current study mainly used emotionally-neutral stimuli. However, the use of emotionally laden stimuli may highlight the role of affective factors, making the high-level interpersonal dimension a mediator or a suppressor. Future studies are encouraged to include both neutral and emotional sound stimuli and make use of more ecological experimental settings with authentic tactile stimuli.

What’s more, the current study used naturally pronounced rimes, which preserved the ecological validity of audio stimuli to a greater extent than the experimentally manipulated ones. However, as multiple acoustic parameters covaried for different rimes, these naturally pronounced stimuli could result in a less controlled study. Although the confounding effects of different parameters were controlled by including them in the same statistical model which was effective in showing their different contributions (e.g., β values estimated by LMEM for different parameters could be directly compared), future studies could further validate the effects of significant parameters in sound symbolic associations using experimentally manipulated stimuli (e.g., only adjusting F1 of a vowel while remaining other acoustic parameters unchanged). Also, the intensity of stimuli was scaled to an average of 70 dB and therefore not included as a predictor in the LMEMs. Based on the important role of intensity in sound symbolic associations (Aryani et al. 2020; Eitan and Rothschild, 2011), further analysis of intensity will also contribute to the acoustic representation of sound symbolism with more naturally presented stimuli.

Two native Mandarin male speakers were invited to record the sounds because the articulation of male speakers is more stable than that of female speakers (e.g., male speakers were found to have a more stable formant structure while female speakers showed more variations) (Liu et al. 2023; Hou and Chen, 2019; Li and Chen, 2016; Ping, 2006). Future studies could further include more speakers from both sexes to validate the results. Moreover, the homogeneity of materials (all rimes) might affect participants’ responses. Although the current study turned single rimes into pseudowords in the form of rime chains, future studies could add fillers with consonants to increase the heterogeneity of materials.

Lastly, the study recruited all Mandarin native speakers and some participants have reported their experience with different dialects. Although the diverse dialect backgrounds of participants avoided the bias of sound symbolic effects by a certain dialect, future studies can look into the effect of dialects on sound symbolism by comparing two or more groups of participants who are from different dialect backgrounds. This may help to answer the question of how different dialects affect the iconic perception of Mandarin rimes. Future studies could also quantify the effect of dialects by treating the acoustic distances between two dialects as an independent variable (Ran, 2020; Ran and Ding, 2023).

Conclusion

Our study is among the relevant attempts to compare visual, tactile, and interpersonal sound symbolic associations simultaneously. Taking advantage of the Mandarin rimes, we demonstrated the effect of formant transitions and nasal codas in sound symbolism, which was seldom investigated before. While the visual perception mostly depended on the basic formants characterizing rimes, the tactile perception relied additionally on duration and nasality. As for the interpersonal perception, only nasality proved to be significant. Machine learning models further confirmed the important roles of the selected rime features and reported above-chance performance. Through model comparisons and mediation analyses, we found that the perceived weight of rimes worked as a suppressor between diphthong-politeness and diphthong-friendliness associations. This suppression effect highlighted that the shared property mechanism could interfere with other mechanisms, such as the language pattern account, to explain the sound symbolism. These findings have bolstered our understanding of the acoustic representations and the working mechanisms underlying sound symbolism, which lend implications for the multi-modal perception in speech communication.