Brain Activity Related to Sound Symbolism: Cross-modal Effect of an Aurally Presented Phoneme on Judgment of Size

Sound symbolism is the idea that a sound makes a certain impression (e.g., phoneme “p” is associated with an impression of smallness) and could be the psychological basis of the word–meaning association. In this study, we investigated the neural basis of sound symbolism. Subjects were required to compare the visual sizes of standard and target stimuli while listening to syllables assumed to create either a larger or smaller impression. Stimulus–response congruence is defined as the agreement between the target size and the syllable’s impression. Behavioral data showed that the subjects displayed a longer reaction time under the incongruent condition than under the congruent condition, indicating that they tended to associate the object size with certain syllables. We used functional magnetic resonance imaging to evaluate the cerebral activity during the task, and found that both semantic- and phonetic-process-related areas of the brain (left middle temporal gyrus and right superior temporal gyrus, respectively) were activated under the incongruent condition. These results suggest that these regions are associated with the incongruence of sound symbolism.

www.nature.com/scientificreports www.nature.com/scientificreports/ modalities of stimulus presentation (visual vs auditory) affect the activities of the right temporal and other brain regions.
In this study, we examined the effect of a phoneme (the basic unit of sound within a word) on the judgment of the size of a visual stimulus and the related brain activity. Subjects were required to perform two types of task, a comparison task and a control task (Fig. 1). In the comparison task, they compared the visual sizes of standard and target stimuli. The target stimulus was either smaller or larger than the standard by ±5%, ±10%, or ±20% of its diameter, and was displayed to the subjects while they were listening to the syllable "bobo" or "pipi". Previous research into sound symbolism predicted a tendency to pair the former sound with a larger object and the latter with a smaller object 13 . The control task was designed to prevent subjects ignoring the sound stimuli.

Results
We analyzed the reaction times (RTs) for the compared tasks. Several studies have shown that the phonemes "b", "d", "g", and "o" elicit "larger" reactions, whereas "p", "t", "k", and "i" elicit "smaller" reactions 13,19 . Therefore we assumed that "bobo" and "pipi" would induce "larger" and "smaller" impressions, respectively. We defined the congruent and incongruent conditions as follows and analyzed the data accordingly. Under the congruent Each trial started with a rest period (3,000 ms). The standard stimulus (520 ms) was then presented, followed by the target stimulus (520 ms) after a 300 ms interstimulus interval (ISI). Subjects were instructed to respond to the task during the response period (3,000 ms) via a button device, using the fingers of their left hand. A red cross was always presented as the fixation point. (B) Flow chart illustrating the experimental process. Subjects were instructed to push the middle button to respond to the control task and to push the left or right button to respond to the comparison task. Button assignment (which button represented a specific answer) in the comparison task was changed between subjects.
www.nature.com/scientificreports www.nature.com/scientificreports/ condition, the target visual stimulus was consistent with the reaction to the sound (i.e., the larger target was presented with "bobo" or the smaller with "pipi"). Under the incongruent condition, the target visual stimulus was inconsistent with the reaction to the sound (i.e., the larger target was presented with "pipi" or the smaller one with "bobo").
Eleven subjects performed well on the comparison task. The mean correct response rate was 94.1%. The differences in the RTs under the congruent and incongruent conditions are shown in Fig. 2. The mean RT decreased as the size difference between the targets increased from ±5% to ±20%. The mean RT under the incongruent condition was longer than that under the congruent condition for all target sizes. However, this difference was statistically significant only when the target was ±20% of the standard (t = −5.93, p < 0.001, t test with Bonferroni correction).
In this study, we investigated the brain region associated with sound symbolism by contrasting the incongruent condition with the congruent. The bilateral anterior cingulate cortex (ACC) was activated under the incongruent minus congruent condition for all target sizes (Fig. 3A, Table 1).
The activation of the right lingual gyrus, left middle temporal gyrus (MTG), and right superior temporal gyrus (STG) was observed under the ±20% incongruent minus ±20% congruent condition (Fig. 3B, Table 2). The contrast (incongruent − congruent) for each target size condition in each of these brain regions is presented in Fig. 4. There was no significant activation under the ±20% congruent minus ±20% incongruent condition.
The contrast estimates (incongruent − congruent) for the right STG and left MTG under each condition were evaluated with a region-of-interest (ROI) analysis with leave-one-subject-out cross-validation, and are shown in Fig. 4, in which a positive value indicates stronger activation during the incongruent condition than during the congruent condition. The difference at ±20% was more prominent than that at ±5% or ±10% in each region.

Discussion
The RT under the congruent condition was shorter than that under the incongruent condition, suggesting that the effect of sound symbolism was observed under our experimental setting (the phoneme as a sound stimulus). However, the efficacy of sound symbolism differed with the target size, and only the ±20% condition yielded a significant difference between the congruent and incongruent conditions. Because the RTs were more strongly affected by the target size than by the congruent-incongruent difference (Fig. 2), it is not surprising that the effect of sound symbolism was minimal under some target conditions.
The bilateral ACCs were more strongly activated under the incongruent condition than under the congruent condition. The activation of the dorsal ACC has been associated with cognitively demanding tasks, typically involving response conflict, such as the Stroop task 20,21 . In this study, we observed a more pronounced BOLD response in the dorsal ACC, suggesting that the task performed under the incongruent condition was more cognitively demanding for the subjects. Together with the longer RTs, the activation of the dorsal region of the ACC may reflect the Stroop-like interference between the phoneme and the visual magnitude of the stimulus.
The left MTG was more strongly activated under the ±20% incongruent condition than under the ±20% congruent condition. The results of the ROI analysis of the peak region are plotted in Fig. 4. The left MTG has been identified in previous studies as a brain region related to semantic association (see review by Price) 22 . This region www.nature.com/scientificreports www.nature.com/scientificreports/ is also very similar to the area that was activated under the incongruent condition in a priming study using EEG 23 , in which congruence was based on the relationship between a picture and an environmental sound (i.e., an animal and its vocalization). That study demonstrated that the left MTG is involved in cross-modal semantic-matching processes. Our results imply that sound symbolically matching between the target size and a phoneme is also processed in the MTG. Shown are the brain regions with greater activation during the ±20% incongruent condition than during the ±20% congruent condition (p < 0.005 with a cluster-level FWE correction of p < 0.05). Significant regions were superimposed on a standard brain template from MRIcro software. www.nature.com/scientificreports www.nature.com/scientificreports/ The right STG was more strongly activated under the ±20% incongruent condition than under the ±20% congruent condition. The ROI analysis of the peak region (Table 2) revealed that the activation of the STG (incongruent minus congruent) was more prominent as the size difference between the targets increased from ±5% to ±20% (Fig. 4), indicating that the efficacy of the sound symbolism correlated with the activation of the region. Many previous studies have identified the STG as a primary region for speech perception 22 . Interestingly, activities in the right STG have been associated with the incongruence between emotional prosodic cues and other information (speech content 24 and facial expression) 25 . Our results suggest that the right STG is part of a brain network involved in processing conflict in phonemic sound symbolism in addition to emotional prosodic information.
Kanero and colleagues (2014) reported that the right posterior STG was more strongly activated when subjects were evaluating matched pairs of mimetic words and moving images than when they were evaluating mismatched pairs. However, our analysis detected no significant activation in this region. Although the functional localization of the right STG is still contentious, several researchers have reported that the anterior region is strongly associated with phoneme perception 26,27 . The stimulus difference (mimetic word in Kanero's study or phoneme in our study) could contribute to the difference in the activated area. In addition to analyzing the stimulus difference, they asked their subjects to actively report their impression of mimetic words, whereas our subjects were engaged in a visual discrimination task, and were not asked to report their impression of the phoneme. The observed differences in activation area could be attributable to these task differences. Further research is required to investigate the sound symbolic effect on both phonemes and mimetic words within the same experimental paradigm to detect the functional segregation, if any, of the right STG into the different aspects of sound symbolism. There are several limitations of this study. The number of subjects is relatively small, and only two types of speech sounds were tested; therefore, our data should be treated as a pilot study to draw general conclusion on the neural processing of sound symbolism. We still believe that our data indicate the involvement of bilateral temporal regions (i.e., the right STG and left MTG) in the sound-meaning association.

Materials and Methods
Subjects. Fourteen subjects (four females and ten males; aged 21-26 years) participated in the fMRI experiment after they had provided their written informed consent. All the subjects were right-handed native Japanese speakers. None of the participants had any knowledge of sound symbolism or the experiment. The data from three participants were excluded because of artifact or inadequate task performance (e.g., head movement >3 mm). The experimental protocol was approved by the Research Ethics Committee on human subjects of Doshisha University, and the study carried out in accordance with the guidelines of the committee.  . Contrast estimate for each target size, calculated by subtracting the congruent from the incongruent condition, in the right superior temporal gyrus and left middle temporal gyrus. The ROIs were spheres with a radius of 4 mm. Error bars represent standard errors of the means. The contrast estimate was greater for a difference of ±20% than for a difference of ±5% or ±10% in each region.
www.nature.com/scientificreports www.nature.com/scientificreports/ to respond. Experiment control software (Presentation ® ; Neurobehavioral Systems, Inc., Albany, CA, USA) was used to synchronize the experimental procedures with the fMRI scans.
Visual stimuli. In this experiment, we examined the effect of sound symbolism when judging the size of a visual stimulus. The visual stimulus, a gray circle that looked like a doughnut, was presented on a frosted screen at the end of the scanner bore, and could be seen by the subject via a mirror mounted on the head coil. The standard stimulus had an outer circle of 300 pixels and an inner circle of 280 pixels. The target stimulus was either smaller or larger than the standard stimulus by ±5%, ±10%, or ±20% of its diameter. In total, seven sizes (one for the standard and six for the targets) were used. Each stimulus was presented twice for 200 ms, with an interstimulus interval (ISI) of 120 ms (Fig. 1A). A red cross (34 pixels) was always presented as the fixation point at the center of the screen. sound stimuli. The sound stimuli were "bobo" and "pipi". A publicly available sound dataset (FW03; NTT Communication Science Laboratories, Kanagawa, Japan) was used to create the sounds. All sounds were recorded at a sampling frequency of 48 kHz and a quantization of 16 bits. The single-syllable utterances "bo" and "pi" were spoken by a male, and these were duplicated to produce the sound stimuli "bobo" and "pipi", respectively. These sound stimuli have no accent, therefore they were not similar to any word. The duration of the sound was 520 ms, and the stimulus amplitude was 64 dB sound pressure level (SPL). According to previous research, the sound "pi" was louder than "bo" by about 1.4 dB in subjective loudness 28 . The sound stimulus was synchronized with the visual stimulus. In an additional experiment, the participants were asked to evaluate the size of each sound stimulus by selecting labeled pictures of five Russian nested dolls of different sizes (doll 5 was the largest) 8,11 . The mean scores for "bobo" and "pipi" were 4.0 and 2.3, respectively, and no participant ever scored "bobo" smaller than "pipi". Therefore, we confirmed that "bobo" created a larger impression than "pipi". fMRI parameters. Functional images of brain activity were acquired with a 1.5-T MRI system (Echelon Vega, Hitachi Medical Corporation, Tokyo, Japan) as T2*-weighted images using a gradient echo-echo planar imaging (GE-EPI) sequence with a resolution of 3 × 3 × 5 mm voxels (30 axial  procedure. The subjects were asked to judge the difference in the sizes of the standard and target stimuli 29 . Each trial began with a 3,000 ms rest period. The standard stimulus was then presented for 520 ms, followed by a 300 ms ISI. The target stimulus was then presented for 520 ms, followed by a response period of 3,000 ms (Fig. 1A). The screen was black during the rest and response periods. After the subject had responded to the task by pressing a button, the next trial began automatically. When the sound (sound 2) presented with the target stimulus was identical to the sound (sound 1) presented with the standard stimulus, the subject had to press the middle button, regardless of the visual stimulus (control task). When the sound stimulus presented with the target stimulus differed from that presented with the standard stimulus, the subject had to respond according to whether the target circle was smaller or larger than the standard (Fig. 1B). The subjects used their index and ring fingers to press the left and right buttons, respectively (comparison task). The button assignment (i.e., which button represented which answer) was changed between subjects. There were six combinations of visual stimuli (one standard × six targets) and four combinations of sound stimuli (two sound 1 × two sound 2), yielding a total of 24 stimulus combinations. The entire stimulus set was randomized to create one block (24 trials), and each session consisted of three blocks. In the behavioral experiment, each subject completed one session (72 trials).
Analysis. The functional imaging data were processed with the SPM12 software (Wellcome Department of Cognitive Neurology, London, UK). For realignment, the image was spatially normalized to an EPI template in Montreal Neurological Institute (MNI) space, and smoothed with an 8 mm full width at half maximum Gaussian kernel. To examine the brain activation associated with phonetic symbolism, we created different types of contrast images (congruent minus incongruent for all target sizes combined, incongruent minus congruent for all target sizes combined, congruent minus incongruent for ±20% target sizes, and incongruent minus congruent for ±20% target sizes). These effects were modeled as stick functions convoluted with the canonical hemodynamic response function. The movement parameters of the realignment corrections were included in the model as covariates of no interest. Activated areas were depicted on a standard human brain template from MRIcro software (version 1.40, www.mccauslandcenter.sc.edu/crnl/mricro).
To further clarify the effect size, we used a ROI analysis of the contrast estimates in the right STG and left MTG. Each ROI was constructed with leave-one-subject-out cross-validation to avoid any bias arising from nonindependence 30 . Briefly, the data for each subject were iteratively left out of the group. The resulting group (N − 1 subjects) analyses returned peak coordinates of the ROIs (spheres with a 4 mm radius) for the subject who was left out, and the average responses in the ROIs in the right STG and left MTG were estimated.