Speech Prosodies of Different Emotional Categories Activate Different Brain Regions in Adult Cortex: an fNIRS Study

Emotional expressions of others embedded in speech prosodies are important for social interactions. This study used functional near-infrared spectroscopy to investigate how speech prosodies of different emotional categories are processed in the cortex. The results demonstrated several cerebral areas critical for emotional prosody processing. We confirmed that the superior temporal cortex, especially the right middle and posterior parts of superior temporal gyrus (BA 22/42), primarily works to discriminate between emotional and neutral prosodies. Furthermore, the results suggested that categorization of emotions occurs within a high-level brain region–the frontal cortex, since the brain activation patterns were distinct when positive (happy) were contrasted to negative (fearful and angry) prosody in the left middle part of inferior frontal gyrus (BA 45) and the frontal eye field (BA8), and when angry were contrasted to neutral prosody in bilateral orbital frontal regions (BA 10/11). These findings verified and extended previous fMRI findings in adult brain and also provided a “developed version” of brain activation for our following neonatal study.

the only exception, see the fMRI study 22 which found the activation was stronger for positive relative to negative prosody). Furthermore, although it is well known that the activation pattern of human brain is not the same for all emotions 23,24 , the question of how verbal expressions of different emotional categories elicit activation in temporal and frontal regions has been scarcely investigated 8 (for the only exception, see the fMRI study by Kotz et al., 15 who found the bilateral superior middle frontal gyrus had enhanced activation for angry relative to neutral prosody while the left IFG had enhanced activation for happy relative to neutral prosody). In addition, fMRI studies on the effects of emotional sounds are unavoidably interfered with the gradient noise of the scanner so the fMRI-based results are necessary to be verified and complemented by a silent imaging method such as functional near-infrared spectroscopy (fNIRS) 25 . However, so far as we know, speech prosody has never been investigated using the fNIRS technique; and there are only three relevant fNIRS studies that examined nonverbal expressions or nonhuman sounds [25][26][27] . Therefore, the first aim of the present study was to provide an fNIRS-based knowledge of how speech prosodies of different emotional categories elicit activation in adult brain.
Another purpose of the current study was to provide a "developed version" of auditory response pattern to an on-going neonatal experiment in our lab. It is worth stressing that the use of fNIRS is irreplaceable for this purpose, because alternative methods such as fMRI and electroencephalography (EEG) cannot map the brain activation of conscious newborns with a high spatial resolution. To further make the results comparable between this study and the neonatal one, we required the adult subjects in this study to passively listen to affective prosodies because passive listening is the only feasible task for neonates (see neonatal studies 28,29 ). Furthermore, since speech comprehension is largely immature in neonates' undeveloped brain, we used semantically meaningless pseudosentences in these two studies so as to provide subjects with only prosody rather than both prosody and semantic information.
It was expected that while the voice-sensitive regions in the STC (including the primary/secondary AC) would be strongly activated by prosodies irrespective of emotional valence 5,8 , frontal regions such as IFC and OFC may have a crucial role in discrimination of verbal expressions of different emotional categories 7,9 . Since there is little knowledge of the brain activity associated with different categories of affective prosody, no hypothesis was made regarding the exact (if any) frontal areas that take part in decoding distinct affective cues embedded in happy, angry and fearful prosodies.

Methods
Participants. Twenty-two healthy subjects (12 females; age range = 18-24 years, 20.8 ± 0.4 years (mean ± std)) were recruited from Shenzhen University as paid participants. All subjects were right-handed and had normal hearing ability. Written informed consent was obtained prior to the experiment. The experimental protocol was approved by the Ethics Committee of Shenzhen University and this study was performed strictly in accordance with the approved guidelines.
Stimuli. The emotional prosodies were selected from the Database of Chinese Vocal Emotions 30 . The database consists of "language-like" pseudosentences in Mandarin Chinese, which were constructed by replacing content words with semantically meaningless words (i.e. pseudowords) while maintaining function words to convey grammatical information. The structure of pseudosentences was equal (subject + predicate + object). The duration of each pseudosentence was approximately 1 to 2 sec.
Four kinds of emotional prosodies, i.e., fearful, angry, happy and neutral prosodies, were examined in this study. In order to construct four 15-sec segments for the four emotional conditions, we concatenated, separately, 11, 11, 8 and 9 pseudosentences of fearful, angry, happy and neutral prosodies. Among these pseudosentences, 6 were with the same constructions (but different emotions) across the four conditions. The mean speech rate of the four kinds of prosodies was 6.33, 6.53, 5.07 and 5.27 syllables/sec. The number of syllables for the four kinds of prosodies was 9.5 ± 1.0, 8.9 ± 1.8, 9.5 ± 1.6 and 8.8 ± 0.83 per sentence (mean ± std). All the selected emotional prosodies were pronounced by native Mandarin Chinese speakers (females), and the mean intensity was equalized. Before the experiment, the emotion recognition rate (mean = 0.80; select one emotion label from anger, happiness, sadness, fear, disgust, surprise, and neutral) and emotional intensity (5-point scale, mean = 3.1) were counterbalanced among the four conditions (the two measurements were from the database30). After the fNIRS recording, all the participants were required to classify each prosodic pseudosentences into one of four emotion categories. The mean recognition rate was 0.99 ± 0.04, 0.95 ± 0.08, 0.94 ± 0.08, 0.97 ± 0.06 for anger, fear, happy and neutral prosodies.
Procedure. Sounds were presented via two speakers (R26T, EDIFIER, Dongguan, China) approximately 50 cm from the participants' head. The speaker sound had a sound pressure level (SPL) of 60 to 70 dB (1353S, TES Electrical Electronic Corp., Taipei, Taiwan). The mean background noise level (without prosody presentation) was 30 dB SPL.
The experiment lasted for 25 min (Fig. 1). Resting-state NIRS data were first recorded for 5 min (eyes opened), followed by a 20-min passive listening task. Each of the four 15-sec segments (corresponding to the four emotions) was repeated ten times. Thus there were 40 blocks in the study, which were presented in a random order. Inter-block interval (silent period) varied randomly between 14 and 16 sec. Data recording. The NIRS data were recorded in a continuous-wave mode with the NIRScout 1624 system (NIRx Medical Technologies, LLC. Los Angeles, USA), which consisted of 16 LED emitters (intensity = 5 mW/ wavelength) and 23 detectors at two wavelengths (760 and 850 nm). Based on previous findings 6,7 , we placed optodes in the frontal and temporal regions of the brain, using a NIRS-EEG compatible cap (EASYCAP, Herrsching, Germany) with respect to the international 10/5 system ( Figs. 2A and 3). There were 54 useful    (Fig. 2B), where source and detector were at a mean distance of 3.2 cm (range = 2.8 to 3.6 cm) from each other. The data were continuously sampled with 4 Hz. Detector saturation never occurred during the recording.
To evaluate the cortical structures underlying NIRS channels, a Matlab toolbox NFRI (http://brain.job. affrc.go.jp/tools/) 31 was used to estimate the NMI coordinates of optodes with respect to the EEG 10/5 positions. The locations of NIRS channels were defined at the central zone of the light path between each adjacent source-detector pair (Table 1). Data preprocessing. The data were processed within the nirsLAB analysis package (v2016.05, NIRx Medical Technologies, LLC. Los Angeles, USA). Four out of the 22 datasets were deleted because the intensity (in volt) of more than 5 channels showed low values (the gain setting of the NIRx device >7). Thus a total of 18 datasets were analyzed in this study.
There are mainly two forms of movement artifacts in the NIRS data, i.e., transient spikes and abrupt discontinuities. First, spikes were smoothed by a semi-automated procedure which replaces contaminated data by linear interpolation. Second, discontinuities (or "jumps") were automatically detected and corrected by the nirs-LAB (std threshold = 5). Third, a band-pass filter (0.01 to 0.2 Hz) was applied to attenuate slow drifts and high frequency noises such as respiratory and cardiac rhythms. Then the intensity data were converted into optical density changes (ΔOD) (refer to the supplementary material for detailed procedure), and the ΔOD of both measured wavelengths were transformed to relative concentration changes of oxyhemoglobin and deoxyhemoglobin (Δ[HbO] and Δ[Hb]) by employing the modified Beer-Lambert law 32 . The source-detector distance of the first channel was 3.1 cm, and the exact distance of the other 53 channels was calculated by nirsLAB according to optode locations. The differential path length factor was assumed to be 7.25 for the wavelength of 760 nm and 6.38 for the wavelength of 850 nm 33 .

Statistical analyses. Statistical significance of concentration changes was determined based on a general
linear model of the canonical hemodynamic response function (parameters in nirsLAB = [6 16 1 1 6 0 32]), with a discrete cosine transformation used for temporal filtering (high-pass frequency cutoff = 128 sec). Although both Δ[HbO] and Δ[Hb] signals were obtained, we only chose Δ[HbO] to perform statistical analyses due to its superior signal-to-noise ratio relative to Δ[Hb]. When estimating beta, nirsLAB used a SPM-based algorithm (restricted maximum likelihood) to compute a least-squares solution to an overdetermined system of linear equations.
To statistically analyze the data, we first performed a one-way ANOVA on the beta values associated with Δ[HbO] (five levels: silence, neutral, fearful, angry and happy prosody), resulting in a thresholded (corrected p < 0.05) F-statistic map. Then six pairwise comparisons were followed up but only focusing on the significant channels revealed by the thresholded F-statistic map. This study was interested in the Δ[HbO] difference between (1) prosody and silence, (2) emotional and neutral prosody, (3) positive and negative prosody, (4) happy and neutral prosody, (5) angry and neutral prosody, (6) fearful and neutral prosody. The first two pairwise comparisons were used to verify and repeat the results of previous relevant studies; the last four pairwise comparisons were designed to explore activation differences between different emotional prosodies. The statistical results in individual channels were corrected for multiple comparisons across channels by the false discovery rate (FDR), following the Benjamini and Hochberg 34 procedure implemented in Matlab (v2015b, the Mathworks, Inc., Natick, USA).

Waveform visualization. In addition to statistic maps, we also displayed waveforms of Δ[HbO] and Δ[Hb]
in the four emotional conditions ( Figure S1 in supplementary material). This study considered Δ[HbO] and Δ[Hb] in a time window from −5 to 25 sec after the onset of emotional prosodies. The mean concentration of 5 sec immediately before each block was used as baseline (i.e., −5 to 0 sec; see also in other studies [35][36][37] ).

Results
Main effect of experimental conditions. The one-way ANOVA showed that 11 fNIRS channels (3, 8, 15, 20, 24, 30, 34-36, 48 and 51) had different activation patterns across the five experimental conditions (silence, neutral prosody and the three emotional prosody). The thresholded (corrected p < 0.05) F-statistic map is shown in Fig. 4, and the F values are summarized in Table 2. To measure the variation of beta values across individuals, the standard deviation of the beta values is reported in Table 3.
Follow-up pairwise comparisons. Contrast 1: prosody > silence. First, we examined the brain regions associated with both emotional and neutral prosodies. The t-test showed that compared to the resting state (silence), four fNIRS channels had significantly enhanced activations in response to prosodies (Channel 20:  Figure S1) Furthermore, the activations within the primary/secondary AC showed leftward lateralization (paired-samples t-test: t(17) = 3.34, p = 0.004; Figure S1A).
In addition, there were another two channels showed significant deactivations (negative t values) in response to prosodies (Channel 8: t(17) = −5.84, p < 0.001, corrected p = 0.001; Channel 36: t(17) = −5.30, p < 0.001, corrected p = 0.002). The two channels correspond to brain regions of dorsolateral prefrontal cortex (DLPFC) and frontopolar prefrontal cortex (PFC). Contrast 2: emotional > neutral prosody. Second, we examined the brain regions that were more activated for emotional compared to neutral prosodies. The t-test showed that compared to neutral prosodies, two channels had significantly enhanced activations in response to emotional prosodies, corresponding to brain  Figure S1A and B).
Contrast 3: positive > negative prosody. Third, we examined the brain regions that were more activated for happy contrasted to fearful and angry prosody. The t-test showed that compared to negative prosody, two channels had significantly enhanced activations in response to happy prosody. The associated brain regions were left pars triangularis (middle IFG, Figure S1C).  Contrast 4: happy > neutral prosody. Fourth, we examined the brain regions that were more activated for happy contrasted to neutral prosody. The t-test showed that Channel 15 had significantly enhanced activations in response to happy prosody (t(17) = 4.12, p < 0.001, corrected p = 0.039). The associated brain regions were left pars triangularis (middle IFG, BA 45).
Contrast 6: fearful > neutral prosody. Finally, we examined the brain regions that were more activated for fearful contrasted to neutral prosody. No channels were significantly activated even before multiple comparison correction.

Discussion
The superior temporal cortex-decoding speech prosodies irrespective of emotional valence.
The STC has been demonstrated to take a critical part in decoding vocal expressions of emotions (see meta-analysis 8 ). (Note: The STC is comprised of STG, MTG, and the superior temporal sulcus 8 . The primary/secondary AC lies in the middle STG). While the lower-level structures of STC (i.e. the primary AC and mid-STC) analyze acoustic features in auditory expressions, the higher-level structures of STC integrate the decoded auditory properties and build up percepts of vocal expressions 7,21 . Consistent with this notion, the current study found that while speech prosodies activated the left primary AC (BA 42) most significantly when contrasting to silence, emotional prosodies activated the right STG (middle and posterior, BA 22/42) when contrasting to neutral prosodies. The right STG is the major structure of "emotional voice area" 38 , its anterior 20 , middle (or the primary and secondary AC) 6,9,17,[39][40][41][42] and especially posterior portion 6,9,13,17,41,[43][44][45] have been reported to show peak activations for emotional compared to neutral vocal expressions.
Our finding provides further evidence to clarify the lateralization of emotional prosody processing in the STC. It is observed that presentation of speech stimuli (i.e. prosody contrasted to silence) showed significant leftward lateralization in the primary/secondary AC and posterior STG, which is in line with the notion that the left hemisphere is better equipped for the analysis of rapidly changing phonetic representations in speech 15,17,21 . However, our data showed a strong right lateralization for affective prosody perception within the STC 7,15,17,25,44,46 , which is consistent with the finding that the right hemisphere is more sensitive to slow-varying acoustic profiles of emotions (e.g. tempo and pausing) 5,9,43,47 .
It is also worth noting that although we explored the cortex responses within six contrasts (i.e. follow-up pairwise comparisons), the STC showed significant activations only within the first two contrasts (i.e. prosody contrasted to silence and emotional contrasted to neutral prosodies). This result suggests that the STC may be implicated in general response to affective prosodies irrespective of valence or emotional categories, which is in  line with many previous studies showing a U-shaped dependency between valence of prosodies and brain activation in the STC 14,18,42,48 .
In addition, we also observed two channels in frontal cortex (BA 9/10) showing deactivations in response to prosodies (contrasted to silence). This area located near but did not match with the default mode network (in particular, the medial prefrontal cortex) reported in fMRI studies. We guess this is due to technique limitations of the NIRS (see the Limitation subsection for details).
The frontal cortex-discriminating speech prosodies of different emotional categories. One novel finding is that the left IFG (pars triangularis, BA 45) and the frontal eye field (BA8) were significantly activated for happy relative to fearful/angry prosodies. It has been reported that the pars triangularis of the IFG plays a critical role in semantic comprehension 21,49 . In this study, the finding of the higher tendency to semantically process happy relative to fearful and angry prosodies may be due to the positivity offset 50 , i.e., the participants felt less stressed in the happy than in the fearful or angry condition, so they were more motivated to comprehend happy prosodies though they were only required to passively listen. Since pseudosentences were used in the study, this potential semantic procedure may also activate the BA 8, which is involved in the management of uncertainty 51 . Previously three studies examined the neural bases of happy prosody processing. While Kotz et al. 15,22 found happy (but not angry) relative to neutral prosodies activated left IFG, Johnstone et al. 52 observed enhanced activation in right IFG for happy relative to angry prosodies. The incongruent lateralization of IFG activation may be due to the differences in stimuli, i.e., the participants in this study and in Kotz et al. 15,22 only listened to speech prosodies but the participants listened to prosodies and watched congruent or incongruent facial expressions at the same time in Johnstone et al. 52 . The contrast of happy to neutral prosody in this study is consistent with the finding of Kotz et al. 15,22 .
Another interesting finding is the significant activation in bilateral OFC (BA 10/11) for angry contrasted to neutral prosody, which is almost consistent with the finding of Kotz et al. 15 . The OFC, which is a key neural correlate of anger 23 , plays an important role in conflict resolution and suppression of inappropriate behavior such as aggression 53,54 . Patients with bilateral damages of the OFC were found to be impaired with voice expression identification and had significant changes in their subjective emotional state 55 . Previous fMRI studies contrasting angry to neutral prosodies have reached different results: while some researchers believe that the bilateral frontal regions such as the OFC are always recruited regardless of implicit and explicit tasks 48,56 , some others found that only in explicit tasks the bilateral OFC responded to angry prosodies 39,41 . Considering the passive listening task in this study, we think the present finding supports the former opinion.
Surprisingly, no significant brain activations were found for fearful contrasted to neutral prosody. The result appears inconsistent with the notion of "the negativity bias" that favors the processing of fearful faces/pictures/ words 50,57 . We propose that while visual emotional stimuli can be processed quickly, which helps individuals to initiate a timely fight-or-flight behavior; emotional prosodies communicate no biologically salient cues because their fine-grained features (e.g. pitch, loudness contour, and rhythm) evolve on a long time scale (i.e. longer than several seconds) 5 .
Limitations. Finally, three limitations should be pointed out for an appropriate interpretation of the current result. First, the NIRS technique is only possible to measure brain activations on the surface of the cortex. Some brain regions that are highly involved in the processing of emotional prosodies (e.g. superior temporal sulcus, medial frontal cortex, ventral OFC and amygdala) are partially or totally untouchable. This may be the reason for the non-significant OFC activation after FDR correction in the follow-up pairwise comparison (angry > neutral  prosody). Also, ventral frontal channels and channels across the midline of the frontal cortex (the influence of cerebrospinal fluid) did not show significant deactivation when prosody was contrasted to silence condition. Second, in order to provide comparable results for the on-going neonatal study, the adult subjects in the current study were required to passively listen to the prosodies (see also in other studies 12,27,42,58,59 ). This task setting is suitable and may be the only feasible task for neonates, but may generate unnecessary voluntary perception and evaluation of emotional prosodies in adult's brain. Since the activation pattern of the brain is task dependent 8 , a further adult study with a more rigorous task design (e.g., explicit/implicit tasks in some studies 6,20,48 ) is needed to verify and complement the current findings. Third, this study did not use a set of pseudosentences that contained exactly the same words in the four emotional conditions, because the speech rate was different across emotions 30 (i.e., although the structure of pseudosentences was equal, a small part of pseudosentences did not contain the same words across emotions). This issue, though inherent in affective prosody studies, may influence the results.

Conclusion
In this study, we used fNIRS to investigate how speech prosodies of different emotional categories are processed in the cortex. Taken together, the current findings suggest that while processing of emotional prosodies within the STC primarily works to discriminate between emotional and neutral stimuli, categorization of emotions might occur within a high-level brain region-the frontal cortex. The results verified and extended previous fMRI findings in adult brain and also provided a "developed version" of brain activation for the following neonatal study.