Introduction

Perception of emotion in social interactions is important for inferring the emotional states and intentions of our counterparts. Communicated emotions expressed through face, body language and voice can be perceived and discriminated with multiple sensory channels1. However, while the literature on emotional perception has been well advanced with respect to the visual domain (e.g. see the review for facial expression studies2), the picture is less than complete for the auditory modality3,4,5. Instead of being a mere by-product of talking, speech prosody or affective melody (i.e. with frequency, intensity, rhythm, etc. as features) carried by human voices provides a rich source of emotional information that affects us consciously or nonconsciously6. (Note: In addition to speech prosody, there are other sound types conveying emotional information, such as environmental sounds, nonverbal expressions, singing and music7. This study only focused on speech prosodies.) A proper decoding of these emotional cues allows adaptive behavior in accordance with social context8.

With the advent of functional magnetic resonance imaging (fMRI), widespread cerebral networks have been suggested as neural bases of prosody decoding6,7,9. In particular, auditory temporal regions including the primary/secondary auditory cortex (AC) and the superior temporal cortex (STC)8,10,11,12,13,14,15, frontal areas such as the inferior frontal cortex (IFC)16 and orbital frontal cortex (OFC)15,17, insula18, and subcortical structures such as amygdala19 have been well acknowledged to be involved in the perception and comprehension of emotional prosody. Furthermore, a hierarchical model has been proposed for the processing of affective prosody6,13,20. The model suggests that (1) the extraction of acoustic parameters has been linked to voice-sensitive structures of the AC and mid-STC; (2) the posterior part of the right STC contributes to the identification of affective prosody by means of multimodal integration; and (3) further processing concerned with the evaluation and semantic comprehension of vocally expressed emotions is accomplished in the bilateral inferior frontal gyrus (IFG) and OFC9,21.

While the above-mentioned studies have formed a solid groundwork for the understanding of emotional prosody perception, rarely did these studies find activation differences between positive and negative prosody (for the only exception, see the fMRI study22 which found the activation was stronger for positive relative to negative prosody). Furthermore, although it is well known that the activation pattern of human brain is not the same for all emotions23,24, the question of how verbal expressions of different emotional categories elicit activation in temporal and frontal regions has been scarcely investigated8 (for the only exception, see the fMRI study by Kotz et al.,15 who found the bilateral superior middle frontal gyrus had enhanced activation for angry relative to neutral prosody while the left IFG had enhanced activation for happy relative to neutral prosody). In addition, fMRI studies on the effects of emotional sounds are unavoidably interfered with the gradient noise of the scanner so the fMRI-based results are necessary to be verified and complemented by a silent imaging method such as functional near-infrared spectroscopy (fNIRS)25. However, so far as we know, speech prosody has never been investigated using the fNIRS technique; and there are only three relevant fNIRS studies that examined nonverbal expressions or nonhuman sounds25,26,27. Therefore, the first aim of the present study was to provide an fNIRS-based knowledge of how speech prosodies of different emotional categories elicit activation in adult brain.

Another purpose of the current study was to provide a “developed version” of auditory response pattern to an on-going neonatal experiment in our lab. It is worth stressing that the use of fNIRS is irreplaceable for this purpose, because alternative methods such as fMRI and electroencephalography (EEG) cannot map the brain activation of conscious newborns with a high spatial resolution. To further make the results comparable between this study and the neonatal one, we required the adult subjects in this study to passively listen to affective prosodies because passive listening is the only feasible task for neonates (see neonatal studies28,29). Furthermore, since speech comprehension is largely immature in neonates’ undeveloped brain, we used semantically meaningless pseudosentences in these two studies so as to provide subjects with only prosody rather than both prosody and semantic information.

It was expected that while the voice-sensitive regions in the STC (including the primary/secondary AC) would be strongly activated by prosodies irrespective of emotional valence5,8, frontal regions such as IFC and OFC may have a crucial role in discrimination of verbal expressions of different emotional categories7,9. Since there is little knowledge of the brain activity associated with different categories of affective prosody, no hypothesis was made regarding the exact (if any) frontal areas that take part in decoding distinct affective cues embedded in happy, angry and fearful prosodies.

Methods

Participants

Twenty-two healthy subjects (12 females; age range = 18–24 years, 20.8 ± 0.4 years (mean ± std)) were recruited from Shenzhen University as paid participants. All subjects were right-handed and had normal hearing ability. Written informed consent was obtained prior to the experiment. The experimental protocol was approved by the Ethics Committee of Shenzhen University and this study was performed strictly in accordance with the approved guidelines.

Stimuli

The emotional prosodies were selected from the Database of Chinese Vocal Emotions30. The database consists of “language-like” pseudosentences in Mandarin Chinese, which were constructed by replacing content words with semantically meaningless words (i.e. pseudowords) while maintaining function words to convey grammatical information. The structure of pseudosentences was equal (subject + predicate + object). The duration of each pseudosentence was approximately 1 to 2 sec.

Four kinds of emotional prosodies, i.e., fearful, angry, happy and neutral prosodies, were examined in this study. In order to construct four 15-sec segments for the four emotional conditions, we concatenated, separately, 11, 11, 8 and 9 pseudosentences of fearful, angry, happy and neutral prosodies. Among these pseudosentences, 6 were with the same constructions (but different emotions) across the four conditions. The mean speech rate of the four kinds of prosodies was 6.33, 6.53, 5.07 and 5.27 syllables/sec. The number of syllables for the four kinds of prosodies was 9.5 ± 1.0, 8.9 ± 1.8, 9.5 ± 1.6 and 8.8 ± 0.83 per sentence (mean ± std). All the selected emotional prosodies were pronounced by native Mandarin Chinese speakers (females), and the mean intensity was equalized. Before the experiment, the emotion recognition rate (mean = 0.80; select one emotion label from anger, happiness, sadness, fear, disgust, surprise, and neutral) and emotional intensity (5-point scale, mean = 3.1) were counterbalanced among the four conditions (the two measurements were from the database30). After the fNIRS recording, all the participants were required to classify each prosodic pseudosentences into one of four emotion categories. The mean recognition rate was 0.99 ± 0.04, 0.95 ± 0.08, 0.94 ± 0.08, 0.97 ± 0.06 for anger, fear, happy and neutral prosodies.

Procedure

Sounds were presented via two speakers (R26T, EDIFIER, Dongguan, China) approximately 50 cm from the participants’ head. The speaker sound had a sound pressure level (SPL) of 60 to 70 dB (1353S, TES Electrical Electronic Corp., Taipei, Taiwan). The mean background noise level (without prosody presentation) was 30 dB SPL.

The experiment lasted for 25 min (Fig. 1). Resting-state NIRS data were first recorded for 5 min (eyes opened), followed by a 20-min passive listening task. Each of the four 15-sec segments (corresponding to the four emotions) was repeated ten times. Thus there were 40 blocks in the study, which were presented in a random order. Inter-block interval (silent period) varied randomly between 14 and 16 sec.

Figure 1
figure 1

Schematic diagram of the timeline of the experiment.

Data recording

The NIRS data were recorded in a continuous-wave mode with the NIRScout 1624 system (NIRx Medical Technologies, LLC. Los Angeles, USA), which consisted of 16 LED emitters (intensity = 5 mW/wavelength) and 23 detectors at two wavelengths (760 and 850 nm). Based on previous findings6,7, we placed optodes in the frontal and temporal regions of the brain, using a NIRS-EEG compatible cap (EASYCAP, Herrsching, Germany) with respect to the international 10/5 system (Figs. 2A and 3). There were 54 useful channels (Fig. 2B), where source and detector were at a mean distance of 3.2 cm (range = 2.8 to 3.6 cm) from each other. The data were continuously sampled with 4 Hz. Detector saturation never occurred during the recording.

Figure 2
figure 2

The locations of optodes and channels with respect to the EEG 10/5 system. (A), The locations of sources (red dots) and detectors (blue dots). (B), The 54 channels (green lines).

Figure 3
figure 3

The locations of optical sources (red dots, n = 16) and detectors (yellow dots, n = 23) on a standardized 3D head. (A) front view. (B) Top view. (C) Left view. (D) right view.

To evaluate the cortical structures underlying NIRS channels, a Matlab toolbox NFRI (http://brain.job.affrc.go.jp/tools/)31 was used to estimate the NMI coordinates of optodes with respect to the EEG 10/5 positions. The locations of NIRS channels were defined at the central zone of the light path between each adjacent source-detector pair (Table 1).

Table 1 The MNI coordinates and cortical regions of the NIRS channels.

Data preprocessing

The data were processed within the nirsLAB analysis package (v2016.05, NIRx Medical Technologies, LLC. Los Angeles, USA). Four out of the 22 datasets were deleted because the intensity (in volt) of more than 5 channels showed low values (the gain setting of the NIRx device >7). Thus a total of 18 datasets were analyzed in this study.

There are mainly two forms of movement artifacts in the NIRS data, i.e., transient spikes and abrupt discontinuities. First, spikes were smoothed by a semi-automated procedure which replaces contaminated data by linear interpolation. Second, discontinuities (or “jumps”) were automatically detected and corrected by the nirsLAB (std threshold = 5). Third, a band-pass filter (0.01 to 0.2 Hz) was applied to attenuate slow drifts and high frequency noises such as respiratory and cardiac rhythms. Then the intensity data were converted into optical density changes (ΔOD) (refer to the supplementary material for detailed procedure), and the ΔOD of both measured wavelengths were transformed to relative concentration changes of oxyhemoglobin and deoxyhemoglobin (Δ[HbO] and Δ[Hb]) by employing the modified Beer-Lambert law32. The source-detector distance of the first channel was 3.1 cm, and the exact distance of the other 53 channels was calculated by nirsLAB according to optode locations. The differential path length factor was assumed to be 7.25 for the wavelength of 760 nm and 6.38 for the wavelength of 850 nm33.

Statistical analyses

Statistical significance of concentration changes was determined based on a general linear model of the canonical hemodynamic response function (parameters in nirsLAB = [6 16 1 1 6 0 32]), with a discrete cosine transformation used for temporal filtering (high-pass frequency cutoff = 128 sec). Although both Δ[HbO] and Δ[Hb] signals were obtained, we only chose Δ[HbO] to perform statistical analyses due to its superior signal-to-noise ratio relative to Δ[Hb]. When estimating beta, nirsLAB used a SPM-based algorithm (restricted maximum likelihood) to compute a least-squares solution to an overdetermined system of linear equations.

To statistically analyze the data, we first performed a one-way ANOVA on the beta values associated with Δ[HbO] (five levels: silence, neutral, fearful, angry and happy prosody), resulting in a thresholded (corrected p < 0.05) F-statistic map. Then six pairwise comparisons were followed up but only focusing on the significant channels revealed by the thresholded F-statistic map. This study was interested in the Δ[HbO] difference between (1) prosody and silence, (2) emotional and neutral prosody, (3) positive and negative prosody, (4) happy and neutral prosody, (5) angry and neutral prosody, (6) fearful and neutral prosody. The first two pairwise comparisons were used to verify and repeat the results of previous relevant studies; the last four pairwise comparisons were designed to explore activation differences between different emotional prosodies. The statistical results in individual channels were corrected for multiple comparisons across channels by the false discovery rate (FDR), following the Benjamini and Hochberg34 procedure implemented in Matlab (v2015b, the Mathworks, Inc., Natick, USA).

Waveform visualization

In addition to statistic maps, we also displayed waveforms of Δ[HbO] and Δ[Hb] in the four emotional conditions (Figure S1 in supplementary material). This study considered Δ[HbO] and Δ[Hb] in a time window from −5 to 25 sec after the onset of emotional prosodies. The mean concentration of 5 sec immediately before each block was used as baseline (i.e., −5 to 0 sec; see also in other studies35,36,37).

Results

Main effect of experimental conditions

The one-way ANOVA showed that 11 fNIRS channels (3, 8, 15, 20, 24, 30, 34–36, 48 and 51) had different activation patterns across the five experimental conditions (silence, neutral prosody and the three emotional prosody). The thresholded (corrected p < 0.05) F-statistic map is shown in Fig. 4, and the F values are summarized in Table 2. To measure the variation of beta values across individuals, the standard deviation of the beta values is reported in Table 3.

Figure 4
figure 4

The F-statistic map showing brain regions that had different activation patterns across the five condition (silence, neutral, fearful, angry and happy prosody). Reported F values are thresholded by p < 0.05 (corrected for multiple comparisons using FDR). (A) front view. (B) top view. (C) left view. (D) right view. Green labels denote the number of channels.

Table 2 Brain regions showed different activation patterns across experimental conditions (silence, neutral, fearful, angry and happy prosody).
Table 3 Standard deviation (across 18 subjects) of the beta values in the 54 channels.

Follow-up pairwise comparisons

Contrast 1: prosody > silence

First, we examined the brain regions associated with both emotional and neutral prosodies. The t-test showed that compared to the resting state (silence), four fNIRS channels had significantly enhanced activations in response to prosodies (Channel 20: t(17) = 4.54, p < 0.001, corrected p = 0.003; Channel 24: t(17) = 4.10, p < 0.001, corrected p = 0.007; Channel 34: t(17) = 3.28, p = 0.004, corrected p = 0.020; Channel 48: t(17) = 3.79, p = 0.002, corrected p = 0.010). The four channels correspond to brain regions of bilateral primary/secondary AC (Brodmann area (BA) 42), left posterior superior temporal gyrus (STG, BA 22), and right pars triangularis (middle IFG; BA 45). Among these brain areas, only the left primary/secondary AC (Channel 20) had convergent waveforms of Δ[HbO] and Δ[Hb] across the four conditions (Figure S1A). (Note: The time course of Δ[HbO] was different across the four conditions in the other four significant channels (e.g., see the waveforms at Channel 48, Channel 24 and Channel 34 in Figure S1) Furthermore, the activations within the primary/secondary AC showed leftward lateralization (paired-samples t-test: t(17) = 3.34, p = 0.004; Figure S1A).

In addition, there were another two channels showed significant deactivations (negative t values) in response to prosodies (Channel 8: t(17) = −5.84, p < 0.001, corrected p = 0.001; Channel 36: t(17) = −5.30, p < 0.001, corrected p = 0.002). The two channels correspond to brain regions of dorsolateral prefrontal cortex (DLPFC) and frontopolar prefrontal cortex (PFC).

Contrast 2: emotional > neutral prosody

Second, we examined the brain regions that were more activated for emotional compared to neutral prosodies. The t-test showed that compared to neutral prosodies, two channels had significantly enhanced activations in response to emotional prosodies, corresponding to brain regions of right posterior STG (BA 22, Channel 51; t(17) = 4.02, p < 0.001, corrected p = 0.035) and right primary/secondary AC (BA 42, Channel 48; t(17) = 3.46, p = 0.003, corrected p = 0.044). It is notable that while the main effect of prosodies (i.e. prosody contrasted to silence) had leftward lateralization in the posterior STG (paired-samples t-test: t(17) = 2.66, p = 0.017) and primary/secondary AC, the contrast of emotional and neutral prosodies within these areas showed rightward lateralization (AC: t(17) = −3.70, p = 0.002; STG: t(17) = −3.78, p = 0.001; Figure S1A and B).

Contrast 3: positive > negative prosody

Third, we examined the brain regions that were more activated for happy contrasted to fearful and angry prosody. The t-test showed that compared to negative prosody, two channels had significantly enhanced activations in response to happy prosody. The associated brain regions were left pars triangularis (middle IFG, BA 45, Channel 15; t(17) = 3.75, p = 0.002, corrected p = 0.039) and frontal eye fields (superior frontal gyrus, BA8, Channel 35; t(17) = 3.60, p = 0.002, corrected p = 0.039). It is notable that while the main effect of prosody (i.e. prosody contrasted to silence) had rightward lateralization in the middle IFG (paired-samples t-test: t(17) = −2.92, p = 0.010), the contrast of happy and fearful/angry prosody showed leftward lateralization (t(17) = 2.78, p = 0.013; Figure S1C).

Contrast 4: happy > neutral prosody

Fourth, we examined the brain regions that were more activated for happy contrasted to neutral prosody. The t-test showed that Channel 15 had significantly enhanced activations in response to happy prosody (t(17) = 4.12, p < 0.001, corrected p = 0.039). The associated brain regions were left pars triangularis (middle IFG, BA 45).

Contrast 5: angry > neutral prosody

Fifth, we examined the brain regions that were more activated for angry contrasted to neutral prosody. The t-test showed that two symmetrical channels had significantly enhanced activations in response to angry prosodies, corresponding to frontopolar and orbitofrontal areas (part of OFC, BA 10/11). However, the activation was not significant after multiple comparison correction (Channel 3: t(17) = 3.56, p = 0.002, corrected p = 0.070; Channel 30: t(17) = 3.74, p = 0.002, corrected p = 0.070; Figure S1D).

Contrast 6: fearful > neutral prosody

Finally, we examined the brain regions that were more activated for fearful contrasted to neutral prosody. No channels were significantly activated even before multiple comparison correction.

Discussion

The superior temporal cortex—decoding speech prosodies irrespective of emotional valence

The STC has been demonstrated to take a critical part in decoding vocal expressions of emotions (see meta-analysis8). (Note: The STC is comprised of STG, MTG, and the superior temporal sulcus8. The primary/secondary AC lies in the middle STG). While the lower-level structures of STC (i.e. the primary AC and mid-STC) analyze acoustic features in auditory expressions, the higher-level structures of STC integrate the decoded auditory properties and build up percepts of vocal expressions7,21. Consistent with this notion, the current study found that while speech prosodies activated the left primary AC (BA 42) most significantly when contrasting to silence, emotional prosodies activated the right STG (middle and posterior, BA 22/42) when contrasting to neutral prosodies. The right STG is the major structure of “emotional voice area“38, its anterior20, middle (or the primary and secondary AC)6,9,17,39,40,41,42 and especially posterior portion6,9,13,17,41,43,44,45 have been reported to show peak activations for emotional compared to neutral vocal expressions.

Our finding provides further evidence to clarify the lateralization of emotional prosody processing in the STC. It is observed that presentation of speech stimuli (i.e. prosody contrasted to silence) showed significant leftward lateralization in the primary/secondary AC and posterior STG, which is in line with the notion that the left hemisphere is better equipped for the analysis of rapidly changing phonetic representations in speech15,17,21. However, our data showed a strong right lateralization for affective prosody perception within the STC7,15,17,25,44,46, which is consistent with the finding that the right hemisphere is more sensitive to slow-varying acoustic profiles of emotions (e.g. tempo and pausing)5,9,43,47.

It is also worth noting that although we explored the cortex responses within six contrasts (i.e. follow-up pairwise comparisons), the STC showed significant activations only within the first two contrasts (i.e. prosody contrasted to silence and emotional contrasted to neutral prosodies). This result suggests that the STC may be implicated in general response to affective prosodies irrespective of valence or emotional categories, which is in line with many previous studies showing a U-shaped dependency between valence of prosodies and brain activation in the STC14,18,42,48.

In addition, we also observed two channels in frontal cortex (BA 9/10) showing deactivations in response to prosodies (contrasted to silence). This area located near but did not match with the default mode network (in particular, the medial prefrontal cortex) reported in fMRI studies. We guess this is due to technique limitations of the NIRS (see the Limitation subsection for details).

The frontal cortex—discriminating speech prosodies of different emotional categories

One novel finding is that the left IFG (pars triangularis, BA 45) and the frontal eye field (BA8) were significantly activated for happy relative to fearful/angry prosodies. It has been reported that the pars triangularis of the IFG plays a critical role in semantic comprehension21,49. In this study, the finding of the higher tendency to semantically process happy relative to fearful and angry prosodies may be due to the positivity offset50, i.e., the participants felt less stressed in the happy than in the fearful or angry condition, so they were more motivated to comprehend happy prosodies though they were only required to passively listen. Since pseudosentences were used in the study, this potential semantic procedure may also activate the BA 8, which is involved in the management of uncertainty51. Previously three studies examined the neural bases of happy prosody processing. While Kotz et al.15,22 found happy (but not angry) relative to neutral prosodies activated left IFG, Johnstone et al.52 observed enhanced activation in right IFG for happy relative to angry prosodies. The incongruent lateralization of IFG activation may be due to the differences in stimuli, i.e., the participants in this study and in Kotz et al.15,22 only listened to speech prosodies but the participants listened to prosodies and watched congruent or incongruent facial expressions at the same time in Johnstone et al.52. The contrast of happy to neutral prosody in this study is consistent with the finding of Kotz et al.15,22.

Another interesting finding is the significant activation in bilateral OFC (BA 10/11) for angry contrasted to neutral prosody, which is almost consistent with the finding of Kotz et al.15. The OFC, which is a key neural correlate of anger23, plays an important role in conflict resolution and suppression of inappropriate behavior such as aggression53,54. Patients with bilateral damages of the OFC were found to be impaired with voice expression identification and had significant changes in their subjective emotional state55. Previous fMRI studies contrasting angry to neutral prosodies have reached different results: while some researchers believe that the bilateral frontal regions such as the OFC are always recruited regardless of implicit and explicit tasks48,56, some others found that only in explicit tasks the bilateral OFC responded to angry prosodies39,41. Considering the passive listening task in this study, we think the present finding supports the former opinion.

Surprisingly, no significant brain activations were found for fearful contrasted to neutral prosody. The result appears inconsistent with the notion of “the negativity bias” that favors the processing of fearful faces/pictures/words50,57. We propose that while visual emotional stimuli can be processed quickly, which helps individuals to initiate a timely fight-or-flight behavior; emotional prosodies communicate no biologically salient cues because their fine-grained features (e.g. pitch, loudness contour, and rhythm) evolve on a long time scale (i.e. longer than several seconds)5.

Limitations

Finally, three limitations should be pointed out for an appropriate interpretation of the current result. First, the NIRS technique is only possible to measure brain activations on the surface of the cortex. Some brain regions that are highly involved in the processing of emotional prosodies (e.g. superior temporal sulcus, medial frontal cortex, ventral OFC and amygdala) are partially or totally untouchable. This may be the reason for the non-significant OFC activation after FDR correction in the follow-up pairwise comparison (angry > neutral prosody). Also, ventral frontal channels and channels across the midline of the frontal cortex (the influence of cerebrospinal fluid) did not show significant deactivation when prosody was contrasted to silence condition. Second, in order to provide comparable results for the on-going neonatal study, the adult subjects in the current study were required to passively listen to the prosodies (see also in other studies12,27,42,58,59). This task setting is suitable and may be the only feasible task for neonates, but may generate unnecessary voluntary perception and evaluation of emotional prosodies in adult’s brain. Since the activation pattern of the brain is task dependent8, a further adult study with a more rigorous task design (e.g., explicit/implicit tasks in some studies6,20,48) is needed to verify and complement the current findings. Third, this study did not use a set of pseudosentences that contained exactly the same words in the four emotional conditions, because the speech rate was different across emotions30 (i.e., although the structure of pseudosentences was equal, a small part of pseudosentences did not contain the same words across emotions). This issue, though inherent in affective prosody studies, may influence the results.

Conclusion

In this study, we used fNIRS to investigate how speech prosodies of different emotional categories are processed in the cortex. Taken together, the current findings suggest that while processing of emotional prosodies within the STC primarily works to discriminate between emotional and neutral stimuli, categorization of emotions might occur within a high-level brain region–the frontal cortex. The results verified and extended previous fMRI findings in adult brain and also provided a “developed version” of brain activation for the following neonatal study.