Introduction

Social communication through voice entails semantic as well as prosodic meaning, the latter being generally defined as the melody of the human voice. The processing of human voice prosody leads to widespread changes in multiple cerebral regions, especially in the superior temporal and inferior frontal cortices1,2,3,4. Given their tripartite functional compartmentalization, whereby each basal ganglia (BG) is linked to either the motor, associative or limbic cortex5,6, there is every reason to suppose that the BG might play a major role in emotional processing in humans. This assertion is reinforced by both the BG’s intrinsic function and their functional and effective connectivity with the rest of the brain7, revealed by functional magnetic resonance imaging (fMRI)8, electrophysiological data9, lesion studies10, as well as by deep brain stimulation of the BG11. There is growing evidence for the involvement of the BG in vocal emotional processing, not only directly, but also through their connections with structures known to be involved in emotional processing, such as the superior frontal and temporal gyri, the amygdala and the cerebellum12.

Evidence gathered from fMRI and lesion models has led to the hypothesis that the BG play a critical and potentially direct role in vocal emotion processing, by promoting efficient decoding of emotional information from vocal cue sequences and rhythmic aspects of speech13,14. The highly connected, closed loop nature of the BG make them perfectly situated to coordinate activity in other cortical and subcortical regions related to emotional voice perception. The subthalamic nucleus (STN) may synchronize neural oscillations within a broader limbic network in order to facilitate efficient processing of auditory and emotion information11. This synchronization would strengthen cortical representations of repeated stimulus–response pairings to form “chunks” of behavioural/cognitive response patterns that could be processed more automatically over learning15. Simultaneously, these chunks may be modified by the cerebellum to minimize the prediction error of an internal model based on its representation of the current sensory state and expected outcome of ongoing auditory processing16,17. Furthermore, the BG and cerebellum may analyse temporal patterns in acoustic stimuli to extract salient emotional cues to feedback to cortex. Nevertheless, the way in which these subcortical and cortical structures exhibit coupling (or decoupling) in order to allow the emergence of a cognitive process such as emotional prosody recognition (i.e., functional integration) remains largely unexplored in affective neuroscience, especially the patterns of connectivity between the BG and the cerebellum in vocal emotion decoding7,12.

As for the subthalamic nucleus, the BG can be divided in at least three functional compartments relative to their cortical efferences: motor, associative and limbic5,6,7. In the present study, BG regions of interest were the striatum, the globus pallidus (internal and external parts) and the STN7,18,19,20,21. These regions also play a critical role in selecting a relevant response pattern—and inhibiting irrelevant ones—and in reward feedback and anticipation7. BG efferences also connect them more directly to the cerebellum, which can also be separated into motor, associative/cognitive, and limbic subparts22 that were recently highlighted by resting state functional connectivity23, specific task-based parcellation24 and cerebellar topography22. In the scope of the present study, the cerebellum would help fine-tune the selected response initiated in the BG, generate an internal model of current goal states and somehow close the loop of reward encoding7,25 in addition to simultaneously assessing auditory timing for further iterations of vocal emotion decoding across time (lesion studies)26,27,28. Specific areas of the cerebellum associated with (vocal) emotion processing are the cerebellum crus of ansiform lobule I and II (Crus I,II), cerebellar lobules IV, V, VI, VIIb, VIII and IX, Vermis7,12,22,29,30,31,32 and deep cerebellar nuclei, especially the dentate7 and fastigial nucleus33,34.

Recent neuroimaging studies helped gain new insights into the role(s) of the BG in emotion processing but still presented shortcomings that need to be overcome. These studies failed: to take advantage of high-resolution scanning of the BG; to investigate the functional and effective connectivity among the BG and between the BG and different subparts of the temporal regions35 that sustain emotional prosody processing, and more crucially between the BG and the cerebellum; to assess the impact of low-level acoustic parameters on voice prosody processing in the BG or cerebellum, despite their impact in temporal and frontal brain regions36,37.

Considering the abovementioned literature, the present study was designed to improve our current understanding of the functional integration of the BG and cerebellum during emotional prosody processing in humans, taking into account low-level acoustic parameters of interest such as synthesized fundamental frequency (f0) and energy, using high resolution fMRI in healthy participants. We therefore hypothesized: (i) an increase of BOLD signal in the STN, striatum, globus pallidus (internal, GPi; external, GPe) and cerebellum (Crus I-II, Vermis, cerebellar lobules IV-IX) during the processing of emotional (angry and happy) voices, as opposed to emotionally neutral voice prosody and (ii) similarly for emotional voices when removing variance explained by low-level acoustics (synthesized energy and f0); (iii) enhanced BOLD signal in the BG (STN, striatum, globus pallidus) for angry voice envelope (synthesized energy); (iv) functional connectivity between the BG, especially in the STN and GPi/GPe, the cerebellum (Vermis and cerebellar lobules IV-IX, dentate nucleus) and temporal (superior temporal gyrus) and frontal voice areas (inferior frontal cortex, orbitofrontal cortex) when contrasting emotional to neutral voices (independently of synthesized energy and f0); (v), enhanced effective coupling within the BG (striatum, STN, GPi/GPe) for angry and/or happy voices.

Results

Fifteen (8 female, 7 male) participants were included in the final analysis of the present study. Their task was a one-back task on neutral, happy and angry sentences of pseudowords (‘ne kali bam sud molen!’) presented binaurally through MR-compatible headphones. Both the original voices and synthesized versions of them—synthesized energy and synthesized f0 voices—were included as stimuli across two runs of about 10 min each, in pseudorandom order. The factors of interest were therefore the Emotion and the type of voice (Acoustic Parameters factor) and the interaction between these two factors. More details on the task and paradigm can be found in the Methods section.

Wholebrain results

We performed voxel-level general linear analyses subdivided into three different models in order to find enhanced brain activity related to the factorial design of our data. The models of interest were model 1 and 2, in which we modelled the Emotion factor and the two-way interaction between Emotion and Acoustic Parameters factors. The former analysis revealed emotion-specific enhanced patterns of activity that are presented in this section (for the general effect of Emotion, see Fig.S1), while the full interaction between factors did not yield any significant results. We present, however, one significant result of interest, as part of our hypotheses, for the rhythmicity of angry voices (synthesized energy of angry > neutral prosody). Finally, results for model 3 – the main effect of Acoustic Parameters – are reported in the supplementary data (Supplementary Table 1–3).

Main effect of emotion factor

Wholebrain results for the Emotion factor revealed significant enhanced activity for both angry > neutral voices (Supplementary Table 4) and happy > neutral voices (Supplementary Table 5) contrasts. Enhanced activations for emotional (angry and happy) compared to neutral voices were also significant especially in the superior temporal cortex and inferior frontal cortex, bilaterally (see Supplementary Table 6). Brain activity specific to angry voices (angry > neutral voices) replicated the involvement of the temporal cortex for processing such stimuli, especially in the anterior part of the middle temporal cortex (aMTG) and the posterior superior temporal gyrus and sulcus (pSTG and pSTS, respectively), bilaterally (Fig. 1a,b,g). Enhanced activity was also observed in medial brain areas such as the anterior cingulate cortex (ACC), the parahippocampal gyrus and the fusiform gyrus (Fig. 1c,d). Activity in the basal ganglia was restricted to the external globus pallidus (GPe) while we also observed enhanced activity in several parts of the thalamus (Fig. 1e). Finally, large parts of the cerebellum were also more active (Fig. 1g) during angry as opposed to neutral voice processing, namely the Crus II area (Fig. 1b), lobules IV-V and VI (Fig. 1c,f), Vermis area VI (Fig. 1d) as well as deep nuclei such as the dentate (Fig. 1c,f) and fastigial nucleus (Fig. 1f). More details are available in Supplementary Table 4.

Figure 1
figure 1

Enhanced brain measures for implicitly processing angry compared to neutral voices, corrected for multiple comparisons (wholebrain voxel-wise p < .05 FDR, k > 10 voxels). (ab) Lateral activations rendered on a sagittal image highlighting middle and superior temporal regions. (cd) Medial activations of the anterior cingulate cortex, parahippocampal cortex and cerebellum. (e) Subcortical activity in the thalamus and globus pallidus displayed on an axial slice. (f) Cerebellar activations displayed on an axial slice. G, Percentage of signal change extracted using singular value decomposition on 9 voxels around each peak in a subset of regions with individual values (circles), mean values (bars) and standard error of the mean (error bars) for angry and neutral voices. Pink circles: angry voices; Black circles: neutral voices. L: left; R: right; IFGop: inferior frontal gyrus pars opercularis; STG: superior temporal gyrus; STS: superior temporal sulcus; MTG: middle temporal gyrus; INS: insula; SMG: supramarginal gyrus; FG: frontal gyrus; FFG: fusiform gyrus; PHG: parahippocampal gyrus; ACC: anterior cingulate cortex; Cereb: cerebellum; Cereb Lob: cerebellum lobule; Cereb Nucl Dentate: dentate nucleus of the cerebellum; Cereb Nucl Fastigial: fastigial nucleus of the cerebellum; Brainstem LL: lateral lemniscus of the brainstem; Thalamus VLN: ventral lateral nucleus of the thalamus; GPe: external globus pallidus; Cereb Crus: cerebellum crus of ansiform lobule; ACC: anterior cingulate cortex. ‘a’ prefix: anterior part; ‘m’ prefix: mid part; ‘p’ prefix: posterior part.

As for angry voices, brain activity specific to normal happy voices (happy > neutral voices) highlighted the anterior, mid and posterior portions of the temporal cortex (aSTS, aMTG; mSTS, mSTG; pSTS, pSTG, pMTG, respectively), bilaterally (Fig. 2a,b,g). Enhanced activity was medially observed in the ACC, parahippocampal gyrus and fusiform gyrus (Fig. 2c,d). Increase of activity in the basal ganglia was observed in the GPe and bilateral putamen, and in the ventral lateral nucleus of the thalamus (Fig. 2e). Multiple subparts of the cerebellum showed significant differences. Cerebellum areas were more activated (Fig. 2g) during happy as opposed to neutral voice processing, especially in the lateral Crus I area, bilaterally (Fig. 2a,b,f), in lobules VI, VIIb and VIII (Fig. 2c,d,f), in Vermis areas III and IV-V (Fig. 3CD) as well as in the dentate nucleus (Fig. 2f). More details are available in Supplementary Table 5.

Figure 2
figure 2

Enhanced brain measures for implicitly processing happy compared to neutral voices, corrected for multiple comparisons (wholebrain voxel-wise p < .05 FDR, k > 10 voxels). (ab) Lateral activations rendered on a sagittal image highlighting middle, superior temporal and cerebellar regions. (cd) Medial activations of the anterior cingulate cortex, parahippocampal cortex and cerebellum. e Subcortical activity in the putamen, thalamus and globus pallidus displayed on an axial slice. (f) Cerebellar activations displayed on an axial slice. (g) Percentage of signal change extracted using singular value decomposition on 9 voxels around each peak in a subset of regions with individual values (circles), mean values (bars) and standard error of the mean (error bars) for happy and neutral voices. Blue circles: happy voices; Black circles: neutral voices. L: left; R: right; IFGtri: inferior frontal gyrus triangularis part; STG: superior temporal gyrus; STS: superior temporal sulcus; MTG: middle temporal gyrus; INS: insula; SMG: supramarginal gyrus; FG: frontal gyrus; FFG: fusiform gyrus; PHG: parahippocampal gyrus; ACC: anterior cingulate cortex; Cereb: cerebellum; Cereb Lob: cerebellum lobule; Cereb Nucl Dentate: dentate nucleus of the cerebellum; Brainstem LL: lateral lemniscus of the brainstem; Thalamus VLN: ventral lateral nucleus of the thalamus; GPe: external globus pallidus; Cereb Crus: cerebellum crus of ansiform lobule; ACC: anterior cingulate cortex. ‘a’ prefix: anterior part; ‘m’ prefix: mid part; ‘p’ prefix: posterior part.

Figure 3
figure 3

Coupled seed-to-seed, gPPI functional connectivity for the interaction between the Emotion and the Acoustic parameter factors contrasting angry > neutral voices * original > f0 & energy synthesized voices, corrected for multiple comparisons (p < .05 FDR). l and L: left; r and R: right; FO: frontal operculum; GPe: external globus pallidus; pSTG: posterior superior temporal gyrus; STT: spinothalamic tract of the brainstem; POTPT: parieto-occipito-temporo-pontine tract of the brainstem; Cereb Lob: cerebellum lobule.

Interaction effect between Emotion and Acoustic Parameters factors

The full, two-way interaction between our Emotion and Acoustic Parameters factors did not reveal significant results when contrasting angry or happy voices to neutral voices while taking into account normal compared to synthesized voices. We, however, had a specific hypothesis concerning the rhythmicity of angry voices, namely the impact of the ‘envelope’ of such voices on basal ganglia regions. We therefore used model 3 to compute a contrast dedicated to highlighting brain regions sensitive to the envelope of angry compared to neutral, synthesized energy voices [synthesized energy for angry > neutral voices]. The contrast revealed enhanced activity in the left ventral lateral and lateral posterior nucleus of the thalamus, putamen, substantia nigra, right caudate head, thalamus as well as in the bilateral insula, left amygdala and right mid-to-anterior and posterior STG (Supplementary Table 7). Similar regions, especially large parts of the STG and STS, were also more active for the synthesized energy of happy voices, namely for the [synthesized energy for happy > neutral voices] contrast (Supplementary Table 8).

Functional connectivity results

Wholebrain analyses revealed significant results for both of our factors (Emotion, Acoustic Parameters) but their interaction did not yield any above-statistical-threshold activations. Computing functional/effective connectivity analyses (both seed-to-seed and seed-to-voxel), however, did reveal several coupled and anti-coupled networks underlying such two-way interaction between the Emotion and the Acoustic Parameters factors. While functional connectivity results were primarily used to further compute effective connectivity, we kept them in the present section due to their specificity and general meaning. These results are presented below.

Seed-to-seed functional connectivity

Computed using 137 ROI composed of 58 ‘aal’ regions within our field of view, 23 brainstem regions, 22 basal ganglia regions and 34 cerebellum regions, seed-to-seed analyses revealed significant results for the interaction between Emotion and Acoustic Parameters factors, for each emotion of interest. Our contrasts of interest therefore included angry or happy compared to neutral voices when spoken normally as opposed to synthesized f0 and energy voices. Seed-to-seed functional connectivity specific to angry original voices were therefore computed with the [angry > neutral voices * original > f0 & energy synthesized voices] contrast, revealing coupled networks. As predicted, we observed coupling between the basal ganglia and the cerebellum, more specifically between the left GPe and right cerebellum lobule X (Fig. 3). Coupled functional connectivity was also observed between the left pSTG and right frontal operculum and in the brainstem between major motor (right parieto-occipito-temporo-pontine tract) and sensory tracts (bilateral spinothalamic tract). Detailed results are reported in Supplementary Table 9.

Looking at positive emotion stimuli, happy voices yielded coupled and anti-coupled seed-to-seed functional connectivity results, as seen in the [happy > neutral voices * original > f0 & energy synthesized voices] contrast (Fig. 4). Coupled functional connectivity revealed three distinct networks: (1) Internal globus pallidus (GPi) and aSTG in the right hemisphere; (2) Left pMTG and right central operculum cortex; (3) Right corticospinal tract (major motor tract) and right lateral lemniscus (major sensory tract). Happy voices also led to two separate anti-coupled networks involving the right paracingulate cortex and subcalcarine cortex as well as in posterior temporal areas, namely between the left pMTG and right pSTG (Fig. 5). Details reported in Supplementary Table 10.

Figure 4
figure 4

Coupled and anti-coupled seed-to-seed, gPPI functional connectivity for the interaction between the Emotion and the Acoustic parameter factors contrasting happy > neutral voices * original > f0 & energy synthesized voices, corrected for multiple comparisons (p < .05 FDR). l and L: left; r and R: right; PaCC: paracingulate cortex; SubCC: subcalcarine cortex; GPi: internal globus pallidus; COC: central operculum cortex; aSTG: anterior superior temporal gyrus; pSTG: posterior superior temporal gyrus; pMTG: posterior middle temporal gyrus; CST: corticospinal tract of the brainstem; LL: lateral lemniscus of the brainstem.

Figure 5
figure 5

Coupled and anti-coupled seed-to-voxel, gPPI effective connectivity for the interaction between the Emotion and the Acoustic parameter factors contrasting angry > neutral voices * original > f0 & energy synthesized voices, corrected for multiple comparisons (p < .05 FDR). (a) Inferior view showing direct coupling between the left STN (seed) and the ipsilateral Cerebellum Crus II. (b) Sagittal view showing direct anti-coupling between the left GPe (seed) and the left MFG and pMTG. (c) Sagittal view showing direct coupling between the left caudate nucleus (seed) and the right primary auditory cortex (Heschl’s gyrus) and planum temporale. L: left; R: right; STN: subthalamic nucleus; GPe: external globus pallidus; Caud: caudate nucleus; Cereb Crus II: cerebellum crus II of ansiform lobule; pMTG: posterior middle temporal gyrus; MFG: middle frontal gyrus; HG: Heschl’s gyrus; PT: planum temporale.

Seed-to-voxel effective connectivity with the basal ganglia as seeds

In order to determine the direct relations between BG regions and the rest of the brain, namely each voxel, we computed seed-to-voxel analyses using multivariate regressions and took as seeds only the BG (N = 22 ROI; Fig. 6). We only observed significant effective connectivity specific to angry—but not happy—voices through the interaction with the Acoustic Parameters factor [angry > neutral voices * original > f0 & energy synthesized voices]. This multivariate analysis revealed a direct coupling between the left STN (seed) and the ipsilateral cerebellum crus II of ansiform lobule (MNI xyz − 4 − 86 − 42; t14 = 4.14, k = 26 voxels; p = 0.031 FDR corrected, two-tailed; Fig. 5a). We also observed an anti-coupling between the left GPe (seed) and left temporo-occipital MTG (MNI xyz -60 -50 -2) and MFG (MNI xyz − 44 34 20; for both contrasts, t14 = 4.14, k = 29 and 20 voxels, respectively; p = 0.018 and 0.048 FDR corrected, two-tailed, respectively; Fig. 5b). Finally, direct coupling was observed between the left caudate nucleus (seed) and voxels covering part of the right primary auditory cortex and planum temporale (MNI xyz 54 − 12 0; t14 = 4.14, k = 64 voxels, p = 0.00009 FDR corrected, two-tailed; Fig. 5c).

Seed-to-seed effective connectivity within the basal ganglia

We were ultimately interested in the effective connectivity within the basal ganglia when processing emotional (angry, happy) voices and independently of low-level acoustic parameters (synthesized f0, energy). We therefore used multiple regression analyses within the BG for our interaction contrasts to highlight direct relations between BG regions. The anger specific contrast [angry > neutral voices * original > f0 & energy synthesized voices] did not reveal any effective connectivity in BG regions whereas the happiness specific contrast [happy > neutral voices * original > f0 & energy synthesized voices] revealed coupling between the left putamen and GPi (t14 = 3.78, p = 0.030 FDR corrected, two-tailed) as well as anti-coupling between the left GPi and the ipsilateral nucleus accumbens (t14 = − 3.65, p = 0.039 FDR corrected, two-tailed).

Discussion

The present study aimed at determining the functional role of both the basal ganglia and the cerebellum according to an integrative neural model of vocal emotion perception, decoding and integration using focal, high-resolution fMRI. It was assumed that connectivity–functional and/or effective–between the BG and the cerebellum would underlie the differential processing of emotion, namely angry and/or happy compared to neutral voices, especially when constraining our data by the use of low-level acoustic parameters of no-interest (synthesized f0 and synthesized energy voices). Our results confirmed the hypothesized involvement of subparts of the BG and cerebellum in processing vocal emotions. The interaction between emotion and acoustical parameters yielded significant results only for connectivity analyses. Functional connectivity data revealed coupled and anti-coupled networks involving the BG and cerebellum, while effective connectivity within the BG and with the BG as seeds, shed new light on the involvement of the internal and external globus pallidus, putamen, left STN and caudate nucleus in vocal emotion processing.

The implication of subcortical structures other than the amygdala involved in emotion processing was only recently emphasized21,38 and through deep brain stimulation in the STN as a neurosurgical treatment for Parkinson’s disease and obsessive–compulsive disorder, a new research window opened11. According to Péron and colleagues’ model (2013)11 and in line with existing literature and our results, the processing of emotion would rely on both the direct (‘hyperdirect pathway’) and indirect coupling between STN subterritories (motor, associative and limbic) and the neocortex, especially the orbitofrontal cortex (OFC) and modality-specific primary and secondary cortices. Indirect coupling would transit from the STN to the OFC through the BG, especially the GPi and GPe, thalamus, substantia nigra and ventral tegmental area, and/or through the amygdala that exhibits some direct connections with the BG as well11. The STN could synchronize oscillations in relevant areas across the brain including the cerebellum to shape cortical learning and facilitate habitual, overlearned processing of familiar stimuli types7. Our results fit well with such model and constrain it by adding some nuance to the expected synchronized regions across the brain. In fact, we observed enhanced activity in several subparts of the BG and in different territories of the cerebellum. More specifically, we observed for angry—similarly for happy—voice processing the involvement of the GPe and thalamus as well as of several lobules (IV, V, VI), nuclei (fastigial, dentate) and areas (Vermis area VI) of the cerebellum and posterior, mid and anterior temporal regions within the voice-sensitive areas. GP activity fits with a more accurate recognition of vocal emotion in healthy compared to BG-lesioned patients39, and with a general role of the more dorsal BG for the sequencing and anticipation of acoustic temporal variations18. The BG would therefore be crucial to detect and classify auditory patterns, subsequently synchronizing activity in other regions for selecting the appropriate response.

The limbic cerebellum (predominately the vermis) and associative regions of the cerebellum (including posterior hemispheric lobules24,40,41,42), present in our wholebrain and connectivity results are in line with the general role of the cerebellum in auditory perception43 and more specifically emotion recognition and perception44. These areas of the cerebellum then could modulate cortical oscillations based on prediction error feedback relative to the given context45,46. By continuously monitoring incoming stimuli for deviations from expected emotional structure (e.g., an angry voice) and fine-tuned interval timing47, the limbic and associative territories of the cerebellum–in our results, Vermis IV and VI and hemispheric lobules IV–VI, VIII, respectively–could signal the need for greater attentional control of sensory cortical responses. Cerebellum activity in our results would also fit well with response adaptation and motor control48, preparing a response following vocal emotion decoding and processing49, especially when the voice or sound is perceived as aversive50. Input to the limbic cerebellum (Vermis and fastigial nucleus) from OFC or the BG regarding the salience of emotional stimuli would shape internal models about how an emotional response would affect the individual in their current state, and, thus, how the cerebellum modifies limbic responses, especially in the temporal domain27.

The idea of temporal pattern analysis in the associative territory of the cerebellum has been proposed, especially when patterns are irregular and not rhythmic26, which includes temporal patterns of vocal emotion and emotional prosody. Specifically, a double dissociation between patients with a BG or cerebellum lesion confirmed that cerebellar lesions alter non-rhythmic–but not rhythmic–temporal prediction while BG lesions showed the opposite pattern27. Additionally, misattributions in emotion recognition between surprise and fear correlated with lesions in lobules VIIb, VIII and X of the cerebellum12, regions that overlap with our results for angry and happy voices in both the wholebrain activation and connectivity analyses and are in line with previous evidence of emotional processing within these specific regions22,31,32. Therefore, these cerebellar lobules may have a crucial function in emotion recognition in voices, notably in temporal pattern analysis and critical low-level acoustics integration such as f0 or pitch.

The importance of BG-cerebellum connections in vocal emotion processing, especially for anger, was further emphasized by our functional connectivity data for angry, but not happy, original voice processing (removing the variance explained by synthesized f0 and energy), which revealed coupling between the GPi and putamen with lobule X of the cerebellum. These results are consistent with a coupling of BG and cerebellum activity in time for autonomic emotional reaction and prediction generation51 or interval timing47 and motor prediction48 but cerebellar lobule X is more rarely observed in emotion-related tasks. This cerebellar lobule, however, was recently integrated in the ‘triple nonmotor representation’ and evidence shows its limbic ties with the neocortex52. It is also important to note here that many cerebellar sub-regions often labelled as ‘motor’ (for example, linked to hand or eye movements) are also significantly involved in cognitive or emotional tasks53,54, such as lobules V, VI, VIII24. Our results therefore converge toward a critical role of the cerebellum in coordination with the BG for both the decoding of vocal emotion—in the temporal, voice-sensitive areas—and the conversion to a motor response48 as an output behaviour following a subjective feeling of emotion7,49.

Furthermore, our effective connectivity results strongly emphasized within-BG direct relations between the putamen and GPi (coupling) and between the GPi and nucleus accumbens (anti-coupling) as well as between BG seeds and frontal and superior temporal regions. Additionally, effective seed-to-voxel connectivity revealed direct coupling between the left STN and ipsilateral cerebellum crus II of the ansiform lobule. While the role of the STN in emotion processing20,55,56,57,58 and vocal emotion recognition11,19,49,59,60 has gathered strong interest in the recent years, the crus II area of the cerebellum also subserves cognition and emotion processes29,44,61. Direct coupling was also observed between the left caudate nucleus and the primary auditory cortex and planum temporale, fitting well again with the direct coupling between the BG and modality-specific sensory cortex11 with the caudate playing a critical role in voice arousal62 and emotion processing63.

We interestingly also observed direct anti-coupling between the left GPe, involved in the explicit recognition of emotional prosody39, and ipsilateral posterior MTG and MFG, superior to and slightly overlapping with the triangularis part of the IFG. Activity modulations in these latter lateral brain areas were repeatedly observed in voice processing in general64 and vocal emotion65,66, especially when contrasting happy to angry voices67. The fact that posterior MTG activity was previously linked to happy vs. angry voice processing therefore could explain the coupling we observed that is specific to happy voices, especially since GP functioning relates to explicit vs. implicit emotion recognition39.

While our data depict a relatively clear image of the importance of the BG and cerebellum for vocal emotion processing and further output response, some limitations should be mentioned. First, sample size was limited and even though we were strict with the correction of p values in our statistical analyses, a sample size closer to 25 participants would have been better for reliable data generalization and reproducibility. Second, p values for wholebrain data analyses were corrected for multiple comparisons using voxel-wise False Discovery Rate (FDR), namely by dividing the p value by the number of activated voxels rather by the total number of voxels in the brain—namely Family-Wise Error (FWE) correction. While FDR is widely used in the functional MRI literature, we cannot exclude more voxels with false positives as compared to FWE correction. Third, and as often observed in the literature, we included happy, angry and neutral emotions as vocal stimuli but other critical emotions such as fear, surprise, sadness or several others were not included, therefore restricting our conclusions. Fourth, although we did include low-level acoustic parameters to control for emotion-specific activity, other meaningful ones should be used in the future, for instance the spectral domain related to voice quality perception, which is thought also important for emotional voice recognition. Fifth, we used high-resolution fMRI, greatly improving spatial resolution with, however, the added cost of a truncated field of view. We cannot therefore exclude the fact that frontal and parietal regions, excluded at data acquisition, would play a role in vocal emotion processing, in terms of both activation and connectivity using the same task. It is, however, worth mentioning that the focus of the present study was on cerebellar and basal ganglia contributions to vocal emotion processing. Sixth, we did not divide the STN and other BG or cerebellar regions into their known associative, motor and limbic subparts. A more precise understanding of the specific role of each subpart of the BG nuclei is therefore unfortunately not possible at this stage. Such concern should be addressed in the future by the use of subject-level delineation of BG sub-territories and/or by using even higher fMRI resolution, such as with a 7-T scanner. Finally, while our functional connectivity results were consistent with existing literature, we cannot rule out that other regions may mediate the correlations between ROI, so these should be taken with more caution than the effective connectivity results that used more direct mathematical association calculations (multiple regressions). In addition to these limitations, future studies should try to highlight emotional substrates within the BG and cerebellum pertaining to sub-components of emotion44, such as for example perception and/or decoding, subjective feeling, response output, behavioural response to emotion, as well as giving more importance to task designs allowing for a clearer topography and parcellation of the affective BG and cerebellum. Future studies should also include patients with known alterations and/or lesions of basal ganglia and cerebellar brain regions such as Parkinson disease—or any relevant lesion within these regions of interest48—or with biases in emotion recognition and processing44 such as in depression or schizophrenia and compare them to healthy, matched controls.

In conclusion, the present study aimed at a better understanding of the implications of basal ganglia and cerebellum involvement in vocal emotion processing. Through the combination of wholebrain analysis, functional and effective connectivity analyses and with the partial exclusion of low-level acoustics of interest (voice f0, energy) our data depict a clearer role of the STN, GP and putamen in vocal emotion processing, especially for auditory pattern detection and synchronization across cortical and subcortical limbic networks. The current results add weight to the assertion that both direct and indirect coupling between these BG regions and the cortex is modulated by BG and cerebellum connections. Our results also favour a framework in which the brain could use temporal regularities (‘patterns’) to analyse and anticipate the timing of future events, and constrain attention and action accordingly. Further work could use a dedicated task and focus on BG and cerebellum subterritories since their specific role(s) is of the highest interest for affective and social neuroscience research.

Material and methods

Participants

We initially included 19 healthy participants but excluded four of them from the analyses because of MRI signal artifacts (N = 2) or psychiatric disorder (N = 2). The remaining sample consisted of seven males and eight females (N = 15), with a mean age of 30.5 years (SD = 3.48, range 27–37 years; mean age (SD) for female participants was 30.25 (3.24) and for male participants 30.85 (3.98)). All included participants were right-handed, native French speakers, and had normal or corrected-to-normal vision and normal hearing. None of them had a history of neurological disease or psychiatric disorder.

Ethics declarations

Participants gave written informed consent for their participation in accordance with the ethical and data security guidelines of the University of Geneva. The study was approved by the local ethics committee and conducted according to the Declaration of Helsinki.

Experimental setup

One-back task

Stimuli

The vocal (prosodic) stimuli consisted of two pseudosentences spoken with different emotional prosodies (“ne kali bam sud molen!” and “kun se mina lod belam?”; mean duration = 1642 ms, range = 854–2788 ms) extracted from a previously validated database, the GEneva Multimodal Emotion Portrayals (GEMEP) corpus68. Alongside these prosodic stimuli (anger, happiness and neutral), we played synthesized stimuli, built from the original emotional and neutral sounds, in order to control for the temporal dynamics of energy and f0. These two basic acoustic features are known to be the most correlated with emotional prosody judgments69,70. The first type of synthetic stimulus (synthesized intensity) consisted of a section of white/pink noise, to which the intensity contour of the original stimulus was applied. The second type of synthetic stimulus (synthesized f0) was a series of pure sine waves (with constant amplitude), the frequency of which corresponded to the f0 of the original vocal stimulus, allowing us to maintain the temporal dynamics of the f0. Both synthetic stimuli had the same duration as in the original recordings. All sounds were matched for mean energy to avoid too strong loudness effects. Two runs were constructed, featuring the different kinds of stimuli in pseudorandom order (no more than three times for the same experimental condition). Each run contained 20 trials featuring anger stimuli, 20 trials featuring happiness stimuli, and 20 trials featuring neutral stimuli, as well as 15 synthesized intensity stimuli, 15 synthesized f0 stimuli, and one section of white noise at the beginning (first stimulus) with a gradual onset to accustom the participants to the auditory material. Each run contained a different list of stimuli. In each prosodic condition, we controlled for the pseudo-sentence being pronounced and the sex of the actor who pronounced the utterances: a female actor pronounced half the stimuli, half of them consisting of the pseudo-sentence "ne kali bam sud molen!”. The total duration of each run was ~ 10 min, and there was a short break between them. Each run contained pairs of identical subsequent stimuli, representing 10% of the total stimuli (pseudorandom order) to allow a one-back task to be performed by the participants, therefore forcing them to carefully attend each stimulus.

Experimental procedure, paradigm

In order to avoid expectancy effects, we varied in each trial the duration of the interval between the onset of the fixation cross and the onset of the auditory stimulus. In other words, the presentation of each auditory stimulus was preceded by a silent portion of pseudorandom duration, ranging from 50 to 250 ms, the so-called jitter (Fig. 6). After the offset of the sound, we also included a silent portion ranging from 3000 to 3500 ms. In order to avoid the offset of the sound and the offset of the fixation cross being synchronous, we varied the duration of the interval between these two offsets. Finally, in order to minimize any retinal afterimage, we ensured that the color of the fixation cross did not contrast too greatly with the color of the desktop background.

Figure 6
figure 6

Experimental timeline and details of stimuli for the one-back task. (a) Following technical scans (localizer and field map), the first run started for 10 min during which participants had to perform a one-back task on the voice presented auditorily to them using an MRI-compatible button box. The second run followed similarly for 10 more minutes and the session ended with the acquisition of an anatomical image for 5 min. During the complete session, the participant laid down in the scanner and had to pay attention to auditorily presented vocal stimuli and do a one-back task (10% of all trials). All stimuli had a duration of 1.3–2.2 s and an inter trial interval of 3–3.5 s. (b) Voice stimuli consisted of pseudowords arranged in sentences with either original vocal signal, synthesized dynamic f0 manipulation or synthesized energy.

For each trial, the participants were asked to keep their eyes open and relaxed. They were told they would hear meaningless speech uttered by male and female actors, as well as synthesized sounds. The binaurally recorded auditory stimuli were played through MR-compatible headphones (MR Confon GmbH, Magdeburg, Germany). Loudness intensity was adjusted for each participant according to her/his hearing threshold at the beginning of the experiment. Participants were asked to focus on these auditory stimuli and to press a button whenever they heard two identical stimuli in a row. These one-back trials represented only 10% of all trials and were excluded from the analyses. The one-back task71 was administered to ensure that the patients were paying attention to the stimuli. Prior to the task, an MR-compatible response box (Current Designs Inc., Philadelphia, PA, USA) was placed beneath the participant’s fingers. A similar task, greatly overlapping with the one used here, was previously used by Julie Péron60.

Image acquisition

Imaging was conducted at the Brain and Behaviour Laboratory (BBL) of the University of Geneva. For the main task, high-resolution imaging data was acquired on a 3 T Siemens Trio System (Siemens, Erlangen, Germany) using a T2*-weighted gradient echo planar imaging sequence with 440 volumes per run (EPI; 1.5 × 1.5x2.2 mm voxels, slice thickness = 2 mm, gap = 0.2 mm, 31 slices, RT = 2320 ms, TE = 33 ms, flip angle = 90°, matrix = 128 × 128, field of view = 192 mm). The acquired volumes, representing a truncated field of view compared to standard wholebrain acquisition, were almost perpendicular to the anterior commissure-posterior commissure (AC/PC) line to cover all regions of interest, especially the basal ganglia, cerebellum and the temporal lobe (see Fig.S2). Therefore, the term ‘wholebrain’ in this manuscript refers exclusively to our truncated field of view, not to volumes covering the wholebrain. The total number of volumes for our fifteen participants was 13′200 for a total number of slices of 409′200. A T1-weighted, magnetization- prepared, rapid-acquisition, gradient echo anatomical scan (slice thickness = 1 mm, 176 slices, RT = 2530 ms, TE = 3.31 ms, flip angle = 7°, matrix = 256 × 256, FOV = 256 mm) was also acquired.

Image analysis

Wholebrain analyses

Functional images analysis was carried out using Statistical Parametric Mapping software 12 (SPM12, Wellcome Trust Centre for Neuroimaging, London, UK). Preprocessing steps included realignment to the first volume of the time series, slice timing, iterative normalization into the Montreal Neurological Institute space72 using the DARTEL toolbox73 and spatial smoothing with an isotropic Gaussian filter of 6 mm full width at half maximum. To remove low-frequency components, we used a high-pass filter with a cutoff frequency of 128 s. Anatomical locations were defined using a standardized coordinate database using the Automated Anatomical Labelling atlas74 incorporated in the xjView toolbox (http://www.alivelearn.net/xjview), an atlas of the brainstem75, basal ganglia76 and cerebellum77,78 displayed in FMRIB Software Library v6.0 (FSL)79 through FSLeyes.

A general linear model was used to compute first-level statistics, in which each run was modelled as a distinct session and each trial was convolved with the hemodynamic response function, time-locked to the onset of each stimulus. Separate regressors were created for each condition, namely for the Emotion and the Acoustic Parameters factors (Design matrix columns for each run (N = 9): anger original, anger f0, anger energy, happy original, happy f0, happy energy, neutral original, neutral f0, neutral energy). Finally, regressors of no-interest included the repetition trials of the one-back task that were concatenated across conditions and added as an additional regressor together with six motion parameters for each run to account for movement. Regressors of interest were used to compute nine simple contrasts (one per column of the design matrix, across runs) for each participant (across runs), leading to a main effect of each condition cited above at the first-level of analysis. Simple contrasts were then used in three distinct flexible factorial, second-level analyses. In model 1, the effect of the Emotion (angry, happy, neutral voices, acoustically untouched or ‘original’) factor was modelled with one Participant factor and one Emotion factor. In model 2, factors Participant, Emotion (angry, happy, neutral voices) and Acoustic Parameters (original, f0 synthesized, energy synthesized parameters) were included to model the two-way interaction between our main factors (Emotion*Acoustic Parameters). Model 3 included the main effect of the Acoustic Parameters (normal, f0 synthesized, energy synthesized parameters) factor, modelled with one Participant factor and one Acoustic Parameters factor. For each model, independence of the Participant factor was set to ‘true’, variance to ‘unequal’ and the Emotion, Acoustic Parameters and Emotion*Acoustic Parameters factors with independence as ‘false’, variance as ‘unequal’.

All neuroimaging activations were thresholded in SPM12 by using a wholebrain voxel-wise false discovery rate (FDR) correction at p < 0.05 with an arbitrary cluster extent of k > 10 voxels.

Functional and effective connectivity analysis

Functional and effective connectivity analyses were performed using the CONN toolbox80 version 18.b implemented in Matlab 9.0 (The MathWorks, Inc., Natick, MA, USA) for the two-way interaction between our factors, namely Emotion and Acoustic Parameters (design matrix identical to wholebrain analyses). As in wholebrain data analysis, repetition trials of the one-back task were modelled as a single column including a concatenation of all their onset times across conditions (regressor of no-interest). Functional connectivity analyses were mainly carried out to orient further effective connectivity analysis and we decided to report both types of connectivity for a clear overview of the results. Functional connectivity analyses were computed using as seeds each region of interest (ROI) of the following atlases: the Automated Anatomical Labelling (‘aal’) atlas74 (N = 58 ROI), an atlas of the brainstem75 (N = 23 ROI), basal ganglia76 (N = 22 ROI) and cerebellum77,78 (N = 34 ROI). All ROI (N = 137; Supplementary Table 11) were within the bounds of our truncated field of view. Frontal, parietal and occipital areas outside the bounds of our field of view, specifically of the ‘aal’ atlas, were isolated through CONN time-course visualization and removed from the analyses when a region had a flat time-course. For effective connectivity analyses and according to our hypotheses, seed regions were limited to the basal ganglia76 (N = 22 ROI). Spurious sources of noise were estimated and removed using the automated toolbox preprocessing algorithm, and the residual BOLD time-series was band-pass filtered using a low frequency window (0.008 < f < 0.09 Hz). Correlation maps were then created for each condition of interest by taking the residual BOLD time-course for each condition from atlas regions of interest and computing bivariate Pearson's correlation coefficients between the time courses of each voxel of each ROI of the atlas, averaged by ROI (‘functional connectivity’ analyses). ‘Effective connectivity’ was approached using multivariate regressions between each seed ROI and all other ROI—or all brain voxels for seed to voxel analysis—and a model was generated and used to characterize the direct connectivity between pairs. For both types of connectivity, we used generalized psychophysiological interaction (gPPI) measures, representing the level of task-modulated (often labelled ‘effective’) connectivity between ROI or between ROI and voxels. gPPI is computed using a separate multiple regression model for each target (ROI/voxel). Each model includes three predictors: (1) task effects convolved with a canonical hemodynamic response function (psychological factor); (2) each seed ROI BOLD time series (physiological factor) and (3) the interaction term between the psychological and the physiological factors, the output of which is regression coefficients associated with this interaction term. Finally, group-level analyses were performed on these regression coefficients to assess for main effects within-group for contrasts of interest in seed-to-seed and seed-to-voxel analyses. Therefore, ‘functional connectivity’ is defined in the present study as a gPPI analysis using bivariate correlations between ROI, while ‘effective connectivity’ defines the gPPI analysis using multivariate regressions between ROI/voxels. Connectivity analyses were computed using methods in line with most recent best practices81. For both analyses, type I error was controlled by the use of seed-level (seed-to-seed analyses) and cluster-level (seed-to-voxel analysis) false discovery rate correction with p < 0.05 FDR to correct for multiple comparisons.