Stimulus arousal drives amygdalar responses to emotional expressions across sensory modalities

The factors that drive amygdalar responses to emotionally significant stimuli are still a matter of debate – particularly the proneness of the amygdala to respond to negatively-valenced stimuli has been discussed controversially. Furthermore, it is uncertain whether the amygdala responds in a modality-general fashion or whether modality-specific idiosyncrasies exist. Therefore, the present functional magnetic resonance imaging (fMRI) study systematically investigated amygdalar responding to stimulus valence and arousal of emotional expressions across visual and auditory modalities. During scanning, participants performed a gender judgment task while prosodic and facial emotional expressions were presented. The stimuli varied in stimulus valence and arousal by including neutral, happy and angry expressions of high and low emotional intensity. Results demonstrate amygdalar activation as a function of stimulus arousal and accordingly associated emotional intensity regardless of stimulus valence. Furthermore, arousal-driven amygdalar responding did not depend on the visual and auditory modalities of emotional expressions. Thus, the current results are consistent with the notion that the amygdala codes general stimulus relevance across visual and auditory modalities irrespective of valence. In addition, whole brain analyses revealed that effects in visual and auditory areas were driven mainly by high intense emotional facial and vocal stimuli, respectively, suggesting modality-specific representations of emotional expressions in auditory and visual cortices.

It has been suggested that the amygdala classifies sensory input according to its emotional and motivational relevance 1,2 and modulates ongoing sensory processing leading to enhanced representations of emotionally relevant stimuli 3,4 . Social signals, such as emotional vocal and facial expressions, typically represent environmental aspects of high social and personal relevance (e.g., indicating other persons' intentions or pointing towards relevant environmental changes) and high intense expressions are associated with higher arousal ratings as compared to low intense expressions 5 . It has been shown that the amygdala responds to both emotional vocal [6][7][8][9] and facial expressions 10 . However, despite a large body of imaging studies on this issue, previous research does not provide an unequivocal answer regarding the factors that drive amygdalar responses to emotionally expressive voices and faces. Particularly, the specificity of amygdalar responding, that is, the proneness to respond to negative, threat-related emotional information has been a matter of debate [11][12][13][14] . Furthermore, it is uncertain whether emotional signals from different sensory domains are processed in an analogous fashion or whether modality-specific idiosyncrasies exist 15,16 .
Regarding the mentioned specificity of amygdalar activation to negative as compared to positive stimuli, findings have been mixed. Several studies employing emotional facial expressions suggest a heightened sensitivity for negative stimuli, threat-related stimuli in particular [17][18][19][20][21][22][23] . Unfortunately, most of these studies do not clarify, whether this 'threat-sensitivity' reflects effects of stimulus valence and/or stimulus arousal 19 . Several studies indicate that the amygdala is sensitive to positive and negative stimuli [24][25][26][27] and might code general effects of motivational relevance and, therefore general arousal, irrespective of valence 11,12,14,28,29 . With regard to facial expressions, enhanced amygdalar activation has been observed for various types of facial expressions, including happy and surprised faces 23,[30][31][32] . Previous research from our own lab provides evidence for amygdalar modulation as a Procedure. Auditory stimuli were presented binaurally via headphones that were specifically adapted for the use in the fMRI environment (commander XG MRI audio system, Resonance Technology, Northridge, USA). When presenting auditory stimuli, a blank screen was presented simultaneously. Visual stimuli were shown via a back-projection screen onto an overhead mirror. Scanning was conducted in two runs (run duration 12 min). Overall, we had 10 conditions (2 modalities [faces vs. voices] × 5 expressions [angry high, angry low, neutral, happy low, happy high]). Each condition was presented in one block (see Fig. 1 for a schematic presentation of the procedure), consisting of ten trials (i.e., each facial/vocal identity [5 females, 5 males, see also the Stimuli section] was presented once in a block). The presentation sequence of each identity was randomized across blocks and participants. Each block was presented twice resulting in 20 blocks per run and overall, in 400 trials (5 expressions × 2 modalities × 10 identities × 2 repetitions × 2 runs). Between each block, there was an 18 second pause. Visual stimuli were presented for 658 ms, while acoustic stimuli were in average presented for 658 ms (see stimulus description) with a stimulus onset asynchrony of 2000 ms. Sequence of blocks were counterbalanced between runs and across participants. Participants were instructed to perform a gender judgment task in order to ensure that participants paid attention to the presented voices and faces. The instructions emphasized both speed and accuracy. Responses were given via button press of the index and the middle finger of the right hand, using a fiber optic response box (LUMItouch; Photon Control). Response assignments to index and middle finger were  Table 1. Mean rating data on intensity (1 to 7), arousal (1 to 9) and valence (1 to 9) with respect to facial and vocal stimuli employed in the present study. Note: Values in parentheses represent standard deviations (SD).

Figure 1.
Each condition was presented in one block, consisting of ten trials. Visual stimuli were presented for 658 ms, while acoustic stimuli were in average presented for 658 ms with a stimulus onset asynchrony of 2000 ms. When presenting auditory stimuli, a blank screen was presented simultaneously. Each block was presented twice resulting in 20 blocks per run. Sequence of blocks were counterbalanced between runs and across participants. Participants were instructed to perform a gender judgment task in order to ensure that participants paid attention to the presented voices and faces.
counterbalanced across participants. Only key pressing during stimulus presentation were considered as valid response. Stimulus presentation and recordings were accomplished by Presentation Software (Neurobehavioral Systems, Inc., Albany, California).
Behavioral data recording and analysis. Accuracy and reaction times were analyzed with within-subject repeated measures analyses of variance (ANOVA) with the factors Modality (face and voice) and Expression (angry high, angry low, neutral, happy low, and happy high) using IBM SPSS 22 software (SPSS Inc., Chicago, Illinois). Greenhouse-Geisser and Bonferroni corrections were used, if appropriate. Results were regarded as statistically significant for p < 0.05.
FMRI data acquisition and analysis. Scanning was performed in a 1.5-Tesla magnetic resonance scanner (Magnetom Vision Plus; Siemens Medical Systems). Following the acquisition of a T1-weighted anatomical scan, two runs of 245 volumes were obtained for each participant using T2*-weighted echo-planar images (TE = 50 ms, flip angle = 90°, matrix = 512 × 512, field of view = 200 mm, TR = 2973 ms). Each volume comprised 30 axial slices (thickness = 3 mm, gap = 1 mm, in-plane resolution = 3 × 3 mm). The slices were acquired parallel to the line between anterior and posterior commissure with a tilted orientation to reduce susceptibility artifacts in inferior parts of the anterior brain 55 . Before imaging, a shimming procedure was performed to improve field homogeneity. The first four volumes of each run were discarded from analysis to ensure steady-state tissue magnetization. Preprocessing and analyses were performed using Brain Voyager QX (Brain Innovation, Maastricht, the Netherlands). The volumes were realigned to the first volume to minimize effects of head movements. Further preprocessing comprised spatial (8 mm full-width half-maximum isotropic Gaussian) and temporal (high-pass filter: three cycles per run, linear trend removal) filtering. The anatomical and functional images were co-registered and normalized to the Talairach space. The expected BOLD signal change for each predictor was modelled with a canonical double γ haemodynamic response function. The GLM was calculated with predictors of interest being the factors Modality (face and voice) and Expression (angry high, angry low, neutral, happy low, and happy high).
Valence and arousal effects were investigated using a parametric approach involving balanced contrast weights, which were derived from normative valence and arousal ratings reported in Table 1. Analysis was conducted for two main contrasts (valence and arousal) and their interaction with modality. For the first main contrast 'arousal' , the arousal rating data for faces and voices were used as contrast weights, displaying a u-shaped function with higher values for high intense compared to low intense expression and neutral expressions being at the lowest point of the u-shape. Contrast weights were zero-centered. The second main contrast modeled valence effects by using normative valence ratings for faces and voices (see Table 1). This contrast modeled a linear function across expression predictors with positive values for positive valence. The two interaction contrasts of visual and auditory modalities with stimulus arousal or valence respectively were modeled using inverted contrast weights for voices. Interactions of arousal and valence were investigated with the mean-centered product of the mean-centered valance and arousal ratings. This parametric approach was chosen, since rating data reflecting stimulus valence/arousal were regarded as most accurate predictors for expected effects on amygdalar responses. Since contrast weights modelled brain activation separately for both modalities, we also controlled for potential differences across modality conditions.
Since the present study focuses on amygdalar response properties, data analysis was conducted as a region-of-interest (ROI) analysis for the amygdala. Additionally, to make the study more comprehensive, a whole-brain analysis was performed without a priori defined ROIs. The amygdala ROI was defined according to probabilistic cytoarchitectonic maps 56,57 and contained the superficial group, the basolateral group, and the centromedial group as subregions 58 . Anatomical maps were created using the Anatomy Toolbox in Matlab (MATLAB 2014, The MathWorks, Inc., Natick, Massachusetts, USA) and transformed into Talairarch space using CBM2TAL 59,60 . Significant clusters were obtained through cluster-based permutation (CBP) with 1000 permutations. The non-parametric CBP framework was chosen, in order to gain precise false discovery rates with no need of assumptions regarding test-statistic distributions 61 . Voxel-level threshold was set to p < 0.005. For each permutation, individual beta maps representing activation patterns in a single experimental condition were randomly assigned without replacement to one of the tested experimental conditions. For example, to test the parametric arousal effect, the five beta maps corresponding to the five expressions were randomly assigned to these five conditions, separately for each subject. This approach is based on the assumption formulated by the null-hypothesis stating that the activation is equal across the five expression within a given subject. Cluster mass was assessed by summing all t-values in neighboring significant voxels, where voxels are defined as neighbors if they share a face (i.e. each voxel has six neighbors). Cluster masses larger than the 95% of the permutation distribution were considered as statistically significant.
FMRI results. ROI analysis. For the arousal contrast vector a significant activation cluster within the right amygdala was revealed, showing responses as a function of stimulus arousal (peak voxel coordinates: x = 25, y = −4, z = −10; t max = 3.39, cluster mass = 18.68, p < 0.001, CBP corrected, cluster size = 6 voxels or 162 mm 3 , see Fig. 2). Importantly, there was no significant interaction between stimulus arousal and modality (p > 0.05). Furthermore, there were no significant clusters for the main contrast of valence as well as its interaction with stimulus modality (all ps > 0.05).
In order to additionally analyze whether or not there was an overall interaction between stimulus valence and stimulus arousal independent of modality, we used the mean-centered product of the mean-centered valence and arousal ratings as a contrast vector. There was no single voxel reaching the initial set voxel-level threshold. Finally, we also investigated potentially bimodal responses to valence 62 by comparing all negative with all other stimuli und all positive with all other stimuli. There were no voxels that survived the voxel threshold. Whole brain analysis. There were several brain regions, which responded as a function of stimulus arousal, most importantly, mid superior temporal sulcus (STS, including the transversal gyrus), postcentral gyrus, posterior occipital cortex, insula, cingulate gyrus, and parts of the lateral frontal cortex (see Table 3 for a complete listing and Fig. 3 for main clusters).
Clusters in the mid STS (x = 54, y = −16, z = 6) reflected modulation by vocal expression, while effects in fusiform gyrus (x = −39, y = −40, z = −8) reflected modulation by facial expression (see Fig. 3). Congruently, significant arousal × modality interactions were observed for these and several other brain regions, including supramarginal gyrus and anterior cingulate, indicating either preferred responses to voices or to faces (see Table 4 for a complete listing).
With regard to stimulus valence, significant clusters were mainly revealed in multi-and supramodal regions (e.g., insula, posterior STS, supramarginal gyrus, middle frontal gyrus), in visual areas (e.g., fusiform gyrus), and somatosensory areas (e.g., postcentral gyrus, see Table 5 for a complete listing of brain regions and Fig. 3 for main clusters). There were several significant valence × modality interactions, which reflected dominance for visually-driven valence effects (see Table 6 for a complete listing).

Discussion
The present study investigated whether amygdalar responses to affective vocal and facial expression reflected modulation by stimulus valence and/or stimulus arousal. Furthermore, it was of interest whether or not potential modulation of the amygdala by valence and/or arousal would rely on analogous mechanisms for vocal and facial stimuli. We used voices and faces of varying emotional intensity across stimulus valence categories to examine this question. BOLD responses were modeled based on normative rating data on stimulus valence and arousal. Our results revealed amygdalar responses as a function of stimulus arousal and emotional intensity, crucially, irrespective of stimulus valence. In addition, arousal-driven effects for the amygdala were independent of the visual and auditory modalities of incoming emotional information, but reflected common response patterns across visual and auditory domains.
The proneness of the amygdala to respond to negative, threatening stimuli has been controversially debated 12,13 . Although enhanced amygdalar activation to negative, threat-related stimuli has been frequently obs erved 17,18,20,23,48 , there are few studies which provide convincing evidence in favor of valence-driven amygdalar responding (but see e.g., Kim et al. 19 ). On the other hand, there is strong empirical support for the notion, that positive, negative, and ambiguous stimuli can elicit amygdalar responding, indicating that the amygdala shows general responsiveness to any salient emotional information 1,12,30 and stimuli related to personal goals 2,25-27 . The present study adds to this observation indicating that amygdalar responses might code general stimulus relevance irrespective of stimulus valence and threat-relation.
There is also accumulating evidence that emotional intensity impacts amygdalar responding for several categories of emotional stimuli (e.g., scenes 34,63,64 and odors 65,66 ). In line with these studies, we find a significant positive relationship between amygdalar activation and stimulus arousal, and thus also a positive relationship between amygdalar activation and emotional intensity of facial expressions. Regarding facial expressions, several other studies found effects of emotional intensity on amygdalar responding 5,29,35 , which however varied. Interestingly, Gerber and colleagues 35   www.nature.com/scientificreports www.nature.com/scientificreports/ for weak, possibly ambiguous expressions. It is possible that the amygdala is sensitive to both stimulus intensity (signaling a need for prioritized processing) and stimulus ambiguity (signaling a need for gathering more sensory information), resulting in combined intensity and ambiguity effects 29 .
Even though there are many studies investigating whether amygdalar responses to vocal and facial expressions reflect modulation by stimulus valence or stimulus arousal, findings have been inconsistent so far [11][12][13][14] . Unfortunately, the majority of affective face and voice processing studies neither provide orthogonal manipulations of the two factors, nor include rating data on stimulus valence and arousal (but see e.g., Kim et al. 19 ; Lin et al. 5 , for exceptions). In contrast to previous research, the present study provided highly arousing negative and positive expressions and systematically varied stimulus arousal and emotional intensity across emotional valence categories. Furthermore, statistical models were directly inferred from rating data on stimulus valence and arousal. Thus, our findings provide strong evidence that amygdalar responses to vocal and facial expressions reflect effects of emotional intensity and associated stimulus arousal and do not depend solely on stimulus valence.
Importantly, the present study also investigated whether amygdalar responses to stimulus arousal and expression intensity depend on the visual and auditory modalities of incoming information. The results of the present study provide evidence that the amygdala responds in an analogous fashion to social signals from visual and auditory modalities. These results are in line with earlier findings by Aubé and colleagues 49 , which suggest that the amygdala processes emotional information from different modalities in an analogous fashion. Our findings     www.nature.com/scientificreports www.nature.com/scientificreports/ are also partly in line with the findings of Phillips and colleagues 36 , who found analogous amygdalar responses to fearful voices and faces (with respect to disgusted expressions, however, amygdalar enhancements were only observed for facial expressions). Interestingly, recent reviews proposed asymmetries in affective voice and face processing 15,16 . It is still uncertain, however, whether these asymmetries reflect minor relevance of subcortical structures in affective voice processing (as suggested by the authors) or methodological differences between the two research fields (e.g., less arousing vocal stimuli, smaller sample sizes, less sensitive statistical approaches in auditory studies). The present study experimentally manipulated stimulus modality as a within-subject factor and provided stimuli of comparable emotional properties across modalities. Controlling for methodological differences, we found parallel amygdalar response patterns for emotionally salient voices and faces. Thus, our results indicate that the amygdala responds in a domain-general fashion to emotional signals across visual and auditory domains with no modality-specific idiosyncrasies.
Besides the amygdala, our results provide evidence for domain-general, arousal-driven effects in several multimodal brain regions including the posterior STS, possibly indicating that these regions play an important role in the processing of stimulus arousal across visual and auditory modalities. A recent study by Lin and colleagues (2016) 5 showed that stimulus arousal strongly impacts activation of the posterior STS in response to facial expressions. Several researchers proposed that the posterior STS is involved in the representation of facial information, particularly the representation of emotional expressions 67,68 , and demonstrated coupling with other face processing areas such as the fusiform gyrus 69,70 . Moreover, parts of the STS have been suggested to be the vocal analogue of the fusiform face processing area 9,71,72 , representing vocal features of varying complexity dependent on their emotional significance 8,9,71,73 . In addition, the posterior STS and supramarginal gyrus have been reported to be involved in the integration of audio-visual information and to respond to multiple types of social signals 74,75 . The results of the present study extend the findings of Lin and colleagues 5 and indicate arousal-driven modulation of the posterior STS by facial and vocal expressions.
In addition, modality-specific arousal effects were observed in unimodal primary and secondary cortices, such as the lateral occipital cortex and the medial STS (mSTS), which showed enhanced activation in response to highly arousing faces and voices, respectively. In addition, modality-specific valence effects were also observed in some regions (see Table 5), which were primarily driven by visual stimulation, and reflected stronger activation to angry as compared to happy expressions. It is possible that advantages for the visual domain reflect a higher degree of specialization for representations of visual stimuli, in line with the dominance of visual representations in human perception. Mostly, modulation by stimulus valence did not reflect valence effects in isolation, but reflected mixed effects of stimulus valence and stimulus arousal, indicating limited empirical support for the valence model (see also Lindquist et al. 12 for a recent meta-analysis on the plausibility of valence-driven brain responses).
There are several limitations of the present study. Since fMRI results were based on a 1.5 Tesla scanner, future work should investigate these issues with 3 or even 7 Tesla scanners and potential increased sensitivity for more nuanced effects [76][77][78] . We would like to mention that we do not suggest that the amygdala might not also code valence. However, the resolution of most fMRI studies makes it difficult to investigate this question in sufficient detail. Single unit studies provide also evidence for highly overlapping units with valence and arousal responses 79   www.nature.com/scientificreports www.nature.com/scientificreports/ small voxels due to valence, arousal, but also modality and other factors in more detail. Furthermore, the fact that the utilized auditory stimuli have no emotional meaning beyond prosody might be regarded as detrimental for the comparative validity of employed stimuli. Importantly, there are several studies demonstrating that it is rather prosody than meaning that causes an emotional reaction [80][81][82][83] . In addition, it should be noted that both stimulus categories provide affective and -to a large extend -non-affective information such as basic visual/auditory features related to gender, age, and identity. Considering these aspects, we regard the parallelism between the employed voices and faces as relatively far-reaching 15,16 . The present study used one specific negative emotion (i.e. anger) and a specific class of socially relevant stimuli. Thus, in order to ensure the generalizability of our findings to other types of negative expressions and emotional stimuli, the inclusion of a broader range of expressions 30 and further emotional stimuli (e.g., biological emotional stimuli 84 ) would be highly desirable. Finally, the present study found a valence-independent and modality-independent effect of arousal on amygdalar responding by using an implicit emotion task (e.g., a gender task). However, an explicit emotion task (e.g., an emotion discrimination task) is often used in studies on emotion processing. Furthermore, several studies have manipulated both explicit and implicit tasks to investigate the effect of task on the processing of emotional facial and vocal expressions 81,85,86 . Future studies might use both explicit and implicit tasks to investigate whether these tasks will show differential effects on arousal and valence dependent amygdala activations.

conclusion
Based on normative rating data on stimulus valence and arousal, the present fMRI study suggest enhanced amygdalar activation as a function of stimulus arousal, which does not depend on stimulus valence. Furthermore, present findings support the hypothesis of the amygdala as common neural substrate in affective voice and face processing, which evaluates emotional relevance irrespective of visual and auditory modalities. Finally, whole brain data provided evidence for modality-specific representations of emotional expressions in auditory and visual cortices, which again, mainly reflected the impact of emotional intensity and associated stimulus arousal. Future high resolution studies, however, should further investigate potential overlapping and distinct activations in the amygdala depending on arousal, valence, stimulus modality and specific task contexts.  Table 6. Significant activations modelled by the parametric interaction of valence and modality. Note. Significant activation clusters as identified by valence × modality contrast weights (p < 0.05, CBP corrected). Negative t-values represent pattern with increased activity to faces compared to voices. The coordinates refer to the peak voxel in each cluster.