Modelling individual and cross-cultural variation in the mapping of emotions to speech prosody

The existence of a mapping between emotions and speech prosody is commonly assumed. We propose a Bayesian modelling framework to analyse this mapping. Our models are fitted to a large collection of intended emotional prosody, yielding more than 3,000 minutes of recordings. Our descriptive study reveals that the mapping within corpora is relatively constant, whereas the mapping varies across corpora. To account for this heterogeneity, we fit a series of increasingly complex models. Model comparison reveals that models taking into account mapping differences across countries, languages, sexes and individuals outperform models that only assume a global mapping. Further analysis shows that differences across individuals, cultures and sexes contribute more to the model prediction than a shared global mapping. Our models, which can be explored in an online interactive visualization, offer a description of the mapping between acoustic features and emotions in prosody.


Corpus selection procedure
To query all available corpora capturing emotional prosody, we combined three search strategies with a final manual selection procedure (see Fig. 1 for an illustration). Each corpus included in the list needs to: (i) contain an annotation of the intended emotion per fragment, and (ii) consist of recordings of sentences (i.e., not a syllable, nonverbal vocalization, a phoneme, or a word).
In the first approach (steps 1-4 in Fig. 1), we queried existing databases with a fixed set of keywords. We adapted an existing search query1 -(speech OR voice OR vocal OR prosody) AND (emotion* OR affect*) -to query four databases: PubMed, Web of Science and IEEE explore. Since PubMed did not support wildcards, we omitted the asterisk (*) for this database. We did not include PsycArticles, as it does not allow us to automatically query the database, and Google Scholar, because search results are individualized to location and user. Since some corpora might only be announced at conferences, and conference proceedings are not always listed in databases, we also scanned all conferences organized by the ISCA, which is responsible for putting on the largest speech conference (INTERSPEECH). The results were not filtered by publication date. All database queries were performed on April 1st, 2020. This yields a large list of potential corpora that was further reduced. To avoid duplicate entries, we only included papers using a valid identifier (e.g., ISBN, DOI, ISCA-URL, or an identifier in the database). The long list (∼ 50,000) was then filtered with the following criteria (step 5 in Fig. 1): either the title or one of the keywords contained the words "database" or "corpus" (case insensitive). Keywords were provided one of the following: by authors, the journal, or by the database. This allowed us to reduce the number of publications to 969, which were all checked by hand. In a second step, we scanned existing reviews of corpora containing emotional prosody2-⁸ (step 6 in Fig. 1). As we might have missed some corpora in literature search engines, we also queried the data repositories Kaggle and Google Dataset (step 7-8 in Fig. 1). All potential corpora were manually checked using the predefined criteria described in the first paragraph (step 9 in Fig. 1). The remaining 24 corpora we obtained access to are listed in Tab. 2.

Corpus annotation
As depicted in Fig. 2, all annotations are centered around a single fragment. A fragment is spoken by a single speaker ( ), which has a biological sex ( ). In a given fragment, the speaker speaks a sentence ( ) in a specific language ( ) in a specific country ( ). For each fragment, it is clear which emotion was intended. If the same sentence is repeated by the same speaker for the same intended emotion, we call this a repetition ( ). Each fragment is part of a speech corpus ( ). For each corpus, we annotate if the speaker was also recorded on video ( ) and if the corpus is fully-crossed or not ( ). This means that the same sentence is produced in all emotions and thus implies that sentences in fully-crossed corpora are either semantically neutral or pseudo-speech. Pseudo-speech ( ) means that the fragment contains a Jabberwocky sentence, created with the phonological and syntactic characteristics of the reference language.
Then there are a series of annotations that are likely to be imperfect (marked by the warning sign in Fig. 2). The emotion induction procedure ( ) indicates how the emotion was elicited, which is either done by the meaning of the sentence, the emotion label (with description), a scenario, or a dyadic interaction. The exact procedure was not always described in the respective papers. Furthermore, emotion intensity ( ) indicates if the emotion intensity was experimentally manipulated in the fragment (yes or no). This annotation is impaired, as in most corpora, the emotion intensity was not explicitly controlled and thus undefined. The moderator "speaker type" ( ) describes if the speaker is a professional actor, an actor, or a speaker. The issue here is that each professional actor is also just an actor, and each actor is also a speaker. This circular structure makes the moderator suboptimal. We also annotate the year of publication or the release year of the corpus ( 21 ) as a proxy for the year the stimuli were recorded in. Obviously, the delay between recording and publication is not always equally long, which

Factor analysis and robustness of the solution
To interpret coefficients in a regression, the number of predictors should be limited to a set of relevant features and the correlations among predictor variables should be low. Strong correlations among predictors -also called multicollinearity -inflate the coefficients of the predictors.
One approach to reduce the amount of features is to select relevant predictors based on previous findings. However, acoustic features are correlated with each other (e.g., see Fig. 3a). Using bootstrapping, we randomly select seven features from the eGeMAPS feature set (n = 10,000). As depicted in Fig. 3b, there is a large probability that at least one pair of predictors is strongly correlated (notice the peaks in the distribution of the correlations at .75, . 45, or .95). Moreover, only a small proportion of seven features picked at random have a maximum correlation that is low (r < .30). Feature selection is not a solution, as the correlation among at least one pair of the features would be too high and thus impair the interpretation of the model. An alternative to feature selection is dimension reduction, which yields orthogonal (i.e., uncorrelated) dimensions. However, dimension reduction can come at the price of interpretability: resulting dimensions often contain blends of variables, and selected features usually capture one aspect of the signal (e.g., loudness). An intermediate position is taken by factor analysis with varimax rotation. Factor analysis enables a reduction in a set of observed, correlated variables to a potentially lower number of latent variables called "factors". The varimax rotation allows the factors to have a very low/no correlation among each other (see Fig. 3c).
We initially computed the factor analysis on a balanced subset of the corpora, indicating each emotion occurred equally often and thus had an equal contribution to the factor solution. Here, the objective was to identify a minimal amount of factors and to keep the model as simple as possible, while still having decent predictive accuracy. To address this issue, we ran a series of SVMs that predict the emotions in each corpus separately for an increasing number of factor solutions (four-fold leave-speaker-out cross-validation if possible, identical hyperparameters as in the main paper). Since the corpora included in the analysis consist of a different number of stimuli, we worried that large corpora might dominate the factor analysis. We, therefore, did not only compute the factor solutions on all corpora at once but also on each corpus separately. We obtained the mean unweighted average recall (UAR) across corpora for an increasing number of factors. As depicted in Fig. 4a, both the UAR for the dimension reduction across all corpora or within a corpus increase as more factors are added. For factor solutions 3-6, we can see the UAR is larger (note the non-overlapping CIs) for the dimension reduction per corpus compared to the dimension reduction on all corpora at once. However, for factor solutions with seven or more factors, there is no structural advantage to applying the dimension reduction on a corpus level. In a second analysis, we fitted a series of Bayesian multinomial logistic regressions using 1 up to 10 factors. As depicted in Fig. 4b, the WAIC improvement stagnates for factor solutions with more than seven features.  F 0 describes the rate of vibration of the vocal folds. It is described with summary statistics (e.g., arithmetic mean). The change of F 0 over time (referred to as pitch contour) is solely described with a slope. F 0 tends to be higher in aroused states33.

Jitter
Jitter refers to small perturbations in F 0 in one cycle to another. It is caused by irregular fluctuations in the time it takes to open and close the vocal folds.
Pitch perturbations; "roughness" in the voice3⁴ First three formants (F 1 to F 3 ) Caused by resonance in and speaker modulations of the vocal tract.

Voice quality3⁵
Amplitude Intensity Sum of amplitudes across all frequency bands. It reflects the effort of the speaker to produce the utterance. Another amplitude measure used is equivalent sound level, which expresses the amplitude in decibels.

Loudness of speech
Shimmer Variations in amplitude from cycle to cycle, caused by irregular fluctuations in amplitude.
"Roughness" in the voice3⁶ Harmonics-to-noise ratio (HNR) Proportion between harmonic (e.g., in vowels) and noise components (e.g., in unvoiced speech) in the voice.

Spectral balance
Alpha ratio Ratio between the summed amplitude in the 50-1000 Hz and 1-5 kHz frequency bands.

Voice quality3⁸
Hammarberg index Ratio of the strongest peak amplitude in the 0-2 kHz and the 2-5 kHz frequency bands.

Voice quality⁴⁰
Energy proportion Energy below and above 500 Hz, and 1000 Hz respectively.
Voice quality (related to spectral slope)⁴⁰ Harmonic difference Difference H1 and H2 and H1 and A3, where the first F 0 harmonic is H1, and the second harmonic is H2; A3 is the highest harmonic in the third formant range.
Voice quality (also related to spectral slope)⁴⁰

Relative energy in
Amplitude of the formants relative to F 0 Voice quality3⁵ Spectral flux Speed at which energy distribution in different frequencies changes over time.

Rhythm and timbre⁴1
MFCCs ( (1) or no (0). * eNTERFACE was recorded during a conference, the country of participants could not be obtained. ** 50% of the corpus is pseudo-speech the other 50% consist of regular sentences. † The authors report 10 speakers, but in the data received from the authors there are only 7 speakers.
A factor analysis is now computed on all 24 corpora. Prior to the analysis, all data was standardized to have zero mean and unit variance. The means are reported in Tables 3 and 4. There is a strong average correlation (.86) between the factor solutions on the balanced and all data. We, therefore, use factor analysis on all data. While it remains debatable, why exactly seven factors and not some more or less were used (compare scree plot in Fig. 4c), the loading plot ( Fig. 4d) indicates that the seven factors load on different prosodic dimensions that are perceptually relevant for the communication of emotion (see Tab. 1 for the perceived correlates of the eGeMAPS features).
To further assess the robustness of the factor analysis, we compute a seven solution factor analysis for each of the four largest countries and largest languages, covering 87% and 89% of the data, respectively. We predict all data into the factor analysis of the respective country or language. For each country and each language pair, we compute the optimal alignment, by maximizing the correlation between the dimensions. For each country and language pair, we compute an average correlation of each of the seven aligned factors. Some country and language pairs align better with each other, but on average we find a correlation of r = .67 and r = .65 indicating a fair overlap in factor solutions across languages and countries (see Fig. 5).

Model comparison
The Widely Applicable Information Criterion (WAIC)⁴⁴ is an information criterion that -in contrast to the Akaike Information Criterion (AIC) and Deviance Information Criterion -does not make an assumption about the shape of the posterior. As depicted in Equation 1, it consists of two parts: a log-pointwise-predictive-density estimate (lppd) and a penalty term. The difference between the two is multiplied by -2, which follows the same scaling convention as in AIC.
The lppd gives us the log probability score for each specific observation. Larger values indicate larger average accuracy. Following Equation 2, lppd takes the original data y and the posterior distribution Θ as an input. Here, i is the index of the current observation -in our case, a single speech recording -and S is the total number of posterior samples (in our models always 4,000) with s as the index of the current sample. In other words: we compute the probability of each recording i for each posterior sample s, and then we take the average and the logarithm.
We can observe that lppd keeps improving for increasingly complex models. This is unwanted behavior, since we want to identify which models are plausible given the data and not which models overfit most. To avoid this issue, a penalty term is introduced (see Equation 3).
The penalty computes the variance in log-probabilities for each recording i and sums it up. The larger the variance in each recording, the more the model tends to overfit. Since we obtain a log-pointwise-predictive-density estimate for every recording, we can compute the standard error, which is defined in Equation 4, where N is the total amount of recordings.
Due to the pointwise nature of the WAIC, it is important to ascertain that the WAIC is not driven by a few extreme observations. The importance of single estimates can be estimated using Pareto-smoothed importance sampling cross-validation (short PSIS-loo). For each recording, a k value is estimated that provides information about the reliability of the approximation. Larger k values are more influential in the WAIC. Model comparisons using WAIC yielded identical results compared to PSIS-loo. For more details, we refer to McElreath [45] and Team [46].
Based on the results from the main analysis, one might get the impression that the number of levels for a group-level effect might be more important than the kind of grouping. In a supplementary analysis, however, we show that the WAIC value does not necessarily improve for a group-level effect with a larger number of levels. As depicted in Fig. 6, group-levels with more levels -"sentence" with 2,963 levels -do not per se obtain a better WAIC score than group-levels with fewer levels (here all other competing models). The moderator "sentence" has by far the most levels, but it is the worst model compared to all other models that have one group-level effect with many fewer levels. In gray the performance of the base model.  Figure 4c (bottom, right).

Acted vs. non-acted speech corpora
Corpora of emotional prosody are often divided into "acted" and "spontaneous" corpora. However, the boundary between both groups is often not so clear: Spontaneous emotional corpora are rarely really "spontaneous". For example, the improvisation fragments in the popular IEMOCAP corpus⁴⁷ are often referred to as spontaneous recordings⁴⁸; however, the recorded responses are within an acting game and might differ from expression in daily life. Generally, in both groups, the speakers are aware they are being recorded, which might affect their response. Furthermore, not all "acted" corpora rely on actors and many "spontaneous" corpora use professional actors⁴⁷,⁴⁹,⁵⁰. Nonetheless, both groups differ in two key aspects. First, "acted" corpora -in contrast to "spontaneous" corpora -have "ground-truth" labels, since it is known which emotion should be depicted. Such labels are missing in spontaneous corpora. Instead, to get an estimate of the expressed emotion in spontaneous corpora, each fragment needs to be annotated manually. Here, a label with high agreement could serve as a label for classification, but the agreement across annotators is often low⁵1. In Supplementary Discussion 3, we outline why validation is not so straightforward.
The second key difference between both groups is that speakers in acted corpora are given only one emotion label per utterance, whereas spontaneous utterances tend to contain blends of emotions⁵1,⁵2. While this in itself is certainly a better approximation of human emotion communication -often a mix rather than just one emotion is expressed -it is problematic for two reasons. First, to the best of our knowledge, there is no corpus available that annotated enough responses from different participants for the same stimulus such that one can analyze blends of emotion. The second issue is that emotion recognition is often stated as a classification problem in which only one emotion needs to be selected, ignoring the possibility that the recordings might contain blends of emotions.

The production of speech prosody
There are multiple definitions of speech prosody. We define speech prosody as variations in pitch, loudness, timing, and voice quality. Speech prosody is the product of human speech production that involves four subsystems. All subsystems are illustrated in Fig. 7a. The respiratory system is responsible for the in-and exhalation of air, which is needed to produce pressure. This subsystem is also responsible for loudnessrelated features. During phonation, the vocal folds rapidly open and close ( Fig. 7b for a depiction of the vocal folds in an open and closed state). This subsystem is responsible for pitch production: the number of vibrations of the vocal folds is called the fundamental frequency. During resonation, formants are created and can be shifted during articulation. So the last two subsystems are mainly responsible for the spectral content of the sound.

Validation of corpora of emotional prosody
Estimating the validity of emotional speech usually involves external raters who identify or rate the intended emotion. However, the validation of corpora is not straightforward, as there is no consensus on cutoff values to call a corpus "valid" -such as the minimal number of ratings per stimulus. Furthermore, the concept of validation builds upon the controversial assumption that emotions can be recognized at above chance level from the voice. Thus, corpus validation might not lead to the selection of valid depictions of an emotion, but instead, it might lead to the selection of prototypical depictions. Heedful of these considerations, we will not evaluate the validity of the emotional expression across datasets.