Artificial sounds following biological rules: A novel approach for non-verbal communication in HRI

Emotionally expressive non-verbal vocalizations can play a major role in human-robot interactions. Humans can assess the intensity and emotional valence of animal vocalizations based on simple acoustic features such as call length and fundamental frequency. These simple encoding rules are suggested to be general across terrestrial vertebrates. To test the degree of this generalizability, our aim was to synthesize a set of artificial sounds by systematically changing the call length and fundamental frequency, and examine how emotional valence and intensity is attributed to them by humans. Based on sine wave sounds, we generated sound samples in seven categories by increasing complexity via incorporating different characteristics of animal vocalizations. We used an online questionnaire to measure the perceived emotional valence and intensity of the sounds in a two-dimensional model of emotions. The results show that sounds with low fundamental frequency and shorter call lengths were considered to have a more positive valence, and samples with high fundamental frequency were rated as more intense across all categories, regardless of the sound complexity. We conclude that applying the basic rules of vocal emotion encoding can be a good starting point for the development of novel non-verbal vocalizations for artificial agents.

in multiple categories. We used sine-wave sounds for the simplest sound category, as these are single-frequency sounds that rarely occur naturally 43 , but which are frequently used in artificial signals of machines. Then, starting with the simple sine-wave sounds we added new acoustic features (pitch contour changes, harmonics, variations of call properties within a sound sample, formants) that are characteristic of animal vocalizations to make more complex and biologically more congruent samples. In each category, we systematically changed the fundamental frequency and call length of the sounds to cover the relevant acoustic ranges of these parameters (for more details see Fig. 1 and Table 1).

Questions and Hypotheses
Our main question was whether the simple coding rules of fundamental frequency and call length of vocalizations are also applied to artificially generated sounds.
Our hypothesis was: H0: Simple coding rules do not exist, the direction of the effects of the acoustic parameters on the emotion ratings are different on the distinct complexity levels.
H1: Simple coding rules apply to artificial sounds as well, the direction of the effects of the acoustic parameters on the emotion ratings are the same.
In this latter case we expect that human listeners perceive artificial sounds with higher fundamental frequency as more intense and sounds with longer calls as having more negative valence, just like in case of human and animal sounds, as we can already find the simple coding rules in complex biological sounds evolved to communicate inner states. In parallel, neural systems are present to process these basic acoustic features. Moreover, if the presence of the features that are inherent consequences of the voice production system are inevitable for accepting a sound as biological and thus being a communicative signal encoding emotional states, the simple coding rules could have a stronger effect (a.k.a. stronger association between acoustic features and emotional scales) in more complex sounds.

Method
Subjects. All subjects were unpaid volunteers from various nationalities recruited via online advertisements. The number of participants in the final analysis were 237, from which 95 chose to fill the questionnaire in Hungarian (60 female, 35 male, mean age = 36.3 ± SD 11.8 years) and 142 in English (122 female, 20 male, mean age 39.9 ± SD 11.7 years). Questionnaire answers were discarded if the participant was under the age of 18. Part of our sample abandoned the survey before finishing (95 individuals) but as the sample presentation was random, these unfinished responses are unlikely to cause any bias, thus they were included in the analysis. Subjects gave their informed consent to participate in the study, which was carried out in accordance with the relevant guidelines and with the approval of the Institutional Review Board of the Institute of Biology, Eötvös Loránd University, Budapest, Hungary (reference number of the ethical permission: 2019/49).
Stimuli. The artificial sounds were generated using a custom Praat (version 6.0.19) script (developed by TF and BK, see Supplementary Methods). The sound samples consisted of calls separated by mute intercall periods forming bouts. We varied both the lengths of the calls (cl) and the fundamental frequency (f 0 ) in all cases. The range of most parameters was set in accordance with the non-verbal human and dog vocalizations used in 25 .
The range of fundamental frequency varied between 65 Hz to 1365 Hz with 100 Hz steps. There were multiple samples at each frequency step, with differing call lengths; sound samples were generated at every 0.03 s Figure 1. The categories of the artificial sounds across three levels of complexity. In each category the basis of the sound (sine wave or pulse train) is followed by the changed parameters in parenthesis. Level  www.nature.com/scientificreports www.nature.com/scientificreports/ call length step. Following this, sounds with specific call lengths were selected (call length fell between 0.07 and 1.96 sec, for details see Table 1). The number of calls in a sound sample depended on the length of the calls, as the complete sound samples were consistently 3 s long and contained only complete calls, meaning that calls starting after 2 s, or calls that would have ended after the 3 s were muted, using Adobe Audition. Therefore, all sound samples consisted of a ~2 s part containing calls and ended with a ~1 s silent part. The intercall interval length was varied in all sound samples. The generated sounds showed variation in loudness, which we included in our analysis as a further acoustic parameter (mean loudness 79.4 ± SD 4.8 dB). www.nature.com/scientificreports www.nature.com/scientificreports/ We created seven categories of artificial sounds with three levels of complexity. Figure 1 presents the characteristics of each category, while Table 1. shows a summary of the acoustic parameters of the sound samples.
Level 1 sounds (category 1) are based on sine waves in which only the call length and the fundamental frequency were varied. In Level 2 sounds (categories 2, 3, 4, 5) we systematically changed one aspect of the original simple sounds in each category. In category 2 we used pulse train sounds instead of sine waves, in which the consecutive non-sinusoid waves model the vibrations of the vocal folds, creating harmonics 19,44,45 . In categories 3 and 4 we implemented pitch contour changes with either decreasing or increasing pitch, while in the category 5 sounds we included variances in call length, in intercall interval length and in fundamental frequency (see Table 1). Level 3 sounds contained all the previously varied parameters (call length, fundamental frequency, harmonics, variances and pitch contour changes), as well as formants based on vocal tract modelling. The physical parameters of this model were defined as a hypothetical vocal tract for a ~70 cm tall social robot. The total number of created stimuli consisted of 588 sound samples, 84 in each category.
Online questionnaire. The final questionnaire used in the study is accessible online at http://soundratingtwo.elte.hu. First, the participants were asked to provide demographic data on their nationality, gender and age, and were asked to answer the question whether they currently owned a dog at the time of the test or owned one in the past. The online page also provided the instructions for the questionnaire, explaining how to indicate the perceived valence and emotional intensity. The participants were asked to use headphones instead of loudspeakers to minimise the differences in the quality and the frequency range of sound production of built-in loudspeakers (e.g., laptops). Participants also had the opportunity to check if their headphones worked correctly and at an optimal volume by playing a non-relevant sound.
The questionnaire used a modified version of the two-dimensional model of emotions by Russell 46 , which had already been successfully used for measuring the perceived emotions associated with dog and human vocalizations 25 . The questionnaire measured the values the participants gave for the sounds on the valence and intensity axes. We used the same questionnaire design in this study. After a sound was played, the participants had to indicate the valence on a horizontal axis and the intensity on the vertical one with one click (Fig. 2). Due to the high number of sound stimuli, each participant received only 11.9% of the samples (70 sound stimuli) after the 4 demo sounds, and received an equal number of samples (10) from all categories. The samples and their listening order were determined randomly.

Data analysis. Statistical analysis was conducted in the R statistical environment.
We excluded responses slower than 20 seconds to avoid artefacts caused by network errors and possible lags in the stimuli presentation. Long response time might also indicate high uncertainty in the answer. We used Linear Mixed Modeling (lmer function from the lme4 package, version 1.1-21 47 ) fit with backward elimination (drop1 function) to find the best model. The fixed effects were the fundamental frequency, call length, sound category, gender, age, query language (Hungarian or English), and the participants' status of dog ownership, loudness of sound samples, as well as the two-way interactions of category and acoustic parameters; language, acoustic parameters and complexity category. The participant's age, gender and dog ownership status were included as background variables, as these have been found to influence the perception of emotions in vocalizations in some cases [48][49][50] . The targets were the intensity and the valence values (respectively), and the random effects were the subjects and the ID of the sounds (see also in Table 2). We used a normal probability distribution with an identity link function and all covariates (fundamental frequency, call length, age, loudness) were scaled and centered. Loudness, and the interaction of loudness and category were included in the model after backward elimination. To compare the effects of call length and fundamental frequency in different complexity categories, we created a Linear Mixed Effects Model of category 1 (Simple sine wave), in which the fundamental frequency and the call length were fixed effects, subjects and sound ID were random effects, and the target was the valence or the intensity ratings. We used the models of category 1 to predict the valence and intensity ratings of the other categories. We compared the predicted and actual valence and intensity ratings with Pearson's correlation.

Results
Intensity. We found that the simple Linear Mixed Effects Model fitted on the sinus category sounds predicted the intensity ratings of the other categories quite well based on the correlation between the real and predicted values (R = 0.49-0.60). The comparison of the predicted valence and intensity ratings of the categories is in Table 3.
In the Linear Mixed Model, both the fundamental frequency and call length were in interaction with the sound category and the language. According to the post-hoc tests, the fundamental frequency had a similar positive effect on intensity ratings in all categories: the sounds with higher fundamental frequency were rated as more intense, however this effect was stronger in category 4 (Sine wave up), 3 (Sine wave down), 5 (Variable sine wave) and 1 (Simple sine wave) while weaker in 2 (Pulse train), 6 and 7 (Complex pulse train down and up) (Fig. 3a). We see a similar pattern within both the English and the Hungarian responses, although stronger in the former. Call length had a negative effect in categories 1 (Simple sine wave) and 3 (Sine wave down): shorter calls were rated as more intense. In contrast, samples with longer calls were rated more intense in categories 2 (Pulse train), 4 (Sine wave up), 6 and 7 (Complex pulse train down and up) (Fig. 3b). In the Hungarian responses the post-hoc test showed a negative trend (short calls are more intense), compared to the English where the long calls were rated as more intense. The sound category was also in interaction with the language and loudness. In general English speaking respondents in most categories rated the samples as more intense compared to the Hungarian sample with the exception of categories 3 and 4 (Sine wave down and up) where we found no difference. In both languages categories 1 (Simple sine wave), 3 (Sine wave down) and 5 (Variable sine wave) got the lowest ratings, while 2 (Pulse train) the highest. Louder sounds were rated more intense in categories 2 (Pulse train), 7 (Complex pulse train up), 4 (Sine wave up), and 6 (Complex pulse train down). Age had a main effect, as older participants rated sounds as less intensive (Fig. 3c). The participants' gender and dog-owner status had no effect on the intensity rating, thus were excluded from the final model. The results of the Linear Mixed Model are summarized in Table 4, and the post-hoc tests are summarized in Supplementary Tables S1 and S2.   www.nature.com/scientificreports www.nature.com/scientificreports/ www.nature.com/scientificreports www.nature.com/scientificreports/ Valence. We found that the simple Linear Mixed Effects Model fitted on the sinus category sounds predicted the valence ratings of the other categories quite well based on the correlation between the real and predicted values (R = 0.46-0.58). The comparison of the predicted valence and intensity ratings of the categories is in Table 3.
The fundamental frequency had a significant main effect in the Linear Mixed Model: samples with lower fundamental frequency were rated to be more positive (Fig. 4a). The post hoc test showed that in the sound category and call length interaction the sound samples that consist of longer calls were rated as having a more negative valence in all categories (Fig. 4b). The interaction of sound category and language showed a significant language effect only within the 2nd (Pulse train) and 3rd (Sine wave down) category: Hungarian responses tended to be more positive in the former and more negative in the latter than English ratings. In both languages category 2 (Pulse train), 6 and 7 (Complex pulse train down and up) were the most negatively rated, while category 4 (Sine wave up) was the most positive. Louder sounds were generally rated as more negative, which effect was steepest in category 2 (Pulse train) and less so in categories 4 and 3 (Sine wave up and down). Age had a main effect: older participants rated the sounds as less negative regardless of the complexity category (Fig. 4c). The gender of the participants and their dog-owner status had no effect on the valence ratings, and neither did the interaction of language and call length, the interaction of language and fundamental frequency, and the interaction of complexity category and fundamental frequency. The results of the Linear Mixed Model are summarized in Table 5, and the post-hoc tests are summarized in Supplementary Tables S3 and S4.

Discussion
The results show that our artificially generated sounds are able to mimic some of the basic coding rules that are present in animal (mammalian) vocalizations. The predictive models based on sinus sound samples explain quite well both the valence and the intensity ratings in all other complexity categories suggesting the presence of the simple rules. The fundamental frequency of the sounds affects the perceived intensity, that is, sounds with higher fundamental frequency were perceived as more intense, while sounds containing longer calls were rated as more negative across all categories. These results align with the findings of previous research on animal and human vocalizations 14,25,26 .
An interesting result was the effect of fundamental frequency on valence: sounds with a higher fundamental frequency were rated as more negative in all categories. Although the fundamental frequency-valence effect was not found by Faragó et al. 25 in dog or human vocalizations, the spectral centre of gravity showed a similar pattern in the case of human vocalizations. Multiple other studies also found that higher pitch was associated with negative valence, in e.g., dogs 50 , pigs 26 and wild boars 52 , horses (Equus caballus) 53 and bonobos (Pan paniscus) 54 . However, high frequency vocalizations in positive contexts can also be found (for a review see 31 ), suggesting that the effect of pitch on valence might be non-linear, or can be influenced by other acoustic parameters.
Emotionally expressive vocalizations of terrestrial tetrapods are assumed to have evolved from involuntary sounds emitted due to breathing during aroused emotional states 55 . However, due to the morphological structures and processes of sound production, even simple emotionally expressive vocalizations are acoustically complex, e.g., phonation already appears in frog vocalizations with the appearance of vocal cords, and continues to be present in terrestrial mammals as a result of vocal fold or membrane vibration 56 . As the basic coding rules related to fundamental frequency and call length were also present in the artificial sounds with no added biological features, we can infer that these effects might originate from a more fundamental component of sound processing.
Communicational signals are frequently the result of ritualization, in which a behaviour that carries only involuntary information goes through an evolutionary process in which it becomes specialized and gains a signalling function 57,58 . Ritualization also increases signal complexity, leading e.g., to decreased signal ambiguity or to reproductive isolation via better species recognition 59 . Systematic investigations using generated sounds akin to ours could be used to find common aspects in the ritualized vocal signals of multiple species, aiding in the understanding of how evolutionary pressures affect specific acoustic parameters.
The results also underscore the compatibility of our approach with other SFU methods of emotion expression by showing that the added acoustic parameters did not interfere with the coding rules based on the acoustic cues derived from the call length and fundamental frequency. We found some overall differences in categories with pulse train sounds (categories 2, 6 and 7) as these were generally rated as more intense and more negative than the sounds in sine wave categories (categories 1, 3, 4 and 5). Pulse train sounds can be perceived to be noisier compared to sine wave sounds, which could have resulted in the higher intensity and more negative valence ratings. Furthermore, as pulse train sounds were used to approximate harmonics (category 2) and formants (categories www.nature.com/scientificreports www.nature.com/scientificreports/ 6 and 7) of animal and human vocalizations, these might have caused an unintended eeriness, which could have resulted in an uncanny effect (as described in HRI, e.g. 8,9 ,) near fundamental frequencies that approximate human speech, leading to more intensive and negative ratings.
The call length of the sounds affected the intensity ratings differently in some of the categories, indicating that it does not represent a general coding rule. The effect of call length on intensity was not found in human vocalizations in 25 , only in dogs, which indicates that this association might be species-specific. By including other acoustic parameters in the artificial sounds, further systematic investigations could specify if some rules are species or taxon specific (e.g., in 25 the tonality (harmonic-to-noise ratio, HNR) affected the intensity ratings of only dog vocalizations, as sounds with high HNR were rated as less intense) or if there are other general coding rules based on the added parameters. It could also clarify which parameters can be added to implement further rules with the potential to enrich or refine the range of expressible emotions.
Loudness influenced both the intensity and valence ratings in interaction with the categories: louder sounds were rated as more negative in all categories with varying degrees, while in case of intensity ratings the direction of the effect differed among the categories. As loudness of biological sounds is notoriously hard to measure reliably, especially in field recordings (recording distance and direction highly affects the measurements) this parameter cannot be compared between species, and its role in emotion encoding is uncertain. Although based on physiology and neural control of vocalization we can hypothesise that it can be linked with both higher arousal and negative inner states 31 . Our results partially support this, but it seems that fundamental frequency and call length plays a more crucial role in emotion encoding.
A limitation of the current set of sounds is the low number of sound samples that were rated notably positive. The majority of sounds had a mean rating on the valence axis lower than 0, and only a small number of sounds had a mean rating higher than 20. This presents a problem in the framework of human-robot interactions, as social robots are to exhibit behaviours also associated with positive emotions. However, considering the basis of these sounds, the scarcity of positive valence sounds is not surprising. In animal vocalisations, the expression of positive inner states is less frequent, and their functionality is limited to very specific behaviours or situations, e.g., grooming 60 , greeting 61 , play 62 . Vocalizations of dogs show a similar pattern in their perceived valence in contrast to human non-verbal vocalizations which cover the whole scale 25 . An acoustic parameter which is associated with positive inner states in humans is a steeper spectral slope 63 , which can be incorporated in the next iteration of the artificial sounds.
In some cases, the language of the questionnaire influenced the strength of the effects, and in the call length-intensity connection, its direction. As the effect of the call length on the intensity ratings was only present in interaction with the categories and not as a general rule independent of added acoustic parameters, it can be assumed that a slight difference of interpretation of the word 'intensity' by the Hungarian or English speaking participants could have caused this discrepancy. However, this seems to have no major confounding effect in the case of our main questions about the simple encoding rules.
We found that the age of the participant had a significant effect on both the valence and intensity ratings of the sounds, as older participants considered the sound samples to be more positive and less intensive than did younger adults. This could be explained by the neural changes that occur during ageing, which leads to a bias towards positive stimuli found in the elderly (positivity effect), causing increased attention towards 64 and memories of 65 positive stimuli. Elderly people are faster to recognize positive facial expressions than negative ones 66 , while studies have contradictory results on intensity ratings (increased intensity 66 ; decreased intensity 67 ). Age related hearing loss could have also influenced the answers of elderly participants, as hearing impairment is more prevalent in sounds with higher frequencies, starting from 1000 Hz 68,69 , which could somewhat reduce the effects of higher fundamental frequency on the intensity and valence ratings found on younger adults. However, the associations between the acoustic cues and the intensity and valence ratings persisted, despite the effects of age, and the noise caused by possible sound differences due to the headphone devices of the participants.
As the participants only rated the sounds on their intensity and valence, some functionally important aspects were not investigated. Based on the current results it is not possible to differentiate between sounds with high intensity and negative valence, as they may be perceived as 'angry' or 'fearful' . However, vocalizations perceived as angry/aggressive or fearful/distressed usually elicit opposing behavioural responses from others, as the first may prompt behaviours to avoid the source of the sound, while fearful or distressed vocalizations may elicit approach 70 . This difference in the behavioural response to sounds is instrumental in HRI, and therefore should be investigated as an added dimension to the valence and intensity.  www.nature.com/scientificreports www.nature.com/scientificreports/ Outlook. In the current study, we have established that humans assess the intensity and emotional valence of artificial sounds according to simple coding rules that are based on acoustic cues of animal vocalizations: sounds with higher fundamental frequency are perceived as more intense, while sounds with shorter call lengths are perceived as being more positive. As these coding rules are considered to be shared at least among mammals, the artificial sounds presumably elicit similar responses in non-human mammalian species that live in the human social environment. In our future work, we are planning on investigating the responses of humans and companion animals to the artificially generated sounds, with comparative fMRI studies on humans and dogs and with behavioural tests on humans, dogs and cats. We are also investigating the approach-avoidance responses of humans to the artificial sounds with a follow up questionnaire study.
Defining basic rules of emotion encoding using comparative approach can be the key to understanding the evolutionary processes of animal vocalizations. We suggest that the presented systematic method of assessing the effects of artificial sounds provides a novel opportunity to investigate the evolution of both the production and perception mechanisms underlying vocal emotion expression.

Data availability
The dataset generated during the current study is available as a supplementary file (Dataset.csv).