Novel vocalizations are understood across cultures

Linguistic communication requires speakers to mutually agree on the meanings of words, but how does such a system first get off the ground? One solution is to rely on iconic gestures: visual signs whose form directly resembles or otherwise cues their meaning without any previously established correspondence. However, it is debated whether vocalizations could have played a similar role. We report the first extensive cross-cultural study investigating whether people from diverse linguistic backgrounds can understand novel vocalizations for a range of meanings. In two comprehension experiments, we tested whether vocalizations produced by English speakers could be understood by listeners from 28 languages from 12 language families. Listeners from each language were more accurate than chance at guessing the intended referent of the vocalizations for each of the meanings tested. Our findings challenge the often-cited idea that vocalizations have limited potential for iconic representation, demonstrating that in the absence of words people can use vocalizations to communicate a variety of meanings.

www.nature.com/scientificreports/ of both online and fieldwork experiment is shown in Fig. 1. Altogether, these two experiments included speakers of 28 different languages spanning 12 different language families.

Results
Online experiment. First, we analyzed guessing accuracy by listeners of each language group with respect to the 30 different meanings. The overall average accuracy of responses across the 25 languages was 64.6%, much higher than chance level chance level (1/6 = 16.7%). The performance for each of the languages was above chance level, with means ranging from 52.1% for Thai to 74.1% for English. Across the board, a look at the descriptive averages shows that all meanings were guessed correctly above chance level, ranging from 34.5% (for the demonstrative that) to 98.6% (for the verb sleep) (Fig. 2b). For 20 out of the 25 languages performance was above chance for all meanings; for 4 out of 25 languages it was above chance for all but one meaning; and for one language, performance was correct for all but two meanings. Thus, all languages exhibited above-chance level performance for at least 28 out of the 30 meanings. A break-down of meaning categories shows that actions were guessed most accurately (70.9%), followed by entities (67.7%), properties (58.5%), and demonstratives (44.7%). If we look at nouns further, we see that vocalizations for animals were guessed most accurately (75.6%), followed by humans (69.9%), and inanimate entities (62.6%). We fitted an intercept-only Bayesian mixed-effects logistic regression analysis to the accuracy data (at the single-trial level). The intercept of this model estimates the overall accuracy across languages. We controlled for by-listener, by-meaning, by-creator (i.e., the creator of the vocalization), and by-language family variation by fitting random intercept terms for each of these factors. In addition, we fitted a random intercept term for unique vocalizations (i.e., an identifier variable creating a unique label for each meaning x creator combination). This random intercept accounts for the fact that not just meanings and creators, but also particular vocalizations are repeated across subjects and hence are an important source of variation that needs to be accounted for. Controlling for all of these factors, the average accuracy was estimated to be 65.8% (posterior mean), with a 95% Bayesian credible interval spanning from 54.0 to 75.9%. The posterior probability of the average accuracy being below chance level is p = 0.0 , i.e., not a single posterior sample is below chance. This strongly supports the hypothesis that across languages, people are able to reliably detect the meaning of vocalizations. Figure 2 displays the posterior accuracy means from the mixed model, demonstrating that performance was above chance level for all languages and for all meanings.
The mixed model also quantifies the variation in accuracy levels across the different random factors (see Table 1). High standard deviations in this . This highlights the remarkable cross-linguistic similarity in the behavioral responses to this task: while there is variation across languages and language families, it is dwarfed by differences between listeners and differences between meanings.  www.nature.com/scientificreports/ Because the vocalization stimuli were created by English speakers (either as a first or second language), we wanted to determine whether familiarity with English and related languages provided listeners with any advantage in understanding the vocalizations. To do this, first, we computed descriptive averages for accuracy as a function of people's knowledge of English as either a first or second language. Our survey asked participants to note any second languages, and so we coded the data into four categories: English native speakers ('English L1' , N = 82 ), participants who reported English as a second language ('English L2' , N = 648 ), participants who reported another second language that was not English ('non-English L2' , N = 22 ), and people who did not report to know any second language ('no L2' , N = 91 ). All four groups performed on average much above chance level, with the English L1 speakers performing best (70.4%, descriptive average), followed by English L2 speakers (64.7%), 'non-English L2' speakers (60.1%), and 'no L2' speakers (59.7%). These descriptive averages suggest that there is a moderate advantage for English, but that listeners are still far above chance even without knowing English as a second language.
We also analyzed whether the genealogical similarity of listeners' first language to English gave any guessing advantage. Indeed, speakers of Germanic languages (other than English) showed, on average, the highest accuracy (70.3%, descriptive average), compared to speakers who spoke a non-Germanic Indo-European language (63.2%), and speakers who spoke a non-Indo-European language (61.7%). This suggests a rough decline in accuracy given the genealogical distance of the speaker's language to the language of the stimuli creators, although there is no difference between speakers of non-Germanic Indo-European languages and speakers from other language families. Field experiment. First, we analyzed guessing accuracy for the seven language groups for the 12 meanings.
Note that "language group" is defined in the field experiment as a unique socio-political population speaking a given language; for our data, the groups map uniquely to the languages, with the exception of the distinction between American English and British English speakers, which maps to groups. On average, responses across the language groups were correct 55.0% of the time, much above chance level ( 1/12 = 8.3% , as there were 12 possible answers, among which one was correct). Listeners of all languages performed above chance, ranging from 33.8% (Portuguese speakers in Brazilian Amazonia) to German (63.1%). All meanings were guessed above chance level, ranging from 27.5% (fruit) to 85.5% (child) (Fig. 3b). Five of the language groups exhibited abovechance level performance for all 12 meanings, and the remaining two groups-Brazilian Portuguese and Palikúr speakers-exhibited above-chance level performance for 10 out of the 12 meanings.
As before, we used an intercepts-only Bayesian logistic regression model to analyze this data, with random effects for vocalization (unique meaning-creator combination), meaning, creator of vocalization, listener, and language group. While controlling for all these factors, the model estimated the overall accuracy to be 51.9%, with the corresponding Bayesian 95% credible spanning from 26.3 to 75.9%, thus not including chance-level performance ( 1/12 = 8.3% ). The posterior probability of the accuracy mean being below chance for the field sample was very low ( p < 0.0001 ), thus indicating strong evidence for above-chance level performance across languages. Figure 3 displays the posterior means for languages and meanings from this model.
A look at the random effects (see Table 2) shows that, similarly to the online experiment, accuracy variation was biggest for meaning ( SD = 1.11 ), followed by vocalization (unique meaning-creator combination; SD = 0.94 ). In contrast to the online experiment, variation across languages ( SD = 0.87 ) was higher than variation across listeners ( SD = 0.57 ). This may reflect the cultural and linguistic heterogeneity of this sample, compared to the online experiment.
Finally, we assessed whether vocalizations for animate entities were guessed more accurately than inanimate entities. As the animate entities-three human and three animal-produce characteristic vocal sounds, these referents may be more widely recognizable across language groups. On average, responses were much more likely to be correct for animate entities (72%) than inanimate entities (39%). To assess this inferentially, we amended the above-mentioned Bayesian logistic regression analysis with an additional fixed effect for animacy. The model indicates strong support for animate entities being more accurate across languages: the odds of these being more accurate were 5.10 to 1 (log odd coefficient: 1.63, SE = 0.49 ), with the Bayesian 95% credible interval for this coefficient ranging from an odds of 1.8 to 13.2. The posterior probability of the animacy effect being below chance level is p = 0.0 (not a single posterior sample was below zero for this coefficient), which supports a cross-linguistic animacy effect.

Discussion
Can people from different cultures use vocalizations to communicate, independent of the use of spoken words? We examined whether listeners from diverse linguistic and cultural backgrounds were able to understand novel, non-linguistic vocalizations that were created to express 30 different meanings (spanning actions, humans, animals, inanimate entities, properties, quantifiers, and demonstratives). Our two comprehension experiments, one conducted online and one in the field, reached a total of 986 participants, who were speakers of 28 different languages from 12 different families. The online experiment allowed us to test whether a large number of diverse participants around the world were able to understand the vocalizations. With the field experiment, by using the 12 easy-to-picture meanings, we were able to test whether participants living in predominantly oral societies were also able to understand the vocalizations. Notably, in this task, participants made their response without the use of written language, by pointing to matching pictures. In both experiments, with just a few exceptions, listeners from each of the language groups were better than chance at guessing the intended referent of the vocalizations for all of the meanings tested.
The stimuli in our experiments were created by English-speakers (either as a first or second language), and so to assess their effectiveness for communicating across diverse cultures, it was important to determine specifically whether the vocalizations were understandable to listeners who spoke languages that are most distant to English www.nature.com/scientificreports/  www.nature.com/scientificreports/ and other Indo-European languages. In the online experiment, we did find that native speakers of Germanic languages were somewhat more accurate in their guessing compared to speakers of both non-Indo-European and non-Germanic Indo-European languages. We also found that those who knew English as a second language were more accurate than those who did not know any English. However, accuracy remained well above chance even for participants who spoke languages that are not at all, or only distantly, related to English. In the field experiment, while guessing exceeded chance across the language groups for the great majority of meanings, we did observe, broadly, two levels of accuracy between the seven language groups. This difference primarily broke along the lines of participant groups from predominantly oral societies in comparison to those from literate ones. The Tashlhiyt Berber, United States and British English, and German speakers all displayed a similarly high level of accuracy, between 57 and 63% ( chance = 1/12 = 8.4% ). In comparison, speakers of Brazilian Portuguese, Palikúr, and Daakie were less accurate, between 34 and 43%. Notably, the Berber group, like the English and German groups, consisted of university students, whereas listeners in the Portuguese, Palikúr, and Daakie groups, living in largely oral societies, had not received an education beyond a primary or secondary school. This suggests that an important factor in people's ability to infer the meanings of the vocalizations is education 35 , which is known to improve performance on formal tests 36 . A related possibility is that the difference between the oral and literate groups is the result of differential experience with wider, more culturally diverse social networks, such as through the Internet.
Not surprisingly, we found, even across our diverse participants, that some meanings were consistently guessed more accurately than others. In the online experiment, collapsing across language groups, accuracy ranged from 98.6% for the action sleep to 34.5% for demonstrative that ( chance = 1/6 = 16.7% ). Participants were best with the meanings sleep, eat, child, tiger, and water, and worst with that, gather, dull, sharp, and knife, for which accuracy was, statistically, just marginally above chance. In the field experiment, accuracy ranged from 85.5% for child to 27.5% for fruit. Across language groups, animate entities were guessed more accurately than inanimate entities.
There is not a clear benchmark for evaluating the accuracy with which novel vocalizations are guessed across cultures. However, an informative point of comparison is the results of experiments investigating the crosscultural recognition of emotional vocalizations-an ability that has been hypothesized to be a psychological universal across human populations. For example, Sauter and colleagues 2 showed that Himba speakers living in Namibian villages, when presented with vocalizations for nine different emotions produced by English speakers, identified the correct emotion from two alternatives with 62.5% accuracy, with identification reliably above chance for six of the nine emotions tested (i.e., for the basic emotions). More generally, a recent meta-analysis found that vocalizations for 25 different emotions were identifiable across cultures with above-chance accuracy, ranging from about 67% to 89% accuracy when scaled relative to a 50% chance rate, although the number of response alternatives varied across the included studies 37 . This analysis found an in-group advantage for recognizing each of the emotions, with accuracy negatively correlated with the distance between the expresser and perceiver culture (also see 38 ). Considering these studies, the recognition of emotional vocalizations appears roughly comparable to the current results. But critically, in comparison to the single domain of emotions, iconic vocalizations can enable people from distant cultures to share information about a far wider range of meanings. Although the use of discrete response alternatives in a forced-choice task presents a greatly simplified context for communication, our use of 6 alternatives in the online experiment and 12 in the field experiment is considerably expanded from the typical two alternative tasks used in many cross-cultural emotion recognition experiments.
Thus, our study indicates the possibility that iconic vocalizations could have supported the emergence of an open-ended symbolic system. While this undercuts a principal argument in favor of a primary role of gestures in the origins of language, it raises the question of how vocalizations would compare to gestures in a cross-cultural test. Laboratory experiments in which English-speaking participants played charades-like communication games have shown better performance with gestures than with non-linguistic vocalizations, although vocalizations are still effective to a significant extent 39,40 . In this light, we highlight that while our findings provide evidence for the potential of iconic vocalizations to figure into the creation of the original spoken words, they do not detract from the hypothesis that iconic-as well as indexical-gestures also played a critical role in the evolution of human communication, as they are known to play in the modern emergence of signed languages 11 . The entirety of evidence may be most consistent with a multimodal origin of language, grounded in iconicity and indexical gestures 21,41 . Both modalities, visual and auditory, could work together in tandem, each better suited for communicating under different environmental conditions (e.g., daylight vs. darkness), social contexts (e.g., whether the audience is positioned to better see or hear the signal), and the meaning to be expressed (e.g., animals, actions, emotions, abstractions).
An additional argument put forward in favor of gestures over vocalizations in language origins is the longheld view that great apes-our closest living relatives including chimpanzees, gorillas, and orangutans-have far more flexible control over their gestures than their vocalizations, which are limited to a species-typical repertoire of involuntary emotional reflexes (e.g., 6,15,[42][43][44] ). Yet, recent evidence shows that apes do in fact have considerable control over their vocalizations, which they produce intentionally according to the same criteria used to assess the intentional production of gestures 45 . Studies of captive apes, especially those cross-fostered with humans, show apes can flexibly control pulmonary airflow in coordination with articulatory movements of the tongue, lips, and jaw, and they are also able to exercise some control over their vocal folds 46,47 . Given these precursors for vocal dexterity in great apes, it appears that early on in human evolution, an improved capacity to communicate with iconic vocalizations would have been adaptive alongside the use of iconic and indexical gestures. Thus, there would have been an immediate benefit to the evolution of increased vocal control (e.g., increased connections between the motor cortex and the primary motor neurons controlling laryngeal musculature 48 )-one that would be much more widely applicable than the modulation of vocalizations for the expression of size and sex (e.g., 49 ). www.nature.com/scientificreports/ Altogether, our experiments present striking evidence of the human ability to produce and understand novel vocalizations for a range of meanings, which can serve effectively for cross-cultural communication when people lack a common language. The findings challenge the often-cited idea that vocalizations have limited potential for iconic representation [13][14][15][16][17][18][19] . Thus, our study fills in a crucial piece of the puzzle of language evolution, suggesting the possibility that all languages-spoken as well as signed-may have iconic origins. The ability to use iconicity to create universally understandable vocalizations may underpin the vast semantic breadth of spoken languages, playing a role similar to representational gestures in the formation of signed languages.

Methods
Online experiment. Participants. The online experiment comprises a sample totaling 843 listeners (594 women, 193 men; age 18-84, mean age: 32) from 25 different languages. This is after we excluded speakers who did not perform accurately on at least 80% of the control condition trials (10 repetitions of the clapping sound in total), and who did not complete at least 80% of the experiment. For 38 US speakers, we failed to ascertain gender and age information. The languages span nine different language families: Indo-European, Uralic, Niger-Congo, Kartvelian, Sino-Tibetan, Tai-Kadai, and the language isolates Korean and Japanese. On average, there were about 33 listeners per language, ranging from seven for Albanian to 77 for German. Table 3 shows the number of speakers per language in the online sample. The sample was an opportunity sample that was assembled via snowballing. We used our own contacts and social media, asking native speakers of the respective languages to share the survey.
All participants declared that they took part in the experiment voluntarily. The informed consent was obtained from all participants and/or their legal guardians. The experiment was a part of a project that was approved by the ethics board of the German Linguistic Society and the data protection officer at Leibniz-Centre General Linguistics. The experiment was performed in accordance with the guidelines and regulations provided by the review board.
Stimuli. The stimuli were collected as part of a contest in a previous study 34 . Participating contestants submitted a set of vocalizations to communicate 30 different meanings spanning actions (sleep, eat, hunt, pound, hide, cook, gather, cut), humans (child, man, woman), animals (snake, tiger, deer), inanimate entities (fire, fruit, water, meat, knife, rock), properties (big, small, sharp, dull), quantifiers (many, one), and demonstratives (this, that). Table 3. Number of listeners and average accuracy (descriptive averages, chance = 16.7%) for each language in the sample for the online experiment in alphabetical order by language family and genus. Within a genus, the languages are sorted by the accuracy (see Fig. 2 for complementary posterior estimates and 95% credible intervals from the main analysis). www.nature.com/scientificreports/ To choose the winner of the contest, Perlman and Lupyan (2018) 34 used Amazon Mechanical Turk to confront native speakers of American English with the submitted vocalizations, and asked them what the intended meaning of each vocalization was. In the current study, a subset of the vocalizations initially submitted to the contest was used. For each meaning, the three most accurately guessed vocalizations from Perlman and Lupyan's (2018) 34 evaluation were chosen, regardless of the submitting contestant. The recordings can be found in the Open Science Framework repository: https:// osf. io/ 4na58/.
Procedure. Participants listened to each of the 90 vocalizations, and for each one, guessed its meaning from 6 alternatives (the correct meaning plus five randomly generated alternatives).
The surveys were translated from English into each language by native speakers of the respective languages. The translation sheet consisted of an English version to be translated and additional remarks for possibly ambiguous cases. In such cases, the literal meaning was the preferred one. The online surveys were hosted with Percy, an independent online experiment engine 50 . All surveys posted online were checked by the first author and by the respective translators and/or native speakers prior to distribution.
The surveys included a background questionnaire in which the participants were asked to report the following information: their sex, age, country of residence, native language(s), foreign language(s), if they spoke a dialect of their native language, and the place where they entered primary school. In addition, they were asked about their hearing ability, the environment they were in at the time of taking the survey, the audio output device, and the input device. After completing the background questionnaire, the participants proceeded directly to the experimental task.
In each of the language versions, the task was the same. The participants listened to a vocalization and were asked to guess which meaning it expressed from among six alternatives-five of which were generated randomly from a pool of other not matching meanings. To check whether the participants were paying attention, ten control trials were added in which the sound of clapping was played. The participants were instructed to choose an additional option-"clapping"-for these trials. The presentation order was randomized for each participant. The sound could be played as many times as a participant wanted. Participants played the sound on average 1.64 times (a single sound = 61.6% of all trials, 1 repetition = 25.6% of all trials).
Statistical analysis. All analyses were conducted with R 51 . We used the tidyverse package for data processing 52 and the ggplot2 package for Figs. 2 and 3 53 . All analyses can be accessed via the Open Science Framework repository (https:// osf. io/ 4na58/).
We fitted Bayesian mixed logistic regression models with the brms package 54 for the online experiment and field experiment data separately. These models were intercept-only to estimate the overall accuracy level. The mixed-effects model included random intercepts for listener ( N = 843 ), creator of the vocalization (11 creators), meaning (30 meanings), unique vocalization (30 × 3, unique meaning-creator combination), language (25 languages), language family (nine families, classifications from the Autotyp database 55 ), and language area (7 macro areas from Autotyp).
MCMC sampling was performed using Stan and the wrapper function Field experiment. Participants. We analyzed data from 143 listeners (99 female; 44 male; average age 28; age range 19-75) who were speakers of six different languages, including Tashlhiyt Berber, Daakie, Palikúr, Brazilian Portuguese, two dialects of English (British English and American English), and German. Thus, this set comprised four groups from Indo-European languages (Brazilian Portuguese, English, and German), and three groups from different language families: Afro-Asiatic (Berber), Arawakan (Palikúr), and Austronesian (Daakie). Notably, many of the participants in these groups spoke multiple languages. For example, the Tashlhiyt Berber group consisted of university students who studied in French, and the listeners from predominantly oral cultures (speakers of Daakie, Palikúr, and Portuguese) also had some knowledge of other languages spoken in the respective regions. As shown in Table 4 Stimuli. For this experiment, we used vocalizations for the twelve animate and inanimate entities only, which could be well displayed in pictures. These included: child, deer, fire, fruit, knife, man, meat, rock, snake, tiger, water, and woman. The pictures depicted the meanings as simply as possible, with preferably no other objects in sight. They were chosen from a flickr.com database under the Creative Commons license. For each meaning, the stimuli included the same vocalizations that were part of the online experiment, so that there were 36 vocalizations presented as auditory stimuli (12 meanings × 3 creators).
Procedure. The pictures of the 12 referents were placed on the table in front of the participant, and their order was pre-randomized in four versions. The experimenter-naïve to the intended meaning of a sound file-played the vocalizations in a pre-randomized order, also in four order versions. For each sound, the listener was asked www.nature.com/scientificreports/ to choose a picture that depicted the sound best from among the pictures lying in front of her/him. The exact instructions given in the native language of the participant were: "You will hear sounds. For each sound, choose a picture that it could depict. More than one sound can be matched to one picture".
Participants made their response by pointing at one of the twelve photos laid out in a grid before them. The experimenter noted the answers on an anonymized response sheet. During the experiment, the experimenter was instructed to look away from the pictures in order to minimize the risk of giving the listener gaze cues. The sound could be played as many times as the participant wanted.
The background questionnaire for the field version was filled out by the experimenter and consisted of questions for: sex, age, native language(s), and other language(s).
The data were collected in various locations (cf. "Participants" above) and therefore also in slightly different conditions. While the American, British, and Berber participants were recorded in a university room, the case was different for the remote populations. German speakers were partly recorded in a laboratory room (Berlin), partly in a quiet indoor area at the Baltic seacoast. The Daakie participants were recorded in a small concrete building of the Presbyterian Church; participants were seated on a bench in front of a table, and an attempt was made that they were not disturbed by curious onlookers. Brazilian Portuguese listeners in Cametá were recorded in their own houses, and Palikúr speakers in a community building in Saint-Georges-de-l'Oyapock where they were interviewed individually in a separate room.
Statistical analysis. The mixed model included random intercepts for meaning (12 different meanings), creator of the vocalization (7 creators), unique vocalization (36 unique meaning-creator combination), listeners (143 listeners), and language (6 languages). We used a uniform prior (− 10, 10) on the intercept for both the field and online experiment. To assess the effect of animacy, an additional model was fitted for the field experiment with animacy as a fixed effect and a regularizing normal prior ( µ = 0, σ = 2 ), as well as with by-language and byspeaker varying animacy slopes.  Table 4. Number of listeners and average accuracy (descriptive averages, chance = 8.3%) for each language in the sample for the field experiment; in alphabetical order by language family. Within a language family, the languages are sorted by the accuracy (see Fig. 3 for complementary posterior estimates and 95% credible intervals from the main analysis).