Emotion Perception in Hadza Hunter-Gatherers

It has long been claimed that certain configurations of facial movements are universally recognized as emotional expressions because they evolved to signal emotional information in situations that posed fitness challenges for our hunting and gathering hominin ancestors. Experiments from the last decade have called this particular evolutionary hypothesis into doubt by studying emotion perception in a wider sample of small-scale societies with discovery-based research methods. We replicate these newer findings in the Hadza of Northern Tanzania; the Hadza are semi-nomadic hunters and gatherers who live in tight-knit social units and collect wild foods for a large portion of their diet, making them a particularly relevant population for testing evolutionary hypotheses about emotion. Across two studies, we found little evidence of universal emotion perception. Rather, our findings are consistent with the hypothesis that people infer emotional meaning in facial movements using emotion knowledge embrained by cultural learning.

It has long been claimed that certain configurations of facial movements are universally recognized as emotional expressions because they evolved to signal emotional information in situations that posed fitness challenges for our hunting and gathering hominin ancestors. Experiments from the last decade have called this particular evolutionary hypothesis into doubt by studying emotion perception in a wider sample of small-scale societies with discovery-based research methods. We replicate these newer findings in the Hadza of Northern Tanzania; the Hadza are semi-nomadic hunters and gatherers who live in tight-knit social units and collect wild foods for a large portion of their diet, making them a particularly relevant population for testing evolutionary hypotheses about emotion. Across two studies, we found little evidence of universal emotion perception. Rather, our findings are consistent with the hypothesis that people infer emotional meaning in facial movements using emotion knowledge embrained by cultural learning.
It has long been claimed that certain configurations of facial movements, such as smiles, scowls, and frowns, are universally recognized as emotional expressions because they evolved to signal emotional information in situations that posed fitness challenges for our hunting and gathering hominin ancestors. This hypothesis can be traced back, in part, to Charles Darwin's 1872 publication of The Expression of the Emotions in Man and Animals 1 , in which he stipulated that emotions are "expressed" across the animal kingdom via patterns of muscular discharge, such as coordinated sets of facial muscle contractions. Darwin's hypothesis was later modified and elaborated on by evolutionary psychologists, who proposed that the facial configurations in question evolved as emotion-specific expressions to signal information 2 in the situations our hominin ancestors faced on the African savannah during the Pleistocene 3,4 . For example, the wide-eyed gasping facial configuration, thought to universally express fear, purportedly evolved to enhance sensory sampling that supports efficient threat detection 5 , including the detection of dangerous predators. Similarly, the nose-wrinkle configuration, thought to universally express disgust, purportedly evolved to limit exposure to noxious stimuli 6 , such as food that is contaminated or that has spoiled in the heat. These hypothesized facial expressions, along with scowls (for expressing anger), smiles (for expressing happiness), and others -now over twenty, in total-are thought to be universally observed in people around the world, although slightly modified by culture 7,8 .
To test whether the facial configurations in question evolved to express certain emotion categories in a universal manner, as proposed, scientists have largely studied how people infer the emotional meaning of those configurations; the logic being that the production and perception of emotional expressions co-evolved as an integrated signaling system 2 (for discussion, see 9 ). This experimental approach can also be traced back to Darwin, who conducted research with two different methods to test his hypotheses 1 . Darwin first asked informants to provide their own emotion labels for photographs of the facial configurations in question. This free-labeling response method produced substantial variation, providing little support for Darwin's hypotheses (see 1 , p. 12). Darwin also surveyed well-traveled colleagues and missionaries from the "old and new worlds" to learn about the facial movements of people who lived in remote, non-urbanized cultural contexts. He constrained informants' responses by providing them with verbal descriptions of the facial configurations. Each description contained an emotion word corresponding to the category he believed was being expressed (e.g., "Does shame excite a blush when the colour of the skin allows it to be visible? and especially how low down the body does the blush extend?" p. 12). Following  Barrett (2008), participants were tested in a second language (Spanish) in which they received training. A subset of choice-from-array studies did not control whether foils and target facial configurations could be distinguished by valence and/or arousal, with the exception of Gendron et al. 2014a, Study 2, which controlled for valence and arousal; Sauter et al. 2015 (2010 re-analysis) and Cordaro et al. (2015) controlled for valence only. N = sample size. Unsupported = consistency and specificity at chance, or any level of consistency above chance combined with evidence of no specificity. Weak support = consistency between 20% and 40% (weak) for at least a single emotion category (other than happiness) combined with above-chance specificity for that category, or consistency between 41% and 70% (moderate) for at least a single category (other than happiness) with unknown specificity. Moderate support = consistency between 41% and 70% (moderate) combined with any evidence of above-chance specificity those categories, or consistency above 70% (strong) for at least a single category (other than happiness) with unknown specificity. Strong support = strong evidence of consistency (above 70%) and strong evidence of specificity for at least a single emotion category (other than happiness). Superscript a: Specificity levels were not reported. Superscript a1: Specificity inferred from reported results. Superscript a2: Traditional specificity and consistency tests are inappropriate for this method, but the results are placed here based on the original author's interpretation of multidimensional scaling and clustering results. Superscript b: The sample size, marginal means, and exact pattern of errors reported for the Sadong samples is identical in Sorenson (1975), Sample 3 and Ekman et al. (1969); Sorenson described using a free-labeling method and Ekman et al. (1969) described using a choice-from-array method in which participants were shown photographs and asked to choose a label from a small list of emotion words; Ekman (1994) indicated, however, that he did not use a free-labeling method, implying that the samples are distinct. Superscript c: Sorenson (1975), Sample 2 included three groups of Fore participants (those with little, moderate, and most other-group contact). The pattern of findings is nearly identical for the subgroup with the most contact and the data reported for the Fore in Ekman et al. (1969); again, Sorenson described using a free-labeling method and Ekman et al. (1969) described using a choice-from-array method. It is questionable whether the Sadong and the Fore subgroup should be considered a small-scale society (see Sorenson, 1975, p. 362 and 363), but we include them here to avoid falsely dichotomizing cultures as "isolated from" versus "exposed to" one another (Fridlund, 1994;Gewald, 2010). Superscript d: These are likely the same sample because the sample sizes and pattern of data are identical for all emotion categories except for the fear category, which is extremely similar, and for the disgust category, which includes responses for contempt in Ekman and Friesen (1971) but was kept separate in Sorenson (1975). Superscript e: Participants were children. Superscript f: Participants were adolescents. Superscript g: The Dani sample reported in Ekman (1972) is likely a subset of the data from Ekman, Heider, Friesen, and Heider (unpublished manuscript).
In Study 1, both Hadza and US participants freely labeled a set of posed facial configurations like those used in prior studies of emotion perception. These configurations are the proposed universal expressions for anger, disgust, fear, happiness, sadness, and surprise categories. We examined the extent to which participants perceived the facial configurations of interest as conveying emotional information, and to which emotion category each was assigned. This allowed us to discover whether Hadza and US participants were similar in the consistency and specificity with which they inferred emotional meaning in these configurations (supporting the universality hypothesis), or whether the Hadza were more variable in the labels they offered (i.e., an observation of cross-cultural diversity). In Study 2, participants registered their inferences using a choice-from-array response method. On a given trial, participants heard a brief story about an emotional event, including an emotion word, and then were shown two posed facial configurations and asked to choose which was the best match. We presented participants with only two facial poses to choose from, in keeping with Ekman and Friesen's original method 19 (where participants received 2 or 3 choices), on which our task was modeled, and consistent with more recently published studies of emotion perception in small-scale societal contexts 22,23 . In both studies, instructions and materials were presented to Hadza participants in their first language, Hadzane. In Study 1, participants responded on a given trial in Hadzane or in Swahili, their primary or second language, according to their preference. Responses were translated online, at the time of testing, into English by author SM, who is fluent in English, Swahili, and Hadzane, and entered by an experimenter. All original responses were also audio recorded so that online translations could be checked for accuracy (see Supplementary Information for details).

Results
Study 1: Free-labeling of facial configurations. Both Hadza (N = 43) and US (N = 45) participants were presented with six posed facial configurations in randomized order (the hypothesized expressions for anger, disgust, fear, happiness, sadness, and surprise) and were asked to freely label them. We hypothesized that when Hadza participants used mental state words to label the facial configurations, they would do so with less consistency and specificity than the US participants. Facial movements are not always understood as conveying meaning about internal, emotional states, however. People in a number of small-scale societies reportedly refrain from explicit mentalizing and, in some publications, describe an inability to infer the mental states of others because they experience other people's minds as opaque. This phenomenon is referred to as opacity of mind in cultural anthropology 36 . Accordingly, we hypothesized a graded continuum of social inference, reminiscent of 37 , with descriptions of action (called action identification) anchoring one end, and internal states (called mental state inference or mentalizing) anchoring the other 38 . Action identification involves an inference of an agent and the behavior that the agent performed, whereas mentalizing involves the additional inference of an internal thought, feeling, or state to the agent. Action identifications involve a representation of what a person is doing (e.g., crying) and how she is doing it (e.g., shedding tears and vocalizing), whereas mentalizing also involves a representation of why the action is occurring in the first place (i.e., assigning a mental cause for the action; e.g., sadness). Prior research indicates that when Hadza participants are asked to assign punishment for a transgression, they are less likely to use available information about mental causes for behavior (intent) 34 , suggesting that they are less likely to mentalize. This was also true of participants from a small-scale agro-pastoralist society, the Himba of Namibia 34 . Correspondingly, Himba participants showed reduced mentalizing and increased action identification of facial configurations during an emotion perception task 28 ; they freely labeled facial configurations as "crying", "laughing", "looking", and so on (a finding replicated in yet another small-scale society, Trobriand Islanders 25 ). Based on these findings, we hypothesized that Hadza participants would be more likely to label the facial configurations with action words rather than mental state words, when compared to US participants.
Mentalizing. We coded participants' translated responses for whether their labels referred to mental states, including emotions and affective states 39,40 , as well as volitional (e.g., "intend"), cognitive (e.g., "remember"), and moral (e.g., "forgiving") states 41  Emotion perception using emotion words. We then coded responses for whether mental states corresponded to the emotion labels (or synonyms) associated with the universality hypothesis (anger, disgust, fear, happiness, sadness, and surprise), defined by empirically-derived semantic clusters identified for US participants 42 , Cohen's Kappa for inter-coder reliability: US data, κ = 0.89, Hadza data, κ = 0.92. The results are presented in Fig. 2 and Table 1 US participants were equally consistent in freely labeling each facial configuration with the expected emotion word, Cochran's Q(5) = 3.14, p < 0.68; see diagonals of Fig. 2 and Table 1, and their labels showed a high degree of specificity (see off-diagonals in Fig. 2 and χ 2 goodness-of-fit tests in Table 1). Hadza participants, by contrast, labeled certain facial configurations more consistently than others, Cochran's Q(5) = 67.99, p < 0.001, and US (right) samples that were coded as "mental states". The proportion of labels produced by a given sample are plotted, with higher intensity (yellow) values indicating a higher proportion and lower intensity (blue) values indicating a lower proportion; numerical proportion is also presented in each cell. Responses are plotted by the coded label types produced (y-axis) for each facial configuration of interest (x-axis). Other mental = other mental labels offered that did not conform to otherwise coded categories. Lower panel depicts verbal responses produced by Hadza (left) and US (right) samples that were coded as consistent with a set of "functional" descriptions derived from the prior literature. Functional descriptions are clustered according to their theoretically proposed links to specific emotions. Other action = other action labels offered that did not conform to otherwise coded categories.   www.nature.com/scientificreports www.nature.com/scientificreports/ with variable specificity (as above, see Fig. 2 and Table 1). Consistency was low and did not exceed chance-level responding for four of the six facial configurations tested.
Sixty-five percent of Hadza participants (N = 28) consistently labeled the scowling facial configuration as anger (i.e., "ofa-"), Prop Hadza = 0.65, SE = 0.07, p < 0.001, 95% CI [0.51 0.79], which was statistically significant using a binomial test against an expected proportion of 0.16 (based on the number of available alternative facial configurations). All subsequent reported tests for above-chance consistency use this same approach. "Ofa-" was consistently applied to the scowling facial configuration at proportions well above chance, but standardized residuals of the χ 2 tests indicated that the scowling configuration was not specifically labeled as "ofa-": this label was also the most characteristic for the nose-wrinkle facial configuration (label offered by 7 participants), for the wide-eyed gasping configuration (label offered by 9 participants), and for the pouting configuration (label offered by 11 participants). Moreover, Hadza participants also labeled scowling faces with other terms, including the general affective description "upset" (16.70%), with action words such as "to grumble/sulk" (23.80%), or with other idiosyncratic labels (see Supplementary Information, Table 3).
Low specificity in Hadza use of the term for anger ("ofa-") may be due to over-reliance on anger as one of the few lexicalized emotion/mental state categories 44 . When we examined the content of a dictionary compiled for the Hadza language, we counted only 21 terms that appeared to be clear references to mental states, compared to hundreds offered in the English language for the specific domain of emotion 45 . Another possibility, of course, is that Hadza participants frequently offered anger-related words because anger is actually expressed with a variety of facial configurations in Hadza culture. Instances of anger are also expressed in the US with a diversity of facial movements and low specificity of scowls to anger 12,46 , yet US participants appear to have a more narrow stereotype that they rely on (for discussion, see 12 ) when compared to Hadza participants.
Forty-four percent of Hadza participants (N = 19) labeled the smiling facial configuration as an expression of happiness ("cheta" in Hadzane or "furahi" in Swahili), Prop Hadza = 0.44, SE = 0.08, p < 0.001, 95% CI [0.30 0.59], revealing moderate consistency, even though 24 Hadza participants labeled smiling faces with other terms, including the general affective description "good" (20.90%), with action words such as "smiling" (56.00%), or with other idiosyncratic labels (see Supplementary Information, Table 3). The smiling facial configuration was labeled as happiness ("cheta/furahi") with a statistically significant level of specificity, because "cheta/furahi" was applied to other facial configurations, but none characteristically (see Table 1). The interpretation of these findings is  www.nature.com/scientificreports www.nature.com/scientificreports/ complicated by the fact that the smiling facial configuration was the only depiction of pleasant valence, in contrast to all the other facial configurations included in the study. As a consequence, it is unclear whether these free-labeling data support a hypothesis of universality for the emotion category of happiness or for the affective property of valence, distinguishable by zygomaticus facial muscle activation; we return to this observation when discussing a similar finding in Study 2.
Action identification. Responses were coded for whether they described actions such as "crying" or "seeing something", Cohen's Kappa for inter-coder reliability: US data κ = 0.84 Hadza data κ = 0.87. (This code was not mutually exclusive with the mental state codes reported above because full participant responses sometimes included both mental content and an action identification.) As predicted, Hadza participants labeled the facial configurations with a higher proportion of action-related labels when compared to US participants, The actions offered by Hadza participants were relatively more descriptive of actual physical movements in that they referred to how an agent was moving (e.g., "looking"), rather than the situational circumstances in which the actions occurred. In some cases, these action labels were situated, accompanied by details about the possible eliciting circumstances or context in which the actions occurred, but these more complex action labels were relatively less frequent. To further examine this distinction, we coded for specific physical movements, such as lashing out, crying, smelling, seeing, or laughing (Cohen's Kappa range for inter-coder reliability: US data κ = 0.90-0.92, Hadza data κ = 0.79-1.00) and social communications, such as signaling dominance, alerting about a threat, or warning about aversive foods, using the descriptions available in 2 (Cohen's Kappa for inter-coder reliability: Hadza data κ = 1.00; it was not possible to compute a Kappa for US participants because they did not offer sufficient social communication responses). A full list of the codes is provided in Supplementary Information, Table 3. The results are presented in Fig. 2 and Table 1, see also Supplementary Information, Table 4. Responses that reflected social functions were extremely sparse, such that no statistical analyses of these codes could be performed.
Scientists who study emotion have a priori assigned certain actions and physiological changes to specific emotion categories 5,6,[47][48][49] . Existing meta-analyses call these stipulations into question, however, suggesting that that actions and physiological changes are weakly consistent for, and not specific to, individual emotion categories 50,51 . Furthermore, there is no evidence that, when participants label a facial configuration with an action-related word such as "smiling" or "looking", they are making an inference that the action is occurring in conjunction with an instance of a specific emotion category, or even during an emotional instance, per se [24][25][26]52 . Nonetheless, we classified the action words offered by our Hadza participants according to the cultural beliefs of western scientists and found some evidence of consistency, but only for a subset of the facial configurations, Cochran's Q(5) = 64.33, p < 0.001. We observed that 18 participants labeled the pouting face as "crying", which exceeded what would be expected for chance-level (0.16) consistency, Prop Hadza = 0.42, p < 0.001, SE = 0.08, 95% CI [0.28, 0.57], but this label was also characteristic for the nose-wrinkle facial configuration, indicating poor specificity. Twenty-five participants labeled the smiling facial configuration as "laughing" or "smiling", which exceeded what would be expected for chance-level (0.16) consistency, Prop Hadza = 0.58, p < 0.001, SE = 0.08, 95% CI [0.43, 0.72], but this behavior was also characteristic for wide-eyed gasping (posed fear) and wide-eyed (posed surprise) targets, indicating poor specificity. US participants, by comparison, produced very few action responses and did not differ in consistency based on target facial configuration, Cochran's Q(5) = 4.00, p < 0.549. Note that many of the responses did not conform to these categories and were more idiosyncratic in nature (see Supplementary Information, Table 5), suggesting that there were many instances in which Hadza participants did not appear to converge on a systematic description. This pattern of results implies that that some Hadza participants are unfamiliar with these facial configurations.
We also examined references to three proposed physiological functions in both Hadza and US participant responses: widening eyes to enhance vigilance, widening eyes to enhance sensory processing, and the closing of nostrils to reduce exposure to contaminants. References to these functions were sparse and lacked consistency for the proposed target facial configurations (wide-eyed gasp, wide-eyed, and nose-wrinkle; see Table 1). We also examined partial references to functions such as vision, vomit, and olfaction, even when participants did not describe the consequences of the actions (such as seeing more clearly or reduction of exposure to contaminants). Six Hadza participants made reference to vision in response to the wide-eyed gasping configuration (the proposed expression of fear), and four participants made references to vision in response to the wide-eyed configuration (the proposed expression of surprise), but neither exceeded chance-level (0.16) consistency. Descriptions of vision were characteristic for both the wide-eyed gasping and wide-eyed configuration, suggesting no specificity for a single facial configuration (see Table 1). That faces with widened eyes are described as "looking" is consistent with the hypothesis that Hadza participants may have been literally describing the facial morphology of the configurations that they viewed. This finding may also suggest that Hadza participants understood the physiological function associated with a facial movement (e.g., people see more when their eyes are widened), but does not itself imply an inference of a causal state of fear or surprise 53 . Study 2: Labeling facial configurations with a choice-from-array. In Study 2, we employed a choice-from-array method, because it has provided the strongest evidence to date 19 in support of universal perceptions of emotion from the face 10,13 ,for discussion. This method only required that participants match a facial pose to an emotion word or phrase, rather than having to produce verbal labels for emotions. In addition, using this task with only two face stimuli -a target and a foil -allowed us to separately examine affect perception and emotion perception. In prior studies employing a choice-from-array method, it is possible that perceivers who Scientific RepoRtS | (2020) 10:3867 | https://doi.org/10.1038/s41598-020-60257-2 www.nature.com/scientificreports www.nature.com/scientificreports/ appear to be distinguishing between facial poses for emotion (emotion perception) are merely using different affective meanings depicted by the facial configurations. For example, participants may distinguish smiling from pouting facial configurations not because smiling is perceived as "happiness" and the other configurations are perceived as "anger", "sadness", and so on, but because smiling is usually perceived as pleasant and pouting as unpleasant (i.e., they differ in valence). Prior studies in small-scale societies have documented that perceivers are able to distinguish between facial configurations that differ in the degree to which they portray pleasant vs. unpleasant states (i.e., their valence features), even as they do not consistently distinguish between the proposed facial configurations for emotion categories that are thought to be universal [24][25][26]28 , consonant with the hypothesis that valence perception is universal 54 . We designed Study 2 to distinguish valence perception from emotion perception by varying the foils that were presented to participants on each choice-from-array trial, as outlined in Fig. 3. If Hadza participants chose the expected facial configuration for a given emotion scenario on affect-controlled trials, then they must be using features other than valence and arousal to do so, providing stronger evidence for universal emotion perception. If, however, Hadza participants were less able to consistently choose the expected facial configuration for a given scenario on these affect-controlled trials when compared to trials where foils differed in valence features (arousal-controlled trials), arousal features (valence-controlled trials), or both features (affect-uncontrolled trials), then this would suggest that they are using affective features to support their performance in the task.
Note that data from a choice-from-array task, even one that strictly controls for affect perception, are still open to alternative interpretation. For example, people can use a process-of-elimination strategy when performing a forced-choice task, in which unused options from prior trials are selected 55,56 . Forced choice can also produce convergence on a label merely because it represents the best available alternative, rather than because it faithfully reflects the inference an individual is making 57 . Finally, when participants are asked to match a posed configuration of facial muscles to a brief vignette that describes a situation, they may select a target face based on contextually appropriate behavior (e.g., widening eyes when confronted with something that requires further visual attention), independent of any drawing on emotion knowledge or any process related to emotion perception.
We analyzed the choice-from-array responses using a series of nonlinear (Bernoulli) hierarchical generalized linear models in HLM7 (SSI Inc., Lincolnwood, IL) with a logit link function to estimate the log-odds that participants' performance was above chance-level responding (i.e., selecting the hypothesized facial configuration on a given trial). We observed that both US and Hadza individuals, on average, chose the target facial configurations more often than would be expected by chance (0.5) across all four trial types (see Fig. 3b,c and Table 2). The society from which participants were sampled significantly moderated performance on all trial types (see Supplementary Information, Table 7). US and Hadza participants performed more similarly on trials in which valence features could be used to distinguish between targets and foils, consistent with the hypothesis that the perception of valence is highly replicable across societies. Hadza participants performed significantly better on trials in which valence features were available to distinguish target from foil. On affect-controlled trials, in which neither valence nor arousal could be used to distinguish target from foil, only 58% of Hadza participants selected the target facial configuration at above-chance levels (28 of 48 participants), compared to 90% who performed above chance on the arousal-controlled trials where valence features were available (43 of 48 participants). Hadza participants who had minimal other-culture exposure (based on proxy variables of formal schooling and second language fluency) had a similar pattern of performance across the four experimental conditions, although probabilities were lower (see Table 3). Most US participants (94% of the US sample), by contrast, selected the target facial configuration on the affect-controlled trials with high probability, even when valence and arousal features did not distinguish the target and foil (e.g., the scowling configuration hypothesized to be the universal expression of anger, the wide-eyed gasping expression hypothesized to be the universal expression of fear, and the nose-wrinkle configuration hypothesized to be the universal expression of disgust), suggesting that their task performance reflected inferences about emotional meaning.
Performance on affect-controlled trials: Emotion perception. US participants-with a probability between 0.86 and 0.89 to correctly choose the hypothesized facial configurations for anger, fear, and disgust-outperformed Hadza participants on the affect-controlled trials, which are most specific in assessing emotion perception (see Fig. 3e,f, Table 4). Hadza participants performed significantly above chance when choosing the hypothesized facial configurations for fear and anger, but not disgust (probability of correctly identifying a target on a given trial was 0.72, 0.61, and 0.59, respectively). The society from which participants were sampled significantly moderated performance for all targets (see Supplementary Information Table 8). Controlling for other-culture exposure reduced these probabilities, however, to 0.65, 0.58, and 0.60, respectively (Table 5). Of the 27 Hadza participants who spoke minimal Swahili and reported no formal schooling, 12 chose the wide-eyed gasping face for the fear category, which was significantly different from chance (see Fig. 3g, Table 5). Across all Hadza participants, level of formal schooling specifically moderated performance for the fear category; individuals with some formal schooling -involving greater exposure to cultural knowledge and norms other than their own, as well as the expectation to follow those norms -chose wide-eyed gasping facial configurations more frequently than did those with no formal schooling (see Table 5). In contrast, of the Hadza participants with minimal other-culture exposure, only nine chose the scowling facial configuration above chance for the anger category and only eight chose the nose-wrinkle facial configuration above chance for the disgust category, with the overall probabilities across participants not statistically different from chance (see Table 5).
Performance in free-labeling vs. choice-from-array methods. We also examined the average proportion of agreement for the subset of participants who completed both the free-labeling and choice-from-array tasks, averaged across participants for Study 1 and averaged across trials and then participants for Study 2 (depicted in (2020) 10:3867 | https://doi.org/10.1038/s41598-020-60257-2 www.nature.com/scientificreports www.nature.com/scientificreports/  Table 6), targets and foils for the four trial types. Facial configurations are examples because stimulus sets restrict publication of actual photographs. Arousal-controlled trials: the foil face differed from the target only in depicting positivity or negativity, or valence (e.g., a smiling facial configuration hypothesized to be the universal expression of happiness vs. a scowling facial configuration hypothesized to be the universal expression of anger). Valence is a descriptive feature of affect, along with a second feature, level of arousal. For example, some evidence suggests that perceivers may be able to distinguish scowling from pouting not because scowling is perceived as "anger" and pouting is perceived as "sadness" but because scowling is typically perceived as high arousal and pouting as low arousal. Valence-controlled trials: the foil face differed from the target only in depicting level of arousal (e.g., a scowling vs. a pouting configuration hypothesized to be the universal expressions of anger and sadness, respectively). Affect-uncontrolled trials: the foil face differed from the target in depicting both valence and level arousal (e.g., a smiling vs. a pouting configuration). Affect-controlled trials: the foil face matched the target in depicting valence and arousal (e.g., a scowling vs. a wide-eyed gasping facial configuration hypothesized to be the universal expressions of anger and fear, respectively). Performance for each of the 4 experimental conditions (x-axis) is plotted for US participants (b), Hadza participants (c) and Hadza-M participants (d). Performance within the affect-controlled condition, for each of the 3 target facial configurations (x-axis) is plotted for US participants (e), Hadza participants (f) and Hadza-M participants (g). Individual data points represent mean proportion agreement (i.e., selecting a target matching the presumed universal model) for a given participant within a given condition. Contours of violin plots represent density of data points at a given agreement level. Horizontal red bar represents chance-level performance, and significance against chance-level responding is noted at the top of each violin plot: ***p < 0.001 **p < 0.01 *p < 0.05 † p < 0.10. Means combined with brackets represent conditions that do not statistically differ in χ 2 tests (ps > 0. 25). Statistically significant differences between conditions based on follow-up χ 2 tests are notated using the same conventions, with the following exception: **(*) indicates statistical significance for individual tests ranged between p < 0.01 and p < 0.001.

Scientific RepoRtS |
(2020) 10:3867 | https://doi.org/10.1038/s41598-020-60257-2 www.nature.com/scientificreports www.nature.com/scientificreports/ Supplementary Information Fig. 1). We adjusted the scores for guessing using a standard correction formula (proportion correct −(1/number of choices))/(1−(1/number of choices)) according to 58 . As predicted, the free-labeling method yielded lower agreement levels than did the choice-from-array method, consistent with the general pattern observed in other published studies [10][11][12] . These findings do not appear to be due to practice effects from Study 1 (Supplementary Information Table 9). Notably, there were no statistical differences on trials in which target and foil could not be distinguished by valence and arousal (trials that most specifically assessed emotion perception; i.e., affect-controlled trials), meaning that previously free-labeling the scowling, wide-eyed gasping, and nose-wrinkle facial configurations did not help Hadza participants to choose them as target facial configurations.

Discussion
We conducted two studies which provided little evidence of universal emotion perception among the Hadza, a small-scale, non-industrial population of hunter-gatherers residing in Tanzania, when compared to samples drawn from the United States, a post-industrialized nation in the cultural west. Observations from our Hadza participants represent an important test of uniformity versus diversity in emotion perception, given that the window of opportunity to work among the Hadza while they are still living a predominantly foraging lifestyle is closing. Life in the Lake Eyasi basin has not remained static for the Hadza over the past century 59 and, increasingly, environmental change is impacting foraging behaviors and mobility 60,61 . Nonetheless, research with such a community provides a rare opportunity to investigate how emotion perception among hunter-gatherers (who are semi-nomadic and reside in small groups) adds to our cross-cultural understanding of emotional phenomena. Our findings are inconsistent with hypotheses that certain facial configurations were selected as universal expressions of emotion because they may have enhanced reproductive fitness [2][3][4] . Instead, our findings replicate the growing number of experiments 10 that reveal diversity, rather than uniformity, in how perceivers make sense of facial movements. Only one facial configuration -the wide-eyed gasping facial configuration -was chosen with any above-chance cross-cultural consistency in Study 2. This pattern of findings might indicate universal fear perception, were it not for the fact that this finding was neither replicated in Study 1, nor in findings from other small-scale societies [24][25][26] .
The present results replicate prior published findings 24,25,27 in suggesting that people infer the valenced meaning of facial configurations similarly across societies, supporting the hypothesis that valence perception is universal 62 . Our findings also provide a possible context for reconsidering any choice-from-array studies that did not control the availability of affective features distinguishing a target face from its foil, including the landmark studies conducted by Ekman and colleagues in the late 1960s and early 1970s that have been interpreted as providing moderate to strong support for cross-cultural consistency in emotion perception 18,19,21 . Those studies may overestimate evidence in support of universality.
Our results also suggest that subtle variation in cultural exposure of the participants who were sampled in the prior literature (Fig. 1) may further account for some of the cross-cultural consistency observed in prior published findings. For example, Diola participants in Burkina-Faso 21 were within walking distance to a town and were tested there. The Fore, a mixed subsistence population in Papua New Guinea 18,19 resided in a protectorate of British, Germans, and Australians (between 1888 and 1975) and had sustained interactions with western settlers and missionaries; for a more detailed discussion, see 13 . Subtle variation was in evidence in the present results, with formal schooling and second-language fluency impacting the extent to which individuals conformed to the proposed universal pattern. Similar variation in emotion perception task performance based on formal schooling has been observed in the United States (e.g., 63 ). This observed impact of formal schooling on emotion perception is consistent with (although not exclusively predicted by) a constructionist hypothesis that emotion perception is enculturated (i.e., guided by learned emotion concepts that are bootstrapped into the brain during early development 64,65 ). With respect to the current findings, some Hadza individuals were educated in a formal system with the historical roots in German and British colonialism in Tanzania. Individuals who attended a regional primary school received instruction in the Swahili language and were potentially also exposed to English (secondary education is conducted in the English language). In addition, individuals who attended school likely had greater exposure to individuals from other ethnic groups. As a consequence, Hadza individuals who had more formal schooling also likely had more opportunity to learn about psychological concepts, including emotion concepts, that would not have been socialized within the Hadza community.
When taken together with other published research on emotion perception in small-scale societies 10 , our findings are also consistent with the hypothesis that mentalizing is a culturally-reinforced mode of perception 66 anchoring one end of a social inference continuum 37 . In Study 1, Hadza participants more often engaged in action identification than US participants when freely labeling the facial configurations. This pattern was observed under stringent test conditions, because in Study 1 we asked participants to freely label the facial configurations in terms of what the person was feeling, a prompt that generates robust mentalizing (and minimal action perception) in US participants. The hypothesis of a culturally-sensitive social inference continuum is consistent with prior research with members of the Himba society, who also understood facial configurations in terms of situated actions rather than in terms of inner mental states 39,40 .
Mentalizing has been assigned a number of privileged functions in social life, such as allowing humans to "predict, explain, mold, and manipulate each other's behavior in ways that go well beyond the capabilities of other animals" 67 , p. 131. Further, in US and European psychology, mentalizing is thought to facilitate social connections and intimacy 68 , such that individuals who have difficulty with mentalizing are predicted to have deficits in social functioning. Yet even within societies that appear to reinforce a high degree of mental inference, such as the United States, there is normal variation in the extent to which people infer mental states as the cause of observable actions 38 , and this variation may be additionally driven by the individual's goals in a given situation and relative social status 69,70 .
The present studies are not without limitations. Hadza and US participants judged only static facial configurations posed by individuals living in a western cultural context. Prior research investigating how participants Scientific RepoRtS | (2020) 10:3867 | https://doi.org/10.1038/s41598-020-60257-2 www.nature.com/scientificreports www.nature.com/scientificreports/ from small-scale societies perceive emotion in dynamic versus static faces did not yield substantially distinct effects 24 , however, and Hadza participants labeled static poses with a variety of dynamic behaviors. In addition, we are unable to isolate the specific cultural features that drove differences in emotion perception between Hadza and US participants, consistent with well-documented limitations of the two-culture approach we adopted 16 . Finally, we might have observed differences in emotion perception across cultures because Hadza participants were less familiar with experimental tasks more generally. Points that mitigate this concern (in addition to our manipulation checks) include the fact that 1) the same experimental tasks have been used in published studies in other small-scale societies 19,22,27 and, 2) the participants enrolled in our experiments were not testing naïve (both Hadza camps that we sampled from are active fieldwork sites for anthropologists and psychologists, although they had not performed emotion perception tasks prior to our testing 31,59 ).
Finally, our findings are best understood as consistent with theoretical frameworks, including our constructionist account 65,71 , that hypothesize more substantial intrinsic sources of variation in both emotional expression and perception than is true for classical or prototype emotion accounts 8,72-74 and evolutionary accounts of discrete emotions [2][3][4] . Recent meta-analyses and reviews indicate that instances of an emotion category such as anger, like the instances of other emotion categories, vary considerably in their associated physiological changes 75 , facial movements 76 , and even in their neural correlates, whether measured at the level of individual neurons 77,78 , as activity in specific brain regions 79 , or as distributed patterns of activity 80   The HGLM includes both formal schooling and Swahili language skill and as Level-2 predictors. Self-reported formal schooling was entered based on the number of years completed. Self-reported Swahili language skill was dichotomized as 0=poor, 1=good. The intercept for each condition, what we call "minimal cultural exposure", tests performance against chance-level responding for participants who self-reported poor Swahili language skill and had no years of formal schooling. For models separately examining Schooling and Swahili as Level-2 predictors, see Supplementary Information Table 10. www.nature.com/scientificreports www.nature.com/scientificreports/ can vary in their affective features (e.g., some instances of fear can feel pleasant, and some instances of happiness can feel unpleasant; 81,82 ). As a consequence, we propose that instances of emotion are the result of evolution, but not because they issue from innate, modular systems that promote a cascade of prepared responses, including the generation of diagnostic facial expressions. Instead, we hypothesize that instances of emotion are emergent products of multiple biologically-evolved mechanisms that depend on cultural learning 83 . We specifically propose that the developing brain bootstraps embodied concepts into its wiring, creating an internal model for how to best regulate the body across a range of situations within the constraints of a culturally-shaped world 84,85 . Accumulating evidence suggests that the human brain evolved to require an extended period of brain development, wiring itself to its physical and cultural surroundings, thereby allowing it to build a model of the world that is tailored to particular social and environmental contexts 86 . This embraining of culture may allow people to survive and thrive as a social species in a wide variety of contexts 87 . This perspective is rooted in the Darwinian concept of population thinking 65 , in which variation provides the necessary flexibility for locally-adaptive responding.
In the constructionist tradition, we hypothesize that the human brain constructs emotions, as needed, in a way that is tailored to the requirements of the immediate situation. In cognitive science, these are referred to as ad hoc categories 73 . An ad hoc category is a situated, abstract category: the instances are variable in their physical and perceptual features but similar in function, with the specific function changing from situation to situation. Consider, for example, the category for anger within western cultures: in situations involving a competition or negotiation, the anger category might be constructed such that instances share the functional goal 'to win' 88 ; in situations of threat, the anger category might cohere around the functional goal 'to be effective' 89 or even 'to appear powerful' 90 ; and in situations involving coordinated action, the anger category might include instances that share the functional goal 'to be part of a group' 91 . We hypothesize that individuals learn to construct these situation-specific categories based on what is considered most functional in their immediate cultural    Information Table 11. context 15,71,92 . Correspondingly, emotional expressions may be highly variable, tailored to the demands of a situation, and may have functional forms that align with goal-directed behavior. As a result, both culturally-divergent and -convergent pathways for emotional instances, including their expressions, are predicted based on the unique and recurrent demands placed on humans across societies. Future work using a constructionist framework to guide hypothesis generation and experimental design may lead to more mechanism-based investigations of these pathways.

Method
Both experiments were officially approved by Northeastern University's Institutional Review Board. The research was performed in accordance with their guidelines and regulations to ensure the ethical treatment of human subjects. All participants provided informed consent before beginning the experiment. All individuals whose photos were used in the experiments consented to having their photos taken, used in scientific research and published in scientific reports. All data collection and consent procedures were approved by the Tanzanian Commission for Science and Technology (COSTECH). Faces were presented on a computer screen at central fixation and remained on screen until a participant provided a response. Participants were tested in a seated position, at a comfortable but not-standardized distance from the screen. Participants were allowed to move closer to the screen to inspect targets as needed. Face images were presented in rectangular photographs with an onscreen width and height of 4 × 6 inches. The faces occupied approximately 75% of the image height. Viewing distance was uncontrolled but can be estimated to vary between the lower bound of 20 inches (resulting in 11.42 × 17.06 degrees visual angle for the square photograph) and an upper bound of 40 inches (resulting in 5.72 × 8.57 degrees visual angle for the square photograph). Viewing distance was not measured during the experiment. Visual angle may thus be a nuisance variable in these data. In the field study, instructions and materials were pre-recorded in Hadzane and presented over headphones. Hadza participants responded via an interpreter either in their native language (Hadzane) or in their second language (Swahili), which is commonly spoken; responses were translated into English prior to coding. A response was coded as in agreement if participants offered a label that was semantically consistent with the expected English-language emotion category (see Supplementary Information for more details).

Stimuli.
Choice-from array. In Study 2, 54 Hadza individuals (25 women, 29 men; ages 18-75) from the same two camps in the Great Rift Valley were tested individually. Data from six individuals were removed prior to analysis due to non-compliance. 48 US individuals (31 women, 17 men) ranging in age from 24 to 67, with a median age of 42.50 were recruited on Mechanical Turk. Participants were required to be native English speakers, over 18 years of age, and were recruited for normal or corrected-to-normal vision. The sample included 39 individuals who identified as White, 3 individuals who identified as Black or of African descent, 1 individual who identified as Native American, and 5 individuals who reported being of mixed race or another category. Participants were recruited to have high-school level schooling or less (three of the 48 participants did not have a high school degree, and one participant had completed some college). This targeted sampling strategy was possible since our participants were recruited online (compared to Study 1, which was conducted in a public location on the Northeastern University campus).
During the experiment, a participant was presented with 24 trials, presented in a fully randomized order. On a given trial, participants listened to a short emotional vignette (that included an emotion category label; see Fig. 3a). Vignettes were adopted from the prior literature and evaluated for cultural appropriateness by two individuals (first by ANC, co-author and anthropologist who has worked among the Hadza population for 15 years, and then by co-author SM, who has long-term experience as a research assistant and is a member of the Hadza community).
In the Hadza study, task instructions and materials were presented in Hadzane. Swahili was used when no direct translation was available in the Hadzane language for specific emotion terms (i.e., "surprise" and "sadness" were translated to "shangaa" and "huzunika", respectively). The necessity of Swahili was based on the judgment of a native speaker of Hadzane and verified via consultation of a Hadzane lexicon compiled by linguists in collaboration with native Hadzane speakers 44 . In the online US study, all task instructions were provided in written English and all vignettes were presented in audio recordings that could be played via computer speakers or headphones. In the field study, vignettes were presented over noise-cancelling headphones so that the experimenter and translator were blind to the vignette presented on a given trial.
Following the vignette, participants were asked to choose one of two pictures of posed facial expressions presented side-by-side on the computer screen. Participants indicated which face matched the emotion that the person in the vignette was feeling. Hadza participants rendered a response by pressing the face on the computer's (2020) 10:3867 | https://doi.org/10.1038/s41598-020-60257-2 www.nature.com/scientificreports www.nature.com/scientificreports/ touch screen. US participants used their mouse to click on the target face. The face images remained on screen until a response was rendered. In both the field study in Tanzania and in the online comparison study, visual angle of the targets was not controlled. Participants were allowed to move closer to the screen to inspect targets as needed. In the field study in Tanzania, face images were presented in rectangular photographs with an onscreen width and height of 4 × 6 inches. The faces occupied approximately 75% of the image height. Viewing distance was uncontrolled but can be estimated to vary between the lower bound of 20 inches (resulting in 11.42 × 17.06 degrees visual angle for the square photograph) and an upper bound of 40 inches (resulting in 5.72 × 8.57 degrees visual angle for the square photograph). Viewing distance was not actually measured during the experiment. Participants in the United States who completed the experiment online may have even more variable visual angles since the size of the monitor on which the faces were presented was not standardized. Visual angle may thus be a nuisance variable in these data.
Each trial contained a target (based on the a priori model for universal emotions and on norming conducted in a US, English speaking sample; see Supplementary Information for details). Each trial also contained a foil that either matched the target in both valence and arousal (affect controlled), matched the target in valence but not arousal (valence controlled), matched the target in arousal but not valence (arousal controlled), or did not match the target in either valence or arousal (affect uncontrolled) (Fig. 3a).
These methods followed closely those used in Ekman and Friesen's classic study in Papua New Guinea 19 , with two exceptions: 1) we administered the experiment on a computer with headphones, and 2) we systematically controlled the affective similarity of the foil to the target to examine cross-cultural perception of affect, as in our prior work 27 .

Data availability
The de-identified datasets generated during and/or analyzed during the current study are available from the corresponding author on reasonable request. Following publication, data will be posted to the OSF by the first author.