Introduction

The phylogenetical origin of human language is a current hot topic in science.1,2,3 The potentiality to find precursors in nonhuman primates' vocal communication in particular remains a subject of vivid discussion among scientists and a challenging task given the fact that, despite their phylogenetical and anatomical proximity to humans, nonhuman primates' calls are strongly determined genetically, unlike human language. However, moving the debate from production and acoustic plasticity to usage reveals human - nonhuman primate parallels, notably regarding adult-young differences concerning sensitivity to communication rules.3 For instance, although vervet monkeys are able to produce their species' eagle alarm call as soon as they are born, immature individuals have to learn the appropriate context of emission by alerting after perceiving, first, any flying item and then, later on, only their real bird predators.4 Specific vocal interactions between infant and adult tamarins in food transfer contexts create an opportunity for infants to learn not only what food is appropriate but what vocalisations are appropriate in feeding contexts.5 As they grow older, the proportion of adult-form food calls emitted by infants in this context increases while the proportion of non-food-associated calls decreases.

Conversations, during which interlocutors have to respect turn-taking rules, represent the social core of human language usage.6 Appropriate turn-taking requires that interlocutors A and B respect three basic rules:6,7,8 1) A and B synchronise their speeches temporally and avoid talking simultaneously (i.e. with a minimum and maximum inter-speech delay); 2) at any given moment during a conversation, interlocutors have coordinated roles, one of them sends a signal and the other one responds (i.e. following the pattern AB and not AA or BB); 3) during longer-term conversations (dialogue), the two interlocutors alternate (e.g. going from AB to ABA, ABAB…). The only interlocutors who happen to break these rules are children at early stages of language development, children who have been poorly tutored by adults or children with developmental disorders.9,10 The fact that turn-taking appears universal in human-beings questions its potential biological basis.11 Some authors have claimed that some nonhuman primates' vocal exchanges can be seen as primitive forms of conversation as they appear to respect the fundamental aforementioned turn-taking rules,12,13,14 but no study has assessed the pertinence of these rules at different developmental stages in nonhuman primates and no experimental design has ever investigated the functional relevance of these rules.

Campbell's monkeys (Cercopihecus campbelli campbelli) are known for their complex vocal abilities and several studies have evidenced their parallels with human language at different production, perception and usage levels (e.g semantic alarm calls, sound combinatorial rules: proto-affixation, proto-syntax and prosody).15,16,17 The social core of Campbell's monkey intra-harem communication relies on female contact call utterances. Adult females rarely call alone, but during temporally and socially structured vocal exchanges generally involving only 2 (43%) or 3 (31%) individuals.18 These interactions have been described as primitive forms of conversation. First, vocal exchanges are characterised by inter-call intervals that last about one call duration (preventing call overlap) but not longer than one second (threshold criterion defining vocal response). Second, during a vocal exchange consecutive calling from the same individual is rare (less than 1% of the occurrences). Third, although most vocal exchanges follow the simple signal-response pattern (AB), extension of the interaction respecting alternation of interlocutors (e.g. ABA) is fairly common (21% of vocal exchanges). Moreover, several studies confirm the social relevance of these vocal interactions. The choice of the interlocutor is not random, as elders are preferred targets of vocal responses18, a characteristic described as universal in traditional human oral societies19 and as vocal exchanges are said to facilitate social integration.20 Also, Campbell's monkeys present the ability to modify the acoustic structure of their contact call to match the structure of their preferred partners independently of kin relatedness,21 a trait described in humans as vocal accommodation.22 To assess the pertinence of the turn-taking ability at different developmental stages in this species, we recorded spontaneous vocal utterances of young and adult Campbell's monkeys in a naturalistic context. We evaluated the propensity of young and adults to violate the signal-response rule. Then, playbacks of vocal exchanges respecting or not the rule of alternation between interlocutors investigated perception and cognitive relevancy of the turn-taking rule for young and adult monkeys.

Results

Campbell's monkeys' contact calls are produced in temporally-ruled bout series with a maximum 1-second inter-call duration.18 Within bouts, one individual can either call alone several times consecutively (inappropriate repetition-IR) or call following the call of another individual (appropriate response-AR). We calculated the proportion of inappropriate vocal utterances for each caller in relation to all calls (IR/(IR+AR)). Young individuals produced significantly more inappropriate vocal utterances than did adults (Mann-Whitney, Nadult = 7, Nyoung = 5, W = 28, P = 0.003) (Figure 1). Repeated calling consisted generally of 2 (84%), sometimes 3 (14%), rarely more, consecutive calls from the same caller. Our observations revealed that juveniles, less socially experienced, broke the turn-taking rule significantly more often than adults did, by repeating their call once or several times before another interlocutor responded.

Figure 1
figure 1

Observations.

The proportion of spontaneous inappropriate contact calls uttered in relation to Campbell's monkeys' age. **P≤0.01

During playback experiments a subject could hear either an appropriate (the two interlocutors A and B call in turn) or an inappropriate vocal exchange (A breaks turn-taking). Analysis of durations of gazes at the loudspeaker showed that, on the one hand, adults were significantly more interested in appropriate than in inappropriate vocal exchanges as their visual attention increased when interlocutors respected turn-taking (Wilcoxon, N = 7, Z = 2.028, P = 0.043) and that, on the other hand, the levels of visual attention of young monkeys did not differ significantly between stimuli (N = 6, Z = 0.374, P = 0.7532) (Figure 2).

Figure 2
figure 2

Playback experiments.

Visual attention (i.e. duration of looking towards the speaker in the 30 seconds ‘after’ minus 30 seconds ‘before’ playback) paid to appropriate and inappropriate vocal exchanges (respecting or not the turn-taking rule) in relation to listener's age. *P≤0.05, NSP>0.05.

Discussion

Here, we demonstrate, to our knowledge, for the first time in a nonhuman primate species that, during a conversation-like interaction the appropriate way of calling is age-dependent and that the turn-taking rule is cognitively relevant for adults, whereas it does not seem to make any sense for inexperienced young monkeys. The ability to converse with other group members has probably been a key step during the evolution of vocal communication to language and would have emerged prior to abilities to articulate sounds and learn new acoustic structures, finding its roots deep inside the primate lineage.

The developmental processes underlying these adult-young differences in nonhuman primates remain open to debate. Social learning very probably occurs here, as in vervets and tamarins,4,5 although we cannot exclude some kind of cognitive maturation. Turn-taking is a key component of human social exchanges and occurs early during pre-verbal interactions between infants and their caregivers.9 Turn-taking triggers infants' production of speech-like sounds.23 Pragmatic challenges, namely how to use language appropriately, may also improve children's syntactic and lexical skills during later development. Nevertheless, some children with autistic syndromes can express preserved syntactic and lexical skills relatively well despite severe impairments of their abilities to use language appropriately.24 Therefore, relationships between different facets of communication remain a major unresolved issue that needs to be explored further in order to understand fully both the development and the evolution of animal vocal communication and human language.

The psychological mechanisms underlying the behavioural response of adults in our playback experiments raise an interesting question. Although some studies show that human and nonhuman primates tend to pay more attention to the stimulus simulating the rule violation than to the appropriate stimulus,4,25 we evidenced here the opposite pattern. Alternatively, gaze duration is commonly used in human and nonhuman primate studies as a measure of a subject's preferences.26,27 Accordingly, we believe that our subjects evaluated the opportunity to participate in the ongoing simulated vocal exchange and lost interest more quickly when they realised that the exchange was inappropriate.

The last question concerns whether we can consider this characteristic to be a precursor of human conversation. Another possibility is that this characteristic has emerged during convergent (or parallel) adaptation to a common environmental factor. Both hypotheses are interesting because they predict something about the communication behaviour of their shared ancestor. The convergent evolution hypothesis is supported by the fact that chimpanzees, close to humans, do not seem to respect turn-taking.28 Most of their calls are uttered independently and vocal responses are relatively rare. Conversely, they have been observed emitting socially relevant chorusing supporting call matching,29 and temporally structured calling coordination has been reported in other apes30. Firm statements can only be made once a reasonable number of species with known phylogenetical links to modern humans have been investigated, something that the current literature does not yet allow.

Methods

Subjects

Subjects belonged to a social group including seven adult females (4.5 to 17 yo), their five juvenile (2 males, 3 females, 2.5 to 3.5 yo) and two infant (1 male, 1 female, 6 mo) offspring. They were housed at Rennes 1 University (Station biologique de Paimpont) and lived in an indoor (9.60mx1.65mx3.25m) – outdoor (29mx9.80mx4.20m) enclosure enriched with wood perches, natural grass and straw litter. Water was available ad libitum. Animals were fed twice a day with fruit, vegetables and food pellets. Observations and experiments were made when the group was locked outside and complied with the current French laws governing animal research.

Observations: Spontaneous contact call production

Contact calls (N = 3487) from all group members (apart from the two infants who produced hardly audible calls) were simultaneously recorded in February 2009 (all-occurrence sampling: 20 recording hours randomly distributed between 10am and 5pm) using a Sennheiser MKH70-1 microphone connected to a Marantz PMD660 recorder (sample rate = 44.1KHz / resolution = 16bits). Analysis of the total number of calls recorded (mean ± s.e.: Nadult = 75 ± 18.6, Nyoung = 54 ± 20.8) gave the numbers of times a given individual called alone several times in succession (inappropriate repetition-IR) or responded to another individual (appropriate response-AR).

Experiments: Playback experiments

Stimuli

In April and May 2010, 39 contact calls from the seven adult females were recorded using the above-mentioned equipment. Based on these calls, 13 sets of acoustic stimuli were formed (using Avisoft) so that the same call exemplar was never used twice. A set included one appropriate vocal exchange (following the pattern A1BA2, with A and B being two different callers taking turn and 1 and 2 two different call exemplars) and one inappropriate vocal exchange (after rearranging artificially the same three calls following the new pattern BA1A2 in which A called twice in succession). Two consecutive calls were separated by a 400ms silent pause (common response delay)18. The appropriate and inappropriate stimuli had similar total duration (respectively, 1.63 ± 0.24 and 1.62 ± 0.24 seconds). Each subject (7 adults, 6 juveniles: the older son had been removed after the observations to simulate natural emigration) was attributed one set of calls. Two A/B individuals were randomly allotted to each subject respecting one condition: each adult/juvenile heard a set containing the calls of a sister or an aunt and an unrelated female.

Conditions and measures

Sounds were played back using a Kuldeski loudspeaker connected to the Marantz player at 70db SPL (matching the species' natural loudness for communication at comparable distances). Two experimenters, a loud speaker and an opaque board were inside the enclosure in order to conduct the experiment. The loudspeaker was hidden, inside the outdoor enclosure where the whole group was free to move, behind a 90×150cm mobile wooden board held by an observer placed on one side who also played the recording. Another observer on the other side, video-recorded the subject's gazes 30 seconds before and 30 seconds after the playback. The loudspeaker and the board were moved to a different place for each trial. To habituate the subjects and to prevent conditioning, mock experiments (without sound) were conducted twice a day one week prior to the study and continued randomly between playbacks (once a day every three days on average). During a playback, the board was placed at 3m (min = 2, max = 4) on average (similar for all situations: appropriate vs inappropriate and adult vs juvenile) from a subject when the latter was sitting (head orientation: at least 90 degrees from the speaker) and resting. The body position in relation to the board location was matched for the two tests for each subject. The loudspeaker was systematically placed at 50cm behind the board. To make the situation as plausible as possible, the loudspeaker was placed so that both A and B were at 2 to 4 meters behind the board from the point of view of the subject. As far as we could see, A and B never reacted particularly when hearing their own voices. At least two days separated a given subject's two tests and half of the subjects heard the appropriate exchange first. Never more than four subjects were tested one day, randomly between 9am and 5pm. To prevent different habituation of subjects we also randomised the tests of adults and of juveniles, resulting in a balanced number of trials per day (mean +/− e.s.): Youngappropriate = Younginappropriate = 0.46 +/− 0.14, Adultappropriate = Adultinappropriate = 0.54 +/− 0.18. Sounds were played only when the whole group was calm, outside feeding periods and when the subject had no neighbour within 1.5m. All videos were then watched in slow motion using Dartfish™ software for coding. The response measured was total duration of gazes towards the board 30 seconds ‘after’ minus 30 seconds ‘before’ the experimenter pressed the play button.26 A second naive and independent observer re-coded 25% of the videos randomly selected. We could thus confirm the consistency in the video coding (Spearman rank–order correlation test: rs = 0.99, P<0.001).