Say what I mean – Expectancy effects in the EEG during joint and spontaneous word-by-word sentence production

Our aim in the present study is to measure neural correlates during spontaneous interactive sentence production. We present a novel approach using the word-by-word technique from improvisational theatre, in which two speakers jointly produce one sentence. This paradigm allows the assessment of behavioural aspects, such as turn-times, and electrophysiological responses, such as event-related-potentials (ERPs). Twenty-five participants constructed a cued but spontaneous four-word German sentence together with a confederate, taking turns for each word of the sentence. In 30% of the trials, an unexpected gender-marked article was uttered by the confederate. To complete the sentence in a meaningful way, the participant had to detect the violation, (possibly) inhibit a prepared response, and retrieve and utter a new fitting response. We found significant increases in response times after unexpected words and – despite allowing unscripted language production and naturally varying speech material – successfully detected significant N400 and P600 ERP effects for the unexpected word. The N400 EEG activity further significantly predicted the response time of the subsequent turn. Our results show that combining behavioural and neuroscientific measures of verbal interactions while retaining sufficient experimental control is possible, and that this combination provides promising insights into the mechanisms of spontaneous spoken dialogue.

The exchange between two persons conversing is stunningly fast: interlocutors take turns speaking 3 and listening at a rapid rate, requiring them to produce and process language simultaneously. The 4 time needed for producing an utterance is commonly longer than the response times observed 5 during natural conversations 1 . This speed can be accomplished by forming expectations, for example 6 of the length of a turn and the meaning of the utterance 2,3 , and then preparing the own response 7 based on these expectations 4 . When expectations are not met, however, language processing is 8 often slowed down 5,6 . Aside from the time needed to process the unexpected event, interlocutors 9 may need additional time to adapt a prepared response to the new unexpected context. 10 To understand the effect of expectations and expectation violations during interactions, not only 11 behavioural but also neural underpinnings might provide a valuable frame. There is, indeed, a major 12 interest in moving towards a neuroscience of social interaction 7-10 . Interactions, however, are 13 characterized by their openness, while neuroscientific devices impose major constraints to measure 14 sensible brain data. It is challenging to bring these two together 11 . 15 One setting in which interactive patterns can be observed in an experimentally controllable 16 environment is the word-by-word exercise from improvisational theatre 12 . In this game, two persons 17 construct a story together by taking turns for each word. A high degree of coordination is necessary, 18 as the interacting partners have to adapt to each other turn-by-turn in order to produce a 19 meaningful sentence. We believe this level of coordination is achieved by forming expectations about 20 the partner's next utterance. Similar to natural interactions, one can observe that when expectations 21 are not met the player hesitates to produce the next turn. The word-by-word setting further keeps 22 the principal turn-taking structure of natural interactions intact while giving the possibility to 23 manipulate systematically whether the preceding turn prompts an unexpected sentence completion. 24 Electroencephalography (EEG) has a high temporal resolution, which can capture the fine temporal 25 structure of verbal interactions. The clear structure of the word-by-word paradigm lends itself for 26 EEG recordings as it ensures valid segmenting of the time periods of interest. EEG studies including 27 semantic expectancy violations have mainly reported modulations of the N400 event-related-28 potential (ERP), linked to semantic processing, and the P600 ERP linked to syntactic processing 5,13-15 . 29 The N400 effect is characterized by an amplitude modulation in the averaged EEG approximately 400 30 ms after an unexpected compared to an expected word, indexing sensitivity to semantic expectancy 31 [16][17][18][19][20][21][22][23] . The N400 is often followed by a so-called late positive complex or P600 that has been 32 associated not only with syntactic analysis but also overall re-analysis 15,24 and even semantics 25,26 . 33 In the present word-by-word study, we want to allow unscripted -though controlled -language 34 production of the participant while producing a sentence interactively with a confederate. We 35 measure the underlying electrophysiological activity while systematically manipulating whether the 36 participants' expectations are met or not. For this purpose, we make use of pictures showing objects 37 that have more than one naming option, i.e., synonyms, with different gender-marked articles in 38 German. To control utterances of participants without making them read aloud, we cued each 39 sentence with a written verb and a picture of an object, both of which had to be included in the 40 sentence. The confederate inserted unexpected sentence continuations (i.e., articles of unexpected 41 gender), where the participant not only had to process the unexpected event but also needed to 42 retrieve and produce a new response deviating from his or her preferred object name. 43 Behavioural studies on speech production suggest that final word selection among lexical 44 competitors takes place rather late, around 300 ms after picture onset, meaning that multiple 45 candidates are activated at first [27][28][29] . Neurophysiological studies of overt language production have 46 found that this lexical access manifests as a positive deflection around 200 ms after event onset 30-33 . 47 During speech comprehension, discourse (i.e., sentential context) can aid in pre-activating 48 appropriate lexical representations, leading to behavioural costs and differing neurophysiological 49 responses when expectations are violated 5 . Accordingly, we hypothesize that during interactive 50 sentence production, such as in the word-by-word paradigm, a specific lexical candidate (e.g., the 51 preferred naming of the object on the picture) will be pre-activated to produce the next turn as fast 52 as possible 1 . Based on this pre-activation, the player will make predictions about the co-player's 53 preceding turn. When the predictions are violated (i.e., when the confederate utters an unfitting 54 gender-marked article with respect to the pre-activated noun), the player will need to recover by 55 activating one of the less preferred lexical alternatives in order to produce a meaningful sentence. 56 The successful completion of the word-by-word task entails language comprehension, language 57 production, and for instances of expectancy violations their detection along with the inhibition of a 58 possibly pre-activated response. Our objectives here can be summarized as: (1) testing the feasibility 59 of measuring sensible neural correlates along with behavioural markers in an interactive setup that 60 allows unscripted language production, and (2) measuring some of the neural processes related to 61 successful interactive language use and repair during spontaneous sentence production. For 62 unexpected continuations (i.e., a different gender-marked article from the preferred object noun), 63 we predicted an N400 and P600 ERP effect. Further, we predicted increased turn-times for the next 64 response after encountering an unexpected article. To our knowledge, this is the first study to target 65 interactive language use during EEG measurement with a paradigm that allows such a dynamic 66 sentence production. 67 questionnaire on demographic and physiological information). Participants were briefed about the 119 following tasks and that their interacting partner was a confederate. 120

Methods
For the measurement, participant and confederate sat next to each other, each in front of a 121 computer screen (see Fig. 1  were instructed to utter only the words of the experiment and to avoid filling utterances (e.g., 'ehm', 126 'eh') and other vocal noises (e.g., laughs, throat clearing). The picture-naming task was programmed 127 in Matlab R2017b. The word-by-word experiment was programmed with the psychophysics toolbox 128 36,37 in Matlab R2017b. 129 First, a picture-naming task with two naming instances per picture was conducted. Participants were 130 asked to name each picture with its respective definite article. Three practice trials were used to 131 familiarize the participant with the task. The 144 pictures (cropped image on black screen) were 132 shown once in a randomized order and then again in a different randomized order (i.e., 288 trials in and was not informed about the task of the confederate. 137 Thereafter, the EEG cap was fitted and impedances controlled. A second picture-naming task with 138 one naming instance per picture (i.e., 144 trials) was conducted (the same cropped images on black 139 background). The confederate again saved the respective namings, which were used from this 140 second run as target words for the expectancy manipulation in the word-by-word experiment. 141 Subsequently, the word-by-word experiment (see Paradigm and Fig. 1 C) was conducted. Here, the 142 task was to construct a correct four-word sentence, taking turns for each word. For the participant, 143 each trial started with a fixation cross (0.5 sec), followed by the simultaneous presentation of the 144 written verb in infinitive (white letters) in the upper middle centre and the cropped object picture in 145 the lower middle centre of the screen (4 sec). During the interactive production part, the 146 participant's screen displayed a steady fixation cross and the confederate's screen displayed the 147 scripted words of the sentence. The confederate uttered the first word, 'Tina' (sentence subject), the 148 participant then uttered the second word, which had to be conjugated, 'sieht' (sees, sentence verb), 149 the confederate then uttered the third word, 'die' (the, sentence gender-marked article), and the 150 participant uttered the fourth word, 'Couch' (sentence object). The trial was terminated with a 151 button press by the confederate. A blank screen was presented for 1.5 seconds between trials. The 152 article uttered by the confederate could either match the preferred naming of the participant (70% 153 of trials -'expected') or fit an alternative naming with different grammatical gender (30% of trials -154 'unexpected'). We emphasized the importance of producing a correct sentence with the confederate 155 using the verb and picture prior to the trial, without revealing that they would encounter unexpected 156 sentence continuations. Three practice trials (without expectation violations) were conducted to 157 clarify the task. Participants were instructed to keep movement minimal and to use the time of the 158 blank screen between trials for necessary movements. Every 12 trials there was a pause and 159 participants could decide when to continue. 160 After the experiment, the EEG cap was removed and participants were asked to fill in an evaluation 161 questionnaire. Participants were for example asked to rate how natural the interaction seemed, Word onsets and offsets were first roughly estimated to create epochs around each single word. 175 Next, the audio signal of these epochs was high-pass FIR filtered at 35 Hz and down sampled to 1470 176 Hz. The envelope was computed (filter length 300) and a low-pass Butterworth FIR filter at 730 Hz 177 was applied. Thereafter, the root mean square (RMS) and cepstrum (using the Voicebox toolbox 39 ) 178 were calculated, and the first and second fundamental frequencies were extracted from the 179 cepstrum (low-pass filtered at 600 Hz). To find the real speech onset, we applied the function 180 'findchangepts' (Matlab toolbox signal processing) on the RMS, which gives a series of markers 181 showing changes in the RMS audio signal. The onset marker was set as valid, if changes were 182 apparent in all calculated signals (RMS, envelope, first and second fundamental frequencies). For 183 speech offset detection, the same procedure was applied with a time-reversed audio signal. Each 184 word segment (from onset to offset) was inspected by ear and adjusted, if necessary. 185 186

Behavioural Analysis 187
The results of the picture-naming task, where each picture had to be named three times, were 188 assessed. When the same name was used for a picture all three times it received an intra-individual 189 frequency rating of 2, when the same name was used twice it received a rating of 1, and when it was and expectancy, an interaction term between these two factors was included in the model. A random 217 slope congruency for the random intercept subject was added to model possible inter-individual 218 differences in effect size of expected to unexpected condition. Furthermore, we included the random 219 intercept factor length of word (i.e., the number of letters) in the model. 220

Brain-behaviour interaction analysis 248
To test the interaction between brain and behaviour (compare for example 48 ), a GLMM was 249 calculated in R 41 with the lme4 package 42 . Individual turn-times from offset of word 3 to onset of 250 word 4 were the response variable and fixed factors were the respective mean N400 EEG activity 251 between 250 and 450 ms over the specified ROIs (see Fig. 3) and expectancy (expected vs. 252 unexpected). An interaction term between both factors was added to the model. Similar to the 253 GLMM calculated for the behavioural analysis, random factors included word length (intercept) and 254 expectancy (slope) in participant (intercept). 255 sentence. Three participants were reinstructed after the first block to ensure compliance of the task. 279 Erroneous trials (e.g., sentences where the verb was forgotten, an ungrammatical fourth word was 280 uttered or no fourth word was uttered) were excluded from further analysis. On average, the word 281 duration, i.e., the time spent uttering a word, was 559 ± 12 ms for word 2 and 540 ± 30 ms for word 4 282 (543 ± 38 ms after expected articles and 548 ± 54 ms after unexpected articles). 283 Participants needed on average 351 ± 59 ms to produce word 2 (turn-time from offset of word 1 to 284 onset of word 2) and 554 ± 193 ms to produce word 4 (turn-time from offset of word 3 to onset of 285 word 4) over all conditions (filler, congruent, incongruent). Split by critical conditions, participants 286 needed on average after an expected article 405 ± 168 ms and after an unexpected article 958 ± 273 287 ms to produce word 4 (see Fig. 2 a). 288 A GLMM with fixed factors expectancy, intra-individual frequency, inter-individual frequency, and 289 random slope congruency nested in intercept participant, as well as random intercept word length 290 showed that intra-individual frequency of naming (i.e., intra-individual naming for the same picture) 291 had no significant effect on turn-time (χ2 (2) = 1.88, p = .391). Therefore, it was dropped from the 292 model for better model fit (see supplementary table 1). Results of the final GLMM (see Fig. 2 showed that expectancy (expected vs. unexpected) had a significant effect on turn-times (χ2 (3) =  294 83.88, p < .001), with increased turn-times for unexpected events (see supplementary table 1).  295 Further, a significant interaction between expectancy and frequency of the preferred naming (i.e., 296 inter-individual naming distribution) was present (χ2 (1) = 40.98, p < .001), where higher frequencies 297 of a naming lead to decreased turn-times for expected events and increased turn-times for 298 unexpected events. Excluding the nested random slope for congruency in participant significantly 299 decreased model fit (χ2 (2) = 62.90, p < .001), indicating considerable variation between participants 300 in the effect of the expectancy violation (see Fig. 2 b). 301

EEG results 303
The grand average ERP of word 3 for expected and unexpected conditions shows an N400 effect 304 between 250 to 450 ms with a centro-posterior scalp distribution (see Fig. 3). Further, a P600 is 305 apparent between 500 and 700 ms after word onset with a posterior topography (see Fig. 3). 306 Statistical analysis confirmed that expectancy had a significant effect on the N400 amplitude for 307 word 3 over five of the seven specified ROIs (see Fig. 3 for electrode locations and see supplementary 308 table 2 for an overview of all results; e.g., posterior midline: χ2 (1) = 15.48, p < .001, left posterior 309 quadrant: χ2 (1) = 35.14, p < .001, and right posterior quadrant: χ2 (1) = 34.02, p < .001). The N400 310 amplitude was significantly more negative for the unexpected compared to the expected condition. 311 The topographical distribution and statistical result indicate a widespread N400 effect with a 312 posterior maximum (see Fig. 3 and supplementary table 2). 313 Expectancy also significantly modulated the P600 amplitude for word 3 (see supplementary table 3)  In this study, two persons jointly produced a sentence, taking turns for each word. The word-by-word 347 paradigm is inspired by a technique used in improvisational theatre, which models various aspects of 348 natural interactions. The paradigm's structure allows for high experimental control, along with the 349 ability to induce expectation violations during an interaction. These two pillars make the paradigm an 350 effective tool to study neurophysiological (e.g., N400 and P600 ERP) and behavioural effects (e.g., 351 turn-time) during interaction. In the present study, we could successfully induce expectation 352 violations by making a confederate utter a gender-marked article that did not fit the participants 353 preferred object name's gender that had to be produced in the next turn. 354 The behavioural findings within this paradigm are what we predicted based on previous research 4,6,49 355 and our own observations from improvisational theatre, as well as everyday experiences of natural 356 interactions. When expectations are not met, the time to produce the next response increased 357 significantly as compared to when the expectations are met. This finding points to the fact that 358 participants pre-activated their preferred object naming in order to produce the next turn fast as 359 possible. However, when they encountered an unexpected (unfitting gender-marked) article, they 360 had to discard the pre-activated lexical entry of their preferred object naming and produce the fitting 361 word. Behaviourally, we can capture the consequences that follow from this repair of a violated 362 expectation. This behavioural effect might still reflect numerous underlying processes, which cannot 363 be disentangled easily. 364 To further our understanding of the underlying mechanisms during word-by-word interactions, 365 neurophysiological underpinnings can help in disentangling some of the crucial aspects for successful 366 interactions. On the neural level, we predicted two main effects, the N400 and P600, to be 367 modulated significantly by expectancy. This was indeed the case, unexpected articles led to a more 368 negative amplitude of the N400 and a more positive amplitude of the P600. We will discuss in the 369 following paragraphs what their presence in this particular setup can tell us about the interplay of 370 language comprehension and language production during verbal interaction. 371 The first process observed in the EEG, the N400 effect, is known to index processing of expectation 372 violations in various domains (for an overview see 22 ). Seeing the N400 effect here is consistent with 373 the idea that the participant pre-activates a specific lexical entry and accompanying grammatical 374 gender during the interaction. The N400 effect is the response to encountering an unfitting gender-375 marked article to this pre-activated entry. Similar to grounding in conceptual pact studies 50 , i.e., 376 where interacting players agree on a specific term for a specific object, the participant named the 377 pictures pertaining to the objects in the co-constructed sentences prior to the experiment in the 378 presence of the interacting confederate. We deduce that participants ascribed certain expectations 379 to the confederate that she would name the objects the same way they had named them and would 380 therefore utter a fitting gender-marked article. The confederate in fact uttered fitting gender-marked 381 articles to the participants' expected object names in the majority of the trials (75%), rendering the 382 remaining trials unexpected. Similar N400 effects on the article level (i.e., when the article renders a 383 noun with high cloze probability grammatically incorrect) have been reported in a language 384 comprehension task in 2005 by DeLong and colleagues (see also 5 ). These N400 effects on article 385 level have been interpreted by the scientific community as strong indicators for prediction during 386 language comprehension 22 , since the article itself constraints the probability of following nouns 387 without defining context in itself. However, DeLong et al.'s findings failed to be replicated in a large-388 scale replication analysis, suggesting that (phonological forms of) words are not necessarily pre-389 activated during language comprehension 51 . Our word-by-word setup combines language 390 comprehension (of the article) with instant language production (of the following noun). We show 391 that in this context a pre-activation of an object form is indeed present, which shows up as an N400 392 effect on the article when violated. The N400 was even predictive of the resulting turn-time needed 393 to utter the next word. We conclude that pre-activation of a specific lexical entry aids in 394 accomplishing the present word-by-word task in a rapid manner, common to the timely turn-taking 395 structure of natural interactions 52 .
It is an open question, if the pre-activated entry leads to a 396 prepared word (i.e., in the speech production loop) or if it relates to pre-activation that aids in 397 speech preparation after listening to the turn of the partner. In other words, it is unclear if speech 398 production is planned during the turn of the partner or after the turn has finished. The later, positive 399 going ERP we observed in the EEG for unexpected conditions could provide information to answer 400 this question. 401 The classical account of this positivity we see would be that of a P600 ERP that has been linked to 402 syntactic analysis 24 and discussed as an index for structural reanalysis, for example regarding 403 semantics 53 . The P600 or late positive complex often follows an N400 effect (e.g., 15 ). In the present 404 study, the P600 would then reflect the parsing of the unexpected article with a transfer to new 405 retrieval. Sassenhagen and colleagues 25 for example found the P600 to be response-aligned to the 406 reaction time of a button press. In this line of argument, the P600 reflects the point, when the event 407 has been fully integrated in the sense-making system, opening the transfer to the most suitable 408 response (be it a button press or speech preparation). average > 400 ms long). In the unexpected condition, lexical access might be disrupted, marked by a 419 larger N400 amplitude, which reflects the processing of the unexpected article. Lexical retrieval is 420 then delayed (or re-activated) at a later stage, for example around 500 to 700 ms or even later, which 421 can overlap with the interpretation of a P600. Given the considerable differences in task demands of 422 earlier EEG studies on speech production and the present study (e.g., picture naming requiring 423 immediate response vs. delayed response) this interpretation is rather speculative. We would 424 encourage future studies to test this interpretation for example by adding a control condition, where 425 participants would listen to the unexpected article without having to produce a response thereafter. 426 Also in regard of language comprehension, the question remains whether the noun is pre-activated 427 due to the required speech production in the next turn or if the N400 effect on an article can also be 428 found for pure language comprehension scenarios. Assessing the preferred object names of 429 participants prior to a language comprehension task, where they listen to sentences with their 430 preferred and dis-preferred object names could provide a scenario to study this question. 431 Future studies could moreover target the role of interindividual differences during interactions. We 432 have seen that during the current word-by-word construction participants suffered to a different 433 degree from the expectation violations (see Fig. 2). Such interindividual differences are also visible 434 during joint story building in improvisational theatre. For example, one can observe differences in the 435 response to expectation violations and their repair. Naïve players are often in situations where they 436 cannot come up with a response, while proficient players manage smooth interactions also without 437 knowing the partner beforehand. Grasping these differences with implicit and neurophysiological 438 measures can pave the way to assess the role of learning in coping with unexpected events and its 439 possible transfer to other social situations. The interaction of brain and behaviour in the present 440 study further shows that neural correlates can be predictive of the behavioural outcome. 441 442

Conclusion 443
Social interactions are complex and marked by multiple levels of processing. Here, we successfully 444 measured neural activity related to linguistic processing during verbal interaction. To our knowledge, 445 this is the first study to measure EEG during expectation violation, where the participant is not only 446 required to comprehend and detect the violation of freely produced speech, but also to inhibit a pre-447 activated response and retrieve a new response to complete an interactively produced sentence. Our 448 EEG findings revealed two underlying processes of the handling of these expectation violations, with 449 one significant outcome on the behavioural level, i.e., in turn-time. A link of these two measures 450 could be established via a brain-behaviour model, i.e., the N400 effect on the article-level predicted 451 the turn-time to produce the following object noun. We conclude that there is added value in 452 combining both measures, behavioural and neural, to understand the mechanisms of social 453 interactions. This joint assessment was possible with the word-by-word paradigm, which combines 454 verbal interaction with the necessary experimental control. 455