The word order of languages predicts native speakers’ working memory

The relationship between language and thought is controversial. One hypothesis is that language fosters habits of processing information that are retained even in non-linguistic domains. In left-branching (LB) languages, modifiers usually precede the head, and real-time sentence comprehension may more heavily rely on retaining initial information in working memory. Here we presented a battery of working memory and short-term memory tasks to adult native speakers of four LB and four right-branching (RB) languages from Africa, Asia and Europe. In working memory tasks, LB speakers were better than RB speakers at recalling initial stimuli, but worse at recalling final stimuli. Our results show that the practice of parsing sentences in specific directions due to the syntax and word order of our native language not only predicts the way we remember words, but also other non-linguistic stimuli.

Accordingly, languages tend to be consistently RB or LB 55 , because consistently sticking to just one parsing strategy may reduce the processing difficulties associated with a mixture of RB and LB structures 9,55,57 . This sensitivity to the branching direction of a language may be cognitively so relevant to also affect the way in which humans remember and/or process sequences of stimuli. Therefore, speakers from languages that vary in their branching may differ in the way they process and/or remember not only words, but also other non-linguistic stimuli. More specifically, we expected LB speakers to better recall initial stimuli as compared to RB speakers, as real-time sentence comprehension relies more heavily on retaining initial information in LB languages.
In order to test this hypothesis, we selected four RB languages (Ndonga, Khmer, Thai, Italian) and four LB languages (Sidaama, Khoekhoe, Korean, Japanese), using the World Atlas of Language Structures (WALS 79 ).
To determine the degree of branching in each language, we used the following word order criteria: order of object-verb, genitive-noun, relative clause-noun, and clause-subordinate. All languages were consistently RB or LB according to all these criteria (except for Sidaama, for which the clause-subordinate order is not classified as either consistently RB or LB by the WALS). In comparison, English is consistently RB for three out of four of these criteria. For each language, we tested 24-30 adult native speakers of both sexes, in three widely used working memory (WM) and three widely used short-term memory (STM) tasks, containing sets of 2-9 numerical, spatial or word stimuli (see Methods). These tasks are well-established span tasks which have been implemented in the Attention & Working Memory Lab by Engle's research group and validated across a variety of studies see 80,81 . Ambiguity exists on the relationship between these two distinct but highly correlated constructs, but most cognitive psychologists would agree that while STM is a storage component of no longer externally available information, WM also contains an attention component aimed at maintaining memory representations in the face of concurrent processing, distraction and attention shifts e.g. [82][83][84][85] , and has an active role in language e.g. 86,87 . Indeed, several studies have demonstrated the influence of WM on sentence processing 88,89 ; see 90 , with WM tasks correlating much better with sentence comprehension as compared to STM tasks e.g. 91,92 ; see 87 . Therefore, we expected branching to predict individuals' ability to recall stimuli in WM but not in STM tasks.
In our study, subjects had to sequentially recall the stimuli right after each presentation. To explore whether branching predicted individuals' ability to better recall initial (primary) or final (recency) stimuli, for each participant we coded the number of correct items recalled in the first half and in the last half of each set of stimuli (the middle stimulus was not coded in lists with odd numbers). The stimuli position (initial, final) was then included as test predictor -together with stimuli kind (spatial, numerical, word) and branching direction (left, right) -in two different models, one for STM tasks and the other for WM tasks, while controlling for repeated observations, multiple components of socio-economic status and individual demographic variables (see Method for a detailed description). This ensured that differences in performance across linguistic groups depended on the position of the recalled stimuli, while controlling for several other factors.

participants.
For each linguistic group, we recruited 30 native speakers (with the exception of South Korea, where only 25 participants were tested due to logistic problems). Participants were of both sexes, aged between 14 and 43. They resided either in a village/town (i.e. <100.000 inhabitants) or in a city (i.e. >100.000 inhabitants), and had a different number of siblings (from 0 to 16). Participants differed in their education level, had different occupations and monthly income. Participants further varied in the second languages they spoke and in their level of proficiency. English was the most common second language spoken in all linguistic groups, with the exception of Khoekhoe (who mostly spoke Afrikaans as a second language) and Sidaama (who mostly spoke Amharic as a second language). For more details, see Table 1 and Supplementary Information.
All experimental procedures had been approved by the ethical committee at the University of Bern, Switzerland (2016-06-00006), all experiments were performed in accordance with European guidelines and regulations, and informed consent was obtained from all participants. Experimental protocol. Testing took place in surroundings that were familiar to the participants, such as schools, community centers and private homes. Individuals were generally tested alone, unless they felt uncomfortable and asked for other people being present, in which case these were sat at a certain distance behind the computer screen and instructed not to interfere in any way with the testing procedure. For each population, one research assistant collected the data together with a local research assistant translating the procedure, when needed (i.e. in Cambodia, Ethiopia, Japan, Korea and Namibia). In Italy and Thailand no local research assistant was needed, as the research assistant collecting the data was a native speaker of the language tested. Overall, a native speaker of the local language conducted recruiting, consenting and testing for all populations tested. Written consent was obtained before testing, while biographical information was obtained at the end of the tasks, by noting participants' name, sex and age, residence, number of siblings, main occupation, approximate monthly income, educational level, native language and proficiency in other languages.
Each participant was tested in 6 different memory tasks, administered one after the other on a laptop, with approximately one-minute breaks in-between. The six tasks were three short-term memory (STM) tasks with words as stimuli (WS = word span), with numbers as stimuli (DS = digit span), or with spatial stimuli (MS = matrix span); and three working memory (WM) tasks with words as stimuli (OS = operation span), with numbers as stimuli (CS = counting span), or with spatial stimuli (SS = symmetry span). For these tasks, we adapted the classic automated span tasks programmed with E-prime and implemented in the Attention & Working Memory Lab by Engle's research group 80,81 . All tasks have been validated across a variety of studies and basically test STM and WM by requiring individuals to observe a series of stimuli and recall them immediately afterwards, in the same order they were presented. Before each task started, participants were instructed about the procedure and provided with two examples containing two stimuli. Moreover, they were also reminded that stimuli had to be sequentially recalled, in the same order as they were presented. In case the procedure was not clear, it was explained again until the participant understood it. Throughout the tasks, the experimenter made no suggestions, but could motivate participants regardless of their performance by reassuring them that they were doing fine. The order of tasks was pseudo-randomized and counterbalanced across subjects, but the order of stimuli and trials within each task was the same for all participants (see Supplementary Information for more details). stM tasks. In the STM-WS task, participants were presented with 18 test trials, each one containing 2-7 stimuli. The stimuli consisted of 600 px × 800 px pictures with images of common animals and objects (e.g. a cat, a hen, a leaf, an ant, a cloth), being visible for 2000 ms in the middle of the screen. Before the task started, individuals were instructed to observe the series of pictures on the screen, name each of them aloud as soon as it appeared, and recall them aloud in the same order they had appeared, as soon as question marks appeared on the screen. The experimenter audio-recorded all trials.
In the STM-DS task, participants were presented with 21 test trials containing 3-9 stimuli. The stimuli consisted of numbers from 1 to 9 (presented as 100 px × 150 px images with a black number on a white background), which were visible for 2000 ms in the middle of the screen. Before the task started, individuals were instructed to observe the series of numbers on the screen and then recall them in the same order they had appeared, as in the previous task. Participants provided their response on coding sheets with series of 9 squares, so that each square could contain one number.
In the STM-MS task, participants were presented with 18 test trials containing 2-7 stimuli. The stimuli consisted of 4 × 4 squared matrixes (presented as 400 px × 300 px images) with a black grid on a white background, and one of the 16 squares inside being colored red in each stimulus (the position of this red square was different depending on the stimulus). Each stimulus was visible for 2000 ms in the middle of the screen. Before the task started, individuals were instructed to observe the series of matrixes on the screen and then recall the position of each red square in the same order they had appeared, by writing them down in a coding sheet as soon as questions marks appeared on the screen. WM tasks. In the WM-OS task, participants were presented with 12 test trials containing 2-5 stimuli. The stimuli consisted of 600 px × 800 px pictures with images of common animals and objects (as in the STM-WS task), and three little squares with a variable number of red dots inside, which served as stimuli for the distracting task. Before the task started, individuals were instructed to observe the series of pictures on the screen, name each of them aloud as soon as it appeared, solve the distracting task (by subtracting the red dots in a box from the red dots in the other one, and telling aloud whether the result corresponded to the number of red dots in the third box; i.e. distracting task), and then recall the name of the pictures aloud in the same order they had appeared, as soon as question marks appeared on the screen. In this task, each stimulus remained in the middle of the screen until it was named and the mathematical operation was solved. The experimenter audio-recorded all trials.
In the WM-CS task, participants were presented with 15 test trials containing 2-6 stimuli. The stimuli consisted of 600 px × 800 px pictures with a grey background and a varying number of blue circles, blue squares and green circles (with the number of blue circles in each image varying from 3 to 9). Before the task started,   individuals were instructed to observe the series of images on the screen, count aloud the number of blue circles among other figures in each image (i.e. distracting task), repeat this number aloud and then recall aloud the series of final numbers in the same order they had appeared, as soon as question marks appeared on the screen. Each stimulus remained in the middle of the screen until the blue circles had been counted. The experimenter audio-recorded all trials.
In the WM-SS task, participants were presented with 12 test trials containing 2-5 stimuli. The stimuli consisted of 4 × 4 squared matrixes (presented as 400 px × 300 px images) with a black grid on a white background (as in the STM-MS task), and one of the 16 squares inside being colored red in each stimulus. These matrixes were alternated to 8 × 8 squared matrixes of the same size, serving as stimuli for the distracting task: some of the 64 squares were colored black, forming a muster that could either be symmetrical or asymmetrical along the vertical axis. Before the task started, individuals were instructed to observe the series of 4 × 4 matrixes on the screen, assess aloud whether the 8 × 8 symmetry matrixes were symmetrical or not (i.e. distracting task), and then recall the position of each red square in the 4 × 4 matrixes in the same order they had appeared, by writing them down in a coding sheet as soon as the question marks appeared on the screen. All matrixes were visible for 2 seconds in the middle of the screen, but 4 × 4 matrixes were only visible after the previous symmetry judgment had been done. On a piece of paper, the experimenter further noted the participants' responses to the distracting task.
scoring. We transcribed all participants' responses from the audios and coding sheets. We then compared the recalled stimuli to the stimuli as named during the stimuli presentation. For each trial, we divided the list of stimuli presented in two halves and separately coded the number of correct responses for the first half (i.e. initial stimuli) and for the second half (i.e. final stimuli). For the first half, we coded whether the first stimulus recalled corresponded to the first stimulus having been presented, whether the second stimulus recalled corresponded to the second stimulus having been presented, and so on. For the second half, we coded whether the last stimulus recalled corresponded to the last stimulus having been presented, the second to last stimulus recalled corresponded to the second to last stimulus having been presented, and so on. Crucially, coding the final stimuli starting from the end ensured that mistakes in recalling initial stimuli did not affect the response for the final stimuli, as a correct response required that both identity and order of stimuli were recalled correctly.
Inter-observer reliability. A second observer recoded 11.6% of all the trials and inter-observer reliability was excellent (for the sum of correct initial stimuli in each trial: Cohen's k = 0.955, N = 2592, p < 0.001; for the sum of correct final stimuli in each trial: Cohen's k = 0.940, N = 2592, p < 0.001). statistical analyses. Before conducting the analyses, we excluded some participants from the sample. In particular, although all participants alleged to be native speakers of the language they were going to be tested for, based on the interactions with the participants we inferred that one Korean and one Khoekhoe-speaker were not native speakers of those languages and we therefore dropped them from the analyses. We further excluded from the analyses one Sidaama who failed to count the blue circles aloud in the distracting task of the WM-CS task (as the distracting task was not implemented, transforming the nature of the WM task). Finally, we excluded 68 trials (i.e. 0.3% of the remaining trials), due to problems with the audio-recordings, participant's failure to understand the procedure, participant's distraction or others' interference in the task.
All analyses were conducted using generalized linear mixed models (GLMM) 93 and were run using R statistics (version 3.2.3) with the lme4 package 94 . We ran one model for the WM tasks, and one for the STM tasks, both with a Poisson structure. In the models, we included participants' performance for initial and final stimuli in each trial of the WM numerical, spatial and word tasks (N = 18050), and in each trial of the STM numerical, spatial and word tasks (N = 26470), respectively. All numerical variables were z-transformed, to obtain comparable and more easily interpretable coefficients 95 . To analyze the effect of test predictors (i.e. the predictors of interest) on the response, we compared each full model (including both control and test predictors) to a corresponding null model (only including control predictors). When test predictors have a significant effect on the response, the full-null model comparison is significant. To obtain the p values for the individual fixed-effects we conducted likelihood-ratio tests 96 . In order to rule out collinearity, we checked variance inflation factors (VIF) 97 and overall VIF values were generally close to one (maximum VIF = 3.26). All models were stable.
In both models, the dependent variable was the number of correct stimuli identified (initial and final). Moreover, in both models, we included three test predictors: branching direction (right or left), kind of stimuli (numerical, spatial and word), and stimuli position (initial or final), as well as their 2-and 3-way interactions. Main branching direction based on (i) the SVO/SOV order, (ii) the presence of head nouns preceding/following (iii) genitive and (iv) relative clauses, and (v) separate adverbial subordinators at the beginning/end of subordinate clauses 79 . See Supplementary Information for more details.
As control predictors we included (i) fixed effects known to potentially affect WM and/or STM, crucially including all possible random slopes, and (ii) random effects. In this way, we could (i) assess the effect of our test predictors after controlling for the effect of other potentially confounding variables, and (ii) account for the non-independence of data points. As fixed effect variables we included: participant's sex (2 levels e.g. 98-100 ), participant's age (from 14 to 43 years old e.g. [101][102][103], number of siblings (from 0 to 16 see 104 ), residence (village/ town or city, with threshold set at 100.000 inhabitants; as living in cities may favour enhanced spatial memory), level of education (depending on the years spent at school/university e.g. 105,106 ), occupation (unemployed, working in the primary sector, in the secondary sector, in commerce or tourism, in other areas of the tertiary sector, students e.g. [107][108][109], centered income (as the deviation of each participant's monthly income from the average national income e.g. 110 Department of State; as this could reduce the effect of the native branching), number of stimuli in each trial (from 2 to 9 see e.g. 80 ), trial number within each task (from 1 to 21), and (only in the WM tasks) the percentage of correct choices in the distracting trials. Note that the inclusion of all these fixed effects makes our results especially robust, as they assess the effect of test predictors (which are a priori defined), independently of other potential confounding factors, also defined a priori. As random effect variables, we included language, participant's identity and trial identity (given that each trial was coded twice: the first half starting from the beginning, and the second half starting from the end), to account for the non-independence of data points.

Results
The effects of branching on WM. The comparison between the full model and the null model was significant (GLMM: p < 0.001, N = 18050, χ 11 2 = 55.81). After dropping the non-significant three-way interaction from the model (branching*kind of stimuli*position of stimuli) (GLMM: p = 0.65, N = 18050, χ 2 2 = 0.86), we found two significant two-way interactions. A first interaction between kind of stimuli and stimuli position revealed that participants were better at recalling number and word stimuli in the final position, but worse at recalling spatial stimuli in the final position (GLMM: p < 0.001, N = 18050, χ 2 2 = 23.45; Fig. 1). Crucially, a second interaction between branching and stimuli position revealed that LB participants were better than RB participants at recalling initial stimuli, but worse at recalling final stimuli (GLMM: p = 0.01, N = 18050, χ 1 2 = 6.48; Fig. 2). Consistent with the aggregate data, Fig. 3 shows that in all RB languages, participants were better at recalling final as compared to initial stimuli, and in all LB languages (with the exception of Sidaama) participants were better at recalling initial as compared to final stimuli.
The effects of branching on STM. The comparison between the full model and the null model was significant (GLMM: p < 0.001, N = 26470, χ 11 2 = 78.79). After dropping the non-significant three-way interaction (branching*kind of stimuli*position of stimuli) (GLMM: p = 0.19, N = 26470, χ 2 2 = 3.23), we found a significant two-way interaction between kind of stimuli and position of stimuli. In particular, participants were overall better at recalling initial stimuli as compared to final stimuli in all tasks, and this effect was steeper in number stimuli compared to spatial and word stimuli (GLMM: p < 0.001, N = 26470, χ 2 2 = 27.42; Fig. 4). In contrast, no effect of branching was found, indicating that LB and RB speakers did not differ in their performance in STM tasks (GLMM: p = 0.78, N = 26470, χ 1 2 = 0.076).

Figure 1.
Box-plot representing the data distribution for the number of correct initial and final stimuli in WM tasks with numeric, spatial and word stimuli from a generalized linear mixed model (GLMM). The horizontal ends of the box represent the 75% and 25% quartiles, and the ends of the whiskers represent the 97.5% and 2.5% quartiles respectively. The dotted line represents the model estimates.

Discussion
As predicted, LB and RB speakers were significantly different in their ability to recall initial and final stimuli, showing a clear link between branching direction and working memory (WM). In WM tasks, LB participants were better than RB participants at recalling initial stimuli (and RB were better at recalling final stimuli), and this pattern held for each language separately (with the exception of Sidaama). These results confirm our hypothesis and suggest that sensitivity to branching direction predicts the way in which humans remember and/or process sequences of stimuli, as real-time sentence comprehension relies more heavily on retaining initial information in LB languages but not in RB languages.  Sidaama was the only language failing to follow this pattern, but there are at least two reasons as to why this might be the case. Firstly, all languages were consistently RB or LB according to four word order criteria (see Methods), with the exception of Sidaama, for which the clause-subordinate order follows no consistent branching direction 79 . Secondly, the Sidaama participants that we tested were the most secluded group compared to all other populations tested, and in contrast to the other tested groups they had had little to no previous contact with technologies (including laptops and audio-recorders). This resulted in WM trials lasting significantly longer than in the other groups, with earlier stimuli becoming comparatively less accessible, and this likely explains the difference between performance by the Sidaama and the other LB participants (Mean session length ± SE for Sidaama: 9.5 minutes ± 3.2, Khoekhoe: 5 ± 1; Korean: 6.3 ± 1.0; Japanese: 6.4 ± 1.0).
Japanese-and Korean-speakers' performance was impressive for both initial and final stimuli, although initial stimuli were better recalled than final ones (Fig. 3). These results are not surprising, as Japanese-and Korean-speakers were mostly students, with a much higher familiarity with being tested on computers than most other participants. Such a higher familiarity likely resulted in overall better performance see e.g. 112 , although there is no reason to assume that it provided them with a special advantage to remember initial versus final stimuli. Moreover, in our models we explicitly controlled for participants' occupation (and for several other factors differing across participants and groups, which might have affected their performance, see Methods), suggesting that these differences cannot explain the results obtained.
As predicted, the effect of branching direction was confined to performance in WM tasks, while RB and LB speakers did not differ in their STM performance. One plausible explanation is that only WM has an active role in language and sentence processing e.g. 86,87,90,113,114 . Moreover, while WM tasks largely reflect a domain-general factor, STM tasks tend to be much more domain specific 115 . Therefore, the effect of language on non-linguistic cognition might be more limited in STM tasks. Finally, it is also possible that the effect of branching direction on performance in WM (but not STM) tasks depends on output interference (i.e. degradation of later-recalled items in the list. due to the interference of initially recalled items) being stronger in STM than WM tasks 116 , and thus wiping out the branching effect in STM tasks, where initial stimuli were recalled much better than final ones (Fig. 4).
In contrast, the link between branching and performance in WM tasks held regardless of the stimuli used (i.e. word, numerical or spatial stimuli). This may be surprising, because branching direction may be expected to more likely predict performance in verbal rather than spatial WM tasks, as only the former selectively tap capacities which are essential for sentence processing. However, although spatial and verbal memory are usually considered two different WM components, it is to date unclear how easily transfers take place between these different components. Transfer from WM training in the lab, for instance, is generally limited see 112 , but there is evidence that interventions improving verbal WM may also have benefits that transfer to spatial WM e.g. 117 . Moreover, it is interesting to note that several participants across different linguistic groups (both RB and LB) spontaneously reported, at the end of the tasks, to have coded spatial information on the grid as numerical information: instead of visualizing and later recalling the spatial position of the red square in WM spatial tasks, they reported to have attributed sequential numbers to the squares on the grid, so that the number corresponding to the red square was kept in memory and later recalled. This approach may have transformed a classic spatial task into a more verbal one, which may be more likely subject to branching effects. Taken together, our results suggest that the link between language and thought might not be just confined to conceptual representations and semantic biases, but rather extend to syntactic structures and the very sequential processing of information. Specific characteristics of a language appear to predict not only the way we perceive and conceptualize the world see 9 , but also the way we process, store and retrieve information. This is especially relevant, as the ability to maintain sequential information in working memory is crucial for a wide range of higher cognitive functions, including reading, problem-solving, decision-making and planning 2,80,81,85,110,112,118 . Therefore, the need to parse sentences in a specific direction, day by day, might affect our way to remember words and other stimuli also in a non-linguistic context. This is in line with previous findings, showing that extensive experience, like biologically relevant behaviors engaging higher cognitive functions (e.g. extensive learning, playing music), can drastically affect our memory and even cause long-term structural changes to our brain, well into adulthood [119][120][121] .
In future work, the inclusion of languages with mixed branching and free word order, while controlling for the frequency of non-canonical word order in each language, would likely provide valuable further insights into the exact link between branching and memory. Free word order languages, in particular, seem to provide an especially interesting test for the linguistic relativity hypotheses: sentences containing the same words in a different order, for instance, appear to be considered repetitions by speakers of free word order languages 122 . The fact that branching and word order may be linked to such a fundamental cognitive process like memory opens up new exciting avenues for psycholinguistic research towards expanding the pool of languages and populations investigated. With more than 7000 languages in the world, we have a uniquely rich pool to study the relation between language and cognition. Preserving and investigating the wealth of this diversity is not only ethical, but also scientifically crucial to ultimately address the age-old question concerning the relation between language and thought.

Data Availability
Data are available as Supplementary Material.