Variability in competitive decision-making speed and quality against exploiting and exploitative opponents

A presumption in previous work has been that sub-optimality in competitive performance following loss is the result of a reduction in decision-making time (i.e., post-error speeding). The main goal of this paper is to test the relationship between decision-making speed and quality, with the hypothesis that slowing down decision-making should increase the likelihood of successful performance in cases where a model of opponent domination can be implemented. Across Experiments 1–3, the speed and quality of competitive decision-making was examined in a zero-sum game as a function of the nature of the opponent (unexploitable, exploiting, exploitable). Performance was also examined against the nature of a credit (or token) system used as a within-experimental manipulation (no credit, fixed credit, variable credit). To compliment reaction time variation as a function of outcome, both the fixed credit and variable credit conditions were designed to slow down decision-making, relative to a no credit condition where the game could be played in quick succession and without interruption. The data confirmed that (a) self-imposed reductions in processing time following losses (post-error speeding) were causal factors in determining poorer-quality behaviour, (b) the expression of lose-shift was less flexible than the expression of win-stay, and, (c) the use of a variable credit system may enhance the perceived control participants have against exploitable opponents. Future work should seek to disentangle temporal delay and response interruption as determinants of decision-making quality against numerous styles of opponency.

www.nature.com/scientificreports/ degree of win-stay behavior whereas manipulating the value of losses does not change the degree of lose-shift behavior 5,35 . Once again, the main message here though is that while such reinforcement learning principles are contingent on both environment and species [38][39][40] , natural predictability in behaviour expressed via win-stay and/ or lose-shift runs the risk of exploitation in competitive environments.

Experiment 1
As an initial test of competitive decision-making in Experiment 1, participants interacted with a computer opponent playing according to a mixed-strategy (MS) in the zero-sum game of Rock, Paper, Scissors (RPS; see 41 for a review). In terms of defining optimal and sub-optimal performance, the Nash 42 equilibrium for RPS against an opponent playing mixed-strategy is for the participant to also play mixed-strategy. In this respect, the no credit condition served as an attempted replication of the data from 3,5 (baseline), and 6 , where each trial consisted of a single response only. Here, performance should approximate optimal MS following wins where the single stay response and the two shift responses are played roughly 33.3% of the time. Conversely, performance after negative outcomes (both losses and draws) should be characterized by an increase in shift behaviour over the 66.6% predicted by optimal performance. Given the unexploitable nature of the opponent, performance should also be characterized by post-error speeding 5,10,43 .
In Experiment 1, variations in a credit system were used to establish different temporal delay conditions (see Supplementary Materials A and B). In the no credit condition, participants simply made a single response selection on each of the 90 trials. For the fixed credit condition, participants had to 'insert one credit' on each of the 90 trials before they could make their game decision (c.f. 43,44 ). This condition should slow down the cycle of play by providing mandatory response interruptions (and hence, regular temporal delays). If slowing down decision-making time increases the quality of decision-making, then there should be a reduction in shift behaviour exhibited following negative outcomes 45 . In the variable credit condition, participants had the same 90 credits in the fixed credit condition, but when and how many credits to insert was the participant's decision. The same constraint existed in that participants could not play the trial unless they had at least 1 credit stored on the computer. Thus, the variable credit condition should also slow down the cycle of play by providing voluntary response interruptions (and hence, intermittent temporal delays). Since multiple credits could be inserted at any point during the condition, the degree of interruption should be intermediate, somewhere between the no credit and fixed credit condition. Therefore, the reduction in shift behaviour following negative outcomes should be more than that shown in the fixed credit condition, but less than that shown in the no credit condition. Finally, if pausing serves as way to maintain better rather than worse quality decision-making then participants should input more credits following positive relative to negative outcomes. All manipulations of trial lag expressed via variations in the credit system were within-participants to reduce the noise traditionally associated with between-participant designs (e.g. [46][47][48]  Method. Participants. 36 individuals were analysed in the study: 25 were female, 3 were left-handed, with mean age = 20.11 (sd = 3.27). One individual was replaced due the recording of only 89 credits in the variable credit condition, and a second individual was replaced as a result of playing Scissors 100% and 99% of the time during the fixed and no credit conditions. Replaced participants undertook the experiment using the same counterbalanced order as the removed participants. All studies reported in this paper were approved by Research Ethics Board 2 at the University of Alberta under the protocol PRO00086116. All experiments were performed in accordance with relevant guidelines and regulations, including obtaining written informed consent. All participants completed the studies for course credit and no participant took part in multiple experiments.
Stimuli and apparatus. Pictures of two gloved hands representing the 9 interactions between participant and opponent during Rock, Paper, Scissors were used from 5 (approximate on-screen size 10.5 cm × 4 cm). Stimulus presentation and response monitoring was conducted by Presentation (version 18.3, build 07.18.16).
Design. Participants completed 270 round of RPS split across three counter-balanced blocks of 90 trials each. In the no credit condition, participants made one response per round involving the selection of Rock, Paper or Scissors. In the fixed credit condition, participants had to make two responses per round: a response to insert one credit and a second response that allowed them to pick their response for that trial. In the variable credit condition, participants were allocated 90 credits at the start of the block, inserted as many as credits as desired, but could only play a round if their current credit score was 1 or above. For both fixed and variable conditions, if the number of inserted credits fell below 1, a warning sign appeared on screen and participants could not proceed with game responses. All opponents played 30 Rock, 30 Paper and 30 Scissors responses in a randomized order across each block (i.e., unexploitable). In this and all subsequent experiments, credit manipulation was a withinparticipants factor and opponency was a between-participants factor split across Experiments (1 = unexploitable, 2 = exploiting, 3 = exploitable).
Procedure. On-screen instructions from the various conditions are presented in Supplementary Information A and examples of the on-screen displays are presented in Supplementary Information B. At each trial and for each block, the participant's current score was displayed for 500 ms, with a credit counter starting at 90. In the no credit condition, participants simply had to select 4, 5, or 6 on the number pad corresponding to the selection of RPS to decrease the counter by 1. In the fixed credit condition, a current credit counter was also displayed and would be red when the current number of credits was 0. Participants were always prompted with the display of 'Insert 1 Credit' at every trial and had to press 0 on the number pad before selecting their choice of RPS. The variable credit condition was identical to the fixed credit condition, apart from the prompt of 'Insert × Credits' at www.nature.com/scientificreports/ every trial. Here, participants could simply select RPS if their current credit count was above 0 (after which 1 was then subtracted from their current count), or before then, could press 0 on the number pad to transfer credits to the current credit counter to a maximum of 90 credits at any point(s) during the block. Following the selection of the game response, RPS selections were displayed for opponent (on the left; blue glove) and participant (on the right; white glove) for 1000 ms. This display was removed for 500 ms and then the outcome of the trial was presented for 1000 ms in the form of 'WIN' (+ 1; green font), 'LOSS' (− 1; red font), or 'DRAW' (0; yellow font) as appropriate. The outcome was removed and the player's score was updated across a 500 ms period, after which the next trial began with the prompt appropriate for that condition. After the completion of all three conditions, participants were thanked for their time and debriefed.
Results. Item and outcome biases. Item and outcome biases were initially analysed using a one-way repeat-  Table 1).
The three possible items and three possible outcomes were also directly compared using arc-sine transformed data at the request of a reviewer, yielding equivalent conclusions. For Experiment 1, analyses of the arc-sine transformed proportion of Rock, Paper, Scissor responses as a function of no, variable and fixed credit conditions produced a significant main effect of item: [F(2,70) = 3.60, MSE = 0.020, p = 0.033, ƞ p 2 = 0.093], in the absence of a significant interaction: [F(4,140) = 1.73, MSE = 0.005, p = 0.147, ƞ p 2 = 0.047]. The slight over-representation of Rock (35.18%) relative to Scissors (30.61%; Tukey's HSD, p < 0.05; but not Paper, 34.22%) was consistent with previous data (see Table 1). Analyses of the similarly arc-sine transformed proportion of win, lose and draw outcomes as a function of no, variable and fixed credit conditions did not produce a significant main effect Reinforcement learning biases. Table 2 provides summary statistics for the three strategies at trial n + 1 as a function of trial n outcome. To assess traditional reinforcement learning biases, the proportion of win-stay, loseshift and draw-shift were analysed as a function of condition using separate one-way repeated-measures ANO-VAs, and, with respect to the value expected on the basis of MS behaviour (33.3% stay responses, 66.6% shift responses; see   Table 3). The single response selection RT in the no credit condition was compared to the first (credit input) response RT in the fixed credit condition, and the response selection RT in the variable credit condition. Group average data are shown in Fig To allay concerns regarding RT outliers, and to maintain consistency with previous protocols in our laboratory, participants were rejected as a result of their average median RT being at least twice as large as the group average median RT (c.f. 5 Credit selection. A final set of data unique to the variable credit condition was the distribution of credits as a function of outcome (win, lose, draw; see Table 4). A one-way repeated-measures ANOVA failed to show significance: F(2,70) = 0.57, MSE = 0.08, p = 0.568, ƞ p 2 = 0.016.
Discussion. The data from Experiment 1 replicated a number of key findings related to quality and speed of contiguous decision-making in a competitive environment. Specifically, high-quality, mixed-strategy (MS) behaviour was more likely following positive outcomes. In other words, following wins, participants stayed with their original response approximately 1/3 of the time and changed to one of two new responses approximately 2/3 of the time. This was in contrast to performance following negative outcomes (specifically, losses), which were characterized by increases in shift behaviour beyond that predicted by MS 3,5 . Moreover, decision-times following negative outcomes were faster than decision-times following positive outcomes, consistent with other unexploitable competitive contexts 43 . Therefore, Experiment 1 provides support for the connection between the valence of the previous competitive encounter and the speed and quality of the next encounter: a negative outcome speeds subsequent decision-making and leads to an overuse of shift behaviour. This highlights the resilience of lose-shift behaviour despite its sub-optimality 36,37 . These data were also consistent across no, fixed www.nature.com/scientificreports/ and variable credit conditions. That is, the addition of an extra response per trial (approximately 400 ms) in the fixed credit condition did not change the distribution of participant responding. There was also no evidence that the voluntary decision to slow down the cycle of play via variable credit changed responding relative to the no credit condition. One reason why there was no effect of the credit systems in Experiment 1 was that the participants were in no danger of being exploited. Lack of exploitation may also have been the reason why deviations from MS were observed (although does not help to explain why there was over-use of shift behaviour following negative outcomes but not over-use of stay behaviour following wins). Slower, improved and/or better-managed decisionmaking may be observed when there are clearer threats of exploitation. This is evidenced in certain primate data: when a computerized opponent played according to MS, primates were observed to overplay certain responses,  www.nature.com/scientificreports/ but when the opponent began to exploit response selection the primate's strategy changed to reflect an appreciation of the opponent's last play 50,51 ; see also 52 . Therefore, Experiment 2 was carried out where the opponent was now designed to take advantage of any item biases expressed by the participant (i.e., exploiting). Changing from an unexploiting (Experiment 1) to exploiting (Experiment 2) opponent should demand more regulation and control of decision-making. By highlighting competitive threat, participants may utilize the fixed credit and/or variable credit conditions more, in order to slow down their cycle of play and increase in decision-making quality. Specifically, there should be a reduction in shift behaviour as a function of negative outcomes from Experiment 1, and an increased likelihood that participants would play more in accordance with MS. Given that the task in Experiment 2 was to minimize the number of losses rather than maximize the number of wins, it was predicted that-as in Experiment 1-performance should again be characterized by post-error speeding.

Experiment 2
Experiment 2 was identical to Experiment 1, apart from the change in opponency to exploiting. Item biases are a recurring observation in the RPS literature, with Rock currently enjoys a slight over-popularity in empirical studies of the game 3,6,31,53-55 (although see 56 for Scissors bias). Experiment 2 was designed to take advantage of any idiosyncratic item bias that participants might express throughout the course of the game. The potential exploitation of the participant was made possible by the computer creating a response matrix inverse to the participant's response selections every 6 trials. For example, if on the first 6 trials the participant selected 4 Rock, 1 Paper and 1 Scissors response, the computer would play for the next 6 trials 4 Paper, 1 Scissors and 1 Rock response (in a random order). Apart from the first six trials where the opponent plays according to MS (2 responses per item), the opponent could exploit any item bias(es) exhibited by the participant on the remaining 84 trials. All other parameters and all statistical analyses were identical to Experiment 1. Two individuals were replaced due the recording of only 89 credits in the variable credit condition. Of the final sample of 36 participants, 3 declined to provide demographic information. Of the remaining sample of 33 individuals, 26 were female and 28 were right-handed (mean age = 20.52, stdev = 2.73).  Table 2).

Reaction times.
A two-way repeated-measures ANOVA on trial n + 1 median RTs using credit type (no, variable, fixed) and outcome at trial n (win, lose, draw; see Fig. 1b and Table 3  Credit selection. A one-way repeated-measures ANOVA failed to show significance in credit distribution as a function of outcome in the variable condition (see Table 4): F(2,70) = 2.53, MSE = 0.06, p = 0.087, ƞ p 2 = 0.067.
Discussion. Experiment 2 tested the idea that the failure to extend or truncate decision-making times via the use of credit systems was due to there being no negative consequences for deviation from optimal performance (lose-shift). If an opponent exploited these deviations, then behaviour should more closely align with MS, especially when given more (fixed credit, variable credit) rather than less (no credit) time to make decisions. However, at a group level, participants fared no worse against an exploiting (Experiment 2) versus unexploitable (Experiment 1) opponent as lose rates were not significantly different ( A final possibility is that any exploiting opponent designed with a static rule of course could be reconfigured to become an exploitable opponent. The idea that there is a variety of individual experiences against non-mixed-strategy opponents will be revisited. Nevertheless, Experiment 2 replicated Experiment 1 in two important ways. First, the data continued to show the increased use of shift behaviour following negative outcomes over that predicted by mixed-strategy. The idea that lose-shift behaviour reliably manifests itself against putatively unexploitable (Experiment 1) and exploiting (Experiment 2) opponents suggests something of the immutability of this particularly reinforcement learning rule, relative to the flexibility observed in the expression of win-stay behaviour. This observation is consistent with previous human data where win-stay but not lose-shift behaviour modulated as a function of outcome value 5 , electrophysiological work where feedback-related negative (FRN) to wins but not losses modulate as a function of frequency 57 , and, also animal work where lose-shift is seen as a more hard-wired reflex 35,36 . These ideas also align with the principle of loss aversion 58 (although see 59 ), loss attention whereby negative outcomes decrease inertia 13 , and evolutionary accounts where avoiding the damage following losing is more important that reaping the benefits following success 34,35 . Second, the RT data suggests that part of the reason for this sub-optimal behaviour may be the self-imposed reduction in time allocated to decisions following negative outcomes (i.e., post-error speeding). In a final attempt to explore the relationship between the quality, speed and control of competitive decision-making in Experiment 3, we exposed participants to an exploitable opponent.

Experiment 3
Previous work suggests that the development of a mental model leading to the successful exploitation of an opponent can radically change competitive performance. For example, post-error speeding becomes post-error slowing during successful exploitation, with the degree of slowing predicted by the degree of exploitation 10 . Therefore, it is possible that the sense of environmental control established during successful exploitation will also translate to an increased utility for varying decision-making times via credit systems.
In terms of the specific exploitable rule used in Experiment 3 60 , if a computer opponent plays one item more often than another (e.g., Rock) then human participants will learn to play the appropriate counter-item with increased frequency (e.g., Paper; see also secondary salience 61 ). Therefore, opponents with item biases should lead to increases in both win-stay and lose-shift participant behaviours. This is because increasing the frequency of item repetition for an opponent should similarly reinforce the repetition of a participant's item following wins and also reinforce the change of a participant's item following losses. By observing the degree of change across win-stay and lose-shift proportions, exploitable opponents serve as a final test of flexibility between these reinforcement learning heuristics. Experiment 3 was identical to both Experiment 2, apart from the change in opponency to exploitable. Here, opponents in each of the three conditions (no, variable and fixed) were given an item bias of 51.11%. For example, Rock was played for 46 trials whereas both Paper and Scissors were played for 22 trials each, in a random order. The assignment of item bias (R, P, S) to condition was counterbalanced, as was the order of conditions. All other parameters and all statistical analyses were identical to Experiments 1 and 2. Two individuals were replaced due the recording of only 89 credits in the variable credit condition. Of the final sample of 36 participants, 1 declined to provide demographic information. Of the remaining sample of 35 individuals, 23 were female and 32 were right-handed (mean age = 21.46, stdev = 5.05).  Table 1). Global win rates were also significantly higher than the expected value of 33. The increase in wins relative to losses and draws expected as a result of the opponent being exploitable was only significant in the variable credit condition (see Table 1).

Results. Item and outcome biases.
A small but significant item bias for Rock was revealed across Experiments 1-3, with the observed value of 34.73% different from the expected value of 33.3%: t[107] = 2.07, p = 0.040. This is consistent with previous work 3,6,31,53-55 . A binomial test was also carried out for each individual under the null hypothesis that their average proportion of Rock was 33.3% and the null could be rejected (α = 0.050) for 100 out of 108 individuals, of whom 58 showed Rock selection above the value expected by mixed strategy. The most parsimonious explanation for this effect is a primary effect 62 , akin to the over selection of Heads in the two-response game Matching Pennies, where participants have a tendency to select the first item. This 58% is similar in magnitude to other 'majorities' reported in decision-making work (e.g., the 55% of individuals who demonstrate more environmental sampling in loss relative to gain experimental contexts 16 Table 2).

Reaction times.
A two-way repeated-measures ANOVA on trial n + 1 median RTs using credit type (no, variable, fixed) and outcome at trial n (win, lose, draw; see Credit selection. A one-way repeated-measures ANOVA on credit distribution was significant: F(2,70) = 3.51, MSE = 0.05, p = 0.035, ƞ p 2 = 0.091, showing that significantly more credits were entered following wins relative to draws (see Table 4).

Cross-experiment comparison
A number of central conclusions can be drawn by summarizing the data across Experiments 1-3. First, the data reliably show that reaction times following losses were faster (or more 'impulsive' 43 ) than reaction times following wins (see also 6 , Experiment 1). Such post-error speeding has previously been conceptualized as a self-imposed reduction in time allocated to decisions following losses, with the individual aiming to exiting the failure state as quickly as possible (contra 13 ). However, this raises the concern that the less time one thinks about one's next decision, the more likely it is to be sub-optimal, giving rise to cycles of poor performance. To investigate these ideas further, RT differences between losses and wins (collapsed across credit condition) were calculated on an individual basis for the two experiments in which a model of opponent performance could be learnt (Experiments 2 and 3; n = 72), and, compared with the difference between win and loss rates experienced by the same participant (following 10 ). Figure 2a depicts a significant, positive correlation between the degree of success exhibited by the participants (i.e., more wins) and the degree to which decisions following losses were slower than decisions following wins (i.e., post-error slowing; r = 0.202, p = 0.036). Thus, slowing down decision-making following losses increases the likelihood of future successful performance.
Two further correlations were examined in an attempt to pinpoint which reinforcement learning mechanism might be more sensitive to promotion following extra decision-making time. A significant, positive correlation between post-error slowing and win-stay rates (r = 0.295, p = 0.002; Fig. 2b) was contrasted with a non-significant, negative correlation between post-error slowing and lose-shift rates (r = − 0.103, p = 0.287; Fig. 2c). Therefore, reduced impulsivity exhibited by participants following loss was also linked to the ability to re-initiate successful win-stay but not lose-shift strategies. www.nature.com/scientificreports/ This ease with which win-stay behaviour can be initiated, relative to the inflexibility of lose-shift behaviour, was further reinforced by also looking across studies in another way. The average proportion of win-stay behaviour (36.67%, 30.15%, 43.88%, respectively) was compared to the average proportion of lose-shift behaviour (77.23%, 80.18%, 73.86%, respectively) across Experiments 1-3 in a two-way, mixed ANOVA (behaviour as a withinparticipants factor, and, experiment as a between-participants factor). There was no main effect of experiment: F(2,105) = 1.27, MSE = 0.02, p = 0.287, ƞ p 2 = 0.024, but there was a main effect of behaviour: F(1,105) = 293.78, MSE = 0.03, p < 0.001, ƞ p 2 = 0.730, as well as an experiment × behaviour interaction: F(2,105) = 5.89, MSE = 0.03, p = 0.004, ƞ p 2 = 0.101. The only significant difference to arise from the current data set was the difference in winstay behaviour between the exploiting opponent (Experiment 2) and the exploitable opponent (Experiment 3; Tukey's p < 0.05). This is consistent with previous data where win-stay rates modulate to a greater degree than lose-shift rates, highlighting that the former is under cognitive control whereas the latter retains more of a reflexive quality. This is, of course, not to suggest that lose-shift behaviour could not be attenuated using substantially longer delays between decisions (c.f, 1, 6.5 and 12 s lags used 45 ), simply to say it is easier to stay following wins than it is to shift following loss.

General discussion
The main goal of this paper was to specifically test the relationship between decision-making speed and quality, with the hypothesis that slowing down decision-making following losses increases the likelihood of future successful performance in cases where a successful model of opponent domination can be implemented. The data confirm that self-imposed reductions in processing time following losses (post-error speeding) are causal factors in determining poorer-quality behaviour (see Fig. 2). Specifically, the data also provide evidence that win-stay (rather than lose-shift mechanisms) might be more sensitive to promotion following extra decision-making time.
Second, the data reinforce the inflexibility of lose-shift as a decision-making heuristic in competitive contexts. However, it is important to address the idea that the potency of the lose-shift heuristic may in part be due to the weakness of the opponent manipulation across Experiments 1-3. Relative to the unexploitable opponent in Experiment 1, where we expected average win rates to be around 1/3 (32.93%) as a result of the use of MS, we did not see a reduction in win rate in Experiment 2 (32.59%) when the opponent was designed to take advantage of any transitory idiosyncratic item bias that participants might have (exploiting). Moreover, while significantly different from Experiment 2, the average win rate against exploitable opponents in Experiment 3 (35.30%) was not comparable to the degree of success observed in previous studies against exploitable opponents (c.f. 10 where participants achieved 18-24% differential between their win and lose rate). It is clear that more extreme expressions of exploiting and exploitable opponents should be used in future research. However, since there can be no guarantee that participant will offer themselves up to exploitation-nor take advantage of opponent who are themselves exploitable 63 -an alternative approach may be to design exploitable and exploiting conditions with fixed rather than variable win rates 57,64-66 . For example 67 created an 80-20% differential between win and lose trials that could be used to recreate exploitable and exploiting opponents, respectively. One critical issue however with the use of fixed outcomes is that there is no consistency either within or between participants in www.nature.com/scientificreports/ the behaviour that will ultimately be reinforced or punished. This may have large scale consequences for how behaviour is perceived and the degree to which participants believe success and failure is under their control. Finally, the data provide future directions for understanding how the use of a variable 'credit' (or 'token') system may enhance the perceived control participants have against exploitable opponents. Specifically, win rates against exploitable opponents (Experiment 3) were enhanced in the variable credit condition, and, participants also inserted more credits following wins in the variable credit condition. This interaction between variable credit and exploitable opponents may reflect an increased sense of control 18,20 , as a result of the successful implementation of a mental model of the competitive environment. Relative to unexploitable or (prima facie) exploiting opponents, exploitable opponents offer a clear opportunity for strategic learning, where success is clearly indexed by an increase in win rate. Similarly, performance during the variable credit condition was also characterized by an increased opportunity for environmental control: the game slows and speeds according to the distribution of credits dictated by the participant. Importantly, the observation that more credits were inserted following win trials against exploitable opponents suggests that participants were initiating their own form of post-reinforcement pause 8 : increasing the time allocated to decision-making on the next round, thereby increasing their chances of consecutive success.
In future work, it will be important to disentangle two features of any putative credit system: temporal lag and response interruption 68 . Relative to the no credit conditions, both fixed and variable credit conditions extended the time between trials (temporal lag) as a result of requiring participants to switch from their RPS task to a credit entering task (see 69 , for a review on the extensive task-switching literature). Therefore, any potential costs or benefits accrued from credits systems could be due to (a) providing individual with more time to make better decisions, (b) disrupting cyclic or poorer-quality motor patterns associated with response selection, or, (c) a combination of the two. Some of our future research will be focused around using average reaction time derived from a fixed credit block as an average delay time around which participants are only exposed to temporal lags between trials. This type of manoeuver will help to reveal any effects of temporal delay independently of the contribution of response interruption, in the larger context of dynamic decision-making against numerous styles of opponency. www.nature.com/scientificreports/