Social cycling and conditional responses in the Rock-Paper-Scissors game

How humans make decisions in non-cooperative strategic interactions is a big question. For the fundamental Rock-Paper-Scissors (RPS) model game system, classic Nash equilibrium (NE) theory predicts that players randomize completely their action choices to avoid being exploited, while evolutionary game theory of bounded rationality in general predicts persistent cyclic motions, especially in finite populations. However as empirical studies have been relatively sparse, it is still a controversial issue as to which theoretical framework is more appropriate to describe decision-making of human subjects. Here we observe population-level persistent cyclic motions in a laboratory experiment of the discrete-time iterated RPS game under the traditional random pairwise-matching protocol. This collective behavior contradicts with the NE theory but is quantitatively explained, without any adjustable parameter, by a microscopic model of win-lose-tie conditional response. Theoretical calculations suggest that if all players adopt the same optimized conditional response strategy, their accumulated payoff will be much higher than the reference value of the NE mixed strategy. Our work demonstrates the feasibility of understanding human competition behaviors from the angle of non-equilibrium statistical physics.

T he Rock-Paper-Scissors (RPS) game is a fundamental non-cooperative game. It has been widely used to study competition phenomena in society and biology, such as species diversity of ecosystems [1][2][3][4][5][6] and price dispersion of markets 7,8 . This game has three candidate actions R (rock), P (paper) and S (scissors). In the simplest settings the payoff matrix is characterized by a single parameter, the payoff a of the winning action (a . 1, see Fig. 1A) 9 . There are the following non-transitive dominance relations among the actions: R wins over S, P wins over R, yet S wins over P (Fig. 1B), so no action is absolutely better than the others.
The RPS game is also a basic model system for studying decision-making of human subjects in competitive environments and the associated social dynamics and non-equilibrium physics. Assuming ideal rationality for players who repeatedly playing the RPS game within a population, classical game theory predicts that individual players will completely randomize their action choices so that their behaviors will be unpredictable and not be exploited by the other players 10,11 . This is referred to as the mixed-strategy Nash equilibrium (NE), in which every player chooses the three actions with equal probability 1/3 at each game round (see Supplementary Notes online). When the payoff parameter a , 2 this NE is evolutionarily unstable with respect to small perturbations but it becomes evolutionarily stable at a . 2 (see Supplementary Notes online) 12 . On the other hand, evolutionary game theory drops the infinite rationality assumption and looks at the RPS game from the angle of evolution and adaption [13][14][15][16][17][18] . Evolutionary models based on various microscopic learning rules (such as the replicator dynamics 12,[19][20][21] , the best response dynamics 22,23 and the logit dynamics 24,25 ) generally predict cyclic evolution patterns for the action marginal distribution (mixed strategy) of each player, especially in finite populations.
Empirical verification of non-equilibrial persistent cycling in the human-subject RPS game (and other noncooperative games) has been rather nontrivial, as the recorded evolutionary trajectories are usually highly stochastic and not long enough to draw convincing conclusions. Two of the present authors partially overcame these difficulties by using social state velocity vectors 26 and forward and backward transition vectors 27 to visualize violation of detailed balance in game evolution trajectories, but a simple way of quantitatively measuring persistent cyclic behavoiors in a highly stochastic trajectory was still lacking. The cycling frequency of directional flows in the neutral RPS game (a 5 2) was later quantitatively measured in 28 using a coarse-grained counting technique. Cason and co-workers 29 using another cycle rotation index as the order parameter also obtained evidence of persistent cycling in some evolutionarily stable RPS-like games, if players were allowed to update actions asynchronously in continuous time and were informed about the social states of the whole population by some sophisticated 'heat maps'.
In this work we investigate whether cycling is a general aspect even for the simplest RPS game. We adopt an improved cycle counting method on the basis of our earlier experiences 28 and study directional flows in evolutionarily stable (a . 2) and unstable (a , 2) discretetime RPS games. We show strong evidence that the RPS game is an intrinsic non-equilibrium system, which cannot be fully described by the NE concept even in the evolutionarily stable region but rather exhibits persistent population-level cyclic motions. We then bridge the collective cycling behavior and the highly stochastic decisionmaking of individuals through a simple conditional response (CR) mechanism. Our empirical data confirm the plausibility of this microscopic model of bounded rationality. Our theoretical calculations also demonstrate that, if all the players adopt the same CR strategy and if the transition parameters of this strategy are chosen in an optimized way, this CR strategy will outperform the NE mixed strategy in terms of the accumulated payoffs of individual players, yet the action marginal distribution of individual players is indistinguishable from that of the NE mixed strategy. Our work as a successful attempt of understanding competition dynamics from the perspective of non-equilibrium statistical physics may stimulate future more refined experimental and theoretical studies on the microscopic mechanisms of decision-making and learning in basic game systems 19,30-34 .

Results
Experimental system. We recruited a total number of 360 students from different disciplines of Zhejiang University to form 60 disjoint populations of size N 5 6. Each population then carries out one experimental session by playing the RPS game 300 rounds (taking 90-150 minutes) with a fixed value of a. In real-world situations individuals often have to make decisions based only on partial input information. We mimic such situations by adopting the traditional random pairwise-matching experimental protocol 11 : At each game round (time) t the players are randomly paired within the population and compete with their pair opponent once; after that each player gets feedback information about her own payoff as well as her and her opponent's action. As the experimental session finishes, the players are paid in real cash proportional to their accumulated payoffs (see Methods). Our experimental setting differs from those of two other recent experiments, in which every player competes against the whole population 9,29 and may change actions in continuous time 29 . We set a 5 1.1, 2, 4, 9 and 100, respectively, in one-fifth of the populations so as to compare the dynamical behaviors in the evolutionarily unstable, neutral, stable and deeply stable regions.
Action marginal distribution of individual players. We observe that the individual players shift their actions frequently in all the populations except one with a 5 1.1 (this exceptional population is discarded from further analysis, see Supplementary Notes online). Averaged among the 354 players of these 59 populations, the probabilities that a player adopts action R, P, S at one game round are, respectively, 0.36 6 0.08, 0.33 6 0.07 and 0.32 6 0.06 (mean 6 s.d.). We obtain very similar results for each set of populations of the same a value (see Supplementary Table S1 online). These results are consistent with NE and suggest the NE mixed strategy is a good description of a player's marginal distribution of actions. However, a player's actions at two consecutive times are not independent but correlated. As demonstrated in Fig. 2A-2E, at each time the players are more likely to repeat their last action than to shift action either counter-clockwise (i.e., R R P, P R S, S R R, see Fig. 1B) or clockwise (R R S, S R P, P R R). This inertial effect is especially strong at a 5 1.1 and it diminishes as a increases.
We notice that at a $ 2, an individual player's probability of making a clockwise action shift is equal to or just slightly different from that of making a counter-clockwise action shift ( Fig. 2A-2E). There is no or only very weak cycling behavior at the level of individual players in the evolutionarily neutral (a 5 2) and stable (a . 2) RPS games, in accordance with the NE theory. As shown in Fig. 2F-2J, the action shift statistics of individual players can be well explained by the later introduced conditional response model.
Collective behaviors of the whole population. The social state of the population at any time t is denoted as s(t) ; (n R (t), n P (t), n S (t)) with n q being the number of players adopting action q g {R, P, S}. Since n R 1 n P 1 n S ; N there are (N 1 1)(N 1 2)/2 such social states, all lying on a three-dimensional plane bounded by an equilateral triangle (Fig. 1C). Each population leaves a trajectory on this plane as the RPS game proceeds. To detect rotational flows, we assign for every social state transition s(t) R s(t 1 1) a rotation angle h(t), which measures the angle this transition rotates with respect to the centroid The net number of cycles around c 0 during the time interval [t 0 , t 1 ] is computed by As shown in Fig. 3A-3E, C 1,t has an increasing trend in most of the 59 populations, indicating persistent counter-clockwise cycling. The cycling frequency of each trajectory in [t 0 , t 1 ] is evaluated by The values of f 1,300 for all the 59 populations are listed in Table 1, from which we obtain the mean frequency to be 0.031 6 0.006 (a 5 1.1, mean 6 SEM), 0.027 6 0.008 (a 5 2), 0.031 6 0.008 (a 5 4), 0.022 6 0.008 (a 5 9) and 0.018 6 0.007 (a 5 100). These mean frequencies are all positive irrespective to the particular value of a, indicating that behind the seemingly highly irregular social state evolution process, there is a deterministic pattern of social state cycling from slightly rich in action R, to slightly rich in P, then to slightly rich in S, and then back to slightly rich in R again. Statistical analysis confirms that f 1,300 . 0 is significant for all the five sets of populations (Wilcoxon signed-rank test, p , 0.05). The correlation between the mean cycling frequency f 1,300 and the payoff parameter a is not statistically significant (Spearman's correlation test: r 5 20.82, p 5 0.19, for n 5 5 mean frequencies; and r 5 20.16, p 5 0.24, for n 5 59 frequencies). We also notice that the mean cycling frequency in the second half of the game (f 151,300 ) is slightly higher than that in the first half (f 1,150 ) for all the five sets of populations (Supplementary  Table S2 online), suggesting that cycling does not die out with time.
A recent experimental work 35 also observed cycling behaviors in a RPS-like game with more than three actions. Evidences of persistent cycling in some complete-information and continuous-time RPSlike games were reported in another experimental study 29 . However, no (or only very weak) evidence of population-level cycling was detected in 29 if action updating was performed in discrete time.
Here and in Ref. 28 we find that even discrete-time updating of actions will lead to collective cyclic motions in the RPS game, and such a population-level behavior is not affected by the particular value of a.
Empirical conditional response patterns. Under the assumption of mixed-strategy NE (i.e., each player chooses the three actions with equal probability at every game round, independent of each other and of the payoffs of previous plays), the social state transitions should obey the detailed balance condition. Therefore the observed persistent cycling behavior cannot be understood within the NE framework. Persistent cycling can also not be explained by the independent decision model which assumes the action choice of a player at one time is influenced only by her action at the previous time (see Supplementary Notes online). Using the empirically determined action shift probabilities of Fig. 2A-2E as inputs, we find that this independent decision model predicts the cycling frequency to be 0.0050 (for a 5 1.1), 20.0005 (a 5 2), 20.0024 (a 5 4), 20.0075 (a 5 9) and 20.0081 (a 5 100), which are all very close to zero and significantly different from the empirical values.
The action choices of different players must be mutually influenced. Our empirical data shown in Fig. 3F-3J confirm the existence of such mutual influences. Let us denote by O the performance (output) of a player at a given game round, with O g {W (win), T (tie), L (lose)}. Conditional on the output O, the probability that this player will decide to shift action clockwise or counter-clockwise or keep the same action in the next play is denoted as O 2 , O 1 and O 0 (; 1 2 O 2 2 O 1 ), respectively. Most interestingly, we see from Fig. 3F-3J that if a player wins over her opponent in one play, her probability (W 0 ) of repeating the same action in the next play is considerably higher than her probabilities (W 2 and W 1 ) of shifting actions. Furthermore, for payoff parameter a $ 2, if a player loses to her opponent in one play, she is more likely to shift action clockwise (probability L 2 ) than either to keep the old action (L 0 ) or to shift action counter-clockwise (L 1 ).
The conditional response model. Inspired by these empirical observations, we develop a simplest nontrival model by assuming the following conditional response strategy: at each game round, every player review her previous performance O g {W, T, L} and makes an action choice according to the corresponding three We can solve this win-lose-tie CR model analytically and numerically (see Supplementary Notes online). Let us denote by n rr , n pp , n ss , n rp , n ps and n sr , respectively, as the number of pairs in which the competition being R-R, P-P, S-S, R-P, P-S, and S-R, in one game round t. Given the social state s 5 (n R , n P , n S ) at time t, the conditional joint probability distribution of these six integers is expressed as P s n rr ,n pp ,n ss ,n rp ,n ps ,n sr Â Ã n R !n P !n S !d nR 2nrr znsrznrp d nP 2nppznrpznps d nS 2nssznpsznsr N{1 ð Þ!!2 nrr n rr !2 npp n pp !2 nss n ss !n rp !n ps !n sr ! , where (N 2 1)!! ; 1 3 3 3 … 3 (N 2 3) 3 (N 2 1) and d n m is the Kronecker symbol (d n m~1 if m 5 n and 5 0 if otherwise). With the help of this expression, we can then obtain an explicit formula for the social state transition probability M cr [s9js] from s to any another social state s9 (see Methods). We then compute numerically the steady-state social state distribution P Ã cr s ð Þ of this Markov matrix 36 and other average quantities of interest. For example, the mean steady-state cycling frequency f cr of this model is computed by where h sRs9 is the rotation angle associated with the social state transition s R s9, see Eq. (7). Using the empirically determined response parameters as inputs, the CR model predicts the mean cycling frequencies for the five sets of populations to be f cr 5 0.035 (a 5 1.1), 0.026 (a 5 2), 0.030 (a 5 4), 0.018 (a 5 9) and 0.017 (a 5 100), agreeing well with the empirical measurements. Such good agreements between model and experiment are achieved also for the 59 individual populations (Fig. 3K-3O). In addition, we find the empirically observed inertial effect of Fig. 2A-2E is quantitatively reproduced by the CR model without any fitting parameter (see Fig. 2F-2J).
Because of the rotational symmetry of the conditional response parameters, the CR model predicts that each player's action marginal  distribution is uniform, identical to the NE mixed strategy (Supplementary Notes online). On the other hand, according to this model, the expected payoff g cr per game round of each player is where g 0 ; (1 1 a)/3 is the expected payoff of the NE mixed strategy, and t cr is the average fraction of ties among the N/2 pairs at each game round, with the expression t cr~X s P Ã cr s ð Þ P n rr ,...,n rs n rr zn pp zn ss À Á P s n rr , . . . ,n sr ð Þ

N=2
: The value of g cr depends on the CR parameters. By uniformly sampling 2.4 3 10 9 instances of C from the three-dimensional probability simplex, we find that for a . 2, g cr has high chance of being lower than g 0 (Fig. 4), with the mean value of (g cr 2g 0 ) being 20.0085(a22). (Qualitatively the same conclusion is obtained for larger N values, e.g., see Supplementary Fig. S1 online for N 5 12.) This is consistent with the mixed-strategy NE being evolutionarily stable 12 . On the other hand, the four g cr values (for the four cases of a ? 2) determined by the empirical CR parameters and the corresponding four mean payoffs of the empirical data sets all weakly exceed g 0 , indicating that individual players are adjusting their responses to achieve higher accumulated payoffs (Supplementary Notes online). The positive gap between g cr and g 0 may further enlarge if the individual players were given more learning time to optimize their response parameters (e.g., through increasing the repeats of the game). As shown in Fig. 4 and Supplementary Fig. S1 online, the CR parameters have to be highly optimized to achieve a large value of g cr . For population size N 5 6 we give three examples of the sampled best CR strategies for a .  .033(a 2 2). For large a these CR strategies outperform the NE mixed strategy in payoff by about 10%. Set C 1 indicates that population-level cycling is not a necessary condition for achieving high payoff values. On the other hand, set C 3 implies W 0 < 1, L 0 < 0, therefore this CR strategy can be regarded as an extension of the win-stay lose-shift (also called Pavlov) strategy, which has been shown by computer simulations to facilitate coop-eration in the prisoner's dilemma game [37][38][39][40] . We should also emphasize that the empirically observed CR transition parameters (Fig. 3F-3J) still differ considerably from those of the win-stay lose-shift strategy C 3 .

Discussion
In game-theory literature it is common to equate individual players' action marginal distributions with their actual strategies 11,18 . In reality, however, decision-making and learning are very complicated neural processes [41][42][43][44][45] . The action marginal distributions are only a consequence of such complex dynamical processes, their coarsegrained nature makes them unsuitable to describe dynamical properties 17 . Our work on the finite-population RPS game clearly demonstrates this point. This game exhibits persistent cyclic motions at the population level (but not at the individual player level) which cannot be understood by the NE concept but are successfully explained by the empirical data-inspired CR mechanism. As far as the action marginal distributions of individual players are concerned, the CR strategy is indistinguishable from the NE mixed strategy, yet it is capable of bringing higher payoffs to the players if its parameters are optimized and all players adopt the same CR strategy. This simple conditional response strategy, with the win-stay lose-shift strategy being a special case, appears to be psychologically plausible for human subjects with bounded rationality 46,47 . For more complicated game payoff matrices, we can generalize the conditional response model accordingly by introducing a larger set of CR parameters (see Supplementary Notes online). It should be very interesting to re-analyze many existing laboratory experimental data 9,29,35,48-51 using this extended model. Figure 3 also reveals that the empirical CR parameters and the social-state cycling frequency change with the payoff parameter a. In a following paper we will study the effect of the payoff parameter a to the individual-and population-level behaviors in more detail 52 .
The CR model as a simple model of decision-making under uncertainty deserves to be fully explored. For example, different players may have different CR transition parameters and these transition parameters may change with time constantly as a result of learning. We find the cycling frequency is not sensitive to population size N at given CR parameters (see Supplementary Fig. S2 online); and the cycling frequency is nonzero even for symmetric CR parameters (i.e., W 1 /W 2 5 T 1 /T 2 5 L 1 /L 2 5 1), as long as W 0 ? L 0 (see Supplementary Fig. S3 online). The optimization issue of CR parameters is left out in this work. We will investigate whether an optimal CR strategy is achievable through simple stochastic learning rules 42,43,45 . The effects of memory length 53 and population size to the optimal CR strategies also need to be thoroughly studied. On the more biological side, whether conditional response is a basic decision-making mechanism of the human brain or just a consequence of more fundamental neural mechanisms is a challenging question for future studies.

Methods
Experiment. The experiment was approved by the Experimental Social Science Laboratory of Zhejiang University and performed at Zhejiang University in the period of December 2010 to March 2014. The corresponding author confirms that this experiment was performed in accordance with the approved social experiments guidelines and regulations. A total number of 360 undergraduate and graduate students of Zhejiang University volunteered to serve as the human subjects of this experiment. These students were openly recruited through a web registration system. Female students were slightly more enthusiastic than male students in registering as candidate human subjects of our experiment. Since we sampled students uniformly at random from the candidate list, more female students were recruited than male students (among the 360 students, the female versus male ratio is 2175143). Informed consent was obtained from all the participanting human subjects.
The 360 human subjects (referred to as players in this work) were distributed into 60 populations of equal size N 5 6. The six players of each population carried one experimental session by playing the RPS game for 300 rounds with fixed payoff parameter a, whose value is chosen from {1.1, 2, 4, 9, 100}. During the game process the players sited separately in a classroom, each of which facing a computer screen. They were not allowed to communicate with each other during the whole experimental session. Written instructions were handed out to each player and the rules of the experiment were also orally explained by an experimental instructor. The rules of the experimental session are as follows: (i) Each player plays the RPS game repeatedly with the same other five players. (ii) Each player earns virtual points during the experimental session according to the payoff matrix shown in the written instruction. These virtual points are then exchanged into RMB as a reward to the player, plus an additional 5 RMB as show-up fee. (iii) In each game round, the six players of each group are randomly matched by a computer program to form three pairs, and each player competes only with the pair opponent. (iv) Each player has at most 40 seconds in one game round to make a choice among the three candidate actions ''Rock'', ''Paper'' and ''Scissors''. If this time runs out, the player has to make a choice immediately (the experimental instructor will loudly urge these players to do so). After a choice has been made it can not be changed.
Before the start of the actual experimental session, the player were asked to answer four questions to ensure that they understand completely the rules of the experimental session. These four questions are: (1) If you choose ''Rock'' and your opponent chooses ''Scissors'', how many virtual points will you earn? (2) If you choose ''Rock'' and your opponent chooses also ''Rock'', how many virtual points will you earn? (3) If you choose ''Scissors'' and your opponent chooses ''Rock'', how many virtual points will you earn? (4) Do you know that at each game round you will play with a randomly chosen opponent from your group (yes/no)?
During the experimental session, the computer screen of each player will show an information window and a decision window. The window on the left of the computer screen is the information window. The upper panel of this information window shows the current game round, the time limit (40 seconds) of making a choice, and the time left to make a choice. The color of this upper panel turns to green at the start of each game round. The color will change to yellow if the player does not make a choice within 20 seconds. The color will change to red if the decision time runs out (and then the experimental instructor will loudly urge the players to make a choice immediately). The color will change to blue if a choice has been made by the player. After all the players of the group have made their decisions, the lower panel of the information window will show the player's own choice, the opponent's choice, and the player's own payoff in this game round. The player's own accumulated payoff is also shown. The players are asked to record their choices of each round on the record sheet (Rock as R, Paper as P, and Scissors as S).
The window on the right of the computer screen is the decision window. It is activated only after all the players of the group have made their choices. The upper panel of this decision window lists the current game round, while the lower panel lists the three candidate actions ''Rock'', ''Scissors'', ''Paper'' horizontally from left to right. The player can make a choice by clicking on the corresponding action names. After a choice has been made by the player, the decision window becomes inactive until the next game round starts.
The reward in RMB for each player is determined by the following formula. Suppose a player i earns x i virtual points in the whole experimental session, the total reward y i in RMB for this player is then given by where r is the exchange rate between virtual point and RMB. According to the mixedstrategy Nash equilibrium, the expected payoff of each player in one game round is (1 1 a)/3. Therefore we set the exchange rate to be r 5 0.45/ (1 1 a) to ensure that, under the mixed-strategy NE assumption, the expected total earning in RMB for a player will be 50 RMB irrespective of the particular experimental session. The value of the payoff parameter a, the numerical value of r, and the above-mentioned reward formula were listed in the written instruction and also orally mentioned by the experimental instructor at the instruction phase of the experiment.
Rotation angle computation. Consider a transition from one social state s 5 (n R , n P , n S ) at game round t to another social states~ñ R ,ñ P ,ñ S ð Þat game round (t 1 1), if at least one of the two social states coincides with the centroid c 0 of the social state plane, or the three points s,s and c 0 lie on a straight line, then the transition s?s is not regarded as a rotation around c 0 , and the rotation angle h 5 0. In all the other cases, the transition s?s is regarded as a rotation around c 0 , and the rotation angle is computed through h~sgn s?s |acos 3 n RñR zn PñP zn SñS ð Þ {N 2 ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ffi 3 n 2 R zn 2 P zn 2 where acos(x) g [0, p) is the inverse cosine function, and sgn s?s~1 if 3 n RñP {n PñR ð Þ zN n P {n R zñ R {ñ P ð Þ ½ w0 (counter-clockwise rotation around c 0 ) and sgn s?s~{ 1 if otherwise (clockwise rotation around c 0 ). Statistical Analysis. Statistical analyses, including Wilcoxon signed-rank test and Spearman's rank correlation test, were performed by using stata 12.0 (Stata, College Station, TX).
Transition matrix of the conditional response model. For the conditional response model, the transition probability M cr [s9js] from the social state s ; (n R , n P , n S ) at time t to the social state s': n' R ,n' P ,n' S ð Þat time (t 1 1) is expressed as: