Multi-AI competing and winning against humans in iterated Rock-Paper-Scissors game

Predicting and modeling human behavior and finding trends within human decision-making processes is a major problem of social science. Rock Paper Scissors (RPS) is the fundamental strategic question in many game theory problems and real-world competitions. Finding the right approach to beat a particular human opponent is challenging. Here we use an AI (artificial intelligence) algorithm based on Markov Models of one fixed memory length (abbreviated as “single AI”) to compete against humans in an iterated RPS game. We model and predict human competition behavior by combining many Markov Models with different fixed memory lengths (abbreviated as “multi-AI”), and develop an architecture of multi-AI with changeable parameters to adapt to different competition strategies. We introduce a parameter called “focus length” (a positive number such as 5 or 10) to control the speed and sensitivity for our multi-AI to adapt to the opponent’s strategy change. The focus length is the number of previous rounds that the multi-AI should look at when determining which Single-AI has the best performance and should choose to play for the next game. We experimented with 52 different people, each playing 300 rounds continuously against one specific multi-AI model, and demonstrated that our strategy could win against more than 95% of human opponents.


Introduction
The Rock-Paper-Scissors (RPS) game has been widely used to study competitive phenomena in society and biology, such as ecological interactions, the maintenance of biodiversity in ecological systems 1,2,3,4,5,6,7,8,9 and price dispersion of markets 10,11 .
There are two general approaches for RPS play, Nash equilibrium randomness and evolving circular exploitation.As a typical example, the payoff  , defined as the money incentive of the winning action divided by the money incentive of drawing action, is set to 2 to form a neutral RPS game 12 .
Previous research has found that there is a social circle in human competitive strategy when playing iterated RPS games 13 .In this article we proposed a multi-AI algorithm that can exploit human strategy and win over human in the same iterated RPS games and we conduct experiments with human players to confirm the results.
Our work may stimulate future more refined experimental and theoretical studies on the microscopic mechanisms of decision-making and learning in basic game systems 14,15,16,17,18 .
Markov chain models 19 are the single models which our multi-AI is composed of.The assumption of a Markov chain is that the current state only depends on a finite number of previous states.Here the iterated RPS game is considered as a Markov process and Markov chains are built throughout the process of 300-rounds of competition.The models are built to exploit attempted circular exploitation patterns.
The simplest discrete-time Markov chain is the first-order Markov chain, where the probability of moving to the next state depends only on the present state and not on the For  <  , all m-th order single Markov chain models will select Rock, Paper, Scissors randomly, with 1/3 probability each.
We conducted experiments on the combination of several selected single models according to their last 5 or 10 rounds best score, we calculate all the selected single model results and select the one with the highest score of last 5 or 10 rounds as the dominant AI output for the next round.
Here for simplicity we use AI-m to denote our single Markov chain model of order m.
Through the experiments we found that different models work best against different human opponents' competition strategies and the prediction results vary greatly so we build the 1 st -5 th order Markov chain models (i.e.AI-1 to AI-5) with different memory lengths for exploiting different human competition strategies.To make a multi-AI model that can differentiate and adapt to different human opponents, we combine the the adaptation speed and sensitivity to form a multi-AI that can adapt to different human strategies and win over most of its opponents.Table 1 illustrates how this multi-AI model competes against a specific player as an example with focus length F=5.Table 1.An example of our multi-AI algorithm competes against a player when F = 5.
For the first round, the multi-AI will use the result from AI-1 and it rolls Rock, Paper or Scissors randomly with 1/3 probability each.
Focus length parameter F is set to control the speed and sensitivity for our multi-AI model to adapt to the opponent's strategy change.Our multi-AI model will look at the recent F rounds of history to decide which single model is currently performing the best and should produce the next output.For the first 4 rounds when our competition data is less than focus length F, which is set to 5 as before, multi-AI will simply consider all rounds to determine its next round's dominant single AI model.In the specific case of Table 1 shows how our multi-AI algorithm competes against a specific player when F= 5 as an example.The transition matrix for the last AI-2 seeing the player played "PS" in the past 2 rounds is: Thus for the next round, AI-2 has 1/3 probability to roll Paper, Scissors and Rock.Table 2 shows an example for the selection between AI-1 and AI-4 when focus length F= 5.Although globally AI-1 has a higher score in total (from the first round AI-1 has a total score of 2, but its recent 5 rounds score is 0), AI-4 has a higher score (which is 3 in this case) locally during the recent 5 rounds.Thus the next round multi-AI will pick AI 4's result as our multi-AI output.
The transition matrix for AI-2 after 20 rounds of competition is in Table 3:

Results
All experiments are conducted with money incentive.We did the experiment with 52 human subjects recruited at Zhejiang University and used a multi-AI model which has 5 or 10 single length Markov chain models.Figure 1 shows 4 typical results of our multi-AI strategy with a combination of the 1 st to 5 th order Markov chain models (here focus length F is also set to 5, but it can be any other integer) competition against 4 typical players for 300 rounds.42 players played against the multi-AI with focus length F=5 and AI-1 to AI-5(see blue bars in Fig. 2) and 11 players played against the multi-AI with focus length F=10 and AI-1 to AI-10 (see orange bars in Fig. 2).From the overall results in Table 5, we see that our multi-AI algorithm with F=10 give similar scores, but has lower standard deviation than that with F=5.
For simplicity, we let multi-5AI denote our multi-AI model with a combination of the 1 st to 5 th order Markov chain models (here focus length F is also set to 5, but it can be any other integer), and multi-10AI denote our multi-AI model with a combination of the 1 st to 10 th order Markov chain models (focus length F is also set to 10 is consistent with some previous human vs human studies.We looked at the performance of individual models within a multi-AI 300 game set.
The AI which performs the best against a particular individual varies greatly, but overall AI 2-6 perform better than AI 1 or higher-order Markov chains of longer memory length.
This general trend is consistent with human short memory holding around 7 items 21 .
Table 5: Game results (total scores) of our multi-10AI competing with human in 300 rounds.
Fig. 3 and Fig. 4 show AI with different memory lengths' performance against specific human players.It is hard to build one single model that can exploit every different human's behavior, and thus we decided to combine the single models to make it able to differentiate and adapt to more different human competition strategies and win over most of its opponents.

Discussion and conclusions
In this paper, we have introduced a multi-AI model that wins over human in iterated RPS games and experimentally confirmed our results.We found that using a singlelength Markov model could beat most human players, but not all human players.
In single model experiments, we found the model with the best performance varies greatly for different people, which indicates that different people have different patterns.
Although different humans have different patterns, and in total the patterns may be very hard to observe and exploit.Human competition behavior indeed has patterns and the patterns are exploitable by using proper simple models (single models that successfully predict this human's behavior).We have obtained and exploited different human behaviors by building and combining single Markov chain models of different memory lengths and during the competition process it learns and switch to the best prediction model according to its focus length.We have introduced one possible architecture for human AI RPS games competition, and this model could be further improved by e.g., optimizing the voting weights of single Markov chain models, using the first part of the competition data to pre-train multi-AI model and switch to only two or three dominant single models after the pre-training process.Focus length is a hyper parameter and can be tested by more human experiments for further optimization.After rearranging single models and adjusting "focus length" our model can potentially be improved further.
The competition behavior patterns and their successful exploitation may lead our future work to better modeling, predicting and adapting to different specific human's competition behavior patterns.

Experiment
Our methods for experiment mostly follows the RPS social experiments conducted in the period of December 2010 to March 2014 by Zhijian Wang et al. 13 .The experiment was approved by the Experimental Social Science Laboratory of Zhejiang University and performed at Zhejiang University in the period of July 2019 to September 2019.
The first author confirms that this experiment was performed in accordance with the approved social experiments guidelines and regulations.A total number of 52 undergraduate and graduate students of Zhejiang University volunteered to serve as the human subjects of this experiment.These students were openly recruited at the university library through onsite volunteer recruitment.More female students were registered than male students due to the humanity subjects of our experiment (more female students in the related humanity subjects in our university).Since we sampled students uniformly at random from the candidate list, more female students were recruited than male students.Informed consent was obtained from all the participating human subjects.
The 360 human subjects (referred to as players in this work) carried one experimental session by playing the RPS game for 300 rounds with fixed payoff parameter a =2.
During the game process the players were seated separately in a classroom, each facing a computer screen.They were not allowed to communicate with each other during the whole experimental session.Written instructions were handed out to each player and the rules of the experiment were also orally explained by an experimental instructor.The rules of the experimental session are as follows: i.Each player plays the RPS game repeatedly with our computer program.
ii.Each player earns virtual points during the experimental session according to the payoff matrix shown in the written instruction.These virtual points are then exchanged into RMB as a reward to the player, plus an additional 5 RMB as show-up fee.
iii.After a choice has been made it cannot be changed.
Before the start of the actual experimental session, the player were asked four questions to ensure that they understand completely the rules of the experimental session.These four questions are: ( During the experimental session, the computer screen in front of each player will show an information window and a decision window.The window on the left of the computer screen is the information window.The upper panel of this information window shows the current game round, the time limit (40 seconds) of making a choice, and the time left to make a choice.The color of this upper panel turns to green at the start of each game round.The color will change to yellow if the player does not make a choice within 20 seconds.The color will change to red if the decision time runs out (and then the experimental instructor will loudly urge the players to make a choice immediately).The color will change to blue if a choice has been made by the player.After all the players of the group have made their decisions, the lower panel of the information window will show the player's own choice, the opponent's choice, and the player's own payoff in this game round.The player's own accumulated payoff is also shown.The players are asked to record their choices of each round on the record sheet (Rock as R, Paper as P, and Scissors as S).
The window on the right of the computer screen is the decision window.It is activated only after all the players of the group have made their choices.The upper panel of this decision window lists the current game round, while the lower panel lists the three candidate actions "Rock", "Scissors", "Paper" horizontally from left to right.The player can make a choice by clicking on the corresponding action names.After a choice has been made by the player, the decision window becomes inactive until the next game round starts.
The reward in RMB for each player is determined by the following formula.Suppose a player i earns xi virtual points in the whole experimental session, the total reward yi in RMB for this player is then given by where r is the exchange rate between virtual point and RMB.According to the mixed-strategy Nash equilibrium, the expected payoff of each player in one game round is (1 + a)/3.
Therefore we set the exchange rate to be r = 0.45/(1 + a) to ensure that, under the mixed-strategy NE assumption, the expected total earning in RMB for a player will be 50 RMB irrespective of the particular experimental session.The value of the payoff parameter a, the numerical value of r, and the above-mentioned reward formula were listed in the written instruction and also orally mentioned by the experimental instructor at the instruction phase of the experiment.
previous states: where  1 ,  2 ,  3 are a sequence of random variables, Rock Paper Scissors here.What you will play in the next round only depends on what you played this round, like a short memory pattern sequence.Markov chains can be generalized to cases of short-term dependency, by taking into account recent past states in the chain.The m-th order Markov chain 20 considers the current state to depend on m previous states, where m is finite, and is a process satisfying Here the m-th order Markov chain is like a model with memory length m, which 'remembers' the previous m states.

Figure 2 :
Figure 2: Total scores for multi-AI competing against different players in 300

Figure 3 :
Figure 3: Game results (total scores) of Multi-10AI competing with human in 300

Table 1 ,
the dominant AI for all of the first 4 rounds is AI-1.In round 5, AI-2 has the best cumulative score and thus is the dominant AI.