Nash Equilibrium of Social-Learning Agents in a Restless Multiarmed Bandit Game

Nakayama, Kazuaki; Hisakado, Masato; Mori, Shintaro

doi:10.1038/s41598-017-01750-z

Download PDF

Article
Open access
Published: 16 May 2017

Nash Equilibrium of Social-Learning Agents in a Restless Multiarmed Bandit Game

Kazuaki Nakayama¹,
Masato Hisakado² &
Shintaro Mori³

Scientific Reports volume 7, Article number: 1937 (2017) Cite this article

1579 Accesses
5 Citations
2 Altmetric
Metrics details

Subjects

Abstract

We study a simple model for social-learning agents in a restless multiarmed bandit (rMAB). The bandit has one good arm that changes to a bad one with a certain probability. Each agent stochastically selects one of the two methods, random search (individual learning) or copying information from other agents (social learning), using which he/she seeks the good arm. Fitness of an agent is the probability to know the good arm in the steady state of the agent system. In this model, we explicitly construct the unique Nash equilibrium state and show that the corresponding strategy for each agent is an evolutionarily stable strategy (ESS) in the sense of Thomas. It is shown that the fitness of an agent with ESS is superior to that of an asocial learner when the success probability of social learning is greater than a threshold determined from the probability of success of individual learning, the probability of change of state of the rMAB, and the number of agents. The ESS Nash equilibrium is a solution to Rogers’ paradox.

Using a theory of mind to find best responses to memory-one strategies

Article Open access 14 October 2020

Aspiration dynamics generate robust predictions in heterogeneous populations

Article Open access 31 May 2021

Self-regulation versus social influence for promoting cooperation on networks

Article Open access 16 March 2020

Introduction

One of the differences between human beings and other animals is that the former transfer their predecessors’ experience and wisdom in the form of knowledge¹. Social learning—learning from the experience of others— is advantageous compared to individual learning^2,3,4. Without social learning everybody would have to learn everything for themselves². In other words, individual learning costs more than social learning does^2,3,4. Therefore, Rogers’ finding that social learning is not necessarily more advantageous than individual learning is counterintuitive⁵. This is now called Rogers’ paradox.

Rogers’ conclusion seems very strange in light of our experience⁴. Several attempts have been made to solve Rogers’ paradox in social learning. Boyd and Richerson² pointed out that Rogers’ paradox is not a paradox when the only benefit of social learning is to avoid learning costs. Further, on analysing two models where social learning reduces individual-learning costs and improves the information obtained through the latter, they concluded that social learning can be adaptive. Enquist et al.³ advocated a learning form called critical social learning, which is social learning supplemented by individual learning. They discussed using rate equations and succeeded in solving the paradox. Rendell et al.⁴ studied the relative merits of several learning strategies by using a spatially explicit stochastic model.

The concept of adaptive information filtering^{3, 6} has been proposed as key to the effective working of social learning. It indicates that each member effectively learns good-quality information provided by other members. For example, in a famous tournament by Rendell et al.⁶, discountmachine that did the most effective social learning won over the other strategies that combined individual learning and social learning.

In this study, we propose a stochastic model to solve Rogers’ paradox in the framework of a restless multiarmed bandit (rMAB) used in that tournament. The objective of this study is to analyse equilibrium social learning in an rMAB. An rMAB is analogous to the “one-armed bandit” slot machine but with multiple “arms”, each with a distinct payoff. We call an arm with a high payoff a good arm. The term “restless” means that the payoffs change randomly. Agents maximise their payoffs by exploiting an arm, searching for a good arm at random (individual learning), or copying an arm exploited by other agents (social learning). Because rMAB is simple in structure and its generality, we believe that it is an appropriate framework to consider Rogers’ paradox.

As a model for social-learning collectives, Bolton and Harris studied an agent system in a multi-armed bandit⁷. They assumed that the agents could know all information of other agents and obtained a socially optimal experiment (learning) strategy. In the present study, we consider the bounded rationality of agents, who can access the results of their respective choices only. In addition, we assume that the environment (i.e., the rMAB) changes randomly. We obtain the socially optimal and equilibrium learning strategies.

Model

We make the model as simple as possible and incorporate the property of adaptive filtering of information into it. A mathematical overview of the model is given in the Methods section.

The rMAB has only one good arm and infinitely many bad arms. There are N agents labeled by n = 1, …, N. In each turn, an agent (say, agent n) is randomly chosen. He/she exploits his/her arm and obtains payoff 1 if he/she knows a good arm. If he/she does not know a good arm, he/she randomly searches for it (individual learning) with probability 1 − r _n, or copies the information of other agents’ good arms (social learning) with probability r _n. In the random search, the probability that he/she successfully finds a good arm is denoted as q _I. On the other hand, we assume that the copy process succeeds with probability q _O ⁸ if there is at least one agent who knows a good arm, and fails if no agent knows a good arm. Then, with probability q _C/N, the good arm changes to a bad one and another good arm appears. If a good arm changes to a bad one, the agents who knew the arm are forced to forget it and to know a bad one. See Fig. 1. The difference with our previous model⁸ is that there are M good arms in the previous model, whereas in the present model there is only one good arm.

Let σ _n be a random variable defined by

$${\sigma }_{n}=\{\begin{array}{ll}\mathrm{1,} & {\rm{if}}\,{\rm{agent}}\,n\,{\rm{knows}}\,{\rm{a}}\,{\rm{good}}\,{\rm{arm}},\\ \mathrm{0,} & {\rm{if}}\,{\rm{agent}}\,n\,{\rm{does}}\,{\rm{not}}\,{\rm{know}}\,{\rm{a}}\,{\rm{good}}\,{\rm{arm}}\mathrm{.}\end{array}$$

(1)

This is simply the payoff for agent n. For each turn t, we have a joint probability function P(σ ₁, …, σ _N|t), which evolves in t according to the aforementioned rule. To exclude trivial results, we assume that q _C, q _I and q _O are positive and that r _ns are less than 1⁸. Then, in the long run, we have the unique steady probability function $P({\sigma }_{1},\cdots ,{\sigma }_{N})={\mathrm{lim}}_{t\to \infty }P({\sigma }_{1},\cdots ,{\sigma }_{N}|t)$. Now, we shall introduce the expected payoff for each agent in the steady state,

$${w}_{n}=E[{\sigma }_{n}]=\sum _{{\sigma }_{1}=\mathrm{0,1}}\cdots \sum _{{\sigma }_{N}\mathrm{=0,1}}P({\sigma }_{1},\cdots ,{\sigma }_{N}){\sigma }_{n},\,n=1,\cdots ,N\mathrm{.}$$

(2)

This quantity depends on parameters N, q _C, q _I, q _O, and r _ns. We regard w _n mainly as a function of r _ns. We denote this function by w(r _n, $\overline{r}$ _n), where

$$w(r,\overline{r})=\frac{1}{a+{q}_{I}+({q}_{O}-{q}_{I})r}\{{q}_{I}+({q}_{O}-{q}_{I})r-\frac{a{q}_{O}r}{a+(N-\mathrm{1)}{q}_{I}\mathrm{(1}-\overline{r})+{q}_{I}\mathrm{(1}-r)}\},$$

(3)

$$a=\frac{{q}_{C}}{1-{q}_{C}/N},$$

(4)

$${\overline{r}}_{n}=\frac{1}{N-1}\sum _{k\ne n}{r}_{k}=\frac{1}{N-1}\{\sum _{k\mathrm{=1}}^{N}{r}_{k}-{r}_{n}\}\mathrm{.}$$

(5)

Thus, we have w _n = w(r _n, $\overline{r}$ _n) for each n = 1, …, N.

In this study, we treat w _n as the fitness for agent n.

Results and Discussion

Pure Strategies and Rogers’ Paradox

In the present study, the strategy of agent n refers to the social learning probability, r _n. We call r _n = 0, 1 as pure strategies and 0 < r _n < 1 as mixed strategies.

First, we confirm that Rogers’ paradox occurs when agents adopt pure strategies. We shall divide N agents into two groups. The first group consists of N _I individual learners (r _k = 0, k = 1, …, N _I). The second group consists of N _S = N − N _I social learners (r _k = 1, k = N _I + 1, …, N _I + N _S). The corresponding fitness per agent, which we denote respectively as w _I and w _S, are given by

$${w}_{I}=\frac{{q}_{I}}{a+{q}_{I}},\,{w}_{S}=\frac{{N}_{I}{q}_{I}{q}_{O}}{(a+{N}_{I}{q}_{I})(a+{q}_{O})},$$

(6)

where a is defined in equation (4).

When q _O ≤ q _I, we have w _I > w _S. Therefore, in this case, individual learning is always favourable over social learning.

Now, we consider the q _O > q _I case. Figure 2 is the plot of w _I and w _S for sufficiently large N.

When the proportion of social learners is small, social learning is effective. However, as the proportion of social learners increases, w _S monotonically decreases and tends to zero. Thus, Rogers’ paradox occurs.

It is important to note that w _I < w _S is true when N _I/N is finite, with a sufficiently large N. This is because, as N → ∞, we have w _I → q _I/(q _C + q _I) and w _S → q _O/(q _C + q _O).

Nash Equilibrium and Rogers’ Paradox

Let us assume that each agent adopts a mixed strategy, that is, for each n = 1, …, N, the social-learning probability, r _n, is an arbitrary number between 0 and 1. This means that agent n performs social learning with probability r _n and individual learning with probability 1 − r _n. The learning mode that he/she chooses would be decided stochastically and automatically.

We consider the N-tuple, (r ₁, …, r _N), of the social-learning probabilities. This is a point in the N-dimensional unit cube J = [0, 1] × … × [0, 1]. J is regarded as the space of N-tuples of mixed strategies. For each point in J, a joint probability function P(σ ₁, …, σ _N) is determined and an N-tuple, (w ₁, …, w _N), of the fitness functions of the agents is calculated.

Now, imagine that agent n maximises w _n by adjusting r _n for fixed r _ks (k ≠ n). It is not difficult to show that the maximum point is unique (Fig. 3) and expressed as

$${r}_{n}=f({\overline{r}}_{n}),$$

(7)

where

$$f(r)=\{\begin{array}{cc}0 & {q}_{O}\le {q}_{I},\\ \min \,\mathrm{(1},\,\max \,\mathrm{(0,}\overline{f}(r))), & {q}_{O} > {q}_{I},\end{array}$$

(8)

$$\overline{f}(r)=\frac{-\zeta +\sqrt{{q}_{O}(N-\mathrm{1)(1}-r)}\sqrt{\zeta }}{{q}_{O}-{q}_{I}},$$

(9)

$$\zeta \equiv a+{q}_{I}+{q}_{I}(N-\mathrm{1)(1}-r\mathrm{).}$$

(10)

We note that f(r) → 0 as q _O → q _I + 0. Next, we introduce the function,

$$F({r}_{1},\cdots ,{r}_{N})=(f({\overline{r}}_{1}),\cdots ,f({\overline{r}}_{N}\mathrm{)).}$$

(11)

This is a continuous function mapping from the N-dimensional unit cube J into itself. As shown in the Methods section, the fixed point of F is unique and is on the diagonal line of J,

$${r}_{1}=\cdots ={r}_{N}={r}_{{\rm{Nash}}},$$

(12)

where r _Nash is a function of q _C, q _I, q _O and N. The value of r _Nash is explicitly given by

$${r}_{{\rm{Nash}}}=\{\begin{array}{cc}1-\eta , & ({q}_{O}-{q}_{I})N > a+{q}_{O},\\ \mathrm{0,} & ({q}_{O}-{q}_{I})N\le a+{q}_{O},\end{array}$$

(13)

where

$$\eta \equiv \frac{\mathrm{2(}a+{q}_{O}{)}^{2}}{({q}_{O}-{q}_{I}N)(a+{q}_{O})+(aN+{q}_{O})({q}_{O}-{q}_{I})+\sqrt{{D}_{1}}},$$

(14)

$$\begin{array}{c}{D}_{1}=(N-\mathrm{1)}{q}_{O}\{-\mathrm{(4}aN+3{q}_{O}N+{q}_{O}){q}_{I}^{2}\\ \,\,\,+\mathrm{2(3}a{q}_{O}N+2{q}_{O}^{2}N-2{a}^{2}-3a{q}_{O}){q}_{I}+a{q}_{O}(aN+3a+4{q}_{O})\}.\end{array}$$

(15)

The entity r _Nash has the following properties (see the Methods section): (i) 0 ≤ r _Nash < 1, (ii) r _Nash → 0 as (q _O − q _I)N − (a + q _O) → 0, and (iii) the fixed point (r _Nash, …, r _Nash) is the unique Nash equilibrium point in J. Figure 4 is a schematic explanation of the Nash equilibrium point.

Moreover, the corresponding mixed strategy is an evolutionarily stable strategy (ESS)⁹ because the fixed point is a Nash equilibrium point in the strong sense,

$$w({r}_{{\rm{Nash}}},{r}_{{\rm{Nash}}}) > w(r,{r}_{{\rm{Nash}}}),\,{\rm{for}}\,{\rm{all}}\,{\rm{r}}\ne {{\rm{r}}}_{{\rm{Nash}}}\mathrm{.}$$

(16)

Further, it is an ESS in the sense of Thomas¹⁰, because the inequality,

$$w({r}_{{\rm{Nash}}},r) > w(r,r),\,{\rm{for}}\,{\rm{all}}\,{\rm{r}}\ne {{\rm{r}}}_{{\rm{Nash}}},$$

(17)

is true.

Now, we consider the two fitness functions, w _I and w _N = w (r _Nash, r _Nash). As shown in the Methods section, the inequality w _N > w _I is correct if and only if (q _O − q _I) N > a + q _O. See also Fig. 5. The Nash equilibrium point is usually regarded as a stable point in the sense that no agent has an intention to change his/her strategy. Therefore, this inequality claims that the mixed strategy r _n = r _Nash (n = 1, …, N) can outperform the pure strategy of individual learning. This solves Rogers’ paradox. We note that the Nash equilibrium point is realised as a mixed strategy of social learning and individual learning.

Pareto Optimality

Pareto optimality is an important concept alongside Nash equilibrium. Thus, we consider Pareto optimality in our model. We shall adopt a natural definition of the Pareto-optimal point in J as the maximum point of the function, ${\sum }_{k=1}^{n}{w}_{k}$. We can show that the maximum point is unique and is on the diagonal line of J,

$${r}_{1}=\cdots ={r}_{N}={r}_{{\rm{Pareto}}},$$

(18)

where r _Pareto is a function of q _C, q _I, q _O, and N. The value of r _Pareto is explicitly given by

$${r}_{{\rm{Pareto}}}=\{\begin{array}{cc}\frac{(a+{q}_{I}N)X-(a+{q}_{I})Y}{{q}_{I}NX+({q}_{O}-{q}_{I})Y}, & ({q}_{O}-{q}_{I})N > a+{q}_{O},\\ \mathrm{0,} & ({q}_{O}-{q}_{I})N\le a+{q}_{O},\end{array}$$

(19)

where

$$X=\sqrt{(N-\mathrm{1)(}a+{q}_{O})({q}_{O}-{q}_{I})},$$

(20)

$$Y=\sqrt{(a+N{q}_{I})N{q}_{O}}\mathrm{.}$$

(21)

Further, r _Pareto has the following properties: (i) 0 ≤ r _Pareto < 1, (ii) r _Pareto → 0 as (q _O − q _I)N − (a + q _O) → 0, (iii) r _Pareto < r _Nash if and only if (q _O − q _I)N > a + q _O (see the Methods section), and (iv) the point (r _Pareto, …, r _Pareto) is the Pareto-optimal point in J. Here, by Pareto optimality, we imply that the statement “if an agent succeeds to increase his/her fitness by changing his/her social-learning probability from r _Pareto to r _Pareto + δr by δr ≠ 0, then another agent’s fitness certainly decreases” is true. Such a δr exists when r _Pareto > 0 and no δr exists when r _Pareto = 0. The statement is correct in both cases.

We define the Pareto fitness function, w _P = f(r _Pareto, r _Pareto). Then, we have the inequality w _P > w _N if and only if (q _O − q _I)N > a + q _O (see Fig. 5 and the Methods section). This is trivial by the definition of the Pareto-optimal point. Thus, we have established the relation among fitness functions,

$$\{\begin{array}{cc}{w}_{P} > {w}_{N} > {w}_{I}, & ({q}_{O}-{q}_{I})N > a+{q}_{O},\\ {w}_{P}={w}_{N}={w}_{I}, & ({q}_{O}-{q}_{I})N\le a+{q}_{O}\mathrm{.}\end{array}$$

(22)

Concluding Remarks

We have proposed a stochastic model of N agents and an rMAB. The unique Nash equilibrium point in the mixed strategy space J has been presented and shown to be an ESS in the sense of Thomas¹⁰. The corresponding fitness w _N per agent is greater than the fitness w _I for an individual learner. This solves Rogers’ paradox.

In this study, we concentrated on steady states. This is valid if the system relaxes quickly to the steady state (see the Methods section). However, if r _ns change faster than the relaxation to the steady state, it is an introduction of non-trivial dynamics. It may be possible that our system has a nice dynamics possessing the stable Nash equilibrium point.

As a future research subject, we propose an experimental study of human collectives in rMAB. There have been several attempts in this direction^11,12,13, whose target has been the improvement of performance by social learning, that is, collective intelligence effect. Since we have shown that there is an ESS Nash equilibrium in the social-learning agents system in rMAB, it is interesting to experimentally examine whether the prediction is realised. As a first step, the interactive rMAB game might be a suitable environment where one human competes with many other mixed-strategy agents and r = r _Nash. We can check whether the social-learning rate of people is the same with r _Nash. Second, when many people compete, the Nash equilibrium emerges as the model parameter q _I changes. Meanwhile, we might be able to detect some phase-transitive behaviour⁸.

As for theoretical research, the stage of our analysis is far from mature. In the present work, we have studied the game of rMAB in the steady state of the system. However, when the relaxation time of the system discussed in the Methods section is not small enough, the assumption of steadiness is unrealistic in the laboratory experiment. Thus, we need to develop a t-dependent theory. It might be a difficult problem. We believe that the research direction is fruitful.

Methods

Mathematical Overview of the Model

For simplicity we use the following notation,

$$\vec{\sigma }=({\sigma }_{1},\cdots ,{\sigma }_{N}),\,\vec{0}=\mathrm{(0},\cdots ,\mathrm{0)},\,{\vec{e}}_{n}=\mathrm{(0},\cdots ,0,\mathop{\breve{1}}\limits^{n},0,\cdots ,\mathrm{0)},\,{\delta }_{{\vec{\sigma }}^{\text{'}},\vec{\sigma }}=\prod _{n\mathrm{=1}}^{N}{\delta }_{{\sigma }_{n}^{\text{'}},{\sigma }_{n}}\mathrm{.}$$

(23)

Our model develops in t according to an agent action and the subsequent state change of the rMAB. This is a Markov process¹⁴. The probability of change $\mathop{\sigma}\limits^{\longrightarrow}\to \vec{\sigma}^{\prime}$ is described by the transition probability matrix¹⁴,

$$T(\vec{\sigma }^{\prime} |\vec{\sigma })=(1-\frac{{q}_{C}}{N})\{(1-\sum _{n\mathrm{=1}}^{N}{p}_{n}(\vec{\sigma })){\delta }_{{\vec{\sigma }}^{\text{'}},\vec{\sigma }}+\sum _{n\mathrm{=1}}^{N}{p}_{n}(\vec{\sigma }){\delta }_{{\vec{\sigma }}^{\text{'}},\vec{\sigma }+{\vec{e}}_{n}}\}+\frac{{q}_{C}}{N}{\delta }_{{\vec{\sigma }}^{\text{'}},\vec{0}},$$

(24)

where

$${p}_{n}(\vec{\sigma })=\frac{{\delta }_{{\sigma }_{n}\mathrm{,0}}}{N}\{{r}_{n}\mathrm{(1}-{\delta }_{{N}_{1}\mathrm{,0}}){q}_{O}+\mathrm{(1}-{r}_{n}){q}_{I}\},\,{N}_{1}=\sum _{n\mathrm{=1}}^{N}{\sigma }_{n}\mathrm{.}$$

(25)

The joint probability function $P(\mathop{\sigma }\limits^{\longrightarrow}|t)=P({\sigma }_{1},\ldots ,{\sigma }_{N}|t)$ satisfies the Chapman-Kolmogorov equation¹⁴,

$$P(\vec{\sigma }|t+1)=\sum _{{\vec{\sigma }}^{\text{'}}}T(\vec{\sigma }|\vec{\sigma }^{\prime} )P(\vec{\sigma }^{\prime} |t)\mathrm{.}$$

(26)

Our assumption is that q _C, q _I, q _O > 0 and r _n < 1 (n = 1, …, N). In this case, the matrix T is shown to be irreducible and primitive¹⁵. Then, the Perron-Frobenius theory¹⁵ ensures that (i) λ ₁ = 1 is an eigenvalue of T of multiplicity 1 and the steady probability function P($\vec{\sigma }$) is a corresponding eigenvector, (ii) the set {|λ _i|}_i≥2 of absolute values of eigenvalues of T other than λ ₁ has an upper bound ρ < 1. When r _ns are fixed, we have the time-homogeneous Markov process¹⁴, that is, the matrix T does not depend on t. Therefore, for any initial probability function $P(\vec{\sigma }|0)$, we have the unique limit $P(\vec{\sigma })={\mathrm{lim}}_{t \rightarrow \infty }P(\vec{\sigma }|t)$. Then, it is not difficult to derive equation (3) using $P(\vec{\sigma })$.

The convergence $P(\vec{\sigma }|t) \rightarrow P(\vec{\sigma })$ is exponential, $|P(\vec{\sigma }|t)-P(\vec{\sigma })|\sim {\rho }^{t}$. This means that the relaxation time is $\tau =-\,1/\,\mathrm{log}\,{\rho }^{-1}$. Thus, when no agent changes his/her social learning probability over a much longer period than τ, the fitness per agent per turn is almost exactly equal to the value of the function w in equation (3).

Existence of a Fixed Point of F

Since the N-dimensional cube J = [0, 1] × … × [0, 1] is a compact, convex set and F is a continuous function mapping from J into itself, Brouwer’s fixed-point theorem¹⁶ guarantees that there exists a fixed point of F in J.

A Fixed Point of F is a Nash Equilibrium Point, and Vice Versa

Let (r ₁, …, r _N) be a fixed point of F, that is, r _n = f($\overline{r}$ _n) for each n = 1, …, N. Since r = f($\overline{r}$ _n) is the unique maximal point of w(r, $\overline{r}$ _n), we have w(r _n + δr, $\overline{r}$ _n) < w(r _n, $\overline{r}$ _n) for each n = 1, …, N when δr ≠ 0. Thus, (r ₁, …, r _N) is a Nash equilibrium point. Conversely, let (r ₁, …, r _N) be a Nash equilibrium point, that is, r = r _n is a maximal point of w(r, $\overline{r}$ _n) for each n = 1, …, N. Since r = f($\overline{r}$ _n) is the unique maximal point of w(r, $\overline{r}$ _n) (see Fig. 3), we have r _n = f($\overline{r}$ _n). Thus, (r ₁, …, r _N) is a fixed point of F.

Uniqueness of the Fixed Point of F

When q _O ≤ q _I, we have the unique fixed point (0, …, 0).

Next, we consider the q _O > q _I case.

Let (r ₁, …, r _N) be a fixed point of F. Since $\overline{r}$ _n = (s − r _n)/(N − 1), $s={\sum }_{k=1}^{N}{r}_{k}$, all the r _ns satisfy the common relation,

$$r=g(r)\equiv f(\frac{s-r}{N-1})\mathrm{.}$$

(27)

Figure 6(b) is a plot of the function g(r).

This is a strictly increasing concave function for s − (N − 1)r ^* ≤ r ≤ s − (N − 1)r _*, where

$${r}^{\ast }=1-\frac{a+{q}_{I}}{(N-\mathrm{1)(}{q}_{O}-{q}_{I})},$$

(28)

$${r}_{\ast }=1-\frac{-({q}_{O}(a-{q}_{I})-2a{q}_{I})+\sqrt{{D}_{2}}}{\mathrm{2(}N-\mathrm{1)}{q}_{I}({q}_{O}-{q}_{I})},$$

(29)

$${D}_{2}={({q}_{O}(a-{q}_{I})-2a{q}_{I})}^{2}+4{q}_{I}({q}_{O}-{q}_{I})(a+{q}_{O}{)}^{2}\mathrm{.}$$

(30)

It is not difficult to show that r _* < r ^* < 1. The maximum value of the derivative g′(r) is 1/2, which is realised at r = s − (N − 1)r ^*. Thus, $\tilde{g}(r)=r-g(r)$ is a strictly increasing function such that $\tilde{g}\mathrm{(0)}\le 0\le \tilde{g}\mathrm{(1)}$. Therefore, there is only one zero, r ₀, of the function $\tilde{g}(r)$ in the interval 0 ≤ r ≤ 1. Then, we conclude that r ₁ = … = r _N = r ₀.

Now we have s = Nr ₀. Therefore, r ₀ is a solution of the equation,

$$r=f(r\mathrm{).}$$

(31)

Figure 6(a) is a plot of the function f(r). The function f(r) is a decreasing function. Thus, h(r) = r − f(r) is a strictly increasing function such that h(0) ≤ 0 ≤ h(1). Therefore, the function h(r) possesses only one zero, r _Nash, such that 0 ≤ r _Nash < 1. Thus, we have r ₀ = r _Nash. This proves the uniqueness of the Nash equilibrium point.

Inequality w _P > w _N > w _I

It is sufficient to consider the (q _O − q _I)N > a + q _O case. Then, r _Nash satisfies $r=\overline{f}(r)$. We introduce the following function,

$$k(u)=(a+{q}_{O}){u}^{2}-\{({q}_{O}-{q}_{I}N)(a+{q}_{O})+(aN+{q}_{O})({q}_{O}-{q}_{I})\}u+({q}_{O}-{q}_{I})({q}_{O}-{q}_{I}{N}^{2}).$$

(32)

It is not difficult to check that 1/(1 − r _Nash) is the larger root of k(u). We note that k(1) < 0.

Next, we define r _I as

$${r}_{I}=1-\frac{a+{q}_{O}}{({q}_{O}-{q}_{I})N}\mathrm{.}$$

(33)

This entity has the following properties: (i) 0 < r _I < 1, (ii) w(r _I, r _I) = w _I, and (iii) k (1/(1 − r _I)) > 0. On the other hand, it is elementary to show that k (1/(1 − r _Pareto)) < 0. Thus, we conclude that r _Pareto < r _Nash < r _I.

Now, r _Pareto is the maximal point of w(r, r). Therefore, we have the inequality w(r _Pareto, r _Pareto) > w(r _Nash, r _Nash) > w(r _I, r _I), that is, w _P > w _N > w _I.

References

Boyd, R. & Richerson, P. J. Culture and the Evolutionary Process (University of Chicago Press, Chicago, 1985).
Boyd, R. & Richerson, P. J. Why does culture increase human adaptability? Ethol. Sociobiol. 16, 125–143, doi:10.1016/0162-3095(94)00073-G (1995).
Article Google Scholar
Enquist, M., Eriksson, K. & Ghirlanda, S. Critical social learning: A solution to Rogers’s paradox of nonadaptive culture. Am. Anthropol. 109, 727–734, doi:10.1525/aa.2007.109.4.727 (2007).
Article Google Scholar
Rendell, L., Fogarty, L. & Laland, K. N. Rogers’ paradox recast and resolved: Population structure and the evolution of social learning strategies. Evolution 64, 534–548, doi:10.1111/j.1558-5646.2009.00817.x (2010).
Article PubMed Google Scholar
Rogers, A. R. Does biology constrain culture? Am. Anthropol. 90, 819–831, doi:10.1525/aa.1988.90.4.02a00030 (1988).
Article Google Scholar
Rendell, L. et al. Why copy others? Insights from the social learning strategies tournament. Science 328, 208–213, doi:10.1126/science.1184719 (2010).
Article ADS MathSciNet CAS PubMed PubMed Central MATH Google Scholar
Bolton, P. & Harris, C. Strategic experimentation. Econometrica 67, 349–374, doi:10.1111/1468-0262.00022 (1999).
Article MathSciNet MATH Google Scholar
Mori, S., Nakayama, K. & Hisakado, M. Phase transition of social learning collectives and the echo chamber. Phys. Rev. E 94, 052301, doi:10.1103/PhysRevE.94.052301 (2016).
Article ADS PubMed Google Scholar
Maynard-Smith, J. Evolution and the Theory of Games (Cambridge University Press, Cambridge, 1982).
Thomas, B. Evolutionary stability: states and strategies. Theor. Popul. Biol. 26, 49–67 (1984).
Article MathSciNet MATH Google Scholar
Toyokawa, W., Kim, H. & Kameda, T. Human collective intelligence under dual exploration-exploitation dilemmas. PloS One 9, e95789, doi:10.1371/journal.pone.0095789 (2014).
Article ADS PubMed PubMed Central Google Scholar
Kameda, T. & Nakanishi, D. Cost-benefit analysis of social/cultural learning in a nonstationary uncertain environment: An evolutionary simulation and an experiment with human subjects. Evol. Hum. Behav. 23, 373–393, doi:10.1016/S1090-5138(02)00101-0 (2002).
Article Google Scholar
Yoshida, S., Hisakado, M. & Mori, S. Interactive restless multi-armed bandit game and swarm intelligence effect. New Generat. Comput. 34, 291–306, doi:10.1007/s00354-016-0306-y (2016).
Article Google Scholar
Stroock, D. W. An Introduction to Markov Processes (Springer-Verlag, Heidelberg, 2014).
Meyer, C. D. Matrix Analysis and Linear Algebra (SIAM, 2000).
Granas, A. & Dugundji, J. Fixed Point Theory (Springer-Verlag, New York, 2003).

Download references

Acknowledgements

We would like to thank Editage (www.editage.jp) for English language editing. This work was supported by JSPS KAKENHI Grant Number 17K00347.

Author information

Authors and Affiliations

Department of Mathematics, Faculty of Science, Shinshu University, Asahi 3-1-1, Matsumoto, Nagano, 390-8621, Japan
Kazuaki Nakayama
Fintech Lab. LLC Meguro, Tokyo, 153-0051, Japan
Masato Hisakado
Department of Physics, Faculty of Science, Kitasato University, Kitasato 1-15-1, Sagamihara, Kanagawa, 252-0373, Japan
Shintaro Mori

Authors

Kazuaki Nakayama
View author publications
You can also search for this author in PubMed Google Scholar
Masato Hisakado
View author publications
You can also search for this author in PubMed Google Scholar
Shintaro Mori
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

S.M. and M.H. conceived the model. K.N. performed a theoretical analysis. All authors contributed to analysing and interpreting the results and to writing the manuscript.

Corresponding author

Correspondence to Kazuaki Nakayama.

Ethics declarations

Competing Interests

The authors declare that they have no competing interests.

Additional information

Publisher's note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Nakayama, K., Hisakado, M. & Mori, S. Nash Equilibrium of Social-Learning Agents in a Restless Multiarmed Bandit Game. Sci Rep 7, 1937 (2017). https://doi.org/10.1038/s41598-017-01750-z

Download citation

Received: 10 February 2017
Accepted: 04 April 2017
Published: 16 May 2017
DOI: https://doi.org/10.1038/s41598-017-01750-z

Comments

By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.