Stochastic evolution in populations of ideas

It is known that learning of players who interact in a repeated game can be interpreted as an evolutionary process in a population of ideas. These analogies have so far mostly been established in deterministic models, and memory loss in learning has been seen to act similarly to mutation in evolution. We here propose a representation of reinforcement learning as a stochastic process in finite ‘populations of ideas’. The resulting birth-death dynamics has absorbing states and allows for the extinction or fixation of ideas, marking a key difference to mutation-selection processes in finite populations. We characterize the outcome of evolution in populations of ideas for several classes of symmetric and asymmetric games.

For fixed Γ the rates will only remain non-negative if λ ≤ λ c . One can compute a lower bound for λ c . Firstly, all the transition rates will be positive if and only if the constraint is met for all 0 < n < N . Since the quantity ∆π = π 1 − π 2 varies linearly with n, it is bounded by ∆π(1) and ∆π(N − 1). Applying the triangular inequality to (35) gives: ∆π − λ ln N − n n ≤ max (|∆π(1)| , |∆π(N − 1)|) + λ ln(N − 1) As a consequence, all transition rates are positive as long as max (|∆π(1)| , |∆π(N − 1)|) + λ ln(N − 1) ≤ 1/Γ. This translates into λ ≤ 1 ln(N − 1) The right-hand side therefore provides a lower bound on the critical value λ c . This bound is plotted as a function of N in Fig. S1. While the bound goes to zero for N → ∞, the inverse logarithmic dependence means it does so extremely slowly: the restriction on the allowed range of λ is therefore mild even for very large population sizes (N ∼ 10 8 and beyond).
ii Positive transition rates Figure S1. Lower bound on λc for a coexistence game as defined in section 3.4, for Γ = 0.1. The bound (black line) is inversely proportional to the logarithm of the population size N .

B Non-monotonicity of the fixation time in a coordination game
In section 3.4, we studied fixation in a coordination game and observed that the fixation time is non monotonic in λ close to the bifurcation threshold λ c , for small N . We will provide an explanation for this phenomenon by decomposing the dynamics leading to fixation into a sequence of elementary events. When N is small enough for activation times to be only moderate, beyond the bifurcation threshold, two additional effects come into play in addition to the relaxation and activation processes observed for large N : (i) direct activation: when starting near x = 0, a fluctuation (activation event) can drive the system straight to fixation at x = 0, even though the deterministic relaxation would take it in the other direction; (ii) trapping in regions near deterministic fixed points, where the net (deterministic) flow is low; deterministic relaxation times can then become comparable to activation times (precisely at such a fixed point, the deterministic relaxation time is in fact infinite as the flow vanishes). Finite populations will stay trapped in these regions of low deterministic flow for a long (but finite) time. This time will grow logarithmically with N as explained in this Supplementary Material, Sec. D. Such regions exist at and near the bifurcation at λ c , both for λ below and above λ c .
The curve in Fig. 6(b) for λ = 0.475 shows the first effect: for small initial values of x, fixation times are rather low, as direct activation towards x = 0 is the dominant fixation mechanism. To the right of the maximum in the curve, on the other hand, we have fixation predominantly at x = 1. The fixation time here is, to a good approximation, given by the deterministic relaxation iii time to the stable fixed point close to x = 1, with the final activation to x = 1 being sufficiently fast to be sub-leading.
Accordingly, the sample trajectories in Fig. 7 show that the system moves to the stable fixed point in a close-to-deterministic fashion, with fixation at x = 1 occurring shortly afterwards.
The second effect above contributes to the initial condition-dependence of the fixation time in Fig. 6. Here we are close enough to the bifurcation to have an extended region of low flow, causing a significant peak in the transition time curve. The low flow also makes fluctuation effects significant as explained above, and these cause deviations from the times predicted for purely deterministic relaxation. In Fig. 7, the sample trajectories that start from n = 200 (x = 0.2) illustrate this effect.
Finally, the low flow also makes direct activation to x = 0 fast, giving a larger region of initial x where this is the main fixation mechanism. As is clear from Fig. 6(b), the resulting movement of the peak in the fixation time is what causes the non-monotonic λ-dependence at fixed initial condition that is visible in Fig. 6(a). We refer to one of the two sample trajectories starting from n = 50 (x = 0.05) in Fig. 7 for an illustration of a direct activation event.
We note that the direct activation effects discussed above for the coordination game do occur also for coexistence and dominance games, with the same consequence that fixation times become small for initial conditions near x = 0. These other games do not have the additional features arising from the bifurcation in the coordination game, however, so do not show non-monotonic variation of the fixation time with λ.

C Activation dynamics in stochastic evolution of ideas for symmetric games
Here, we explain how to obtain the large N -behaviour of activation times in our stochastic evolution for a population of ideas, and discuss the consequences for the fixation dynamics.

C.1 Kramers-Moyal expansion and effective potential
Our starting point is the dynamics defined by the transition rates (13) and (14). We have discussed in the main text how for N → ∞ this leads to deterministic dynamics, here -by our construction -the Sato-Crutchfield equation (22). This can formally be derived from a Kramer-Moyal expansion to lowest order. In order to capture stochastic effects, one retains the first sub-leading order in the expansion. This is standard for evolutionary processes [1], and iv (iii) (i) (iv) leads to an Itō stochastic differential equation of the forṁ where ξ(t) is Gaussian white noise of unit variance, ξ(t)ξ(t ′ ) = δ(t − t ′ ). For the birth-death process discussed in Sec. 3 one finds Our aim is to use Eyring-Kramers theory [2], and so we map the above dynamics with multiplicative noise to one with additive noise. This is standard for systems with one degree of freedom, and is achieved by a change of variable from x to v and conversely x(y) = sin 2 (y/2). Translating the dynamics of x to one for y giveṡ The additional flow term with prefactor 1 N arises from the x-dependence of the original noise variance σ 2 (x). We will see shortly that this term can be neglected in determining the leading (exponential in N ) scaling of activation times. The y-dynamics can now be written in the forṁ Now that we have a standard Langevin equation with additive noise, Eyring-Kramers theory tells us that the time for an activated event, say from a stable fixed point y 1 to an unstable fixed point (barrier state) y 2 or to a boundary, scales as exp{N Γ[V y (y 2 ) − V y (y 1 )]}. It follows that the O(1/N ) term in V y will only contribute to the prefactor, which we are not considering here anyway; it can therefore be neglected. More importantly, if we translate back from y to x the potential takes the simple form and activation times scale as This will be the basis for our further analysis. In particular, we will exploit that for large N , hence if there are competing processes the one with the smaller activation barrier occurs first (with probability one as N → ∞).
We add finally as a note of caution that the above Langevin analysis is valid for small Γ, where the rates for a transition n → n + 1 and its reverse are close to each other. Otherwise a more general approach is needed to determine activation timescales [3].

C.2 Generic symmetric two-strategy games
We can write down the potential V (x) quite generically for a symmetric game where there are two actions to choose from. Inserting the explicit form of the payoffs (see Eq. (4) and (5)) into (45), one has, up to an umimportant additive constant, Here the entropy is s( has a maximum, corresponding to a coordination game. In the remaining cases, where |v| > |w|, V (x) is monotonic for x ∈ [0, 1], so one has a dominance game.
To understand the effect of nonzero λ on V (x), note that the function −λs(x) is convex. Hence for a coexistence game V (x) continues to have a single minimum x ⋆ . A fixation trajectory will first relax to this minimum. The barrier to activation towards x = 0 is then V (0) − V (x ⋆ ), so fixation will occur there if this is lower than the corresponding barrier V (1) − V (x ⋆ ) for fixation at x = 1. In the opposite case, i.e. for V (0) > V (1), fixation will occur at x = 1.
For a dominance game, the inclusion of the entropic term in V (x) will create a single minimum x ⋆ for any λ > 0, because the derivative −λs ′ (x) diverges to ±∞ at the two boundaries x = 0 and x = 1. The fixation dynamics then follows the same pattern as for a coexistence game.

C.3 Kramer-Moyal expansion for coordination games
The remaining case of coordination games is the most interesting, as the competition between the maximum in V (x) at λ = 0 and the convex entropic term can create additional minima. We keep λ > 0 from now on and write with v =ṽ/λ, w =w/λ. The shape of V (x) is determined by these parameters, while λ only affects the overall scale of the activation barriers but not their relative size for different processes.
We therefore drop the prefactor λ in the following.
For large v and w, corresponding to small λ at fixedṽ andw, the entropic term is mostly negligible in V (x). But its diverging derivative always dominates in V ′ (x) when one is close enough to the boundaries, so must create two minima there. We denote their positions x ⋆ 1 and x ⋆ 2 , respectively, and that of the intermediate maximum by x s . We also introduce as illustrated in Fig. S3. As v and w change, so will the values of these barrier parameters. In particular, the signs of ∆ 0 and ∆ 1 determine qualitatively the kind of fixation dynamics that the system will exhibit. The regime where ∆ 0 and ∆ 1 have different signs is subdivided further according to their relation to the barriers B 0 and B 1 . A graphical summary is given in Fig. S4 and discussed further below. Fig. S5 shows the resulting phase diagram in the (v, w)-plane, and summarizes to what extent fixation probabilities and fixation times depend on initial conditions in each of the four regimes. Note that when w gets too close to zero, or |v|/|w| becomes too large, a maximum and a minimum of V (x) can merge in a bifurcation. In the single minimum regime beyond this, the fixation dynamics becomes simple again and has the same features as viii for coexistence and dominance games. The arrow in Fig. S5 shows how the various regions of the diagram are traversed when λ is increased at fixedṽ andw, i.e. for fixed payoffs. In Fig. S2 we plot over what λ-ranges V (x) has the shapes (i), (iii) and (iv), respectively, in the specific example game of section 3.4. The λ-range for shape (iii) is too small to see in that figure, however. Fig. S4(a) shows the simplest case ∆ 0 , ∆ 1 < 0. Here depending on its initial condition, the system will first relax to one of the minima of the potential, say x ⋆ 1 . Then because ∆ 0 < 0 the barrier for activation to x = 0 is smaller than for activation to the maximum x s . For large N -which we always assume in the following discussion -then with probability one the former process is the first to happen: fixation occurs at x = 0. Similarly if the initial relaxation goes to x ⋆ 2 because the system started at x > x s , fixation will occur at x = 1. The fixation probability at 0 is therefore a step function of the initial condition x, dropping from one to zero at x = x s . The fixation time changes similarly with initial condition, from exp[N ΓB 0 ] for x < x s to exp[N ΓB 1 ] for x > x s .
The opposite case of ∆ 0 , ∆ 1 > 0 is illustrated in Fig. S4(b). Here once the system has landed in either of the two minima, it will be able to reach the maximum separating these minima much faster than a boundary. As a result the system will make many "trips" between the two minima and effectively equilibrates across them, forgetting its initial condition. One can show that fixation will then eventually occur as if the system only had a single potential minimum at the lower of the two local potential minima, and will accordingly take place at the boundary with the lower value of V .
Finally there is the case where ∆ 0 and ∆ 1 have opposite signs, e.g. ∆ 1 > 0, ∆ 0 < 0 as shown in Fig. S4(c,d). If the system starts out of x < x s , we have the same case as (a) above: deterministic relaxation to x ⋆ 1 followed by fixation at x = 0 on a timescale set by the barrier B 0 . Otherwise, the system will initially relax to x ⋆ 2 and then traverse the maximum at x s : ∆ 1 > 0 ensures that activation to the maximum is exponentially faster than fixation at x = 1. After arrival at x ⋆ 1 the earlier sequence of processes is followed. Because fixation in both cases takes place at x = 0, the fixation probability is independent of the initial condition.
Whether the fixation time has such a dependence, on the other hand, depends on timescales.
As Fig. S4(c,d) shows, the timescale for activation from x ⋆ 2 to x s is set by the barrier B 1 − ∆ 1 , while the timescale for fixation at x = 0 from x ⋆ 1 is set by B 0 . If the former is smaller than the latter, as in Fig. S4(c), then even when the system initially relaxes to x ⋆ 2 , the timescale for the overall fixation trajectory will be given by B 0 : it is therefore independent of the initial condition.
In the converse case of Fig. S4(d), the system will take longer to reach fixation starting from x > x s because activation from x ⋆ 2 to x s is much slower than fixation from x ⋆ 1 . A typical fixation trajectory here will see the system spend almost all of its time near (i) When ∆0 < 0 and ∆1 < 0, the barriers to fixation at the boundaries are smaller than the central barrier separating the two potential minima x ⋆ 1 and x ⋆ 2 : fixation occurs by deterministic relaxation to one of these points, then activation to the nearest boundary.
(ii) For ∆0 > 0 and ∆1 > 0, transitions between the two potential minima are much faster than activated fixation at either boundary. The system equilibrates between the minima, forgetting its initial condition, and fixes at the boundary with lower V (x), here x = 1. (iii,iv) When ∆0 < 0 and ∆1 > 0, fixation always occurs at x = 0 because from x ⋆ 2 the system will cross the barrier at xs to x ⋆ 1 . (iii) If the barrier crossing is faster than the final activation time towards x = 0, also the fixation time is independent of the initial condition. (iv) Otherwise, the barrier crossing time dominates, causing a much longer fixation time when starting from x > xs.
drives it across x s to x ⋆ 1 and from there to x = 0.
Single well potential (i) (ii) (iii) (iv) Figure S5. Phase diagram in the (v, w)-plane, indicating where the different shapes of V (x) occur that are explained in Fig. S4. The dotted arrow shows how the phase diagram is traversed at fixedṽ andw when λ is increased.

D Fixation in regions of small flow
Here, we explain briefly why the noise-driven escape from the low-flow region around an unstable fixed fixed point takes a time scaling as ln(N ).
Consider the linearized dynamics of a coordination (or other) game near an unstable fixed point.
After the mapping to Langevin dynamics with additive noise, cf. (43), this can be written in the formẏ =μ(y − y 0 ) + 1 √ N ξ(t) withμ > 0. Assuming that y(0) = y 0 , a straightforward calculation then shows that the variance of y(t) is: [y(t) − y 0 ] 2 = exp(2μt) − 1 2μN (53) To have 'escape' from the unstable fixed point this needs to be of order unity; call this value c.
Neglecting the −1 in the numerator then gives an escape time of order t = ln(2cμN c + 1)/(2μ) which for large N becomes ln(N )/(2μ), establishing the promised logarithmic scaling with N .
Note that while we have estimated the time for an escape to a distance of order unity in y-space, this is equivalent to an order unity distance in x-space as the mapping from x to y is smooth.