The complexity of understanding others as the evolutionary origin of empathy and emotional contagion

Contagious yawning, emotional contagion and empathy are characterized by the activation of similar neurophysiological states or responses in an observed individual and an observer. For example, it is hard to keep one’s mouth closed when imagining someone yawning, or not feeling distressed while observing other individuals perceiving pain. The evolutionary origin of these widespread phenomena is unclear, since a direct benefit is not always apparent. We explore a game theoretical model for the evolution of mind-reading strategies, used to predict and respond to others’ behavior. In particular we explore the evolutionary scenarios favoring simulative strategies, which recruit overlapping neural circuits when performing as well as when observing a specific behavior. We show that these mechanisms are advantageous in complex environments, by allowing an observer to use information about its own behavior to interpret that of others. However, without inhibition of the recruited neural circuits, the observer would perform the corresponding downstream action, rather than produce the appropriate social response. We identify evolutionary trade-offs that could hinder this inhibition, leading to emotional contagion as a by-product of mind-reading. The interaction of this model with kinship is complex. We show that empathy likely evolved in a scenario where kin- and other indirect benefits co-opt strategies originally evolved for mind-reading, and that this model explains observed patterns of emotional contagion with kin or group members.

Below we provide additional information on the model following the same notation of the main text, summarized in Table S1. First we provide additional details on the cognitive strategies and how fitness expressions are obtained. Second, we show how the results presented in the text are obtained in detail. Last, we present agent based simulations testing the model in presence of simple learning rules. In particular: • Section 1. Deterministic model, detailed description:

Section 1 General expressions for fitness
We denote the strategy an individual with u. The strategy that an individual adopts is defined by a strategic type (F , P or S), and one or more continuous traits, denoted with the letter u, i.e. u D , u B and u S . The strategy of an individual determines the probability of a given response to a stimulus: B or ∅ if the individual is an actor, D,C and 0 if an observer. We denote the probabilities of a given response in the form p(x, t, u), dropping the arguments for simplicity when not necessary. For instance for an actor, p B (x, t, u) = Pr(a +i | s +i ; x; t) defines the probability of an appropriate B response. A focal observer reacts appropriately to a social stimulus with probability p D (x, t, u) = P (a −i |s −i ; x; t; u).

Definition s
Stimulus. It can be perceived as an actor (e.g. s +i ) or as an observer (e.g. s −i ) a Action. The best response for a stimulus s j is a j x Intensity of a neural configuration (e.g. the perception of a stimulus ) t, t + , t − Absolute time, time that an individual interacts as an actor and as an observer.

B
Best response for an actor (e.g. a +i in response to s +i ).
D Best response for an observer (e.g. a −i in response to s −i ).
C Coordination. It occus if an observer performs a +i in response to s −i ).
0, ∅ Failed action (e.g. a +j in response to s +i ), for observers and actors, respectively.  Table 2. •

Focal individual/strategy
• Non-focal individuals/strategies π + , π − , π ± Payoff as actor, observer and observed actor p Y Probability of a given response Y for a specific stimulus P Y Average probability of a given response Y for a given strategy λ l , λ e , λ d Learning, death and environmental change rate p − = p, p + = (1 − p) Probabilities that an individual is either an observer or an actor.
r Assortment of relatedness coefficient D r Cooperative response, providing a benefit b + to an actor at a cost −b − for the observer.
Similarly, we can define the probability that it reacts coordinatively with p C (x, t, u) = Pr(a +i |s −i ; x; t; u), which leaves p 0 (x, t, u) = Pr(a j̸ =±i |s −i ; x; t; u) = (1 − p D (x, t, u) − p C (x, t, u)). We can now express the expectations of a given response in terms of p B , p D , p 0 and p C . For a focal individual •, the expected probability of B responses is simply We use the notation E y to indicate the expectation over the environmental variable y. For brevity the expectation over all the possible variables, for actors and observers, is indicated by E, without further subscripts. P • B can be calculated by averaging also across all the population strategies (i.e.
. For social interactions we have in the most general form: where analogous expressions can be obtained for P • D ,P • C and P • 0 . Note that t • or t • , as well as x • and x • , are not necessarily independent, and their relationships will depend on the specific structure of the environment. For example, the intensity of a stimulus perceived by an actor, x + , and by an observer, x − , might be correlated. Similarly, environmental states can vary independently for all individuals(asynchronous/spatial variation, t • independent of t • ), or at the other extreme, vary simultaneously (synchronous/temporal variation, t • = t • ).
We observed that differences in these correlations do not lead to any qualitative difference in the behavior of the model, even in the most extreme cases. Thus in agent-based simulations we adopt a synchronous variation and discrete generations setup for computational efficiency. In the deterministic description of the model we report general expression that can be applied to all cases, while for figures we assume asynchronous variation and uncorrelated intensities of actors' and observers' representations of a stimulus, i.e. independent t • and t • , and independent x • and x • . In this case eq.S2-3 become: For readability, later we also report simplified expressions obtained under the assumption that all stimuli intensities are identically distributed and equal to 1, i.e. x + = x − = 1.

Cognitive strategies
Cognitive strategies can be represented using the functions defined in the main text and reported in Table 2. We also obtain simplified expressions by using the step linear functions: We distinguish parameters in (i) potentially evolvable traits, denoted by the letter u and part of the trait vector u; (ii) other non-evolvable parameters involved in the cognitive functions or the individuals' life cycle, for which the evolutionary dynamics would be constrained or trivial, are denoted by λ, if affecting the time component of learning, or ρ, if affecting the intensity component.

As-actor circuit
The as-actor response is common to all individuals independently of their strategy. The probability of an appropriate response B is: = Pr(a +i |â +i ; x + , t + )Pr(â +i |ŝ +i ; x + , t + )Pr(ŝ +i , s +i ) where t + , the learning time for as-actor stimuli, would depend on the specific learning algorithm considered. Here, we want to provide a general treatment, and since we only care about the rate of learning, we only distinguish two classes of learning algorithms on the basis of their dependence on time. In the first class, each exposure to a stimulus improves future responses as t + = p + t, whether or not the focal individual performs an action. We define this mechanism as 'observational learning'.
In the second class, that could be exemplified by any form of trial-and-error learning process, an individual improves future reactions on the basis of the outcome of previous responses. In this case, a learning event occurs only when the individual actually attempts a response. Hence, t + depends on an individual's own strategy, affecting the probability of an actual response: where Pr(x + ) indicates the distribution of direct stimuli intensities x + .
Finally, we have an explicit expression for P B , the main component of the as-actor fitness:

As-observer circuit and strategy types
When an individual acts as an observer, the highest fitness payoff is attained by construction for response D. Hence all mind-reading strategies are expected to evolve maximizing P D , the expectation of an appropriate social response.

Fixed Response strategies -F
The mapping providing the highest average payoff is the one matching the most common environmental state e max . Therefore the average payoffs is proportional to the fraction of time spent in e max , Pr(e max ): where For environments changing at a constant rate we have: Associative strategies -P Schematically we can represent in a similar way the as-actor and P-strategies as-observer circuits: as-actor circuit By assuming null payoffs for 0 responses, the fitness component for a P observer is simply π P − = P P D d − . For a focal P individual, P P D is equal to Note that whereas for a focal actor the learning time t + depends only on its own strategy, for an observer it is also necessary that the observed actor's behavior is informative. Hence, t − depends on In the cases of observational and trial-and-error learning we have respectively: We develop here the former case as:

Simulative strategies -S
In order to represent S we introduce a new class of cognitive functions, simulative functions. These functions allow an observer to map a social cue into an as-actor stimulus representation. Biologically, it has been suggested that mirror neurons, possibly developed through simple associative learning, might serve this purpose. For more complex or emotionally relevant stimuli, structures like the anterior insula and anterior medial cingulate cortex have been suggested [1]. Here, we only focus on the efficiency of this process, i.e. the probability of its success, and its possible role in the modulation of empathy and the responses of simulative strategies, i.e. the effect on the intensity of neural representations.
To this aim, we represent simulation as a two step process: in the first, social stimuli are mapped to neural representation of the corresponding stimuli as-if perceived by an actor; in the second, the inferred as-actor response is used by the observer to choose a social response. Formally, we define these simulation functions as any function mapping an actions or stimuli representations within the same space, eitherŜ orÂ, but different subspaces. In particular we refer to the subspaces of as-actor (i.e. S + ,Â + ) and as-observer representations (i.e.Ŝ − ,Â − ). Thus, we can distinguish the two simulation functions as: • γ s :Ŝ − →Ŝ + , maps a social stimulus representationŝ = (s −i , x) ∈Ŝ − to an as-actor onê s = (s +i , y) ∈Ŝ + , allowing for the representation of a social stimulus "as if" it was directly perceived as an actor. Thus: • γ a :Â + →Â − , operates similarly but in a reverse fashion, so thatâ = (a +i , x) ∈Â + is mapped toâ = (a −i , y) ∈Â − . Therefore, the representation of an inferred as-actor action can be used to select the appropriate social response and produce the corresponding neural configuration: The possibility that during the simulation process, as-actor representations are modulated in intensity, is implemented as an evolvable trait u D , described later in details. First, we describe the flow of a simulative circuit: 1. An observer perceives a social cue s −i as a representationŝ −i ∈Ŝ − (Fig.1c,d, first continuous arrow).
2. Instead of using social experience, a simulative strategy maps the social cue representation to its correspondent as-actor representation,ŝ +i ∈Ŝ + , via γ s (Fig.1c-d, first dashed arrow).
3. The observer uses its own as-actor circuit in order to infer an appropriate response toŝ +i , namelŷ a +i (Fig.1c-d, gray arrow ). This representation corresponds to an inference about the actor's behavior. The core of the as-actor circuit is the learning function l, mappingŝ +i toâ +i on the basis of the as-actor experience t + .
4. Finally, the inferenceâ +i can be mapped to an appropriate social response,â −i , via γ a (Fig.1c-d, second dashed arrow). Hence, γ s and γ a describe in the model the efficiency of the simulation process, and the possible loss of information associated to it.
We already suggested that a potential advantage of S is to take advantage of private information acquired as an actor. In order to gain access to this information however, a simulative strategy has to activate, at least partially, the corresponding as-actor circuits (ŝ +i ,â +i , Fig.1). This represents an inherent major constraint of such strategies. In fact, simulated action representations are part of the response machinery to direct stimuli. Therefore, an activation of these simulated neural configurations, if comparable to cases of a direct perception of the stimulus, might trigger the downstream responses.
For an observer, this leads to the activation of as-actor responses. This process, also called facilitation, might interfere with an appropriate social response, and in our model it can be associated with a cost. Conservatively, we assumed that whenever an action representation activates the downstream action, the latter hinders a further response. Therefore, a C response occurs when a simulated action representationâ +i ∈Â + results in the activation of a +i . The probability Pr(a +i |â +i is simply given by the activation function. Because of the risk of coordination, inhibitory processes of C are likely subjected to evolutionary constraints. Thus, in order to study their evolution, we investigated the evolutionary dynamics of key traits involved in this modulation. In our model, inhibition of coordination can be achieved in different ways: • By adopting non-coordinating strategies like P or F . A variation of this mechanism is temporal inhibition, using social information whenever this is sufficient to infer actors' responses without adopting a simulative strategy. This is explored in Section 2.4 as a continuous trait u S .
• By evolving the as-actor circuit in order to increase its specificity and avoid activation in response to simulated action representations. This mechanism corresponds to structural changes of the as-actor circuit, thus influencing the behavior of an organism even as an actor. Since B responses as an actor are affected, we represent this trait as an evolvable trait denoted as u B .
• By modulating the way the as-actor network is recruited during simulation. We model this with a continuous trait u D .
Using the last two mechanisms, an organism can inhibit coordination while still adopting a simulative strategy. We model these mechanisms by exploring the evolution of traits influencing the shape of the activation function, α(u D x − , u B ), which determines the probability of C responses. In this framework, these two alternative ways to avoid activation can be visualized as an increase of the threshold of α, u B , or a decrease of the input intensity of the simulated neural configurations by reducing u D .
For a focal S individual, the probabilities of C and D responses and the as-observer component of the fitness are: Note that S strategies rely on as-actor information rather than on as-observer information, differently from P strategies. Thus, in this model, S strategies are entirely independent of t − , and acquire information as t + . We relax this assumption in Section 4, exploring a model in which both as-actor and as-observer information are used in simulation functions.

Evolution of strategy types and u D
In the frequency-independent case the evolutionary equilibrium value u * D is the global maximum of π S − (u D ). Thus u * D is also an evolutionarily stable strategy (ESS), and satisfies: and The first condition indicates that u * D is a singular point, where the selection gradient The second derivative condition guarantees the non-invasibility by small mutants and the presence of a fitness maximum. We denote the equilibrium strategy S characterized by u * D as S * . Since π S − depends only on the focal u D , for a singular strategy the ESS second derivative condition coincides with the requirements of convergence stability (CS) [3,4], namely S * invades and dominates over all other S strategies. Therefore, we consider S * for all analyses contrasting different types of mind-reading strategies.

Equation 26
shows that the evolution of simulative strategies is subjected to a trade-off, regardless of the specific mathematical functions used to represent the cognitive processes: the as-actor network will be recruited more (higher u D ) if this increases the accuracy of social inferences (higher P S D ); however, if this results in an increased risk of coordination, the as-actor network will be recruited less (lower u D , and in turn lower P S C ). An at least partial inhibition occurs whenever the intensities of simulated stimuli representations E[u D x − ] are on average lower than the intensities of stimuli perceived as an actor E[x + ]. When u * D → 0 simulative strategies completely inhibit coordination. We are interested in determining under which conditions the non-trival case in which an internal equilibrium exists and the as-actor network is partially recruited, while coordination still occurs, i.e. u * D > 0, P S C > 0. We start by exploring the case of cheap coordination (c − = 0), for which only P S D affects the fitness of an S observer, and eq.26-27 simplify into: Recalling eq.23-24, when considering the simplified model in which x + = x − = 1: The selection gradient is positive and u D evolves to higher values when: This equation exemplifies several points. First, a benefit in terms of more efficient mind-reading is necessary for u D to evolve to non-zero values and for the existence of a singular point with non-zero coordination, i.e. P S D must be an increasing function of u D . Second, u D increases as long as the relative improvement in social inferences is larger than the relative loss of appropriate responses due to coordination. For values of u D ≈ 0 coordination is mostly inhibited, thus u D evolves to higher values. However, as u D increases, Thus, u D can evolve to a singular point in which inhibition occurs even in absence of explicit costs for coordination, just because coordination will interfere with the appropriate social response D.
Note that typically, activation/inhibitory functions are highly convex: on a single variable, the discriminatory ability of such functions is determined by their slope, and their steep change allows for the classification of non-relevant and relevant inputs/stimuli. In biologically-based models of neurons and brain activation, as well as in computational modeling, the typical activation function is a sigmoid function: a steep threshold ensures the discrimination of low-intensity/below-threshould stimuli, that are inhibited, from high-intensity relevant ones (see [5,6,7,8,9,10] or [11,12,13] for comprehensive reviews on the topic, and [14] for an application in mirror systems). Thus, the evolution of u D is strongly constrained by the threshold of the activation function: inhibition becomes inefficient if u D increases beyond the activation threshold and generally cannot evolve to higher values. Thus a singular point u * D generally occurs for any typical activation function/inhibitory mechanism. Note that also the second ESS condition is easily satisfied for any smooth activation/inhibition function. In terms of the u D parameter space, this ensures the presence of a low u D region where D prevails over C, and the presence of a high u D region, where coordination strongly hinder appropriate responses. When α is smooth enough (i.e. differentiable), u D always partially "climbs" the activation function to an internal equilibrium value u * D that is independent of the payoffs, and depends only on the shape of the cognitive functions. Note from eq.28 that in the most general formulation of the model, what is required for an internal equilibrium is that the expectation of P S D is continuous. This condition can be in principle fulfilled even for non-continuous activation functions, representing extremely discriminative forms of inhibition: for example, any stochasticity in the stimuli intensity (i.e. their intensity is not always identical) would lead to continuous expectations even for non-continuous activation functions.
An example is a step activation function, with distributed normally stimuli intensities, as shown in where equality holds for the singular strategy u * D . In terms of the cognitive functions, eq.26-27 can be rewritten as: where we remind that . Whereas in the cheap coordination case, u * D is independent of the payoffs, for costly coordination the explicit cost c − adds up to the intrinsic cost of C responses. Expectedly, a higher c − reduces the equilibrium value u * D . For readability, we recall our previous simplifying assumptions. Hence, we consider again stimuli intensities equal to 1 and in turn simulated stimuli with intensities u D . We also assume that non simulated representations are not inhibited i.e. α(1, u B ) = 1. In this case eq.33-34 equal: Again, both conditions are satisfied when l and α are increasing functions of u D , and α is strongly convex.
We explore numerically the evolution of u D for simulative strategies in Fig S1. As we already mentioned, when P C increases smoothly with u D , u * D is a Continuously Stable Strategy (CSS), as shown in the pairwise invasibility plot (PIP) (Fig.S2a) [4]. S * invades all other simulative strategies regardless of the level of environmental complexity, since the sign of the invasion fitness remains the same regardless of the parameters describing environmental variation. Thus, the same PIP and the same value u * D hold for any value of λ e or p − . As a consequence, α(u * D , u B ), and in turn the ratio between P C and P D , remain constant, despite the fact that both expectedly decrease with λ e and with a reduction in learning time (Fig.S2b-c).

Competition between strategy types
We now consider the evolution of more radical forms of inhibition of C events, acting through a structural modification of the as-actor network itself, and thus affecting also the behavior of an individual as an actor. We model this by exploring the evolution of a continuous trait u B . For P and F , the evolutionary dynamics are trivial: since coordination never occurs, inhibition is always deleterious, and u B converges to the minimum possible value. We define this equilibrium value as u F P B . Regarding S strategies, u B affects in opposite direction two processes: higher u B values determine a more efficient inhibition of coordination, increasing the ratio between P D and P C ; however higher u B values determine a higher intensity threshold also when the focal individual is an actor, reducing P B .
We perform a general stability analysis of the competition between the three different strategy types, investigating cases for which u S B > u F P B . We can rewrite the replicator systems as: where we expressed all fitness functions in terms of the evolvable traits. For simplicity, we focus first on the competition between different types of strategy, neglecting the evolution of continuous traits. This corresponds to assuming a separation in time scale [4,2] between the evolution of strategic types, and the evolutionary fine-tuning of continuous traits shaping the different cognitive functions. In section 2.4 we relax this assumption by introducing a different framework. When the continuous traits are fixed, different values of u B for simulative and non simulative strategies lead to frequency-dependent dynamics. The overall pattern in the competition between the different strategies is similar to the frequency independent case: S dominates for more variable environments, whereas F dominates for more stable ones (Fig.S3). However, bistability can appear at the transition between the domain of attractions of S and P : the initial frequency of the two strategies determines which one dominates.
This occurs when S strategies have higher u B than P or F strategies. When actors have lower u B less B responses occur, mimicking the effects of increased environmental variation on learning: the effective learning times t + and t − are reduced. Hence, higher frequencies of S strategies stabilize the S-only equilibrium, and vice versa.

Evolution of u B
We investigate here which specific values of u B the population would evolve to, when simulative strategies are present. We adopt the adaptive dynamics assumptions of small and rare mutations and We can develop the selection gradient for the as-actor (including the as-lone actor and as-observed actor terms) and observer components in terms of the cognitive functions, respectively: ] .
The as-lone actor selection gradient (eq.39) is always negative for non-negative values of u B , since higher u B determines lower values of p B . Note that the as-observed actor selection gradient can be either positive or negative, when it is deleterious to be observed and actors are very efficient, i.e.
for the resident population P D d + > P C c + and P B b < P D d − − P C c − . In this case any action is disadvantageous for an actor, because of the excessively costly interaction with a potential observer.
Thus, a trivial equilibrium can exist if the as-actor selection gradient is negative, in which no action is performed by actors.
We are however interested in cases in which actors perform B responses that can be exploited by observers, and the as-actor selection gradient is negative. We already observed how for S observers, u D Therefore u B generally evolves towards a singular point allowing the discrimination of simulated and direct stimuli.
We can obtain explicit expressions for u * B by assuming the same cognitive functions used in the simplified step-linear model of the main text, with 0 ≤ u D ≤ 1. In this case the selection gradient is: As we can see in eq.41, as long as inhibition does not hinder as-actor responses (u B << 1), the as-actor fitness component is negligible and the selection gradient is non-negative. In particular, for steep activation functions, when u D ≈ u B , the as-observer selection gradient strongly increases. Vice versa, when u B + 1/ρ a approaches 1 and the as-actor behavior is compromised, the selection gradient changes sign.
Here we have to distinguish two different cases. When ρ a is very high, hence α very discriminative, a full inhibition of coordination can be achieved, by evolving u B to any values between u D and reflecting the tradeoff between the as-actor and as-observer payoffs. A singular strategy u * B is Convergence Stable (CS) or an ESS respectively when: In this case, both conditions are always verified since: Regarding coordination, a similar behavior occurs when u D or u B evolve, as a smooth α leads to the evolution of occasional coordination once an S is adopted. We investigate the stability conditions numerically, for a sigmoidal activation function (Pairwise Invasibility Plot in Fig.S5a). Again, we identify a CSS, with non-zero probability of coordination. However, smooth α also determine the presence of an internal repellor, hindering the evolution of u B to higher values when the initial resident trait is lower than the repellor. For a wide range of parameter, larger mutational steps still guarantee the convergence to the CSS. This can be seen in the PIP (Fig.S5a) for the horizontal line passing through the CSS (every mutant with that u B = u * B invades and converge to the CSS). The appearance of this repellor is due the fact that for cases different than the α-step-linear, nothing guarantees that the as-actor component of the fitness is necessarily smaller than the as-observer component when u B << u D . For a sigmoidal α, both the as-observer and as-actor component of the selection gradient approach 0, but the latter could be larger. When the benefit of as-observer responses becomes small compared to the as-actor component of the fitness (very low p − , a >> d − + c − ), a repellor appears, as shown in Fig.S5a. Furthermore, in some instances the as-actor fitness component always exceeds the as-observer one, the CSS disappears and only a repellor exists: at that point u B converges to the minimum possible value, approaching u F P B (Fig.S5b). This occurs for extremely low value of p − (in Fig.S5b p − = 0.01) and high environmental complexity, leading to an abrupt decrease in u B and P D , and a local increase in coordination events (Fig.S5c-f). These results reveal a non-monotonic effect of environmental complexity and p − on the evolution of simulative strategies: even though low p − and high complexity favor S against other strategies, simulation collapses when p − and λ e increase excessively. The effects of decreasing p − can be also observed on P D and P C , both increasing to finally decrease abruptly for extreme combinations of λ e and low p − .

Simultaneous evolution of u B and u D
These aspects can be further investigated by looking at the simultaneous evolution of u D and u B . In this case, a singular strategy is given by the combination of trait u * D and u * B such that the selection gradient for both trait is zero, hence: The system is characterized by a global attractor S * with u * D and u * B , as shown in Fig.S6 (representative of all the parameters tested). We have already shown that a CSS always exists for u D , given any u B .
The same is not true for u B , since for high u D an internal repellor might lead u B to evolve towards small values. However, the internal repellor, which is present when only u B evolves, generally disappears when u D and u B evolve simultaneously, except for extreme values of b. This is due to the fact that u D quickly evolves to lower values when u D > u B . Hence, even for low u B , eventually u B ≃ u D , and the as-observer selection gradient for u B sharply increases.
For the evolution of multiple traits, convergence stability depends on the Jacobian of selection gradient J, in our case: Strong convergence stability [15] guarantees convergence for the canonical deterministic adaptive dynamics, under the assumption that the mutational matrix varies gradually. This criterion is satisfied when J is negative definite. However it is possible to require a stronger criterion, guaranteeing that a singular point is robust to any gradualistic mutational path. When this criterion, defined as absolute convergence stability, is fullfilled, one does not need to take into account correlations between traits.
Absolute convergence stability [16] is determined by the matrix J ′ = J V , where J is the Jacobian of the selection gradient and V is the mutational variance-covariance matrix: When J ′ is negative definite, absolute convergence stability is satisfied. In order to ensure this, either V or J must be symmetric, if J is negative definite. However, absolute convergence stability is a very restrictive criterion, generally holding only for models with simple structure, since there are no reasons to expect J to be symmetric. Therefore, it is possible to adopt strong convergence stability and the ESS criterion to define a CSS in the multiple trait context [15].
Regarding the ESS condition, H is negative definite when its trace is negative and the determinant positive: Similarly for strong convergence stability, the Jacobian has to be negative definite. This is verified when: Due to the complex expression for the invasion fitness, it is not straightforward to obtain simple, readable stability conditions. However it is possible to draw some informal reasoning about the behavior of the system. First we note that with respect to the Hessian all terms but H 22 can be simplified considering only the as-observer fitness π S − , since the as-actor component is independent of The same reasoning can be applied to D u D , and hence to J 11 and J 12 . We also know that both H 11 and J 11 are always negative. We know as well that two possible singular points might exist. In general a singular point S * exists, such that J 22 and H 22 are negative. For S * , the traces of both H and J are negative. Therefore the singular point is either stable or a saddle. Given the tradeoffs described for the single traits, a saddle is unlikely. This is also intuitively clear, by noticing that the mixed derivative terms are usually quite small: in our model an increase in u D above u * D increases the proportion of C responses; however, a parallel increase in u B buffers the risk C, partially inhibiting coordination. We verified this informal reasoning numerically. In Fig.S7 we show a representative part of the parameter space, showing that the largest eigenvalues for H and J are always negative, as long as environmental complexity is not so high that learning cannot occur anymore.

Evolution of u D , u B and temporal inhibition
Besides innate inhibitory mechanisms, organisms adjust their social behavior according to previous experience. As we have seen simulative strategies provide a useful tool to predict other's behavior when other sources of information are scarce, but as more social information is available, more direct strategies become advantageous. Empirically, several neurophysiological studies suggest that simulative circuits, relying on as-actor experience, and mechanisms relying on acquired as observers, are combined in mind-reading [17,1]. For example, empirical studies have shown that empathy is strongly inhibited after interacting again with a defector stooge.
We investigate here the possibility that as-actor and as-observer informations are combined in a single mixed strategy. Specifically, we explore temporally mixed strategies, denoted as SP , initially adopting an S phenotype and later shifting to P , when enough social information learned as an observer becomes available. Such strategies employ a learned inhibitory mechanism, allowing individuals to stop using simulation and shift to a coordination-free strategy when the latter is sufficiently efficient.
We model this by exploring the evolution of the trait u S , defined as shifting time, the expected time at which a focal individual changes strategy type, from S to P : when u S → 0 the strategy converges to a pure P phenotype, while the higher the values of u S the longer a simulative type is used before adopting a P phenotype.
We explore the evolution of such strategies, considering both evolvable traits u D and u B . We assume that an individual adopting SP adopts a P type whenever the payoff of adopting P is larger than by adopting S. Thus, for a focal individual, u S depends on both u B and u D . u S can be found by equating π S and π P : The social fitness component of the switching type SP is simply given by the social fitness of S and P , respectively up to and after t = u S : The selection gradient for SP is: Therefore, the selection gradient for u D is identical to what was analyzed so far, except that when u S is 0 it evolves under random drift. For u B , the as-actor component is also identical to the other strategies. The only difference is in the relative importance of the simulative strategy, now used only in the first part of an individual's lifetime. Therefore u B evolves to slightly lower values than for a pure S strategy type (Fig.S9). Because of the similarity with the previous case, we perform again a sequential analysis of the evolvable traits , first by considering the evolution of a single trait, when the other is fixed, and later looking at the simultaneous evolution of both. However, we only show the quantitative differences, and focus on the different aspects between traits. Fig.S8 and Fig.S9 show respectively the PIP and a numerical exploration of the u D -evolvable and u B evolvable cases, presenting similar patterns to the ones observed for S types. However temporal inhibition allows us to investigate the dynamics of u D and u B without taking into account the frequencies of the different strategy types. We can track the equilibrium values of u S : the higher the value of u S the longer an S phenotype is adopted by SP . Fig.S8a and Fig.S9b show how simulative phenotypes are extensively used over a wide range in parameter space. S phenotypes are always adopted, with the only exception of extreme values of p − and λ e . In fact when the probability of interactions is high and environmental variability are very low, P phenotypes are always more advantageous. At the opposite extreme, when p − is low and λ e very high, a focal individual is essentially a lone actor, almost never playing as an observer. Hence, u B evolves towards minimum values, only optimizing the as-actor behavior. We finally explore the case when u D and u B coevolve (Fig.S10). Also in this case, SP behaves similarly to S, regarding the evolution of u D and u B , for moderate values of p − and λ e . Despite the fact that inhibition could be achieved in multiple ways, small probabilities of coordination generally evolve.
Interestingly, the presence of temporal inhibition determines a total inhibition of S when simulative strategies become suboptimal. In this case, u D evolves neutrally, while the selection gradient leads u B to converge to minimal values. This leads to an evolutionary trap, and better combination of traits cannot evolve for simulative strategies unless mutations with big effects are possible. Therefore the system is bistable, since in this case an optimal combination of u B and u D for simulation cannot be achieved anymore.
The ratio between C and D responses increases for higher environmental variability and lower p − , when simulative phenotypes are adopted for longer time. Interestingly the non-linear relation between p − and λ e and simulation is revealed in the equilibrium values of the traits u D and u B : although simulative strategies are used more frequently u * D and u * B decrease. This is due to the asactor component of the fitness, becoming increasingly more important, and attempting to minimize the loss in B responses.
Regarding stability, the reasoning is analogous to the case of pure S strategies.

Section 3 Kin selection and indirect benefits of coordination
With assortment, the social component of the fitness is obtained similarly to eq. 56: Note that the optimal action D can vary depending on the probability of interacting with individuals with the same genotype, a property defined as assortment and captured by the assortment coefficient r. For example, a well known effect of assortment is to promote cooperation, that can become advantageous because of indirect benefits [18,19]. Thus, it will be useful necessary to describe cases in which the appropriate response is to cooperate. Here we keep the notation used so far to describe the case without assortment, for which we assumed payoffs +d − , −d + , −c − , +c + with all coefficients ≥ 0. Note however that the signs of the payoffs can be changed without loss of generality. Thus later, we will represent cases in which the appropriate response is to cooperate by defining the appropriate response as D r , with payoffs −d − r and d + r . We first explore the evolution of full coordination and describe how the model is affected by different responses D and D r , a case relevant when assortment is high and promotes cooperation; second, we explore the effects of r on the probability of coordination, for cases in which inhibition occurs.

Full coordination
C gives higher fitness than D when: This can be written in a form that is reminiscent of conditions for the evolution of cooperation [18,19]: The numerator represents the costs of coordination for the observer: this depends both on the direct cost of coordination (c − ) and on the missed benefits from appropriate social responses (γ a d − ). The denominator is given by the benefit for the recipient actor: this is the sum of the direct benefit of coordination (c + ) and the potential cost of being predicted and outcompeted socially by the observer (γ a d + ). It can be useful also to think in terms of a cooperative response D r , which has positive fitness when r > d − r /d + r . In this case, C provides a higher fitness than D r when: Full coordination evolves more easily when γ a is low, i.e. when a correct inference of the actor's state does not lead often to response D. This can occur when it is hard to learn the appropriate social response (section 4) or map the internal representation of an actor's action to an actual response. In these cases coordination provides a quick, although suboptimal, solution.

Effects of assortment on the probability of coordination
We The fitness gradient is: We note again that coordination also implies an intrinsic cost, in term of missed D responses (−γ a d − + r d + ). For the singular strategy, we can rearrange eq.63 into: For r = 0 we recover eq.35, where the right hand term is just d − d − +c − . The derivative du * D dr can be calculated by implicit differentiation of the invasion fitness [3]. For clarity we express the invasion , since in our model F does not depend on the resident trait. For a singular point, F (u * D , r) = 0. Hence: Rearranging we obtain: Notice that is a maximum of π S − and a CSS. Therefore in this case: and du * D dr > 0 holds when Substituting equation 64 for u * D , we find that u * D increase with r when The denominator implies that a different effect of r occurs when eq.60 is satisfied, i.e. when r is high and C is the advantageous response. Since in this case an internal singular strategy does not exist we can neglect the denominator, considering only cases in which it is positive. Hence, eq.69 becomes This condition can be easily interpreted in our standard case, when D is beneficial for the observer (d − > 0) and detrimental for the actor (d + < 0): Increasing relatedness leads to an increase in coordination when the relative benefit of an appropriate social response (for the observer versus the actor) is larger than the relative cost of coordination. In nature this condition is likely fulfilled in most cases, since it is for the observer itself that a social response is more relevant, i.e. its payoff has the largest absolute value for the observer. In fact, in many cases a cost for the observed actor might not exist. When the appropriate social response is D r , this condition is reversed and it reads as implying that increasing relatedness leads to an increase in coordination when a proper cooperative response provides a larger benefit to the actor than that determined by coordination. Therefore even when simulative strategies are employed for cooperation, as in empathy, an increase in coordination with r is observed. Note that the probability coordination, although to a minor extent, increases with assortment even if the best response is still defection. Remarkably, relatedness can even decrease the probability of coordination (Fig.S11). We also notice that at high u D values, the sign of the fitness gradient is only determined by the C payoff: if C is rewarding (r > c − /c + ), full coordination is evolutionarily stable. This condition is not mutually exclusive with the existence of an internal singular strategy u * D (eq.60). Hence, the system is possibly bistable (Fig.4). As r or c + /c − increase the internal singular strategy increases its value, finally vanishing: in this case full coordination is the only evolutionarily stable strategy. The stability of full coordination when an internal singular strategy exist is due to the fact that when u D is very high and beyond the activation threshold, small decreases in u D do not reduce significantly the probability of coordination. Nevertheless, they might lead to a reduction in the probability of a correct inference about the actor's state. Clearly, this equilibrium is unstable for larger mutational steps, since C provides a lower fitness than the internal singular strategy.

Section 4 Simulation with as-observer learning
We investigated so far the case of simulative strategies employing only as-actor information, by assuming that an individual has a fixed probability of performing an appropriate social response given an accurate inference about the actor's state i.e. we treated the cognitive function γ a as a fixed parameter. Here we extend our analyses to investigate the case in which a simulative observer learns dr > 0, gray region) and for full coordination to evolve (yellow-red shaded regions) under different r (continous or dashed lines) and different γ a (shades from yellow to red). In the gray regions, when r is low and C is not advantageous overD (region not shaded in red-yellow), an increase in r leads to an increase in u * D and P C . The opposite occurs in white unshaded regions. Above the colored lines indicating combinations or r and γ a value, full coordination evolves and a singular strategy disappears. in time what the appropriate social response is to an actor's action (a +i ). Hence, for simulative strategies, learning occurs in two phases: first an individual learns as an actor, and applies its as-actor network to infer an actor's action (a +i ) on the basis of social cue (s −i ) (Fig.1d, full gray arrow); secondly, when a correct inference is made (â +i ), an observer learns at every social interaction what is the best social response (a −i ) (Fig.1d, second dashed arrow). Hence, the accuracy of social responses increase both with experience obtained as an actor (t + ) and as an observer (t − ). The former allows to map efficientlyŝ −i toâ +i . The latter allows to infer the appropriate social response,â +i toâ −i .
In this model, the advantage of a correct inference about the actor's action is to facilitate the learning of the appropriate social response. This process can be seen as a reduction in the search space of possible social responses, since it is known what the actor's action is. Hence, we treat the cognitive function γ a a learning function l, that to avoid confusion we denote as l a . We reserve instead the symbol γ a for a fixed parameter, describing the reduction of the search space for the appropriate social response, proportional to (1 − γ a ). The parameter γ a has a similar effect to the constant case: when γ a ≃ 1 knowing what the actor will do is perfectly informative about the appropriate social response, when γ a ≃ 0 learning and social responses does not benefit of this information. The learning time for appropriate social responses, given a correct inference about the actor's state is then t τ γ , where: We show here the behavior of this model, in the case where u D evolves (Fig.S12). Simulative strategies still invade in highly variable environment, more easily the lower is p − . However, since now the second step of social inferences depends on social learning, the range of p − values for which they evolve is smaller. In particular, when γ a decreases, fixed strategies can evolve even in highly variable environments because no learning either as-observer or as-actor learning limit the fitness of either strategies relying on learning, P and S (Fig.S12). When including relatedness, the same pattern of the constant γ a is observed, with higher environmental complexity exerting the same effects of lower γ a values, i.e. higher u D evolve in response to relatedness.

Section 5 Agent Based Evolutionary Simulations
We tested the predictions of the deterministic model with an agent-based model, exploring the stochastic effects of a real learning algorithm and noise in perceived stimuli. We considered interactions, payoff structure and cognitive schemes analogous to the one presented along with the deterministic model ( Fig.1 and payoff Table 1; section 1 of the Appendix).

Agent structure
Individuals perceive stimuli as actors or observers (s + or s − ) with probability p + or p − , respectively.
We considered n different stimuli for actors, and a set of n corresponding stimuli for observers. In each turn each agent perceives a signal, represented as a vector of n stimuli intensities, one for each stimulus of the same class, either as-actor or as-observer. One of these stimuli is the leading one, determining the appropriate response. All the others are just noise. We assume the intensities of both leading and noise stimuli to be normally distributed, with mean µ s and variance σ 2 s for the formers, and µ n and σ n for the latters. For the simulations shown in the following figures, we assumed µ s = 0.5, σ s = 0.1, µ n = µ s /10, σ n = σ s /10.
Individuals select a response to a given stimulus representation, using a n × n association matrix M of weights q ij . Specifically, each element q ij is a weight determining the strength of the association between a stimulus representationŝ = i and action representationâ = j. Therefore, after a signal is perceived, an action representation is selected probabilistically according to a softmax action selection function l, that take into account the intensity of the signal and the weights associated to the corresponding actions. In particular, action j is chosen with probability P r(â = j|x 1 q 1j , . . . , x i q ij , . . . , x n q nj ) = e where the weights of a response to a given stimulus i are weighted for the intensity of the relative signal x i . T defines the temperature of the softmax function. As T increases, the more exploratory is the Figure S 12: Competition among strategies in the simulation with as-observer learning case. Invasion domains for S (white), P (grey) and F (black) for different values of γ a . From left to right, γ a equals 0.98, 0.9 and 0.8. On the y-axis p − varies, while on the x-axist changes as a function of λ e (t = 1/(λ d + λ e )). We considered the same model as in Fig.1, with sigmoid activation function and a learning curve as in Table 2. All the other parameters are the same as in Fig.S2.
action selection process, selecting occasionally responses with lower weights, and allowing the algorithm to be flexible and robust to initial errors. When T decreases, the algorithm is more conservative, choosing almost always the action with the highest weight.
Weights are updated accordingly to previous experience and payoffs, so that an agent's responses converge to the appropriate ones in time as learning occurs. When an individual performs an appropriate response to the perceived stimulus, it updates the corresponding weights with the learning rule q i,j (t + 1) = (1 − α i )q i,j (t) + α i r, where α i = α x i ∑ n j x j and α is a constant. The reward r at time t is 1 if the response performed is correct (thus providing a positive payoff), and r = 0 if incorrect. Thus, when an agent perceives a stimulus s = ±i, it chooses probabilistically a response on the base of its previous history of rewards. All the as-actor stimuli s ∈ S + ,e.g. an as-actor stimulus s +i , are associated to different possible actions a ∈ A + , e.g. a +j , according to the weights of a non-social matrix M + (M + |Ŝ + →Â + ). Social stimuli s ∈ S − are instead processed differently by the different mind-reading strategies, organized as • An S-strategy takes advantage of the as-actor experience, in the form of matrix M + , in order to interpret social stimuli. Therefore, it first maps a social stimulus representation to a non-social stimulus representation by rescaling the stimulus intensity by an evolvable scalar factor u D . The simulated as-actor stimulus, taking advantage of M + , is mapped to an as-actor representation.
An activation function α is implemented analogously to the deterministic case. In this case we consider a step function with an evolvable activation threshold u B . We notice that analogously to the deterministic case, P and S strategies differ as they use as-observer or as-actor experience, i.e. M − and M + , respectively. Regarding F strategies we assumed, conservatively against the invasion of S and P , that they are perfectly adapted to one of the experienced environmental states, by fixing the weights of the M − . For small n we let the single weights to evolve independently, obtaining comparable results.
Each individual experiences 500 stimuli, distributed across a number n e of identically distributed environmental states.

Selection scheme
Each simulation occurs in two stages. In a first stage the strategies interact and compete within the same strategy type for 10 generations. In this phase the traits u D and u B evolve to an average value for each strategy, since u D and u B can mutate (independent mutation rates µ(u D ) = µ(u α ) = 0.2).
When a mutation occurs the current value of the trait is added a random number between −0.2 and 0.2. The strategy types are mutually exclusive, so that an individual is either S, F or P . In a second stage, 2000 agents for each strategy start interacting and competing for 25 generations. Selection occurs through a softmax function with temperature T s = 0.1, weighing individuals'total payoff.

Results
Evolutionary simulations with populations characterized by different strategic types, i.e. Vice versa, u B converges to minimal value in F and P strategies. u D is expressed only in simulative strategies, therefore it evolves neutrally in F and P . When S dominates small levels of coordination are sustained (Fig.S15). For the parameters here explored, agents coordinate about 1% of the social interactions.