Nash equilibria in human sensorimotor interactions explained by Q-learning with intrinsic costs

The Nash equilibrium concept has previously been shown to be an important tool to understand human sensorimotor interactions, where different actors vie for minimizing their respective effort while engaging in a multi-agent motor task. However, it is not clear how such equilibria are reached. Here, we compare different reinforcement learning models to human behavior engaged in sensorimotor interactions with haptic feedback based on three classic games, including the prisoner’s dilemma, and the symmetric and asymmetric matching pennies games. We find that a discrete analysis that reduces the continuous sensorimotor interaction to binary choices as in classical matrix games does not allow to distinguish between the different learning algorithms, but that a more detailed continuous analysis with continuous formulations of the learning algorithms and the game-theoretic solutions affords different predictions. In particular, we find that Q-learning with intrinsic costs that disfavor deviations from average behavior explains the observed data best, even though all learning algorithms equally converge to admissible Nash equilibrium solutions. We therefore conclude that it is important to study different learning algorithms for understanding sensorimotor interactions, as such behavior cannot be inferred from a game-theoretic analysis alone, that simply focuses on the Nash equilibrium concept, as different learning algorithms impose preferences on the set of possible equilibrium solutions due to the inherent learning dynamics.

multiple equilibria, or signaling games with Bayesian Nash equilibria to model optimal sensorimotor communication. In general, it was found in these studies that subjects' sensorimotor behavior during haptic interactions was in agreement with the game theoretic predictions of the Nash equilibrium, even though the pen-and-paper versions of some games systematically violate these predictions. This raises the question of how such equilibria are attained, especially since the Nash equilibrium concept itself provides no explanation of how it is reached, especially when there are multiple equivalent equilibria.
In this study we investigate what kind of learning models could explain how humans that are interacting through a haptic sensorimotor coupling reach a Nash equilibrium. The problem of learning in games can be approached within different frameworks, including learning of simple fixed response models like partial best response dynamics for reaching pure Nash equilibria 31,32 , or fictitious play with smoothed best response dynamics for mixed equilibria 33,34 , as well as more sophisticated reinforcement learning models like Q-learning 35,36 , policy gradients 37,38 , minimax Q-learning 39 or Nash Q-learning 40,41 , together with learning models in evolutionary game theory for reaching Nash equilibria (evolutionary stable strategies) through population dynamics 42 . Here we focus on model-free reinforcement learning models to explain subjects' sensorimotor interactions, as in many of the previous experiments [25][26][27] subjects interact only through haptic feedback and cannot otherwise "see" the choices of the other player. In particular, we consider model-free reinforcement learning models like Q-learning and direct policy search methods like policy gradients that exclusively rely on force feedback during learning. We compare Q-learning and policy gradient learning in games with pure and mixed Nash equilibria, including the prisoner's dilemma and two versions of the matching pennies game with symmetric and asymmetric payoffs respectively. While the prisoner's dilemma game has a single pure equilibrium, the sensorimotor version of the matching pennies games has infinitely many mixed Nash equilibria that are theoretically equivalent, and we investigate whether the different learning algorithms introduce additional preferences between these equilibrium strategies based on the inherent learning dynamics and we check how these match up with human learning behavior.

Results
To investigate learning during sensorimotor interactions with pure and mixed Nash equilibria, we designed three continuous sensorimotor games that are variants of the traditional two-player matrix games of the prisoner's dilemma, the asymmetric matching pennies game, and the symmetric matching pennies game (see Fig. 1 and Methods for details). During the experiment two players were sitting next to each other and interacted through a virtual reality system with the handles of two robotic interfaces that were free to move in the horizontal plane (see Fig. 1A). In each trial, players were requested to move their handle to cross a target bar ahead of them, where the lateral position of both handles determined the individual magnitude of a resistive force opposing the forward motion of the handle. Thus, the movement of each player directly impacted on the forces experienced by both players in a continuous fashion. This way, our sensorimotor game differs in three essential aspects from the traditional cognitive versions, in that first, we have an implicit effort cost through haptic coupling, second, subjects could choose from a continuum of actions defining a continuous payoff landscape, where the payoffs in the four corners correspond to the payoffs in the traditional payoff matrices, and third, that subjects are unaware of the structure of the haptic coupling, i.e., they do not have complete information about the payoffs, as they only have access to their own payoffs through force feedback.
Since the payoffs can only be learned from experience, we devised different reinforcement learning models that can emulate subjects' behavior by adapting their choice distributions in a way that avoids punitive forces. As subjects were only communicated their own payoff when making their choices, we compare only model-free reinforcement learning schemes, in particular Q-learning and policy gradient models. We consider two reward conditions, with and without intrinsic costs added to the extrinsically imposed reward defined by the haptic forces, where the intrinsic costs make it more expensive for the learner to change actions abruptly between trials-see Methods for details. We focus on trial-by-trial learning across the 40 trial blocks where games were repeated and neglect within-trial adaptation, as we found that initial and final positions of movement trajectories were close together most of the time. In particular, we found that in approximately 65% of the trials for the prisoner's dilemma, and in 90% and 95% of the trials for the asymmetric and symmetric matching pennies respectively, the subjects' final decision laid within a 3.6cm neighborhood ( 20% of the entire workspace) of their initial position in each trial, and there was no systematic change across the block of trials. We found that the distribution of distances between the initial and final position in each trial did not change significantly when comparing the first 20 trials of each block with the last 20 trials of each block (Kolmogorov-Smirnov test, p > 0.05 for all games). Therefore we concentrate only on the final position of each trial, where the force pay-off was strongest.
Categorical analysis. Previously, we have studied subjects' choices in the sensorimotor game in a binary fashion 25,26 to allow for a direct comparison with the corresponding pen-and-paper 2-by-2 matrix game. To this end, we divide each subject's action space into two halves and categorize actions accordingly-compare Fig. 2. The four quadrants of the combined action space then reflect all possible combinations, for example in the prisoner's dilemma (defect, defect), (defect, cooperate), (cooperate, defect) and (cooperate, cooperate). When comparing the histogram over subjects' choices over the first 10 trials compared to the last 10 trials of each block of 40 trials, it can be clearly seen that behavior adapts and becomes more consistent with the Nash equilibrium solution of the 2-by-2 matrix games for the prisoner's dilemma and the asymmetric matching pennies game. In particular, in the prisoner's dilemma game, subjects increasingly concentrate on the defect/defect solution, even though the Nash solution is never perfectly reached, as other action combinations also occur late in learning. In the asymmetric matching pennies game, player 1 displays the expected asymmetric action distribution, whereas player 2 shows the expected 50:50 behavior. In both games, the entropy of the joint action distribution decreases In the case of the symmetric matching pennies game, the histograms do not reveal any learning or entropy decrease, as expected, because the random Nash strategy coincides with the random initial strategy. We compared the binary Q-learning described in the methods as model 1, and the policy gradient described as model 2, to the human data by exposing both models to the same kind of game sequence that was experienced by subjects. In the simulations, we always applied the same learning model for both players, albeit with playerspecific parameters that were fitted to match subjects' action distributions. In Fig. 2 it can be seen that both kinds of learning models explain subjects' behavior equally well. Also when repeating the same analysis including the intrinsic cost function that discourages deviations from previous positions, the result remains essentially the same. This leaves open the possibility that the categorical analysis may be too coarse to distinguish between these kinds of models. In the following, we therefore study the continuous games in terms of their continuous responses and not just their binary discretization. We compare subjects' behavior in the three different games to two continuous reinforcement learning models, where one is a policy gradient model with a continuous action distribution, and the other one a continuous Q-learning model.

Continuous analysis.
Prisoner's dilemma: Endpoint distributions. In Fig. 3A subjects' final decisions in the continuous prisoner's dilemma game are shown as a scatter plot within the x 1 x 2 -space, where the single pure Nash equilibrium is located in the top-right corner at position (1,1). Over the course of 40 trials subjects' responses increasingly cluster around the Nash equilibrium, even though a considerable spread remains. The difference in endpoint distribution between the first 10 trials in a block and the last 10 trials was highly significant for player 1 and player 2 (KS-test, p < 0.01 for each player). A two-dimensional histogram binning of the experimental scatter plots can be found in Fig. 3B. Figure 3E shows the performance of two coupled continuous policy gradient learners on the same task. The model learners also converge to the Nash solution, but the variability is spread mainly along the boundaries and corners, which is in stark contrast to subjects' behavior. The same is also true if the policy gradient learners are parameterized slightly differently with the logit-normal-see Supplementary Figure 2. Figure 3M shows two coupled continuous Q-learning agents performing the prisoner's dilemma sensorimotor task. The model learners slowly start concentrating probability mass in the quadrant of the Nash equilibrium, but their responses are spread out more uniformly than the subjects' responses, giving in comparison a distribution that is too flat due to the broad exploration. One way to enforce more specific exploration is through the consideration of intrinsic costs, for example by assuming that selecting actions that significantly deviate from the action prior, given by averaging over all trials, is costly 44,45 . Figure 3I and Q show the performance of the two learning models considering such an intrinsic cost. While the behavior of the gradient learner still shows a sharp peak in the corner, the Q-learning model concentrates actions in a way that is more similar to human behavior in the quadrant of the Nash equilibrium, and it even shows the exponential decline in probability that is observed in the human data when moving away from the Nash corner. Accordingly, the Q-learning model with intrinsic costs achieves the most similar two-dimensional histogram to the data, as quantified by the Euclidean distance between the subjects' and the models' histograms-compare Supplementary Table 1.
Prisoner's dilemma: Response frequency adaptation. A statistical measure to assess the effects of learning is the change of the mean endpoint (averaged for both players) for each trial across the block of 40 trials. Figure 3C shows a slow increase from 0.5 to approximately 0.7. The Q-learning model without intrinsic costs shows a flat distribution resulting from the broad exploration (see Fig. 3O). In contrast, both the gradient learning models (see Fig. 3G and K) and the local Q-learning model with intrinsic costs (see Fig. 3S), manage to mimic the experimental responses, with the latter model achieving the lowest mean-squared error compared to the data-compare Supplementary Table 1.
To study the direction of adaptation in the endpoint space, we look at the difference vectors resulting from subtracting the mean endpoints of each mini-block of 10 trials from the preceding mini-block of 10 trials. The  www.nature.com/scientificreports/ arrow plot in Fig. 3D shows that subjects move towards the Nash equilibrium on a straight path, where the step size of learning becomes smaller over time. As can be seen in Fig. 3H and L this adaptation pattern is reproduced by both the gradient learning models as well as the Q-learning model with intrinsic costs (see Fig. 3T). Without the intrinsic costs, the extensive search of the Q-learning agents generates an almost uniform adaptation (see Fig. 3P). In summary, our analysis demonstrates that it is not enough to look at the mean behavior to distinguish between the different models. In fact, when looking at the entire two-dimensional histogram over the continuous action space, it becomes apparent that the Q-learning model with intrinsic costs is the only one that can capture subjects' behavior well, as it is the only model that concentrates actions in the Nash quadrant where the concentration increases gradually with proximity to the Nash solution (compare Fig. 3B,F,J,N,R). Asymmetric matching pennies: endpoint distribution. Figure 4A shows subjects' final decisions in the continuous asymmetric matching pennies game as a scatter plot in the x 1 x 2 -plane, where the set of mixed Nash equilibria corresponds to all joint distributions p(x 1 , x 2 ) where the marginal distribution of player 1 has the expected location E p [x 1 ] = 0.8 and the marginal distribution of player 2 has the expected location E p [x 2 ] = 0.5 . Throughout the course of 40 trials, subjects' responses evolve from a symmetric distribution centered approximately at (0.5, 0.5) to an asymmetric distribution that is tilted for player 1 as expected. The difference in endpoint distribution between the first 10 trials in a block and the last 10 trials was highly significant (KS-test, p < 0.01 ). A two-dimensional histogram binning of the scatter plots can be seen in Fig. 4B. Figure 4E shows the endpoint distribution of two coupled policy gradient learners performing in the continuous asymmetric matching pennies game. As expected, player 1 shifts to an asymmetric behavior, while player 2 remains random, but the model actions are mainly spread along the two opposing boundaries, which singles out a possible Nash solution, but is in stark contrast to subjects' behavior that concentrates in the upper quadrants of the workspace. This preference for extreme responses of the gradient model is even more extreme when including intrinsic costs, which eliminates most of the asymmetry (see Fig. 4I) and also holds for a logit-normal parameterization of the gradient learners (compare Supplementary Figure 3). Figure 4M shows two coupled continuous Q-learning agents performing the same task. In contrast to the gradient learning models, the best-fitting Q-learning models manage to preserve substantial probability mass in the center of the workspace, but this happens at the expense of such a slow learning progress that the resulting action distribution is close to uniform, and therefore also does not capture subjects' behavior well. This changes, however, when including intrinsic costs. Figure 4Q shows two fitted continuous Q-learning agents with intrinsic costs that manage to qualitatively capture the experimentally observed distribution in that player 1 concentrates its probability mass in the top quadrants, whereas player 2 shows more uniform behavior, as expected.
Asymmetric matching pennies: response frequency adaptation. Figure 4C shows the change of the mean endpoint (independently for each player) in each trial across the block of 40 trials. We can observe an increase from 0.5 to approximately 0.8 for player 1, whereas for player 2 the mean value remains constant along all trials around 0.5. This behavior is also observed in Fig. 4D, which shows the average direction of subjects' adaptation. These mean adaptation patterns are reproduced by the gradient learning model (see Fig. 4G,H), however, when including the intrinsic costs, the symmetry between extreme responses causes more random behavior in both players (see Fig. 4K,L), although this is not so much the case in the logit-normal parameterization (compare Supplementary Figure 3). We see also that the Q-learning model without intrinsic costs depicted in Fig. 4O is very slow in learning, which can also be seen from the mean direction of adaptation depicted in Fig. 4P. However, when considering intrinsic costs in the Q-learning model, the change of the mean endpoint (see Fig. 4S) and the average direction of adaptation (see Fig. 4T) match subjects behavior quite well. Even though the Q-learning model with intrinsic costs is initially slightly slower than the human subjects in picking up the equilibrium behavior, taken together with the two-dimensional histogram binning of the scatter plots in Fig. 4B,F,J,N,R the Q-learning model with intrinsic costs provides the best fit to subjects' behavior, confirmed by the lowest Euclidean distance between model and subject histograms-compare Supplementary Table 1.
Symmetric matching pennies: endpoint distribution. While the asymmetric matching pennies game requires adaptation in action frequency, learning in the symmetric game is not expected to show changes in the action frequency, and thus we finally need to look for other signatures of learning. In Fig. 5A subjects final decisions in the continuous matching pennies game are shown as a scatter plot in the x 1 x 2 -plane, where the set of mixed Nash equilibria corresponds to all joint distributions p(x 1 , x 2 ) where the marginal distribution of player 1 and player 2 has the expected location Subjects' responses generally cluster around the center position with mean (0.5, 0.5) in the first block of 10 trials and remain that way through the entire block of 40 trials. A two-dimensional histogram binning of the scatter plots can be seen in Fig. 5B. While subjects' initial behavior can be explained as a result of ignorance, the same behavior is also compatible with the Nash solution, and so the response frequencies remain stable throughout the block of 40 trials. This can be seen from the mean endpoint (averaged for both players) across the block of 40 trials shown in Fig. 5C, and from the average direction of subjects' adaptation depicted in Fig. 5D.
The endpoint distribution of two coupled policy gradient learners performing in the continuous symmetric matching pennies game is shown in Fig. 5E, and the corresponding version including intrinsic costs is shown in Fig. 5I. In both cases the distribution of responses for both players is rather uniform. Figure 5M shows two coupled continuous Q-learning agents performing the same task, and Fig. 5Q the same model including intrinsic costs. Contrarily to the gradient learning models, the Q-learning models manage to preserve substantial probability mass in the center of the workspace.
While all four models capture the effect of stable mean response frequencies-see Fig. 5G,H,K,L,O,P, and S,T for comparison-, the gradient learning models predict a more uniform distribution of responses, including in particular a high amount of extreme responses at the borders of the workspace that were mostly absent in  www.nature.com/scientificreports/ the experiment. In contrast, the Q-learning models cluster in the center of the work space. Moreover, the inclusion of intrinsic costs that punish large deviations from behavior in the previous trial lead to a slightly stronger concentration of responses in the center of the work space in agreement with subject data. Again, based on the two-dimensional histogram binning of the scatter plots in Fig. 5B Symmetric matching pennies: response frequency adaptation. As there is no change in response frequencies, assessing learning in this game is more challenging, even though the difference in endpoint distribution between the first 10 trials in a block and the last 10 trials was significant (KS-test, p > 0.01 ). To study the trial-by-trial adaptation process in more detail, we correlated subjects' positional response from the current trial either to their positional response in the previous trial or the positional response of the other player in the previous trial. The correlations across four batches of ten consecutive trials that make up the blocks of 40 trials are shown in Fig. 6. The positive correlations displayed by the solid lines indicate that subjects had a strong tendency to give the same kind of response in consecutive trials. In contrast, the close-to-zero correlations displayed by the dashed lines indicate that subjects' behavior cannot be predicted from the other player's response in the previous trial. These correlations were stable across the block of 40 trials. The close-to-zero correlations represented by the dashed lines are reproduced by all learning models. However, the solid lines representing the first-order autocorrelation behave quite differently in the models. The gradient learners start out with zero correlation between successive actions and increase this correlation slightly over trials, but always far below the high autocorrelation displayed by subjects. In contrast, the Q-learning models exhibit an autocorrelation comparable in magnitude to the subject data. Although the modeled autocorrelation decreases slightly over time, the continuous Q-learning models again fit the data best.

Discussion
In this study we investigate human multi-agent learning in three different kinds of continuous sensorimotor games based on the three 2 × 2-matrix games of matching pennies, asymmetric matching pennies and the prisoner's dilemma. In these games, subjects were haptically coupled and learned to move in a way that would reduce the experienced haptic coupling force resisting their forward motion, which served as a negative reward during the interaction. To explain subjects' learning behavior across multiple repetitions, we compare their movements to different reinforcement learning models with or without internal costs arising when deviating from the previous action or average behavior. In previous studies 25 , the attainment of Nash equilibria in sensorimotor interactions has been reported, but without investigating the mechanisms of learning. In the current study we found, when simulating different learning models, that it is not enough to study binary discretizations of the continuous games as previously, as this coarse analysis does not reveal fine differences between learning models. However, clear differences between learning models arise when studying the games in their native continuous sensorimotor space. In particular, we found that a continuous Q-learning model with intrinsic action costs best explains subjects' co-adaptation. While we considered binary and continuous models in the paper to highlight the advantage the latter may have compared to the former, there is a gradual spectrum from discrete to continuous models. Naively, discrete models with more than two actions can be used to approximate the continuous case, but then learning is slower, because there are many more independent actions to learn. In the reinforcement learning literature 46 , this problem is remedied with multiple tilings, where each tiling contains a fixed number of tiles that cover the workspace. Having multiple tilings that are shifted with respect to each other introduces neighbourhood relationships where the number of tiles per tiling reflects the width of the neighbourhood. When formulating the Q-learning model in our examples as a discrete tiled agent, we get qualitatively similar results to the continuous agents, however, sometimes there are border effects. Therefore, we decided to use continuous models that more closely reflect the experimental setup, and that can be discretized for visualization.
In all of the games, the continuous Q-learning model trained without additional intrinsic costs tends to spread its probability fairly evenly across the workspace, and shows rather slow learning. The reason is not that Q-learning per se would be too slow to learn, but is rather a consequence of the fitting procedure that determines that the slow learning in that case provides the best fit, as faster learning would produce behavior that deviates www.nature.com/scientificreports/ from subjects' behavior even more. However, we found that introducing an intrinsic cost that punishes large deviations from the previous action could alleviate this problem, allowing for more concentrated learning and making the Q-learning model qualitatively the best fit to human behavior. The intrinsic cost can be interpreted as a simple intrinsic cost for motor effort, that is not considered in the payoff function of the game, or it could be the consequence of an intrinsic motor planning cost that favors similar actions. As all our models were initialized with a uniform distribution, such an intrinsic cost could also be thought of as a non-uniform prior that one could use to initialize the models. An alternative interpretation would be to conceptualize the minimization of the trial-by-trial deviation as minimizing the information difference between each action and the marginal distribution over actions given by the average behavior. Such an implicit information penalization of action policies can be regarded as a form of bounded rational decision-making, where decision-makers trade off between utility and information costs that are required to achieve a certain level of precision [47][48][49] . Such information costs have been previously suggested to model costs of motor planning and abstraction 50,51 . Another form of limited information-processing capability is subjects' limited sensory precision regarding the perception of force, that we have modeled by assuming Gaussian sensory noise with a magnitude roughly corresponding to ten percent of the maximum force 52,53 . The assumption of this sensory noise is vital for all the models, as otherwise the simulations predict very fast convergence and steep learning transients that are not observed in human subjects.
Compared to the Q-learning model, the policy gradient learners preferentially converged to strategies involving actions at the extremes, even though average behavior of these models fitted with game-theoretic predictions. This preference for extremes occurs not only for the particular parameterization that we chose (Kumaraswamy or beta-distribution), but also occurs for other parameterizations like the logit-normal model-compare Supplementary Figures 2-4. While this can only happen when the policy gradient models are initialized uniformly, a non-uniform initialization of these models must assign zero probability to the extremes of the workspace, and therefore trivially avoids this problem. When initialized uniformly, the gradient learners prefer the extreme solutions, as the reward gradient is steepest in this direction. This demonstrates that the choice of learning algorithm imposes a preference on the set of possible equilibrium solutions, that a priori would all be equally acceptable, due to the inherent learning dynamics. We conclude that it is important to study different learning algorithms for understanding sensorimotor interactions, as such behavior cannot be inferred from a game-theoretic analysis alone, that simply focuses on the Nash equilibrium concept. While our study can of course not exclude the possibility that there could be many other learning algorithms that could explain the data equally well or better, it shows also that it is not trivial that the Q-learning with intrinsic costs manages to capture the adaptive behavior of human interactions.
Unlike the abstract concept of the Nash equilibrium solution, reinforcement learning models are deeply rooted in the psychology literature of classical and operant conditioning 46 and thereby open up a mechanistic approach to understanding sensorimotor interactions. In the neurosciences there have been numerous attempts to relate these psychological concepts to neuronal mechanisms like spike-timing dependent plasticity and dopaminergic modulation 54,55 . In human strategic interactions, previous reinforcement learning models have been used to explain, for example, the emergence of dominant strategies [56][57][58] , conditional cooperation 59,60 , learning in extensive form games [61][62][63] , and transfer of reward-predictive representations 64,65 . The interactions in these studies were mostly based on cognitive strategies that participants could develop, as they were made explicitly aware of the meaning of different choice options. Instead, we have focused on haptic interactions with forces that require implicit learning that can be construed as the minimization of reward prediction error in the context of reinforcement learning 66,67 .
The role of reinforcement learning in the context of implicit learning has been previously examined in a number of studies on motor learning [68][69][70] , including physical robot-human interactions 71,72 and group coordination 73 , however, not in the context of game theory and haptic interactions. While our task was simple enough to allow for model-free reinforcement learning, a challenge for the future remains to study reinforcement learning in more complex environments with multiple agents. This may include, for example, studying model-based learning strategies 74 or learning strategies that succeed in the presence of multiple equilibria 26 . Although parallel singleagent Q-learning over multiple trials can lead to successful action coordination in some strategic interactions, game-theoretic learning algorithms like Nash Q-learning make it more likely to reach a joint optimal path in more general games 41 . Whether human players would devise such strategies during sensorimotor interactions remains an open question. In summary, our study adds to a growing body of research harnessing the power of reinforcement learning models to understand human interactions and highlights that a more detailed understanding of learning mechanisms can also contribute to a better understanding of equilibrium behavior.

Methods
Experimental methods. Participants. Sixteen naïve subjects participated in this study and provided written informed consent for participation. The study was approved by the ethics committee of Ulm University. All sixteen participants were undergraduate students that were compensated for their time with an hourly payment of 10 euros, while they played two sensorimotor versions of the matching pennies game. Before the experiment, participants were instructed that they should move a cursor to a target bar while trying to minimize the resistive forces experienced during the movement. Participants were told that they are not allowed to communicate with each other during the experiment. The data of another sixteen subjects was reanalyzed from a previous study 25 , where they performed a sensorimotor version of the prisoner's dilemma under the same kind of experimental conditions presented here in the current study. All methods were carried out in accordance with relevant guidelines and regulations. www.nature.com/scientificreports/ Setup. The experiment was run on two vBOT haptic devices 75 that were connected to a virtual reality environment in which two subjects could control one cursor each by moving a handle in the horizontal plane. This setup provides both subjects separately with a visual feedback of their own cursor position, without showing a direct view of their hand or the hand of the other player-see Fig. 1A. The subjects were haptically coupled by applying individual forces on the two handles representing a negative payoff. Both the forces and the cursors' positions were updated and recorded with a sampling frequency of 1 kHz throughout each trial.
Experimental design. To begin each trial, subjects had to simultaneously locate their cursor on the 15-cm-wide starting bar placed at each bottom side of their respective workspace. Each participant's workspace was constrained by the vBOT simulating solid walls at the extremes left ( −8cm ) and right ( +8cm ) for both players. A beep sound informed players that a valid starting position was chosen and a 15-cm-wide target bar was displayed in front of each player with the same distance that was randomly drawn from the uniform distribution between 5 and 20 cm-see Fig. 1A. Subjects were instructed to move as soon as they heard the beep. Each participant had to hit their respective target bar with their cursor, for which they had to move forward (y-direction) and touch the bar within a maximum time period of 1500 ms. Each trial was finished as soon as both players had hit their respective target bar, which they could touch anywhere along its length. For the analysis, we interpreted this horizontal position of the target crossing as subjects' final choice.
To determine the payoffs of the interaction, each player's cursor was attached to a simulated one-dimensional spring whose equilibrium point was located on the starting bar. This generated a force F i on their hand in the negative y-direction that opposed their forward movement with F i = −K i y i , where y i corresponds to the y-distance of the cursor of player i ∈ {1, 2} from the starting bar. Crucially, the spring constants K i were not constant but variable functions K i (x 1 , Prisoner's dilemma motor game. In the pen-and-paper version of the prisoner's dilemma the 2-by-2 payoff matrices are given by and represent a scenario where two delinquents are caught separately by the police and face the choice of either admitting their crime (cooperating) or denying it (defecting), while being individually interrogated with no communication allowed. For each player it is always best to defect, no matter what the other player is doing, even though cooperation would lead to a more favorable outcome if both players choose to do so. Hence, there is only one pure Nash equilibrium given by defect/defect. In our continuous sensorimotor version of the game, the Nash equilibrium corresponds to a corner of the x 1 x 2 -space-see Fig. 1C.
Matching pennies motor games. In the classic version of the symmetric matching pennies game the 2-by-2 payoff matrices are given by and can be motivated from the scenario of 2 football players during a penalty kick with the two actions left and right. In this case player 2 (e.g. the goalkeeper) will tryto select the matching action (e.g. x 1 = right/x 2 = right ) to stop the ball, while player 1 (e.g. the striker) will try to unmatch by aiming the shot away from player 1 (e.g. x 1 = left/x 2 = right ). Choosing an action deterministically in this game is a bad strategy, since the opponent can exploit predictable behavior. The mixed Nash equilibrium is therefore given by a pair of distributions (p 1 ; p 2 ) with p 1 (left) = p 1 (right) = 0.5 and p 2 (left) = p 2 (right) = 0.5 (see Fig. 1E). In the continuous sensorimotor version of the experiment there are infinitely many mixed Nash equilibria corresponding to the set of all distributions In the 2-by-2 version of the asymmetric matching pennies game the payoff matrices are given by www.nature.com/scientificreports/ which is similar to the symmetric scenario, only that one player now has a higher incentive to choose a particular action (e.g. the striker gets awarded more points when choosing x 2 = left ). The mixed Nash equilibrium is then given by a pair of distributions (p 1 ; p 2 ) with p 1 (left) = 0.8, p 1 (right) = 0.2 and p 2 (left) = p 2 (right) = 0.5 (see Fig. 1D).

Theoretical methods
Computational models. Every model consists of two artificial agents whose parameters are updated over the course of 40 trials. Like in the actual experiment, every agent decides independently from the other agent about an action, observes the reward signal in the form of a punishing force and adapts their strategy. The models can be categorized in binary action models with action space A = {0, 1} , and continuous action models that operate on the action space A = [0, 1] . All models have the same structure: • At the beginning of every trial t every agent i samples an action a t i ∈ A from the distribution P t i . • After both agents have taken their respective action a t i simultaneously, they receive a reward R t i = 1 − F t i , where F t i is calculated based on the spring constant K i from Equation (1), such that F t i = 1 10α K i (a t 1 , a t 2 ) + ε , where ε is a Gaussian random variable with mean zero and standard deviation 0.1 to model the effect of sensory noise.
• When considering intrinsic costs, the reward signal is augmented to . . , 40 in order to punish high trial-by-trial deviations. For all simulations we use κ = 1 2 . • Depending on the action and reward, the agents adapt their strategy P t i → P t+1 i to optimize future rewards.
Binary action models. Binary action models are limited to the two actions 0 and 1. These models can be used for learning in the classic 2 × 2 matrix versions of the different games. For the distribution P t i we chose a softmax-distribution with parameter β i : and the updates for the Q-values are of the following form: where α t i is a learning rate that decreases exponentially over the course of the trials. All Q-values are initialized as 0, such that the agents start the game without any initial knowledge.
Model 2: Gradient Decent. The gradient decent model learns a parameterized policy that can select actions without consulting a value function. The policy used by the binary gradient decent learner is the Bernoulli probabilty distribution with outcomes {0, 1} . An action a is determined by sampling from the policy: where p t i is a parameter that is adapted whenever a player receives the reward R t i for their current action a t i .
with an exponentially decaying learning rate parameter γ that is fitted to the behavior of the human participants. The parameter is initialized as p = 1 2 such that both actions are equally probable for the initial policy. www.nature.com/scientificreports/ Model 3: Gradient Descent. The gradient decent model directly learns a parameterized policy that can select actions without consulting a value function. The parameterized policy used in this study is given by the Kumaraswamy probability distribution on the interval (0, 1). An action a is determined by sampling from the policy:

Continuous action models.
where α t i and β t i are shape-parameters that are adapted whenever a player receives reward or punishment in form of a force for their current action. For an action a t i and reward R t i , the parameters are updated according to the following update equations: with an exponentially decaying learning rate parameter γ that is fitted to the behavior of the human participants. The shape-parameters are initialized as α = β = 1 such that the initial policy for all players is the uniform distribution on (0, 1).

Model 4: Q-Learning with Gaussian basis functions.
This model is an extension of the discrete Q-Learning model to the continuous action space [0, 1]. Its core consists of an action value function with radial basis functions such that the real action value function Q * i (·) is well reflected by a set of weights Q t i that are adapted throughout the trials t = 1, 2, . . . and a set of parameters (c k , σ k ) with k = 1, 2, . . . that are held constant. For the simulation we chose Gaussian basis functions N (c k , σ k ) centered around c k with standard deviation σ k such that action samples can be drawn from a Gaussian mixture model. Accordingly, we can generate an action a t i at time t, by sampling an index j from a categorical distribution with softmax-parameters σ (Q t j ) and softmax function σ(.), and then draw a sample a from the j-th basis function in the Gaussian mixture. Samples that lie outside of the interval [0, 1] are rejected and resampled. Unlike the discrete case we update all weights Q in every trial according to the following rule. If a t and R t are the action taken and the reward at time t, then where α is a learning rate parameter, γ is a discount factor and ϕ c k ,σ k is the clipped Gaussian density function min{1, N (c k , σ k )} . All weights were initialized as 0.
Optimizing parameters. All parameters were optimized such that the simulated agents' behavior fitted the behavior of the average player in a specific player slot (i.e. player 1 or player 2). For the optimization we minimized the absolute error: that occurred over 40 trials between the participants data d and the simulation's results d . The histogram function h maps the data points onto an 8 × 8 matrix where each entry of this matrix contains the normalized amount of data points that lie in the corresponding bin of an 8 × 8 binned histogram. The time index t of d t indicates that all data points from the first trial up until trial t are used to generate the matrix.