Introduction

Nash equilibria are the central solution concept for understanding strategic interactions between different agents1. Crucially, unlike other maximum expected utility decision-making models2,3,4, the Nash equilibrium concept cannot assume a static environment that can be exploited to find the optimal action in a single sweep, but it rather defines a fixed point representing a combination of strategies that can be found by iteration, so that finally no agent has anything to gain by deviating from their equilibrium behavior. Here, a strategy is conceived as a probability distribution over actions, so that Nash equilibria are in general determined by combinations of probability distributions over actions (mixed Nash equilibria), and only in special cases by combinations of single actions (pure equilibria)5. An example of the first kind is the popular rock-papers-scissors game which can be simplified to the matching pennies game with two action choices, where in either case the mixed Nash equilibrium requires players to randomize their choices uniformly. An example of the second kind is the prisoner’s dilemma6 where both players choose between cooperating and defecting without communication. The pure Nash equilibrium in this game requires both players to defect, because the payoffs are designed in a way that allow for a dominant strategy from the perspective of a single player, where it is always better to defect, no matter what the other player is doing.

The Nash equilibrium concept has not only been broadly applied in economic modeling of interacting rational agents, companies and markets7,8,9,10, but also to explain the dynamics of animal conflict11,12, population dynamics including microbial growth13,14, foraging behavior15,16,17,18, the emergence of theory of mind19,20,21,22, and even monkeys playing rock-papers-scissors23,24. Recently, Nash equilibria have also been proposed as a concept to understand human sensorimotor interactions25,26,27,28,29. In these studies human dyads are typically coupled haptically30 and experience physical forces that can be regarded as payoffs in a sensorimotor game. By designing the force payoffs appropriately in dependence of subjects’ actions, these sensorimotor games can be made to correspond to classic pen-and-paper games like the prisoner’s dilemma with a single pure equilibrium, coordination games with multiple equilibria, or signaling games with Bayesian Nash equilibria to model optimal sensorimotor communication. In general, it was found in these studies that subjects’ sensorimotor behavior during haptic interactions was in agreement with the game theoretic predictions of the Nash equilibrium, even though the pen-and-paper versions of some games systematically violate these predictions. This raises the question of how such equilibria are attained, especially since the Nash equilibrium concept itself provides no explanation of how it is reached, especially when there are multiple equivalent equilibria.

In this study we investigate what kind of learning models could explain how humans that are interacting through a haptic sensorimotor coupling reach a Nash equilibrium. The problem of learning in games can be approached within different frameworks, including learning of simple fixed response models like partial best response dynamics for reaching pure Nash equilibria31,32, or fictitious play with smoothed best response dynamics for mixed equilibria33,34, as well as more sophisticated reinforcement learning models like Q-learning35,36, policy gradients37,38, minimax Q-learning39 or Nash Q-learning40,41, together with learning models in evolutionary game theory for reaching Nash equilibria (evolutionary stable strategies) through population dynamics42. Here we focus on model-free reinforcement learning models to explain subjects’ sensorimotor interactions, as in many of the previous experiments25,26,27 subjects interact only through haptic feedback and cannot otherwise “see” the choices of the other player. In particular, we consider model-free reinforcement learning models like Q-learning and direct policy search methods like policy gradients that exclusively rely on force feedback during learning. We compare Q-learning and policy gradient learning in games with pure and mixed Nash equilibria, including the prisoner's dilemma and two versions of the matching pennies game with symmetric and asymmetric pay-offs respectively. While the prisoner's dilemma game has a single pure equilibrium, the sensorimotor version of the matching pennies games has infinitely many mixed Nash equilibria that are theoretically equivalent, and we investigate whether the different learning algorithms introduce additional preferences between these equilibrium strategies based on the inherent learning dynamics and we check how these match up with human learning behavior.

Figure 1
figure 1

Experimental setup. (A) Sensorimotor game. Subjects were sitting next to each other without communication, and interacted through two handles that generated a force on their hand opposing their forward movement. (B) Payoff matrices for each one of the 3 games. The 2-by-2 payoff matrices determine the boundary values of the force payoffs at the extremes of the \(x_1 x_2\)-space for both players (left). Interpolation of the payoffs defining a continuous payoff landscape for a continuum of actions (right). (C) Subjects’ results in the continuous prisoner's dilemma game. Final decisions shown as a scatter plot in the \(x_1x_2\)-space (left), where the single pure Nash equilibrium is located in the top-right corner at (1,1). The same responses binned into an 8-by-8 histogram (middle), and the corresponding categorical representation (right). (D) Subjects’ results in the continuous asymmetric matching pennies game. Final decisions (left), where the mixed Nash equilibrium strategies have expectation (0.8, 0.5). The same responses binned into an 8-by-8 histogram (middle), and the corresponding categorical representation (right). (E) Subjects’ results in the continuous matching pennies. Final decisions game (left), where the mixed Nash equilibrium strategies have expectation (0.5, 0.5). The same responses binned into an 8-by-8 histogram (middle), and the corresponding categorical representation (right). Created using MATLAB R2021a (https://www.mathworks.com)43.

Results

To investigate learning during sensorimotor interactions with pure and mixed Nash equilibria, we designed three continuous sensorimotor games that are variants of the traditional two-player matrix games of the prisoner’s dilemma, the asymmetric matching pennies game, and the symmetric matching pennies game (see Fig. 1 and Methods for details). During the experiment two players were sitting next to each other and interacted through a virtual reality system with the handles of two robotic interfaces that were free to move in the horizontal plane (see Fig. 1A). In each trial, players were requested to move their handle to cross a target bar ahead of them, where the lateral position of both handles determined the individual magnitude of a resistive force opposing the forward motion of the handle. Thus, the movement of each player directly impacted on the forces experienced by both players in a continuous fashion. This way, our sensorimotor game differs in three essential aspects from the traditional cognitive versions, in that first, we have an implicit effort cost through haptic coupling, second, subjects could choose from a continuum of actions defining a continuous payoff landscape, where the payoffs in the four corners correspond to the payoffs in the traditional payoff matrices, and third, that subjects are unaware of the structure of the haptic coupling, i.e., they do not have complete information about the payoffs, as they only have access to their own payoffs through force feedback.

Since the payoffs can only be learned from experience, we devised different reinforcement learning models that can emulate subjects’ behavior by adapting their choice distributions in a way that avoids punitive forces. As subjects were only communicated their own payoff when making their choices, we compare only model-free reinforcement learning schemes, in particular Q-learning and policy gradient models. We consider two reward conditions, with and without intrinsic costs added to the extrinsically imposed reward defined by the haptic forces, where the intrinsic costs make it more expensive for the learner to change actions abruptly between trials—see Methods for details. We focus on trial-by-trial learning across the 40 trial blocks where games were repeated and neglect within-trial adaptation, as we found that initial and final positions of movement trajectories were close together most of the time. In particular, we found that in approximately \(65\%\) of the trials for the prisoner’s dilemma, and in \(90\%\) and \(95\%\) of the trials for the asymmetric and symmetric matching pennies respectively, the subjects’ final decision laid within a 3.6cm neighborhood (\(20\%\) of the entire workspace) of their initial position in each trial, and there was no systematic change across the block of trials. We found that the distribution of distances between the initial and final position in each trial did not change significantly when comparing the first 20 trials of each block with the last 20 trials of each block (Kolmogorov-Smirnov test, \(p>0.05\) for all games). Therefore we concentrate only on the final position of each trial, where the force pay-off was strongest.

Figure 2
figure 2

Categorical analysis. Final decisions in the \(x_1 x_2\)-space categorized into two halves for the experimental data (left) and, from left to right, the results of the binary gradient descent, binary gradient descent with prior cost, binary Q-learning, and binary Q-learning with prior cost. We show the first 10 trials compared to the last 10 trials of each block of 40 trials.(A) Prisoner’s dilemma. (B) Asymmetric matching pennies. (C) Matching pennies. Created using MATLAB R2021a (https://www.mathworks.com)43.

Categorical analysis

Previously, we have studied subjects’ choices in the sensorimotor game in a binary fashion25,26 to allow for a direct comparison with the corresponding pen-and-paper 2-by-2 matrix game. To this end, we divide each subject’s action space into two halves and categorize actions accordingly—compare Fig. 2. The four quadrants of the combined action space then reflect all possible combinations, for example in the prisoner’s dilemma (defectdefect), (defectcooperate), (cooperatedefect) and (cooperatecooperate). When comparing the histogram over subjects’ choices over the first 10 trials compared to the last 10 trials of each block of 40 trials, it can be clearly seen that behavior adapts and becomes more consistent with the Nash equilibrium solution of the 2-by-2 matrix games for the prisoner’s dilemma and the asymmetric matching pennies game. In particular, in the prisoner’s dilemma game, subjects increasingly concentrate on the defect/defect solution, even though the Nash solution is never perfectly reached, as other action combinations also occur late in learning. In the asymmetric matching pennies game, player 1 displays the expected asymmetric action distribution, whereas player 2 shows the expected 50:50 behavior. In both games, the entropy of the joint action distribution decreases significantly over the block of 40 trials—see Supplementary Fig. 1. In the case of the symmetric matching pennies game, the histograms do not reveal any learning or entropy decrease, as expected, because the random Nash strategy coincides with the random initial strategy.

We compared the binary Q-learning described in the methods as model 1, and the policy gradient described as model 2, to the human data by exposing both models to the same kind of game sequence that was experienced by subjects. In the simulations, we always applied the same learning model for both players, albeit with player-specific parameters that were fitted to match subjects’ action distributions. In Fig. 2 it can be seen that both kinds of learning models explain subjects’ behavior equally well. Also when repeating the same analysis including the intrinsic cost function that discourages deviations from previous positions, the result remains essentially the same. This leaves open the possibility that the categorical analysis may be too coarse to distinguish between these kinds of models. In the following, we therefore study the continuous games in terms of their continuous responses and not just their binary discretization. We compare subjects’ behavior in the three different games to two continuous reinforcement learning models, where one is a policy gradient model with a continuous action distribution, and the other one a continuous Q-learning model.

Continuous analysis

Prisoner’s dilemma: Endpoint distributions. In Fig. 3A subjects’ final decisions in the continuous prisoner’s dilemma game are shown as a scatter plot within the \(x_1x_2\)-space, where the single pure Nash equilibrium is located in the top-right corner at position (1, 1). Over the course of 40 trials subjects’ responses increasingly cluster around the Nash equilibrium, even though a considerable spread remains. The difference in endpoint distribution between the first 10 trials in a block and the last 10 trials was highly significant for player 1 and player 2 (KS-test, \(p<0.01\) for each player). A two-dimensional histogram binning of the experimental scatter plots can be found in Fig. 3B.

Figure 3E shows the performance of two coupled continuous policy gradient learners on the same task. The model learners also converge to the Nash solution, but the variability is spread mainly along the boundaries and corners, which is in stark contrast to subjects’ behavior. The same is also true if the policy gradient learners are parameterized slightly differently with the logit-normal—see Supplementary Figure 2. Figure 3M shows two coupled continuous Q-learning agents performing the prisoner’s dilemma sensorimotor task. The model learners slowly start concentrating probability mass in the quadrant of the Nash equilibrium, but their responses are spread out more uniformly than the subjects’ responses, giving in comparison a distribution that is too flat due to the broad exploration. One way to enforce more specific exploration is through the consideration of intrinsic costs, for example by assuming that selecting actions that significantly deviate from the action prior, given by averaging over all trials, is costly44,45. Figure 3I and Q show the performance of the two learning models considering such an intrinsic cost. While the behavior of the gradient learner still shows a sharp peak in the corner, the Q-learning model concentrates actions in a way that is more similar to human behavior in the quadrant of the Nash equilibrium, and it even shows the exponential decline in probability that is observed in the human data when moving away from the Nash corner. Accordingly, the Q-learning model with intrinsic costs achieves the most similar two-dimensional histogram to the data, as quantified by the Euclidean distance between the subjects’ and the models’ histograms—compare Supplementary Table 1.

Prisoner’s dilemma: Response frequency adaptation. A statistical measure to assess the effects of learning is the change of the mean endpoint (averaged for both players) for each trial across the block of 40 trials. Figure 3C shows a slow increase from 0.5 to approximately 0.7. The Q-learning model without intrinsic costs shows a flat distribution resulting from the broad exploration (see Fig. 3O). In contrast, both the gradient learning models (see Fig. 3G and K) and the local Q-learning model with intrinsic costs (see Fig. 3S), manage to mimic the experimental responses, with the latter model achieving the lowest mean-squared error compared to the data—compare Supplementary Table 1.

To study the direction of adaptation in the endpoint space, we look at the difference vectors resulting from subtracting the mean endpoints of each mini-block of 10 trials from the preceding mini-block of 10 trials. The arrow plot in Fig. 3D shows that subjects move towards the Nash equilibrium on a straight path, where the step size of learning becomes smaller over time. As can be seen in Fig. 3H and L this adaptation pattern is reproduced by both the gradient learning models as well as the Q-learning model with intrinsic costs (see Fig. 3T). Without the intrinsic costs, the extensive search of the Q-learning agents generates an almost uniform adaptation (see Fig. 3P). In summary, our analysis demonstrates that it is not enough to look at the mean behavior to distinguish between the different models. In fact, when looking at the entire two-dimensional histogram over the continuous action space, it becomes apparent that the Q-learning model with intrinsic costs is the only one that can capture subjects’ behavior well, as it is the only model that concentrates actions in the Nash quadrant where the concentration increases gradually with proximity to the Nash solution (compare Fig. 3B,F,J,N,R).

Figure 3
figure 3

Prisoner’s dilemma. (A),(E),(I),(M), and (Q) show scatter plots of final decisions in the \(x_1 x_2\)-plane, where subjects’ actions are expected to cluster around the single pure Nash equilibrium located in the top-right corner at position (1,1). (B),(F),(J),(N), and (R) show two-dimensional histograms binning the experimental scatter plots. (C),(G),(K),(O), and (S) represent the change of the mean endpoints (averaged for both players) for each trial across the block of 40 trials. (D),(H),(L),(P), and (T) show the direction of adaptation in the endpoint space. The experimental data is shown at the top, the four continuous models are shown below. Created using MATLAB R2021a (https://www.mathworks.com)43.

Asymmetric matching pennies: endpoint distribution. Figure 4A shows subjects’ final decisions in the continuous asymmetric matching pennies game as a scatter plot in the \(x_1x_2\)-plane, where the set of mixed Nash equilibria corresponds to all joint distributions \(p(x_1,x_2)\) where the marginal distribution of player 1 has the expected location \(\mathbb {E}_{p}[x_1]=0.8\) and the marginal distribution of player 2 has the expected location \(\mathbb {E}_{p}[x_2]=0.5\). Throughout the course of 40 trials, subjects’ responses evolve from a symmetric distribution centered approximately at (0.5, 0.5) to an asymmetric distribution that is tilted for player 1 as expected. The difference in endpoint distribution between the first 10 trials in a block and the last 10 trials was highly significant (KS-test, \(p<0.01\)). A two-dimensional histogram binning of the scatter plots can be seen in Fig. 4B.

Figure 4E shows the endpoint distribution of two coupled policy gradient learners performing in the continuous asymmetric matching pennies game. As expected, player 1 shifts to an asymmetric behavior, while player 2 remains random, but the model actions are mainly spread along the two opposing boundaries, which singles out a possible Nash solution, but is in stark contrast to subjects’ behavior that concentrates in the upper quadrants of the workspace. This preference for extreme responses of the gradient model is even more extreme when including intrinsic costs, which eliminates most of the asymmetry (see Fig. 4I) and also holds for a logit-normal parameterization of the gradient learners (compare Supplementary Figure 3).

Figure 4M shows two coupled continuous Q-learning agents performing the same task. In contrast to the gradient learning models, the best-fitting Q-learning models manage to preserve substantial probability mass in the center of the workspace, but this happens at the expense of such a slow learning progress that the resulting action distribution is close to uniform, and therefore also does not capture subjects’ behavior well. This changes, however, when including intrinsic costs. Figure 4Q shows two fitted continuous Q-learning agents with intrinsic costs that manage to qualitatively capture the experimentally observed distribution in that player 1 concentrates its probability mass in the top quadrants, whereas player 2 shows more uniform behavior, as expected.

Asymmetric matching pennies: response frequency adaptation. Figure 4C shows the change of the mean endpoint (independently for each player) in each trial across the block of 40 trials. We can observe an increase from 0.5 to approximately 0.8 for player 1, whereas for player 2 the mean value remains constant along all trials around 0.5. This behavior is also observed in Fig. 4D, which shows the average direction of subjects’ adaptation. These mean adaptation patterns are reproduced by the gradient learning model (see Fig. 4G,H), however, when including the intrinsic costs, the symmetry between extreme responses causes more random behavior in both players (see Fig. 4K,L), although this is not so much the case in the logit-normal parameterization (compare Supplementary Figure 3). We see also that the Q-learning model without intrinsic costs depicted in Fig. 4O is very slow in learning, which can also be seen from the mean direction of adaptation depicted in Fig. 4P. However, when considering intrinsic costs in the Q-learning model, the change of the mean endpoint (see Fig. 4S) and the average direction of adaptation (see Fig. 4T) match subjects behavior quite well. Even though the Q-learning model with intrinsic costs is initially slightly slower than the human subjects in picking up the equilibrium behavior, taken together with the two-dimensional histogram binning of the scatter plots in Fig. 4B,F,J,N,R the Q-learning model with intrinsic costs provides the best fit to subjects’ behavior, confirmed by the lowest Euclidean distance between model and subject histograms—compare Supplementary Table 1.

Figure 4
figure 4

Asymmetric matching pennies. (A),(E),(I),(M), and (Q) show final decisions as a scatter plot in the \(x_1 x_2\)-plane, where subjects’ actions are expected to cluster in top quadrants. (B),(F),(J),(N), and (R) show a two-dimensional histogram binning of the experimental scatter plots. (C),(G),(K),(O), and (S) present the change of the mean endpoint (independently for each player) for each trial across the block of 40 trials. (D),(H),(L),(P), and (T) show the direction of adaptation in the endpoint space. The experimental data is shown on the top, the four continuous models are below. Created using MATLAB R2021a (https://www.mathworks.com)43.

Symmetric matching pennies: endpoint distribution

While the asymmetric matching pennies game requires adaptation in action frequency, learning in the symmetric game is not expected to show changes in the action frequency, and thus we finally need to look for other signatures of learning. In Fig. 5A subjects final decisions in the continuous matching pennies game are shown as a scatter plot in the \(x_1x_2\)-plane, where the set of mixed Nash equilibria corresponds to all joint distributions \(p(x_1,x_2)\) where the marginal distribution of player 1 and player 2 has the expected location \(\mathbb {E}_{p}[x_1]=\mathbb {E}_{p}[x_2]=0.5\). Subjects’ responses generally cluster around the center position with mean (0.5, 0.5) in the first block of 10 trials and remain that way through the entire block of 40 trials. A two-dimensional histogram binning of the scatter plots can be seen in Fig. 5B. While subjects’ initial behavior can be explained as a result of ignorance, the same behavior is also compatible with the Nash solution, and so the response frequencies remain stable throughout the block of 40 trials. This can be seen from the mean endpoint (averaged for both players) across the block of 40 trials shown in Fig. 5C, and from the average direction of subjects’ adaptation depicted in Fig. 5D.

The endpoint distribution of two coupled policy gradient learners performing in the continuous symmetric matching pennies game is shown in Fig. 5E, and the corresponding version including intrinsic costs is shown in Fig. 5I. In both cases the distribution of responses for both players is rather uniform. Figure 5M shows two coupled continuous Q-learning agents performing the same task, and Fig. 5Q the same model including intrinsic costs. Contrarily to the gradient learning models, the Q-learning models manage to preserve substantial probability mass in the center of the workspace.

While all four models capture the effect of stable mean response frequencies—see Fig. 5G,H,K,L,O,P, and S,T for comparison—, the gradient learning models predict a more uniform distribution of responses, including in particular a high amount of extreme responses at the borders of the workspace that were mostly absent in the experiment. In contrast, the Q-learning models cluster in the center of the work space. Moreover, the inclusion of intrinsic costs that punish large deviations from behavior in the previous trial lead to a slightly stronger concentration of responses in the center of the work space in agreement with subject data. Again, based on the two-dimensional histogram binning of the scatter plots in Fig. 5B,F,J,N,R the Q-learning model with intrinsic costs achieves the best fit to subjects’ behavior, confirmed by the lowest Euclidean distance between the fourth binned histogram—compare Supplementary Table 1.

Symmetric matching pennies: response frequency adaptation

As there is no change in response frequencies, assessing learning in this game is more challenging, even though the difference in endpoint distribution between the first 10 trials in a block and the last 10 trials was significant (KS-test, \(p>0.01\)). To study the trial-by-trial adaptation process in more detail, we correlated subjects’ positional response from the current trial either to their positional response in the previous trial or the positional response of the other player in the previous trial. The correlations across four batches of ten consecutive trials that make up the blocks of 40 trials are shown in Fig. 6. The positive correlations displayed by the solid lines indicate that subjects had a strong tendency to give the same kind of response in consecutive trials. In contrast, the close-to-zero correlations displayed by the dashed lines indicate that subjects’ behavior cannot be predicted from the other player’s response in the previous trial. These correlations were stable across the block of 40 trials. The close-to-zero correlations represented by the dashed lines are reproduced by all learning models. However, the solid lines representing the first-order autocorrelation behave quite differently in the models. The gradient learners start out with zero correlation between successive actions and increase this correlation slightly over trials, but always far below the high autocorrelation displayed by subjects. In contrast, the Q-learning models exhibit an autocorrelation comparable in magnitude to the subject data. Although the modeled autocorrelation decreases slightly over time, the continuous Q-learning models again fit the data best.

Figure 5
figure 5

Symmetric matching pennies. (A),(E),(I),(M), and (Q) show final decisions as a scatter plot in the \(x_1 x_2\)-plane, where subjects’ actions are expected to cluster around the center of the workspace. (B),(F),(J),(N), and (R) show a histogram binning of the experimental scatter plots. (C),(G),(K),(O), and (S) present the change of the mean endpoint (averaged for both players) for each trial across the block of 40 trials. (D),(H),(L),(P), and (T) show the direction of adaptation in the endpoint space. The experimental data is shown in the top, the four continuous models are below. Created using MATLAB R2021a (https://www.mathworks.com)43.

Figure 6
figure 6

Autocorrelation in the symmetric matching pennies. The solid lines show the autocorrelation between actions from the current and the previous trial for both players. The dashed lines indicate cross-correlations between one players action in the current trial with another player’s action in the previous trial. From left to right: experimental data, policy gradient, policy gradient with intrinsic costs, Q-learning, Q-learning with intrinsic costs. Created using MATLAB R2021a (https://www.mathworks.com)43.

Discussion

In this study we investigate human multi-agent learning in three different kinds of continuous sensorimotor games based on the three \(2\times 2\)-matrix games of matching pennies, asymmetric matching pennies and the prisoner’s dilemma. In these games, subjects were haptically coupled and learned to move in a way that would reduce the experienced haptic coupling force resisting their forward motion, which served as a negative reward during the interaction. To explain subjects’ learning behavior across multiple repetitions, we compare their movements to different reinforcement learning models with or without internal costs arising when deviating from the previous action or average behavior. In previous studies25, the attainment of Nash equilibria in sensorimotor interactions has been reported, but without investigating the mechanisms of learning. In the current study we found, when simulating different learning models, that it is not enough to study binary discretizations of the continuous games as previously, as this coarse analysis does not reveal fine differences between learning models. However, clear differences between learning models arise when studying the games in their native continuous sensorimotor space. In particular, we found that a continuous Q-learning model with intrinsic action costs best explains subjects’ co-adaptation.

While we considered binary and continuous models in the paper to highlight the advantage the latter may have compared to the former, there is a gradual spectrum from discrete to continuous models. Naively, discrete models with more than two actions can be used to approximate the continuous case, but then learning is slower, because there are many more independent actions to learn. In the reinforcement learning literature46, this problem is remedied with multiple tilings, where each tiling contains a fixed number of tiles that cover the workspace. Having multiple tilings that are shifted with respect to each other introduces neighbourhood relationships where the number of tiles per tiling reflects the width of the neighbourhood. When formulating the Q-learning model in our examples as a discrete tiled agent, we get qualitatively similar results to the continuous agents, however, sometimes there are border effects. Therefore, we decided to use continuous models that more closely reflect the experimental setup, and that can be discretized for visualization.

In all of the games, the continuous Q-learning model trained without additional intrinsic costs tends to spread its probability fairly evenly across the workspace, and shows rather slow learning. The reason is not that Q-learning per se would be too slow to learn, but is rather a consequence of the fitting procedure that determines that the slow learning in that case provides the best fit, as faster learning would produce behavior that deviates from subjects’ behavior even more. However, we found that introducing an intrinsic cost that punishes large deviations from the previous action could alleviate this problem, allowing for more concentrated learning and making the Q-learning model qualitatively the best fit to human behavior. The intrinsic cost can be interpreted as a simple intrinsic cost for motor effort, that is not considered in the payoff function of the game, or it could be the consequence of an intrinsic motor planning cost that favors similar actions. As all our models were initialized with a uniform distribution, such an intrinsic cost could also be thought of as a non-uniform prior that one could use to initialize the models.

An alternative interpretation would be to conceptualize the minimization of the trial-by-trial deviation as minimizing the information difference between each action and the marginal distribution over actions given by the average behavior. Such an implicit information penalization of action policies can be regarded as a form of bounded rational decision-making, where decision-makers trade off between utility and information costs that are required to achieve a certain level of precision47,48,49. Such information costs have been previously suggested to model costs of motor planning and abstraction50,51. Another form of limited information-processing capability is subjects’ limited sensory precision regarding the perception of force, that we have modeled by assuming Gaussian sensory noise with a magnitude roughly corresponding to ten percent of the maximum force52,53. The assumption of this sensory noise is vital for all the models, as otherwise the simulations predict very fast convergence and steep learning transients that are not observed in human subjects.

Compared to the Q-learning model, the policy gradient learners preferentially converged to strategies involving actions at the extremes, even though average behavior of these models fitted with game-theoretic predictions. This preference for extremes occurs not only for the particular parameterization that we chose (Kumaraswamy or beta-distribution), but also occurs for other parameterizations like the logit-normal model—compare Supplementary Figures 2- 4. While this can only happen when the policy gradient models are initialized uniformly, a non-uniform initialization of these models must assign zero probability to the extremes of the workspace, and therefore trivially avoids this problem. When initialized uniformly, the gradient learners prefer the extreme solutions, as the reward gradient is steepest in this direction. This demonstrates that the choice of learning algorithm imposes a preference on the set of possible equilibrium solutions, that a priori would all be equally acceptable, due to the inherent learning dynamics. We conclude that it is important to study different learning algorithms for understanding sensorimotor interactions, as such behavior cannot be inferred from a game-theoretic analysis alone, that simply focuses on the Nash equilibrium concept. While our study can of course not exclude the possibility that there could be many other learning algorithms that could explain the data equally well or better, it shows also that it is not trivial that the Q-learning with intrinsic costs manages to capture the adaptive behavior of human interactions.

Unlike the abstract concept of the Nash equilibrium solution, reinforcement learning models are deeply rooted in the psychology literature of classical and operant conditioning46 and thereby open up a mechanistic approach to understanding sensorimotor interactions. In the neurosciences there have been numerous attempts to relate these psychological concepts to neuronal mechanisms like spike-timing dependent plasticity and dopaminergic modulation54,55. In human strategic interactions, previous reinforcement learning models have been used to explain, for example, the emergence of dominant strategies56,57,58, conditional cooperation59,60, learning in extensive form games61,62,63, and transfer of reward-predictive representations64,65. The interactions in these studies were mostly based on cognitive strategies that participants could develop, as they were made explicitly aware of the meaning of different choice options. Instead, we have focused on haptic interactions with forces that require implicit learning that can be construed as the minimization of reward prediction error in the context of reinforcement learning66,67.

The role of reinforcement learning in the context of implicit learning has been previously examined in a number of studies on motor learning68,69,70, including physical robot-human interactions71,72 and group coordination73, however, not in the context of game theory and haptic interactions. While our task was simple enough to allow for model-free reinforcement learning, a challenge for the future remains to study reinforcement learning in more complex environments with multiple agents. This may include, for example, studying model-based learning strategies74 or learning strategies that succeed in the presence of multiple equilibria26. Although parallel single-agent Q-learning over multiple trials can lead to successful action coordination in some strategic interactions, game-theoretic learning algorithms like Nash Q-learning make it more likely to reach a joint optimal path in more general games41. Whether human players would devise such strategies during sensorimotor interactions remains an open question. In summary, our study adds to a growing body of research harnessing the power of reinforcement learning models to understand human interactions and highlights that a more detailed understanding of learning mechanisms can also contribute to a better understanding of equilibrium behavior.

Methods

Experimental methods

Participants

Sixteen naïve subjects participated in this study and provided written informed consent for participation. The study was approved by the ethics committee of Ulm University. All sixteen participants were undergraduate students that were compensated for their time with an hourly payment of 10 euros, while they played two sensorimotor versions of the matching pennies game. Before the experiment, participants were instructed that they should move a cursor to a target bar while trying to minimize the resistive forces experienced during the movement. Participants were told that they are not allowed to communicate with each other during the experiment. The data of another sixteen subjects was reanalyzed from a previous study25, where they performed a sensorimotor version of the prisoner’s dilemma under the same kind of experimental conditions presented here in the current study. All methods were carried out in accordance with relevant guidelines and regulations.

Setup

The experiment was run on two vBOT haptic devices75 that were connected to a virtual reality environment in which two subjects could control one cursor each by moving a handle in the horizontal plane. This setup provides both subjects separately with a visual feedback of their own cursor position, without showing a direct view of their hand or the hand of the other player—see Fig. 1A. The subjects were haptically coupled by applying individual forces on the two handles representing a negative payoff. Both the forces and the cursors’ positions were updated and recorded with a sampling frequency of 1 kHz throughout each trial.

Experimental design

To begin each trial, subjects had to simultaneously locate their cursor on the 15-cm-wide starting bar placed at each bottom side of their respective workspace. Each participant’s workspace was constrained by the vBOT simulating solid walls at the extremes left (\(-8cm\)) and right (\(+8cm\)) for both players. A beep sound informed players that a valid starting position was chosen and a 15-cm-wide target bar was displayed in front of each player with the same distance that was randomly drawn from the uniform distribution between 5 and 20 cm—see Fig. 1A. Subjects were instructed to move as soon as they heard the beep. Each participant had to hit their respective target bar with their cursor, for which they had to move forward (y-direction) and touch the bar within a maximum time period of 1500 ms. Each trial was finished as soon as both players had hit their respective target bar, which they could touch anywhere along its length. For the analysis, we interpreted this horizontal position of the target crossing as subjects’ final choice.

To determine the payoffs of the interaction, each player’s cursor was attached to a simulated one-dimensional spring whose equilibrium point was located on the starting bar. This generated a force \(F_i\) on their hand in the negative y-direction that opposed their forward movement with \(F_i=-K_i y_i\), where \(y_i\) corresponds to the y-distance of the cursor of player \(i\in \{1,2\}\) from the starting bar. Crucially, the spring constants \(K_i\) were not constant but variable functions \(K_i(x_1,x_2)\) depending on the normalized horizontal positions \(x_1\) and \(x_2\) of both participants, where \(x_i=0\) corresponds to extreme left and \(x_i=1\) represents extreme right of the workspace of player i. In particular, the spring functions \(K_i(x_1,x_2)\) were determined by the bilinear interpolation

$$\begin{aligned} K_i(x_1,x_2) = \alpha \Bigg ( \begin{matrix} 1-x_1\\ x_1 \end{matrix}\Bigg )^T\Bigg [ \begin{matrix} M^{(i)}_{0,0} &{} M^{(i)}_{1,0}\\ M^{(i)}_{0,1} &{} M^{(i)}_{1,1} \end{matrix}\Bigg ]\Bigg ( \begin{matrix} x_2\\ 1-x_2 \end{matrix}\Bigg ), \end{aligned}$$
(1)

where \(\alpha\) was a constant scaling parameter set to \(0.019 \frac{N}{cm}\) throughout the experiment, and the 2-by-2 matrices \(M^{(i)}\) corresponds to the 2-by-2 payoff matrix of player i and determine the boundary values of the force payoffs at the extremes of the \(x_1x_2\)-space. Consequently, \(M^{(i)}_{j,k}\) is the payoff assigned to player i when player 1 takes action \(x_1=k\) and player 2 takes action \(x_2=1-j\) where \(j,k\in \{0,1\}\). Each one of the three games analyzed in this study was defined by a different set of payoff matrices corresponding to the 2-by-2 games of the prisoner’s dilemma, asymmetric matching pennies, and symmetric matching pennies—see Fig. 1B. Players’ actions are considered as the final positions in the \(x_1 x_2\)-space, where the continuous responses can be categorized by using different binning (see Fig. 1C,D,E). To encourage learning, the payoff matrices were permuted every 40 trials, so that the mapping between action and payoff had to be relearned every 40 trials. There were four possible permutations corresponding to the possible reassignments \(x_1 \rightarrow (1-x_1)\) and \(x_2 \rightarrow (1-x_2)\). Subject pairs repeated 20 blocks of 40 trials each, totaling in 800 trials over the whole experiment.

Prisoner’s dilemma motor game

In the pen-and-paper version of the prisoner’s dilemma the 2-by-2 payoff matrices are given by

$$\begin{aligned} M^{(1)} = \left( \begin{array}{cc} 0 &{} 3 \\ 10 &{} 7 \end{array} \right) \qquad M^{(2)} = \left( \begin{array}{cc} 10 &{} 7\\ 3 &{} 0 \end{array} \right) \end{aligned}$$

and represent a scenario where two delinquents are caught separately by the police and face the choice of either admitting their crime (cooperating) or denying it (defecting), while being individually interrogated with no communication allowed. For each player it is always best to defect, no matter what the other player is doing, even though cooperation would lead to a more favorable outcome if both players choose to do so. Hence, there is only one pure Nash equilibrium given by defect/defect. In our continuous sensorimotor version of the game, the Nash equilibrium corresponds to a corner of the \(x_1x_2\)-space—see Fig. 1C.

Matching pennies motor games

In the classic version of the symmetric matching pennies game the 2-by-2 payoff matrices are given by

$$\begin{aligned} M^{(1)} = \left( \begin{array}{cc} 10 &{} 0 \\ 0 &{} 10 \end{array} \right) \qquad M^{(2)} = \left( \begin{array}{cc} 0 &{} 10\\ 10 &{} 0 \end{array} \right) \end{aligned}$$

and can be motivated from the scenario of 2 football players during a penalty kick with the two actions left and right. In this case player 2 (e.g. the goalkeeper) will try to select the matching action (e.g. \(x_1=right/x_2=right\)) to stop the ball, while player 1 (e.g. the striker) will try to unmatch by aiming the shot away from player 1 (e.g. \(x_1=left/x_2=right\)). Choosing an action deterministically in this game is a bad strategy, since the opponent can exploit predictable behavior. The mixed Nash equilibrium is therefore given by a pair of distributions \((p_1;p_2)\) with \(p_1(left)=p_1(right)=0.5\) and \(p_2(left)=p_2(right)=0.5\) (see Fig. 1E). In the continuous sensorimotor version of the experiment there are infinitely many mixed Nash equilibria corresponding to the set of all distributions \(\{p_1,p_2\}\) where \(\mathbb {E}_{p_1}[x_1]=0.5\) and \(\mathbb {E}_{p_2}[x_2]=0.5\) assuming \(x_i \in [0,1]\)—see Supplementary Material.

In the 2-by-2 version of the asymmetric matching pennies game the payoff matrices are given by

$$\begin{aligned} M^{(1)} = \left( \begin{array}{cc} 10 &{} 0 \\ 2.5 &{} 10 \end{array} \right) \qquad M^{(2)} = \left( \begin{array}{cc} 2.5 &{} 10\\ 10 &{} 2.5 \end{array} \right) \end{aligned}$$

which is similar to the symmetric scenario, only that one player now has a higher incentive to choose a particular action (e.g. the striker gets awarded more points when choosing \(x_2=left\)). The mixed Nash equilibrium is then given by a pair of distributions \((p_1;p_2)\) with \(p_1(left)=0.8, p_1(right)=0.2\) and \(p_2(left)=p_2(right)=0.5\) (see Fig. 1D). In the continuous sensorimotor version of the experiment there are infinitely many mixed Nash equilibria corresponding to the set of all distributions \(\{p_1,p_2\}\) where \(\mathbb {E}_{p_1}[x_1]=0.8\) and \(\mathbb {E}_{p_2}[x_2]=0.5\) assuming \(x_i \in [0,1]\)—see Supplementary Material.

Theoretical methods

Computational models

Every model consists of two artificial agents whose parameters are updated over the course of 40 trials. Like in the actual experiment, every agent decides independently from the other agent about an action, observes the reward signal in the form of a punishing force and adapts their strategy. The models can be categorized in binary action models with action space \(\mathcal {A} = \{0,1\}\), and continuous action models that operate on the action space \(\mathcal {A} = [0,1]\). All models have the same structure:

  • At the beginning of every trial t every agent i samples an action \(a_i^t \in \mathcal {A}\) from the distribution \(P_i^t\).

  • After both agents have taken their respective action \(a_i^t\) simultaneously, they receive a reward \(R_i^t = 1-F_i^t\), where \(F_i^t\) is calculated based on the spring constant \(K_i\) from Equation (1), such that \(F_i^t=\frac{1}{10\alpha } K_i(a_1^t,a_2^t)+ \varepsilon\), where ε is a Gaussian random variable with mean zero and standard deviation 0.1 to model the effect of sensory noise. 

  • When considering intrinsic costs, the reward signal is augmented to \(R_i^t = \left( 1-F_i^t\right) - \kappa |a_i^t - a_i^{t-1} |\) for all \(t = 2,\dots ,40\) in order to punish high trial-by-trial deviations. For all simulations we use \(\kappa = \frac{1}{2}\)

  • Depending on the action and reward, the agents adapt their strategy \(P_i^t \rightarrow P_i^{t+1}\) to optimize future rewards.

Binary action models

Binary action models are limited to the two actions 0 and 1. These models can be used for learning in the classic \(2\times 2\) matrix versions of the different games.

Model 1: Q-learning

As in standard Q-Learning, agents using the first model represent their knowledge in the form of a Q-value lookup table. Since there are only two actions the lookup table for player i has two values, namely:

$$\begin{aligned} Q_i^t(a) = {\left\{ \begin{array}{ll} Q_{0,i}^t &{}, \text {if } a = 0 \\ Q_{1,i}^t &{}, \text {if } a = 1 \end{array}\right. } \end{aligned}$$

For the distribution \(P_i^t\) we chose a softmax-distribution with parameter \(\beta _i\):

$$\begin{aligned} P_i^t(a = k) = \frac{e^{\beta _i Q_i^t(k)}}{\sum _{\tilde{k} \in \{0,1\}} e^{\beta _i Q_i^t(\tilde{k})}} \end{aligned}$$

and the updates for the Q-values are of the following form:

$$\begin{aligned} Q_i^{t+1}(a) = Q_i^t(a) + \alpha _i^t \left( R_i^t + \gamma _i \max _{\tilde{a} \in \{0,1\}} Q_i^t(\tilde{a}) - Q_i^t(a)\right) , \end{aligned}$$

where \(\alpha _i^t\) is a learning rate that decreases exponentially over the course of the trials. All Q-values are initialized as 0, such that the agents start the game without any initial knowledge.

Model 2: Gradient Decent

The gradient decent model learns a parameterized policy that can select actions without consulting a value function. The policy used by the binary gradient decent learner is the Bernoulli probabilty distribution with outcomes \(\{0,1\}\). An action a is determined by sampling from the policy:

$$\begin{aligned} P_i^t(a) = (p_i^t)^a(1-p_i^t)^{1-a}, \end{aligned}$$

where \(p_i^t\) is a parameter that is adapted whenever a player receives the reward \(R_i^t\) for their current action \(a_i^t\).

$$\begin{aligned} p_i^{t+1}&= p_i^t + \gamma _i^t R_i^t \frac{p_i^t - a}{(p_i^t-1)p_i^t} \end{aligned}$$

with an exponentially decaying learning rate parameter \(\gamma\) that is fitted to the behavior of the human participants. The parameter is initialized as \(p = \frac{1}{2}\) such that both actions are equally probable for the initial policy.

Continuous action models

Continuous models give a more faithful representation of the sensorimotor version of the game in that they are able to sample actions from the full interval [0, 1].

Model 3: Gradient Descent

The gradient decent model directly learns a parameterized policy that can select actions without consulting a value function. The parameterized policy used in this study is given by the Kumaraswamy probability distribution on the interval (0, 1). An action a is determined by sampling from the policy:

$$\begin{aligned} P_i^t(a) = P(a \mid \alpha _i^t, \beta _i^t) = \alpha _i^t\beta _i^t a^{\alpha _i^t-1} \left( 1-a^{\alpha _i^t} \right) ^{\beta _i^t-1}, \end{aligned}$$

where \(\alpha _i^t\) and \(\beta _i^t\) are shape-parameters that are adapted whenever a player receives reward or punishment in form of a force for their current action. For an action \(a_i^t\) and reward \(R_i^t\), the parameters are updated according to the following update equations:

$$\begin{aligned} \alpha _i^{t+1}&= \alpha _i^t + \gamma _i^t R_i^t \left( \frac{1}{\alpha _i^t} + \log (a_i^t) - \frac{\beta _i^t-1}{1-(a_i^t)^{\alpha _i^t}} (a_i^t)^{\alpha _i^t} \log (a_i^t) \right) \\ \beta _i^{t+1}&= \beta _i^t + \gamma _i^t R_i^t \left( \frac{1}{\beta _i^t} + \log (1-(a_i^t)^{\alpha _i^t}) \right) \end{aligned}$$

with an exponentially decaying learning rate parameter \(\gamma\) that is fitted to the behavior of the human participants. The shape-parameters are initialized as \(\alpha = \beta = 1\) such that the initial policy for all players is the uniform distribution on (0, 1).

Model 4: Q-Learning with Gaussian basis functions

This model is an extension of the discrete Q-Learning model to the continuous action space [0, 1]. Its core consists of an action value function with radial basis functions such that the real action value function \(Q_i^*(\cdot )\) is well reflected by a set of weights \(Q_i^t\) that are adapted throughout the trials \(t = 1,2, \dots\) and a set of parameters \((c_k,\sigma _k)\) with \(k=1,2,\dots\) that are held constant. For the simulation we chose Gaussian basis functions \(\mathcal {N}(c_k,\sigma _k)\) centered around \(c_k\) with standard deviation \(\sigma _k\) such that action samples can be drawn from a Gaussian mixture model. Accordingly, we can generate an action \(a_i^t\) at time t, by sampling an index j from a categorical distribution with softmax-parameters \(\sigma(Q_j^t)\) and softmax function σ(.), and then draw a sample a from the j-th basis function in the Gaussian mixture. Samples that lie outside of the interval [0, 1] are rejected and resampled. Unlike the discrete case we update all weights Q in every trial according to the following rule. If \(a^t\) and \(R^t\) are the action taken and the reward at time t, then

$$\begin{aligned} \forall k \in \{1,\dots ,n\}: Q_k^{t+1} = Q_k^t + \alpha \varphi _{c_k,\sigma _k}\left( a^t\right) \left( R^t + \gamma \max _{k'} Q_{k'}^t - Q_k^t\right) , \end{aligned}$$

where \(\alpha\) is a learning rate parameter, \(\gamma\) is a discount factor and \(\varphi _{c_k,\sigma _k}\) is the clipped Gaussian density function \(\min \{1,\mathcal {N}(c_k,\sigma _k)\}\). All weights were initialized as 0.

Optimizing parameters

All parameters were optimized such that the simulated agents’ behavior fitted the behavior of the average player in a specific player slot (i.e. player 1 or player 2). For the optimization we minimized the absolute error:

$$\begin{aligned} E(d, \tilde{d}) = \sum _{t = 1}^{40} \Vert h(d_t) - h(\tilde{d}_t)\Vert _1 \end{aligned}$$

that occurred over 40 trials between the participants data d and the simulation’s results \(\tilde{d}\). The histogram function h maps the data points onto an \(8 \times 8\) matrix where each entry of this matrix contains the normalized amount of data points that lie in the corresponding bin of an \(8 \times 8\) binned histogram. The time index t of \(d_t\) indicates that all data points from the first trial up until trial t are used to generate the matrix.