Bayesian nonparametric models characterize instantaneous strategies in a competitive dynamic game

Previous studies of strategic social interaction in game theory have predominantly used games with clearly-defined turns and limited choices. Yet, most real-world social behaviors involve dynamic, coevolving decisions by interacting agents, which poses challenges for creating tractable models of behavior. Here, using a game in which humans competed against both real and artificial opponents, we show that it is possible to quantify the instantaneous dynamic coupling between agents. Adopting a reinforcement learning approach, we use Gaussian Processes to model the policy and value functions of participants as a function of both game state and opponent identity. We found that higher-scoring participants timed their final change in direction to moments when the opponent’s counter-strategy was weaker, while lower-scoring participants less precisely timed their final moves. This approach offers a natural set of metrics for facilitating analysis at multiple timescales and suggests new classes of experimental paradigms for assessing behavior.


Supplementary Methods: Computer Opponent Algorithm
The computer opponent, also called the goalie, was represented by a vertical bar on the right hand side of the screen. The computer opponent used a xed strategy based on a track-then-guess algorithm in which it initially attempts to match the vertical position of the puck at the start of each trial and subsequently chooses from a predened set of strategies at random. The complexity of the resulting algorithm (Algorithm 2) is almost entirely due to two subtleties: (1) determining at what point the opponent should stop tracking and guess a strategy and (2) how this guess is chosen.
To address the rst question, we consider the collection of points at which the puck has the potential to outrun the bar to one of the corners. For simplicity, our algorithm assumes that both players start from rest at the vertical middle of the screen and move maximally at each time step. After a certain position C 1 late in the trial, the distance the puck can move exceeds the bar's ability to catch it. After another point, C 2 , near the end of the trial, the situation is reversed: the longer length of the bar allows it to block the puck more successfully. The two points C 1 and C 2 dene each trial's critical region. Note, however, that because we have assumed both players start from rest at y = 0, this critical region is the same for each trial and calculated only once, at the start of the task. Algorithm 1 shows how this calculation is implemented. To determine when and what strategy the computer opponent chooses, our algorithm also denes a pair of inection points surrounding the rst critical point, C 1 . The rst of these, IP 1 , is the horizontal position of the puck in the time step immediately before it reaches C 1 , and the second, IP 2 is its location two time steps afterward. (Other choices are certainly possible; these were tuned to generate reasonable opponent play.) When the puck reaches IP 1 , the computer chooses a two-part strategy (S, S ) by sampling with replacement from a bag of strategies accumulated over the course of play. These two-part strategies allow the computer opponent to potentially counter strategies in which the participant reverses play, as well as adding unpredictability to the opponents' own movements.
Having selected a strategy, the computer opponent then implements strategy S until the puck reaches IP 2 , after which it implements S until the end of the trial. After each trial, the bag of strategies is then updated by adding to it the participant's observed strategy on the most recent trial (see UpdateStrategies). In this way, the computer opponent biases its random strategy selection to adaptively counter the participant's observed tendencies.

Function UpdateStrategies
Strategies ← [(U, U), (D, D), Track ] foreach trial do while trial not ended do play trial These strategies dier only in where they set the opponent's desired bar position at the next time step. This can be thought of as the goal in a feedback control model with u t as the control signal Equation (12) [1]. In the case of the tracking strategy, the goal is set at the participant's projected position in the next time step, while for the up and down strategies, u t is maximal toward the top or bottom of the screen. Moreover, because the computer is capable of perfectly tracking the position of the puck, we implemented a variable lag in its behavior, comparable to a human's reaction time: That is, the lag, L is a random variable drawn separately for each trial. This variable as dened is measured in seconds, but is converted by the code to the integer indexing the closest time step. The resulting strategy execution is detailed in DoStrategy, while the full computer opponent algorithm is in Algorithm 2.
with x i the coordinate (either vertical or horizontal) andx i the midpoint along the same dimension, both in pixels. This transformation rescales the game arena uniformly, maintaining the aspect ratio between horizontal and vertical dimensions.
Likewise, since the puck horizontal velocity is constant, and since the bar's vertical velocity is of the same order of magnitude, we measure all velocities in units of v p . Thus the horizontal velocity of the puck is 1, and the vertical speed of the puck is ≤ 1.

Supplementary Note 2: Uncertainty of GP gradients
Here, we adopt the conventions of [2]: y is a vector of data points observed over a domain labeled by x ∈ X , f ∼ GP(0, k(x, x )) is the latent Gaussian Process (GP), and f is the value of the GP at the data points. Let X denote the matrix of original data points and K the Gramian of these where we have additionally dened the matrix k * ≡ k(X * , X) (this is a vector when X * = x * , a single point) and k * * ≡ k(X * , X * ).
Our goal is to calculate the distribution of the gradient of f , ∇f , with respect to x: where the division is elementwise division. We calculate this by considering X * = [x + h, x] (each point comprising one column), in which case the variable of interest, ∇ h f has mean lim h→ A · f * /h with A = (1, −1). That is, That is, the calculation for the mean is the same as in the prediction case, with the kernel k replaced by its gradient. (Note once again that f (x) is our Gaussian process, while f is the vector of observations of this GP at the training data points. Moreover, from here on, we will only consider evaluating the GP at a single text point x.) For the covariance matrix, we will make use of the fact that for a bivariate normal distribution with σ i the standard deviation of each variable and σ 12 the covariance of the two.
That is, the variance of the dierence f (x + h) − f (x) can be written where we have again used k as the kernel and k(x) = k(x, X), the vector formed by evaluating the kernel on the new datum and the original training set.
To calculate the covariance of (5), we then write f (x + h) − f (x) = h Σ ∇f h and take the h → 0 limit. It is straightforward to show that all terms of O(h) cancel, and where we have again used the fact that k is symmetric in its arguments, so ∂ 2 k ∂x 2 = ∂ 2 k ∂x 2 . Putting all this together, we then conclude that Here again, we see that this is the same as the formula (4) for a new observation at x * with the substitutions k → ∇∇ k (15) k → ∇k (16) though the fact that ∇f is a vector means that p(∇f |x * , f , X) is multivariate Gaussian.

Supplementary Note 3: Precomputing kernels
For computational eciency, it may be preferable to calculate kernel gradients explicitly, rather than relying on automatic dierentiation. (For example, TensorFlow does not easily facilitate the computation of gradients of tensors, only scalars.) Here, we provide explicit formulas in the case that k is chosen to be the radial basis function (RBF) kernel: In which case we have which implies Similarly, we have where we have used for the Hadamard (elementwise) product and assumed broadcasting in the i and m indices. Note also that these gradient formulas indicate that we naturally expect the scaling depicted in Figure 4B,C: gradient means and covariances are inversely proportional to the hyperparameters λ i , so aggregated sensitivities should also roughly scale as powers of the hyperparameters.
Now, if we want to consider cases where prediction at more than one point is needed, we must be careful about indices. For example, the double gradient of the kernel evaluated at the new points is with i, j labeling coordinates and p, p new data points. Note again that in the special case that the new evaluation points are the same along the gradient directions, this reduces to Similarly, again with i labeling coordinates, p new data points, and m original data/inducing points. From this it follows that However, in inverting the covariance, we need a matrix, not a 4-tensor indexed by ij, pp . So to accomplish this, we stack the datapoint and coordinate indices: with ⊗ the Kronecker product and λ the diagonal matrix with entries λ i . As for v, if one ensures that n is the rst index, then one can simply tf.reshape(v, [-1, D]) where D is the dimension of the vector x.

Supplementary Note 4: Uncertainties for sensitivity metrics
We have dened the sensitivity of the participant to the opponent's actions as ς = ∇ g f 2 where f (x) = Φ −1 (π(x)) is the GP dening the local likelihood of a change point and ∇ g represents the gradient of f with respect to the opponent's state variables (position and velocity). If we let X g denote the opponent's state variables, then by denition That is, our sensitivity is a quadratic combination of gradients, ∇f · A · ∇f , with cov[∇f ] = Σ ∇f as dened in (14). This combination has a generalized χ 2 distribution [3] with metric and covariance Σ ∇f . Using ς = ∇ g f 2 as a measure of the sensitivity of the participant to the opponent is intuitive but also arbitrarily chooses an equal weighting of opponent position and velocity sensitivities. For a more principled solution, we adopted an alternative denition: with g subscripts once again restricting to variables in X g and Σ g = L g L g (i.e., L g is the Cholesky factor of Σ g ). In what follows, we will drop the g subscripts for brevity. We can motivate this choice by beginning with the observation (14) that ∇f is multivariate normal with mean and covariance is multivariate normal with mean 0 and unit covariance matrix, or equivalently, L −1 ∇f is multivariate normal with mean L −1 µ and unit covariance. More importantly, what follows is that the sum of squares L −1 ∇f 2 is a sum of squares of independent unit normals (with nonzero means) and thus follows a noncentral χ 2 distribution: with d the number of variables in X g .
While the choice of Mahalonobis norm may be less intuitive than the Euclidean norm in measuring sensitivity, a few comments are in order: • The original metric does not take into account uncertainty in the individual gradient terms.
This metric normalizes each term by its uncertainty, using a sort of signal-to-noise measure. This is also appropriate from a Bayesian standpoint, since it downweights uncertain information.
• The original metric does not account for correlation among the gradient terms. However, ς rotates the original gradient to a basis in which the covariance matrix is diagonal (the PCA basis , so the average of a series of (noncentral) χ 2 random variables,X is given by So whileX does not have a conventional distributional form (a rescaled χ 2 is gammadistributed, but a rescaled noncentral χ 2 is not), its moments and cdf can be obtained by working with NX and rescaling appropriately. It is this formula that we use in the main text to calculate credible intervals at each time point when averaging across trials.
Supplementary Note 5: Disentangling identity and context eects in play When attempting to quantify how player strategy (or indeed, any task-related variable) diers based on the identity of the opponent (human or computer), simply taking the observed dierence between the varriable of interest during the human trials and likewise for the computer trials elides an important distinction between what might be termed opponent identity eects and opponent context eects. That is, we might ask whether observed dierences in switch probability between the two opponents are due to intrinsic dierences in the way participants perceive each opponent or the fact that each opponent simply plays a dierent strategy. In typical social games, these eects are all but impossible to disentangle, but because we model the joint distribution of both states and opponent identity, f (s, ω), we can perform the following counterfactual experiment: For every state s visited in play against the computer (ω = 0), we can ask how f (s, 0) compares to f (s, 1). This is equivalent to freezing game play at a single moment, switching the identity of the opponent while holding all other variables xed, and asking how play in the next instant diers.
Such a pure identity eect quanties how much participants' strategies would dier between human and computer opponents who used the same strategy.
More formally, dene: be an expectation of some random variable X (for instance, a probability of switching or sensitivity). Here again, s represents the game state and ω the opponent identity (0 = computer, 1 = human), but we decouple the opponent specied in the random variable from the opponent that generated the states over which we average. More concretely, X 00 denotes the value of X against the computer, averaged over states actually played against the computer, while X 10 again denotes the value of X against the computer, only this time averaged over states played against the human.
In this notation, Figure 2D plots X 00 and X 11 with X = Φ −1 (p), while Figure 3 shows the same two variables with X equal to our opponent sensitivity metric.
What is most important, however, is that the observed contrast plotted in purple in Figure 1B-C can be decomposed as a weighted sum of the identity eect C identity and the context eect C identity , as follows: with n 0 and n 1 the number of trials played against the computer and human opponents, respectively, N = n 0 + n 1 , and approximate equality holds in Eq 38 because n 0 ≈ n 1 in our data. In fact, the observed contrast between the two curves in Figure 2D can be fully decomposed into an eect due to opponent identity and an eect due to dierences in the distributions of visited states (see Methods). As indicated in Figure 1A, the observed contrast plotted in Figure 2D corresponds to the dierence along the diagonal, while the identity and context eects correspond to dierences taken along the vertical and horizontal directions, respectively. Figures 1B and C illustrate this decomposition for two representative participants. These gures show both the observed contrast (dierence between the two curves in Figure 2D and its constituent pieces due to opponent identity and context. While the latter are typically larger, indicating a predominance of game state eects on switch probability, there is considerable heterogeneity across both participants and time in trial. Figure 1D illustrates this by considering the average identity eects for each participant during the early and late stages of each trial. There, a positive value indicates higher switch probability for a human opponent, while a negative value indicates higher switch probability against the computer. While some participants consistently exhibit higher switch probabilities against the human opponent (upper right) or against the computer (lower left), others switch more against one early and the other late (upper left, lower right). Thus, players can be distinguished not only by which opponent elicits more switching behavior, but also by the periods of the trial in which these tendencies occur.

Supplementary Note 6: Empirical Action Value Model
We have shown that we can use nonparametric methods to estimate the policy participants use when playing a dynamic, strategic game. Yet this analysis says nothing about how eective these policies are. So how do participants' choices at each moment translate to wins and losses? To answer this, we separately modeled each participant's action value Q π (a|s, ω): the expected value of taking action a in state s against opponent ω and playing according to policy π thereafter. As indicated by notation, this value is policy-dependent. That is, each policy π uniquely determines a value function Q π . In typical reinforcement learning models, policies are likewise dependent on action values: Given action values, Q, policies choose actions based on a softmax function or other rule [4]. Thus, there is a mapping in the reverse direction from action values to policies.
The Bellman Equation stipulates that for optimal learners, the optimal policy and action values determine one another [4], but this need not hold for nonoptimal learners. Figure 2A illustrates these concepts. While the optimal policy π * and Q * are mapped onto each other by the processes of value calculation and action selection, respectively, for non-optimal learners, the observed policy π obs leads to a value function Q obs , but softmax action selection based on Q obs may not be equivalent to the original policy: π Q = π obs , so the mappings in Figure 2A are not inverses except for optimal policies. In other words, learners may not necessarily be choosing based on the expected values of their actions. As a result, we took an approach in which the action value function Q(a|s, ω) was modeled independently of π: This model took as inputs the instantaneous state, opponent, and observed action at that time and attempted to predict from those data whether the participant subsequently won the trial. We used the same Gaussian Process classication approach as before, only this time predicting the trial outcome and using the participant's observed action as an additional input.
The results of this model are shown in Figure 2. As Figure 2B illustrates, there are uctuations in expected value even within a single trial as players move and counter move. Here, we have plotted the predicted expected value, which is equal to the value function of reinforcement learning: V π (s, ω) = a π(a|s, ω)Q π (a|s, ω), an undiscounted, weighted sum of action values according to their probability under the current policy. Quantifying expected value at the time point level, rather than the trial level, allows us to see how uctuations in game state impact likelihood of winning. For reference, we t a logistic regression using the same set of input features and targets.
Once again, the GP model outperforms logistic regression for each participant in our cohort (see Supplementary Figure 10).
Examining the lengthscale hyperparameter for each input variable answers the questions of whether a given variable is relevant in predicting the target variable in gaussian processes. Large lengthscale values mean a given variable is irrelevant, while smaller values up to approximately 1 mean a variable is predictively relevant. We found that within the empirical expected value GP, 68 out of 82 subjects had an opponent identity lengthscale hyperparameter less than or equal to 1, suggesting that opponent identity impacted expected value for most of the subjects in our task.  A: Relationship between policies and action values in reinforcement learning. Each policy determines an action value (rightward arrow). Conversely, a set of action values, coupled with an action selection mechanism like softmax or greedy methods, determines a policy (leftward arrow). For optimal learners, the connection between the optimal policy π * and its resulting action values Q * is given by the Bellman Equation, which states that the leftward and rightward arrows are inverses of one another. For non-optimal agents, however, the observed policy π obs determines Q obs , but action selection based on Q obs may not be the same as π obs . In fact, this trend can also be visualized in terms of the density of value as a function of time in trial ( Figure 3). Against the human opponent ( Figure 3A), values start out concentrated around a player's mean win rate and evolve gradually over the course of the trial toward the 0 and 1 outcomes.
By contrast, against the computer, values hold around 0.5 until abruptly diverging at the critical point. And indeed, this pattern holds in the average across all participants ( Figure 3C,D). Note that here, in the case of a computer opponent dened by a simple heuristic, our model is easily able to recover strong indications of that heuristic in an unbiased way. This indicates that our approach is powerful enough to characterize a wide range of behavior. Circumstantially, it also suggests that our participants are unlikely to have relied on simple heuristics alone to constructing their strategies. Finally, to investigate how well expected value predicts whether a given trial will result in a win or loss, we conducted a series of univariate logistic regressions. Given an opponent and an average expected value in the early, middle, or late periods of each trial, we attempted to predict the trial's result. We found that regression coecients for the human opponent condition were higher than those for the computer opponent (t = 4.53, p < 0.0001), suggesting that (unrealized) expected values better predict trial outcome in the human opponent condition. Second, we found that regression coecients increase as the trial progresses, such that the late coecients were signicantly higher than early coecients (t = 30.69, p < 0.0001). This matches our intuition that trial outcomes are better predicted by expected values later in the trial (see Supplementary Figure   11).

Supplementary Note 7: Regularized Logistic Regression
We compared the predictive performance of our classication model to a regularized logistic regression using LASSO as implemented in the scikit-learn LogisticRegression class. We t models with regularization parameters C using 20 evenly spaced values from 10 −3 to 10 2 on a log 10 scale.
For each subject, we selected the model with the highest log likelihood on a test set consisting of 20% held out data.

Supplementary Figures 4-19: Comparison across participants
Here, we reproduce our analyses from the main text for all participants used in any of the gures.  Process AUC was higher than that of his/her logistic regression AUC. This is encouraging, though not particularly surprising, since a gaussian process is a non-parametric model and a logistic regression is a linear parametric model.  Average Goalie Sensitivity Metric (a.u) Gradient Sensitivity Metric with Different Random Seeds, Subject 3 numpy random seed 1 numpy random seed 5 numpy random seed 10 Figure 16: Gaussian Process policy model gradients robust to dierent random seeds. One participant's behavioral data was t with three separate Gaussian Process models with all hyperparameters held constant (participant 3, 500 inducing points, 200,000 iterations, 256 minibatch size), except 3 dierent random seeds were used (with the numpy package). This was conducted to determine how sensitive the gradient sensitivity to opponent actions metric was to random seed changes. We nd that the qualitative form of the gradient shape remains the same and the conclusions drawn regarding the timing and amplitude of the gradient sensitivity curve are not impacted signicantly by changes in random seeds used.