Entropy-based metrics for predicting choice behavior based on local response to reward

For decades, behavioral scientists have used the matching law to quantify how animals distribute their choices between multiple options in response to reinforcement they receive. More recently, many reinforcement learning (RL) models have been developed to explain choice by integrating reward feedback over time. Despite reasonable success of RL models in capturing choice on a trial-by-trial basis, these models cannot capture variability in matching behavior. To address this, we developed metrics based on information theory and applied them to choice data from dynamic learning tasks in mice and monkeys. We found that a single entropy-based metric can explain 50% and 41% of variance in matching in mice and monkeys, respectively. We then used limitations of existing RL models in capturing entropy-based metrics to construct more accurate models of choice. Together, our entropy-based metrics provide a model-free tool to predict adaptive choice behavior and reveal underlying neural mechanisms.

H ow do we distribute our time and choices between the many options or actions available to us? Around 60 years ago, Richard Herrnstein strived to answer this question based on one of the key ideas of behaviorism; that is, the history of reinforcement is the most important determinant of behavior. He proposed a simple rule called the matching law stating that the proportion of time or responses that an animal allocates to an option or action matches the proportion of reinforcement they receive from those options or actions 1 . The matching law has been shown to explain global choice behavior across many species 2 including pigeons 3-6 , mice [7][8][9] , rats [10][11][12] , monkeys [13][14][15][16][17][18] , and humans [19][20][21][22][23][24] , in a wide range of tasks, including concurrent variable interval, concurrent variable ratio, probabilistic reversal learning, and so forth. A common finding in most studies, however, has been that animals undermatch, corresponding to selection of the better stimulus or action less than it is prescribed by the matching law. Such deviation from matching often corresponds to suboptimal behavior in terms of total harvested rewards, pointing to adaptive mechanisms beyond reward maximization.
The matching law is a global rule but ultimately should emerge from an interaction between choice and learning strategies, resulting in local (in time) adjustments of choice behavior to reinforcement obtained in each trial. Accordingly, many studies have tried to explain matching based on different learning mechanisms. Reinforcement-learning (RL) models are particularly useful because they can simulate changes in behavior due to reward feedback. Consequently, many studies on matching have focused on developing RL models that can generate global matching behavior based on local learning rules 14,15,[25][26][27][28][29][30][31][32] . These RL models are often augmented with some components in addition to stimulus-or action-value functions to improve fit of choice behavior on a trial-by-trial basis. For example, the models could include learning the rewardindependent rate of choosing each option 15 , adopting win-stay lose-switch (WSLS) policies 27,28 , or learning on multiple timescales 31 . Although these models all provide compelling explanations of the emergence of matching behavior, it remains unclear how they compare in terms of fitting local choice behavior and the extent to which they replicate observed variability in matching behavior. This could result in misinterpretation or missing important neural mechanisms underlying matching behavior in particular and adaptive behavior more generally 33,34 . Therefore, after decades of research on matching behavior, it is still not fully understood how such a fundamental law of behavior emerges as a result of local response to reward feedback.
In this work, we propose a set of metrics based on information theory that can summarize trial-by-trial response to reward feedback and predict global matching behavior. To test the utility of our metrics, we apply them to large sets of behavioral data in mice and monkeys during two very different dynamic learning tasks. We find that in both mice and monkeys, our entropy-based metrics can predict deviation from matching better than existing measures. Specifically, we find the strongest link between undermatching and the consistency of choice strategy (stay or switch) in response to receiving no reward after selection of the worse option in both species. Finally, we use shortcomings of purely RL models in capturing the pattern of entropy-based metrics in our data to construct multicomponent models that integrate reward-and option-dependent strategies with standard RL models. We show that these models can capture both trial-by-trial choice data and global choice behavior better than the existing models, thus revealing additional mechanisms involved in adaptive learning and decision making.

Results
Mice and monkeys dynamically adjust their behavior to changes in reward probabilities. To study learning and decision making in dynamic reward environments, we examined choice behavior of mice and monkeys during two different probabilistic reversal learning tasks. Mice selected between two actions (licking left and right) that provided reward with different probabilities, and these probabilities changed between blocks of trials without any signal to the animals 9 ( Fig. 1a; see Methods for more details). Block lengths were drawn from a uniform distribution that spanned a range of 40 to 80 trials. Here, we focused on the majority (469 out of 528) of sessions in which two sets of reward probabilities (equal to 0.4 and 0.1, and 0.4 and 0.05) were used. We refer to these reward schedules as 40/10 and 40/5 reward schedules (1786 and 1533 blocks with 40/5 and 40/10 reward schedules, respectively). Rewards were baited such that if reward was assigned on a given side and that side was not selected, reward would remain on that side until the next time that side was selected. Due to baiting, the probability of obtaining reward on the unchosen side increased over time as during foraging in a natural environment. As a result, selecting the worse side (side with lower base reward rate) occasionally can improve the overall total harvested reward. In total, 16 mice performed 469 sessions of the twoprobability version of the task for a total of 3319 blocks and 189,199 trials.
In a different experiment, monkeys selected between pairs of stimuli (a circle or square with variable colors) via saccades and received a juice reward probabilistically 35 (Fig. 1b; see Methods for more details). In superblocks of 80 trials, the reward probabilities assigned to each stimulus reversed randomly between trials 30 and 50, such that the more-rewarding stimulus became the less-rewarding stimulus. We refer to trials before and after a reversal as a block. Monkeys completed multiple superblocks per each session of the experiment wherein the reward probabilities assigned to the better and worse stimuli were equal to 0.8 and 0.2, 0.7 and 0.3, or 0.6 and 0.4, which we refer to as 80/20, 70/30, and 60/40 reward schedules. In contrast to the task used in mice, rewards were not baited. Here, we only analyze data from the 80/20 and 70/30 reward schedules as they provide two levels of reward uncertainty similar to the experiment in mice. In total, 4 monkeys performed 2212 blocks of the task with the 80/20 and 70/30 reward schedules for a total of 88,480 trials.
We found that in response to block switches, both mice and monkeys rapidly adjusted their choice behavior to select the better option (better side or stimulus in mice and monkeys, respectively) more often (Fig. 1c, d). However, the fraction of times they chose the better option fell below predictions made by the matching law, even at the end of the blocks (Fig. 1e, f). More specifically, the relative selection of the better option (i.e., choice fraction) was often lower than the ratio of reward harvested on the better option to the overall reward harvested (i.e., reward fraction), corresponding to undermatching behavior. Therefore, we next explored how undermatching depends on choice-and reward-dependent strategies.
Mice and monkeys exhibit highly variable undermatching behavior. To better examine matching behavior, we used the difference between the relative choice and reward fractions for each block of trials to define "deviation from matching" (see Eqs. 1, 2 in Methods; Fig. 2a, d). Based on our definition, negative and positive values for deviation from matching correspond to undermatching and overmatching, respectively. Undermatching occurs when the relative choice fraction is smaller than the relative reward fraction for reward fractions larger than 0.5, or the relative choice fraction is larger than the relative reward fraction when the latter is smaller than 0.5. Overmatching occurs when the relative choice fraction is larger than the relative reward fraction for reward fractions larger than 0.5, or the relative choice fraction is smaller than the relative reward fraction when the latter is smaller than 0.5. Undermatching could happen because the animal does not detect the more-rewarding stimulus or action, poor credit assignment, or due to too much stochasticity in choice. In contrast, overmatching is characterized by selecting the better option more frequently than is prescribed based on perfect matching. In the task used in monkeys, overmatching was not possible by design (except due to random fluctuations in reward assignment) and harvested rewards could be maximized by selecting the better stimulus all the time, corresponding to matching. In contrast, overmatching was possible in the reversal learning tasks with baited rewards (e.g., task used in mice).
Existing behavioral metrics only partially explain variability in matching. To examine the relationship between existing behavioral metrics and undermatching, we first performed stepwise To initiate a trial, mice received an olfactory go cue (or no-go cue in 5% of trials) (a), and monkeys fixated on a central point (b). Next, animals chose (via licks for mice and saccades for monkeys) between two options (left or right tubes for mice and circle or square for monkeys) and then received a reward (drop of water and juice for mice and monkeys, respectively) probabilistically based on their choice. c, d Average choice and reward using a sliding window with a length of 10 for a representative session in mice (c) and five superblocks of a representative session in monkeys (d).
Mean selection of 1 and −1 correspond to 100% selection of or 100% reward on the right and left in mice (square and circle stimuli in monkeys), respectively, and mean selection of 0 corresponds to equal selection or reward on the two choice options. Vertical gray dashed lines indicate trials where reward probabilities reversed. Vertical gray solid lines indicate divisions between superblocks in the monkey experiment. e, f Average relative choice and reward fractions around block switches using a non-causal smoothing kernel with a length of three separately for all blocks with a given reward schedule in mice (e) and monkeys (f). The better (or worse) option is the better (or worse) option prior to the block switch. Trial zero is the first trial with the reversed reward probabilities. Average choice fractions for the better option (better side or stimulus) are lower than average reward fractions for that option throughout the block for both mice and monkeys, corresponding to undermatching behavior. multiple regressions to predict deviation from matching for both mice and monkeys based on commonly used behavioral metrics including: p win ð Þ; p stay À Á ; p stay; jwin À Á , and pðswitchjloseÞ. The threshold for adding a predictor was set at p < 0.0001 (see "Methods" for more details and Supplementary Note 1 for regression equations). These regression models explained 31% and 34% of the variance in deviation from matching for mice and monkeys, respectively, which are significant but unsurprising amounts of overall variance (mice: Adjusted R 2 = 0.31; monkeys: We next included the Repetition Index (RI) on the better (RI B ) and worse (RI W ) options (side or stimulus), which measure the tendency to stay beyond chance on the better and worse options 36 to predict undermatching. To that end, we conducted additional stepwise multiple regressions that predicted deviation from matching using: RI B ; RI W ; p win ð Þ; p stay À Á ; pðstayjwinÞ; and pðswitchjloseÞ as predictors. These models explained 48% and 49% of the variance in deviation from matching for mice and monkeys, respectively (mice: Adjusted R 2 ¼ 0:48; monkeys : Adjusted R 2 ¼ 0:49). Thus, including RI B and RI W enabled us to account for additional 17% and 15% of variance, suggesting that staying beyond chance on both the better and worse choice options is a significant contributor to undermatching behavior.
Together, these results illustrate that undermatching is correlated with tendency to stay beyond chance (measured by RI) and response to reward feedback in terms of stay or switch (measured by win-stay and lose-switch). However, win-stay and lose-switch are not strong predictors of undermatching because their relative importance depends on the overall probability of winning. For example, if p win ð Þ is high, lose-switch is less useful for predicting behavior because response to loss represents strategy in a small subset of trials. Although win-stay, lose-switch, p win ð Þ, and p stay À Á contain all the information necessary to compute the dependence of strategy on reward, this requires interpretation of all four metrics in conjunction and may depend on nonlinear relationships that are challenging to intuit or capture with regression. To overcome these issues, we propose metrics to quantify changes in strategy due to reward outcome using information theory.
Behavioral metrics based on information theory. To better capture the dependence of staying (or similarly switching) strategy on reward outcome, we developed a series of model-free behavioral metrics based on Shannon's information entropy 37 . The Shannon information entropy of a random variable X conditioned on Y, denoted HðXjYÞ, captures the surprise or uncertainty of X given knowledge of the values of Y. Lower information entropies correspond with decreased uncertainty in the variable under consideration and thus consistency in utilized strategy (see below). First, we define the entropy of reward-dependent strategy (ERDS) that measures the dependence of adopting a response strategy on reward outcome. Formally, ERDS is the information entropy of response strategy conditioned on reward outcome in the previous trial, H str; jrew ð Þ(see "Methods" for more details). Therefore, ERDS quantifies the amount of information needed to explain an animal's strategy (e.g., choosing the option selected in the previous trial) given knowledge of reward outcome (e.g., whether the animal won or lost in the previous trial). Lower ERDS values indicate more consistent response to reward feedback.
In its simplest formulation for measuring the effect of reward in the previous trial on staying or switching, ERDS is a function of winstay, lose-switch, and p win ð Þ (see Eq. (8) in "Methods"). As win-stay and lose-switch move further from 0.5, ERDS decreases, reflecting increased consistency of reward-dependent strategy (Fig. 3a). Moreover, p win ð Þ modulates the effects of win-stay and lose-switch on ERDS (Fig. 3b, c). As p(win) decreases, the influence of win-stay on ERDS decreases, reflecting that win-stay is less relevant to overall response to reward feedback when winning rarely occurs. Similarly, as p lose ð Þ (¼ 1 À pðwinÞ) decreases, the influence of lose-switch on ERDS decreases, reflecting that lose-switch is less relevant to response to reward feedback. Because of these properties, ERDS corrects for the limitations of win-stay and lose-switch. Also, as stay (or switch) strategy becomes more independent of reward outcome, ERDS increases because reward outcome provides no information about strategy.
ERDS can be decomposed into ERDS + and ERDSto measure the specific effects of winning and losing in the preceding trial, respectively (Fig. 3b, c; see Methods). More specifically, ERDS + is the entropy of win-dependent strategy, and ERDS − is the entropy of loss-dependent strategy. Therefore, comparing ERDS + and ERDSprovides information about the relative contributions of win-dependent strategy and loss-dependent strategy to the overall reward-dependent strategy.
In addition to conditioning stay strategy on reward outcome in the preceding trial, we can also condition stay strategy on selection of the better or worse choice option (stimulus or action) in the previous trial. The resulting entropy of option-dependent strategy (EODS), H str; jopt À Á , captures the dependence of stay (or switch) strategy on the selection of the better or worse option in the preceding trial (Fig. 3d). EODS depends on pðchoose betterÞ; pðstayjchoose betterÞ, and pðswitchjchoose worseÞ and moreover, can be decomposed into EODS B and EODS W based on selection of the better or worse option, respectively (Fig. 3e, f).
Finally, to capture the dependence of response strategy on reward outcome and the previously selected option in a single metric, we computed the entropy of reward-and optiondependent strategy (ERODS), H str; j; rew; opt À Á . ERODS depends on the probabilities of adopting a response strategy conditioned on all combinations of reward outcome and option selected in the previous trial (see "Methods" for more details). ERODS has similar properties to ERDS and EODS and can be interpreted in a similar fashion. Low ERODS values indicate that stay (or switch) strategy consistently depends on combinations of win/lose and the option selected in the previous trial; for example, winning and choosing the better side in the previous trial. ERODS can be decomposed either by the better or worse option (ERODS B and ERODS W ), by win or loss (ERODS + and ERODS − ), or by both (ERODS B+ , ERODS W+ , ERODS B− , and ERODS W− ).
To summarize, we propose three metrics, ERDS, EODS, and ERODS that capture the dependence of response strategy on reward outcome and/or selected option in the preceding trial. Each metric can be decomposed into components that provide important information about the dependence of stay (or switch) strategy on winning or losing and/or choosing the better or worse option in the preceding trial (see Supplementary Table 1 for summary). We next show how these entropy-based metrics can predict deviation from matching behavior and further be used to construct more successful RL models.
Deviation from matching is highly correlated with entropybased metrics. To test the relationship between the observed undermatching and our entropy-based metrics, we next computed correlations between all behavioral metrics and deviation from matching ( Supplementary Fig. 2). We found that nearly all entropybased metrics were significantly correlated with deviation from matching. Out of all behavioral metrics tested, deviation from matching showed the strongest correlation with ERODS W− based on both parametric (Pearson; mice: r ¼ À0:71; p < 10 À300 ; monkeys: r ¼ À0:64; p ¼ 10 À231 ) and non-parametric (Spearman; mice: r ¼ À0:78; p < 10 À300 ; monkeys: r ¼ À0:75; p < 10 À300 ) tests in both mice and monkeys (Fig. 4). The size of the correlation between ERODS W-and deviation from matching is remarkable because it indicates that a single metric can capture more than 50% and 41% of the variance in deviation from matching in mice and monkeys, respectively. This finding suggests that undermatching occurs when animals lose when selecting the worse option and respond inconsistently to those losses.
Out of all existing behavioral metrics tested, the probability of winning and probability of staying had the two strongest correlations with deviation from matching for mice and monkeys, respectively, but each metric individually only captured about 25% of variance in deviation from matching (Fig. 4). The correlation between the probability of winning (total harvested rewards) and deviation from matching was positive such that increased total harvested rewards corresponded with less undermatching.
Entropy-based metrics can accurately predict deviation from matching. To verify the utility of entropy-based metrics in predicting deviation from matching, we performed additional stepwise regressions to predict deviation from matching using our entropybased metrics. In these models, we included ERDS þ ; ERDS À ; These models explained 74% and 57% of total variance in deviation from matching for mice and monkeys, respectively (mice: Adjusted R 2 ¼ 0:74; monkeys : Adjusted R 2 ¼ 0:57). For mice, the regression model explained 26% more variance than the model with repetition indices and other existing behavioral metrics and 43% more variance than the model with existing behavioral metrics but without repetition indices. For monkeys, the regression model explained 8% more variance than the model with repetition indices and existing behavioral metrics and 23% more variance than the model with existing behavioral metrics without repetition indices. These are significant improvements over previous models, suggesting that most variance in undermatching behavior can be explained by trial-by-trial response to reward feedback.
In terms of the predictive power of different metrics, we found that for mice, the first three predictors added to the regression models were ERODS W− (ΔR 2 ¼ 0:59), ERODS W+ ðΔR 2 ¼ 0:04Þ; and ERODS B+ ΔR 2 ¼ 0:02 À Á : For monkeys, the first three predictors added were ERODS W− ðΔR 2 ¼ 0:31Þ; EODS W ðΔR 2 ¼ 0:09Þ; and ERODS B+ ðΔR 2 ¼ 0:06Þ. These results indicate that entropy-based metrics were the best predictors of deviation from matching when considering all metrics together. In addition, most entropy-based metrics included as predictors were added to the final regression equations for both mice and monkeys. This suggests that despite their overlap, each entropy-based metric captures a unique aspect of the variance in deviation from matching behavior.
Entropy-based metrics capture the relationship between undermatching and reward environment better than existing metrics. Our previous observation that entropy-based metrics can explain most variance in undermatching behavior suggests that entropy-based metrics may also capture average differences in undermatching between reward schedules in mice and the lack of differences in undermatching between reward schedules in monkeys.
To test whether this was the case, we used tenfold crossvalidated linear regression to predict deviation from matching using a model without entropy-based metrics or repetition indices, a model without entropy-based metrics, and a model with all metrics (full model). Predictors chosen for inclusion in these models were the predictors that remained in final stepwise regression equations described above.
For mice, in all three models, predicted deviation from matching was significantly lower in the 40/5 than the 40/10 reward environment (Supplementary Fig. 6; two-sided t test; model without entropy-based metrics or repetition indices: p ¼ 6:30 10 À3 ; model without entropy-based metrics: p ¼ 3:22 10 À3 ; full model: p ¼ 4:76 10 À25 ). Importantly, the difference between deviation from matching in the two reward schedules was greatest for the full model (Supplementary Fig. 6; Cohen's d; model without entropy-based metrics or repetition indices: d ¼ À0:10; model without entropy-based metrics: d ¼ À0:10; full model: d ¼ À0:39). The full model with entropy-based metrics was the only model that came close to replicating the magnitude of differences in deviation from matching between the 40/5 and 40/10 schedules in behavioral data ( Supplementary Fig. 6c, d).
For monkeys, predicted deviation from matching from both regression models without entropy-based metrics was significantly lower in the 70/30 than the 80/20 reward environment ( Supplementary Fig. 6e 4 Correlation between undermatching and proposed entropy-based metrics and underlying probabilities. a Pearson correlation between proposed entropy-based metrics and existing behavioral metrics and deviation from matching in mice. Correlation coefficients are computed across all blocks, and metrics with nonsignificant correlations (two-sided, p > 0:0001 to account for multiple comparisons) are indicated with a hollow bar. The metric with the highest correlation with deviation from matching is indicated with a star (ERODS W− ; r ¼ À0:71; p < 10 À300 ). b Similar to (a) but for monkeys (ERODS W− ; r ¼ À0:64; p ¼ 10 À231 ). Overall, entropy-based metrics show stronger correlation with deviation from matching than existing metrics. model without entropy-based metrics: p ¼ 1:32 10 À29 ). Only the regression model with entropy-based metrics replicated the observed lack of difference in undermatching between reward schedules ( Supplementary Fig. 6g, h; full model: p ¼ 0:36; observed difference between reward schedules: p ¼ 0: 19). Therefore, entropy-based metrics are necessary and sufficient to capture the influence of reward schedule on deviation from matching.
Purely RL models do not capture the pattern of entropy-based metrics. To capture the observed variability in entropy-based metrics and underlying learning and choice mechanisms, we next fit choice behavior using three purely RL models. These models assumed different updating of reward values (RL1 and RL2; see Methods for more details) or learning multiple reward values across different timescales (multiple timescales model). We tested these models because previous research has suggested they can replicate matching or undermatching phenomena 14,26,31 . Out of these three models, we found that RL2, in which the estimated reward value for the unchosen option (side or stimulus) decays to zero over time, provided the best fit of choice behavior for both mice and monkeys as reflected in the lowest Akaike Information Criterion (AIC) (Fig. 5a, b, Supplementary Table 2).
We next tested whether RL2 could replicate observed distributions of entropy-based metrics and undermatching by simulating the model during our experiment using parameters obtained from model fitting. Due to the large number of simulations performed (100 simulations per session, mice: n ¼ 331; 900 blocks; monkeys: n ¼ 221; 200 blocks), we were able to estimate population distributions of metrics for each model. We found that the median predicted ERODS W− was significantly higher than the median observed ERODS W− , suggesting the RL2 model underutilizes loss-dependent and option-dependent strategies when compared to mice and monkeys in our experiments (Fig. 5c, d).
To evaluate the similarity of observed and predicted distributions of entropy-based metrics and matching, we computed Kolmogorov's D statistic that measures the maximum difference (or distance) between two empirical cumulative distribution functions. Using this method, we ound that the distribution of predicted ERODS W− was very different than the observed distributions for both mice and monkeys (Fig. 5c, d; two-sided Kolmogorov-Smirnov test; mice: D ¼ 0:121; p ¼ 1:44 10 À41 ; monkeys: D ¼ 0:072; p ¼ 1:42 10 À9 ). Moreover, the predicted distribution of deviation from matching was very different from the observed distribution for both mice and monkeys (Fig. 5e, f; Two-sided Kolmogorov-Smirnov test; mice: D ¼ 0:091; p ¼ 3:24 10 À24 ; monkeys: D ¼ 0:101; p ¼ 6:38 10 À20 ).
Finally, we also computed undermatching and all behavioral metrics in simulated data using RL2 with random parameter values. We found that our entropy-based metrics were better predictors for deviation from matching than the parameters of the RL2 model (see Supplementary Fig. 7). Together, our results illustrate that purely RL models fail to replicate observed distribution of ERODS W− and variability in matching behavior, pointing to additional mechanisms that contribute to behavior. Moreover, ERODS W− was highly correlated with undermatching in observed behavior (Fig. 4) and RL simulations ( Supplementary  Fig. 7), suggesting that a model that better captures ERODS Wmay also better capture variability in matching behavior.
Model with additional choice memory captures entropy-based metrics in monkeys more accurately. The deviations of predicted ERODS W− from observed ERODS W− suggest RL models underutilize loss-dependent and option-dependent strategies; that is, they fail to capture the influence of option (stimulus or action) and loss in the current trial on choice in the subsequent trial. To improve capture of option-dependent strategy, we added a common choice-memory component to estimate the effects of previous choices on subsequent decisions 8,15,38 . The choicememory (CM) component encourages either staying on or switching from options that have been chosen recently. Because standard RL models typically choose the option with a higher value, the CM component can capture strategy in response to selection of the better or worse option reflected in the optiondependent entropy-based metrics. The influence of the CM component on choice is determined by fitting a weight parameter that can take either positive or negative values which correspond to better-stay/worse-switch or better-switch/worse-stay strategies, respectively.
In monkeys, we found that the RL2 model augmented with a CM component, which we refer to as the RL2+CM model, fit choice behavior better than RL1, RL2, and RL1 + CM as indicated by lower AIC (Fig. 5b; Supplementary Table 2). Although the improvement in fit of choice behavior for RL2 + CM over RL2 was statistically significant (paired samples t test of AICs: p ¼ 1:04 10 À23 ; Supplementary Table 2), the RL2 + CM model was only twice as likely as RL2 to be the best model based on a comparison of Akaike weights.
Importantly, the RL2 + CM model improved capture of the observed distribution of ERODS W− in monkeys ( Fig. 5d; twosided Kolmogorov-Smirnov test; D ¼ 0:037; p ¼ 8:91 10 À3 ). This improvement in capturing ERODS W-corresponded with similar improvements in capturing deviation from matching. The predicted distribution of deviation from matching from the RL2 + CM model better replicated the observed distribution of deviation from matching than the predicted distribution from RL2 ( Fig. 5f; two-sided Kolmogorov-Smirnov test; D ¼ 0:065; p ¼ 2:07 10 À8 ). This improvement was significant; there was an over 30% reduction in the maximum difference between CDFs in the RL2 + CM model from the RL2 model.
To determine whether these improvements were attributable to modulations in better-switch/worse-stay or better-stay/worseswitch strategies, we examined the distribution of the estimated CM weights and fit a model with a CM component with weights restricted to positive values only (RL2 + CM + model). The median fitted CM weights in the RL2 + CM model was negative ( Supplementary Fig. 8k), and the fit of choice behavior was worse for the RL2 + CM + model than the RL2 + CM model (Supplementary Table 2), indicating that the CM component enhanced better-switch/worse-stay strategies in monkeys.
In mice however, the RL2 + CM model had positive weights and had fairly weak effects on fit of local choice behavior and capture of metrics and undermatching (Supplementary Figs. 8i, 5a; Supplementary Table 2). These results in conjunction with our entropy-based findings suggest that additional mechanisms that modulate response to loss are necessary to improve capture of variability in ERODS W-and matching behavior in mice. In mice, the RL2 + CM + LM model (RL2 augmented with a choice-memory and a loss-memory component) fit choice behavior better than all existing RL models as indicated by a lower AIC (Fig. 5a). The Akaike weight for the RL2 + CM + LM a b c d e f  ). Moreover, the predicted distribution of deviation from matching from the RL2 + CM + LM model better replicated the observed distribution of deviation from matching than the predicted distribution from RL2 ( Fig. 5e; two-sided Kolmogorov-Smirnov test; mice: D ¼ 0:065; p ¼ 2: 19 10 À12 ). This improvement corresponds to an over 20% reduction in the maximum difference between cumulative distribution functions (CDFs) for deviation from matching computed from observed and simulated data.
Finally, to understand how the RL2 + CM + LM model modulates specific loss-and option-dependent strategies in mice, we compared the distributions of model parameters between all models with RL2. The LM and CM components both had positive average weights in mice ( Supplementary Fig. 8i-l), such that the LM component encouraged lose-switch strategies and the choicememory component encouraged worse-switch strategies. Two additional models, RL2 + CM+ and RL2 + LM+, in which the weights of the CM and LM components were restricted to positive values, fit comparably to the RL2 + CM and RL2 + LM models, providing further evidence for the previous conclusion. Interestingly, the average weights of the CM and LM components were higher in the RL2 + CM + LM model than in the RL2 + CM and RL2 + LM models, indicating that the two components may interact to modulate behavior ( Supplementary Fig. 8i, j). This dovetails with our entropy-based results that show response to the loss after selection of the worse option is uniquely important for global choice behaviors.
In monkeys, however, the LM component was not necessary to better explain choice behavior. The RL2 + CM model fit local choice behavior better than the RL2 + CM + LM model based on AIC and better captured ERODS w-and deviation from matching ( Fig. 5a; Supplementary Table 2). Despite this, in monkeys, the RL2 + LM model still improved upon RL2 in capture of undermatching and ERODS W− (Supplementary Table 2), indicating that there may be overlap between the effects of the CM and LM components that renders the LM component useless in the full model.
To summarize, the models with additional components improved fit of choice behavior and captured our metrics and undermatching more accurately in both species, whereby revealing that undermatching behavior arises from competition among multiple components incorporating choice memory and/ or loss memory on a trial-by-trial basis. Importantly, we used deviations in predicted entropy-based metrics from their observed values to identify shortcomings in purely RL models and to incorporate previous mechanisms or propose new mechanisms to mitigate them.

Discussion
Undermatching is a universal behavioral phenomenon that has been observed across many species. Here, we show that proposed entropy-based metrics based on response to reward feedback can accurately predict undermatching in mice and monkeys, suggesting that inconsistencies in the use of local reward-dependent and option-dependent strategies can account for a large proportion of variance in global undermatching. Moreover, we demonstrate that these entropy-based metrics can be utilized to construct more complex RL models that are able to capture choice behavior, undermatching, and utilization of rewarddependent strategies. Together, our entropy-based metrics provide a model-free tool to develop and refine computational models of choice behavior and reveal neural mechanisms underlying adaptive behavior.
Similar to many previous studies of matching behavior 4,11,14,17,39 , we observed significant, but highly variable undermatching in both mice and monkeys. By focusing on the variability in undermatching, here, we were able to show that global undermatching can be largely explained by the degree of inconsistency in response to no reward on the worse option (ERODSw − ) across species. Specifically, ERODSw − could explain about 50% and 41% of the variance in undermatching in mice and monkeys, respectively. The proposed entropy-based metrics were able to predict undermatching across two very different species despite differences in the tasks including utilized reward probabilities and schedules (40/5 and 40/10 probabilities with baiting vs. complementary 80/20 and 70/30 with no baiting), learning modality (action-based vs. stimulus-based), choice readout (licks vs. saccades), and predictability of block switches (unpredictable vs. semipredictable), suggesting the proposed metrics are generalizable.
The proposed entropy-based metrics complement and improve upon commonly used behavioral metrics such as win-stay, loseswitch, and the U-measure 40 . Although win-stay and lose-switch provide valuable information 28,41-45 , these probabilities do not solely reflect the effects of reward feedback on staying (or similarly switching) as they both depend on the probability of stay. For example, if staying behavior is independent of reward, winstay, and lose-switch values simply reflect the overall stay and switch probabilities, respectively. Consistently, we found that win-stay and lose-switch are not strong predictors of undermatching because their relative importance depends on the overall probability of winning. For example, if the overall probability of reward is high, lose-switch is less useful for predicting behavior because response to loss represents strategy in a small subset of trials. Therefore, win-stay and lose-switch cannot capture the degree to which staying and switching strategies depend on reward outcome only. The entropy-based metrics such as ERDS overcome these issues by combing win-stay and loseswitch with p(win) and p(stay). Similarly, although the U-value has been used to measure consistency or variability in choice behavior 46,47 , this metric is difficult to interpret and fails to capture sequential dependencies in choice 48 . Our proposed entropy-based metrics avoid these issues because they have both clear interpretations and can capture the sequential dependence of choice on previous reward and/or selected action or option.
As shown by multiple studies, models that fit choice data best may still fail to replicate important aspects of behavior 33,34 . Therefore, model validation must involve analyzing both a model's predictive potential (fitting) and its generative power (replication of behavior in simulations). We used shortcomings of purely RL models in capturing the most predictive entropy-based metrics to detect additional mechanisms underlying adaptive behavior. This approach can be applied to other tasks in which similar or different entropy-based metrics are most predictive of global choice behavior (matching or other metrics). Our aim here was not to find the best model for capturing all aspects of behavior but instead, to provide a framework for how local response to reinforcement can be used to guide model development and explore interesting properties of local and global choice behavior.
Using this method, we constructed a model (RL2 + CM + LM model) that augments a reinforcement-learning model with a choice-memory component that captures option-dependent strategies and a loss-memory component that captures lossdependent strategies. Previous studies have also shown that a combination of WSLS strategies with RL models could improve fit of choice behavior and capture of the average matching behavior 27,28,49 . The choice-memory component used here is similar to other choice-memory components that have been shown to improve fit of choice behavior 15,38 . Nonetheless, the proposed RL2 + CM + LM model is a novel combination of these components. Critically, the weights of the loss and choice components could be either positive or negative. This parallels how entropy-based metrics capture response to reward feedback considering that low entropy can result from strong positive or negative influences of recent rewards or choices (e.g., high winstay and high win-switch both correspond to low entropy). Neural correlates of a similar loss-memory component weighted by recent reward prediction errors have been identified in the dorsal anterior cingulate cortex of humans 50 . Moreover, neural correlates of such choice memories have been identified in various cortical areas of monkeys including the dorsolateral prefrontal cortex, dorsal medial prefrontal cortex, lateral intraparietal area, and the anterior cingulate cortex [51][52][53] .
Despite the significant correlation between ERODSW − and deviation from matching in both species, the loss-memory component introduced here only improved fit of choice behavior and capture of metrics in the full model in mice. This finding may be related to the close correspondence between reward-and option-dependent strategies in the monkey task because winning (respectively, losing) almost always corresponds with choosing the better (respectively, worse) side. Due to this significant overlap, one component may be sufficient to capture both strategies. In the mouse task, however, these strategies were dissociated because losing was likely when choosing either the better or worse option (but more for the worse option). This could explain why for monkeys, the LM component improved capture of entropy-based metrics and deviation from matching in the RL2 + LM model relative to the RL2 model but was not useful in conjunction with the choice-memory component. Moreover, we observed a higher overall probability of switching in mice than in monkeys, indicating that mice occasionally switch from the morerewarding side to harvest baited rewards on the less-rewarding side, whereas monkeys typically exploit the more-rewarding stimulus. Because of this, a loss-memory component that encourages switching in response to loss would be more helpful in capturing that behavior in mice than in monkeys. Although aforementioned differences in results for these two datasets may be partially explained by differences in task structure and species, they also highlight the limitations of using entropy-based metrics to guide model development. Entropy-based metrics describe properties of choice behavior that are helpful for making educated guesses about model structure, but alone, cannot provide a generative account of behavior.
The model fits were also worse for mouse data than the monkey data in terms of explained variance in choice behavior, likely due to differences in the overall entropy in choice behavior and task structure. More specifically, mice showed higher average entropy in their choice behavior than monkeys across different measures, suggesting that the observed difference in the quality of fit occurred because mice choice behavior was more random and thus harder to predict. In addition, sessions in the mouse task were longer than superblocks in the monkey task, so the same number of parameters were used to account for more choices in mice than in monkeys, resulting in an overall poorer fitting quality.
The goal of our approach, to predict and develop generative models to explain undermatching, was similar to a recent study that suggested limited undermatching results in optimal performance in stochastic environments and proposed learning on multiple timescales to account for such undermatching 31 . In contrast, we identified a positive correlation between reward harvesting and deviation from matching which suggests that the degree of undermatching observed here corresponded with suboptimal choice. This difference between Iigaya et al. 31 and our study could be due to differences in how performance and undermatching are defined. More specifically, here we measure performance as the total number of harvested rewards in each block of trials and undermatching as the difference between choice and reward fractions in each block. In contrast, Iigaya et al. 31 use harvesting efficiency, equal to the number of rewards harvested divided by the maximum number of rewards that could have been collected, in each session of experiment (consisting of multiple blocks) as a measure of performance and quantify undermatching as the difference between the slope of choice vs. reward fractions and one in each session. Moreover, we found that nearly all other models described here better accounted for local and global choice behavior than the multiple timescales model proposed in Iigaya et al. 31 . Nonetheless, it is possible that more complex models based on learning on multiple timescales may fit choice behavior better.
We also observed weak, positive choice-memory effects in mice such that mice tended to choose options that they had recently chosen. A previous study using a nearly identical task (reversal learning with same reward schedules (40/10) and baited rewards, but longer blocks) observed a much stronger, negative choice memory effect in mice 8 . The reason for this difference is unclear given the similarity of the two tasks. Consistent with prior studies of choice-history effects in monkeys 15 , we identified strong, negative choice-memory effects in monkeys such that the choicememory component encouraged switching from recently chosen options. Thus, the incorporation of the negative weights was only important for capturing behavior in the monkey task and thus, could be task dependent. This negative weighting mechanism may be able to facilitate quick adaptation to reversals in monkeys, a behavior that has previously been described using a Bayesian approach 54 , because negative weights in either the choicememory or the loss-memory component encourage faster response to reversals. Future studies are needed to test whether this is the case.
In summary, we show that entropy-based metrics are good predictors of global choice behavior across species and can be used to refine RL models. Results from fitting and simulating RL models augmented with additional components suggest that recent choices and rewards affect decisions in ways beyond their influence on the update of subjective values in standard RL models. Thus, entropy-based metrics have the potential to open a realm of possibilities for understanding computational and neural mechanisms underlying adaptive behavior.

Methods
Experimental paradigm in mice. Mice performed a dynamic foraging task in which after receiving a go-cue signaled by an odor, they licked one of the two water tubes (on left and right) to harvest possible reward. In 5% of trials, a no-go cue was presented by another odor signaling that a lick would not be rewarded or punished. If a mouse licked one of the tubes after a go-cue odor, reward was delivered probabilistically. Each trial was followed by an inter-trial interval drawn from an exponential distribution with a rate parameter of 0.3. If a mouse licked a tube in the 1 s no-lick prior to odor delivery, an additional inter-trial interval and an additional 2.5 s no-lick period were added.
The reward probabilities assigned to the left and right tubes were constant for a fixed number of trials (blocks) and changed throughout the session (block switches). Block lengths were drawn from a uniform distribution that spanned a range of 40-100 trials, however, the exact block lengths spanned smaller ranges for individual sessions, resulting in variable block lengths with most block lengths ranging between 40 and 80 trials. If mice exhibited strong side-specific biases, block lengths were occasionally shortened or lengthened. Miss trials, in which the mouse did not make a choice, and no-go trials were excluded for all analyses described here. In total, 1706 miss trials (average of 3.64 per session) and 7893 no-go trials (average of 16.83 trials per session) were excluded from our analyses.
Mice performed two versions of the task, one with 16 different sets of reward schedules and another with two sets of reward schedules. The vast majority (469 out of 528) of sessions used two sets of reward probabilities equal to 0.4 and 0.1, and 0.4 and 0.05, which we refer to as 40/10 and 40/5 reward schedules. Here, we focus on the most frequent blocks (40/5 and 40/10 reward schedules). Rewards were baited such that if reward was assigned on a given side and that side was not selected, reward would remain on that side until the next time that side was selected. Due to this baiting mechanism, the probability of obtaining reward on the unchosen side increased over time as during foraging in a natural environment. In total, 16 mice performed 469 sessions of the two-probability version of the task for a total of 3319 blocks (1786 and 1533 blocks with 40/5 and 40/10 reward schedules, respectively) and 189,199 trials. Male C57BL/6J (The Jackson Laboratory, 000664) mice were used in the experiment, and mice were housed on a 12 h dark/12 h light cycle. All surgical and experimental procedures were in accordance with the National Institutes of Health Guide for the Care and Use of Laboratory Animals and approved by the Johns Hopkins University Animal Care and Use Committee. This experimental setup and some analyses of the data have also been described in Bari et al. 9 .
Experimental paradigm in monkeys. In the reversal learning task in monkeys, Costa et al. 35 trained monkeys to fixate on a central point on a screen to initiate each trial (Fig. 1). After fixation, two stimuli, a square and circle, were presented on the screen to the left and right of the fixation point (6°visual angle). The side that the stimuli were presented on was assigned randomly and was not related to reward. Monkeys made saccades to a stimulus and fixated on the stimulus to indicate their choice in each trial. A 0.085 mL juice reward was delivered probabilistically via a pressurized tube based on the chosen stimulus. Each trial was followed by a fixed 1.5 s inter-trial interval. Trials in which the monkey did not make a choice or failed to fixate were immediately repeated.
Monkeys completed sessions that contained around 1300 trials on average divided into superblocks of 80 trials. Within each superblock the reward probabilities assigned to each cue were reversed randomly between trials 30 and 50, such that the stimuli that was less rewarding at the beginning of the superblock became more rewarding and vice versa. Every 80 trials, monkeys were presented with new stimuli that varied in color but not shape. Six images of a red, green, and blue circle or square were used as stimuli, and the two choice options in a given block always differed in both color and shape, e.g., a red square could be presented with a blue circle. Superblock presentation was fully randomized without replacement such that a monkey viewed all stimuli pair/reward schedule combinations (e.g., red square, blue circle, 70/30) before any repeated.
Monkeys performed two variants of the task, a stochastic variant with three reward schedules (80/20, 70/30, and 60/40) and a deterministic variant with one reward schedule (100/0). Here, we focus our analyses on the 80/20 and 70/30 reward schedules (2212 blocks of the task performed by 4 monkeys) as they provide two levels of uncertainty similar to the experiment in mice. Male rhesus macaques were used in the experiment. Monkeys were water restricted throughout the experiment and during test days earned fluid only through the task. Stimulus presentation and behavioral monitoring was controlled by the MonkeyLogic (version 1.1) toolbox 55 . Eye movements were sampled at 1 kHz using an Arrington eye-tracking system (Arrington Research). All experimental procedures were performed in accordance with the Guide for the Care and Use of Laboratory Animals and were approved by the National Institute of Mental Health Animal Care and Use Committee. This experimental setup and some analyses of the its data have also been described in Costa et al. 35,54 .

Behavioral metrics
Matching performance. To measure the overall response to reinforcement on the two choice options (e.g., left and right actions when reward is based on the location) in each block of the experiment, we defined undermatching (UM) as: where sign 0 ð Þ ¼ 1 and choice and reward fractions (Choice F , Reward F ) are defined as follows: Therefore, UM measures the difference between choice and reward fractions toward the more-rewarding side. Similarly, UM can be computed based on the color of stimuli when color is informative about reward outcome. Based on our definition, negative and positive values for UM correspond to undermatching and overmatching, respectively.
Win-stay and lose-switch. Win-stay (WS) and lose-switch (LS) measure the tendency to repeat a rewarded choice (in terms of action or stimulus) and switch away from an unrewarded choice, respectively. These quantities are based on the conditional probabilities of stay and switch after reward and no reward, respectively, and can be calculated in a block of trials as follows: where PðwinÞ and PðstayÞ are the probabilities of harvesting reward and choosing the same option (side or stimulus) in successive trials, PðloseÞ ¼ 1 À PðwinÞ, and PðswitchÞ ¼ 1 À PðstayÞ. When computing metrics based on action or reward in the previous trial for mice, we treated each miss trial as though the trial did not exist. For example, if a mouse chose left and was rewarded on trial t, did not respond on trial t þ 1 (miss trial), then chose left on trial t þ 2, trial t þ 2 would be labeled as win-stay.
Repetition index. Repetition index (RI) measures the tendency to repeat a choice beyond what is expected by chance and can be computed by subtracting the probability of stay by chance from the original probability of stay 36 . RI can be computed based on the repetition of left or right choices (RI LR ) as follows: In general, RI reflects a combination of reward-dependent and rewardindependent strategies as well as the sensitivity of choice to value differences (equal to the inverse temperature in the logit function translating value differences to choice probability; see Eq. (28).
Repetition index can also be measured based on other option or choice attributes that predict reward such as the color of the chosen option. For example, RI can be defined based on selection of the better or worse option (RI BW ) when such options exist in a task: where t is the trial number. Using Eq. (5), RI BW can be decomposed into two pieces, RI B and RI W , that measure the tendency to repeat the better and worse options, respectively: Entropy-based metrics. In order to quantify the influence of previous reward outcome on choice behavior in terms of stay or switch, we defined the conditional entropy of reward-dependent strategies (ERDS) that combines tendencies of winstay and lose-switch into a single metric. More specifically, ERDS is defined as the conditional entropy of using stay or switch strategy depending on win or lose in the preceding trial: To better show the link between ERDS and win-stay, lose-switch, and p(win), Eq. (7) can be rewritten as follows: ERDS can be decomposed into two components, ERDS þ and ERDS À (ERDS ¼ ERDS þ þ ERDS À ), to allow separation of animals' response to rewarded (win) and unrewarded (loss) outcomes: The above equations also show that ERDS þ and ERDS À are linked to win-stay and lose-switch, respectively.
Considering that RI can be decomposed to repetition after the better or worse option (Eq. (5), and following the same logic used to derive ERDS, one can define the conditional entropy of option-dependent strategy (EODS) based on staying on or switching from the better or worse option (EODS BW ) in two consecutive trials: We should note that ERDS and EODS are directly comparable and provide insight into the consistency of strategy adopted by an animal. Lower ERDS than EODS suggests that an animal's decisions are more consistently influenced by immediate reward feedback than selection of the better or worse option. A lower EODS than ERDS suggests the opposite. It is worth noting that ERDS decompositions (ERDS+ or ERDS−) cannot be directly compared to EODS decompositions (EODS B or EODS W ) because they encompass different sets of trials; that is, trials where the animal wins may not be trials where the animal chooses the better option and vice versa.
Because conditional entropies can be defined for any two discrete random variables, ERDS and EODS can be generalized to combinations or sequences of combinations of reward and option. Hence, we can define the entropy of rewardand option-dependent strategy (ERODS), a measure of the dependence of strategy on the selected option and reward outcome.

ERODS
ERODS can be decomposed based on choosing the better or worse option in the previous trial, winning or losing in the previous trial, or combinations of the selected option and reward outcome (e.g., choose better option and win on the previous trial).
Decomposing ERODS based on reward option combinations gives: Finally, ERODS can also be decomposed based on selection of the better or worse option: or winning or losing in the previous trial: Reinforcement-learning models. We used nine generative RL models to fit choice behavior. In all models except the multiple timescales model, reward values associated with the right and left sides (Q Right and Q Left ) for mice or circle and square stimuli (Q Circle and Q Square ) for monkeys were updated differently depending on whether a given choice was rewarded or not. Some of the models incorporated additional loss-or choice-memory components that influenced choice but did not affect the update of reward values. As such, we refer to the final reward and nonreward values used for decision making as decision values, DV (e.g., DV Circle ), to distinguish them from the updated reward values. Models were defined in a nested fashion with subsequent models building on the update rules of their predecessor.
Purely RL models. In the first model, which we refer to as RL1, only the reward value associated with the chosen option (side or stimulus) (Q RL1 C ) was updated as follows: where C 2 Left; Right È É for mice and C 2 Circle; Square È É for monkeys, R t ð Þ ¼ 1 or 0 indicates reward outcome on trial t, and α corresponds to the learning rate (α rew or α unrew ) depending on the whether the choice was rewarded or not rewarded. In contrast, the reward value associated with the unchosen option (Q RL1 U ) was not updated in this model: where U 2 Left; Right È É for mice and U 2 Circle; Square È É for monkeys. In RL1, i . In the second model (RL2), the reward value associated with the chosen option (Q RL2 C ) was updated as in Eq. (18), and the reward probability associated with unchosen option (side or stimulus) was also updated as follows: where decay rate is the decay (or discount) rate of the value of the unchosen option. In RL2, Loss-memory component. The loss-memory component influences stay/switch strategy in response to receiving no reward. In unrewarded trials, the value of the loss-memory component for the chosen option ðL C ðt þ 1ÞÞ is the negative expected reward prediction error, and in rewarded trials, the value of the component is 0: The expected unsigned reward prediction error tracks expected uncertainty and is updated on every trial as follows: where γ is the decay rate for expected reward prediction error and R t ð Þ À Q RL2 C ðtÞ is the reward prediction error on the current trial. Because the value of the lossmemory component is proportional to expected uncertainty, the no reward outcome has a greater influence on choice during times of high uncertainty.
Choice-memory component. The choice-memory component influences stay/switch strategy in response to selection of the better/worse option and is already known to be important for explaining behavior in mice and monkeys 8,15,38 . The values of choice memory for the chosen option (side or stimulus), C C , and for the unchosen option (side or stimulus), C U , are updated as follows: where γ represents the decay rate for the choice value.
Models with loss-and/or choice-memory components. The loss-memory and choicememory components are weighted with fitted parameters and summed with learned reward values to determine the decision values for different models. Below we use the notation RL1 and RL2 to denote that the standard reward values, Q C t ð Þ and Q U t ð Þ; that are updated based on the update rules of RL1 and RL2, respectively.
In the full model (RL2 + CM + LM), the decision values related to the chosen and unchosen options in trial t, DV C t þ 1 ð Þand DV U t þ 1 ð Þ, are computed as follows: where ω LM and ω CM are free parameters that determine the relative weight of the loss-memory and choice-memory components, respectively. In the full model, the same γ was used for both the choice-and loss-memory components because we found that a model with different γ fitted for the two components fit worse (based on AIC) than a model with one γ shared between the components for mice and monkeys.
In RL2 + LM, the decision values are computed as follows: In RL2 + CM, the decision values are computed as follows: In RL1 + CM, the decision values are computed as follows: In all models except the multiple timescales model, the probability of selecting the left side (or circle stimulus) is represented as a sigmoid function of the difference in estimated reward probabilities or values for the left and right sides (respectively, circle and square stimuli). Hence, the estimated probability of choosing the left side for mice (or circle for monkeys) in trial t, P LeftðCircleÞ t ð Þ, is equal to: where β is the inverse temperature (or stochasticity in choice) that quantifies sensitivity of choice to the difference in decision values. Values of decay rate ; γ; α rew; and α unrew ranged from 0 to 1 for all models, and values of β ranged from 0 to 100. For fit of mouse data, γ was fit as a free parameter, but for fit of monkey data, γ was fixed as γ ¼ meanðα rew ; α unrew Þ such that learning in choiceand loss-memory components occurred at the same rate as the acquisition of reward values. This was done because models with fixed γ had lower mean AIC than models with fitted γ for monkey data and models with fixed γ had higher mean AIC than models with fitted γ for mouse data (mean AIC; mice data: fixed γ: AIC RL2  . This difference may be attributable to different task structure: a superblock for monkeys is only 80 trials, whereas a session for mice is much longer, making the threshold for how useful a parameter must be on a trial-by-trial basis to be added to a model more stringent for monkeys. In the above models, values of ω LM and ω CM varied from −1 to 1, such that the effects of recent loss and choice on future choice could increase either staying or switching behavior. To test the effects of negative choice-memory weights, we also two additional models, RL2 + LM+ and RL2 + CM+. In RL2 + LM+ and RL2 + CM+, the decision values are computed as in Eqs. (24), (25), respectively, however, ω LM and ω CM only range from 0 to 1 instead of −1 to 1.
Multiple timescales model. We also fit and simulated one additional model based on learning across multiple timescales (Iigaya et al. 31 ). In this model, the values for options are updated across three timescales, τ fast À 1 ¼ 2; τ fast À 2 ¼ 20; τ slow ¼ 100 trials. The reward values for the chosen and unchosen options computed on timescale τ i ; (Q Time C;τ i ðtÞ and Q Time U;τ i ðtÞ) are updated as follows: which is equivalent to the RL2 update rule with α rew ¼ α unrew ¼ decay rate ¼ 1=τ i . The decision value for the chosen (unchosen) option ðDV C U ð Þ t ð ÞÞ is then a weighted sum of the three reward values computed on different timescales: where ω fast À 1 ; ω fast À 2 ; and ω slow are fitted parameters that range from 0 to 1 and determine the contribution of different timescales to decision making. ω fast À 1 ; ω fast À 2 ; and ω slow are normalized such that they sum to 1. Finally, the probability of choosing the left side (circle stimulus) is computed as follows: We also tested a few modified versions of the timescale model that incorporated fitting a beta parameter, using a sigmoid decision rule, fitting instead of fixing the τ parameters, and integrating learning on multiple timescales with RL2. However, none of modified timescale models fit or captured metrics better than RL2 + CM + LM for mice or RL2 + CM for monkeys, so we only present the original multiple timescales model.
Model fitting and simulations. We used the standard maximum likelihood estimation method to fit and estimate the best-fit parameters for the models described above. One set of model parameters was fit to each session of mouse data and each superblock of monkey data. We then used estimated parameters across sessions (in mice) and superblocks (in monkeys) to generate the distributions of parameters for each model (Supplementary Fig. 8). When fitting and simulating RL models with mouse data, we treated miss and no-go trials as if they had not occurred.
To quantify goodness-of-fit, we computed the Akaike Information Criterion (AIC) for each session (for mouse data) or superblock (for monkey data): where p is the number of free parameters in a given model. To test for significant differences in AIC, we conducted paired samples t tests comparing the mean of AIC of each model with the mean AIC of the best-fitting model (Supplementary Table 2). To compute the probability that a given model is the best model given the data and set of candidate models, we used AIC values to compute the Akaike weights 56,57 for the ith model ðM i Þ in a set of k models, M 1 ; M 2 ; ; M k È É ; as follows: where AIC M i À Á indicates the mean AIC for M i ; Δ AIC M i À Á is the difference between the mean AIC for M i and the minimum mean AIC out of the set of candidate models, and w i indicates the Akaike weight for M i : To quantify an absolute measure of goodness-of-fit, we also computed the McFadden R 2 58 for each model: where n is the number of trials in a given session or superblock.
One hundred model simulations were performed per session using best-fit parameters. The large number of simulations allowed us to estimate the population distributions of all metrics. Finally, we conducted additional simulations of RL2 using random parameter values to examine the relationship between parameters and entropy-based metrics. For these simulations, α rew , α unrew ; and β varied in the range of (0,∞) and decay rate was set to 0.1.

Data analyses and stepwise regressions.
Stepwise regressions were conducted using MATLAB's (R2019a) stepwiselm and stepwisefit functions. The criterion for adding or removing terms from the model was based on an F-test of the difference in sum of squared error resulting from the addition or removal of a term from the model. A predictor was added to the model if the p value of the F-test was <0.0001, and a predictor was removed from the model if the p value of the F-test was >0.00011.
We note that there were fewer blocks used in the full model stepwise regression because some of the specific entropy-based metrics were not defined for certain blocks, e.g., if a mouse or monkey never won on the worse option (worse side or stimulus) in a block, then ERODS W+ was undefined for that block. This resulted in the exclusion of around 500 blocks for mice and 700 blocks for monkeys in the final regression.
We also conducted tenfold cross-validated regressions to predict deviation from matching ( Supplementary Fig. 6) using MATLAB's (R2019a) fitrlinear and kfoldPredict functions. More specifically, stepwise regressions were performed on a set of possible predictors to determine which predictors to include in the final regression model. Then, cross-validated regressions were computed to predict deviation from matching using the set of predictors included in the final stepwise regression model.
Reporting summary. Further information on experimental design is available in the Nature Research Reporting Summary linked to this paper.