The dynamics of explore–exploit decisions reveal a signal-to-noise mechanism for random exploration

Growing evidence suggests that behavioral variability plays a critical role in how humans manage the tradeoff between exploration and exploitation. In these decisions a little variability can help us to overcome the desire to exploit known rewards by encouraging us to randomly explore something else. Here we investigate how such ‘random exploration’ could be controlled using a drift-diffusion model of the explore–exploit choice. In this model, variability is controlled by either the signal-to-noise ratio with which reward is encoded (the ‘drift rate’), or the amount of information required before a decision is made (the ‘threshold’). By fitting this model to behavior, we find that while, statistically, both drift and threshold change when people randomly explore, numerically, the change in drift rate has by far the largest effect. This suggests that random exploration is primarily driven by changes in the signal-to-noise ratio with which reward information is represented in the brain.


Fitting the drift-diffusion model using the HDDM
In addition to fitting with the maximum likelihood approach, we also used the HDDM python toolbox [20], which is known to have strong performance in situations with relatively low numbers of trails (compared to the number of free parameters) [39]. Our fits used 200,000 MCMC samples (discarding 20,000 for burn-in), and typical heuristics were checked to further suggest that our Markov chains had converged (e.g. Markov chain error < 1% and visual inspection of the converged chains). Average parameters fit using the HDDM were almost identical to those found using the maximum likelihood approach ( Figure S1). Thus, in the main text we focused on the simpler and much faster (30s for MLE vs 5 days for MCMC to fit on a laptop) maximum likelihood fits. All codes and data used to reproduce the figures and analysis are available at https://github.com/sffeng/horizon_  Figure S1: Comparison between MLE and MCMC parameter values for the full model from horizon 1 (blue) and horizon 6 (red) games.

Parameter recovery
Parameter recovery [40] was performed by fitting simulated data. In particular, we simulated 46 participants worth of data using the same parameters that we found by fitting real data. This simulated data was fit in exactly the same way as the original data set, using the maximum likelihood approach. We then compared the recovered parameters to the ground truth parameters from simulation. As shown in Figure S2, parameter recovery is excellent for this model in this task. In particular, recovery of the most important parameters, as far as random exploration is concerned (c 0 and c µ R ) is near perfect (correlation between simulated and fit parameters is greater than 0.93 in all cases). . Note one subject, who had negative drift rate parameters, was excluded from this analysis.

Parameter Values for Figure 5
The parameter values for Figure 5, columns B through E were chosen by hand to visually represent the different qualitative predictions of the logistic version of the drift-diffusion model. The bias parameters were fixed at 0: c ↵ 0 = c ↵ R = c ↵ I = 0 and non-decision time was fixed at T 0 = 0.05. The remaining parameters are given in Table S1.

Model with threshold dependent on absolute value of R and I
As mentioned in the main text, having the threshold be linearly dependent on R and I could be problematic both from a mathematical point of view (negative thresholds leading to undefined behavior) and psychological point of view (threshold depends on spatial location of bandits). To avoid these problems we fit a modified form of the model in which the threshold depends on the absolute value of R and I.
First we fit the model to simulated data to check parameter recovery for this modified model. As shown in Figure S4, parameter recovery was similarly good for this model as Table S1: Parameter values for Figure 5  Next we fit the model to human behavior. As shown in Figure S5 the fit parameter values share several similarities to the original model. In particular, we see: a decrease in c µ R with horizon, consistent with a signal-to-noise ratio change driving random exploration; an increase in c µ I with horizon, consistent with an information bonus driving directed exploration; and a decrease in c ↵ R with horizon. In contrast to the original model we see no change in the baseline threshold with horizon c 0 . Instead we see a significant change in the effect of reward on threshold (c R ) with horizon. In particular, c R is positive for horizon 1 and approximately zero for horizon 6. This suggests that people increase their thresholds in horizon 1 when | R| is high -that is, they make more careful decisions in horizon 1 when the consequences of making an error are largest. While the modified model does not map directly onto the logistic model, both c µ R and c R could affect behavioral variability and random exploration. To determine which of these factors contributes most to the horizon change in behavioral variability, we simulated behavior of the modified model in two conditions: first with c R held constant with horizon (at its horizon 1 value) and second with c µ R held constant with horizon (at its horizon 1 value). We then fit the resulting behavior with the logistic choice model to estimate the effect of a horizon change in only one of c R and c µ R on behavioral variability. As shown in Figure S6, we find that in both the [1 3] and [2 2] uncertainty conditions, the horizon change c µ R (i.e. when c R is constant with horizon) accounts for most of the horizon change in the noise. Thus, these results with the modified model support the conclusion that random exploration is primarily driven by horizon changes in the signal-to-noise ratio, not by horizon change in threshold.
Finally, as with the original model, posterior predictive simulations show that the modified model provides a good fit to both the choice and response time data well.