Dopamine blockade impairs the exploration-exploitation trade-off in rats

In a volatile environment where rewards are uncertain, successful performance requires a delicate balance between exploitation of the best option and exploration of alternative choices. It has theoretically been proposed that dopamine contributes to the control of this exploration-exploitation trade-off, specifically that the higher the level of tonic dopamine, the more exploitation is favored. We demonstrate here that there is a formal relationship between the rescaling of dopamine positive reward prediction errors and the exploration-exploitation trade-off in simple non-stationary multi-armed bandit tasks. We further show in rats performing such a task that systemically antagonizing dopamine receptors greatly increases the number of random choices without affecting learning capacities. Simulations and comparison of a set of different computational models (an extended Q-learning model, a directed exploration model, and a meta-learning model) fitted on each individual confirm that, independently of the model, decreasing dopaminergic activity does not affect learning rate but is equivalent to an increase in random exploration rate. This study shows that dopamine could adapt the exploration-exploitation trade-off in decision-making when facing changing environmental contingencies.

The reward prediction error (δt) is represented by the phasic dopamine signals. However, in addition, we must consider the effects of tonic dopamine (d0). Both positive and negative RPEs are required to control learning, so the teaching signal must consist of deviations of dopamine concentration from a threshold t0. After each rewarded step of learning, which only affects performed actions, action values are updated according to However, actions values should remain constant in the absence of phasic dopamine activity. This implies that d 0-t0 must be zero, in other words that the dopamine concentration threshold for learning must coincide and track the average level of tonic dopamine. This is consistent with the notion that RPEs adapt to the average expectation of rewards [53], [63], [64].
If we assume that the effect of dopaminergic inhibition is to reduce the original reward function, it results in a new reward function r' = g.r where g is the reduction factor (0<g<=1). We assume the factor g to be constant during the learning process, which is reasonable under pharmacological or genetic manipulations when relearning is periodically required. Dopaminergic blockade also affects tonic dopaminergic effects, by the same reduction factor g. However, because the blockade is long lasting, we expect threshold t0 to track the new level of tonic activation, in which case we can ignore the term g(d0-t0). Just in case the threshold does not have time to adapt to the new tonic dopamine level, we compared a model which takes this effect into account and found it had a worse BIC score (58936, with two extra parameters per dose) than both the model with only β free (58489 one extra parameter per dose) and the model with all three parameters optimized independently per dose (58892).
This dopamine manipulation will change the consequences of positive RPEs (when the reward is present), but not negative RPEs (in absence of reward). Importantly, it is not equivalent to reducing learning rate, which would affect the consequences of both positive and negative RPEs. In addition, this dopaminergic manipulation is assumed not to affect the revision of value for non-selected actions if a forgetting mechanism is at play (eq. 4).
We will now show by induction that under dopaminergic inhibition, the Q-values obtained at any time during learning Q', are just scaled-down versions of the original Q-values, Q.
After each rewarded step of learning which only affects performed actions After each non-rewarded step of learning, which only affects performed actions, After each forgetting step, which only affects non-performed actions (if applicable) Then plugging the scaled Q-values into the softmax function we get: Therefore, scaling down the reward function by a factor g with dopaminergic inhibition is formally equivalent to reducing the inverse temperature by the same factor g. Noticeably, although Q-values (including asymptotic values) are reduced by dopaminergic inhibition, this effect cannot be mimicked by a simple change in learning rate. Indeed, changes in learning rate do not affect asymptotic values for a constant reward function (δt=0 if and only if Qt=r).

Supplementary Table 1. Coefficients and adjusted R² of the linear model between predicted and experimental action
probabilities. For each block, we labeled the three possible actions as either the target (action 1 in Supplementary Fig.   3), the previous target (action 2; for the very first block of the session, the target of the last block was used so as to ensure all lever-label combinations were equally represented) or the remaining lever (action 3). For each rat (n=23), we then averaged the experimental probabilities of each action for bins of four trials per block, and compared them to the corresponding average theoretical probabilities as determined by current Q-values plugged into the softmax function by fitting a linear model without an intercept (experimental probabilities = b1 * softmax probabilities). This procedure was applied to separate doses, separate risk levels and to the entire experiment as reported in the table and we always found b1 very close to 1 and very good R² demonstrating a very good correspondence between the two. Relationship between predicted and observed probabilities of the three possible actions within a block. Predicted probability is given by the softmax function averaged in a given bin of four trials within a block. Observed probability is based on the number of times this specific action was chosen in this bin. Each data point represents the average of blocks for a given combination of rat, bin, risk level, dose and response type. Response types are identified by colors representing correct (target, red), previously correct but now incorrect (previous target, blue) or other actions (green). The same linear regression curve (y = 0.99*x is shown in black) fits all three response types. the forgetting model (see Supplementary Fig. 5), this model was unable to reproduce this key aspect of behaviour and shifted from correct rewarded actions far more frequently than the subjects.