Learning relative values in the striatum induces violations of normative decision making

To decide optimally between available options, organisms need to learn the values associated with these options. Reinforcement learning models offer a powerful explanation of how these values are learnt from experience. However, human choices often violate normative principles. We suggest that seemingly counterintuitive decisions may arise as a natural consequence of the learning mechanisms deployed by humans. Here, using fMRI and a novel behavioural task, we show that, when suddenly switched to novel choice contexts, participants’ choices are incongruent with values learnt by standard learning algorithms. Instead, behaviour is compatible with the decisions of an agent learning how good an option is relative to an option with which it had previously been paired. Striatal activity exhibits the characteristics of a prediction error used to update such relative option values. Our data suggest that choices can be biased by a tendency to learn option values with reference to the available alternatives.


Supplementary Figure 5: Reinforcement learner value estimates for EF(1) and GH(2) learning.
Supplementary figure 4 (above) shows below-chance performance on the initial trials of EF(1) and GH (2) discriminations. This is the result of a subtle initial advantage in local outcome histories for F over E, and H over G. The figure shows value estimates from a reinforcement learner, which reveals that the agent's estimate for H is higher compared to G during the initial ~15 trials, and higher for F compared to E during the first ~5 trials. The outcome sequences that participants experienced were:  A negative effect of relative value was found both in caudate and putamen in both hemispheres (top row = left, bottom row = right hemisphere), whereas no such effect was found in the ventral striatum. The left ventral striatum did show an effect (peak t 23 = 1.93), which however did not survive cluster-based correction.
All regions showed a pronounced positive effect of relative outcome. Furthermore, qualitatively, representations of the motor response appeared to be more pronounced in the caudate and putamen compared to ventral striatum and to merge later in the caudate compared to the putamen. RVAL = relative value, RO = relative outcome, Resp = response made by the subject. BOLD signal is interpolated to a resolution of 300 ms.

Computational modelling
Here, we describe the implementation of the two main models (absolute and relative learner) and five alternative models that we tested.

(1) Q-Learner with update of the chosen and unchosen option
This is a simple Q-Learner that estimates the objective reward probabilities using a simple Rescorla-Wagner update rule: Where Q t is the estimated value on trial t, α is the subject specific learning rate and δ t is the prediction error on trial t: Where r t is the reward (0 or 1) observed on trial t. The value estimates were then used to generate a probability for the model to select a given option (here: A vs B) using a softmax choice rule: Where VD is the value difference between options (here: A and B) and τ is the softmax temperature that accounts for the stochasticity in subjects' choices.

(2) Relative Value Learner
This algorithm does not track separate value estimates for the two options in each pair. Instead, it directly learns how much better one option is compared to the alternative with which it is presented. It uses the same update rule as in equation [1]: However, here the prediction error δ t takes the following form: Where Rc t and Ru t are the rewards observed on the chosen and unchosen options, respectively.
Thus, the outcome difference is compared to the expected outcome difference to update the relative value of options. Model choice probabilities were again given by a softmax function as in (

3) Q-Learner with update of chosen option only
This agent is identical to model (1), with the only exception that it exclusively learns from direct experience, not using the outcomes on the non-chosen option to update the unchosen option's value. It thus captures the behaviour of a subject attending exclusively to the outcomes of the chosen option.

(4) Q-Learner with separate learning rates for chosen and unchosen options
Somewhere between the extremes of an agent learning exactly to the same extent from chosen and unchosen outcomes (model (1)) and an agent learning nothing at all from unchosen outcomes is an agent that learns from both, but to a greater or lesser degree from unchosen vs chosen outcomes. We capture this with a Q-Learner endowed with separate learning rates for the chosen and unchosen option. Again, values are updated using a simple delta rule: Where α C and α U are the learning rates for the chosen and unchosen option, respectively.

(5) State-dependent relative value learning
Here, we describe the relative value learner as recently used by Palminteri and colleagues 1 that we used for comparison with our model. While the algorithm we used does not learn separate option values but instead directly learns how good one option is compared to the available alternative, the state-dependent relative value learner does learn separate option values. However, these are learnt with reference to the average value of the current context, or state. Option values are updated according to the same standard delta rule as in equation [1]: With α = α C for the chosen option and α = α U for the unchosen option. However, here the prediction error δ t on trial t for the chosen and non-chosen option is: Where V(s) t is the state value on trial t, which is likewise updated on each trial: Where α State is the state learning rate and ( ) ! is the state prediction error on trial t: Where the state-level average outcome SO is represented by the average reward on the chosen and unchosen option: This algorithm is strongly related to Baird's advantage updating, from which it differs in terms of the inclusion of counterfactual learning and by comparing the selected action with the average outcome, rather than the best outcome 2 .

(6) Actor-Critic learning
Actor-Critic learning has in common with state-dependent relative value learning (model (5)) the learning of a state value function and prediction errors that are based on this state value estimate.
The actor selects an action based on a policy π and this action is evaluated by the critic. Unlike in other forms of learning, there is a separate representation of the policy, independent of the value function. On each trial, the action selected by the subject generates a prediction error δ t : Where V(state) t is the value of a particular state, where pairs of stimuli presented together represent one state. The resulting prediction error is then used to update both the state value (the critic) and a separate policy (the actor): Where the policy π(s, a) is the strength of the connection between the chosen stimulus and the action of selecting it when in state s, and α A and α C are the learning rates in the actor and the critic, respectively. Policy weights are then again used for action selection using a softmax rule: The free parameters in all models were estimated using custom-written model fitting procedures in Matlab. The parameter space was set up as n-dimensional grids in log space (where n is the number of parameters in the respective model). Negative log likelihoods were computed for each parameter combination in the grid: Where p t is the model's choice probability on trial t. The grid optimum was then used to initialise further optimisation using the Nelder-Mead simplex algorithm implemented in Matlab's fminsearch function.

ROI analyses
The BOLD timeseries from regions of interest were resampled to a resolution of 300 ms using cubic spline interpolation before being cut into trials with a duration of 14.4 s. Each trial consisted of three phases: CHOICE (time between stimulus and response onset), DELAY (time between response and outcome onset), and OUTCOME (a fixed window of 10 s following outcome onset).
The duration of both DELAY and OUTCOME were fixed (3.5 and 10 s), whereas the CHOICE phase was of variable duration, depending on subject's response time (RT) on a particular trial. As duration for the CHOICE phase, we used 0.9 s, corresponding to the mean RT across trials and subjects. Thus, the BOLD signal was cut into epochs of 14.4 s on each trial (0.9 s CHOICE, 3.5 s DELAY, and 10 s OUTCOME), where the start of each phase was defined by the exact onset of each event in each trial. The variability of the RT means that on trials with faster than average RT, the last few data points at the end of the CHOICE phase contain data points that actually belong to the first samples of the DELAY phase. Conversely, on trials with longer than average RT, the last few data points of the CHOICE phase will be missing from the analysis. In the plots, subtle discontinuities at the transition between the CHOICE and DELAY phase are the result of this. Note that all of our analyses exclusively focus on the OUCOME period, which is of a constant duration and thus is not affected by this procedure. The resulting data matrix is of size m x n, where m = number of trials and n = number of timepoints. We then regressed a design matrix X against this data matrix at each time poing using ordinary least squares regression. The design matrix X is of size m x p, where p = number of regressors. This results in a p x n matrix, which is the timecourse (n time points) of regression coefficients for each regressor p.