Contextual modulation of value signals in reward and punishment learning

Compared with reward seeking, punishment avoidance learning is less clearly understood at both the computational and neurobiological levels. Here we demonstrate, using computational modelling and fMRI in humans, that learning option values in a relative—context-dependent—scale offers a simple computational solution for avoidance learning. The context (or state) value sets the reference point to which an outcome should be compared before updating the option value. Consequently, in contexts with an overall negative expected value, successful punishment avoidance acquires a positive value, thus reinforcing the response. As revealed by post-learning assessment of options values, contextual influences are enhanced when subjects are informed about the result of the forgone alternative (counterfactual information). This is mirrored at the neural level by a shift in negative outcome encoding from the anterior insula to the ventral striatum, suggesting that value contextualization also limits the need to mobilize an opponent punishment learning system.

The mask includes all voxels classified as striatum, pallidum and insula in the Automatic Anatomic Labeling (AAL) atlas. The mask is superimposed to axial slices of the between-subjects averaged anatomical T1.
Contextual modulation of value signals in reward and punishment learning Palminteri and colleagues 7

I) Reaction times
Reaction time analysis provides evidence of relative value encoding. We also analyzed the reaction times, with the same statistical model used for correct choice rate, and we observed a significant effect of outcome valence (F=78.0, P<0.001), a marginally statistical effect of feedback information (F=3.2, P=0.09) and interaction between the two (F=3.8, P=0.06) (Supplementary Figure 1). Post-hoc test revealed that subjects were slower in the punishment avoidance contexts compared to the reward ones (partial and complete contexts: T>4.0, P<0.001), whereas the effect of feedback information reached statistical significance only in the punishment context (T=2.5, P<0.05), but not in the reward one (T=0.3, P>0.5). Conditioning (Pavlovian-to-instrumental transfer or PIT) as well as decision field theories established a link between chosen option value and reaction times. More precisely they predict that the subjects would take more time if choices are likely to result in negative outcomes 1-3 . We indeed observed this effect (main effect of valence), since subjects were slower when choices potentially led to negative outcomes. However, the trend toward a significant valence x information interaction (driven by faster responses in the punishment/complete context compared to the punishment/partial context) suggests that the reaction times' pattern could not be fully explained by considering option value on an absolute scale. . Importantly, the absence of difference in reaction times in the two reward contexts further indicates that observed pattern could not be explained assuming reaction times a simple function of to the correct response rate that is much higher in the reward/complete contexts compared to the reward/partial. Thus, learning and post-learning results suggest that the observed interaction may derive from relative value encoding: punishment-induced reaction times slowing in the punishment/complete context is smaller compared to the punishment/partial context, as if the option value was less negative as a result of the value contextualization process.

II) Post-learning test detailed analysis
Value inversion in the post-learning test is robust across all possible binary comparisons and confirms relative value encoding. In the main text we reported post learning choice rate in an aggregate manner, i.e. reporting the probability of choosing an option, taking into account all possible comparisons. The advantage of using this aggregate measure resides in that it is directly proportional to the underlying option value, to which it can be therefore easily compared (see Figure 2B and Figure   P>0.05). Similarly, subjects did not correctly report partial and complete feedbacks as choice context-specific (correct responses: 47.1%; P>0.2). Thus, as far as explicit knowledge of the task structure can be inferred by the post-scanning structured interview, whereas the existence of discrete choice contexts (states) and their number seemed explicitly grasped by the subjects, the separation between reward and punishment, as well as between partial and complete feedback, conditions, remained implicit. These two features are taken into account by our computational models that i) assume the perception of discrete states (s) but ii) treat option and context values as continuous ("model-free") variables instead of categorical ("rule or model-based") ones.
where t is the number of trials and R V is the context-level outcome at trial t: a global measure that encompasses both the chosen and unchosen options. Frequentist inference is appropriate for environments with no volatility and instantiates a progressive reduction of the learning rate, since new experiences have less weight as the number of trials increases. RELATIVE models 1 and 2 with frequentist update of context value could be advantaged by the fact that they do not require additional free parameters, compared to the ABSOLUTE model. However, for the same reason, they cannot account for interindividual variability. We also included RELATIVE models 3 and 4 implementing the delta rule, which, for analogy with the frequentist update, can be written as: Where α 3 is the context value learning rate. Delta rule is appropriate for environments with unknown volatility.
RELATIVE models 3 and 4 with delta rule update of context value could be disadvantaged by the fact that they require an additional free parameter (α 3 ), compared to the ABSOLUTE model. However, for the same reason, they can account for interindividual variability. The second computational question concerned the definition of R V . In fact, whereas average outcome trial can be straightforwardly calculated in the complete feedback contexts as the average of the factual and the counterfactual outcomes as follows: the question arises in the partial feedback contexts, where R U is not explicitly provided. One possibility, (implemented in RELATIVE models 1 and 3) is to replace R U with RM (the central -median -task reward: 0.0€), in the partial feedback contexts,: which we define as a "context-aspecific heuristic", in which, simplifying, R V = R C,t / 2, . However, given that R V is meant to be a context-level measure, in order to incorporate unchosen option information in R V also in the partial feedback contexts, a possibility, implemented in RELATIVE models 2 and 4, is to consider Qt(s,u) a proxy of RU, t and calculate R V,t as follows: which we define as a "context-specific heuristic".
To sum up, this model space included 5 models. The ABSOLUTE model (Q-learning) and four RELATIVE models which differed in 1) context value update rule ("frequentist" versus "delta rule") and 2) the way R V was calculated in the partial feedback contexts ("context-aspecific" or "context-specific" heuristic). We submitted these new models to the same parameters optimization procedure and model comparison analyses presented in the main text and involving the Bayesian information criterion (BIC), Akaike information criterio (AIC) and the Laplace approximation 12 of the model evidence-based calculation of the model posterior probability and exceedance probability 5,6 .

Complexity-penalizing model comparison criteria concordantly indicated that the RELATIVE model 4 better
accounted for the data (see Supplementary Table 3). Note that priors-independent model comparison criteria (LLmax, AIC and BIC) were smaller (indicating better fit) in all RELATIVE models compared to the ABSOLUTE model, indicating that the finding that relative value learning better accounts for the data was robust across algorithmic variations of the context value update rule. Thus, subsequent analyses in the main text and in supplementary materials have been focused on the comparison RELATIVE model 4 only, to whom we referred simply as "RELATIVE", to stress the main feature of the model instead of its less relevant algorithmic specifications.

Position of the RELATIVE models within the family of reinforcement learning algorithms: similarity and differences with previous
formulations. The RELATIVE family of models in general, and the RELATIVE 4 model in particular (the best fitting model), computationally embody the ideas behind the two-factor theory that, in simple terms, states that the instrumental action-induced punishment avoidance (cessation of fear in the original formulation) should acquire a positive reinforcement value, in order to sustain instrumental responding, in absence of further negative reinforcement (i.e. successful avoidance) 7 . The RELATIVE models capture this basic intuition of the two-factor theory assuming that, in the punishment conditions, neutral outcomes are computed relative to the negative context values (or state values as they are more frequently called in the reinforcement learning literature). The idea of computationally capturing elements of the two-factor theory by assuming some form of relative value learning has been also proposed in previous computational studies 8,9 . These studies were based on actor-critic or advantage learning models 10,11 , and the models proved useful to account for classical avoidance learning results, such as the conditioned avoidance response (CAR) induced via discriminated avoidance procedure. The computational model

I) Comparison with two variants of the actor-critic model
We compared the RELATIVE 4 model with two variants of the actor-critic model. At each trial t the model calculated a chosen policy prediction error defined as: where V(s) is the value of the current choice context s and RC is the outcome of the chosen policy (factual outcome).
This prediction error is then used to update the chosen policy value (P(s,c)) using a delta-rule: Contextual modulation of value signals in reward and punishment learning Palminteri and colleagues 13 where α1 is the learning rate for the chosen option. We extended the actor-critic model in order to integrate counterfactual learning, as we have done for the other models. Thus, in the complete feedback contexts, the model also calculates an unchosen policy prediction error: where RU is the outcome of the unchosen policy (counterfactual outcome). This prediction error is then used to update the unchosen policy value (P(s,u)) using a delta-rule: where α2 is the learning rate for the unchosen option. The two variants of the actor critic model differ in the way the context value V(s) is then updated. In the first, more "classical", variant (ACTOR-CRITIC 1) the chosen policy prediction error is also used to update the context value in all choice contexts: where α3 is the learning rate for the context value. In a second variant (ACTOR-CRITIC 2) the context value update also takes into account the unchosen policy prediction error: We submitted these new models to the same parameters optimization procedure and model comparison analyses

II) Comparison with different ways to calculate the context value calculation
We also devised two additional variants of the RELATIVE models. These variants assume the context value being calculated based on the current (RELATIVE 5) or the best (RELATIVE 6) policy. More specifically, these models essentially differ from the RELATIVE 4 in the way they calculate the R V,t : the context-level outcome at trial t, which is used to update the context, value V(s). In the RELATIVE 4 model RV was calculated based on the RC and Q(s,u), in the partial feedback contexts, and based on RC and RU, in the complete feedback contexts (i.e. "random-policy" since independent from the subjects' choice). This choice was motivated by conceiving V(s) as a reference point as much neutral as possible in respect to the current obtained outcomes, supposing that the subjects do take all feedback into account (thus being random-policy) to estimate the context value (see "Conclusions on supplementary computational analyses"). However this choice is not frequent in the current panorama of reinforcement learning algorithms. In the RELATIVE 5 model for all choice contexts the context level outcome is defined as: The context value V(s) is therefore calculated considering the ongoing policy ("on-policy"). This is the most frequent way to calculate the context value in the reinforcement learning literature. Note that the RELATIVE 5 is analogous to the advantage learning algorithm extended to also, once included the counterfactual learning module 10 . Another tempting possibility, particularly relevant in presence of complete feedback information, is to calculate the context Contextual modulation of value signals in reward and punishment learning Palminteri and colleagues 14 value based on the best policy. The RELATIVE 6 model implements this possibility, in fact in the partial information choice contexts the context level outcome is defined at: Q(s,u)), whereas in the complete information choice contexts it is defined as: We submitted these new models to the same parameters optimization procedure and model comparison analyses

III) Conclusions on supplementary computational analyses
Whereas previous computational studies suggested that the actor-critic architecture could provide a good explanation for conditioned avoidance response 8,9 , we found that in our task the RELATIVE 4 outperformed the actor-critic models. One important difference compared to the actor critic model is that the RELATIVE 4 model can be reduced to Q-learning assuming the contextual learning rate (α3=0), whereas the actor critic cannot. This lack of flexibility may at least partly explain the overall poor group-level performances. We also note the important differences between the discriminate avoidance procedure and our paradigm. In the former the contingencies are deterministic, avoidance learning is studied in isolation and the "avoidance learning paradox" consists in the long lasting insensitivity to extinction of the conditioned responses, despite the absence of further reinforcement. In our paradigm, the contingencies are probabilistic (thus with overlapping outcomes from the correct and incorrect choices), avoidance learning is not studied in isolation, but in opposition to reward seeking behavior and the "avoidance learning paradox" consists in similar performance in the reward punishment domain, despite the fact that the performance-induced sampling bias would predict enhanced performances in the reward domain. These important differences should be also taken into account, when interpreting the relatively poor performances of the actor-critic models in our task. On the other side the good potential of the actor-critic model to explain the post-test results (Supplementary Figure 3E and 3F), further illustrates the conceptual proximity between this influential algorithm and the RELATIVE model 4.
We also found that "random-policy of complete feedback information 1 ) 11,12 . Thus, in presence of only one type of choice context, the model predictions obtained using on-policy or random-policy context value (V(s)), can hardly diverge. In such mono-dimensional tasks, on-and random-policy context values would display a similar trend across trials and the eventual differences in their magnitude can easily be neutralized by rescaling parameters, such as coefficients or learning rates (see Supplementary   Figure 4A-C). We believe that we were precisely able to rule out on-policy (and best-policy) context value, thanks to the presence of multiple, different choice contexts in our design. In particular, both the simultaneous contrasts between reward and punishment and between partial and complete feedback information contributed to highlight this feature of the best fitting model. In fact, only the random-policy context values i) were symmetrical in respect to the valence, thus permitting similar performances in the reward and punishment domains, and ii) were enhanced in magnitude in the complete feedback contexts, thus permitting the value inversion of the intermediate value cues in the complete feedback conditions (see Supplementary Figure 3A-F). Furthermore, the importance of being on-policy has been mainly stressed in problems with a risk of substantial/lethal punishments such as the cliff simulation where random-policy algorithms such as Q-learning cannot avoid sometimes falling in the cliff due to occasional exploratory decisions 11,12 . In our case, it is reasonable to consider that human subjects do not fear being harmed when interacting with the screen. In fact, we believe that this algorithmic difference between the standard view of the context value V(s) (on-policy) and ours (random-policy) betrays a more profound difference concerning the psychological intuitions behind these quantities. Whereas in most reinforcement learning models V(s) is conceived as a "Pavlovian" anticipation of the reward (or punishment) to come, aimed to elicit automatic motor effects 2,3 , in our model it represents a more abstract signal, subserving value contextualization for efficient encoding purposes [13][14][15] . In the light of these interpretation it is easy to understand why in the framework of a "motor preparation", the context value needs to be calculated in an on-policy manner (preparation to an outcome), whereas in the framework of a "efficient coding", the context value has to be calculated in a random-policy manner. In principle both quantities (on-and off-policy context values) could exist in the brain and express their effects in different behavioral measures.
Further work, probably implicating a deeper analysis of reaction times (a good candidate for Pavlovian effects) could shed light on this topic. Finally, we are not without acknowledging that an random-policy calculation of context value could rapidly become computationally challenging in learning situations implicating more than two options.
Further studies are needed to uncover the learning heuristics implemented in such cases.

Supplementary Note 3: written task instructions
The subject read the learning test instructions before the training session, outside the scanner. The experimenter read the post-learning instructions to the subject, while he/she was in the scanner, after the last (fourth) functional acquisition, before starting the T1 anatomic acquisition.

Learning test instructions
The experiment is divided in four sessions, of about 12 minutes each. There will be two training sessions (a longer one outside and a shorter one inside the scanner) before the starting of the fMRI experiment.
You are asked to choose in each round one of two abstract symbols. The symbols will appear on the screen to the left or the right of a fixation cross. To choose one of the two symbols you should press the right or left button. After few seconds a cursor will appear under the chosen symbol confirming your choice. If you do not press any button, the cursor will appear at the center of the screen, and your result will be disadvantageous.
As an outcome of your choice you may: -gain 50 cents (+0.5€) -get nothing (0€) -loose 50 cents (-0.5€) The outcome of your choice will appear on the top of the chosen symbol, and will be always indicated by the position of the cursor. The two symbols are not equivalent (identical). One of the two symbols is on average more advantageous or less disadvantageous, in the sense that it makes you winning more often or loosing less often than the other. The goal of the experiment is to gain as much as you can.
In some trials the information about the outcome of the unchosen option will be also provided. Note that your earnings will correspond only to the chosen option. At the end of each session the experimenter will communicate your earnings for that session. Your final earnings will correspond to the sum of the earnings of the four sessions.

Post-learning test instructions
The test will last 5 minutes with no training.
The goal of the next test is to indicate the symbol with the higher value from the last (fourth) session. At any trial, you are asked to choose between two symbols pressing the corresponding button. Your choice will be immediately recorded and will be confirmed with the presence of a cursor that will appear under the chosen stimulus.
It will not always be the case that the shown symbols would have been presented together in the previous session. Please try to give an answer even if you are not completely sure.