Confidence and psychosis: a neuro-computational account of contingency learning disruption by NMDA blockade

A state of pathological uncertainty about environmental regularities might represent a key step in the pathway to psychotic illness. Early psychosis can be investigated in healthy volunteers under ketamine, an NMDA receptor antagonist. Here, we explored the effects of ketamine on contingency learning using a placebo-controlled, double-blind, crossover design. During functional magnetic resonance imaging, participants performed an instrumental learning task, in which cue-outcome contingencies were probabilistic and reversed between blocks. Bayesian model comparison indicated that in such an unstable environment, reinforcement learning parameters are downregulated depending on confidence level, an adaptive mechanism that was specifically disrupted by ketamine administration. Drug effects were underpinned by altered neural activity in a fronto-parietal network, which reflected the confidence-based shift to exploitation of learned contingencies. Our findings suggest that an early characteristic of psychosis lies in a persistent doubt that undermines the stabilization of behavioral policy resulting in a failure to exploit regularities in the environment.

The learning rate α and the choice temperature β are free parameters, with the constraints 0≤α≤1 and β>0. The learning rate adjusts the weight assigned to prediction error in value updating, and the choice temperature the degree of exploration (as opposed to exploitation of the learned value).
We devised three variants of this reinforcement learning level, following step--by--step increments from model--free to model--based strategy, i.e. adding pieces of information about task structure.
In a first variant, the reinforcer RQ was the monetary value of the outcome (1, 0.1, --0.1 or --1). This variant can be considered as a model--free strategy, in line with the law of effect, meaning that outcomes increased the probability of repeating the same choice, depending on their sign and magnitude.
In a second variant, the reinforcer RQ was defined according to outcome valence (Val) and not its magnitude (i.e. 1 when winning £1 or 10p; --1 when loosing £1 or 10p). This variant implies that subjects understood that cues determined the outcome valence (positive or negative) and not its magnitude, which depended on the choice.

2
In a third variant, the reinforcer RQ was defined according to outcome valence (Val) and the update of the current cue (say cue A) was transferred to the alternative cue (cue B).
This variant implies that subjects understood that there were only two cues, with opposite valence. In other words, the two cue values summed up to zero.
-Meta-learning level Reinforcement learning models have constant parameters (learning rate and choice stochasticity). This limits the capacity to optimize the behavioral policy around the end of learning blocks, once subjects believe themselves to have a reasonably good estimation of contingencies. At this point, prediction errors should be tempered, and choices tuned to a more deterministic exploitation of learned contingencies. (4--6) Conversely, when contingencies suddenly change after reversals, prediction errors should be given more weight, and choices should be more exploratory. One way to optimize the behavior is to subordinate the reinforcement learning parameters to a higher level of control that monitors performance. A second series of models therefore included a meta--cognitive level consisting in updating confidence so as to down--regulate contingency learning and choice stochasticity. We compared two ways to monitor confidence and four ways to use it.

--Confidence monitoring level
In both variants, confidence was monitored using a delta rule. The confidence learning rate γ was a free parameter, with 0≤γ≤1. The initial value of confidence, C(0) was also fitted as a free parameter, with 0≤C(0)≤1.
In a first variant, we used the absolute value of the prediction error computed at the reinforcement learning level to update confidence (7): In a second variant, we used outcome optimality (Op) to update confidence (i.e. 1 for winning £1 or losing 10p, --1 otherwise):

--Modulation of low--level free parameters
Confidence was used to modulate the free parameters in the reinforcement learning models. This was done after each outcome, which brought information about how accurate the reinforcement learning model was, in terms of value estimates or behavioral policy. We considered four possibilities: modulation of learning rate or choice temperature, or both with the same weight, or both with a different weight.
The learning rate was modulated on the basis of not only confidence but also the outcome category. The idea is that to stabilize a representation of learned contingencies, subjects should increase their sensitivity to confirmation and decrease their sensitivity to contradiction. The impact of confidence on the learning rate therefore depended on whether the outcome was confirmatory (outcome and cue value have the same sign; Val(t) = sign(Q(t))) or not.
For confirmatory outcomes, was modulated as follows: where 0 and k are free parameter And for contradictory outcomes: Therefore, when confidence increased the modified learning rate m got closer to 1 for confirmatory outcomes and closer to zero for contradictory outcomes.
The choice temperature β was modulated such that exploration was reduced when confidence increased: βm(t) = β0/(1+kß*C(t)) where β0 and kß are free parameter This modulation enables increasing exploitation above matching behavior, i.e. choosing the risky option more than 80% of the time following a cue that associated to a reward 80% of the time.
To test whether these modulations improved the fit of observed choices, we compared between models that included or not the free parameters (k and kß), which could have or not identical values.
Other hierarchical models 4 Other hierarchical models have been developed to implement a form of second--level confidence that modulates first--level estimates. For instance, one hierarchical Bayesian architecture models the behavior in a probabilistic reversal learning task, with a second-level inference that tracks the occurrence of contingency reversals.(8) However, reversals were numerous in this task and participants were extensively trained, so they had built an internal model of the task (including the possibility of a reversal) before entering the scanner. In our paradigm, participants were not informed about the presence of reversals and they encountered only three of them, which was not enough to build and integrate an explicit notion of reversal at the meta--cognitive level.
Unsurprisingly, we found no evidence that the behavior was more easily reversed the last time compared to the first one: if anything, performance in the last block was worst.
For similar reasons, we did not include the possibility of re--using contingencies that were learned in previous blocks, as was done in another hierarchical model with task-set monitoring on top of Q--learning.(9) Indeed, there was no evidence that the second and third reversals, after which subjects could have returned to previous contingency sets, were learned faster than the first one. In addition, none of the existing models implemented the differential impact of the meta--cognitive level on the first--level learning rule, which enables participants to specifically ignore contradictory outcomes (probabilistic errors), a key way to stabilize behavior at the end of blocks. It is important to keep in mind that our paradigm was not designed to investigate reversal processes per se, but to examine how behavior is optimized between reversals.

Family analyses
Family model comparison (10) was used to test whether each level of complexity added to the basic reinforcement learning model was necessary for explaining choice data. In a first comparison, the model space was divided into three families, depending on RL-variant (i.e. whether the monetary value or the outcome valence was integrated in the delta-rule and whether the two cues or only the current cue was updated). Results confirmed that participants integrated the two aspects of task structure (xp =0.97 and 0.93 for placebo and ketamine sessions, respectively): first that only the outcome valence, and not monetary amount, was informative about cue value, and second that the two cues always had opposite valence such that they could both be updated after every outcome.
In a second comparison, we divided the model space into three families, according to the way confidence was updated (see Supplementary tables Table S1 Exceedance probabilities table for Placebo sessions. Table S2   6   Exceedance probabilities table for Ketamine sessions.   Table S3 Parameter estimates for the best computational model  Table S2 7