Abstract
Dopamine is implicated in adaptive behavior through reward prediction error (RPE) signals that update value estimates. There is also accumulating evidence that animals in structured environments can use inference processes to facilitate behavioral flexibility. However, it is unclear how these two accounts of rewardguided decisionmaking should be integrated. Using a twostep task for mice, we show that dopamine reports RPEs using value information inferred from task structure knowledge, alongside information about reward rate and movement. Nonetheless, although rewards strongly influenced choices and dopamine activity, neither activating nor inhibiting dopamine neurons at trial outcome affected future choice. These data were recapitulated by a neural network model where cortex learned to track hidden task states by predicting observations, while basal ganglia learned values and actions via RPEs. This shows that the influence of rewards on choices can stem from dopamineindependent information they convey about the world’s state, not the dopaminergic RPEs they produce.
Similar content being viewed by others
Main
Adaptive behavior requires learning which actions lead to desired outcomes and updating this knowledge when the world changes. Reinforcement learning (RL) has provided an influential account of how this works in the brain, with RPEs updating estimates of the values of states and/or actions, in turn driving choices. In support of this framework, dopamine activity resembles RPEs in many behaviors^{1,2,3,4}, and causal manipulations can reinforce or suppress behaviors consistent with dopamine acting functionally as an RPE^{5,6,7,8}.
However, value learning is not the only way we adapt to changes in the environment. For example, we behave differently on weekdays and weekends, but this is clearly not because we relearn the value of going to work versus spending time with family each Saturday morning. Rather, although the world looks the same when we wake up, we understand that the state of the world is in fact different, and this calls for different behavior. Formally, the decision problem we face is partially observable—our current sensory observations only partially constrain the true state of the world. In such environments, it is typically possible to estimate the current state better using the history of observations than using just the current sensory input^{9,10}.
It is increasingly clear that this ability to infer hidden (that is, not directly observable) states of the world plays an important role even in simple laboratory rewardguided decisionmaking^{10,11,12,13,14,15,16,17,18}. For example, in probabilistic reversal learning tasks where reward probabilities of two options are anticorrelated, both behavior and brain activity indicate that subjects understand this statistical relationship^{10,12,13,19}. This is not predicted by standard RL models in which RPEs update the value of preceding actions, but is predicted by models which assume subjects understand there is a hidden state that controls both reward probabilities. Intriguingly, brain recordings have shown that not only prefrontal cortex (PFC) but also the dopamine system can reflect knowledge of such hidden states^{13,19,20,21,22,23}.
Integrating these two accounts of behavioral flexibility raises several pressing questions. If state inference, not RL, mediates flexible rewardguided behavior, why does dopamine look and act like an RPE? Conversely, if value updates driven by dopaminergic RPEs are responsible, how does this generate the signatures of hiddenstate inference seen in the data?
To address these questions, we measured and manipulated dopamine activity in highly trained mice performing a twostep decision task. The task had two important features. First, reward probabilities were anticorrelated and reversed periodically, constituting a hidden state that could be inferred by observing where rewards were obtained. Second, inference and RLbased strategies could be differentiated by measuring how prior rewards affected dopamine activity. Behavior and dopamine signaling were consistent with mice tracking a single hidden state of the reward probabilities. However, while dopamine signals closely resembled RPEs, neither activating nor inhibiting dopamine neurons at trial outcome had any effect on subsequent choice. We show that these apparently paradoxical data can be reproduced by a neural network model in which cortex infers hidden states by predicting observations and basal ganglia uses RL mediated by dopaminergic RPEs to learn appropriate actions.
Results
Mice behavior respects task causal structure
We trained dopamine transporter (DAT)Cre mice (n = 18) on a sequential decision task, which required them to choose between two options—left and right—to gain access to one of two reward ports—up or down (Fig. 1a). Each firststep choice led commonly (80% of trials) to one secondstep state and rarely (20% of trials) to the other (Fig. 1b). Reward probabilities in each port changed in blocks between 0.8/0.2, 0.5/0.5 and 0.2/0.8 on the up/down ports, respectively (Fig. 1c), and were therefore anticorrelated at the up and down ports, while transition probabilities remained fixed. This facilitates hiddenstate inference strategies (also referred to as ‘latentstate’ inference^{24}) because the state of the reward probabilities—whether up or down is rewarded with higher probability—fully determines which firststep action is best^{24,25}.
Subjects tracked which option was currently best, performing 348.70 ± 93.90 trials and completing 8.32 ± 3.24 blocks per session (mean ± s.d. across subjects; Fig. 1c). Reward following a common transition promoted repeating the same firststep choice on the next trial, while reward following a rare transition promoted switching to the other firststep choice (Fig. 1d,e; mixedeffects logistic regression—transition × outcome: β = 0.507, s.e. = 0.044, z = 11.468, P < 0.001), and these effects persisted over multiple trials (Fig. 1f). This pattern is adaptive because it corresponds to rewards promoting choice of firststep actions that commonly lead to the secondstep states where they were obtained. However, the probability of repeating the same choice following a nonrewarded outcome was similar irrespective of whether a common or rare transition occurred (Fig. 1d,f).
To assess what strategy the animals used, we fitted a set of models to their choices and simulated data from the fitted models (Fig. 1g–j). Neither modelfree nor modelbased RL, commonly used to model twostep task behavior^{26}, resembled subject’s choices (Fig. 1g,h). We also considered a strategy that used Bayesian inference to track the hidden state of the reward probabilities, combined with a fixed habitlike mapping from this state estimate to the corresponding highvalue firststep action^{24}, but again this did not resemble the experimental data. These models failed because both predict a symmetric influence of reward and omission on choices, contrary to our experimental data; therefore, we modified each model to incorporate this asymmetry. For modelbased RL, this was done using different learning rates for positive and negative RPEs. We also incorporated forgetting about the value of states that were not visited, as this was supported by model comparison (Extended Data Fig. 1). This approach was not possible for the inference model, as Bayesian updates do not have a learning rate parameter that can be different for reward and omission. We therefore implemented the asymmetry by modifying the observations received by the model, so that it treated reward obtained in the two secondstep states as different observations, but treated reward omission as the same observation irrespective of the state where it occurred. Simulated on the task, both asymmetric models generated a pattern of stay probabilities that closely matched subject’s data (Fig. 1i,j).
We adopted two different approaches to try and differentiate between these strategies: (i) likelihoodbased model comparison (Supplementary Tables 1 and 2), and (ii) fitting a mixtureofstrategies model incorporating both components to assess which explained most variance in subjects’ choices (Supplementary Table 3). Both analyses gave a consistent picture that it was not possible to arbitrate between the strategies using behavior alone (Extended Data Fig. 1). Critically, however, the two strategies make different predictions for how rewards update the estimated value of each secondstep state (discussed below), and hence for dopaminergic RPE signaling. We therefore looked for evidence of inferencebased value updates in dopamine activity.
Inferred values drive dopamine signals
We used fiber photometry to record calcium signals from GCaMP6fexpressing dopamine neuron cell bodies in the ventral tegmental area (VTA) and axons in the nucleus accumbens (NAc) and the dorsomedial striatum (DMS; Fig. 2a), and dopamine release using dLight1.1 expressed panneuronally in the NAc and DMS (Fig. 2b; see Extended Data Fig. 2 for placements). Dopamine activity fluctuated dynamically across the trial, as mice made their initial choice and received information about the secondstep state reached and the trial outcome (Fig. 2d–f). Reward responses were prominent in all signals, although relatively weaker in DMS calcium. However, average DMS calcium activity masked a strong mediolateral gradient in reward response, with larger responses more laterally in the DMS (Extended Data Fig. 3). For the following analyses, we excluded the DMS site in two animals where the fiber was most medial, and we observed a negative reward response (Extended Data Fig. 3).
A key feature of the state inference strategy is that it assumes that a single hidden variable controls both reward probabilities. Therefore, reward obtained in one secondstep state not only increases the value of that state but also decreases the value of the other secondstep state, unlike in standard modelbased RL where the state values are learned independently (Fig. 3c). We can therefore leverage our photometry data to discriminate between these strategies by examining how the previous trial’s outcome influences dopamine activity and release when the secondstep state reached on the current trial is revealed. Specifically, we can ask whether a reward obtained in one secondstep state (for example, upactive), decreases the dopamine response to the other secondstep step state (downactive) if it is reached on the next trial.
We aligned activity across trials and used linear regression to model dopamine fluctuations across trials and timepoints. We ran separate regression analyses for each timepoint, using predictors that varied from trial to trial but using the same value for all timepoints in each trial. The time courses of predictor loadings across the trial, therefore, reflect when, and with what sign, each predictor explained variance in the activity (Fig. 3a). The key predictors for differentiating between strategies included one coding for the previous trial’s outcome on trials where the secondstep state is the same as on the previous trial, and another coding for the previous trial’s outcome when the second step reached on the current trial was different (Fig. 3b). We also included regressors modeling the current trial outcome and other possible sources of variance (Methods and Extended Data Fig. 4). We focus on the GCaMP data in the main figures, but results from dLight were closely comparable, except where noted in the text (Extended Data Fig. 4).
When the secondstep state was the same as on the previous trial, the previous trial’s outcome positively influenced dopamine when the secondstep state was revealed, consistent with both modelbased RL and inference (Fig. 3d,e and Extended Data Fig. 4). However, when the secondstep state was different to the previous trial, the previous outcome negatively influenced dopamine when the secondstep state was revealed. This is at odds with the predictions of standard modelbased RL but, crucially, is consistent with inference (Fig. 3d,e and Extended Data Fig. 4). In NAc, loading on these regressors reversed sign at outcome time. This biphasic response is exactly as predicted for an RPE; RPEs are computed from value differences between successive time steps, so if dopamine reports RPE, the value of the secondstep state reached on the trial will drive a positive response when the state is revealed, followed by a negative response at the time of trial outcome^{27,28}. Unexpectedly, this reversal was not observed in either the VTA or the DMS.
To ensure that this pattern of modulation by inferred value was not an artifact of confounding effects of trial history, we performed a lagged regression predicting the dopamine response to the secondstep state cue. This confirmed that the dopamine response was driven by inferred state values, and that these integrated outcomes over multiple previous trials (less clearly in the DMS axonal calcium activity, but prominently in DMS dopamine release; Extended Data Fig. 5b).
To test whether the asymmetric influence of rewards and omissions on subject’s choices was also reflected in dopamine activity, we ran a modified regression analysis using separate regressors for trials following rewarded and nonrewarded outcomes. In the VTA and NAc, the differential dopamine response to reaching the same versus different secondstep state was much stronger following rewarded than nonrewarded trials, consistent with rewards updating secondstep state values more strongly than omissions (Extended Data Fig. 5c). Dopamine activity at choice time was also higher when subjects chose an action that commonly led to the state where reward was obtained on the previous trial (Extended Data Fig. 4, inferred action value update), consistent with subjects inferring the value of the firststep action using knowledge of the task structure. Again, these effects were primarily driven by rewarded rather than omission trials (Extended Data Fig. 5d).
Together, these findings indicate that the mice understood that a single hidden variable controlled the reward probabilities in both ports and inferred its current state by observing where rewards were obtained. Reward predictions based on this then shaped dopamine responses to both task states.
Dissociable influence of RPE, reward rate and movement on dopamine activity
Recent work has argued that dopamine fluctuations more closely reflect value than RPE^{29}. To examine whether this was the case in our data, we used the same linear regression framework but with value estimates from the inference model for the chosen firststep action and secondstep state as predictors (Fig. 4 and Extended Data Figs. 6 and 7). As lateralized movements^{6} and average reward rate^{5,29} have also been reported to influence dopamine, we additionally included regressors for these variables.
In line with our above results, secondstep state values drove a biphasic response in NAc GCaMP and dLight signals, with a positive influence when the secondstep state was revealed followed by a negative influence at outcome, consistent with RPE not direct value coding (Fig. 4b). VTA GCaMP also showed this biphasic pattern but with a smaller negative response at outcome time relative to the positive response to the secondstep state. The time course was more complex in the DMS, with peaks following both the secondstep cue and secondstep port entry, although the former only survived multiplecomparison correction across timepoints in the dLight data (Extended Data Fig. 7). The chosen action value also had a strong positive influence on activity in all three regions around the time of the choice, which then reversed in all the regions when the secondstep value was revealed, again consistent with RPE (Fig. 4c).
In addition to these RPElike signals, dopamine was also transiently modulated by lateralized movement (Fig. 4d and Extended Data Figs. 4, 6 and 7). Consistent with previous reports^{6}, activity in the DMS but not the VTA or NAc was modulated after initiating a contralateral choice. Unlike previous studies, here the task necessitated a second lateralized movement (in the opposite direction) from the choice port back to the centrally located secondstep reward port. This did not evoke a response in DMS activity, but did in VTA and NAc activity (note the negative predictor loadings for the VTA and NaC in Fig. 4d following the choice indicate increased activity for contralateral movements from the choice port back to the secondstep port).
Reward rate had a strong positive influence on dopamine in all three regions (Fig. 4e and Extended Data Figs. 4, 6 and 7). Unlike the influence of action/state values and rewards, which were tightly timelocked to trial events (Fig. 4a–c), reward rate positively influenced activity at all timepoints, with little modulation by specific trial events (Fig. 4e,f). This reward rate signal was also present in NAc dopamine concentrations, but negligible in DMS concentrations (Extended Data Fig. 7).
In sum, these data demonstrate that dopamine carries information about (i) action and state values in a manner consistent with RPE signaling, (ii) lateralized movement and (iii) recent reward rate. While these signals exist in parallel, they can nonetheless be dissociated based on their timescale and their lateralization.
Dopamine does not mediate the reinforcing effect of task rewards
To assess the causal influence of dopamine on choices, we manipulated dopamine activity in a new cohort of mice expressing either channelrhodopsin (ChR2; N = 7) or a control fluorophore (EYFP, N = 5; Fig. 5a and Extended Data Fig. 8a) in VTA dopamine neurons. We verified that our stimulation parameters (five pulses, 25 Hz, ∼8–10 mW power) were sufficient to promote and maintain intracranial selfstimulation in the ChR2 group compared to YFP controls (t(10) = 3.107, P = 0.011, 95% confidence interval (CI; 68.18, 414.07), Cohen’s d = 1.819) using an assay where optical stimulation was delivered contingent on nosepoking in a different context from the twostep task (Fig. 5b).
We then examined the effect on twostep task behavior of optogenetic stimulation at two different timepoints: (i) after the firststep choice, at the time of secondstep cue onset, or (ii) at outcomecue presentation (Fig. 5d,g; 25% stimulated trials, stimulation timepoint fixed for each session, counterbalanced across sessions and mice; N = 6–8 sessions per mouse and stimulation type). We again used a mixedeffects logistic regression approach (Fig. 1e), here adding stimulation and its interaction with transition and outcome as regressors. Note that if the effects of rewards are mediated by dopamine, we would expect stimulation to act like a reward, causing positive loading on the transition × stimulation regressor, or, if stimulation interacts with outcome (reward/omission), on the transition × outcome × stimulation regressor; if, however, optogenetic activation simply reinforces the previous choice, this would be evident as positive loading on the stimulation regressor (Fig. 5c).
Data from both groups (YFP and ChR2) and each stimulation type (nonstimulated trials, stimulation after firststep choice and stimulation at outcome) were included in a mixedeffects logistic regression. This revealed a significant stimulation type 1bygroup effect (β = −0.053, s.e. = 0.024, z = −2230, P = 0.026; Extended Data Fig. 8b). To explore this effect, we performed separate logistic mixedeffects regressions for each group and each stimulation type (stimulation after firststep choice and stimulation at outcome).
In the ChR2 group, stimulating dopamine neurons after the firststep choice was reinforcing; it significantly increased the probability of repeating that choice on the next trial (β = 0.091, s.e. = 0.035, z = 2.645, P = 0.008, mixedeffects logistic regression; Fig. 5f). This stimulation did not significantly interact with either the transition or the outcome in its effect on next trial choice (all P > 0.069). This is in line with the intracranial selfstimulation result (Fig. 5b) and previous reports^{5,7,8,30} showing that dopamine activation promotes repeating recent actions.
Strikingly, stimulating dopamine neurons at the time of trial outcome—where we observed large increases or decreases in dopamine following reward or omission, respectively—had no significant influence on the subsequent choice (Fig. 5i); it did not reinforce the preceding firststep choice (effect of stimulation: β = 0.041, s.e. = 0.030, z = 1.378, P = 0.168), nor act like a reward by interacting with the state transition (β = −0.011, s.e. = 0.030, z = −0.355, P = 0.723), nor modify the effect outcomes on choices (stimulation–transition–outcome interaction: β = −0.014, s.e. = 0.030, z = −0.464, P = 0.643). No effect of stimulation was found in the YFP group for either stimulation types (all P > 0.456). To evaluate the strength of this null result in the ChR2 group, we computed a Bayes factor (B = 0.048) for whether dopamine stimulation acted like a task reward or had no effect. This indicated the manipulation result provides ‘strong evidence’ (using the classification in ref. ^{31}) against dopamine stimulation recapitulating the behavioral consequences of rewards in this task.
By contrast, while stimulation after the firststep choice had no effect on the latency to initiate the next trial (Fig. 5e, t(6) = 0.347, P = 0.740, 95% CI (−30.25, 40.25), Cohen’s d = 0.034), stimulation at outcome significantly reduced the latency to initiate the next trial (Fig. 5h, t(6) = 4.228, P = 0.0055, 95% CI (20.98, 78.6), Cohen’s d = 0.369). Again, there was no effect on this latency in the YFP group (Fig. 5e, stimulation after choice: t(4) = 0.816, P = 0.460, 95% CI (−62.7, 114.9), Cohen’s d = 0.302; Fig. 5h, stimulation at outcome time: t(4) = 1.713, P = 0.162, 95% CI (−9.87, 41.67), Cohen’s d = 0.141).
To further corroborate that the reward effects observed in the behavior are independent of dopamine, we repeated the previous experiment, but now using either a somatargeted anionconducting ChR2 to inhibit VTA dopamine neurons (GtACR2, N = 7) or a control fluorophore (tdTomato, N = 5; Extended Data Fig. 8a). Our stimulation parameters (1 s continuous, 5 mW) were effective at negatively reinforcing an immediately preceding action in a twoalternative forcedchoice control task (Extended Data Fig. 8c). Nonetheless, inhibiting dopamine neurons during the twostep task had no effect on performance (no effect of stimulation by opsin group either in isolation or interacting with other trial events, all P > 0.283). To corroborate this null result, we again calculated the Bayes factor for the GtACR2 group at outcometime stimulation (B = 0.062), which indicated ‘strong evidence’ against dopamine inhibition modulating the effects of outcome on subsequent choices.
Therefore, while optogenetic activation or inhibition of VTA dopamine neurons was sufficient to promote or reduce the likelihood of repeating an immediately preceding action, respectively, it completely failed to recapitulate the behavioral consequences of natural rewards at outcome, despite reward delivery and omission driving the largest dopamine signals observed in the task.
Neural network model reproduces experimental data
Our behavioral and dopamine analyses demonstrate subjects inferred the hidden state of the reward probabilities by observing where rewards were obtained, while our optogenetic manipulations indicate these belief updates (changes in estimates of the hidden state) were not caused by dopamine. This raises several questions: How do subjects learn there is a hidden state that controls the reward probability at both ports? Where and how are beliefs about the hidden state represented and updated? How does state inference interact with the prominent RPE signals we observe in dopamine activity?
One possibility is that recurrent networks in cortex learn to infer the hidden states by predicting observations, while RL mechanisms in basal ganglia learn the corresponding value and appropriate action. To test this hypothesis, we implemented a simple neural network model of cortexbasal ganglia circuits (Fig. 6).
The task was modeled as having five actions corresponding to the five nosepoke ports, and five observable states corresponding to trial events (for example, choice state or upactive), such that completing each trial required a sequence of three actions (Fig. 6a). PFC was modeled as a recurrent neural network that received at each time step an observation O_{t} (the observable state) and the preceding action A_{t1}. PFC activity and observations provided input to a feedforward network representing basal ganglia, comprising a single layer of rectified linear units with two output layers: a scalar estimate of the current value V_{t} (that is, expected discounted longrun reward) and a vector of action probabilities that determined the next action A_{t}. The PFC network was trained using gradient descent to predict the next observation given the history of observations and actions. The basal ganglia network was trained using actorcritic RL to estimate the value and select actions given its current input. Network weights were updated gradually across training, and held constant for the simulations analyzed in Fig. 6, such that changes in network activity and behavior from trial to trial were mediated only by the changing input and recurrent activity it induced in PFC.
PFC activity tracked the hidden state of the reward probabilities across trials. Notably, this was true in the choice state (Fig. 6d) even though the next observation in this state does not depend on the reward probabilities (as it is either upactive or downactive). However, to accurately predict trial outcomes, the network must carry information provided by previous outcomes forward through time in its recurrent activity, causing it to be present throughout the trial. The model’s choices tracked the high reward probability option (Fig. 6d), demonstrating that the basal ganglia network was able to use the reward probability information present in its PFC input to select appropriate actions.
Stay probability showed a strong interaction between transition and outcome; that is, the model repeated choices following rewarded common transitions and nonrewarded rare transitions. While this is the pattern expected for an agent that infers the hidden state of the reward probabilities and has a fixed mapping from this to the firststep choice^{24} (Fig. 1g), it can also be generated by modelbased RL prospectively evaluating actions by predicting the states they will lead to^{24,26}. However, that is not what is happening here: the PFC network only predicts the next observation after an action has been selected, and this prediction is used only for updating PFC network weights.
The model did not exhibit the asymmetric learning from reward and omission observed in the mice. We showed in Fig. 1i,j that models which use Bayesian inference to track the reward probabilities exhibit this asymmetry if they treat rewarded outcomes as distinct observations based on where they occur (that is, top/bottom port), but nonrewarded outcomes as the same observation irrespective of where they occur. We therefore asked if this mechanism could generate such asymmetry in the network model. We modified the input provided to the PFC network such that on each time step PFC received the observation gated by whether reward was received, such that on nonrewarded time steps the input was a zero vector (Fig. 6e). PFC activity and choices still tracked the task reward probabilities (Fig. 6g), but now the stay probabilities recapitulated the asymmetry between rewarded and nonrewarded outcomes seen in the mice (Fig. 6f). As with the explicitly Bayesian models (Fig. 3c), the network model reproduced the nonlocal value updates we observed for the dopamine signal (Fig. 6h): that is, reward in one secondstep state increased the value of that state (twosided ttest, t(11) = 9.28, P < 0.001) but also decreased the value of the other secondstep state (twosided ttest, t(11) = −3.23, P = 0.008).
Finally, we asked how optogenetic stimulation of dopamine on individual trials would affect the model’s choice behavior, assuming it acted by updating weights in the basal ganglia network as if it were a positive RPE (Fig. 6i). Stimulation following firststep choice increased the probability of repeating the choice on the next trial (twosided ttest, t(11) = 7.41, P < 0.001) as observed experimentally. Crucially, stimulation at trial outcome had no effect on next trial choice (twosided ttest, all t(11) < 0.30, P > 0.77), again recapitulating the data. This is because an RPE following an action updates the network weights to increase the probability of selecting the same action in the same state in future. Therefore, stimulation following a choice affects next trial choice, but stimulation following, for example, an uppoke in the upactive state has no effect on choosing left versus right in the next trial’s choice state.
In sum, the network model recapitulates our key experimental findings that both behavior and the value information that drives dopaminergic RPEs are consistent with state inference, but dopamine does not mediate the effect of rewards on subsequent choices.
Discussion
By recording and manipulating dopamine activity in a twostep decision task, we obtained results that support an integrated framework for understanding rewardguided decisionmaking. Rewards did not simply reinforce preceding choices, but rather promoted choosing the action that commonly led to the state where the reward was obtained, consistent with previous work with similar tasks^{25,32}. Dopamine carried rich information about value, action and recent reward history, responding strongly and transiently to both rewards and states that predicted reward. The influence of state values on mesolimbic dopamine exhibited a key signature of an RPE^{27,28,33}: a positive response when the state was entered followed by a negative response at trial outcome. Additionally, rewards obtained in one secondstep state negatively influenced the dopamine response upon reaching the other secondstep state on the subsequent trial, consistent with an inferred value update. Strikingly, however, neither optical activation nor inhibition of dopamine neurons at the time of trial outcome—when dopamine responses were maximal—influenced next trial choice, despite positive controls in the same subjects verifying the manipulations were effective.
These findings are not consistent with value updates driven by dopaminergic RPEs changing subject’s choice preference from trial to trial. The observed nonlocal value updates are consistent with the animals understanding that a single hidden variable controls both reward probabilities. However, if animals solve the task by state inference, how do they learn to track the hidden state, and what function do the observed dopaminergic RPEs serve^{34}?
Our computational model suggests a possible answer. A recurrent network representing frontal cortex learned to track the hidden state of the reward probabilities by predicting observations. A feedforward network representing basal ganglia learned values and appropriate actions (‘policies’) over the observable and inferred state features using RL, generating choices that closely resembled those of the subjects. Crucially, the shorttimescale effect of rewards on subsequent choices was driven by changes in recurrent activity in the PFC, not synaptic weight changes in either network. Consistent with this, two recent studies found that medial frontal and retrosplenial cortex activity tracks the reward probabilities during probabilistic reversal learning, not only at decision or outcome time, but also throughout the intertrial interval^{35,36}. This is necessary if recurrent activity is responsible for carrying forward information about the recent reward history to guide choices. This simple model also reproduced our key findings of nonlocal value updates, and the sensitivity of choices to optogenetic stimulation at different timepoints.
This twoprocess account of rewardguided decisionmaking can help reconcile the burgeoning evidence for state inference regulating both behavior^{10,12,13,14,15,16} and neural activity in cortex^{13,37,38,39,40,41,42} and the dopamine system^{14,19,20,21,43,44}, with the longstanding literature supporting dopamine activity resembling and acting like an RPE signal^{1,2,3,4}. It also helps explain previous findings that although stimulating/inhibiting dopamine following an action can bidirectionally modulate the probability of repeating the action in future^{5,6,7,8}, inhibiting dopamine at outcome time can fail to block the effect of rewards on subsequent choices^{5,6}, and pharmacological manipulation of dopamine signaling in rewardguided decision tasks often has limited or no effect on learning^{10,45,46,47}.
Our network model is related to a recent proposal that frontal cortex acts as a metaRL system^{48}. In both models, synaptic plasticity acts on a slow timescale over task acquisition to sculpt recurrent network dynamics that generate adaptive behavior on a fast timescale. Unlike this previous work, our model differentiates between cortex and basal ganglia, both with respect to network architecture (recurrent versus feedforward) and type of learning (unsupervised versus reinforcement). This builds on longstanding ideas that the cortex implements a hierarchical predictive model of its sensory inputs^{49,50}, while the basal ganglia implement temporaldifference RL^{2,51}. It is also motivated by work in machine learning in which Markovian state representations (which integrate the observation history to track hidden states) are learned by predicting observations^{52,53,54,55}, enabling RL to solve tasks where the current observation is insufficient to determine the correct action. There are also commonalities with connectionist models of learning phenomena where one stimulus changes the meaning of another, including occasion setting, configural and contextual learning^{56,57}. These use hidden units between an input and output to allow modulatory interactions between stimuli^{58,59}, just as in our model the hidden units in basal ganglia allow for nonlinear combination of the current observation and PFC activity to determine value and action.
Recent work has questioned the relative influence of value and RPE on dopamine activity and striatal concentrations^{4,5,29}. We observed the biphasic influences of the values of firststep actions and secondstep states, a key signature of RPE, in VTA, NAc and DMS calcium activity (Fig. 3 and Extended Data Figs. 4 and 6), and NAc and DMS dopamine concentrations (Extended Data Figs. 4 and 7). This biphasic pattern was most prominent in the NAc. Intriguingly, when evaluating only the influence of the most recent outcome on secondstep state value, rather than the extended history, the biphasic pattern was prominent in the NAc but absent in the VTA (Fig. 3e). Differences between the VTA and striatal signals could reflect local modulation of terminals, for example, by cholinergic neurons^{60}, or alternatively the VTA signal may be influenced by calcium activity in dendrites that is partially decoupled from spiking due to somatic inhibition.
In parallel, we observed two other important modulators associated with dopamine function—recent reward rate and movement—both of which accounted for separate variance from the value of trial events. Reward rate had positive sustained effects across the trial from before initiation to after outcome, unlike the influence of state and action values which were tightly timelocked to the corresponding behavioral event. This appears broadly consistent with theoretical proposals that tonic dopamine represents average reward rate, which acts as the opportunity cost of time in average reward RL^{61}. Reward rate signaling may mediate the effect of dopamine manipulations on motivation and task engagement observed in other studies^{46}. The correlation between reward rate and NAc dopamine concentrations is consistent with recent reports^{5,29}, but that with VTA calcium is unexpected given previous reports of no correlation with VTA spikes^{29}.
By contrast, the influence of movement was transient, lateralized and exhibited distinct dynamics in the DMS and the VTA/NAc. Specifically, DMS, but not NAc or VTA, dopamine was selectively influenced at the time of the initial choice, with increased activity in the hemisphere contralateral to the movement direction consistent with previous studies^{6}. Intriguingly, at the time of the second movement from the lateralized choice port to the reward port, significant modulations were instead observed in the NAc (with a similar pattern in the VTA), but not in the DMS, again with increased dopamine activity contralateral to the movement direction. This suggests that an interplay of dopamine dynamics across striatum might shape movement direction as animals proceed through a sequence to reward.
Conclusion
Our findings emphasize that flexible behavior involves two processes operating in parallel over different timescales: inference about the current state of the world, and evaluation of those states and actions taken in them. The involvement of dopamine in updating values has rightly been a major focus of accounts of flexible decisionmaking. However, in the structured environments common both in the laboratory and the real world, it is only half the picture. Our data show that during rewardguided decisionmaking by experienced subjects, the effect of rewards on choices is due to the information they provide about the state of the world, not the dopaminergic RPEs they generate.
Methods
Subjects
All procedures were performed in line with the UK Animal (Scientific Procedure) Act 1986 and in accordance with the University of Oxford animal use guidelines. They were approved by the local ethical review panel at the Department of Experimental Psychology, University of Oxford, and performed under UK Home Office Project Licence P6F11BC25. Twelve DATCre heterozygous mice (DATCre^{+/−}, 7 females and 5 males) were used for the GCaMP photometry recordings, 6 wildtype C57BL/6 mice (DATCre^{−/−}, 3 females and 3 males) for the dLight recordings, and 12 DATCre mice (DATCre^{+/−}, YFP: 2 females and 3 males; ChR2: 4 females and 3 males) for the optogenetic activation experiment, and 12 DATCre mice (DATCre^{+/−}, tdTomato: 3 females and 2 males; GtACR2: 4 females and 3 males) for the optogenetic inhibition experiment. All animals were bred by crossing DATCre male with C57BL/6 female mice (Charles River, UK). Mice were aged 8–16 weeks at the start of behavioral training. Animals were typically housed in groups of 2–4 throughout training and testing. Temperature was kept at 21 ± 2 °C under 55% ± 10% humidity on a 12h light–dark cycle. Animals were tested during the light phase.
Behavioral setup
The task was run in custombuilt 12 × 12 cm operant boxes (design files at https://github.com/pyControl/hardware/tree/master/Behaviour_box_small/) controlled using pyControl^{62}. Five nosepoke ports were located on the back wall of the boxes—a central initiation port flanked by two choice ports 4 cm to the left and right and two secondstep ports located 1.6 cm above and below the central poke. The secondstep ports each had a solenoid to deliver water rewards. A speaker located above the ports was used to deliver auditory stimuli. Video data were acquired from an FLIR Chameleon 3 camera positioned above each setup using a Bonsai based workflow (https://github.com/ThomasAkam/Point_Grey_Bonsai_multi_camera_acquisition/)^{63}.
Behavioral task and training
The behavioral task was adapted from the human twostep task^{26}. Each trial started with the central initiation port lighting up. Subjects initiated the trial by poking the illuminated port, which caused the choice ports to illuminate. On freechoice trials (75% of trials), both the left and right port lit up, allowing subjects to choose, while on forcedchoice trials only one randomly selected choice port lit up, forcing animals to select that specific port. Poking a choice port was followed, after a 200ms delay, by the secondstep port lighting up and 1s presentation of one of two auditory cues (‘secondstep cue’, either a 5kHz or a 12kHz tone depending on whether the top or bottom secondstep port became active, counterbalanced across animals). Each choice port was commonly (80% trials) associated with transitioning to one secondstep state (up or down) and rarely (20% trials) to the other. The transition structure was fixed for each animal across all sessions, and was counterbalanced across animals, that is, the task had two possible transition structures: transition type A, where a left choice commonly led to the up secondstep port, and a right choice commonly led to the down secondstep port; and the opposite for transition type B. The secondstep port only became responsive to pokes after cue offset. Poking the secondstep port triggered a 200ms delay, after which a 500ms auditory cue signaled whether the trial was rewarded or not (same 5kHz or 12kHz tone as the secondstep cue, counterbalanced across animals, with pulses delivered at 10 Hz on rewarded trials, white noise on unrewarded trials). Reward was delivered at the offset of this cue. To ensure mice knew when they had made a nose poke, a click sound was presented whenever the subject poked a port that was active (for example, the initiation port during the initiation state).
Reward probabilities for the two secondstep ports changed in blocks. In nonneutral blocks, one secondstep port had 80% reward probability and the other had 20% probability, while in neutral blocks both secondstep ports were rewarded with 50% probability. Block transitions from nonneutral blocks were triggered 5 to 15 trials after mice crossed a threshold of 75% ‘correct’ choices (that is, choosing the higher reward probability), computed as the exponential moving average with a time constant of 8 freechoice trials. Transitions from neutral blocks were triggered after 20–30 trials. An intertrial interval of 2–4 s in duration started once the subject remained out of the secondstep port for 250 ms after the trial outcome.
Training
Behavioral training took 4–6 weeks. Animals were put on water restriction 48 h before starting training, and received 1 h adlib water access in their home cage 24 h before starting training. On training days (typically 6 d per week) animals usually received all their water from the task, but were topped up outside the task as necessary to maintain a body weight of >85% of their prerestriction baseline. On days off from training, mice received 1 h adlib water access in their home cage. Water reward size was decreased from 15 µl to 4 µl across training to increase the number of trials performed and blocks experienced on each session.
Training consisted of multiple stages with increasing complexity (Supplementary Table 4). At the first training stage 1.1, only the secondstep ports were exposed, with all other ports covered. Secondstep ports were illuminated in a pseudorandom order with an intertrial interval of 2–4 s. Poking an illuminated port delivered reward with 100% probability, with no auditory cues. When animals obtained >50 rewards on this stage, they transitioned to stage 1.2, where the auditory cues for secondstep state and reward were introduced. Once animals obtained >70 rewards in a session, they were switched to stage 2 on the next session. At stage 2, the choice ports were introduced, but all trials were forced choice, such that only one choice port lit up on each trial. Mice were switched to stage 3 when they obtained >70 rewards on a single session. At stage 3, the initiation poke was introduced, and when they obtained >70 rewards on a single session, they transitioned to stage 4 on the next training session. Stage 4 consisted of multiple substages where progressively more freechoice trials were introduced, and the reward probabilities gradually changed until reaching the final task parameters. Mice were transitioned to the next substage after two training sessions of 45 min or a single 90min session, until they reached substage 4.6. Subjects were only transitioned to the final stage (full task) when they completed at least 5 blocks in a single session.
Surgery
Mice were anesthetized with isoflurane (3% induction, 0.5–1% maintenance), and injected with buprenorphine (0.08 mg per kg body weight), meloxicam (5 mg per kg body weight) and glucose saline (0.5 ml). Marcaine (maximum of 2 mg per kg body weight) was injected into the scalp and before placing mice into the stereotaxic frame. Mice were maintained at ∼37 °C using a rectal probe and heating blanket (Harvard Apparatus). Surgery proceeded as described below for the different experiments. After surgery, mice were given additional doses of meloxicam each day for 3 d after surgery, and were monitored carefully for 7 d after surgery.
GCaMP photometry
Mice were intracranially injected with 1 µl of saline containing a 1:10 dilution of AAV1.Syn.Flex.GCaMP6f.WPRE.SV40 (titer of 6.22 × 10^{12} viral genomes per ml (vg/ml), Penn Vector Core) and a 1:20 dilution of AAV1.CAG.Flex.tdTomato.WPRE.bGH (AllenInstitute864; titer of 1.535 × 10^{12} vg/ml, Penn Vector Core) at 2 nl s^{−1} in VTA (anteroposterior (AP): −3.3, mediolateral (ML): ±0.4, dorsoventral (DV): −4.3 from bregma) in one hemisphere for mesolimbic dopamine recordings in the VTA and NAc and VTA/substantia nigra pars compacta (AP: −3.1, ML: ±0.9, DV: −4.2 from bregma) in the other hemisphere for DMS recordings. Three 200µmdiameter ceramic optical fibers were implanted chronically in each animal in the VTA (AP: −3.3, ML: ±0.4, DV: −4.3 from bregma) and the NAc (AP: +1.4, ML: ±0.8, DV: −4.1 from bregma) in the same hemisphere, and DMS (AP: +0.5, ML: ±1.5–1.7, DV: −2.6 from bregma) in the contralateral hemisphere.
dLight photometry
Mice were intracranially injected with 500 nl of saline containing a 1:5 dilution of pAAV5CAGdLight1.1 (titer of 1.4 × 10^{12} vg/ml; Addgene) and a 1:5 dilution of pssAAV2/5hSyn1chItdTomatoWPRESV40p(A) (titer of 4.9 × 10^{11} vg/ml; ETH Zurich) at 2 nl s^{−1} in the NAc (AP: +1.4, ML: ±0.8, DV: −4.1 from bregma) and DMS (AP: +0.5, ML: ±1.5/1.7, DV: −2.6 from bregma) in opposite hemispheres. Two 200µmdiameter ceramic optical fibers were implanted chronically in the injection sites.
Optogenetic manipulation
For optical activation experiments, mice were injected bilaterally with 500 nl per hemisphere of saline containing either AAV2EF1aDIOEYFP (titer of >1 × 10^{12} vg/ml, UNC Vector Core; YFP group) or rAAV2/Ef1aDIOhchR2(E123t/T159C)EYFP (titer of 5.2 × 10^{12} vg/ml, UNC Vector Core; ChR2 group) at 2 nl s^{−1} in the VTA (AP: −3.3, ML: ±0.4, DV: −4.3 from bregma). For optical inhibition experiments, mice were injected bilaterally with 500 nl per hemisphere of saline containing a 1:10 dilution of either ssAAV1/2CAGdloxtdTomato(rev)dloxWPREbGHp(A) (titer of 7.9 × 10^{12} vg/ml; ETH Zurich; tdTomato group) or AAV1hSyn1SIOstGtACR2FusionRed (titer of 1.9 × 10^{13} vg/ml; Addgene) (GtACR2 group) at 2 nl s^{−1} in the VTA (AP: −3.3, ML: ±0.4, DV: −4.3 from bregma). For both sets of experiments, two 200µmdiameter ceramic optical fibers were implanted chronically targeting the injection sites at a 10° angle.
Histology
Mice were terminally anesthetized with sodium pentobarbital and transcardially perfused with saline and then 4% paraformaldehyde solution. Then, 50µm coronal brain slices were cut, covering striatum and VTA, and immunostained with antiGFP and antiTH primary antibodies, and Alexa Fluor 488 and Cy5 secondary antibodies. For the animals used on the optogenetic inhibition experiment, only antiTH and Cy5 primary and secondary antibodies, respectively, were used. An Olympus FV3000 microscope was used to image the slices.
Photometry recordings
Dopamine calcium activity (GCaMP) and release (dLight) were recorded at a sampling rate of 130 Hz using pyPhotometry^{64}. The optical system comprised a 465nm and a 560nm LED, a fiveport minicube and fiberoptic rotary joint (Doric Lenses) and two Newport 2151 photoreceivers. Time division illumination with background subtraction was used to prevent crosstalk between fluorophores due to the overlap of their emission spectra, and changes in ambient light from affecting the signal. Synchronization pulses from pyControl onto a digital input of the pyPhotometry board were used to synchronize the photometry signal with behavior^{64}.
Photometry signals were preprocessed using custom Python code. A median filter (width of five samples) was first used to remove any spikes due to electrical noise picked up by the photodetectors. Afterwards, a 5Hz lowpass filter was used to denoise the signal. To obtain motion correction of the signals, we bandpassed the denoised signals between 0.001 Hz and 5 Hz, and used linear regression to predict the GCaMP or dLight signal using the control fluorophore (tdTomato) signal. The predicted signal due to motion was subtracted from the denoised signal. To correct for bleaching of the fiber and fluorophores, detrending of signals was performed using a double exponential fit to capture the temporal dynamics of bleaching: a first fast decay and a second slower one (Extended Data Fig. 10). Finally, the preprocessed dopamine signal was zscored for each session to allow comparison across sessions and animals with different signal intensities.
For GCaMP, we recorded data from 12, 11 and 12 mice in VTA, NAc and DMS, respectively (in one animal, the fiber targeting the NAc did not exhibit any GCaMP modulation, which later histological analysis confirmed was caused by the fiber being in the anterior commissure). In the DMS, two animals—the two with most medial coordinates—were excluded from the main analyses of the effects of reward on subsequent dopamine activity as they presented a negative modulation to reward (Extended Data Fig. 3).
For dLight, we recorded data from 5 and 6 mice in the NAc and DMS, respectively (one mouse in the NAc did not present any dLight modulation; subsequent histological analysis confirmed that the fiber was misplaced into the ventricle).
Sessions in which there were large artifacts (large step change in recorded signals) introduced through a malfunctioning of the rotary joint or disconnection of the patch cord from the fiber, or where there was a complete loss of signal on one of the channels due to discharged battery during recording, were excluded. A total of 46 sessions (∼9% of the total) were removed from the analyses.
Optogenetic activation
VTA dopamine neurons were stimulated bilaterally using two 465nm LEDs (Plexon Plexbright) connected to 200µm 0.66NA optical fiber patch cords. All stimulation experiments used stimulation parameters of five pulses at 25 Hz, 25% duty cycle and ∼8–10 mW optical power at the fiber tip.
First, as a positive control, we performed an intracranial selfstimulation assay in which mice were presented with either 4 or 2 nosepoke ports, one of which triggered optical stimulation when poked. A minimum 1s delay was imposed between stimulations. Mice were tested on intracranial selfstimulation during 40–60min sessions for 4 d.
We then tested the effect of optogenetic activation during the twostep task. On each stimulation session, we either stimulated (i) 200 ms after the firststep choice at the time of secondstep cue onset, or (ii) at outcomecue onset. Stimulation occurred on 25% of the trials, under the constraints that (i) the trial after stimulation was always a freechoice trial, and (ii) there were always at least two nonstimulated trials after each stimulation. The stimulation sessions were interspersed with baseline nostimulation sessions (data not shown). The timing of stimulation was fixed within a session, with the session order counterbalanced across animals (for example, nostimulation session → secondstep cue stimulation session → outcome stimulation session → nostimulation session → outcome stimulation session → secondstep cue stimulation session).
Optogenetic inhibition
VTA dopamine neurons were inhibited bilaterally using two 465nm LEDs (Plexon Plexbright) connected to 200µm 0.66NA optical fiber patch cords. All inhibition experiments used stimulation parameters of 1s continuous light at ∼5 mW optical power at the fiber tip.
We first tested the effect of optogenetic inhibition on the twostep task. As in the stimulation experiment, on each inhibition session, we either inhibited (i) 200 ms after the firststep choice at the time of secondstep cue onset, or (ii) at outcomecue onset; baseline sessions with no stimulation were interspersed with the stimulation sessions (data not shown). Inhibition occurred on 25% of the trials under the same constraints as in the optogenetic activation experiment (see above).
As a positive control, we then performed a twoalternative forcedchoice bias assay. Mice were presented with three ports: a central initiation port, and left and right choice ports. Mice initiated each trial by poking the illuminated center port. This triggered either both the left and right ports to light up (freechoice trials, 50% of total) or just one choice port to illuminate. Poking an illuminated choice port led, after a 200ms delay, to 500ms presentation of an outcome cue (5kHz or 12kHz tone—left or right frequency tone counterbalanced across animals—pulsed with 10 Hz on rewarded trials, white noise on unrewarded trials), after which, on rewarded trials, reward was delivered. The reward probability associated with choice of either the left or right port was fixed at 50% throughout. After 3 d of training without optical stimulation and once animals showed a consistent bias toward one of the choice ports for at least two consecutive days (termed the animal’s preferred choice), stimulation sessions commenced. In these, 1 s of continuous light stimulation was delivered coincident with the outcome cue on any trial when the preferred choice port was selected (on both freechoice and forcedchoice trials). After 4 d, the light stimulation was then switched to be paired with selection of the other choice port for four more days. Each day, animals were tested on a single 60min session.
Analysis
All behavioral and photometry analysis was performed using custom Python and R code.
Behavioral logistic regression model
The logistic regression analysis shown in Fig. 1e predicted repeating choices (or ‘staying’) as a function of the subsequent trial events, considering only freechoice trials, implemented as a mixedeffects model using the afex package^{65} in R programming language. The model formula was:
For the analysis of twostep optogenetic manipulation (Fig. 5f,i and Extended Data Fig. 8), we added stimulation (stim) and group and its interaction with trials events as additional predictors, giving the formula:
We used orthogonal sumtozero contrasts, and likelihoodratio test to calculate P values. The maximal random effect structure^{66,67} with subject as a grouping factor was used.
The predictors were coded as:

Correct: Threelevel categorical variable indicating whether the previous choice was correct, incorrect or in a neutral block.

Bias: Binary categorical variable explaining whether the previous choice was left or right.

Outcome: Binary categorical variable indicating whether the previous trial was rewarded or not.

Transition: Binary categorical variable indicating whether the previous trial had a common or a rare transition.

Stimulation: Threelevel categorical variable indicating whether the previous trial was nonstimulated, stimulated after choice or stimulated at outcome time.

Group: Binary categorical variable indicating whether the data are from the control or manipulation subject. YFP and ChR2 were used for the stimulation experiment, and tdTomato and GtACR2 were used for the inhibition experiment.
For the optogenetic manipulation analysis, we performed a single regression for the activation experiment including both experimental groups (YFP and ChR2) and both stimulation times (after choice and outcome). We did the same for the inhibition experiment (tdTomato and GtACR2). To ensure stimulated and nonstimulated trials had matching histories, we only included trials where stimulation could have potentially been delivered, that is, excluding the two trials following each stimulation where stimulation was never delivered. As a followup analysis, we performed separate regressions per group (YFP, ChR2, tdTomato and GtACR2) and stimulation time (after choice and outcome).
To test the strength of evidence of our null results, we performed Bayes factor calculation using R as:
We defined the data likelihood as a normal distribution with the mean and standard deviation of the transition × stimulation regression coefficient. For the optogenetic activation experiment, the alternative hypothesis, H_{1}, was that dopamine activation acted like a natural reward, defined as a uniform distribution between 0 and the transition × outcome regression coefficient. For the optogenetic inhibition experiment, H_{1} was that dopamine inhibition reduced the effects of natural rewards, defined as a uniform distribution between 0 and minus the transition × outcome regression coefficient. Finally, the null hypothesis, H_{0}, was set to 0. We used the classification in ref. ^{31} to assess the strength of evidence for the alternative or null hypothesis.
The lagged logistic regression analysis (Fig. 1f,h) assessed how subjects’ choices were affected by the history of trial events over the last 12 trials. The regression predicted subjects’ probability of choosing left, using the following predictors at lags 1, 2, 3–4, 5–8, 9–12 (where lag 3–4, for example, means the sum of the individual trial predictors over the specified range of lags).

Common transition: rewarded at lag n: +0.5/−0.5 if the nth previous trial was a left/right choice followed by a common transition and reward, and 0 otherwise.

Rare transition: rewarded at lag n: +0.5/−0.5 if the nth previous trial was a left/right choice followed by a rare transition and reward, and 0 otherwise.

Common transition: unrewarded at lag n: +0.5/−0.5 if the nth previous trial was a left/right choice followed by a common transition and no reward, and 0 otherwise.

Rare transition: unrewarded at lag n: + 0.5/−0.5 if the nth previous trial was a left/right choice followed by a rare transition and no reward, and 0 otherwise.
The lagged regression was fitted separately for each subject as a fixedeffects model. The crosssubject mean and s.e. for each predictor coefficient were plotted. Significance of coefficients was assessed using a twosided ttest comparing the distribution of the individual subjects’ coefficients against zero, and Bonferroni multiplecomparison correction was performed.
Singlestrategy models
We evaluated the goodness of fit to subjects’ choices for a set of different RL agents created by combining one or more of the following learning strategies.
Modelfree
The modelfree strategy updated value Q_{MF}(c) of the chosen action and value V(s) of secondstep state as:
Where α is the learning rate, λ is the eligibility trace parameter and r is the outcome.
A variation of this model was the asymmetric modelfree strategy, which included different learning rates for positive and negative outcomes, and included forgetting of the nonexperienced states, so both the nonexperienced action and secondstep state decayed towards neutral value (0.5).
Modelbased
The modelbased strategy updated value V(s) of the secondstep reached, and both firststep action values Q_{MB}(a) as:
Where α is the learning rate, r is the outcome, and P(sa) is the transition probability of reaching secondstep state s after taking action a (that is, 0.8 or 0.2).
A variation of this model was the asymmetric modelbased strategy, which included different learning rates for positive and negative outcomes, and included forgetting of the nonexperienced secondstep state, which decayed toward a neutral value (0.5). In addition, we tested another model in which the forgetting rate decayed to 0 (asymmetric modelbased + forget to 0).
Bayesian inference
The Bayesian inference strategy treated the task as having a binary hidden state h ∈{up_good, down_good}, which determined the reward probabilities given the secondstep state reached on the trial as:
Outcome r  

P(rs,up_good)  Rewarded  Unrewarded  
Secondstep state s  Up  0.8  0.2 
Down  0.2  0.8 
Outcome r  

P(rs,down_good)  Rewarded  Unrewarded  
Secondstep state s  Up  0.2  0.8 
Down  0.8  0.2 
The strategy maintained an estimate P(up_good) tracking the probability the task was in the up_good hidden state, updated following each trial’s outcome using Bayes rule as:
Where
P(up_good) was also updated in each trial based on the probability a reversal occurred as:
Where P(reversal) is the probability a block reversal occurred.
The value of secondstep states and firststep actions was determined by P(up_good) as:
Note that although the equation relating firststep action values Q(a) to secondstep state values V(s) is the same for the inference and modelbased RL strategies, the mechanistic interpretation is different: For the inference strategy, the action values are assumed to have been learned gradually over task acquisition using temporal difference RL operating over a state representation combining the observable state s and belief state P(up_good). This learning process was modeled explicitly in the network model (Fig. 6), which generated choice behavior similar to the inference model (Fig. 1). For the modelbased RL strategy, the firststep action values are assumed to be computed online by predicting the states the actions will lead to.
We also used a variant of the inference strategy designed to capture the asymmetric influence of reward and omission on subject’s choices. This treated rewards in each secondstep state as different observations but treated reward omission at the up and down secondstep states as the same observation; implemented as a generative model, which determined the joint probability of reward and secondstep state conditioned on the hidden state:
Outcome r  

p(r,sup_good)  Rewarded  Unrewarded  
Secondstep state s  Up  0.4  0.5 
Down  0.1 
Outcome r  

p(r,sdown_good)  Rewarded  Unrewarded  
Secondstep state s  Up  0.1  0.5 
Down  0.4 
The corresponding Bayesian update to P(up_good) given each trial’s outcome was given by:
Where
State and action values (V_{inf} (s) and Q_{inf} (a)) for the asymmetric inference strategy were computed as for the standard inference strategy.
Combined action values
A set of different candidate models was created by combining one of more of the above strategies in a weighted sum with (optionally) bias and perseveration parameters, to give net action values:
Where w_{i} is the weight assigned to strategy i whose firststep action values are given by Q_{i} (a), and K(a) is the modifier to the value of firststep action a due to any bias or perseveration terms included in the model. In models where bias was included, this increased the value of the left action by an amount determined by a bias strength parameter on all trials. In models where perseveration was included, this increased the value of the firststep action chosen on the previous trial by an amount determined by a perseveration strength parameter.
The combined action values determined choice probabilities via a softmax decision rule as:
Model fitting and comparison
We generated a total of 59 different individual models from the following classes:
Modelfree (MF): These models used only the modelfree strategy, and varied with respect to whether they had asymmetric learning rates for positive and negative outcomes (together with forgetting toward neutral), and whether they included perseveration or multitrial perseveration and/or bias.
Modelbased (MB): These models used only the modelbased strategy, and varied with respect to whether they had asymmetric learning rates for positive and negative outcomes (together with forgetting to either neutral or zero), and whether they included perseveration or multitrial perseveration and/or bias.
Hybrid (MF + MB): These models used both the modelbased and modelfree strategies, and varied with respect to whether they had asymmetric learning rates for positive and negative outcomes (together with forgetting toward neutral), and whether they included perseveration or multitrial perseveration and/or bias.
Bayesian inference: These models used the Bayesian inference strategy, and varied with respect to whether they included an asymmetric updating based on the outcome received, and whether they included perseveration or multitrial perseveration and/or bias.
Bias increased the value of the left action by an amount determined by a bias strength parameter. Perseveration increased the value of the firststep action chosen on the previous trial by an amount determined by a perseveration strength parameter; in the case of multitrial perseveration, an exponential moving average of previous choices was used rather than just the previous choice, with a time constant determined by the alpha multitrial perseveration parameter.
Each model was fit separately to data from each subject using maximum likelihood. The optimization was repeated 30 times starting with randomized initial parameter values drawn from a beta distribution (α = 2, β = 2) for unit range parameters, gamma distribution (α = 2, β = 0.4) for positive range parameters and normal distribution (σ = 5) for unconstrained parameters, and the best of these fits was used. Model comparison was done using both Bayesian information criterion (Extended Data Fig. 1a and Supplementary Tables 1 and 2) and crossvalidated log likelihood using 10 folds (Extended Data Fig. 1b).
To compare data simulated from the singlestrategy models with real data (Extended Data Fig. 1g–j), we simulated the same number of sessions for each animal (28.9 ± 1.6 sessions, mean ± s.d.) of their average number of trials for each session (351.6 ± 93.0 trials, mean ± s.d.), using parameters values from each animal’s fits.
Mixtureofstrategies model
We created a mixtureofstrategies model, which contained the different behavioral strategies from the singlestrategy models (modelfree RL, modelbased RL and Bayesian inference) as components (Extended Data Fig. 1c–f and Supplementary Table 3). All three components included asymmetric updating from rewards and omissions.
This mixtureofstrategies model combined the action values of the three strategies in a weighted sum with bias and multitrial perseveration to give net action values as:
Where w_{i} is the weight assigned to strategy i whose firststep action values are given by Q_{i}(a), and K(a) is the modifier to the value of firststep action a due to bias and multitrial perseveration. The combined action values determined choice probabilities via a softmax decision rule like in the singlestrategy models.
This model was fit separately to data from each subject using maximum a posteriori probability, with priors: beta distribution (α = 2, β = 2) for unit range parameters, gamma distribution (α = 2, β = 0.4) for positive range parameters and normal distribution (σ = 5) for unconstrained parameters. The optimization was repeated 50 times starting with randomized initial parameter values drawn from the prior distributions.
To test whether behavior generated by the modelbased and inference strategies could be differentiated, we fitted the mixtureofstrategies model to data simulated from each singlestrategy model using parameters fit to subjects’ data (Extended Data Fig. 1c,d).
Photometry analysis
Photometry signals were aligned across trials by linearly timewarping the signal at the two points in the trial where timings were determined by subject behavior; between initiation and choice, and between the secondstep port illuminating and being poked (Fig. 2b). Activity at other time periods was not warped.
For the analyses presented in Figs. 3 and 4 and Extended Data Figs. 4–7, we used Lasso linear regression to predict trialbytrial dopamine activity at each timepoint in a trial as:
where y(i, t) is the calcium zscored activity on trial i at timepoint \(t\), \({\beta }_{p}(t)\) is the weight for predictor \(p\) at timepoint \(t\), \({X}_{p}(i)\) is the value of the predictor \(p\) on trial i, \({\beta }_{o}(t)\) is the intercept at timepoint \(t\), and \(\varepsilon(i,{t})\) is the residual unexplained variance.
The linear regression was fit separately for each subject to obtain the coefficient time courses β_{p}(t). The penalty used for the Lasso regularization was found for each individual regression through crossvalidation. When regularization was used, predictors were standardized by centering the mean at 0 and scaling to a variance of 1. For each predictor, we plotted the mean and s.e. across subjects. The statistical significance of each predictor at each timepoint was assessed using a ttest comparing the distribution of coefficients across subjects with zero, with Benjamini–Hochberg correction for comparison of multiple timepoints. Effect sizes were computed using Cohen’s d at each timepoint as:
The linear regressions in Fig. 3 and Extended Data Fig. 4 used the following predictors:

Reward: +0.5 if current trial is rewarded, and −0.5 otherwise.

Previous reward: +0.5 if previous trial is rewarded, and −0.5 otherwise.

Good second step: +0.5/−0.5 if the second step reached on the current trial has high/low reward probability, and 0 if neutral block.

Previous good second step: +0.5/−0.5 if the second step reached on the previous trial has high/low reward probability, and 0 if neutral block.

Correct choice: +0.5/−0.5 if subject current trial choice commonly leads to the high/low reward probability secondstep port, and 0 if neutral block.

Repeat choice: +0.5 if same choice as previous choice, and −0.5 if different choice to previous trial.

Direct reinforcement action value update: +0.5/−0.5 if current choice is the same as the previous choice and the previous trial was rewarded/not rewarded, and 0 if different choice from previous trial.

Inferred action value update: +0.5/−0.5 if current choice commonly leads to the previous second step when it was rewarded/not rewarded, and −0.5/+0.5 if current choice rarely leads to the previous second step when it was rewarded/not rewarded.

Previous reward, same second step: +0.5/−0.5 if current second step is the same as in the previous trial and previous trial was rewarded/unrewarded, and 0 if current second step is different from the second step on the previous trial.

Previous reward, different second step: +0.5/−0.5 if current second step is different from the previous second step and previous trial was rewarded/unrewarded, and 0 if current second step is the same as the second step on the previous trial.

Common transition: +0.5 if a common transition occurs on the current trial, and −0.5 if a rare transition occurs.

Forced choice: +0.5 if a forcedchoice trial occurs on the current trial, and −0.5 if a freechoice trial occurs.

Reward rate: exponential moving average of the recent reward rate (tau = 10 trials).

Contralateral choice: +0.5 if the current choice is in the contralateral side from the recording site, and −0.5 if the current choice is in the ipsilateral side.

Up second step: +0.5 if the current second step is up, and −0.5 if current second step is down.
The linear regression in Extended Data Fig. 5c,d included the same predictors as above (Fig. 3 and Extended Data Fig. 4) except predictors ‘same second step’, ‘previous reward’, ‘different second step’ and ‘inferred action value update’, which were replaced with predictors split by outcome as:

Same versus different second step, previously rewarded: +0.5/−0.5 if current second step is the same/different as in the previous trial and the previous trial was rewarded, and 0 if the previous trial was not rewarded.

Same versus different second step, previously nonrewarded: +0.5/−0.5 if current second step is the same/different as in the previous trial and the previous trial was not rewarded, and 0 if the previous trial was rewarded.

Inferred action value update from rewarded trials: +0.5/−0.5 if current choice commonly/rarely leads to the previous second step and the previous trial was rewarded, and 0 if the previous trial was not rewarded.

Inferred action value update from unrewarded trials: −0.5/+0.5 if current choice commonly/rarely leads to the previous second step and the previous trial was not rewarded, and 0 if the previous trial was rewarded.
The linear regressions used in Fig. 4 and Extended Data Figs. 6 and 7 used the abovedescribed reward, previous reward, reward rate, contralateral choice, up second step, common transition and forcedchoice regressors with the following additional regressors:

Secondstep value: the value \({V}_{{inf}}(s)\) of the second step reached on the current trial from the asymmetric Bayesian inference model.

Chosen action value: the value \({Q}_{{inf}}(c)\) of the firststep action chosen on the current trial from the asymmetric Bayesian inference model.
Finally, the lagged photometry regression in Extended Data Fig. 5b predicted dopamine response (500 ms at the end of the secondstep cue, baseline subtracted using the 500 ms before choice) to the secondstep cue as a function of the extended history of trials over the previous 12 trials. No regularization was used in this linear regression. The analysis included the abovedescribed regressors: ‘previous reward, same second step’, ‘previous reward, different second step’, ‘direct reinforcement action value update’ and ‘inferred action value update’ regressors at different lag \(n\), with the following additional regressors to correct for correlations in the signal:

Same versus different second step: +0.5/−0.5 if current second step is the same/different as in the \(n\) th previous trial.

Reward on trial −1 (not lagged): +0.5 if previous trial is rewarded, and −0.5 otherwise.
Neural network model
For the neural network modeling (Fig. 6), the task was represented as having five observable states—choice state, upactive, downactive, rewardatup, rewardatdown and no reward—and five actions corresponding to the five ports—pokeleft, pokeright, pokeup, pokedown and pokecenter. Completing each trial therefore required a sequence of at least three states and actions (for example, choicestate, pokeleft → upactive, pokeup → rewardatup and pokecenter), but could take more steps if the agent chose actions that were inactive in the current state (for example, pokeleft in the upactive state).
The neural network model consisted of a recurrent network representing PFC and a feedforward network representing basal ganglia, implemented using the Keras Tensorflow API (https://keras.io/). The PFC network was a single fully connected layer of 16 gated recurrent units^{68}. In the version of the model shown in Fig. 6b–d, the PFC network received as input on each time step an observation \({{\boldsymbol{O}}}_{t}\) (the observable task state) and the preceding action \({{\boldsymbol{A}}}_{t1}\), both coded as onehot vectors. In the version of the model shown in Fig. 6e–i, the PFC network received as input a vector \({{\boldsymbol{O}}}_{t}^{g}\) which was the observation \({{\boldsymbol{O}}}_{t}\) gated by whether reward was received on that time step: On rewarded time steps the input was a onehot vector indicating the observation (\({{\boldsymbol{O}}}_{t}^{g}={{\boldsymbol{O}}}_{t}\)), while on nonrewarded time steps the input was a 0 vector (\({{\boldsymbol{O}}}_{t}^{g}={\boldsymbol{0}}\)).
The basal ganglia network received as input the observation \({{\boldsymbol{O}}}_{t}\) and the activity of the PFC network units. It comprised a layer of ten rectified linear units with two outputs: a scalarvalued linear output for the estimated value \({V}_{t}\) (that is, the expected discounted future reward from the current time step) and a vectorvalued softmax output for the policy (that is, the probability of choosing each of the five actions on the next time step).
The model was trained using episodes which terminated after 100 trials or 600 time steps (whichever occurred first), with network weights updated between episodes. For the version of the model used in Fig. 6b–d the PFC network was trained to predict the observation \({{\boldsymbol{O}}}_{t}\) given the preceding observations and actions. For the version of the model used in Fig. 6e–i, the PFC network was trained to predict the rewardgated observation \({{\boldsymbol{O}}}_{t}^{g}\), which it received as input, given this input on preceding time steps. In both cases, PFC network weights were updated using gradient descent with backpropagation through time, with a meansquarederror cost function, using the Adam optimizer^{69} with learning rate = 0.01. The basal ganglia network was trained using the advantage actorcritic RL algorithm^{70}. Hyperparameters for training the basal ganglia network were: learning rate = 0.05, discount factor = 0.9 and entropy loss weight = 0.05.
For each version of the model, we performed 12 simulation runs, each of 500 episodes, using different random seeds. We used these runs as the experimental unit for statistical analysis (that is, the equivalent of subjects in the animal experiments). Data from the last 10 episodes of each run were used for analyses. We excluded any runs that did not obtain reward above chance level in the last 10 episodes, excluding 2 runs of the model version shown in Fig. 6b–d and no runs of the model variant shown in Fig. 6e–i.
To visualize how the activity of PFC units tracked the reward probability blocks, we took the activity of the PFC units in the task’s ‘choice state’ on each trial of an episode, yielding an activity matrix of shape (n_units, n_trials). We used principal component analysis to find the first principal component of the activities’ variation across trials (a vector of weights over units), then projected the activity matrix onto this, giving the time series across trials (Fig. 6d,g).
To evaluate how rewards modified the value of the secondstep states (Fig. 6i), we used the model to evaluate both the secondstep state that was actually reached on each trial, and the value the other secondstep state would have if it had been reached. We then computed how the trial outcome (reward versus omission) changed the value of the secondstep state where it was received, and the other secondstep state.
To simulate the effects of optogenetic stimulation of dopamine neurons (Fig. 6i), we randomly selected 25% of trials and for each of these trials computed the update to the basal ganglia network weights that would be induced by a positive RPE occurring either following either the choice action (choicetime stimulation) or the secondstep action (outcometime stimulation). We evaluated how these weight updates affected behavior using linear regression to model the probability of repeating the choice on the next trial (stay probability) as a function of the transition, outcome and whether the trial was stimulated or not.
Statistics and reproducibility
Sample size was determined using power analyses with significance = 0.05 and power = 0.8, using effect sizes based on our own preliminary data.
As described in ‘Photometry recordings’ and ‘Neural network model’, data were excluded as follows: (i) no data were recorded from the NAc in two animals as they did not present any GCaMP or dLight modulation, and subsequent histology confirmed the fiber targeting the NAc was misplaced in these two mice (over the anterior commissure); (ii) sessions with large artifacts or signal loss due to technical issues during recording sessions (total of 46 sessions, representing ∼9% of the total); and (iii) during simulations using the neural network model, runs that did not obtain reward above chance level in the last ten episodes were excluded (total of two runs in the model from Fig. 6b–d).
The presented GCaMP data were obtained using two different cohorts run at different times. dLight data were obtained from a cohort of mice run alongside the second GCaMP cohort. Optogenetic activation and inhibition experiments were also obtained at different times.
Auditory cues and transition probability structure were randomized and counterbalanced across animals and sexes. For the optogenetic assays, group allocation was also randomized. In both activation and inhibition optogenetics, stimulation sessions were interspersed with baseline nostimulation sessions. The order of stimulation sessions (whether stimulation happened at choice or outcome time) was counterbalanced across animals.
Data collection and analysis were not performed blind to the conditions of the experiments, but the behavioral apparatus and optogenetic stimulation were fully automated, minimizing experimenter influence.
For the statistical reporting, data distribution was assumed to be normal, but this was not formally tested. Where possible, individual data points are shown.
Reporting summary
Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.
Data availability
All raw data and preprocessed variables from the manuscript are available on OSF at https://osf.io/u6xrc/.
Code availability
Complete code used to implement the pyControl task, preprocess the data and generate the Figs. 1–5 is available at https://github.com/Mblancopozo/twostep_dopamine (https://doi.org/10.5281/zenodo.10093116).
The code used to simulate the model and generate Fig. 6 is available at https://github.com/ThomasAkam/PFCBG_model (https://doi.org/10.5281/zenodo.10079814).
References
Montague, P. R., Dayan, P. & Sejnowski, T. J. A framework for mesencephalic dopamine systems based on predictive Hebbian learning. J. Neurosci. 16, 1936–1947 (1996).
Schultz, W., Dayan, P. & Montague, P. R. A neural substrate of prediction and reward. Science 275, 1593–1599 (1997).
Eshel, N. et al. Arithmetic and local circuitry underlying dopamine prediction errors. Nature 525, 243–246 (2015).
Kim, H. R. et al. A unified framework for dopamine signals across timescales. Cell 183, 1600–1616 (2020).
Hamid, A. A. et al. Mesolimbic dopamine signals the value of work. Nat. Neurosci. 19, 117–126 (2016).
Parker, N. F. et al. Reward and choice encoding in terminals of midbrain dopamine neurons depends on striatal target. Nat. Neurosci. 19, 845–854 (2016).
Steinberg, E. E. et al. A causal link between prediction errors, dopamine neurons and learning. Nat. Neurosci. 16, 966–973 (2013).
Ilango, A. et al. Similar roles of substantia nigra and ventral tegmental dopamine neurons in reward and aversion. J. Neurosci. 34, 817–822 (2014).
Wilson, R. C., Takahashi, Y. K., Schoenbaum, G. & Niv, Y. Orbitofrontal cortex as a cognitive map of task space. Neuron 81, 267–279 (2014).
Costa, V. D., Tran, V. L., Turchi, J. & Averbeck, B. B. Reversal learning and dopamine: a bayesian perspective. J. Neurosci. 35, 2407–2416 (2015).
Bartolo, R. & Averbeck, B. B. Inference as a fundamental process in behavior. Curr. Opin. Behav. Sci. 38, 8–13 (2021).
Vertechi, P. et al. Inferencebased decisions in a hidden state foraging task: differential contributions of prefrontal cortical areas. Neuron 106, 166–176 (2020).
Hampton, A. N., Bossaerts, P. & O’Doherty, J. P. The role of the ventromedial prefrontal cortex in abstract statebased inference during decision making in humans. J. Neurosci. 26, 8360–8367 (2006).
Wimmer, G. E., Daw, N. D. & Shohamy, D. Generalization of value in reinforcement learning by humans. Eur. J. Neurosci. 35, 1092–1104 (2012).
Baram, A. B., Muller, T. H., Nili, H., Garvert, M. M. & Behrens, T. E. J. Entorhinal and ventromedial prefrontal cortices abstract and generalize the structure of reinforcement learning problems. Neuron 109, 713–723 (2021).
Samborska, V., Butler, J. L., Walton, M. E., Behrens, T. E. J. & Akam, T. Complementary task representations in hippocampus and prefrontal cortex for generalizing the structure of problems. Nat. Neurosci. 25, 1314–1326 (2022).
Gallistel, C. R., Mark, T. A., King, A. P. & Latham, P. E. The rat approximates an ideal detector of changes in rates of reward: implications for the law of effect. J. Exp. Psychol. Anim. Behav. Process. 27, 354–372 (2001).
Gershman, S. J. & Niv, Y. Learning latent structure: carving nature at its joints. Curr. Opin. Neurobiol. 20, 251–256 (2010).
BrombergMartin, E. S., Matsumoto, M., Hong, S. & Hikosaka, O. A pallidus–habenula–dopamine pathway signals inferred stimulus values. J. Neurophysiol. 104, 1068–1076 (2010).
Babayan, B. M., Uchida, N. & Gershman, S. J. Belief state representation in the dopamine system. Nat. Commun. 9, 1891 (2018).
Starkweather, C. K., Babayan, B. M., Uchida, N. & Gershman, S. J. Dopamine reward prediction errors reflect hiddenstate inference across time. Nat. Neurosci. 20, 581–589 (2017).
Nakahara, H., Itoh, H., Kawagoe, R., Takikawa, Y. & Hikosaka, O. Dopamine neurons can represent contextdependent prediction error. Neuron 41, 269–280 (2004).
Lak, A. et al. Dopaminergic and prefrontal basis of learning from sensory confidence and reward value. Neuron 105, 700–711 (2020).
Akam, T., Costa, R. & Dayan, P. Simple plans or sophisticated habits? State, transition and learning interactions in the twostep task. PLoS Comput. Biol. 11, e1004648 (2015).
Akam, T. et al. The anterior cingulate cortex predicts future states to mediate modelbased action selection. Neuron 109, 149–163 (2021).
Daw, N. D., Gershman, S. J., Seymour, B., Dayan, P. & Dolan, R. J. Modelbased influences on humans’ choices and striatal prediction errors. Neuron 69, 1204–1215 (2011).
Behrens, T. E. J., Hunt, L. T., Woolrich, M. W. & Rushworth, M. F. S. Associative learning of social value. Nature 456, 245–249 (2008).
Niv, Y., Edlund, J. A., Dayan, P. & O’Doherty, J. P. Neural prediction errors reveal a risksensitive reinforcementlearning process in the human brain. J. Neurosci. 32, 551–562 (2012).
Mohebi, A. et al. Dissociable dopamine dynamics for learning and motivation. Nature 570, 65–70 (2019).
Pan, W. X., Coddington, L. T. & Dudman, J. T. Dissociable contributions of phasic dopamine activity to reward and prediction. Cell Rep. 36, 109684 (2021).
Jeffreys, H. Theory of Probability (Clarendon Press, 1961).
Miller, K. J., Botvinick, M. M. & Brody, C. D. Dorsal hippocampus contributes to modelbased planning. Nat. Neurosci. 20, 1269–1276 (2017).
Rutledge, R. B., Dean, M., Caplin, A. & Glimcher, P. W. Testing the reward prediction error hypothesis with an axiomatic model. J. Neurosci. 30, 13525–13536 (2010).
Akam, T. & Walton, M. E. What is dopamine doing in modelbased reinforcement learning? Curr. Opin. Behav. Sci. 38, 74–82 (2021).
Bari, B. A. et al. Stable representations of decision variables for flexible behavior. Neuron 103, 922–933 (2019).
Hattori, R. & Komiyama, T. Contextdependent persistency as a coding mechanism for robust and widely distributed value coding. Neuron 110, 502–515 (2022).
Schuck, N. W., Cai, M. B., Wilson, R. C. & Niv, Y. Human orbitofrontal cortex represents a cognitive map of state space. Neuron 91, 1402–1412 (2016).
KleinFlügge, M. C., Wittmann, M. K., Shpektor, A., Jensen, D. E. A. & Rushworth, M. F. S. Multiple associative structures created by reinforcement and incidental statistical learning mechanisms. Nat. Commun. 10, 4835 (2019).
Bradfield, L. A., Dezfouli, A., van Holstein, M., Chieng, B. & Balleine, B. W. Medial orbitofrontal cortex mediates outcome retrieval in partially observable task situations. Neuron 88, 1268–1280 (2015).
Starkweather, C. K., Gershman, S. J. & Uchida, N. The medial prefrontal cortex shapes dopamine reward prediction errors under state uncertainty. Neuron 98, 616–629 (2018).
Bartolo, R. & Averbeck, B. B. Prefrontal cortex predicts state switches during reversal learning. Neuron 106, 1044–1054 (2020).
Jones, J. L. et al. Orbitofrontal cortex supports behavior and learning using inferred but not cached values. Science 338, 953–956 (2012).
Gershman, S. J. & Uchida, N. Believing in dopamine. Nat. Rev. Neurosci. 20, 703–714 (2019).
Sadacca, B. F., Jones, J. L. & Schoenbaum, G. Midbrain dopamine neurons compute inferred and cached value prediction errors in a common framework. Elife 5, e13665 (2016).
Grogan, J. P. et al. Effects of dopamine on reinforcement learning and consolidation in Parkinson’s disease. Elife 6, e26801 (2017).
Korn, C. et al. Distinct roles for dopamine clearance mechanisms in regulating behavioral flexibility. Mol. Psychiatry 26, 7188–7199 (2021).
Eisenegger, C. et al. Role of dopamine D2 receptors in human reinforcement learning. Neuropsychopharmacology 39, 2366–2375 (2014).
Wang, J. X. et al. Prefrontal cortex as a metareinforcement learning system. Nat. Neurosci. 21, 860–868 (2018).
Rao, R. P. & Ballard, D. H. Predictive coding in the visual cortex: a functional interpretation of some extraclassical receptivefield effects. Nat. Neurosci. 2, 79–87 (1999).
Friston, K. A theory of cortical responses. Philos. Trans. R. Soc. Lond. B. Biol. Sci. 360, 815–836 (2005).
Doya, K. Complementary roles of basal ganglia and cerebellum in learning and motor control. Curr. Opin. Neurobiol. 10, 732–739 (2000).
Sutton, R. S. & Barto, A. G. Reinforcement Learning: an Introduction (MIT press, 2018).
Littman, M. & Sutton, R. S. Predictive representations of state. In Advances in Neural Information Processing Systems (eds. T. Dietterich et al.) 14 (MIT Press, 2001).
Lin, L. & Mitchell, T. M. Reinforcement learning with hidden states. In From Animals to Animats 2: Proceedings of the Second International Conference on Simulation of Adaptive Behavior (eds Meyer, J.A., Roitblat, H. L., Wilson, S. W.) (MIT Press, 1993).
Igl, M., Zintgraf, L. M., Le, T. A., Wood, F. & Whiteson, S. Deep variational reinforcement learning for POMDPs. In Proceedings of the 35th International Conference on Machine Learning 2117–2126 (2018).
Pearce, J. M. & Bouton, M. E. Theories of associative learning in animals. Annu. Rev. Psychol. 52, 111–139 (2001).
Fraser, K. M. & Holland, P. C. Occasion setting. Behav. Neurosci. 133, 145–175 (2019).
Delamater, A. R. On the nature of CS and US representations in Pavlovian learning. Learn. Behav. 40, 1–23 (2012).
Schmajuk, N. A., Lamoureux, J. A. & Holland, P. C. Occasion setting: a neural network approach. Psychol. Rev. 105, 3–32 (1998).
Threlfell, S. & Cragg, S. J. Dopamine signaling in dorsal versus ventral striatum: the dynamic role of cholinergic interneurons. Front. Syst. Neurosci. 5, 11 (2011).
Niv, Y., Daw, N. D., Joel, D. & Dayan, P. Tonic dopamine: opportunity costs and the control of response vigor. Psychopharmacology 191, 507–520 (2007).
Akam, T. et al. Opensource, Pythonbased, hardware and software for controlling behavioural neuroscience experiments. Elife 11, e67846 (2022).
Lopes, G. et al. Bonsai: an eventbased framework for processing and controlling data streams. Front. Neuroinform. 9, 7 (2015).
Akam, T. & Walton, M. E. pyPhotometry: open source Python based hardware and software for fiber photometry data acquisition. Sci. Rep. 9, 3521 (2019).
Singmann, H., Bolker, B., Westfall, J. & Aust, F. afex: analysis of factorial experiments. R package. (2018).
Barr, D. J., Levy, R., Scheepers, C. & Tily, H. J. Random effects structure for confirmatory hypothesis testing: keep it maximal. J. Mem. Lang. 68, 255–278 (2013).
Matuschek, H., Kliegl, R., Vasishth, S., Baayen, H. & Bates, D. Balancing type I error and power in linear mixed models. J. Mem. Lang. 94, 305–315 (2017).
Cho, K. et al. Learning phrase representations using RNN encoderdecoder for statistical machine translation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (eds. A. Moschittiet al.) 1724–1734 (ACL, 2014). https://doi.org/10.3115/v1/d141179
Kingma, D. P. & Ba, J. Adam: A method for stochastic optimization. In Proceedings of the Third International Conference on Learning Representations (eds. Bengio, Y. & LeCun, Y.) (2015).
Mnih, V. et al. Asynchronous methods for deep reinforcement learning. In Proceedings of the International conference on machine learning 1928–1937 (2016).
Acknowledgements
This research was funded by Wellcome (202831/Z/16/Z to M.E.W.; 214314/Z/18/Z to M.E.W. and T.A.; WT096193AIA to T.A.; 225926/Z/22/Z to T.A.; 215198/Z/19/Z to M.B.P.) and the Biotechnology and Biological Sciences Research Council (BB/S006338/1). The funder had no role in study design, data collection and analysis, decision to publish or preparation of the manuscript. For the purpose of open access, the author has applied a CCBY public copyright license to any Author Accepted Manuscript arising from this submission. We thank M. Panayi for assistance with the mixedeffects modeling, T. Behrens for discussions about the work, and P. Dayan and A. Lak for comments on the manuscript.
Author information
Authors and Affiliations
Contributions
T.A. and M.E.W. conceived the project and, with M.B.P., designed the experiments. M.B.P. and T.A. collected the data. M.B.P. and T.A. analyzed the data. T.A. performed the network modeling. All authors wrote the manuscript.
Corresponding authors
Ethics declarations
Competing interests
The authors declare no competing interests.
Peer review
Peer review information
Nature Neuroscience thanks Mihaela Iordanova and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Extended data
Extended Data Fig. 1 Model comparison and mixtureofstrategies model.
AB) Modelcomparison for singlestrategy models; statistical significance, twosided Wilcoxon signedrank test (with normal approximation due to ties) with Bonferroni correction. a) BIC score modelcomparison of all the models tested (see Supplementary Table 1 for details of each model). Colours indicate the model’s strategy. Each strategy was tested both with and without choice bias and multitrial perseveration. Additionally, models were tested with both symmetric and asymmetric learning from reward and omission outcomes (see Methods). Modelfree and modelbased RL strategies were tested with and without forgetting about the values of notvisited states and notchosen actions. Two variants of forgetting were tested: one implemented forgetting as value decay towards a neutral value of 0.5, the other as value decay towards 0. Insets show the bestfitting model of each model class. Fitted parameter values for the best model in each class are presented in Supplementary Table 2. b) Crossvalidated log likelihood of the same models as in A. Inset, as in A, best fitting model of each category. Both model comparison approaches indicate that asymmetric inference and asymmetric modelbased RL with forgetting to 0 strategies explain the data equally well, and better than other strategies considered. cf) Mixtureofstrategies model: choices were determined by a weighted combination of the asymmetric versions of modelfree RL, modelbased RL, and Bayesian inference. We tested two different mixtureofstrategies models differentiated by how forgetting was implemented for the RL components: In C) forgetting was implemented as value to decay towards a neutral value of 0.5, while in D) forgetting was implemented as value decay towards 0. C) Mixtureofstrategies model with forgetting towards neutral value. Left panel: influence of each model component on choices, evaluated using the fit of the model to each subject’s data. The influence on choices was quantified as the standard deviation across trials of the difference between the two firststep action values due to a given component, divided by the total standard deviation of the value difference due to all components. Each coloured dot represents a single mouse. Centre and Right panels: influence of each model component for fits of the mixtureofstrategies model to behaviour simulated from either a modelbased (centre panel) or Bayesian inference (right panel) singlestrategy model with parameters fit to each subject’s choices. The fit correctly assigned high weight to the strategy that generated the simulated data and low weight to the other strategies. D) As in C but with forgetting towards 0. When forgetting was implemented as decay towards 0, the mixtureofstrategies model fit to simulated data did not correctly identify which strategy generated simulated data. Therefore, consistent with the model comparison, we cannot arbitrate between asymmetric inference and asymmetric modelbased RL with forgetting to 0 strategies using the mixtureofstrategies model. Statistical significance, twosided paired ttest, Bonferroni corrected. E,F) Value decay of the nonexperienced secondstep state over consecutive trials when forgetting decayed towards neutral (light green) or zero (dark green) in E) the asymmetric modelbased model and F) in the mixtureofstrategies model. Boxplots show the distribution of the data across subjects. Grey box represents the interquartile range, with horizontal lines representing first quartile, median and third quartile, from bottom to top. Whiskers represent minimum and maximum values. Rhomboids mark outliers. N=18 simulated subjects, numbers of trials and sessions per subject matched to corresponding mouse. n.s. nonsignificant, *P<0.05, **P<0.01, ***P<0.001. For exact pvalues, see Supplementary Table 6.
Extended Data Fig. 2 Fibre placement for photometry experiments.
a) NAc, b) DMS, and c) VTA. Green, GCaMP6f; blue, dLight1.1 animals.
Extended Data Fig. 3 Mediolateral gradient in reward modulation in DMS.
a) Dopamine activity measured through GCaMP. b) Dopamine receptor binding measured through dLight. Left, mean reward coefficient (from the linear regression model) at the time of outcome cue; Right, mean contralateral choice coefficient prechoice. Top, correlation between mean coefficient and mediolateral (ML) or anteroposterior (AP) location of the optic fibre. Twosided Wald test against 0. Bottom, mean zscored activity split by outcome or lateralised choice for animals with medial or lateral placement in DMS. Dashed line in the correlation plots shows how animals were divided into two groups (medial and lateral).
Extended Data Fig. 4 Regression coefficients from all the predictors used in the behavioural logistic regression model predicting VTA, NAc and DMS dopamine activity (GCaMP recordings) and NAc and DMS dopamine receptor binding (dLight recordings).
Traces reflect mean regression coefficients across subjects; shaded area indicates crosssubject standard error. Dots indicate effect size of the statistically significant timepoints, twosided ttest comparing the crosssubject distribution against 0, after BenjaminiHochberg multiple comparison correction.
Extended Data Fig. 5 Value update regressors.
a) Coefficients in the linear regression predicting dopamine activity showing the influence of previous trial outcome when the secondstep was the same (dark red) or different (green) from the previous trial for NAc and DMS dLight signals. b) Linear regression predicting the dopamine response to the secondstep cue as a function of the extended history of trial events over the previous 12 trials. Error bars cross subject mean ± s.e.m. Bottom left, Schematic of dopamine signal, yellow shaded area represents the predicted activity. Twosided ttest comparing the crosssubject distribution against 0, Bonferroni corrected. For exact pvalues, see Supplementary Table 6. c) Secondstep value update regressors, whether the current secondstep was the same or different from the previous trial, split by rewarded and not rewarded outcome on the previous trial. d) Inferred action value update split by outcome on the previous trial. Shaded area indicates crosssubject standard error. Dots indicate effect size of the statistically significant timepoints, twosided ttest comparing the crosssubject distribution against 0, after BenjaminiHochberg multiple comparison correction.
Extended Data Fig. 6 Regression model using secondstep and action values derived from the asymmetric Bayesian inference model predicting dopamine activity (GCaMP) in VTA (left), NAc (centre) and DMS (right).
Traces reflect mean regression coefficients across subjects; shaded area indicates crosssubject standard error. Dots indicate effect size of the statistically significant timepoints, twosided ttest comparing the crosssubject distribution against 0, after BenjaminiHochberg multiple comparison correction.
Extended Data Fig. 7 Regression model and reward rate effect in dopamine concentrations.
a) Regression model using secondstep and action values derived from the asymmetric Bayesian inference model predicting dopamine concentrations (dLight) in NAc (left) and DMS (right). Traces reflect mean regression coefficients across subjects; shaded area indicates crosssubject standard error. Dots indicate effect size of the statistically significant timepoints, twosided ttest comparing the crosssubject distribution against 0, after BenjaminiHochberg multiple comparison correction. b) Mean zscore activity split by rewarded/unrewarded trials and recent reward rate. Shaded area indicates crosssubject standard error. Blue – high reward rate, > 0.7 rewards/trial (exponential moving average with tau = 8 trials). Green, medium reward rate (between 0.4 and 0.7). Red, low reward rate (< 0.4).
Extended Data Fig. 8 Optogenetic manipulation.
a) Left, fibre placement. Yellow, YFP animals from the optogenetic activation experiment; blue, ChR2 animals; orange, tdTomato animals from the optogenetic inhibition experiment; pink, GtACR2 animals. Right, photomicrograph showing injection and fibre placement. Photomicrograph comes from an example GtACR2 mouse, stained for TH (Tyrosine Hydroxylase, blue) and FusionRed (red fluorescent protein, red). b) Mixedeffects logistic regression predicting stay/switch behaviour in the optogenetic activation experiment. Analysis included data from both groups (YFP and ChR2) and three stimulation types (nonstimulated, stimulation after firststep choice, and stimulation at outcome). Error bars mixed effect model estimate ± s.d.; statistical significance, likelihoodratio test with Type 3 sums of squares. For exact pvalues, see Supplementary Table 6. c–e) Optogenetic inhibition experiment. C) 2alternative forced choice control task showing percentage of choices to the initially preferred side in the tdTomato (red) and GtACR2 (pink) groups. Following baseline sessions without stimulation (sessions 13), optical stimulation (1s continuous, 5mW) was delivered from sessions 4 to 7 when mice poked their initial preferred choice (preferred side in sessions 13). From sessions 8 to 11 the side of optical stimulation was reversed, now to be coincident with choice of their initial nonpreferred side. Boxplot show the distribution of crosssubjects regression estimates; box represents the interquartile range, with horizontal lines representing first quartile, median and third quartile, from bottom to top. Whiskers represent minimum and maximum values. Rhomboids mark outliers. **P<0.01, ***P<0.005, twosided ttest with Bonferroni multiple comparison correction. For exact pvalues, see Supplementary Table 6. D, E) Inhibition effects on the twostep task. D) Mean latency to initiate a new trial after the centre poke illuminates following stimulated and nonstimulated trials. Dots indicate individual animals, error bars show standard error. E) As in B, but for the optogenetic inhibition experiment, mixedeffects logistic regression including both groups (tdTomato and GtACR2) and three stimulation types (nonstimulated, stimulation after firststep choice, and stimulation at outcome). N: tdTomato: 5 animals, 12,555 trials (choicetime stimulation sessions); 13,373 trials (outcometime stimulation sessions); GtACR2: 7 animals, 14,789 trials (choicetime stimulation sessions), 15,491 trials (outcometime stimulation sessions). For exact pvalues, see Supplementary Table 6.
Extended Data Fig. 9 Value updates and simulated dopamine stimulation for version of PFCbasal ganglia network model shown in Fig. 6b.
a) Effect of trial outcome (rewarded vs nonrewarded) on the value of the secondstep state where reward was received (same) and on the other state (different). ***P = 1.41e7 (same), P=2.00e7 (different), respectively. b) Effect of stimulated optogenetic stimulation after choice (top panel), or at outcome time (bottom panel). Stimulation was modelled as modifying weights in the basal ganglia network as by a positive RPE. N: 10 simulation runs. Twosided ttest against 0. ***P = 1.36e4.
Extended Data Fig. 10 Photometry signal preprocessing.
a, b) Preprocessing steps in an example session. A) Whole recording session. B) Zoomed signal from the dashed area in A. Top, raw GCaMP and tdTomato signal. Middle, Signal after motion correction, and black line reflects the double exponential fit that will be used for the bleaching correction. Bottom, Motion and bleaching corrected signal.
Supplementary information
Supplementary Information
Supplementary Tables 1–4.
Supplementary Table 5
Exact P values for statistical tests in Fig. 1.
Supplementary Table 6
Exact P values for statistical tests in Extended Data Figs. 1, 5 and 8.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
BlancoPozo, M., Akam, T. & Walton, M.E. Dopamineindependent effect of rewards on choices through hiddenstate inference. Nat Neurosci 27, 286–297 (2024). https://doi.org/10.1038/s4159302301542x
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1038/s4159302301542x
This article is cited by

Lasting dynamic effects of the psychedelic 2,5dimethoxy4iodoamphetamine ((±)DOI) on cognitive flexibility
Molecular Psychiatry (2024)