Midbrain dopamine neurons signal reward prediction error (RPE), or actual minus expected reward. The temporal difference (TD) learning model has been a cornerstone in understanding how dopamine RPEs could drive associative learning. Classically, TD learning imparts value to features that serially track elapsed time relative to observable stimuli. In the real world, however, sensory stimuli provide ambiguous information about the hidden state of the environment, leading to the proposal that TD learning might instead compute a value signal based on an inferred distribution of hidden states (a 'belief state'). Here we asked whether dopaminergic signaling supports a TD learning framework that operates over hidden states. We found that dopamine signaling showed a notable difference between two tasks that differed only with respect to whether reward was delivered in a deterministic manner. Our results favor an associative learning rule that combines cached values with hidden-state inference.
Subscribe to Journal
Get full journal access for 1 year
only $17.42 per issue
All prices are NET prices.
VAT will be added later in the checkout.
Rent or Buy article
Get time limited or full article access on ReadCube.
All prices are NET prices.
Schultz, W., Dayan, P. & Montague, P.R. A neural substrate of prediction and reward. Science 275, 1593–1599 (1997).
Bayer, H.M. & Glimcher, P.W. Midbrain dopamine neurons encode a quantitative reward prediction error signal. Neuron 47, 129–141 (2005).
Cohen, J.Y., Haesler, S., Vong, L., Lowell, B.B. & Uchida, N. Neuron-type-specific signals for reward and punishment in the ventral tegmental area. Nature 482, 85–88 (2012).
Eshel, N. et al. Arithmetic and local circuitry underlying dopamine prediction errors. Nature 525, 243–246 (2015).
Sutton, R.S. & Barto, A.G. in Learning and Computational Neuroscience: Foundations of Adaptive Networks (eds. Gabriel, M. and Moore, J.) 497–537 (MIT Press, 1991).
Gershman, S.J., Blei, D.M. & Niv, Y. Context, learning and extinction. Psychol. Rev. 117, 197–209 (2010).
Gershman, S.J., Norman, K.A. & Niv, Y. Discovering latent causes in reinforcement learning. Curr. Opin. Behav. Sci 5, 43–50 (2015).
Rao, R.P.N. Decision making under uncertainty: a neural model based on partially observable Markov decision processes. Front. Comput. Neurosci. 4, 146 (2010).
Daw, N.D., Courville, A.C. & Touretzky, D.S. Representation and timing in theories of the dopamine system. Neural Comput. 18, 1637–1677 (2006).
Fiorillo, C.D., Newsome, W.T. & Schultz, W. The temporal precision of reward prediction in dopamine neurons. Nat. Neurosci. 11, 966–973 (2008).
Pasquereau, B. & Turner, R.S. Dopamine neurons encode errors in predicting movement trigger occurrence. J. Neurophysiol. 113, 1110–1123 (2015).
Nomoto, K., Schultz, W., Watanabe, T. & Sakagami, M. Temporally extended dopamine responses to perceptually demanding reward-predictive stimuli. J. Neurosci. 30, 10692–10702 (2010).
Tian, J. & Uchida, N. Habenula lesions reveal that multiple mechanisms underlie dopamine prediction errors and prediction-error-based learning. Neuron 87, 1304–1316 (2015).
Hamid, A.A. et al. Mesolimbic dopamine signals the value of work. Nat. Neurosci. 19, 117–126 (2016).
Kobayashi, S. & Schultz, W. Influence of reward delays on responses of dopamine neurons. J. Neurosci. 28, 7837–7846 (2008).
Jo, Y.S. & Mizumori, S.J.Y. Prefrontal regulation of neuronal activity in the ventral tegmental area. Cereb. Cortex 26, 4057–4068 (2016).
Sutton, R.S. Learning to predict by the methods of temporal differences. Mach. Learn. 3, 9–44 (1988).
Suri, R.E. & Schultz, W. A neural network model with dopamine-like reinforcement signal that learns a spatial delayed response task. Neuroscience 91, 871–890 (1999).
Suri, R.E. & Schultz, W. Learning of sequential movements by neural network model with dopamine-like reinforcement signal. Exp. Brain Res. 121, 350–354 (1998).
Hollerman, J.R. & Schultz, W. Dopamine neurons report an error in the temporal prediction of reward during learning. Nat. Neurosci. 1, 304–309 (1998).
Brown, J., Bullock, D. & Grossberg, S. How the basal ganglia use parallel excitatory and inhibitory learning pathways to selectively respond to unexpected rewarding cues. J. Neurosci. 19, 10502–10511 (1999).
Oswal, A., Ogden, M. & Carpenter, R.H.S. The time course of stimulus expectation in a saccadic decision task. J. Neurophysiol. 97, 2722–2730 (2007).
Janssen, P. & Shadlen, M.N. A representation of the hazard rate of elapsed time in macaque area LIP. Nat. Neurosci. 8, 234–241 (2005).
Tsunoda, Y. & Kakei, S. Reaction-time changes with the hazard rate for a behaviorally relevant event when monkeys perform a delayed wrist-movement task. Neurosci. Lett. 433, 152–157 (2008).
Ghose, G.M. & Maunsell, J.H. Attentional modulation in visual cortex depends on task timing. Nature 419, 616–620 (2002).
Klein-Flügge, M.C., Hunt, L.T., Bach, D.R., Dolan, R.J. & Behrens, T.E. Dissociable reward and timing signals in human midbrain and ventral striatum. Neuron 72, 654–664 (2011).
Friston, K. A theory of cortical responses. Phil. Trans. R. Soc. Lond. B 360, 815–836 (2005).
Lee, T.S. & Mumford, D. Hierarchical Bayesian inference in the visual cortex. J. Opt. Soc. Am A Opt Image Sci Vis. 20, 1434–1448 (2003).
Kakade, S. & Dayan, P. Acquisition and extinction in autoshaping. Psychol. Rev. 109, 533–544 (2002).
Stalnaker, T.A., Berg, B., Aujla, N. & Schoenbaum, G. Cholinergic interneurons use orbitofrontal input to track beliefs about current state. J. Neurosci. 36, 6242–6257 (2016).
Takahashi, Y.K., Langdon, A.J., Niv, Y. & Schoenbaum, G. Temporal specificity of reward-prediction errors signaled by putative dopamine neurons in rat VTA depends on ventral striatum. Neuron 91, 182–193 (2016).
Ludvig, E.A., Sutton, R.S. & Kehoe, E.J. Stimulus representation and the timing of reward-prediction errors in models of the dopamine system. Neural Comput. 20, 3034–3054 (2008).
Gershman, S.J., Moustafa, A.A. & Ludvig, E.A. Time representation in reinforcement-learning models of the basal ganglia. Front. Comput. Neurosci. 7, 194 (2014).
Mello, G.B.M., Soares, S. & Paton, J.J. A scalable population code for time in the striatum. Curr. Biol. 25, 1113–1122 (2015).
Backman, C. et al. Characterization of a mouse strain expressing Cre recombinase from the 3′ untranslated region of the dopamine transporter locus. Genesis 44, 383–390 (2006).
National Research Council. Guide for the Care and Use of Laboratory Animals 8th edn. (The National Academies Press, 2011).
Atasoy, D., Aponte, Y., Su, H.H. & Sternson, S.M. A FLEX switch targets channelrhodopsin-2 to multiple cell types for imaging and long-range circuit mapping. J. Neurosci. 28, 7025–7030 (2008).
Uchida, N. & Mainen, Z.F. Speed and accuracy of olfactory discrimination in the rat. Nat. Neurosci. 6, 1224–1229 (2003).
Schmitzer-Torbert, N., Jackson, J., Henze, D., Harris, K. & Redish, A.D. Quantitative measures of cluster quality for use in extracellular recordings. Neuroscience 131, 1–11 (2005).
Lima, S.Q., Hromádka, T., Znamenskiy, P. & Zador, A.M. PINP: a new method of tagging neuronal populations for identification during in vivo electrophysiological recording. PLoS One 4, e6099 (2009).
Kvitsiani, D. et al. Distinct behavioral and network correlates of two interneuron types in prefrontal cortex. Nature 498, 363–366 (2013).
We thank R. Born, J. Assad and C. Harvey for discussions, members of the Uchida lab for discussions and C. Dulac for sharing resources. This work was supported by National Science Foundation grant CRCNS 1207833 (S.J.G.), US National Institutes of Health grants R01MH095953 (N.U.), R01MH101207 (N.U.), T32 MH020017 (C.K.S.) and T32GM007753 (C.K.S.), a Harvard Brain Science Initiative Seed grant (N.U.), a Harvard Mind Brain and Behavior faculty grant (S.J.G. and N.U.) and Fondation pour la Recherche Medicale grant SPE20150331860 (B.M.B.).
The authors declare no competing financial interests.
Integrated supplementary information
a, Task 2b was identical to Task 2, except for the constant ISI trial type. Odor B trials had a constant ISI of 2.0 s. b, The trough of the dopamine omission ‘dip’ was slightly later for 2s variable ISI omission trials than for 2s constant ISI omission trials, as indicated by the arrows. c, This slightly later omission dip is consistent with our belief state TD model. If our belief state TD model was adjusted to incorporate timing uncertainty (see Supplementary Fig. 12), the ‘dip’ for reward omission would be more blurred in time, particularly for the 2s constant ISI omission response.
a-c, Anticipatory lick rate increased above baseline for all rewarded trial types (n = 26 sessions, F1,50 > 150, P < 6 × 10-17 for odors A-C in Task 1, 1-way ANOVA; n = 9 sessions, F1,16 > 60, P < 1 × 10-6 for odors A-C in Task 2, 1-way ANOVA; n = 22 sessions, F1,42 > 100, P < 5 × 10-13 for odors A-C in Task 2b, 1-way ANOVA). Anticipatory lick rate did not increase above baseline for odor D trials (P > 0.10 for all Tasks, odor D, 1-way ANOVA). d, Average licking PSTH for all sessions across all animals trained on Task 1. e, Average licking PSTH for all sessions across all animals trained on Task 2 (including Task 2b).
Supplementary Figure 3 Are any aspects of licking behavior correlated with temporal modulation of dopamine RPEs?
Because the trends of licking behavior across time appeared to be different between Tasks 1 and 2 (see Supplementary Figure 2d,e), we asked whether licking patterns during each session were at all predictive of temporal modulation of dopamine RPEs recorded during that session. a,c, Example fits of logistic function to lick data during individual sessions (see Methods ). b,d, Correlations between various parameters of these logistic functions and the slope of phasic dopamine RPEs across time in Task 1 (b) and Task 2 (d). e, Correlations between the slope k of the logistic function and the slope of phasic dopamine RPEs across time as an aggregate in both Task 1 and Task 2 cohorts. f, Correlations between the slope of licking behavior across time just prior to reward delivery (200ms-0ms) and the slope of phasic dopamine RPEs across time. Our analysis revealed a relationship between the steepness, k, of the logistic licking function and negative modulation over time for post-reward firing in Task 1 (b), and in an aggregate analysis of both Task 1 and Task 2 cohorts (e). However, the steepness of the logistic function in Task 1 is determined in the time prior to reward delivery (see a). None of these analyses revealed a behavior correlate of shifting temporal expectation during the interval in which rewards could be delivered.
a,b, Schematic of recording sites for animals used in Task 1(a) and Task 2 (b)
a, Raw trace for single example neuron, demonstrating laser pulses and laser-evoked spikes. b, Near-identical match between waveform of average laser-evoked spike (blue) and average spontaneous spike (black) for same example neuron. c, Histogram of latencies to spike for same example neuron. d, Light-evoked spikes for 10Hz and 50Hz pulses, for same example neuron. e,f, Mean latency and standard deviation of light-evoked spikes, for all identified dopamine neurons. g, Waveform correlation between laser-evoked and spontaneous spikes, for all identified dopamine neurons.
Non-normalized PSTH for Tasks 1 (a) and 2 (b) aligned to the time of reward with only a 20ms box used for smoothing. Although water valve opens at 0ms, phasic dopaminergic response does not begin until 50ms following water valve opening.
Supplementary Figure 7 Visualization of belief-state model parameters (see Computational Methods for details).
a,b, Schematic of the difference between modeling Task 1 and 2. Whereas cue can never lead back to the ITI sub-state in Task 1, this ‘hidden’ transition occurs in 10% of trials (omission trials) in Task 2. c,d, Observation matrices for Task 1 and 2. In a tiny percentage of cases in Task 2 (see Computational Methods), the animal observes ‘cue’ when in fact it transitions from the ITI back to the ITI during an omission trial. This occurs in such a small proportion of sub-state 15→15 transitions that it cannot be visualized in this map, so we have indicated values that differ between Tasks directly below the matrices. See Methods for how these values were computed. e,f, Transition matrices for Task 1 and 2. The probability of transitioning from the sub-state 15 back to the sub-state 15 is slightly higher in Task 2 (see Computational Methods), due to the presence of omission trials. The baseline probability of transitioning out of the ITI in either task is so small that it cannot be visualized here, so we have indicated values that differ between Tasks directly below the matrices. See Methods for how these values were computed.
Supplementary Figure 8 Belief-state model reproduces modulation of pre-reward and post-reward RPEs over variable ISI interval.
a,c, Pre-reward (a) and post-reward (c) RPEs generated by belief state model for different ISIs in Task 1. b,d, Pre-reward (b) and post-reward (d) RPEs generated by belief state model across different ISIs in Task 2.
In order to compare the magnitude of dopamine RPEs across different populations of neurons, post-reward RPEs must be normalized to the free water response (see Eshel et al., 2016). a, Post-reward RPEs in Task 1 are suppressed to about 60% of the free water response at the beginning of the interval (1.2s), and are suppressed to about 40% of the free water response at the end of the interval (2.8s). RPEs plotted are from the 29 neurons for which we included 2-5 free water deliveries during recording. b, Post-reward RPEs in Task 2 are suppressed to about 60% of the free water response at the beginning of the interval (1.2s), and are suppressed to about 90% of the free water response at the end of the interval (2.8s). RPEs plotted are from the 34 neurons for which we included 2-5 free water deliveries during recording. c, Schematic of data in (a) and (b), highlighting that the opposing trends of temporal modulation in Tasks 1 and 2 arise mostly from diverging patterns towards the end of the interval.Citation for Eshel et al., 2016: Eshel N, Tian J, Bukwich M, Uchida N. Dopamine neurons share a common response function for reward prediction error. Nat Neurosci. 19, 479-486 (2016)
Supplementary Figure 10 Belief-state model result is agnostic to precise shape of trained distribution.
When we trained our belief state TD model on a uniform distribution, the model produced negative modulation of post-reward RPEs over time in the 100% rewarded condition, and positive modulation of post-reward RPEs over time in the 90% rewarded condition. Therefore, while an assumption of our model is that mice perfectly learned the Gaussian distribution of reward timings, relaxing this assumption produces the same qualitative result.
Supplementary Figure 11 Testing belief-state TD model and TD with CSC and Reset model on other studies’ data.
a, Other studies trained animals to anticipate rewarding or reward-predicting events during a variable ISI interval in which the ISI was drawn from a flat probability distribution10-12. In these studies, reward was delivered in 100% of trials. b,c, Both the belief state model (b) and TD with CSC and Reset model (c) produce negative modulation of RPEs over time. d, Exponential distribution of ISIs, similar to Fiorillo et al, 2008 Figure 5d, on which we trained our belief state TD model. e, Post-reward RPEs decreased over time for belief state model trained on exponential distribution of ISIs, similar to data in Fiorillo et al, 2008 Figure 5e.
a, Belief state at time t was blurred by a Gaussian distribution with standard deviation proportional to t, in order to capture temporal uncertainty. b,c, Belief state model with temporal uncertainty captures our data well. Post-reward RPEs are less suppressed for Odor C than for Odor B in both tasks, due to temporal uncertainty that grows with elapsed time.
About this article
Cite this article
Starkweather, C., Babayan, B., Uchida, N. et al. Dopamine reward prediction errors reflect hidden-state inference across time. Nat Neurosci 20, 581–589 (2017). https://doi.org/10.1038/nn.4520
Current Opinion in Behavioral Sciences (2021)
Human Brain Mapping (2020)
Dynamic resource allocation during reinforcement learning accounts for ramping and phasic dopamine activity
Neural Networks (2020)