Dopamine reward prediction errors reflect hidden-state inference across time

Starkweather, Clara Kwon; Babayan, Benedicte M; Uchida, Naoshige; Gershman, Samuel J

doi:10.1038/nn.4520

Article
Published: 06 March 2017

Dopamine reward prediction errors reflect hidden-state inference across time

Nature Neuroscience volume 20, pages 581–589 (2017)Cite this article

13k Accesses
99 Citations
52 Altmetric
Metrics details

Subjects

Abstract

Midbrain dopamine neurons signal reward prediction error (RPE), or actual minus expected reward. The temporal difference (TD) learning model has been a cornerstone in understanding how dopamine RPEs could drive associative learning. Classically, TD learning imparts value to features that serially track elapsed time relative to observable stimuli. In the real world, however, sensory stimuli provide ambiguous information about the hidden state of the environment, leading to the proposal that TD learning might instead compute a value signal based on an inferred distribution of hidden states (a 'belief state'). Here we asked whether dopaminergic signaling supports a TD learning framework that operates over hidden states. We found that dopamine signaling showed a notable difference between two tasks that differed only with respect to whether reward was delivered in a deterministic manner. Our results favor an associative learning rule that combines cached values with hidden-state inference.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on Springer Link
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Figure 2: Averaged dopamine activity in tasks 1 and 2 shows different patterns of modulation over a range of ISIs.**

**Figure 3: Individual dopamine neurons show opposing patterns of post-reward firing in tasks 1 and 2.**

**Figure 4: TD with CSC model, with or without reset, is inconsistent with our data.**

**Figure 5: Belief-state model is consistent with our data.**

**Figure 6: Value signals in the belief-state model of tasks 1 and 2.**

**Figure 7: Hazard and subjective hazard functions for tasks 1 and 2.**

A distributional code for value in dopamine-based reinforcement learning

Article 15 January 2020

Dopamine transients do not act as model-free prediction errors during associative learning

Article Open access 08 January 2020

Dopamine-independent effect of rewards on choices through hidden-state inference

Article Open access 12 January 2024

References

Schultz, W., Dayan, P. & Montague, P.R. A neural substrate of prediction and reward. Science 275, 1593–1599 (1997).
Article CAS Google Scholar
Bayer, H.M. & Glimcher, P.W. Midbrain dopamine neurons encode a quantitative reward prediction error signal. Neuron 47, 129–141 (2005).
Article CAS Google Scholar
Cohen, J.Y., Haesler, S., Vong, L., Lowell, B.B. & Uchida, N. Neuron-type-specific signals for reward and punishment in the ventral tegmental area. Nature 482, 85–88 (2012).
Article CAS Google Scholar
Eshel, N. et al. Arithmetic and local circuitry underlying dopamine prediction errors. Nature 525, 243–246 (2015).
Article CAS Google Scholar
Sutton, R.S. & Barto, A.G. in Learning and Computational Neuroscience: Foundations of Adaptive Networks (eds. Gabriel, M. and Moore, J.) 497–537 (MIT Press, 1991).
Gershman, S.J., Blei, D.M. & Niv, Y. Context, learning and extinction. Psychol. Rev. 117, 197–209 (2010).
Article Google Scholar
Gershman, S.J., Norman, K.A. & Niv, Y. Discovering latent causes in reinforcement learning. Curr. Opin. Behav. Sci 5, 43–50 (2015).
Article Google Scholar
Rao, R.P.N. Decision making under uncertainty: a neural model based on partially observable Markov decision processes. Front. Comput. Neurosci. 4, 146 (2010).
Article Google Scholar
Daw, N.D., Courville, A.C. & Touretzky, D.S. Representation and timing in theories of the dopamine system. Neural Comput. 18, 1637–1677 (2006).
Article Google Scholar
Fiorillo, C.D., Newsome, W.T. & Schultz, W. The temporal precision of reward prediction in dopamine neurons. Nat. Neurosci. 11, 966–973 (2008).
Article CAS Google Scholar
Pasquereau, B. & Turner, R.S. Dopamine neurons encode errors in predicting movement trigger occurrence. J. Neurophysiol. 113, 1110–1123 (2015).
Article Google Scholar
Nomoto, K., Schultz, W., Watanabe, T. & Sakagami, M. Temporally extended dopamine responses to perceptually demanding reward-predictive stimuli. J. Neurosci. 30, 10692–10702 (2010).
Article CAS Google Scholar
Tian, J. & Uchida, N. Habenula lesions reveal that multiple mechanisms underlie dopamine prediction errors and prediction-error-based learning. Neuron 87, 1304–1316 (2015).
Article CAS Google Scholar
Hamid, A.A. et al. Mesolimbic dopamine signals the value of work. Nat. Neurosci. 19, 117–126 (2016).
Article CAS Google Scholar
Kobayashi, S. & Schultz, W. Influence of reward delays on responses of dopamine neurons. J. Neurosci. 28, 7837–7846 (2008).
Article CAS Google Scholar
Jo, Y.S. & Mizumori, S.J.Y. Prefrontal regulation of neuronal activity in the ventral tegmental area. Cereb. Cortex 26, 4057–4068 (2016).
Article Google Scholar
Sutton, R.S. Learning to predict by the methods of temporal differences. Mach. Learn. 3, 9–44 (1988).
Google Scholar
Suri, R.E. & Schultz, W. A neural network model with dopamine-like reinforcement signal that learns a spatial delayed response task. Neuroscience 91, 871–890 (1999).
Article CAS Google Scholar
Suri, R.E. & Schultz, W. Learning of sequential movements by neural network model with dopamine-like reinforcement signal. Exp. Brain Res. 121, 350–354 (1998).
Article CAS Google Scholar
Hollerman, J.R. & Schultz, W. Dopamine neurons report an error in the temporal prediction of reward during learning. Nat. Neurosci. 1, 304–309 (1998).
Article CAS Google Scholar
Brown, J., Bullock, D. & Grossberg, S. How the basal ganglia use parallel excitatory and inhibitory learning pathways to selectively respond to unexpected rewarding cues. J. Neurosci. 19, 10502–10511 (1999).
Article CAS Google Scholar
Oswal, A., Ogden, M. & Carpenter, R.H.S. The time course of stimulus expectation in a saccadic decision task. J. Neurophysiol. 97, 2722–2730 (2007).
Article CAS Google Scholar
Janssen, P. & Shadlen, M.N. A representation of the hazard rate of elapsed time in macaque area LIP. Nat. Neurosci. 8, 234–241 (2005).
Article CAS Google Scholar
Tsunoda, Y. & Kakei, S. Reaction-time changes with the hazard rate for a behaviorally relevant event when monkeys perform a delayed wrist-movement task. Neurosci. Lett. 433, 152–157 (2008).
Article CAS Google Scholar
Ghose, G.M. & Maunsell, J.H. Attentional modulation in visual cortex depends on task timing. Nature 419, 616–620 (2002).
Article CAS Google Scholar
Klein-Flügge, M.C., Hunt, L.T., Bach, D.R., Dolan, R.J. & Behrens, T.E. Dissociable reward and timing signals in human midbrain and ventral striatum. Neuron 72, 654–664 (2011).
Article Google Scholar
Friston, K. A theory of cortical responses. Phil. Trans. R. Soc. Lond. B 360, 815–836 (2005).
Article Google Scholar
Lee, T.S. & Mumford, D. Hierarchical Bayesian inference in the visual cortex. J. Opt. Soc. Am A Opt Image Sci Vis. 20, 1434–1448 (2003).
Article Google Scholar
Kakade, S. & Dayan, P. Acquisition and extinction in autoshaping. Psychol. Rev. 109, 533–544 (2002).
Article Google Scholar
Stalnaker, T.A., Berg, B., Aujla, N. & Schoenbaum, G. Cholinergic interneurons use orbitofrontal input to track beliefs about current state. J. Neurosci. 36, 6242–6257 (2016).
Article CAS Google Scholar
Takahashi, Y.K., Langdon, A.J., Niv, Y. & Schoenbaum, G. Temporal specificity of reward-prediction errors signaled by putative dopamine neurons in rat VTA depends on ventral striatum. Neuron 91, 182–193 (2016).
Article CAS Google Scholar
Ludvig, E.A., Sutton, R.S. & Kehoe, E.J. Stimulus representation and the timing of reward-prediction errors in models of the dopamine system. Neural Comput. 20, 3034–3054 (2008).
Article Google Scholar
Gershman, S.J., Moustafa, A.A. & Ludvig, E.A. Time representation in reinforcement-learning models of the basal ganglia. Front. Comput. Neurosci. 7, 194 (2014).
Article Google Scholar
Mello, G.B.M., Soares, S. & Paton, J.J. A scalable population code for time in the striatum. Curr. Biol. 25, 1113–1122 (2015).
Article CAS Google Scholar
Backman, C. et al. Characterization of a mouse strain expressing Cre recombinase from the 3′ untranslated region of the dopamine transporter locus. Genesis 44, 383–390 (2006).
Article CAS Google Scholar
National Research Council. Guide for the Care and Use of Laboratory Animals 8th edn. (The National Academies Press, 2011).
Atasoy, D., Aponte, Y., Su, H.H. & Sternson, S.M. A FLEX switch targets channelrhodopsin-2 to multiple cell types for imaging and long-range circuit mapping. J. Neurosci. 28, 7025–7030 (2008).
Article CAS Google Scholar
Uchida, N. & Mainen, Z.F. Speed and accuracy of olfactory discrimination in the rat. Nat. Neurosci. 6, 1224–1229 (2003).
Article CAS Google Scholar
Schmitzer-Torbert, N., Jackson, J., Henze, D., Harris, K. & Redish, A.D. Quantitative measures of cluster quality for use in extracellular recordings. Neuroscience 131, 1–11 (2005).
Article CAS Google Scholar
Lima, S.Q., Hromádka, T., Znamenskiy, P. & Zador, A.M. PINP: a new method of tagging neuronal populations for identification during in vivo electrophysiological recording. PLoS One 4, e6099 (2009).
Article Google Scholar
Kvitsiani, D. et al. Distinct behavioral and network correlates of two interneuron types in prefrontal cortex. Nature 498, 363–366 (2013).
Article CAS Google Scholar

Download references

Acknowledgements

We thank R. Born, J. Assad and C. Harvey for discussions, members of the Uchida lab for discussions and C. Dulac for sharing resources. This work was supported by National Science Foundation grant CRCNS 1207833 (S.J.G.), US National Institutes of Health grants R01MH095953 (N.U.), R01MH101207 (N.U.), T32 MH020017 (C.K.S.) and T32GM007753 (C.K.S.), a Harvard Brain Science Initiative Seed grant (N.U.), a Harvard Mind Brain and Behavior faculty grant (S.J.G. and N.U.) and Fondation pour la Recherche Medicale grant SPE20150331860 (B.M.B.).

Author information

Authors and Affiliations

Department of Molecular and Cellular Biology, Center for Brain Science, Harvard University, Cambridge, Massachusetts, USA
Clara Kwon Starkweather, Benedicte M Babayan & Naoshige Uchida
Department of Psychology, Center for Brain Science, Harvard University, Cambridge, Massachusetts, USA.,
Benedicte M Babayan & Samuel J Gershman

Authors

Clara Kwon Starkweather
View author publications
You can also search for this author in PubMed Google Scholar
Benedicte M Babayan
View author publications
You can also search for this author in PubMed Google Scholar
Naoshige Uchida
View author publications
You can also search for this author in PubMed Google Scholar
Samuel J Gershman
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

C.K.S. and N.U. designed the recording experiments and behavioral task; C.K.S. collected and analyzed data; C.K.S. and S.J.G. constructed the computational models; and C.K.S., B.M.B., S.J.G. and N.U. wrote the manuscript.

Corresponding authors

Correspondence to Naoshige Uchida or Samuel J Gershman.

Ethics declarations

Competing interests

The authors declare no competing financial interests.

Integrated supplementary information

Supplementary Figure 1 Task 2b.

a, Task 2b was identical to Task 2, except for the constant ISI trial type. Odor B trials had a constant ISI of 2.0 s. b, The trough of the dopamine omission ‘dip’ was slightly later for 2s variable ISI omission trials than for 2s constant ISI omission trials, as indicated by the arrows. c, This slightly later omission dip is consistent with our belief state TD model. If our belief state TD model was adjusted to incorporate timing uncertainty (see Supplementary Fig. 12), the ‘dip’ for reward omission would be more blurred in time, particularly for the 2s constant ISI omission response.

Supplementary Figure 2 Mice learn to lick in anticipation of rewarded odors.

a-c, Anticipatory lick rate increased above baseline for all rewarded trial types (n = 26 sessions, F_1,50 > 150, P < 6 × 10^-17 for odors A-C in Task 1, 1-way ANOVA; n = 9 sessions, F_1,16 > 60, P < 1 × 10^-6 for odors A-C in Task 2, 1-way ANOVA; n = 22 sessions, F_1,42 > 100, P < 5 × 10^-13 for odors A-C in Task 2b, 1-way ANOVA). Anticipatory lick rate did not increase above baseline for odor D trials (P > 0.10 for all Tasks, odor D, 1-way ANOVA). d, Average licking PSTH for all sessions across all animals trained on Task 1. e, Average licking PSTH for all sessions across all animals trained on Task 2 (including Task 2b).

Supplementary Figure 3 Are any aspects of licking behavior correlated with temporal modulation of dopamine RPEs?

Because the trends of licking behavior across time appeared to be different between Tasks 1 and 2 (see Supplementary Figure 2d,e), we asked whether licking patterns during each session were at all predictive of temporal modulation of dopamine RPEs recorded during that session. a,c, Example fits of logistic function to lick data during individual sessions (see Methods ). b,d, Correlations between various parameters of these logistic functions and the slope of phasic dopamine RPEs across time in Task 1 (b) and Task 2 (d). e, Correlations between the slope k of the logistic function and the slope of phasic dopamine RPEs across time as an aggregate in both Task 1 and Task 2 cohorts. f, Correlations between the slope of licking behavior across time just prior to reward delivery (200ms-0ms) and the slope of phasic dopamine RPEs across time. Our analysis revealed a relationship between the steepness, k, of the logistic licking function and negative modulation over time for post-reward firing in Task 1 (b), and in an aggregate analysis of both Task 1 and Task 2 cohorts (e). However, the steepness of the logistic function in Task 1 is determined in the time prior to reward delivery (see a). None of these analyses revealed a behavior correlate of shifting temporal expectation during the interval in which rewards could be delivered.

Supplementary Figure 4 Recording sites.

a,b, Schematic of recording sites for animals used in Task 1(a) and Task 2 (b)

Supplementary Figure 5 Optogenetic identification of dopamine neurons.

a, Raw trace for single example neuron, demonstrating laser pulses and laser-evoked spikes. b, Near-identical match between waveform of average laser-evoked spike (blue) and average spontaneous spike (black) for same example neuron. c, Histogram of latencies to spike for same example neuron. d, Light-evoked spikes for 10Hz and 50Hz pulses, for same example neuron. e,f, Mean latency and standard deviation of light-evoked spikes, for all identified dopamine neurons. g, Waveform correlation between laser-evoked and spontaneous spikes, for all identified dopamine neurons.

Supplementary Figure 6 Onset of phasic post-reward RPEs begin 50 ms following water-valve opening.

Non-normalized PSTH for Tasks 1 (a) and 2 (b) aligned to the time of reward with only a 20ms box used for smoothing. Although water valve opens at 0ms, phasic dopaminergic response does not begin until 50ms following water valve opening.

Supplementary Figure 7 Visualization of belief-state model parameters (see Computational Methods for details).

a,b, Schematic of the difference between modeling Task 1 and 2. Whereas cue can never lead back to the ITI sub-state in Task 1, this ‘hidden’ transition occurs in 10% of trials (omission trials) in Task 2. c,d, Observation matrices for Task 1 and 2. In a tiny percentage of cases in Task 2 (see Computational Methods), the animal observes ‘cue’ when in fact it transitions from the ITI back to the ITI during an omission trial. This occurs in such a small proportion of sub-state 15→15 transitions that it cannot be visualized in this map, so we have indicated values that differ between Tasks directly below the matrices. See Methods for how these values were computed. e,f, Transition matrices for Task 1 and 2. The probability of transitioning from the sub-state 15 back to the sub-state 15 is slightly higher in Task 2 (see Computational Methods), due to the presence of omission trials. The baseline probability of transitioning out of the ITI in either task is so small that it cannot be visualized here, so we have indicated values that differ between Tasks directly below the matrices. See Methods for how these values were computed.

Supplementary Figure 8 Belief-state model reproduces modulation of pre-reward and post-reward RPEs over variable ISI interval.

a,c, Pre-reward (a) and post-reward (c) RPEs generated by belief state model for different ISIs in Task 1. b,d, Pre-reward (b) and post-reward (d) RPEs generated by belief state model across different ISIs in Task 2.

Supplementary Figure 9 Post-reward RPEs normalized to free-water response.

In order to compare the magnitude of dopamine RPEs across different populations of neurons, post-reward RPEs must be normalized to the free water response (see Eshel et al., 2016). a, Post-reward RPEs in Task 1 are suppressed to about 60% of the free water response at the beginning of the interval (1.2s), and are suppressed to about 40% of the free water response at the end of the interval (2.8s). RPEs plotted are from the 29 neurons for which we included 2-5 free water deliveries during recording. b, Post-reward RPEs in Task 2 are suppressed to about 60% of the free water response at the beginning of the interval (1.2s), and are suppressed to about 90% of the free water response at the end of the interval (2.8s). RPEs plotted are from the 34 neurons for which we included 2-5 free water deliveries during recording. c, Schematic of data in (a) and (b), highlighting that the opposing trends of temporal modulation in Tasks 1 and 2 arise mostly from diverging patterns towards the end of the interval.

Citation for Eshel et al., 2016: Eshel N, Tian J, Bukwich M, Uchida N. Dopamine neurons share a common response function for reward prediction error. Nat Neurosci. 19, 479-486 (2016)

Supplementary Figure 10 Belief-state model result is agnostic to precise shape of trained distribution.

When we trained our belief state TD model on a uniform distribution, the model produced negative modulation of post-reward RPEs over time in the 100% rewarded condition, and positive modulation of post-reward RPEs over time in the 90% rewarded condition. Therefore, while an assumption of our model is that mice perfectly learned the Gaussian distribution of reward timings, relaxing this assumption produces the same qualitative result.

Supplementary Figure 11 Testing belief-state TD model and TD with CSC and Reset model on other studies’ data.

a, Other studies trained animals to anticipate rewarding or reward-predicting events during a variable ISI interval in which the ISI was drawn from a flat probability distribution^10-12. In these studies, reward was delivered in 100% of trials. b,c, Both the belief state model (b) and TD with CSC and Reset model (c) produce negative modulation of RPEs over time. d, Exponential distribution of ISIs, similar to Fiorillo et al, 2008 Figure 5d, on which we trained our belief state TD model. e, Post-reward RPEs decreased over time for belief state model trained on exponential distribution of ISIs, similar to data in Fiorillo et al, 2008 Figure 5e.

Supplementary Figure 12 Belief-state model can be adapted to incorporate temporal uncertainty.

a, Belief state at time t was blurred by a Gaussian distribution with standard deviation proportional to t, in order to capture temporal uncertainty. b,c, Belief state model with temporal uncertainty captures our data well. Post-reward RPEs are less suppressed for Odor C than for Odor B in both tasks, due to temporal uncertainty that grows with elapsed time.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Starkweather, C., Babayan, B., Uchida, N. et al. Dopamine reward prediction errors reflect hidden-state inference across time. Nat Neurosci 20, 581–589 (2017). https://doi.org/10.1038/nn.4520

Download citation

Received: 17 July 2016
Accepted: 25 January 2017
Published: 06 March 2017
Issue Date: April 2017
DOI: https://doi.org/10.1038/nn.4520

This article is cited by

State and rate-of-change encoding in parallel mesoaccumbal dopamine pathways
- Johannes W. de Jong
- Yilan Liang
- Stephan Lammel
Nature Neuroscience (2024)
Dopamine-independent effect of rewards on choices through hidden-state inference
- Marta Blanco-Pozo
- Thomas Akam
- Mark E. Walton
Nature Neuroscience (2024)
Dopaminergic prediction errors in the ventral tegmental area reflect a multithreaded predictive model
- Yuji K. Takahashi
- Thomas A. Stalnaker
- Geoffrey Schoenbaum
Nature Neuroscience (2023)
Distinct value computations support rapid sequential decisions
- Andrew Mah
- Shannon S. Schiereck
- Christine M. Constantinople
Nature Communications (2023)
Long-term, multi-event surprise correlates with enhanced autobiographical memory
- James W. Antony
- Jacob Van Dam
- Kelly A. Bennion
Nature Human Behaviour (2023)