Abstract
Dopamine neuron activity is tied to the prediction error in temporal difference reinforcement learning models. These models make significant simplifying assumptions, particularly with regard to the structure of the predictions fed into the dopamine neurons, which consist of a single chain of timepoint states. Although this predictive structure can explain error signals observed in many studies, it cannot cope with settings where subjects might infer multiple independent events and outcomes. In the present study, we recorded dopamine neurons in the ventral tegmental area in such a setting to test the validity of the single-stream assumption. Rats were trained in an odor-based choice task, in which the timing and identity of one of several rewards delivered in each trial changed across trial blocks. This design revealed an error signaling pattern that requires the dopamine neurons to access and update multiple independent predictive streams reflecting the subject’s belief about timing and potentially unique identities of expected rewards.
This is a preview of subscription content, access via your institution
Access options
Access Nature and 54 other Nature Portfolio journals
Get Nature+, our best-value online-access subscription
$29.99 / 30 days
cancel any time
Subscribe to this journal
Receive 12 print issues and online access
$209.00 per year
only $17.42 per issue
Buy this article
- Purchase on Springer Link
- Instant access to full article PDF
Prices may be subject to local taxes which are calculated during checkout
Similar content being viewed by others
Data availability
The dataset and all scripts used in the present study are available at https://github.com/YouGTakahashi/ultra_delay_analysis for the unit analyses and at https://github.com/ajlangdon/multithreadTD for the modeling.
Code availability
The dataset and all scripts used in the present study are available at https://github.com/YouGTakahashi/ultra_delay_analysis for the unit analyses and at https://github.com/ajlangdon/multithreadTD for the modeling.
References
Schultz, W. Dopamine reward prediction-error signalling: a two-component response. Nat. Rev. Neurosci. 17, 183–195 (2016).
Keiflin, R. & Janak, P. H. Dopamine prediction errors in reward learning and addiction: from theory to neural circuitry. Neuron 88, 247–263 (2015).
Glimcher, P. W. Understanding dopamine and reinforcement learning: the dopamine reward prediction error hypothesis. Proc. Natl Acad. Sci. USA 108, 15647–15654 (2011).
Schultz, W., Dayan, P. & Montague, P. R. A neural substrate for prediction and reward. Science 275, 1593–1599 (1997).
Watabe-Uchida, M., Eshel, N. & Uchida, N. Neural circuitry of reward prediction error. Annu. Rev. Neurosci. 40, 373–394 (2017).
Mirenowicz, J. & Schultz, W. Importance of unpredictability for reward responses in primate dopamine neurons. J. Neurophysiol. 72, 1024–1027 (1994).
Hollerman, J. R. & Schultz, W. Dopamine neurons report an error in the temporal prediction of reward during learning. Nat. Neurosci. 1, 304–309 (1998).
Waelti, P., Dickinson, A. & Schultz, W. Dopamine responses comply with basic assumptions of formal learning theory. Nature 412, 43–48 (2001).
Tobler, P. N., Dickinson, A. & Schultz, W. Coding of predicted reward omission by dopamine neurons in a conditioned inhibition paradigm. J. Neurosci. 23, 10402–10410 (2003).
Lak, A., Stauffer, W. R. & Schultz, W. Dopamine prediction error responses integrate subjective value from different reward dimensions. Proc. Natl Acad. Sci. USA 111, 2342–2348 (2014).
Cohen, J. Y., Haesler, S., Vong, L., Lowell, B. B. & Uchida, N. Neuron-type-specific signals for reward and punishment in the ventral tegmental area. Nature 482, 85–88 (2012).
Eshel, N. et al. Arithmetic and local circuitry underlying dopamine prediction errors. Nature 525, 243–246 (2015).
Pan, W.-X., Schmidt, R., Wickens, J. R. & Hyland, B. I. Dopamine cells respond to predicted events during classical conditioning: evidence for eligibility traces in the reward-learning network. J. Neurosci. 25, 6235–6242 (2005).
Kim, H. R. et al. A unified framework for dopamine signals across timescales. Cell 183, 1600–1616 (2020).
Fiorillo, C. D., Newsome, W. T. & Schultz, W. The temporal precision of reward prediction in dopamine neurons. Nat. Neurosci. 11, 966–973 (2008).
Kobayashi, K. & Schultz, W. Influence of reward delays on responses of dopamine neurons. J. Neurosci. 28, 7837–7846 (2008).
Suri, R. E. & Schultz, W. A neural network model with dopamine-like reinforcement signal that learns a spatial delayed response task. Neuroscience 91, 871–890 (1999).
Daw, N., Courville, A. C. & Touretzky, D. S. Representation and timing in theories of the dopamine system. Neural Comput. 18, 1637–1677 (2006).
Takahashi, Y. K., Langdon, A. J., Niv, Y. & Schoenbaum, G. Temporal specificity of reward prediction errors signaled by putative dopamine neurons in rat VTA depends on ventral striatum. Neuron 91, 182–193 (2016).
Starkweather, C. K., Babayan, B. M., Uchida, N. & Gershman, S. J. Dopamine reward prediction errors reflect hidden-state inference across time. Nat. Neurosci. 20, 581–589 (2017).
Takahashi, Y. K. et al. Dopamine neurons respond to errors in the prediction of sensory features of expected rewards. Neuron 95, 1395–1405 (2017).
Stalnaker, T. A. et al. Dopamine neuron ensembles signal the content of sensory prediction errors. eLife 8, e49315 (2019).
Howard, J. D. & Kahnt, T. Identity prediction errors in the human midbrain update reward-identity expectations in the orbitofrontal cortex. Nat. Commun. 9, 1–11 (2018).
Chang, C. Y., Gardner, M., Di Tillio, M. G. & Schoenbaum, G. Optogenetic blockade of dopamine transients prevents learning induced by changes in reward features. Curr. Biol. 27, 3480–3486 (2017).
Keiflin, R., Pribut, H. J., Shah, N. B. & Janak, P. H. Ventral tegmental dopamine neurons participate in reward identity predictions. Curr. Biol. 29, 92–103 (2019).
Sharpe, M. J. et al. Dopamine transients are sufficient and necessary for acquisition of model-based associations. Nat. Neurosci. 20, 735–742 (2017).
Lak, A., Nomoto, K., Keramati, M., Sakagami, M. & Kepecs, A. Midbrain dopamine neurons signal belief in choice accuracy during a perceptual decision. Curr. Biol. 27, 821–832 (2017).
Starkweather, C. K. & Uchida, N. Dopamine signals as temporal difference errors: recent advances. Curr. Opin. Neurobiol. 67, 95–105 (2021).
Wilson, R. C., Takahashi, Y. K., Schoenbaum, G. & Niv, Y. Orbitofrontal cortex as a cognitive map of task space. Neuron 81, 267–279 (2014).
Starkweather, C. K., Gershman, S. J. & Uchida, N. The medial prefrontal cortex shapes dopamine reward prediction errors under state uncertainty. Neuron 98, 616–629 (2018).
Jo, Y. S. & Mizumori, S. J. Prefrontal regulation of neuronal activity in the ventral tegmental area. Cereb. Cortex 26, 4057–4068 (2016).
Jo, Y. S., Lee, J. & Mizumori, S. J. Effects of prefrontal cortical inactivation on neural activity in the ventral tegmental area. J. Neurosci. 33, 8159–8171 (2013).
Takahashi, Y. K. et al. Expectancy-related changes in firing of dopamine neurons depend on orbitofrontal cortex. Nat. Neurosci. 14, 1590–1597 (2011).
Langdon, A. J., Sharpe, M. J., Schoenbaum, G. & Niv, Y. Model-based predictions for dopamine. Curr. Opin. Neurobiol. 49, 1–7 (2017).
Sutton, R. S. Learning to predict by the method of temporal difference. Mach. Learn. 3, 9–44 (1988).
Sutton, R. S. & Barto, A. G. Reinforcement Learning: An introduction (MIT Press, 1998).
Kaelbling, L. P., Littman, M. L. & Cassandra, A. R. Planning and acting in partially observable stochastic domains. Artif. Intelligence 101, 99–134 (1998).
Montague, P. R., Dayan, P. & Sejnowski, T. J. A framework for mesencephalic dopamine systems based on predictive hebbian learning. J. Neurosci. 16, 1936–1947 (1996).
Ludvig, E. A., Sutton, R. S. & Kehoe, E. J. Evaluating the TD model of classical conditioning. Learn. Behav. 40, 305–319 (2012).
Glascher, J., Daw, N., Dayan, P. & O’Doherty, J. P. States versus rewards: dissociable neural prediction error signals underlying model-based and model-free reinforcement learning. Neuron 66, 585–595 (2010).
Acknowledgements
This work was supported by grant no. Z1A-DA000587 (to G.S.) at the Intramural Research Program of the National Institute on Drug Abuse, the Intramural Research Program of the National Institute of Mental Health (to A.J.L.) and by grant no. U01DA050647 (to A.J.L.) from the National Institute on Drug Abuse. The opinions expressed in this article are the authors’ own and do not reflect the view of the National Institutes of Health/Department of Health and Human Services.
Author information
Authors and Affiliations
Contributions
Y.K.T., T.A.S. and G.S. designed the experiments. Y.K.T. and L.E.M. conducted the behavioral training and single-unit recording. S.K.H. and A.J.L. conducted the modeling. Y.K.T., A.J.L. and G.S. interpreted the data and wrote the manuscript with input from the other authors.
Corresponding authors
Ethics declarations
Competing interests
The authors declare no competing interests.
Peer review
Peer review information
Nature Neuroscience thanks Okihide Hikosaka, Kevin Miller and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Extended data
Extended Data Fig. 1 Activity to the odor cues changes with value and free choice behavior across trial blocks and does so similarly for the two block types.
(a) Average firing in the reward-responsive dopamine neurons during presentation of the high and low value cues. Average firing is plotted for the first and last 5 trials of the Delay-Only and Delay-Flavor blocks. Activity increased to the high value cue, paired with the early reward, across each block (Three-way ANOVA, Trial x value, F9,1071 = 9.46, p < 0.01), and there was no difference in the pattern across switch types (F’s < 1.86, p’s > 0.06). Error bars represent SEM. n = 120 cells collected from 8 independent rats. (b) Relationship between the change in firing to the high and low value cues and the change in free choice behavior. The difference in firing to the high and low value cues in the first (blue) and last (red) 5 trials of all the blocks is plotted against the percentage of choice of the short reward during these same trials. The two measures were strongly correlated (scatter plot) reflecting the shift in both measures from early to late (blue to red in the distribution plots) within each block (Two-sided Wilcoxon ranksum test, cue, p < 0.01; choice, p < 0.01). n = 120 cells collected from 8 independent rats. Note all models implemented in the main text also produced changes in signal to the cues that differed by value (see Extended Data Fig. 5).
Extended Data Fig. 2 Changes activity of reward-responsive dopamine neurons to omission of a delayed reward in delay-only and delay-flavor blocks do not depend on order of switches.
Displays as in Fig. 5a of main text except that (a) shows data from blocks 2 and 3 data in which the delay-flavor block preceded the delay-only block, and (b) shows data from blocks 4 and 5 in which the delay-only block preceded the delay—flavor block. Statistics in each panel indicate results of Wilcoxon signed-rank test (p) and the average difference score (u). Comparisons of the distributions in panels a and b showed that they were not different (Two-sided Wilcoxon rank sum test) within either delay-only (p = 0.48) or delay-flavor switches (p = 0.60). n = 120 cells collected from 8 independent rats.
Extended Data Fig. 3
Changes activity of reward-responsive dopamine neurons to omission of a delayed reward in delay-only and delay-flavor blocks do not depend on non-significant numerical differences in subjects’ consumption of the two flavors (Fig. 1c). Displays as in Fig. 5a of main text except that (a) shows data involving omission of the numerically-higher reward and (b) show data involving omission of the numerically-lower reward. Statistics in each panel indicate results of Wilcoxon signed-rank test (p) and the average difference score (u). Comparisons of the distributions in panels a and b showed that they were not different (Two-sided Wilcoxon rank sum test) within either delay-only (p = 0.29) or delay-flavor switches (p = 0.88). n = 120 cells collected from 8 independent rats.
Extended Data Fig. 4 Licking behavior during omission of long reward.
(a) Average lick rate in 2 sec after an omission of delayed reward on first trial and average of last 5 trials in the Delay-Only (dark-blue) and the Delay-Flavor (light-blue) switches. Two-way ANOVA (early/late x switch types) revealed a significant main effect of early/late (F1,51 = 9.47, p < 0.01) and a significant interaction between early/late and switch type (F1,51 = 5.18, p < 0.05). A step down comparison revealed that lick rates in the first trial were significantly higher than those in the last trials after a Delay-Flavor switch (F1,51 = 9.90, p < 0.01, light-blue), but not after a Delay-Only switch (F1,51 = 0.54, p > 0.10, blue). Lick rates on the first trial in the Delay-Flavor switch were significantly higher than those in the first trial after a Delay-Only switch (Two-way ANOVA, F1,51 = 4.11, p = 0.04), but not during the last 5 trials (F1,51 = 0.85, p > 0.10). Error bars represent SEM. n = 53 sessions collected from 8 independent rats. (b) Distributions of difference scores comparing lick rates on the first and last trials after Delay-Only (left) and Delay-Flavor (right) switches. The numbers in each panel indicate results of Two-sided Wilcoxon signed-rank test (p) and the average difference score (u). n = 53 sessions collected from 8 independent rats.
Extended Data Fig. 5 Simulated prediction error response to the cue in high and low value blocks for each TDRL model.
Simulated reward prediction error responses to the cue in the (a) single thread TDRL model without reset, (b) single thread TDRL model with reset, (c) single thread TDRL model with sequential reset, (d) single thread TDRL model with delay-specific sequential reset and the (e) multithread TDRL model for each of the delay and delay-flavor block switches. All models predicted qualitatively similar changes in activity to the high value (blue) and low value (red) cues across first and last 5 trials of delay-only and delay-flavor block switches.
Supplementary information
Rights and permissions
About this article
Cite this article
Takahashi, Y.K., Stalnaker, T.A., Mueller, L.E. et al. Dopaminergic prediction errors in the ventral tegmental area reflect a multithreaded predictive model. Nat Neurosci 26, 830–839 (2023). https://doi.org/10.1038/s41593-023-01310-x
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1038/s41593-023-01310-x