Abstract
The hypothesis that midbrain dopamine (DA) neurons broadcast a reward prediction error (RPE) is among the great successes of computational neuroscience. However, recent results contradict a core aspect of this theory: specifically that the neurons convey a scalar, homogeneous signal. While the predominant family of extensions to the RPE model replicates the classic model in multiple parallel circuits, we argue that these models are ill suited to explain reports of heterogeneity in task variable encoding across DA neurons. Instead, we introduce a complementary ‘feature-specific RPE’ model, positing that individual ventral tegmental area DA neurons report RPEs for different aspects of an animal’s moment-to-moment situation. Further, we show how our framework can be extended to explain patterns of heterogeneity in action responses reported among substantia nigra pars compacta DA neurons. This theory reconciles new observations of DA heterogeneity with classic ideas about RPE coding while also providing a new perspective of how the brain performs reinforcement learning in high-dimensional environments.
This is a preview of subscription content, access via your institution
Access options
Access Nature and 54 other Nature Portfolio journals
Get Nature+, our best-value online-access subscription
$29.99 / 30 days
cancel any time
Subscribe to this journal
Receive 12 print issues and online access
$209.00 per year
only $17.42 per issue
Buy this article
- Purchase on SpringerLink
- Instant access to full article PDF
Prices may be subject to local taxes which are calculated during checkout
Similar content being viewed by others
Data availability
Both the model and neural data that support the findings of this study are available on Figshare at https://doi.org/10.6084/m9.figshare.25752450 (ref. 83). A description of the data can be found with the code at https://github.com/ndawlab/vectorRPE/. Source data are provided with this paper.
Code availability
Code used for the deep RL model, VR environment and analysis of the data to reproduce the figures can be found at https://github.com/ndawlab/vectorRPE/.
References
Schultz, W., Dayan, P. & Montague, P. R. A neural substrate of prediction and reward. Science 275, 1593–1599 (1997).
Montague, P. R., Dayan, P. & Sejnowski, T. J. A framework for mesencephalic dopamine systems based on predictive Hebbian learning. J. Neurosci. 16, 1936–1947 (1996).
Arbuthnott, G. W. & Wickens, J. Space, time and dopamine. Trends Neurosci. 30, 62–69 (2007).
Matsuda, W. et al. Single nigrostriatal dopaminergic neurons form widely spread and highly dense axonal arborizations in the neostriatum. J. Neurosci. 29, 444–453 (2009).
Schultz, W. Predictive reward signal of dopamine neurons. J. Neurophysiol. 80, 1–27 (1998).
Parker, N. F. et al. Reward and choice encoding in terminals of midbrain dopamine neurons depends on striatal target. Nat. Neurosci. 19, 845–854 (2016).
Lee, R. S., Mattar, M. G., Parker, N. F., Witten, I. B. & Daw, N. D. Reward prediction error does not explain movement selectivity in DMS-projecting dopamine neurons. eLife 8, e42992 (2019).
Choi, J. Y. et al. A comparison of dopaminergic and cholinergic populations reveals unique contributions of VTA dopamine neurons to short-term memory. Cell Rep. 33, 108492 (2020).
Engelhard, B. et al. Specialized coding of sensory, motor and cognitive variables in VTA dopamine neurons. Nature 570, 509–513 (2019).
Lerner, T. N. et al. Intact-brain analyses reveal distinct information carried by SNc dopamine subcircuits. Cell 162, 635–647 (2015).
Collins, A. L. & Saunders, B. T. Heterogeneity in striatal dopamine circuits: form and function in dynamic reward seeking. J. Neurosci. Res. 98, 1046–1069 (2020).
Verharen, J. P. H., Zhu, Y. & Lammel, S. Aversion hot spots in the dopamine system. Curr. Opin. Neurobiol. 64, 46–52 (2020).
Hassan, A. & Benarroch, E. E. Heterogeneity of the midbrain dopamine system. Neurology 85, 1795–1805 (2015).
Marinelli, M. & McCutcheon, J. E. Heterogeneity of dopamine neuron activity across traits and states. Neuroscience 282, 176–197 (2014).
Kremer, Y., Flakowski, J., Rohner, C. & Lüscher, C. Context-dependent multiplexing by individual VTA dopamine neurons. J. Neurosci. 40, 7489–7509 (2020).
Howe, M. W. & Dombeck, D. A. Rapid signalling in distinct dopaminergic axons during locomotion and reward. Nature 535, 505–510 (2016).
Anderegg, A., Poulin, J.-F. & Awatramani, R. Molecular heterogeneity of midbrain dopaminergic neurons—moving toward single cell resolution. FEBS Lett. 589, 3714–3726 (2015).
Barter, J. W. et al. Beyond reward prediction errors: the role of dopamine in movement kinematics. Front. Integr. Neurosci. 9, 39 (2015).
Cai, L. X. et al. Distinct signals in medial and lateral VTA dopamine neurons modulate fear extinction at different times. eLife 9, e54936 (2020).
Hamid, A. A., Frank, M. J. & Moore, C. I. Wave-like dopamine dynamics as a mechanism for spatiotemporal credit assignment. Cell 184, 2733–2749.e16 (2021).
Mohebi, A., Wei, W., Pelattini, L., Kim, K. & Berke, J. D. Dopamine transients follow a striatal gradient of reward time horizons. Nat. Neurosci. 27, 737–746 (2024).
Zolin, A. et al. Context-dependent representations of movement in Drosophila dopaminergic reinforcement pathways. Nat. Neurosci. 24, 1555–1566 (2021).
Eshel, N., Tian, J., Bukwich, M. & Uchida, N. Dopamine neurons share common response function for reward prediction error. Nat. Neurosci. 19, 479–486 (2016).
Cohen, J. Y., Haesler, S., Vong, L., Lowell, B. B. & Uchida, N. Neuron-type-specific signals for reward and punishment in the ventral tegmental area. Nature 482, 85–88 (2012).
Menegas, W., Akiti, K., Amo, R., Uchida, N. & Watabe-Uchida, M. Dopamine neurons projecting to the posterior striatum reinforce avoidance of threatening stimuli. Nat. Neurosci. 21, 1421–1430 (2018).
Jin, X. & Costa, R. M. Start/stop signals emerge in nigrostriatal circuits during sequence learning. Nature 466, 457–462 (2010).
Dabney, W. et al. A distributional code for value in dopamine-based reinforcement learning. Nature 577, 671–675 (2020).
Greenstreet, F. et al. Action prediction error: a value-free dopaminergic teaching signal that drives stable learning. Preprint at bioRxiv https://doi.org/10.1101/2022.09.12.507572 (2024).
Bogacz, R. Dopamine role in learning and action inference. eLife 9, e53262 (2020).
Lindsey, J. and Litwin-Kumar, A. Action-modulated midbrain dopamine activity arises from distributed control policies. In Proc. 36th International Conference on Neural Information Processing Systems (eds. Koyejo, S. et al.) 5535–5548 (2022).
Gardner, M. P. H., Schoenbaum, G. & Gershman, S. J. Rethinking dopamine as generalized prediction error. Proc. Biol. Sci. 285, 20181645 (2018).
Alexander, G. E., DeLong, M. R. & Strick, P. L. Parallel organization of functionally segregated circuits linking basal ganglia and cortex. Annu. Rev. Neurosci. 9, 357–381 (1986).
Lau, B., Monteiro, T. & Paton, J. J. The many worlds hypothesis of dopamine prediction error: implications of a parallel circuit architecture in the basal ganglia. Curr. Opin. Neurobiol. 46, 241–247 (2017).
Haber, S. N., Fudge, J. L. & McFarland, N. R. Striatonigrostriatal pathways in primates form an ascending spiral from the shell to the dorsolateral striatum. J. Neurosci. 20, 2369–2382 (2000).
Hintiryan, H. et al. The mouse cortico–striatal projectome. Nat. Neurosci. 19, 1100–1114 (2016).
Hunnicutt, B. J. et al. A comprehensive excitatory input map of the striatum reveals novel functional organization. eLife 5, e19103 (2016).
Pan, W. X., Mao, T. & Dudman, J. T. Inputs to the dorsal striatum of the mouse reflect the parallel circuit architecture of the forebrain. Front. Neuroanat. 4, 147 (2010).
Cox, J. & Witten, I. B. Striatal circuits for reward learning and decision-making. Nat. Rev. Neurosci. 20, 482–494 (2019).
Mnih, V. et al. Asynchronous methods for deep reinforcement learning. In Proc. 33rd International Conference on Machine Learning (eds. Balcan, M. F. & Weinberger, K. Q.) 1928–1937 (jmlr.org, 2016).
Sutton, R. S. Learning to predict by the methods of temporal differences. Mach. Learn. 3, 9–44 (1988).
Daw, N. D., Kakade, S. & Dayan, P. Opponent interactions between serotonin and dopamine. Neural Netw. 15, 603–616 (2002).
Lloyd, K. & Dayan, P. Safety out of control: dopamine and defence. Behav. Brain Funct. 12, 15 (2016).
Lak, A., Nomoto, K., Keramati, M., Sakagami, M. & Kepecs, A. Midbrain dopamine neurons signal belief in choice accuracy during a perceptual decision. Curr. Biol. 27, 821–832 (2017).
Daw, N. D., Courville, A. C. & Touretzky, D. S. Timing and partial observability in the dopamine system. In Proc. 15th International Conference on Neural Information Processing Systems (eds. Becker, S. et al.) 99–106 (MIT Press, 2003).
Kurth-Nelson, Z. & Redish, A. D. Temporal-difference reinforcement learning with distributed representations. PLoS ONE 4, e7362 (2009).
Gershman, S. J., Pesaran, B. & Daw, N. D. Human reinforcement learning subdivides structured action spaces by learning effector-specific values. J. Neurosci. 29, 13524–13531 (2009).
Voorn, P., Vanderschuren, L. J. M. J., Groenewegen, H. J., Robbins, T. W. & Pennartz, C. M. A. Putting a spin on the dorsal–ventral divide of the striatum. Trends Neurosci. 27, 468–474 (2004).
Rueda-Orozco, P. E. & Robbe, D. The striatum multiplexes contextual and kinematic information to constrain motor habits execution. Nat. Neurosci. 18, 453–460 (2015).
Parker, N. F. et al. Choice-selective sequences dominate in cortical relative to thalamic inputs to NAc to support reinforcement learning. Cell Rep. 39, 110756 (2022).
Matsumoto, N., Minamimoto, T., Graybiel, A. M. & Kimura, M. Neurons in the thalamic CM–Pf complex supply striatal neurons with information about behaviorally significant sensory events. J. Neurophysiol. 85, 960–976 (2001).
Choi, K. et al. Distributed processing for value-based choice by prelimbic circuits targeting anterior–posterior dorsal striatal subregions in male mice. Nat. Commun. 14, 1920 (2023).
da Silva, J. A., Tecuapetla, F., Paixão, V. & Costa, R. M. Dopamine neuron activity before action initiation gates and invigorates future movements. Nature 554, 244–248 (2018).
Dodson, P. D. et al. Representation of spontaneous movement by dopaminergic neurons is cell-type selective and disrupted in parkinsonism. Proc. Natl Acad. Sci. USA 113, E2180–E2188 (2016).
Coddington, L. T. & Dudman, J. T. The timing of action determines reward prediction signals in identified midbrain dopamine neurons. Nat. Neurosci. 21, 1563–1573 (2018).
Jog, M. S., Kubota, Y., Connolly, C. I., Hillegaart, V. & Graybiel, A. M. Building neural representations of habits. Science 286, 1745–1749 (1999).
Ribas-Fernandes, J. J. F. et al. A neural signature of hierarchical reinforcement learning. Neuron 71, 370–379 (2011).
Jiang, L. & Litwin-Kumar, A. Models of heterogeneous dopamine signaling in an insect learning and memory center. PLoS Comput. Biol. 17, e1009205 (2021).
Matsumoto, H., Tian, J., Uchida, N. & Watabe-Uchida, M. Midbrain dopamine neurons signal aversion in a reward-context-dependent manner. eLife 5, e17328 (2016).
de Jong, J. W. et al. A neural circuit mechanism for encoding aversive stimuli in the mesolimbic dopamine system. Neuron 101, 133–151 (2019).
Matsumoto, M. & Hikosaka, O. Two types of dopamine neuron distinctly convey positive and negative motivational signals. Nature 459, 837–841 (2009).
Lammel, S. et al. Input-specific control of reward and aversion in the ventral tegmental area. Nature 491, 212–217 (2012).
Syed, E. C. J. et al. Action initiation shapes mesolimbic dopamine encoding of future rewards. Nat. Neurosci. 19, 34–36 (2016).
O’Doherty, J. et al. Dissociable roles of ventral and dorsal striatum in instrumental conditioning. Science 304, 452–454 (2004).
Moss, M. M., Zatka-Haas, P., Harris, K. D., Carandini, M. & Lak, A. Dopamine axons in dorsal striatum encode contralateral visual stimuli and choices. J. Neurosci. 41, 7197–7205 (2021).
Saunders, B. T., Richard, J. M., Margolis, E. B. & Janak, P. H. Dopamine neurons create Pavlovian conditioned stimuli with circuit-defined motivational properties. Nat. Neurosci. 21, 1072–1083 (2018).
Mikhael, J. G., Kim, H. R., Uchida, N. & Gershman, S. J. The role of state uncertainty in the dynamics of dopamine. Curr. Biol. 32, 1077–1087.e9 (2022).
Tsutsui-Kimura, I. et al. Distinct temporal difference error signals in dopamine axons in three regions of the striatum in a decision-making task. eLife 9, e62390 (2020).
Avvisati, R. et al. Distributional coding of associative learning in discrete populations of midbrain dopamine neurons. Cell Rep. 43, 114080 (2024).
Gonon, F. et al. Geometry and kinetics of dopaminergic transmission in the rat striatum and in mice lacking the dopamine transporter. Prog. Brain Res. 125, 291–302 (2000).
Akiti, K. et al. Striatal dopamine explains novelty-induced behavioral dynamics and individual variability in threat prediction. Neuron 110, 3789–3804.e9 (2022).
Rescorla, R. A. and Wagner, A. R. A theory of Pavlovian conditioning: variations in the effectiveness of reinforcement and nonreinforcement. In Classical Conditioning II: Current Research and Theory (eds. Black, A. H. & Prokasy, W. F.) 64–99 (Appleton-Century-Crofts, 1972).
Kamin, L. J. ‘Attention-like’ processes in classical conditioning. Miami Symposium on the Prediction of Behavior: Aversive Stimulation (ed. Jones, M. R.) 9–31 (Univ. Miami Press, 1968).
Gershman, S. J., Norman, K. A. & Niv, Y. Discovering latent causes in reinforcement learning. Curr. Opin. Behav. Sci. 5, 43–50 (2015).
Russek, E. M., Momennejad, I., Botvinick, M. M., Gershman, S. J. & Daw, N. D. Predictive representations can link model-based reinforcement learning to model-free mechanisms. PLoS Comput. Biol. 13, e1005768 (2017).
Stachenfeld, K. L., Botvinick, M. M. & Gershman, S. J. The hippocampus as a predictive map. Nat. Neurosci. 20, 1643–1653 (2017).
Niv, Y. Learning task-state representations. Nat. Neurosci. 22, 1544–1553 (2019).
Pinto, L. et al. An accumulation-of-evidence task using visual pulses for mice navigating in virtual reality. Front. Behav. Neurosci. 12, 36 (2018).
Aronov, D. & Tank, D. W. Engagement of neural circuits underlying 2D spatial navigation in a rodent virtual reality system. Neuron 84, 442–456 (2014).
Brockman, G. et al. OpenAI Gym. Preprint at https://arxiv.org/abs/1606.01540 (2016).
Hill, A. et al. Stable baselines. GitHub https://github.com/hill-a/stable-baselines (2018).
Barreto, A. et al. Successor features for transfer in reinforcement learning. In Proc. 31st Conference on Neural Information Processing Systems (eds. Guyon, I. et al.) 4055–4065 (Curran Associates, Inc., 2017).
Rowland, M. et al. Statistics and samples in distributional reinforcement learning. In Proc. 36th International Conference on Machine Learning, Vol. 97 (eds. Chaudhuri, K. & Salakhutdinov, R.) 5528–5536 (PMLR, 2019).
Lee, R. S., Sagiv, Y., Engelhard, B., Witten, I. B. & Daw, N. D. A feature-specific prediction error model explains dopaminergic heterogeneity. Figshare https://doi.org/10.6084/m9.figshare.25752450 (2024).
Acknowledgements
We thank A. Luna and J. Lopez for their help with the VR software system; W. Dabney for discussion on the distributional RL model and providing the imputation function for the distributional RL model; M. Lee and E. Grant for help with training the deep RL network; P. Dayan, A. Kahn and L. Brown for comments on this work; and the BRAIN CoGS team and the laboratories of N.D.D. and I.B.W. for their help. This work was supported by an NSF GRFP (R.S.L.), 1K99MH122657 (B.E.), National Institutes of Health R01 DA047869 (I.B.W.), U19 NS104648-01 (I.B.W.), ARO W911NF-16-1-0474 (N.D.D.), ARO W911NF1710554 (I.B.W.), Brain Research Foundation (I.B.W.), Simons Collaboration on the Global Brain (I.B.W.) and the New York Stem Cell Foundation (I.B.W.).
Author information
Authors and Affiliations
Contributions
R.S.L., I.B.W. and N.D.D. conceived the project. B.E. and I.B.W. conducted the original neural recording experiments, and B.E. provided software, interpretation and analysis. R.S.L. and N.D.D. developed the model, and Y.S. extended it. R.S.L. wrote software, trained the deep network and conducted data analyses. R.S.L. and Y.S. conducted model simulations. R.S.L., Y.S., I.B.W. and N.D.D. wrote the manuscript.
Corresponding authors
Ethics declarations
Competing interests
The authors declare no competing interests.
Peer review
Peer review information
Nature Neuroscience thanks Talia Lerner and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Extended data
Extended Data Fig. 1 Deep RL network retrained with the same task and a different seed.
(a) Psychometric curve showing the retrained agent’s performance, with error bars showing ±1 s.e.m. (N = 1000 trials) (b) Retrained agent’s scalar value during the cue period decreased as a function of trial difficulty (defined as the absolute tower difference, blue gradient). (c) Retrained agent’s feature-specific RPE units’ response to confirmatory (purple) and disconfirmatory (gray) cues. Response is averaged across cue-sensitive units only (N = 47/64), with error bars indicating ±1 s.e.m. (d) Retrained agent’s feature-specific RPE units averaged across trials plotted with respect to view angle. Each row represents one unit’s peak normalized response to the view angle. (e-f) Same as (d) but for the agent’s (e) position and (f) left (red) and right (blue) cues. (g) Retrained agent’s scalar RPE time-locked at reward time for rewarded (magenta) and unrewarded trials (gray). (h) Same as (g) but for rewarded trials with different trial difficulties, with hard trials (light blue) and easy trials (dark blue) defined like in Fig. 5d. (i) Histogram of retrained agent’s feature-specific RPE units’ response to reward minus omission at reward time (P = 4.06e-12 for two sided Wilcoxon signed rank test, N = 64). Yellow line indicates median. (j) Same as (i) but for rewarded trials plotted against trial difficulty (P = 0.028 for two sided Wilcoxon signed rank test, N = 64). In (j), there is an outlier data point at 0.16 for a feature-specific RPE unit showing strong reward expectation modulation.
Extended Data Fig. 2 Tuning of 64 LSTM feature units to position and evidence.
(a) Each panel shows an individual feature unit and how it is tuned to the agent’s position in the maze and the cumulative tower difference at that position.
Extended Data Fig. 3 Feature-specific RPEs clustered to behavioral features of the task to match Engelhard et al4 clustering analysis.
(a) Optimal number of clusters was 3, selected by minimizing the AIC scores for models with different numbers of clusters. (b) Relative contribution of the behavioral features including cues, position, choice and reward response for the 64 units sorted based on the highest probability belonging to the cluster, with colored lines on top indicating the cluster identity. Relative contribution is defined as the percentage of the explained variance for the partial model not including the variable versus the full model.
Extended Data Fig. 4 Scalar RPE signal does not reflect the incidental high-dimensional features.
(a) Scalar RPE signal autocorrelated across time (similarity defined as the cosine of position-lagged scalar RPE responses) does not show peaks at position lag 43 cm for the wall-pattern repetition location.
Extended Data Fig. 5 Cue responses in model and DAergic neurons represent a vector code of lateralized response and RPE.
(a) Average responses of the contralateral cue responsive DA neurons9 only recorded from the left hemisphere (N= 54/62 neurons, subset of contralateral cue responsive neurons from Fig. 4c) for confirmatory (red) and disconfirmatory (gray) contralateral cues. Colored fringes represent ±1 s.e.m. for kernel amplitudes. (b) Same as (a) but for DA neurons recorded on the right hemisphere (N = 8/62) responding to confirmatory (blue) and disconfirmatory (gray) confirmatory cues. (c-d) Same as (a-b), but for feature-specific RPE model units responding to (c) left cues specifically (N = 27/44, subset of the cue responsive neurons from Fig. 4b that were modulated by left cues only) and (d) right cues specifically (N = 17/44). Error bars indicate ±1 s.e.m.
Extended Data Fig. 6 Scalar RPE shows fine-grained reward expectation modulation.
(a) Scalar RPE’s response modulated by reward expectation given by the difficulty of the task, defined as the absolute value of the final tower difference (blue gradient) of the trial.
Extended Data Fig. 7 State diagrams for simulations of the Parker et al6 and Jin and Costa26 tasks in Fig. 8.
(a) State diagram for the Parker et al6 task simulation. (b) State diagram for the Jin and Costa26 task simulation. In both panels, arrows indicate probabilistic transitions between states, with probabilities described by the arrow labels. Unlabeled arrows denote transitions with the remaining probability to make the total sum to 1. π(x) refers to the probability of executing action x under the agent’s behavioral policy π.
Supplementary information
Supplementary Information
Supplementary Table 1: Hyperparameters for the A2C algorithm.
Source data
Source Data Fig. 2
Statistical source data.
Source Data Fig. 3
Statistical source data.
Source Data Fig. 4
Statistical source data.
Source Data Fig. 5
Statistical source data.
Source Data Fig. 6
Statistical source data.
Source Data Fig. 7
Statistical source data.
Source Data Fig. 8
Statistical source data.
Source Data Extended Data Fig. 1
Statistical source data.
Source Data Extended Data Fig. 2
Statistical source data.
Source Data Extended Data Fig. 3
Statistical source data.
Source Data Extended Data Fig. 4
Statistical source data.
Source Data Extended Data Fig. 5
Statistical source data.
Source Data Extended Data Fig. 6
Statistical source data.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Lee, R.S., Sagiv, Y., Engelhard, B. et al. A feature-specific prediction error model explains dopaminergic heterogeneity. Nat Neurosci 27, 1574–1586 (2024). https://doi.org/10.1038/s41593-024-01689-1
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1038/s41593-024-01689-1