Abstract
Since its introduction, the reward prediction error theory of dopamine has explained a wealth of empirical phenomena, providing a unifying framework for understanding the representation of reward and value in the brain^{1,2,3}. According to the now canonical theory, reward predictions are represented as a single scalar quantity, which supports learning about the expectation, or mean, of stochastic outcomes. Here we propose an account of dopaminebased reinforcement learning inspired by recent artificial intelligence research on distributional reinforcement learning^{4,5,6}. We hypothesized that the brain represents possible future rewards not as a single mean, but instead as a probability distribution, effectively representing multiple future outcomes simultaneously and in parallel. This idea implies a set of empirical predictions, which we tested using singleunit recordings from mouse ventral tegmental area. Our findings provide strong evidence for a neural realization of distributional reinforcement learning.
This is a preview of subscription content, access via your institution
Access options
Access Nature and 54 other Nature Portfolio journals
Get Nature+, our bestvalue onlineaccess subscription
$29.99 / 30 days
cancel any time
Subscribe to this journal
Receive 51 print issues and online access
$199.00 per year
only $3.90 per issue
Buy this article
 Purchase on Springer Link
 Instant access to full article PDF
Prices may be subject to local taxes which are calculated during checkout
Similar content being viewed by others
Data availability
The neuronal data analysed in this work are available at https://doi.org/10.17605/OSF.IO/UX5RG.
Code availability
The analysis code from our valuedistribution decoding and code used to generate model predictions for distributional TD are available at https://doi.org/10.17605/OSF.IO/UX5RG.
References
Schultz, W., Stauffer, W. R. & Lak, A. The phasic dopamine signal maturing: from reward via behavioural activation to formal economic utility. Curr. Opin. Neurobiol. 43, 139–148 (2017).
Glimcher, P. W. Understanding dopamine and reinforcement learning: the dopamine reward prediction error hypothesis. Proc. Natl Acad. Sci. USA 108, 15647–15654 (2011).
WatabeUchida, M., Eshel, N. & Uchida, N. Neural circuitry of reward prediction error. Annu. Rev. Neurosci. 40, 373–394 (2017).
Morimura, T., Sugiyama, M., Kashima, H., Hachiya, H. & Tanaka, T. Parametric return density estimation for reinforcement learning. In Proc. 26th Conference on Uncertainty in Artificial Intelligence (eds Grunwald, P. & Spirtes, P.) http://dl.acm.org/citation.cfm?id=3023549.3023592 (2010).
Bellemare, M. G., Dabney, W., & Munos, R. A distributional perspective on reinforcement learning. In International Conference on Machine Learning (eds Precup, D. & The, Y. W.) 449–458 (2017).
Dabney, W. Rowland, M. Bellemare, M. G. & Munos, R. Distributional reinforcement learning with quantile regression. In AAAI Conference on Artificial Intelligence (2018).
Sutton, R. S. & Barto, A. G. Reinforcement Learning: an Introduction Vol. 1 (MIT Press, 1998).
Mnih, V. et al. Humanlevel control through deep reinforcement learning. Nature 518, 529–533 (2015).
Silver, D. et al. Mastering the game of Go with deep neural networks and tree search. Nature 529, 484–489 (2016).
Hessel, M. et al. Rainbow: combining improvements in deep reinforcement learning. In 32nd AAAI Conference on Artificial Intelligence (2018).
Botvinick, M. M., Niv, Y. & Barto, A. G. Hierarchically organized behavior and its neural foundations: a reinforcement learning perspective. Cognition 113, 262–280 (2009).
Wang, J. X. et al. Prefrontal cortex as a metareinforcement learning system. Nat. Neurosci. 21, 860–868 (2018).
Song, H. F., Yang, G. R. & Wang, X. J. Rewardbased training of recurrent neural networks for cognitive and valuebased tasks. eLife 6, e21492 (2017).
BarthMaron, G. et al. Distributed distributional deterministic policy gradients. In International Conference on Learning Representations https://openreview.net/forum?id=SyZipzbCb (2018).
Dabney, W., Ostrovski, G., Silver, D. & Munos, R. Implicit quantile networks for distributional reinforcement learning. In International Conference on Machine Learning (2018).
Pouget, A., Beck, J. M., Ma, W. J. & Latham, P. E. Probabilistic brains: knowns and unknowns. Nat. Neurosci. 16, 1170–1178 (2013).
Lammel, S., Lim, B. K. & Malenka, R. C. Reward and aversion in a heterogeneous midbrain dopamine system. Neuropharmacology 76, 351–359 (2014).
Fiorillo, C. D., Tobler, P. N. & Schultz, W. Discrete coding of reward probability and uncertainty by dopamine neurons. Science 299, 1898–1902 (2003).
Eshel, N. et al. Arithmetic and local circuitry underlying dopamine prediction errors. Nature 525, 243–246 (2015).
Rowland, M., et al. Statistics and samples in distributional reinforcement learning. In International Conference on Machine Learning (2019).
Frank, M. J., Seeberger, L. C. & O’Reilly, R. C. By carrot or by stick: cognitive reinforcement learning in parkinsonism. Science 306, 1940–1943 (2004).
Hirvonen, J. et al. Striatal dopamine D1 and D2 receptor balance in twins at increased genetic risk for schizophrenia. Psychiatry Res. Neuroimaging 146, 13–20 (2006).
Piggott, M. A. et al. Dopaminergic activities in the human striatum: rostrocaudal gradients of uptake sites and of D1 and D2 but not of D3 receptor binding or dopamine. Neuroscience 90, 433–445 (1999).
RosaNeto, P., Doudet, D. J. & Cumming, P. Gradients of dopamine D1 and D2/3binding sites in the basal ganglia of pig and monkey measured by PET. Neuroimage 22, 1076–1083 (2004).
Mikhael, J. G. & Bogacz, R. Learning reward uncertainty in the basal ganglia. PLOS Comput. Biol. 12, e1005062 (2016).
Robb, B. et al. A computational and neural model of momentary subjective wellbeing. Proc. Natl Acad. Sci. USA 111, 12252–12257 (2014).
Huys, Q. J., Daw, N. D. & Dayan, P. Depression: a decisiontheoretic analysis. Annu. Rev. Neurosci. 38, 1–23 (2015).
Bennett, D. & Niv, Y. Opening Burton’s clock: psychiatric insights from computational cognitive models. Preprint at https://doi.org/10.31234/osf.io/y2vzu (2018).
Tian, J. & Uchida, N. Habenula lesions reveal that multiple mechanisms underlie dopamine prediction errors. Neuron 87, 1304–1316 (2015).
Eshel, N., Tian, J., Bukwich, M. & Uchida, N. Dopamine neurons share common response function for reward prediction error. Nat. Neurosci. 19, 479–486 (2016).
Newey, W. K. & Powell, J. L. Asymmetric least squares estimation and testing. Econometrica 55, 819–847 (1987).
Chris Jones, M. Expectiles and mquantiles are quantiles. Stat. Probab. Lett. 20, 149–153 (1994).
Ziegel, J. F. Coherence and elicitability. Math. Finance 26, 901–918 (2016).
Bellemare, M. G., Naddaf, Y., Veness, J. & Bowling, M. The arcade learning environment: an evaluation platform for general agents. J. Artif. Intell. Res. 47, 253–279 (2013).
Heess, N. et al. Emergence of locomotion behaviours in rich environments. Preprint at https://arxiv.org/abs/1707.02286 (2017).
Bäckman, C. M., et al. Characterization of a mouse strain expressing cre recombinase from the 3′ untranslated region of the dopamine transporter locus. Genesis 44, 383–390 (2006).
Cohen, J. Y. et al. Neurontypespecific signals for reward and punishment in the ventral tegmental area. Nature 482, 85–88 (2012).
Stauffer, W. R., Lak, A. & Schultz, W. Dopamine reward prediction error responses reflect marginal utility. Curr. Biol. 24, 2491–2500 (2014).
Fiorillo, C. D., Song, M. R. & Yun, S. R. Multiphasic temporal dynamics in responses of midbrain dopamine neurons to appetitive and aversive stimuli. J. Neurosci. 33, 4710–4725 (2013).
Schaul, T., Quan, J., Antonoglou, I. & Silver, D. Prioritized experience replay. In International Conference on Learning Representations (2016).
Van Hasselt, H., Guez, A. & Silver, D. Deep reinforcement learning with double qlearning. In AAAI Conference on Artificial Intelligence (2016).
Krizhevsky, A. & Hinton, G. Learning Multiple Layers of Features from Tiny Images (Univ. of Toronto, 2009).
Acknowledgements
We thank K. Miller, P. Dayan, T. Stepleton, J. Paton, M. Frank, C. Clopath, T. Behrens and the members of the Uchida laboratory for comments on the manuscript; and N. Eshel, J. Tian, M. Bukwich and M. WatabeUchida for providing data.
Author information
Authors and Affiliations
Contributions
W.D. conceived the project. W.D., Z.K.N. and M.B. contributed ideas for experiments and analysis. W.D. and Z.K.N. performed simulation experiments and analysis. N.U. and C.K.S. provided neuronal data for analysis. W.D., Z.K.N. and M.B. managed the project. M.B., N.U., R.M. and D.H. advised on the project. M.B., W.D. and Z.K.N. wrote the paper. W.D., Z.K.N., M.B., N.U., C.K.S., D.H. and R.M. provided revisions to the paper.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Peer review information Nature thanks Rui Costa, Michael Littman and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Extended data figures and tables
Extended Data Fig. 1 Mechanism of distributional TD.
a, The degree of asymmetry in positive to negative scale determines the equilibrium where positive and negative errors balance. Equal scaling equilibrates at the mean, whereas a larger positive (negative) scaling produces an equilibrium above (below) the mean. b, Distributional prediction emerges through experience. Quantile (sign function) version is displayed here for clarity. Model is trained on arbitrary task with trimodal reward distribution. c, Same as b, viewed in terms of cumulative distribution (left) or learned value for each predictor (quantile function) (right).
Extended Data Fig. 2 Learning the distribution of returns improves performance of deep RL agents across multiple domains.
a, DQN and distributional TD share identical nonlinear network structures. b, c, After training classical or distributional DQN on MsPacman, we freeze the agent and then train a separate linear decoder to reconstruct frames from the agent’s final layer representation. For each agent, reconstructions are shown. The distributional model’s representation allows substantially better reconstruction. d, At a single frame of MsPacman (not shown), the agent’s value predictions together represent a probability distribution over future rewards. Reward predictions of individual RPE channels shown as tick marks ranging from pessimistic (blue) to optimistic (red), and kernel density estimate shown in black. e, Atari57 experiments with single runs of prioritized experience replay^{40} and double DQN^{41} agents for reference. Benefits of distributional learning exceed other popular innovations. f, g, The performance payoff of distributional RL can be seen across a wide diversity of tasks. Here we give another example, a humanoid motorcontrol task in the MuJoCo physics simulator. Prioritized experience replay agent is shown for reference^{14}. Traces show individual runs; averages are in bold.
Extended Data Fig. 3 Simulation experiment to examine the role of representation learning in distributional RL.
a, Illustration of tasks 1 and 2. b, Example images for each class used in our experiment^{42} c, Experimental results, where each of ten random seeds yields an individual run shown with traces; average over seeds is shown in bold. d, Same as c, but for control experiment. e, Bird–dog tSNE visualization of final hidden layer of network, given different input images (blue, bird; red, dog). Left, classical TD; right, distributional TD; top row, representation after training on task 1; bottom row, representation after training on task 2.
Extended Data Fig. 4 Null models.
a, Classical TD plus noise does not give rise to the pattern of results observed in real dopamine data in the variablemagnitude task. When reversal points were estimated in two independent partitions there was no correlation between the two (P = 0.32 by linear regression). b, We then estimated asymmetric scaling of responses and found no correlation between this and reversal point (P = 0.78 by linear regression). c, Model comparison between ‘same’, a single reversal point, and ‘diverse’, separate reversal points. In both, the model is used to predict whether a heldout trial has a positive or negative response. d, Simulated baselinesubtracted RPEs, colourcoded according to the groundtruth value of bias added to that cell’s RPEs. e, Across all simulated cells, there was a strong positive relationship between prestimulus baseline firing and the estimated reversal point. f, Two independent measurements of the reversal point were strongly correlated. g, The proportion of simulated cells that have significantly positive (blue) or negative (red) responses showed no magnitudes with both positive and negative responses. h, In the simulation, there was a significant negative relationship between the estimated asymmetry of each cell and its estimated reversal point (opposite that observed in neural data). i, Diagram illustrating a Gaussianweighted topological mapping between RPEs and value predictors. j, Varying the standard deviation of this Gaussian modulates the degree of coupling. k, In a task with equal chance of a reward 1.0 or 0.0, distributional TD with different levels of coupling shows robustness to the degree of coupling. l, When there is no coupling, a distributional code is not learned, but asymmetric scaling can cause spurious detection of diverse reversal points. m, Even though every cell has the same reward prediction they appear to have different reversal points. n, With this model, some cells may have significantly positive responses, and others significantly negative responses, in response to the same reward. o, But this model is unable to explain a positive correlation between asymmetric scaling and reversal points. p, Simulation of ‘synaptic’ distributional RL, in which learning rates but not firing rates are asymmetrically scaled. This model predicts diversity in reversal points between dopamine neurons. q, The model predicts no correlation between asymmetric scaling of firing rates and reversal point.
Extended Data Fig. 5 Asymmetry and reversal.
a, Left, all data points (trials) from an example cell. The solid lines are linear fits to the positive and negative domains, and the shaded areas show 95% confidence intervals calculated with Bayesian regression. Right, the same cell plotted in the format of Fig. 4b. b, Crossvalidated model comparison on the dopamine data favours allowing each cell to have its own asymmetric scaling (P = 1.4 × 10^{−11} by paired ttest). The standard error of the mean appears large relative to the P value because the P value is computed using a paired test. c, Although the difference between singleasymmetry and diverseasymmetry models was small in firingrate space, such small differences correspond to large differences in decoded distribution space (more details in Supplementary Information). Each point is a TD simulation; colour indicates the degree of diversity in asymmetric scaling within that simulation. d, We were interested in whether an apparent correlation between reversal point and asymmetry could arise as an artefact, owing to a mismatch between the shape of the actual dopamine response function and the function used to fit it. Here we simulate the variablemagnitude task using a TD model without a true correlation between asymmetric scaling and reversal point. We then apply the same analysis pipeline as in the main paper, to measure the correlation (colour axis) between asymmetric scaling and reversal point. We repeat this procedure 20 times with different dopamine response functions in the simulation, and different functions used to fit the positive and negative domains of the simulated data. The functions are sorted in increasing order of concavity. An artefact can emerge if the response function used to fit the data is less concave than the response function used to generate the data. For example, when generating data with a Hill function but fitting with a linear function, a positive correlation can be spuriously measured. e, When simulating data from the distributional TD model, where a true correlation exists between asymmetric scaling and reversal point, it is always possible to detect this positive correlation, even if the fitting response function is more concave than the generating response function. The black rectangle highlights the function used to fit real neural data in c. f, Here we analyse the real dopamine cell data identically to Fig. 4d, but using Hill functions instead of linear functions to fit the positive and negative domains. Because the correlation between asymmetric scaling and reversal point still appears under these adversarial conditions, we can be confident it is not driven by this artefact. g, Same as Fig. 4d, but using linear response function and linear utility function (instead of empirical utility).
Extended Data Fig. 6 Cue responses versus outcome responses, and more evidence for diversity.
a, In the variableprobability task: firing at cue, versus firing at reward (left) or omission (right). Colour brightness denotes asymmetry. b, Same as a, but showing RPEs from distributional TD simulation. c, Data from ref. ^{30} also included unpredicted rewards and unpredicted airpuffs. Top two panels show responses for all the cells recorded in one animal and bottom two panels show responses for all the cells of another animal. Left, the x axis is the baselinesubtracted response to free reward and the y axis is the baselinesubtracted response to airpuff. Dots with black outlines are percell means, and unoutlined dots are means of disjoint subsets of trials indicating consistency of asymmetry. Right, the same data plotted in a different way, with cells sorted along the x axis by response to airpuff. Response to reward is shown in greyscale dots. Asterisks indicate significant difference in firing rates from one or both neighbouring cells. d, Simulations for distributional but not classical TD produce diversity in relative response.
Extended Data Fig. 7 More details of data in variableprobability task.
a, Details of analysis method. Of the four possible outcomes of the two Mann–Whitney tests (Methods), two outcomes correspond to interpolation (middle) and one each to the pessimistic (left) and optimistic (right) groups. b, Simulation results for the classical TD and distributional TD models. y axis shows the average firingrate change, normalized to mean zero and unit variance, in response to each of the three cues. Each curve is one cell. The cells are split into panels according to a statistical test for type of probability coding (see Methods for details). Colour indicates the degree of optimism or pessimism. Distributional TD predicts simultaneous optimistic and pessimistic coding of probability, whereas classical TD predicts all cells have the same coding. c, Same as b, but using data from real dopamine neurons. The pattern of results closely matches the predictions from the distributional TD model. d, Same as b, using data from putative VTA GABAergic interneurons.
Extended Data Fig. 8 Further distribution decoding analysis.
This figure pertains to the variablemagnitude experiment. a–c, In the decoding shown in the main text, we constrained the support of the distribution to the range of the rewards in the task. Here, we applied the decoding analysis without constraining the output values. We find similar results, although with increased variance. d, We compare the quality of the decoded distribution against several controls. The real decoding is shown as black dots. In coloured lines are reference distributions (uniform and Gaussian with the same mean and variance as the ground truth; and the ground truth mirrored). Black traces shift or scale the groundtruth distribution by varying amounts. e, Nonlinear functions used to shift asymmetries, to measure degradation of decoded distribution. The normal cumulative distribution function ϕ is used to transform asymmetry τ. This is shifted by some value s and transformed back through the normal quantile function ϕ^{−1}. Positive values s increase the value of τ and negative values decrease the value of τ. f, Decoded distributions under different shifts, s. g, Plot of shifted asymmetries for values of s used. h, Quantification of match between decoded and groundtruth distribution, for each s. i, j, Same as Fig. 5d, e, but for putative GABAergic cells rather than dopamine cells.
Extended Data Fig. 9 Simultaneous diversity.
a, b, Variableprobability task. Mean spiking (a) and licking (b) activity in response to each of the three cues (indicating 10%, 50% or 90% probability of reward) at time 0, and in response to the outcome (reward or no reward) at time 2,000 ms. c, Trialtotrial variations in lick rates were strongly correlated with trialtotrial variations in dopamine firing rates. Mean of each cell is subtracted from each axis, and the x axis is binned for ease of visualization. d, Dopaminergic coding of the 50% cue relative to the 10% and 90% cues (as shown in b) was not correlated with the same measure computed on lick rates. Therefore, betweensession differences in cue preference, measured by anticipatory licking, cannot explain betweencell differences in optimism. e, Four simultaneously recorded dopamine neurons. These are the same four cells whose time courses are shown in Fig. 3c. f, Variablemagnitude task. Across cells, there was no relationship between asymmetric scaling of positive versus negative prediction errors, and baseline firing rates (R = 0.18, P = 0.29). Each point is a cell. These data are from dopamine neurons at reward delivery time. g, tstatistics of response to 5 μl reward compared with baseline firing rate, for all 16 cells from animal D. Some cells respond significantly above baseline and others significantly below. Cells are sorted by tstatistic. h, Spike rasters showing all trials in which the 5 μl reward was delivered. The two panels are two example cells from the same animal with rasters shown in Fig. 2.
Extended Data Fig. 10 Relationship of results to original analysis.
Here we reproduce results for the variablemagnitude task in ref. ^{30} with two different time windows. a, Change in firing rate in response to cued reward delivery averaged over all cells. b, Comparing Hillfunction fit and response averaged over all cells for expected (cued) and unexpected reward delivery. c, Correlation between response predicted by scaled common response function and actual response to expected reward delivery. d, Zooming in on c shows correlation driven primarily by larger reward magnitudes. e–h, Repeating the above analysis for a window of 200–600 ms.
Supplementary information
Supplementary Information
This material contains six sections: Section 1 covers mechanisms underlying distributional RL; Section 2 considers alternative models; Section 3 tests robustness to modelling assumptions; Section 4 presents supplementary results; Section 5 discusses relations to previous work; and Section 6 gives further predictions of the theory.
Rights and permissions
About this article
Cite this article
Dabney, W., KurthNelson, Z., Uchida, N. et al. A distributional code for value in dopaminebased reinforcement learning. Nature 577, 671–675 (2020). https://doi.org/10.1038/s4158601919246
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1038/s4158601919246
This article is cited by

Reward prediction error neurons implement an efficient code for reward
Nature Neuroscience (2024)

Frontostriatal circuit dysfunction leads to cognitive inflexibility in neuroligin3 R451C knockin mice
Molecular Psychiatry (2024)

A multistage anticipated surprise model with dynamic expectation for economic decisionmaking
Scientific Reports (2024)

Anterior cingulate learns reward distribution
Nature Neuroscience (2024)

Uncertainty of treatment efficacy moderates placebo effects on reinforcement learning
Scientific Reports (2024)
Comments
By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.