Reward-evoked dopamine transients are well established as prediction errors. However, the central tenet of temporal difference accounts—that similar transients evoked by reward-predictive cues also function as errors—remains untested. In the present communication we addressed this by showing that optogenetically shunting dopamine activity at the start of a reward-predicting cue prevents second-order conditioning without affecting blocking. These results indicate that cue-evoked transients function as temporal-difference prediction errors rather than reward predictions.
Your institute does not have access to this article
Open Access articles citing this article.
Nature Communications Open Access 26 April 2022
Subscribe to Nature+
Get immediate online access to the entire Nature family of 50+ journals
Subscribe to Journal
Get full journal access for 1 year
only $4.92 per issue
All prices are NET prices.
VAT will be added later in the checkout.
Tax calculation will be finalised during checkout.
Get time limited or full article access on ReadCube.
All prices are NET prices.
Behavioral data will be made available upon reasonable request.
Simulations were performed using custom-written functions in MATLAB (Mathworks), which are posted on Github (https://github.com/mphgardner/Basic_Pavlovian_TDRL/tree/Maes_2018).
Glimcher, P. W. Understanding dopamine and reinforcement learning: the dopamine reward prediction error hypothesis. Proc. Natl Acad. Sci. USA 108, 15647–15654 (2011).
Dayan, P. Improving generalization for temporal difference learning: the successor representation. Neural Comput. 5, 613–624 (1993).
Sutton, R. S. Learning to predict by the method of temporal difference. Machine Learn. 3, 9–44 (1988).
Schultz, W., Dayan, P. & Montague, P. R. A neural substrate for prediction and reward. Science 275, 1593–1599 (1997).
Rizley, R. C. & Rescorla, R. A. Associations in second-order conditioning and sensory preconditioning. J. Compar. Physiol. Psychol. 81, 1–11 (1972).
Chang, C. Y., Gardner, M., Di Tillio, M. G. & Schoenbaum, G. Optogenetic blockade of dopamine transients prevents learning induced by changes in reward features. Curr. Biol. 27, 3480–3486 (2017).
Kamin, L. J. Aversive stimulation. In Miami Symposium on the Prediction of Behavior, 1967 (ed. M.R. Jones) 9–31 (Univ. Miami Press, 1968).
Chang, C. Y., Gardner, M. P. H., Conroy, J. S., Whitaker, L. R. & Schoenbaum, G. Brief, but not prolonged, pauses in the firing of midbrain dopamine neurons are sufficient to produce a conditioned inhibitor. J. Neurosci. 38, 8822–8830 (2018).
Sharpe, M. J. et al. Dopamine transients are sufficient and necessary for acquisition of model-based associations. Nat. Neurosci. 20, 735–742 (2017).
Kim H. R. et al. A unified framework for dopamine signals across timescales. Preprint at bioRxiv https://doi.org/10.1101/803437 (2019).
Gardner, M. P. H., Schoenbaum, G. & Gershman, S. J. Rethinking dopamine as generalized prediction error. Proc. R. Soc. B 285, https://doi.org/10.1098/rspb.2018.1645 (2018).
Keiflin, R., Pribut, H. J., Shah, N. B. & Janak, P. H. Ventral tegmental dopamine neurons participate in reward identity predictions. Curr. Biol. 29, 93–103.E3 (2019).
Nairne, J. S. & Rescorla, R. A. 2nd-order conditioning with diffuse auditory reinforcers in the pigeon. Learn. Motiv. 12, 65–91 (1981).
Sutton, R. S. & Barto, A. G. Reinforcement Learning: An Introduction (MIT Press, 1998).
Rescorla, R. A. & Wagner, A. R. in Classical Conditioning: II. Current Research and Theory (eds Black A. H. & Prokasy W. F.) 64–99 (Appleton–Century–Crofts, 1972).
Sharpe, M. J. & Killcross, A. S. The prelimbic cortex contributes to the down-regulation of attention toward redundant cues. Cereb. Cortex 24, 1066–1074 (2014).
Mahmud, A., Petrov, P., Esber, G. R. & Iordanova, M. D. The serial blocking effect: a testbed for the neural mechanisms of temporal-difference learning. Sci. Rep. 9, 5962 (2019).
Steinberg, E. E. et al. A causal link between prediction errors, dopamine neurons and learning. Nat. Neurosci. 16, 966–973 (2013).
Olejnik, S. & Algina, J. Generalized eta and omega squared statistics: measures of effect size for some common research designs. Psychol. Methods 8, 434–447 (2003).
This work was supported by the Intramural Research Program at the NIDA; the Canada Research Chair’s program (to M.D.I.); a Natural Sciences and Engineering Research Council of Canada Discovery Grant (to M.D.I.); a Natural Sciences and Engineering Research Council of Canada Undergraduate Student Research Award (to E.J.P.M.); and a Concordia University Undergraduate Research Award (to A.A.U.). The opinions expressed in this article are our own and do not reflect the view of the NIH/DHHS.
The authors declare no competing interests.
Peer review information Nature Neuroscience thanks S. Floresco and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Experimental design for within-subjects blocking and second-order conditioning as used in our study, along with graphs modeling the predicted results of shunting of the dopamine transient at the start of the reward-predictive cue, A, in each procedure. In Model 1 the VTA DA signal encodes a prediction error and in Model 2 it encodes a reward prediction. Bar graphs are reproduced from Fig. 1 in the main text; other panels model results of training in the other phases. Note the output of the classic TDRL model was converted from V to conditioned responding (CR) to better reflect the behavioral output actually measured in our experiments. The major impact of the neural manipulation was on responding to X in Model 2. Elimination of the prediction on AX trials in this model causes a positive prediction error on reward delivery in the blocking phase. This results in unblocking of X.
Extended Data Fig. 2 Experimental designs for within-subjects blocking and second-order conditioning as used in our study with NpHR and eYFP rats.
During Conditioning, responding to A but not B increased across days, and this responding was higher for A compared to B. During Blocking, responding to the control compound (DZ) was lower compared to blocking compound (AX, AY) at the start, but equivalent by the end of training, with no difference between the blocking compounds. Responding during the first trial of the Probe Test showed evidence of blocking (X and Y vs. Z) and no difference between the blocking cues (X vs Y, see Fig. 3 legend for statistics). Differences disappeared on subsequent trials. Responding to the retrained cue A increased across reminder (Rmdr) trials while that to C (i.e., C→A trials) and D (i.e., D→A trials) did not differ across second-order training. On Probe Test, responding to C was lower compared to D (see Fig. 2 legend for statistics) on the first trial as well as across the entire test. eYFP: The pattern of data obtained for the NpHR rats was similar to that obtained for the eYFP rats with one critical exception: there was no difference between C and D on Probe Test in the eYFP rats. Some data are reproduced from Figs. 2 and 3 in the main text. CR or conditioned responding is percent time spent in the magazine during the last 5 s of the cue. Drawings to the left illustrate the extent of expression of NpHR and eYFP and location of fiber tips within VTA.
Extended Data Fig. 3 The cue-evoked dopamine transient is necessary for second-order conditioning in naïve rats.
Drawings to the left illustrate the extent of expression of NpHR and location of fiber tips within VTA. The three panels of behavioral responding show behavioral data across the three phases of the second-order conditioning experiment represented using three different CRs (top—percent time spent in the magazine; middle—cumulative head entries during the CS across a single day of training; bottom—percent trials containing a head entry). Behavioral responding during A increased during Conditioning (see methods for statistics). Responding to C (i.e., C→A trials) and D (i.e., D→A trials) did not differ (see methods) during second-order training when shunting of VTA transients took place at the start of the reward-predictive cue, A. On Test, responding to C was lower compared to D (see methods for statistics for each of the CRs), showing that inhibition of the VTA DA signal at the start of A prevented A from supporting second-order conditioning to C whereas identical inhibition during the ITI left learning to D intact.
Extended Data Fig. 4 Modeling data for the Blocking and Second-order experiments with different strengths of neuronal inhibition.
The modeling data show how different inhibition strength (i.e., η = 0, 0.5, 1 as used in the models, see also Figure S1) affects the predicted conditioned responding on Probe Test across the different models. Model Control represents eYFP controls in which inhibition is not effective. Model Error represents the dopamine signal acting as a prediction-error in which increases in inhibition strength do not affect conditioned responding to X in blocking but lead to reduced conditioned respdoning to the C in second-order conditioning. Model V represents the dopamine signal as prediction in which increases in inhibition strength lead to greater conditioned responding to X in blocking (i.e., unblocking) and reduced conditioned responding to C in second-order conditioning.
About this article
Cite this article
Maes, E.J.P., Sharpe, M.J., Usypchuk, A.A. et al. Causal evidence supporting the proposal that dopamine transients function as temporal difference prediction errors. Nat Neurosci 23, 176–178 (2020). https://doi.org/10.1038/s41593-019-0574-1
A gradual temporal shift of dopamine responses mirrors the progression of temporal difference error in machine learning
Nature Neuroscience (2022)
Nature Communications (2022)