Abstract
Reward expectations based on internal knowledge of the external environment are a core component of adaptive behavior. However, internal knowledge may be inaccurate or incomplete due to errors in sensory measurements. Some features of the environment may also be encoded inaccurately to minimize representational costs associated with their processing. In this study, we investigated how reward expectations are affected by features of internal representations by studying behavior and dopaminergic activity while mice make timebased decisions. We show that several possible representations allow a reinforcement learning agent to model animals’ overall performance during the task. However, only a small subset of highly compressed representations simultaneously reproduced the covariability in animals’ choice behavior and dopaminergic activity. Strikingly, these representations predict an unusual distribution of response times that closely match animals’ behavior. These results inform how constraints of representational efficiency may be expressed in encoding representations of dynamic cognitive variables used for rewardbased computations.
This is a preview of subscription content, access via your institution
Access options
Access Nature and 54 other Nature Portfolio journals
Get Nature+, our bestvalue onlineaccess subscription
$29.99 / 30 days
cancel any time
Subscribe to this journal
Receive 12 print issues and online access
$209.00 per year
only $17.42 per issue
Buy this article
 Purchase on Springer Link
 Instant access to full article PDF
Prices may be subject to local taxes which are calculated during checkout
Similar content being viewed by others
Data availability
The experimental data shown in this study are available through the original study by ref. ^{31}. Details of this study are approved under the protocol approved by the Champalimaud Foundation Animal Welfare Body (protocol number: 2017/013) and the Portuguese Veterinary General Board (DirecçãoGeral de Veterinária, project approval number: 0421/000/000/2018) and in accordance with the European Union Directive 2010/63/EEC
Code availability
All code for simulations reported in this study and some saved variables used to show the model’s results are available at https://github.com/asmamotiwala/ITDSNcDATDE.
References
Dayan, P. & Sejnowski, T. J. Td(λ) converges with probability 1. Mach. Learn. 14, 295–301 (1994).
Sutton, R. S. Learning to predict by the methods of temporal differences. Mach. Learn. 3, 9–44 (1988).
Bayer, H. M. & Glimcher, P. W. Midbrain dopamine neurons encode a quantitative reward prediction error signal. Neuron 47, 129–141 (2005).
Fiorillo, C. D., Tobler, P. N. & Schultz, W. Discrete coding of reward probability and uncertainty by dopamine neurons. Science 299, 1898–1902 (2003).
Schultz, W., Dayan, P. & Montague, P. R.A neural substrate of prediction and reward. Science 275, 1593–1599 (1997).
Reynolds, J. N. J., Hyland, B. I. & Wickens, J. R. A cellular mechanism of rewardrelated learning. Nature 413, 67–70 (2001).
Stauffer, W. R., Lak, A. & Schultz, W. Dopamine reward prediction error responses reflect marginal utility. Curr. Biol. 24, 2491–2500 (2014).
Steinberg, E. E. et al. A causal link between prediction errors, dopamine neurons and learning. Nat. Neurosci. 16, 966–973 (2013).
Niv, Y. & Langdon, A. Reinforcement learning with Marr. Curr. Opin. Behav. Sci. 11, 67–73 (2016).
WatabeUchida, M., Eshel, N. & Uchida, N. Neural circuitry of reward prediction error. Annu. Rev. Neurosci. 40, 373–394 (2017).
Daw, N. D., Courville, A. C. & Touretzky, D. S. Representation and timing in theories of the dopamine system. Neural Comput. 18, 1637–1677 (2006).
Ludvig, E. A., Sutton, R. S. & Kehoe, E. J. Stimulus representation and the timing of rewardprediction errors in models of the dopamine system. Neural Comput. 20, 3034–3054 (2008).
Suri, R. E. & Schultz, W. A neural network model with dopaminelike reinforcement signal that learns a spatial delayed response task. Neuroscience 91, 871–890 (1999).
Botvinick, M. M., Niv, Y. & Barto, A. G. Hierarchically organized behavior and its neural foundations: a reinforcement learning perspective. Cognition 113, 262–280 (2009).
Starkweather, C. K., Babayan, B. M., Uchida, N. & Gershman, S. J. Dopamine reward prediction errors reflect hiddenstate inference across time. Nat. Neurosci. 20, 581–589 (2017).
Russek, E. M., Momennejad, I., Botvinick, M. M., Gershman, S. J. & Daw, N. D. Predictive representations can link modelbased reinforcement learning to modelfree mechanisms. PLoS Comput. Biol. 13, e1005768 (2017).
Gouvêa, T. S. et al. Striatal dynamics explain duration judgments. eLife 4, e11386 (2015).
Mello, G. B. M., Soares, S. & Paton, J. J. A scalable population code for time in the striatum. Curr. Biol. 25, 1113–1122 (2015).
Remington, E. D., Narain, D., Hosseini, E. A. & Jazayeri, M. Flexible sensorimotor computations through rapid reconfiguration of cortical dynamics. Neuron 98, 1005–1019 (2018).
Wang, J., Narain, D., Hosseini, E. A. & Jazayeri, M. Flexible timing by temporal scaling of cortical responses. Nat. Neurosci. 21, 102–110 (2018).
Atick, J. J. & Redlich, A. N. What does the retina know about natural scenes? Neural Comput. 4, 196–210 (1992).
Lewicki, M. S. Efficient coding of natural sounds. Nat. Neurosci. 5, 356–363 (2002).
Olshausen, B. A. & Field, D. J. Emergence of simplecell receptive field properties by learning a sparse code for natural images. Nature 381, 607–609 (1996).
Rieke, F., Bodnar, D. A., & Bialek, W. Naturalistic stimuli increase the rate and efficiency of information transmission by primary auditory afferents. Proc. Biol. Sci. 262, 259–265 (1995).
Machens, C. K., Gollisch, T., Kolesnikova, O. & Herz, A. V. M. Testing the efficiency of sensory coding with optimal stimulus ensembles. Neuron 47, 447–456 (2005).
Reinagel, P. & Zador, A. M. Natural scene statistics at the centre of gaze. Network 10, 341–350 (1999).
Salinas, E. How behavioral constraints may determine optimal sensory representations. PLoS Biol. 4, e387 (2006).
Botvinick, M., Weinstein, A., Solway, A. & Barto, A. Reinforcement learning, efficient coding, and the statistics of natural tasks. Curr. Opin. Behav. Sci. 5, 71–77 (2015).
Summerfield, C. & Tsetsos, K. Building bridges between perceptual and economic decisionmaking: neural and computational mechanisms. Front. Neurosci. 6, 70 (2012).
Louie, K. & Glimcher, P. W. Efficient coding and the neural representation of value. Ann. N Y Acad. Sci. 1251, 13–32 (2012).
Soares, S., Atallah, B. V. & Paton, J. J. Midbrain dopamine neurons control judgment of time. Science 354, 1273–1277 (2016).
Kepecs, A., Uchida, N., Zariwala, H. A. & Mainen, Z. F. Neural correlates, computation and behavioural impact of decision confidence. Nature 455, 227–231 (2008).
Kiani, R. & Shadlen, M. N. Representation of confidence associated with a decision by neurons in the parietal cortex. Science 324, 759–764 (2009).
Lak, A., Nomoto, K., Keramati, M., Sakagami, M. & Kepecs, A. Midbrain dopamine neurons signal belief in choice accuracy during a perceptual decision. Curr. Biol. 27, 821–832 (2017).
Janssen, P. & Shadlen, M. N. A representation of the hazard rate of elapsed time in macaque area lip. Nat. Neurosci. 8, 234–241 (2005).
Fiorillo, C. D., Newsome, W. T. & Schultz, W. The temporal precision of reward prediction in dopamine neurons. Nat. Neurosci. 11, 966–973 (2008).
Pasquereau, B. & Turner, R. S. Dopamine neurons encode errors in predicting movement trigger occurrence. J. Neurophysiol. 113, 1110–1123 (2015).
Gibbon, J. & Church, R. M. Representation of time. Cognition 37, 23–54 (1990).
Joel, D., Niv, Y. & Ruppin, E. Actor–critic models of the basal ganglia: new anatomical and computational perspectives. Neural Netw. 15, 535–547 (2002).
Khamassi, M., Lachèze, L., Girard, B., Berthoz, A. & Guillot, A. Actor–critic models of reinforcement learning in the basal ganglia: from natural to artificial rats. Adaptive Behavior 13, 131–148 (2005).
Roitman, J. D. & Shadlen, M. N. Response of neurons in the lateral intraparietal area during a combined visual discrimination reaction time task. J. Neurosci. 22, 9475–9489 (2002).
Wimmer, G. E., Daw, N. D. & Shohamy, D. Generalization of value in reinforcement learning by humans. Eur. J. Neurosci. 35, 1092–1104 (2012).
Foster, D. & Dayan, P. Structure in the space of value functions. Mach. Learn. 49, 325–346 (2002).
Solway, A. et al. Optimal behavioral hierarchy. PLoS Comput. Biol. 10, e1003779 (2014).
Mastrogiuseppe, F. & Ostojic, S. Linking connectivity, dynamics, and computations in lowrank recurrent neural networks. Neuron 99, 609–623 (2018).
Kao, T.C., Sadabadi, M. S. & Hennequin, G. Optimal anticipatory control as a theory of motor preparation: a thalamocortical circuit model. Neuron 109, 1567–1581 (2021).
Mnih, V. et al. Humanlevel control through deep reinforcement learning. Nature 518, 529–533 (2015).
Song, H. F., Yang, G. R. & Wang, X.J. Rewardbased training of recurrent neural networks for cognitive and valuebased tasks. eLife 6, e21492 (2017).
Wang, J. X. et al. Prefrontal cortex as a metareinforcement learning system. Nat. Neurosci. 21, 860–868 (2018).
Sutton, R. S. & Barto, A. G. Reinforcement Learning: An Introduction (MIT Press, 2018).
Baird, L. C. Advantage updating. Technical Report WLTR931146. Wright Laboratory, WrightPatterson Air Force Base. https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.135.9434&rep=rep1&type=pdf (1993).
Acknowledgements
We thank J. Semedo and A. Renart for several helpful discussions and feedback on the manuscript. We thank members of the Paton laboratory for several helpful discussions. We also thank S. Berger for feedback on the manuscript. This work was developed with support from the research infrastructure Congento, cofinanced by Lisboa Regional Operational Programme (Lisboa2020), under the PORTUGAL 2020 Partnership Agreement, through the European Regional Development Fund and Fundação para a Ciência e Tecnologia (Portugal), under project LISBOA010145FEDER022170. The work was funded by a Howard Hughes Medical Institute International Research Scholar Award to J.J.P (55008745), a European Research Council Consolidator grant (DYCOCIRC  REP7723391) to J.J.P., a BIAL bursary for scientific research to J.J.P. (193/2016) and internal support from the Champalimaud Foundation. The work was also supported by a National Institutes of Health U01 grant (NS094288) to C.K.M. and the Fundação para Ciência e Tecnologia (SFRH/BD/52214/2013) to A.M. The funders had no role in study design, data collection and analysis, decision to publish or preparation of manuscript.
Author information
Authors and Affiliations
Contributions
A.M., J.J.P. and C.K.M. designed the study. A.M. performed all analyses and simulations. S.S., B.V.A. and J.J.P. designed and S.S. and B.A. performed the behavioral and photometry experiments. A.M., J.P. and C.K.M. wrote the manuscript. J.J.P. and C.K.M. contributed equally to this work.
Corresponding authors
Ethics declarations
Competing interests
The authors declare no competing interests.
Peer review
Peer review information
Nature Neuroscience thanks the anonymous reviewers for their contribution to the peer review of this work.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Extended data
Extended Data Fig. 1 Schematic to illustrate how various task events influence reward expectations.
Let us assume the duration of intervals presented is uniformly distributed between 0 to 3 sec. and that choices are reported only at interval offset. In this case, the agent needs to estimate value functions over two sets of states, one during the interval and the other at interval offset. Let us assume the reward amount is 1, in this case the value at interval offset C(z) will be equal to the probability of reporting the correct choice for that interval estimate. During the interval, the value function is a weighted sum of the value of interval offset at that time and the value of being in the interval at later times. The weighting factor is given by the probability of transitioning to each of these states, that is, \(V(z)=p({{z}^{\prime}={{{{{\rm{interval}}}}}}\,{{{{{\rm{offset}}}}}} z})* C(z)+\gamma\)\(* p\left({z}^{\prime}\ne {{{{{\rm{interval}}}}}}\,{{{{{\rm{offset}}}}}} z\right)* V({z}^{\prime})\), where \({z}^{\prime}\) is the successor state. (ad) If we assume that the agent does not explicitly encode the distribution of interval durations, but does encode an estimate of choice accuracy at interval offset, the hazard rate encoded by the agent is zero for all internal estimates of time (shown in panel a), and its estimate of value during the interval will always be zero (shown in panel c). Its estimate of value at interval offset is equal to the probability with which it will correctly report interval duration (shown in panel b) and the resulting RPEs at interval offset will reflect only choice accuracy (shown in panel d). (eh) Let us now assume that the agent encodes the distribution of interval durations, but does not encode choice accuracy. The probability of detecting interval offset at any estimated time z given that no interval offset was detected for all \({z}^{\prime} < z\) is given by the hazard rate of interval offset, H(z). In the case of uniformly distributed interval durations, H(z) is shown in panel e. In this case the value function during the interval is monotonically increasing. If there was no time discounting (that is, γ = 1), this value function would be 1 for the entire interval. For time discounted rewards (that is, 0 < < γ < 1), the value function simply reflects the fact that early in the interval the agent expects rewards to be, on average, further in the future and hence more timediscounted than later in the interval (as shown in panel g). Since we assumed here that the agent does not encode choice accuracy, reward expectations from interval offset states are constant (panel f). Consequently, RPEs at interval offset will be monotonically decreasing with elapsed time (as shown in panel h). (il) Finally, if we assume that the agent encodes, both, choice accuracy and the distribution of interval durations, we see that the estimated value at interval offset is the same as when the agent only encodes choice accuracy (shown in panel j). However, the value function during the interval now reflects a combination of choice accuracy and the hazard rate of interval offsets, that is \(V(z)=H(z)* C(z)+\gamma * (1H(z))* V({z}^{\prime})\) (shown in panel k). Consequently, the RPEs will also reflect both choice accuracy and the hazard rate of interval offset (as shown in panel l).
Extended Data Fig. 2 Schematic to illustrate how the overall profile of RPEs at interval offset determines differences between psychometric functions for low and high RPEs trials.
Each row shows how over or underestimating elapsed time (shown in column 1) changes RPEs at any possible interval duration (column 2), the probability of reporting choice ‘long’ (column 3) and how magnitude of RPEs relate to the probability of choice ‘long’ (column 4). The top row shows the case when reward expectations are estimated only based on choice accuracy (as shown in Supp Fig 1ad), the middle row shows the case when reward expectations only reflect the hazard rate of interval offset (as shown in Supp Fig 1eh) and the bottom row shows the case when reward expectations are estimated based on choice accuracy as well as hazard rate of interval offset (as shown in Supp Fig 1il). (Column 1) Let’s consider two example types of trials (shown in panels a,e,i), one in which elapsed time is underestimated (z < t, shown by the blue line) and the other overestimates elapsed time (z > t, shown by the red line). (Column 2) For these two types of trials, RPEs at interval offsets for all possible interval durations are shown in the second column (each row corresponds to each of the three possibilities for how the agent might estimate reward expectations as a function of task events shown in Supp Fig 1d,h,l). In each panel, for every time step, the trial on which RPE is lower than the other is highlighted in yellow. In panel (b), when RPEs result from reward expectations that only take into account choice accuracy, for all time points before the decision boundary, trials on which elapsed time is overestimated will have lower RPEs than trials on which elapsed time is underestimated. On the other hand, for all time points after the decision boundary, trials on which elapsed time is underestimated will have lower RPEs. In panel (f), when RPEs are driven only due to the hazard rate of interval offset, for all time points trials on which elapsed time is overestimated will have lower RPEs. Finally, in panel (j), when RPEs reflect both choice accuracy and hazard rate of interval offset, RPEs are lower on trials on which elapsed time is overestimated on the short side of the boundary. On the long side of the boundary, close to the boundary, trials on which elapsed time is underestimated have lower RPEs. However, for estimates much longer than the boundary, we see trials on which elapsed time is overestimated have lower RPEs. (Column 3) If the agent’s choices change as a function of it’s internal estimates of elapsed time, for the two example trial types shown here, the psychometric function of the agent will also be different. When the agent underestimates elapsed time (blue curve), the psychometric curve will be biased towards ‘short’ choices (that is it will show a rightward shift). Similarly, if the agent overestimates time (red curve), the psychometric curve will be biased towards ‘long’ choices (that is will be shifted left). To establish the relationship between magnitude of RPE at any given estimated time of interval offset and the probability of choices the agent will report, for each of the trial types, all time points at which RPE was lower on that trial type (shown by the yellow highlights in the second column) are also highlighted in yellow. In panel (c), we see that for all time points before the boundary, the probability of choosing long is higher for most segments highlighted yellow. For all time points after the decision boundary, we see that the probability of choosing short is higher for all segments highlighted in yellow. In panel (g), we see that for all interval durations, trials that had lower RPEs have a higher probability of reporting the interval as ‘long’. Finally, in panel (k), we see that for all interval durations before the boundary, low RPE trials have a higher probability of reporting choice ‘long’ and the opposite is true after the decision boundary. (Column 4) For each of the panels c,g and k, for all time points on either side of the boundary we ask: what is the average of the psychometric curves highlighted in yellow. (d) In panel c, we see that on the short side of the boundary, for most time points the red psychometric curve is highlighted and the average of the highlighted segments of the curve is shown by the solid yellow marker in the top panel in column d. For time points longer than the boundary, we see in panel c that for most time points, the blue curve is highlighted and the average of that segment of the psychometric curve is shown by the yellow marker on the long side if the boundary in panel d. The green points in panel d show the averages of the curves in panel c on either side of the boundary that are not highlighted and correspond to time points at which the RPEs (shown in panel b) are higher. In other words, when RPEs are driven only due to choice accuracy, lower RPE in general are associated with low choice variability and hence, we would predict the psychometric curves for low and high RPE trials, in this case, to show a difference in slope. (h) Following the same steps for the middle row (panels eh), we find that when RPEs are driven only by the hazard rate of interval offset, low RPEs correlate with higher probability of reporting long irrespective of which side of the boundary the interval offset lies. Thus, in this case we would predict a change in bias in psychometric functions for low and high RPE trials. (l) Finally, when RPEs reflect both choice accuracy as well as the hazard rate of interval offset, following the same steps we find that the psychometric curves for low and high RPE trials show a change in slope.
Extended Data Fig. 3 Distributions of internal estimates of elapsed time at interval offset.
Each curve shows the distribution of the agent’s internal estimates of elapsed time since interval onset at interval offsets for each of the interval durations presented in the task. The color of the curve indicates which interval duration the distribution corresponds to. We see that these distributions of internal estimates get wider with the length of the interval duration at which the estimates are recorded. This is due to scalar variability implemented in the dynamics with which the latent variable and hence internal estimates of elapsed time evolve in the model.
Extended Data Fig. 4 Average reward prediction errors at interval offset and difference in psychometric functions of trials grouped based on the magnitude of RPEs, for varying degrees of compression in the representation used.
(Top) We see that the profile of average RPEs does not vary considerably for the different degrees of compression in the basis functions used to estimate value functions. (Bottom) We see that for the efficient representation (λ = 1), the difference in the psychometric curve has the same sign of all the stimuli presented. This corresponds to a change in bias and an overall shift along the stimulus axis between the two psychometric functions (as seen in the data, shown in Fig. 2b). On the other hand, for the unambiguous representation (λ = 0), the difference in the psychometric function changes sign for stimuli on different sides of the boundary. This corresponds to a change in slope between the two psychometric functions. This is consistent with the results from the simple model shown in Supp. Figure 2l where the psychometric curves cross around the decision boundary. We see that only for representations that are very close to the most efficient version do the psychometric functions show a difference in bias as is observed in the data.
Extended Data Fig. 5 Quantification of average RPE at interval offset for a range of model parameters shows that the main features of the average RPE profile can be reproduced by representations of all degrees of compression.
(ac) All of the panels in the top row show the average RPE at interval offset using the efficient representation. Each panel shows which feature of the RPE is being quantified in the corresponding column. (a) The dashed lines show average RPE for all short intervals (in purple) and all long intervals (in brown). (b) The dark and light purple dots show the average RPE magnitude for short easy and short nearboundary stimuli respectively. (c) The dark and light brown points show the average RPE magnitude for long easy and long nearboundary stimuli respectively. (d,g,j) All panels show the difference between average RPE at interval offset for long and short interval durations. This difference is plotted as a function of model parameters that control stochasticity of the policy (α) and temporal variability (σ) for the efficient representation, an intermediate representation and the unambiguous representation. In (d) we see qualifications of RPEs when using the efficient representation (λ = 1), in (g) using an intermediate representation (λ = 0.5) and in (j) using the unambiguous representation (λ = 0). For all model parameters and all degrees of compression of the representation, we see that average RPEs are higher for short intervals than long intervals. This is consistent with the observed magnitude of DA activity at interval offset. The difference between average zscored DA magnitude for short vs long interval durations was 0.52. The red star in the heat maps indicates the parameter combination with which the plots in the top row were generated. (e,h,k) All heat maps in this column show the difference in the magnitude of RPE between easy and nearboundary short interval durations. In (e) we see qualifications of RPEs when using the efficient representation (λ = 1), in (h) using an intermediate representation (λ = 0.5) and in (k) using the unambiguous representation (λ = 0). For all model parameters and all degrees of compression of the representation, we see that average RPEs are higher for easy short intervals than nearboundary short intervals. This is consistent with observed DA magnitude in experiments. The difference between average zscored DA magnitude for easy vs nearboundary short intervals was 0.50. (f,i,j) All heat maps in this column show the difference in the magnitude of RPE between easy and nearboundary long interval durations. We see that average RPEs are only slightly higher for long easy intervals than long nearboundary intervals. Over a large range of parameter combinations, the difference in RPE magnitude between easy and nearboundary stimuli is higher for short intervals than long intervals. This is again consistent with observed DA magnitude in experiments. The difference between average zscored DA magnitude for easy vs nearboundary long intervals was 0.29. Overall, these results show that average DA responses are well captured by the model irrespective of the efficiency of the representation used.
Extended Data Fig. 6 Quantification of psychometric curves split by magnitude of RPE at interval offset for a range of model parameters shows that the change in bias observed in data is best reproduced when using the efficient representation.
(ab) Panels in the top row show psychometric curves of trials split by high and low RPE at interval offset. Each panel shows which feature of these psychometric curves is quantified and plotted in the heatmaps in the corresponding columns. (a) The large dots show, for each psychometric function, the interval duration at which the probability of reporting a long choice would be 0.5. We use this estimate to quantify bias in each psychometric function. (b) For each psychometric function, we characterize the interval durations that would correspond to 0.25 and 0.75 probability of reporting a long choice. We use the range between these two values to quantify the sensitivity of each of the two psychometric functions. (c,e,g) In each of these panels, the heat maps show the difference in bias between the psychometric curves of high and low RPE trials. In (c) we see the difference in bias over a range of model parameters using the efficient representation (λ = 1), in (e) using an intermediate representation (λ = 0.5) and in (g) using the unambiguous representation (λ = 0). For all model parameters we see that the difference in bias is largest when using the efficient representation as compared to that of the intermediate and unambiguous representations. The red star in the heat maps indicates the parameter combination with which the plots in the top row were generated. (d,f,h) In each of these panels, the heat maps show the difference in the sensitivity between the psychometric curves for high vs low RPE trials. In (d) we see the difference in sensitivity over a range of model parameters using the efficient representation (λ = 1), in (f) using an intermediate representation (λ = 0.5) and in (h) using the unambiguous representation (λ = 0). For all model parameters we see that the difference in sensitivity is smallest when using the efficient representation as compared to that of the intermediate and unambiguous representations. Overall, these quantifications show that for a large range of parameters, the efficient representation yields the greatest similarity to experimental data, that is a much larger difference in bias in psychometric curves for trials split by high and low DA magnitude at interval offset as compared to the difference in sensitivity. When we split trials based on observed magnitude of DA activity at interval offset into low and high DA, we find that the difference in bias was 0.28 sec and the difference in sensitivity was 0.05 sec.
Extended Data Fig. 7 Distributions of RPEs at interval offset show why compression in the representation causes psychometric curves split by RPE magnitude to be different.
To understand why the asymmetry in the value function at interval offset using the efficient representation (shown in Fig. 7e) reproduces the change in bias observed in the psychometric curves of low and high DA trials in animals, let us consider how internal estimates of time, RPEs and choices covary for each of the two nearboundary intervals. Let us consider the results obtained using the unambiguous representation. For each of these intervals, the agent’s estimate of elapsed time will vary from trialtotrial as described by the two distributions shown in (a) and (d). We recall that the agent’s decisions are based entirely on its internal estimates of elapsed time: when the estimate is shorter (or longer) than the boundary, the agent will report choice ‘short’ (or ‘long’) with higher probability. We can therefore split each of the distributions based on the choice of the agent ((a) and (d) red and blue areas). This procedure creates four groups of trials, given by the two nearboundary intervals and the two choices of the agent. Within each group, the variability in time estimates gives rise to associated variability in RPEs. The four resulting distributions of RPEs are shown in (b) and (e). Here, the two panels group trials according to the presented interval, and the red and blue RPE distributions in each panel correspond to trials grouped according to the animal’s choice. Using these groupings, we can now study how high or low RPE trials relate to behavior. To do so, we first define high (or low) RPE trials for a given interval as all trials with RPEs greater (or smaller) than the median RPE for that interval (indicated by the dashed line). The fraction of long choices falling into the high (or low) RPE trials shown in (c) and (f) correspond to points at the near boundary intervals in the psychometric functions shown in Fig. 4c (which are for all interval durations). For the RL agent using the unambiguous representation, RPEs on incorrect trials are on average lower than RPEs on correct trials (b) and (e). In turn, if we split all trials based on the magnitude of RPEs, irrespective of the interval, we find that the agent made more mistakes on low RPE trials compared to high RPE trials. Thus, the psychometric curves for these two groups of trials show a larger difference in slope and cross each other around the decision boundary (b) and (e). For the RL agent using the efficient representation, the picture is very different (gl). Here we find that RPEs are on average lower when the agent reports choices as long, irrespective of the interval (h) and (k). Consequently, the psychometric functions for high and low RPEs show a larger change in bias and do not cross each other near the boundary (i) and (l).
Extended Data Fig. 8 The efficient representation incentivises procrastination of choices for long interval estimates close to the decision boundary.
(a) The heat map shows the value function learnt using the efficient representation when the agent is required to report choices immediately after interval offset. The purple and mustard yellow lines show example trajectories the agent would take through the state space if it withheld choice for the entire trial for an example ‘short’ (purple) and ‘long’ (yellow) interval. (bc) The sequence of state values that the agent will encounter when it follows the purple and yellow trajectories, respectively, shown in (a). The triangle markers indicate the timesteps at which the agent encountered interval offset during these trials. Let’s denote the interval offset states as z1 and the subsequent state that the agent transitions to as z2. The red arrows show the transition between z1 and z2 for the example trajectories shown in (bc). (b) We see that after an estimated ‘short’ interval, withholding choice would result in the value function to decrease and hence incur negative RPEs. However, the average reward obtained from reporting a choice immediately would be equal to the value of the interval offset state z1 and will, on average, incur zero RPEs. Hence the agent is discouraged from withholding choice when estimating an interval to be ‘short’. (c) On the other hand, for intervals estimated as nearboundary ‘long’, withholding choices results in incurring positive RPEs. If the agent reported a choice immediately, it would on average receive rewards equal to the value function at z1 and incur zero RPEs in doing so. Hence the agent is incentivised to withhold choices for long nearboundary interval estimates. On these trials, when the agent transitions from z1 to z2 by withholding choice actions, the value of z1 will be updated closer to the value of state z2 based on the TD update equation V(z1) ← V(z1) + α(γV(z2) − V(z1)). Moreover, when the agent reports a choice action at z2 after transitioning from z1 to z2, the average reward it will receive from z2 will be lower than if the interval offset had been presented at z2. trialtotrial variability in the latent variable z will lead to incorrect estimates of the category (‘short’ vs ‘long’) of intervals presented closer to the decision boundary (such as z1) than those further away from it (such as z2). Consequently, trialtotrial variability in choices at z2 will be lower if those choices are from trials where the agent’s estimate at interval offset is z2 than if choices at z2 are reported on trials in which agent’s estimate at interval offset is either z2 or z1. Thus, TDupdates on the long side of the decision boundary lead to a flattening of the value function in the efficient model due to procrastination of choices.
Extended Data Fig. 9 Average response times according to the agent’s choice for varying degrees of compression in the representation.
The agent has short response times for all stimuli when choosing short. The profile is very similar for all degrees of compression in the representation used. However, the profile of average response times when the agent chooses long changes greatly with the degree of compression in the representation. Only when using efficient representation (λ = 1), do we see response times vary strongly as a function of interval duration and decrease with the length of the interval duration on trials with ‘long’ choices as seen in the data (shown in Fig. 8c).
Extended Data Fig. 10 Quantification of average response times for a range of model parameters shows that the response time profile observed in animals is best reproduced by the model when using the efficient model.
(ac) All of the panels in the top row show the average response times of the model when using the efficient representation. The solid red line shows average response times for all interval durations when the agent makes long choices and solid blue lines show average response times when the agent makes short choices. Each panel also shows which feature of the response time profile is being quantified in the corresponding column. (a) The dashed lines show average RTs for all short choices (in blue) and all long choices (in red). (b) The two red dashed line segments show the average RTs for long choices for short and long interval durations presented. (c) The two blue dashed line segments show the average RTs for short choices for short and long interval durations presented. (d,g,j) All heat maps show the difference in average response times for all long vs short choices. This difference is plotted for a range of parameters that control temporal variability and stochasticity of the agent’s policy. In (d) we see the difference plotted for using the efficient representation (λ = 1), in (g) using an intermediate representation (λ = 0.5) and in (j) using the unambiguous representation (λ = 0). For all model parameters we see that the difference in average RTs between long and short choices is largest using the efficient representation. (e,h,k) All heat maps show the difference in average RTs for short vs long interval durations when the agent reported long choices. In (e) we see the difference plotted for using the efficient representation (λ = 1), in (h) using an intermediate representation (λ = 0.5) and in (k) using the unambiguous representation (λ = 0). We see, over a range of parameters, that the difference in average RTs for long vs short intervals, when the agent reports long choices, is largest for the efficient representation. The red star in the heat maps indicates the parameter combination with which the plots in the top row were generated. (f,i,l) All heat maps show the difference in average RTs for short vs long interval durations when the agent reported short choices. In (f) we see the difference plotted for using the efficient representation (λ = 1), in (i) using an intermediate representation (λ = 0.5) and in (l) using the unambiguous representation (λ = 0). We see, over a range of parameters, that the difference in average RTs for long vs short intervals, when the agent reports long choices, is very close to zero and the overall response profile is very similar across all degrees of compression in the representation. We also see that the difference in RTs for long vs short intervals is larger when the agent makes long choices as opposed to when the agent makes short choices (compare second and third columns). The most surprising feature of animals’ behaviour during this task is that their response times do not have the ‘xshaped profile’ previously reported in other twoalternative forced choice tasks. Instead we find that animals have longer response times for long choices for all interval durations (the difference between average RTs for long vs short choices is 446 ms), that response times for long choices negatively correlate with interval duration (the difference in average RTs for long choices for long vs short interval durations was 651 ms) and that responses for short choices were more similar for long and short interval durations (the difference in average RTs for long choices for long vs short interval durations was 96 ms). Overall, we see that the model response times when using the efficient model show the main features observed in animals’ RTs and that this is not true when the model using the unambiguous or more intermediate efficiency representations.
Supplementary information
Supplementary Information
Supplementary Figs. 1 and 2
Rights and permissions
About this article
Cite this article
Motiwala, A., Soares, S., Atallah, B.V. et al. Efficient coding of cognitive variables underlies dopamine response and choice behavior. Nat Neurosci 25, 738–748 (2022). https://doi.org/10.1038/s41593022010857
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1038/s41593022010857
This article is cited by

Reward prediction error neurons implement an efficient code for reward
Nature Neuroscience (2024)

Frontostriatal circuit dysfunction leads to cognitive inflexibility in neuroligin3 R451C knockin mice
Molecular Psychiatry (2024)

Action suppression reveals opponent parallel control via striatal circuits
Nature (2022)