Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Article
  • Published:

Efficient coding of cognitive variables underlies dopamine response and choice behavior

Abstract

Reward expectations based on internal knowledge of the external environment are a core component of adaptive behavior. However, internal knowledge may be inaccurate or incomplete due to errors in sensory measurements. Some features of the environment may also be encoded inaccurately to minimize representational costs associated with their processing. In this study, we investigated how reward expectations are affected by features of internal representations by studying behavior and dopaminergic activity while mice make time-based decisions. We show that several possible representations allow a reinforcement learning agent to model animals’ overall performance during the task. However, only a small subset of highly compressed representations simultaneously reproduced the co-variability in animals’ choice behavior and dopaminergic activity. Strikingly, these representations predict an unusual distribution of response times that closely match animals’ behavior. These results inform how constraints of representational efficiency may be expressed in encoding representations of dynamic cognitive variables used for reward-based computations.

This is a preview of subscription content, access via your institution

Access options

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Fig. 1: Rodents were trained to classify interval durations.
Fig. 2: Reward prediction errors at interval offset can be modulated by choice accuracy and hazard rate of interval offset.
Fig. 3: Task variables can be represented unambiguously using a two-dimensional state space.
Fig. 4: Unambiguous value function approximation cannot simultaneously reproduce average DA at interval offset and trial-to-trial relationship between DA magnitude and choice.
Fig. 5: Reconstruction accuracy and representational efficiency.
Fig. 6: Efficient value function approximation can simultaneously reproduce average DA at interval offset and trial-to-trial relationship between DA magnitude and choice.
Fig. 7: Reward expectations learned by the RL agent using the unambiguous and efficient representations differ most right after the decision boundary.
Fig. 8: The efficient representation predicts an unusual profile of response times that closely matches animals’ behavior.

Similar content being viewed by others

Data availability

The experimental data shown in this study are available through the original study by ref. 31. Details of this study are approved under the protocol approved by the Champalimaud Foundation Animal Welfare Body (protocol number: 2017/013) and the Portuguese Veterinary General Board (Direcção-Geral de Veterinária, project approval number: 0421/000/000/2018) and in accordance with the European Union Directive 2010/63/EEC

Code availability

All code for simulations reported in this study and some saved variables used to show the model’s results are available at https://github.com/asma-motiwala/ITD-SNcDA-TDE.

References

  1. Dayan, P. & Sejnowski, T. J. Td(λ) converges with probability 1. Mach. Learn. 14, 295–301 (1994).

    Google Scholar 

  2. Sutton, R. S. Learning to predict by the methods of temporal differences. Mach. Learn. 3, 9–44 (1988).

    Google Scholar 

  3. Bayer, H. M. & Glimcher, P. W. Midbrain dopamine neurons encode a quantitative reward prediction error signal. Neuron 47, 129–141 (2005).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  4. Fiorillo, C. D., Tobler, P. N. & Schultz, W. Discrete coding of reward probability and uncertainty by dopamine neurons. Science 299, 1898–1902 (2003).

    Article  CAS  PubMed  Google Scholar 

  5. Schultz, W., Dayan, P. & Montague, P. R.A neural substrate of prediction and reward. Science 275, 1593–1599 (1997).

    Article  CAS  PubMed  Google Scholar 

  6. Reynolds, J. N. J., Hyland, B. I. & Wickens, J. R. A cellular mechanism of reward-related learning. Nature 413, 67–70 (2001).

    Article  CAS  PubMed  Google Scholar 

  7. Stauffer, W. R., Lak, A. & Schultz, W. Dopamine reward prediction error responses reflect marginal utility. Curr. Biol. 24, 2491–2500 (2014).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  8. Steinberg, E. E. et al. A causal link between prediction errors, dopamine neurons and learning. Nat. Neurosci. 16, 966–973 (2013).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  9. Niv, Y. & Langdon, A. Reinforcement learning with Marr. Curr. Opin. Behav. Sci. 11, 67–73 (2016).

    Article  PubMed  PubMed Central  Google Scholar 

  10. Watabe-Uchida, M., Eshel, N. & Uchida, N. Neural circuitry of reward prediction error. Annu. Rev. Neurosci. 40, 373–394 (2017).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  11. Daw, N. D., Courville, A. C. & Touretzky, D. S. Representation and timing in theories of the dopamine system. Neural Comput. 18, 1637–1677 (2006).

    Article  PubMed  Google Scholar 

  12. Ludvig, E. A., Sutton, R. S. & Kehoe, E. J. Stimulus representation and the timing of reward-prediction errors in models of the dopamine system. Neural Comput. 20, 3034–3054 (2008).

    Article  PubMed  Google Scholar 

  13. Suri, R. E. & Schultz, W. A neural network model with dopamine-like reinforcement signal that learns a spatial delayed response task. Neuroscience 91, 871–890 (1999).

    Article  CAS  PubMed  Google Scholar 

  14. Botvinick, M. M., Niv, Y. & Barto, A. G. Hierarchically organized behavior and its neural foundations: a reinforcement learning perspective. Cognition 113, 262–280 (2009).

    Article  PubMed  Google Scholar 

  15. Starkweather, C. K., Babayan, B. M., Uchida, N. & Gershman, S. J. Dopamine reward prediction errors reflect hidden-state inference across time. Nat. Neurosci. 20, 581–589 (2017).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  16. Russek, E. M., Momennejad, I., Botvinick, M. M., Gershman, S. J. & Daw, N. D. Predictive representations can link model-based reinforcement learning to model-free mechanisms. PLoS Comput. Biol. 13, e1005768 (2017).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  17. Gouvêa, T. S. et al. Striatal dynamics explain duration judgments. eLife 4, e11386 (2015).

    Article  PubMed  PubMed Central  Google Scholar 

  18. Mello, G. B. M., Soares, S. & Paton, J. J. A scalable population code for time in the striatum. Curr. Biol. 25, 1113–1122 (2015).

    Article  CAS  PubMed  Google Scholar 

  19. Remington, E. D., Narain, D., Hosseini, E. A. & Jazayeri, M. Flexible sensorimotor computations through rapid reconfiguration of cortical dynamics. Neuron 98, 1005–1019 (2018).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  20. Wang, J., Narain, D., Hosseini, E. A. & Jazayeri, M. Flexible timing by temporal scaling of cortical responses. Nat. Neurosci. 21, 102–110 (2018).

    Article  CAS  PubMed  Google Scholar 

  21. Atick, J. J. & Redlich, A. N. What does the retina know about natural scenes? Neural Comput. 4, 196–210 (1992).

    Article  Google Scholar 

  22. Lewicki, M. S. Efficient coding of natural sounds. Nat. Neurosci. 5, 356–363 (2002).

    Article  CAS  PubMed  Google Scholar 

  23. Olshausen, B. A. & Field, D. J. Emergence of simple-cell receptive field properties by learning a sparse code for natural images. Nature 381, 607–609 (1996).

    Article  CAS  PubMed  Google Scholar 

  24. Rieke, F., Bodnar, D. A., & Bialek, W. Naturalistic stimuli increase the rate and efficiency of information transmission by primary auditory afferents. Proc. Biol. Sci. 262, 259–265 (1995).

  25. Machens, C. K., Gollisch, T., Kolesnikova, O. & Herz, A. V. M. Testing the efficiency of sensory coding with optimal stimulus ensembles. Neuron 47, 447–456 (2005).

    Article  CAS  PubMed  Google Scholar 

  26. Reinagel, P. & Zador, A. M. Natural scene statistics at the centre of gaze. Network 10, 341–350 (1999).

    Google Scholar 

  27. Salinas, E. How behavioral constraints may determine optimal sensory representations. PLoS Biol. 4, e387 (2006).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  28. Botvinick, M., Weinstein, A., Solway, A. & Barto, A. Reinforcement learning, efficient coding, and the statistics of natural tasks. Curr. Opin. Behav. Sci. 5, 71–77 (2015).

    Article  Google Scholar 

  29. Summerfield, C. & Tsetsos, K. Building bridges between perceptual and economic decision-making: neural and computational mechanisms. Front. Neurosci. 6, 70 (2012).

    Article  PubMed  PubMed Central  Google Scholar 

  30. Louie, K. & Glimcher, P. W. Efficient coding and the neural representation of value. Ann. N Y Acad. Sci. 1251, 13–32 (2012).

    Article  PubMed  Google Scholar 

  31. Soares, S., Atallah, B. V. & Paton, J. J. Midbrain dopamine neurons control judgment of time. Science 354, 1273–1277 (2016).

    Article  CAS  PubMed  Google Scholar 

  32. Kepecs, A., Uchida, N., Zariwala, H. A. & Mainen, Z. F. Neural correlates, computation and behavioural impact of decision confidence. Nature 455, 227–231 (2008).

    Article  CAS  PubMed  Google Scholar 

  33. Kiani, R. & Shadlen, M. N. Representation of confidence associated with a decision by neurons in the parietal cortex. Science 324, 759–764 (2009).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  34. Lak, A., Nomoto, K., Keramati, M., Sakagami, M. & Kepecs, A. Midbrain dopamine neurons signal belief in choice accuracy during a perceptual decision. Curr. Biol. 27, 821–832 (2017).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  35. Janssen, P. & Shadlen, M. N. A representation of the hazard rate of elapsed time in macaque area lip. Nat. Neurosci. 8, 234–241 (2005).

    Article  CAS  PubMed  Google Scholar 

  36. Fiorillo, C. D., Newsome, W. T. & Schultz, W. The temporal precision of reward prediction in dopamine neurons. Nat. Neurosci. 11, 966–973 (2008).

    Article  CAS  PubMed  Google Scholar 

  37. Pasquereau, B. & Turner, R. S. Dopamine neurons encode errors in predicting movement trigger occurrence. J. Neurophysiol. 113, 1110–1123 (2015).

    Article  PubMed  Google Scholar 

  38. Gibbon, J. & Church, R. M. Representation of time. Cognition 37, 23–54 (1990).

    Article  CAS  PubMed  Google Scholar 

  39. Joel, D., Niv, Y. & Ruppin, E. Actor–critic models of the basal ganglia: new anatomical and computational perspectives. Neural Netw. 15, 535–547 (2002).

    Article  PubMed  Google Scholar 

  40. Khamassi, M., Lachèze, L., Girard, B., Berthoz, A. & Guillot, A. Actor–critic models of reinforcement learning in the basal ganglia: from natural to artificial rats. Adaptive Behavior 13, 131–148 (2005).

    Article  Google Scholar 

  41. Roitman, J. D. & Shadlen, M. N. Response of neurons in the lateral intraparietal area during a combined visual discrimination reaction time task. J. Neurosci. 22, 9475–9489 (2002).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  42. Wimmer, G. E., Daw, N. D. & Shohamy, D. Generalization of value in reinforcement learning by humans. Eur. J. Neurosci. 35, 1092–1104 (2012).

    Article  PubMed  PubMed Central  Google Scholar 

  43. Foster, D. & Dayan, P. Structure in the space of value functions. Mach. Learn. 49, 325–346 (2002).

    Article  Google Scholar 

  44. Solway, A. et al. Optimal behavioral hierarchy. PLoS Comput. Biol. 10, e1003779 (2014).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  45. Mastrogiuseppe, F. & Ostojic, S. Linking connectivity, dynamics, and computations in low-rank recurrent neural networks. Neuron 99, 609–623 (2018).

    Article  CAS  PubMed  Google Scholar 

  46. Kao, T.-C., Sadabadi, M. S. & Hennequin, G. Optimal anticipatory control as a theory of motor preparation: a thalamo-cortical circuit model. Neuron 109, 1567–1581 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  47. Mnih, V. et al. Human-level control through deep reinforcement learning. Nature 518, 529–533 (2015).

    Article  CAS  PubMed  Google Scholar 

  48. Song, H. F., Yang, G. R. & Wang, X.-J. Reward-based training of recurrent neural networks for cognitive and value-based tasks. eLife 6, e21492 (2017).

    Article  PubMed  PubMed Central  Google Scholar 

  49. Wang, J. X. et al. Prefrontal cortex as a meta-reinforcement learning system. Nat. Neurosci. 21, 860–868 (2018).

    Article  CAS  PubMed  Google Scholar 

  50. Sutton, R. S. & Barto, A. G. Reinforcement Learning: An Introduction (MIT Press, 2018).

  51. Baird, L. C. Advantage updating. Technical Report WL-TR-93-1146. Wright Laboratory, Wright-Patterson Air Force Base. https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.135.9434&rep=rep1&type=pdf (1993).

Download references

Acknowledgements

We thank J. Semedo and A. Renart for several helpful discussions and feedback on the manuscript. We thank members of the Paton laboratory for several helpful discussions. We also thank S. Berger for feedback on the manuscript. This work was developed with support from the research infrastructure Congento, co-financed by Lisboa Regional Operational Programme (Lisboa2020), under the PORTUGAL 2020 Partnership Agreement, through the European Regional Development Fund and Fundação para a Ciência e Tecnologia (Portugal), under project LISBOA-01-0145-FEDER-022170. The work was funded by a Howard Hughes Medical Institute International Research Scholar Award to J.J.P (55008745), a European Research Council Consolidator grant (DYCOCIRC - REP-772339-1) to J.J.P., a BIAL bursary for scientific research to J.J.P. (193/2016) and internal support from the Champalimaud Foundation. The work was also supported by a National Institutes of Health U01 grant (NS094288) to C.K.M. and the Fundação para Ciência e Tecnologia (SFRH/BD/52214/2013) to A.M. The funders had no role in study design, data collection and analysis, decision to publish or preparation of manuscript.

Author information

Authors and Affiliations

Authors

Contributions

A.M., J.J.P. and C.K.M. designed the study. A.M. performed all analyses and simulations. S.S., B.V.A. and J.J.P. designed and S.S. and B.A. performed the behavioral and photometry experiments. A.M., J.P. and C.K.M. wrote the manuscript. J.J.P. and C.K.M. contributed equally to this work.

Corresponding authors

Correspondence to Asma Motiwala, Joseph J. Paton or Christian K. Machens.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Neuroscience thanks the anonymous reviewers for their contribution to the peer review of this work.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 Schematic to illustrate how various task events influence reward expectations.

Let us assume the duration of intervals presented is uniformly distributed between 0 to 3 sec. and that choices are reported only at interval offset. In this case, the agent needs to estimate value functions over two sets of states, one during the interval and the other at interval offset. Let us assume the reward amount is 1, in this case the value at interval offset C(z) will be equal to the probability of reporting the correct choice for that interval estimate. During the interval, the value function is a weighted sum of the value of interval offset at that time and the value of being in the interval at later times. The weighting factor is given by the probability of transitioning to each of these states, that is, \(V(z)=p({{z}^{\prime}={{{{{\rm{interval}}}}}}\,{{{{{\rm{offset}}}}}}| z})* C(z)+\gamma\)\(* p\left({z}^{\prime}\ne {{{{{\rm{interval}}}}}}\,{{{{{\rm{offset}}}}}}| z\right)* V({z}^{\prime})\), where \({z}^{\prime}\) is the successor state. (a-d) If we assume that the agent does not explicitly encode the distribution of interval durations, but does encode an estimate of choice accuracy at interval offset, the hazard rate encoded by the agent is zero for all internal estimates of time (shown in panel a), and its estimate of value during the interval will always be zero (shown in panel c). Its estimate of value at interval offset is equal to the probability with which it will correctly report interval duration (shown in panel b) and the resulting RPEs at interval offset will reflect only choice accuracy (shown in panel d). (e-h) Let us now assume that the agent encodes the distribution of interval durations, but does not encode choice accuracy. The probability of detecting interval offset at any estimated time z given that no interval offset was detected for all \({z}^{\prime} < z\) is given by the hazard rate of interval offset, H(z). In the case of uniformly distributed interval durations, H(z) is shown in panel e. In this case the value function during the interval is monotonically increasing. If there was no time discounting (that is, γ = 1), this value function would be 1 for the entire interval. For time discounted rewards (that is, 0 < < γ < 1), the value function simply reflects the fact that early in the interval the agent expects rewards to be, on average, further in the future and hence more time-discounted than later in the interval (as shown in panel g). Since we assumed here that the agent does not encode choice accuracy, reward expectations from interval offset states are constant (panel f). Consequently, RPEs at interval offset will be monotonically decreasing with elapsed time (as shown in panel h). (i-l) Finally, if we assume that the agent encodes, both, choice accuracy and the distribution of interval durations, we see that the estimated value at interval offset is the same as when the agent only encodes choice accuracy (shown in panel j). However, the value function during the interval now reflects a combination of choice accuracy and the hazard rate of interval offsets, that is \(V(z)=H(z)* C(z)+\gamma * (1-H(z))* V({z}^{\prime})\) (shown in panel k). Consequently, the RPEs will also reflect both choice accuracy and the hazard rate of interval offset (as shown in panel l).

Extended Data Fig. 2 Schematic to illustrate how the overall profile of RPEs at interval offset determines differences between psychometric functions for low and high RPEs trials.

Each row shows how over or underestimating elapsed time (shown in column 1) changes RPEs at any possible interval duration (column 2), the probability of reporting choice ‘long’ (column 3) and how magnitude of RPEs relate to the probability of choice ‘long’ (column 4). The top row shows the case when reward expectations are estimated only based on choice accuracy (as shown in Supp Fig 1a-d), the middle row shows the case when reward expectations only reflect the hazard rate of interval offset (as shown in Supp Fig 1e-h) and the bottom row shows the case when reward expectations are estimated based on choice accuracy as well as hazard rate of interval offset (as shown in Supp Fig 1i-l). (Column 1) Let’s consider two example types of trials (shown in panels a,e,i), one in which elapsed time is underestimated (z < t, shown by the blue line) and the other overestimates elapsed time (z > t, shown by the red line). (Column 2) For these two types of trials, RPEs at interval offsets for all possible interval durations are shown in the second column (each row corresponds to each of the three possibilities for how the agent might estimate reward expectations as a function of task events shown in Supp Fig 1d,h,l). In each panel, for every time step, the trial on which RPE is lower than the other is highlighted in yellow. In panel (b), when RPEs result from reward expectations that only take into account choice accuracy, for all time points before the decision boundary, trials on which elapsed time is overestimated will have lower RPEs than trials on which elapsed time is underestimated. On the other hand, for all time points after the decision boundary, trials on which elapsed time is underestimated will have lower RPEs. In panel (f), when RPEs are driven only due to the hazard rate of interval offset, for all time points trials on which elapsed time is overestimated will have lower RPEs. Finally, in panel (j), when RPEs reflect both choice accuracy and hazard rate of interval offset, RPEs are lower on trials on which elapsed time is overestimated on the short side of the boundary. On the long side of the boundary, close to the boundary, trials on which elapsed time is underestimated have lower RPEs. However, for estimates much longer than the boundary, we see trials on which elapsed time is overestimated have lower RPEs. (Column 3) If the agent’s choices change as a function of it’s internal estimates of elapsed time, for the two example trial types shown here, the psychometric function of the agent will also be different. When the agent underestimates elapsed time (blue curve), the psychometric curve will be biased towards ‘short’ choices (that is it will show a rightward shift). Similarly, if the agent overestimates time (red curve), the psychometric curve will be biased towards ‘long’ choices (that is will be shifted left). To establish the relationship between magnitude of RPE at any given estimated time of interval offset and the probability of choices the agent will report, for each of the trial types, all time points at which RPE was lower on that trial type (shown by the yellow highlights in the second column) are also highlighted in yellow. In panel (c), we see that for all time points before the boundary, the probability of choosing long is higher for most segments highlighted yellow. For all time points after the decision boundary, we see that the probability of choosing short is higher for all segments highlighted in yellow. In panel (g), we see that for all interval durations, trials that had lower RPEs have a higher probability of reporting the interval as ‘long’. Finally, in panel (k), we see that for all interval durations before the boundary, low RPE trials have a higher probability of reporting choice ‘long’ and the opposite is true after the decision boundary. (Column 4) For each of the panels c,g and k, for all time points on either side of the boundary we ask: what is the average of the psychometric curves highlighted in yellow. (d) In panel c, we see that on the short side of the boundary, for most time points the red psychometric curve is highlighted and the average of the highlighted segments of the curve is shown by the solid yellow marker in the top panel in column d. For time points longer than the boundary, we see in panel c that for most time points, the blue curve is highlighted and the average of that segment of the psychometric curve is shown by the yellow marker on the long side if the boundary in panel d. The green points in panel d show the averages of the curves in panel c on either side of the boundary that are not highlighted and correspond to time points at which the RPEs (shown in panel b) are higher. In other words, when RPEs are driven only due to choice accuracy, lower RPE in general are associated with low choice variability and hence, we would predict the psychometric curves for low and high RPE trials, in this case, to show a difference in slope. (h) Following the same steps for the middle row (panels e-h), we find that when RPEs are driven only by the hazard rate of interval offset, low RPEs correlate with higher probability of reporting long irrespective of which side of the boundary the interval offset lies. Thus, in this case we would predict a change in bias in psychometric functions for low and high RPE trials. (l) Finally, when RPEs reflect both choice accuracy as well as the hazard rate of interval offset, following the same steps we find that the psychometric curves for low and high RPE trials show a change in slope.

Extended Data Fig. 3 Distributions of internal estimates of elapsed time at interval offset.

Each curve shows the distribution of the agent’s internal estimates of elapsed time since interval onset at interval offsets for each of the interval durations presented in the task. The color of the curve indicates which interval duration the distribution corresponds to. We see that these distributions of internal estimates get wider with the length of the interval duration at which the estimates are recorded. This is due to scalar variability implemented in the dynamics with which the latent variable and hence internal estimates of elapsed time evolve in the model.

Extended Data Fig. 4 Average reward prediction errors at interval offset and difference in psychometric functions of trials grouped based on the magnitude of RPEs, for varying degrees of compression in the representation used.

(Top) We see that the profile of average RPEs does not vary considerably for the different degrees of compression in the basis functions used to estimate value functions. (Bottom) We see that for the efficient representation (λ = 1), the difference in the psychometric curve has the same sign of all the stimuli presented. This corresponds to a change in bias and an overall shift along the stimulus axis between the two psychometric functions (as seen in the data, shown in Fig. 2b). On the other hand, for the unambiguous representation (λ = 0), the difference in the psychometric function changes sign for stimuli on different sides of the boundary. This corresponds to a change in slope between the two psychometric functions. This is consistent with the results from the simple model shown in Supp. Figure 2l where the psychometric curves cross around the decision boundary. We see that only for representations that are very close to the most efficient version do the psychometric functions show a difference in bias as is observed in the data.

Extended Data Fig. 5 Quantification of average RPE at interval offset for a range of model parameters shows that the main features of the average RPE profile can be reproduced by representations of all degrees of compression.

(a-c) All of the panels in the top row show the average RPE at interval offset using the efficient representation. Each panel shows which feature of the RPE is being quantified in the corresponding column. (a) The dashed lines show average RPE for all short intervals (in purple) and all long intervals (in brown). (b) The dark and light purple dots show the average RPE magnitude for short easy and short near-boundary stimuli respectively. (c) The dark and light brown points show the average RPE magnitude for long easy and long near-boundary stimuli respectively. (d,g,j) All panels show the difference between average RPE at interval offset for long and short interval durations. This difference is plotted as a function of model parameters that control stochasticity of the policy (α) and temporal variability (σ) for the efficient representation, an intermediate representation and the unambiguous representation. In (d) we see qualifications of RPEs when using the efficient representation (λ = 1), in (g) using an intermediate representation (λ = 0.5) and in (j) using the unambiguous representation (λ = 0). For all model parameters and all degrees of compression of the representation, we see that average RPEs are higher for short intervals than long intervals. This is consistent with the observed magnitude of DA activity at interval offset. The difference between average z-scored DA magnitude for short vs long interval durations was 0.52. The red star in the heat maps indicates the parameter combination with which the plots in the top row were generated. (e,h,k) All heat maps in this column show the difference in the magnitude of RPE between easy and near-boundary short interval durations. In (e) we see qualifications of RPEs when using the efficient representation (λ = 1), in (h) using an intermediate representation (λ = 0.5) and in (k) using the unambiguous representation (λ = 0). For all model parameters and all degrees of compression of the representation, we see that average RPEs are higher for easy short intervals than near-boundary short intervals. This is consistent with observed DA magnitude in experiments. The difference between average z-scored DA magnitude for easy vs near-boundary short intervals was 0.50. (f,i,j) All heat maps in this column show the difference in the magnitude of RPE between easy and near-boundary long interval durations. We see that average RPEs are only slightly higher for long easy intervals than long near-boundary intervals. Over a large range of parameter combinations, the difference in RPE magnitude between easy and near-boundary stimuli is higher for short intervals than long intervals. This is again consistent with observed DA magnitude in experiments. The difference between average z-scored DA magnitude for easy vs near-boundary long intervals was 0.29. Overall, these results show that average DA responses are well captured by the model irrespective of the efficiency of the representation used.

Extended Data Fig. 6 Quantification of psychometric curves split by magnitude of RPE at interval offset for a range of model parameters shows that the change in bias observed in data is best reproduced when using the efficient representation.

(a-b) Panels in the top row show psychometric curves of trials split by high and low RPE at interval offset. Each panel shows which feature of these psychometric curves is quantified and plotted in the heatmaps in the corresponding columns. (a) The large dots show, for each psychometric function, the interval duration at which the probability of reporting a long choice would be 0.5. We use this estimate to quantify bias in each psychometric function. (b) For each psychometric function, we characterize the interval durations that would correspond to 0.25 and 0.75 probability of reporting a long choice. We use the range between these two values to quantify the sensitivity of each of the two psychometric functions. (c,e,g) In each of these panels, the heat maps show the difference in bias between the psychometric curves of high and low RPE trials. In (c) we see the difference in bias over a range of model parameters using the efficient representation (λ = 1), in (e) using an intermediate representation (λ = 0.5) and in (g) using the unambiguous representation (λ = 0). For all model parameters we see that the difference in bias is largest when using the efficient representation as compared to that of the intermediate and unambiguous representations. The red star in the heat maps indicates the parameter combination with which the plots in the top row were generated. (d,f,h) In each of these panels, the heat maps show the difference in the sensitivity between the psychometric curves for high vs low RPE trials. In (d) we see the difference in sensitivity over a range of model parameters using the efficient representation (λ = 1), in (f) using an intermediate representation (λ = 0.5) and in (h) using the unambiguous representation (λ = 0). For all model parameters we see that the difference in sensitivity is smallest when using the efficient representation as compared to that of the intermediate and unambiguous representations. Overall, these quantifications show that for a large range of parameters, the efficient representation yields the greatest similarity to experimental data, that is a much larger difference in bias in psychometric curves for trials split by high and low DA magnitude at interval offset as compared to the difference in sensitivity. When we split trials based on observed magnitude of DA activity at interval offset into low and high DA, we find that the difference in bias was 0.28 sec and the difference in sensitivity was 0.05 sec.

Extended Data Fig. 7 Distributions of RPEs at interval offset show why compression in the representation causes psychometric curves split by RPE magnitude to be different.

To understand why the asymmetry in the value function at interval offset using the efficient representation (shown in Fig. 7e) reproduces the change in bias observed in the psychometric curves of low and high DA trials in animals, let us consider how internal estimates of time, RPEs and choices covary for each of the two near-boundary intervals. Let us consider the results obtained using the unambiguous representation. For each of these intervals, the agent’s estimate of elapsed time will vary from trial-to-trial as described by the two distributions shown in (a) and (d). We recall that the agent’s decisions are based entirely on its internal estimates of elapsed time: when the estimate is shorter (or longer) than the boundary, the agent will report choice ‘short’ (or ‘long’) with higher probability. We can therefore split each of the distributions based on the choice of the agent ((a) and (d) red and blue areas). This procedure creates four groups of trials, given by the two near-boundary intervals and the two choices of the agent. Within each group, the variability in time estimates gives rise to associated variability in RPEs. The four resulting distributions of RPEs are shown in (b) and (e). Here, the two panels group trials according to the presented interval, and the red and blue RPE distributions in each panel correspond to trials grouped according to the animal’s choice. Using these groupings, we can now study how high or low RPE trials relate to behavior. To do so, we first define high (or low) RPE trials for a given interval as all trials with RPEs greater (or smaller) than the median RPE for that interval (indicated by the dashed line). The fraction of long choices falling into the high (or low) RPE trials shown in (c) and (f) correspond to points at the near boundary intervals in the psychometric functions shown in Fig. 4c (which are for all interval durations). For the RL agent using the unambiguous representation, RPEs on incorrect trials are on average lower than RPEs on correct trials (b) and (e). In turn, if we split all trials based on the magnitude of RPEs, irrespective of the interval, we find that the agent made more mistakes on low RPE trials compared to high RPE trials. Thus, the psychometric curves for these two groups of trials show a larger difference in slope and cross each other around the decision boundary (b) and (e). For the RL agent using the efficient representation, the picture is very different (g-l). Here we find that RPEs are on average lower when the agent reports choices as long, irrespective of the interval (h) and (k). Consequently, the psychometric functions for high and low RPEs show a larger change in bias and do not cross each other near the boundary (i) and (l).

Extended Data Fig. 8 The efficient representation incentivises procrastination of choices for long interval estimates close to the decision boundary.

(a) The heat map shows the value function learnt using the efficient representation when the agent is required to report choices immediately after interval offset. The purple and mustard yellow lines show example trajectories the agent would take through the state space if it withheld choice for the entire trial for an example ‘short’ (purple) and ‘long’ (yellow) interval. (b-c) The sequence of state values that the agent will encounter when it follows the purple and yellow trajectories, respectively, shown in (a). The triangle markers indicate the timesteps at which the agent encountered interval offset during these trials. Let’s denote the interval offset states as z1 and the subsequent state that the agent transitions to as z2. The red arrows show the transition between z1 and z2 for the example trajectories shown in (b-c). (b) We see that after an estimated ‘short’ interval, withholding choice would result in the value function to decrease and hence incur negative RPEs. However, the average reward obtained from reporting a choice immediately would be equal to the value of the interval offset state z1 and will, on average, incur zero RPEs. Hence the agent is discouraged from withholding choice when estimating an interval to be ‘short’. (c) On the other hand, for intervals estimated as near-boundary ‘long’, withholding choices results in incurring positive RPEs. If the agent reported a choice immediately, it would on average receive rewards equal to the value function at z1 and incur zero RPEs in doing so. Hence the agent is incentivised to withhold choices for long near-boundary interval estimates. On these trials, when the agent transitions from z1 to z2 by withholding choice actions, the value of z1 will be updated closer to the value of state z2 based on the TD update equation V(z1) ← V(z1) + α(γV(z2) − V(z1)). Moreover, when the agent reports a choice action at z2 after transitioning from z1 to z2, the average reward it will receive from z2 will be lower than if the interval offset had been presented at z2. trial-to-trial variability in the latent variable z will lead to incorrect estimates of the category (‘short’ vs ‘long’) of intervals presented closer to the decision boundary (such as z1) than those further away from it (such as z2). Consequently, trial-to-trial variability in choices at z2 will be lower if those choices are from trials where the agent’s estimate at interval offset is z2 than if choices at z2 are reported on trials in which agent’s estimate at interval offset is either z2 or z1. Thus, TD-updates on the long side of the decision boundary lead to a flattening of the value function in the efficient model due to procrastination of choices.

Extended Data Fig. 9 Average response times according to the agent’s choice for varying degrees of compression in the representation.

The agent has short response times for all stimuli when choosing short. The profile is very similar for all degrees of compression in the representation used. However, the profile of average response times when the agent chooses long changes greatly with the degree of compression in the representation. Only when using efficient representation (λ = 1), do we see response times vary strongly as a function of interval duration and decrease with the length of the interval duration on trials with ‘long’ choices as seen in the data (shown in Fig. 8c).

Extended Data Fig. 10 Quantification of average response times for a range of model parameters shows that the response time profile observed in animals is best reproduced by the model when using the efficient model.

(a-c) All of the panels in the top row show the average response times of the model when using the efficient representation. The solid red line shows average response times for all interval durations when the agent makes long choices and solid blue lines show average response times when the agent makes short choices. Each panel also shows which feature of the response time profile is being quantified in the corresponding column. (a) The dashed lines show average RTs for all short choices (in blue) and all long choices (in red). (b) The two red dashed line segments show the average RTs for long choices for short and long interval durations presented. (c) The two blue dashed line segments show the average RTs for short choices for short and long interval durations presented. (d,g,j) All heat maps show the difference in average response times for all long vs short choices. This difference is plotted for a range of parameters that control temporal variability and stochasticity of the agent’s policy. In (d) we see the difference plotted for using the efficient representation (λ = 1), in (g) using an intermediate representation (λ = 0.5) and in (j) using the unambiguous representation (λ = 0). For all model parameters we see that the difference in average RTs between long and short choices is largest using the efficient representation. (e,h,k) All heat maps show the difference in average RTs for short vs long interval durations when the agent reported long choices. In (e) we see the difference plotted for using the efficient representation (λ = 1), in (h) using an intermediate representation (λ = 0.5) and in (k) using the unambiguous representation (λ = 0). We see, over a range of parameters, that the difference in average RTs for long vs short intervals, when the agent reports long choices, is largest for the efficient representation. The red star in the heat maps indicates the parameter combination with which the plots in the top row were generated. (f,i,l) All heat maps show the difference in average RTs for short vs long interval durations when the agent reported short choices. In (f) we see the difference plotted for using the efficient representation (λ = 1), in (i) using an intermediate representation (λ = 0.5) and in (l) using the unambiguous representation (λ = 0). We see, over a range of parameters, that the difference in average RTs for long vs short intervals, when the agent reports long choices, is very close to zero and the overall response profile is very similar across all degrees of compression in the representation. We also see that the difference in RTs for long vs short intervals is larger when the agent makes long choices as opposed to when the agent makes short choices (compare second and third columns). The most surprising feature of animals’ behaviour during this task is that their response times do not have the ‘x-shaped profile’ previously reported in other two-alternative forced choice tasks. Instead we find that animals have longer response times for long choices for all interval durations (the difference between average RTs for long vs short choices is 446 ms), that response times for long choices negatively correlate with interval duration (the difference in average RTs for long choices for long vs short interval durations was 651 ms) and that responses for short choices were more similar for long and short interval durations (the difference in average RTs for long choices for long vs short interval durations was -96 ms). Overall, we see that the model response times when using the efficient model show the main features observed in animals’ RTs and that this is not true when the model using the unambiguous or more intermediate efficiency representations.

Supplementary information

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Motiwala, A., Soares, S., Atallah, B.V. et al. Efficient coding of cognitive variables underlies dopamine response and choice behavior. Nat Neurosci 25, 738–748 (2022). https://doi.org/10.1038/s41593-022-01085-7

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/s41593-022-01085-7

This article is cited by

Search

Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing