Computational noise in reward-guided learning drives behavioral variability in volatile environments

Findling, Charles; Skvortsova, Vasilisa; Dromnelle, Rémi; Palminteri, Stefano; Wyart, Valentin

doi:10.1038/s41593-019-0518-9

Article
Published: 28 October 2019

Computational noise in reward-guided learning drives behavioral variability in volatile environments

Nature Neuroscience volume 22, pages 2066–2077 (2019)Cite this article

11k Accesses
57 Citations
135 Altmetric
Metrics details

Subjects

Abstract

When learning the value of actions in volatile environments, humans often make seemingly irrational decisions that fail to maximize expected value. We reasoned that these ‘non-greedy’ decisions, instead of reflecting information seeking during choice, may be caused by computational noise in the learning of action values. Here using reinforcement learning models of behavior and multimodal neurophysiological data, we show that the majority of non-greedy decisions stem from this learning noise. The trial-to-trial variability of sequential learning steps and their impact on behavior could be predicted both by blood oxygen level-dependent responses to obtained rewards in the dorsal anterior cingulate cortex and by phasic pupillary dilation, suggestive of neuromodulatory fluctuations driven by the locus coeruleus–norepinephrine system. Together, these findings indicate that most behavioral variability, rather than reflecting human exploration, is due to the limited computational precision of reward-guided learning.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on Springer Link
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Fig. 1: Experimental paradigm and noisy RL model.**

**Fig. 2: Contributions of learning noise and choice stochasticity to non-greedy decisions.**

**Fig. 3: Decomposition of learning noise into ultimately predictable and unpredictable terms.**

**Fig. 4: Characterization of decision effects predicted by learning noise.**

**Fig. 5: Neural correlates of learning noise in the human brain.**

**Fig. 6: Neural correlates of learning noise in choice-free, cued trials.**

**Fig. 7: Brain–behavior and pupillometric analyses.**

**Fig. 8: Proposed payoff–cost trade-off on learning precision.**

Distinct value computations support rapid sequential decisions

Article Open access 21 November 2023

Andrew Mah, Shannon S. Schiereck, … Christine M. Constantinople

A distributional code for value in dopamine-based reinforcement learning

Article 15 January 2020

Will Dabney, Zeb Kurth-Nelson, … Matthew Botvinick

Entropy-based metrics for predicting choice behavior based on local response to reward

Article Open access 12 November 2021

Ethan Trepka, Mehran Spitmaan, … Alireza Soltani

Data availability

The data (behavioral, neuroimaging and pupillometric) that support these findings are available from the corresponding author upon request.

Code availability

Python and C++ code for fitting all computational models described in the article are available at https://github.com/csmfindling/learning_variability. The algorithmic backbone of the Monte Carlo procedures used to fit models can be found in Supplementary Modeling Note.

References

Sutton, R. S. & Barto, A. G. Reinforcement Learning: An Introduction (MIT Press, 1998).
Rescorla, R. A. & Wagner, A. R. A theory of Pavlovian conditioning: variations in the effectiveness of reinforcement and nonreinforcement. in Classical Conditioning II (eds Black, A. H.Prokasy, W. F.) 64–99 (Appleton-Century-Crofts, 1972).
Daw, N. D., O’Doherty, J. P., Dayan, P., Seymour, B. & Dolan, R. J. Cortical substrates for exploratory decisions in humans. Nature 441, 876–879 (2006).
CAS PubMed PubMed Central Google Scholar
Wilson, R. C., Geana, A., White, J. M., Ludvig, E. A. & Cohen, J. D. Humans use directed and random exploration to solve the explore–exploit dilemma. J. Exp. Psychol. Gen. 143, 2074–2081 (2014).
PubMed PubMed Central Google Scholar
Wyart, V. & Koechlin, E. Choice variability and suboptimality in uncertain environments. Curr. Opin. Behav. Sci. 11, 109–115 (2016).
Google Scholar
Drugowitsch, J., Wyart, V., Devauchelle, A.-D. & Koechlin, E. Computational precision of mental inference as critical source of human choice suboptimality. Neuron 92, 1398–1411 (2016).
CAS PubMed Google Scholar
Fechner, G. T. Elements of Psychophysics (Holt, Reinehart & Winston, 1966).
Churchland, M. M. et al. Stimulus onset quenches neural variability: a widespread cortical phenomenon. Nat. Neurosci. 13, 369–378 (2010).
CAS PubMed PubMed Central Google Scholar
Johnson, K. O., Hsiao, S. S. & Yoshioka, T. Neural coding and the basic law of psychophysics. Neuroscientist 8, 111–121 (2002).
PubMed PubMed Central Google Scholar
Palminteri, S., Wyart, V. & Koechlin, E. The importance of falsification in computational cognitive modeling. Trends Cogn. Sci. 21, 425–433 (2017).
PubMed Google Scholar
Boorman, E. D., Behrens, T. E. J., Woolrich, M. W. & Rushworth, M. F. S. How green is the grass on the other side? Frontopolar cortex and the evidence in favor of alternative courses of action. Neuron 62, 733–743 (2009).
CAS PubMed Google Scholar
Palminteri, S., Khamassi, M., Joffily, M. & Coricelli, G. Contextual modulation of value signals in reward and punishment learning. Nat. Commun. 6, 8096 (2015).
CAS PubMed PubMed Central Google Scholar
Lau, B. & Glimcher, P. W. Dynamic response-by-response models of matching behavior in rhesus monkeys. J. Exp. Anal. Behav. 84, 555–579 (2005).
PubMed PubMed Central Google Scholar
Gershman, S. J., Pesaran, B. & Daw, N. D. Human reinforcement learning subdivides structured action spaces by learning effector-specific values. J. Neurosci. 29, 13524–13531 (2009).
CAS PubMed PubMed Central Google Scholar
Yu, A. J. & Cohen, J. D. Sequential effects: superstition or rational behavior? Adv. Neural Inf. Process. Syst. 21, 1873–1880 (2009).
Google Scholar
Cohen, J. D., McClure, S. M. & Yu, A. J. Should I stay or should I go? How the human brain manages the trade-off between exploitation and exploration. Philos. Trans. R. Soc. Lond. B. Biol. Sci. 362, 933–942 (2007).
PubMed PubMed Central Google Scholar
Doya, K. Modulators of decision making. Nat. Neurosci. 11, 410–416 (2008).
CAS PubMed Google Scholar
Shenhav, A., Botvinick, M. M. & Cohen, J. D. The expected value of control: an integrative theory of anterior cingulate cortex function. Neuron 79, 217–240 (2013).
CAS PubMed PubMed Central Google Scholar
Donoso, M., Collins, A. G. E. & Koechlin, E. Foundations of human reasoning in the prefrontal cortex. Science 344, 1481–1486 (2014).
CAS PubMed Google Scholar
Aston-Jones, G. & Cohen, J. D. An integrative theory of locus coeruleus-norepinephrine function: adaptive gain and optimal performance. Annu. Rev. Neurosci. 28, 403–450 (2005).
CAS PubMed Google Scholar
Usher, M., Cohen, J. D., Servan-Schreiber, D., Rajkowski, J. & Aston-Jones, G. The role of locus coeruleus in the regulation of cognitive performance. Science 283, 549–554 (1999).
CAS PubMed Google Scholar
Eldar, E., Cohen, J. D. & Niv, Y. The effects of neural gain on attention and learning. Nat. Neurosci. 16, 1146–1153 (2013).
CAS PubMed PubMed Central Google Scholar
Jepma, M. & Nieuwenhuis, S. Pupil diameter predicts changes in the exploration-exploitation trade-off: evidence for the adaptive gain theory. J. Cogn. Neurosci. 23, 1587–1596 (2011).
PubMed Google Scholar
Joshi, S., Li, Y., Kalwani, R. M. & Gold, J. I. Relationships between pupil diameter and neuronal activity in the locus coeruleus, colliculi, and cingulate cortex. Neuron 89, 221–234 (2015).
PubMed PubMed Central Google Scholar
Gershman, S. J. A unifying probabilistic view of associative learning. PLoS Comput. Biol. 11, e1004567 (2015).
PubMed PubMed Central Google Scholar
Beck, J. M., Ma, W. J., Pitkow, X., Latham, P. E. & Pouget, A. Not noisy, just wrong: the role of suboptimal inference in behavioral variability. Neuron 74, 30–39 (2012).
CAS PubMed PubMed Central Google Scholar
Kennerley, S. W., Walton, M. E., Behrens, T. E. J., Buckley, M. J. & Rushworth, M. F. S. Optimal decision making and the anterior cingulate cortex. Nat. Neurosci. 9, 940–947 (2006).
CAS PubMed Google Scholar
Tervo, D. G. R. et al. Behavioral variability through stochastic choice and its gating by anterior cingulate cortex. Cell 159, 21–32 (2014).
CAS PubMed Google Scholar
Farashahi, S. et al. Metaplasticity as a neural substrate for adaptive learning and choice under uncertainty. Neuron 94, 401–414.e6 (2017).
CAS PubMed PubMed Central Google Scholar
Meder, D. et al. Simultaneous representation of a spectrum of dynamically changing value estimates during decision making. Nat. Commun. 8, 1942 (2017).
PubMed PubMed Central Google Scholar
Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I. & Salakhutdinov, R. Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15, 1929–1958 (2014).
Google Scholar
Bottou, L. Large-scale machine learning with stochastic gradient descent. in Proceedings of COMPSTAT’2010 (eds Lechevallier Y. & Saporta G.) 177–186 (2010).
Behrens, T. E. J., Woolrich, M. W., Walton, M. E. & Rushworth, M. F. S. Learning the value of information in an uncertain world. Nat. Neurosci. 10, 1214–1221 (2007).
CAS PubMed Google Scholar
Yu, A. J. & Dayan, P. Uncertainty, neuromodulation, and attention. Neuron 46, 681–692 (2005).
CAS PubMed Google Scholar
Arnsten, A. F. T. & Goldman-Rakic, P. S. Selective prefrontal cortical projections to the region of the locus coeruleus and raphe nuclei in the rhesus monkey. Brain Res. 306, 9–18 (1984).
CAS PubMed Google Scholar
Warren, C. M. et al. The effect of atomoxetine on random and directed exploration in humans. PLoS One 12, e0176034 (2017).
PubMed PubMed Central Google Scholar
Kane, G. A. et al. Increased locus coeruleus tonic activity causes disengagement from a patch-foraging task. Cogn. Affect. Behav. Neurosci. 17, 1073–1083 (2017).
PubMed Google Scholar
Browning, M., Behrens, T. E., Jocham, G., O’Reilly, J. X. & Bishop, S. J. Anxious individuals have difficulty learning the causal statistics of aversive environments. Nat. Neurosci. 18, 590–596 (2015).
CAS PubMed PubMed Central Google Scholar
Robert, C. & Casella, G. Monte Carlo Statistical Methods (Springer, 2004).
Chopin, N. A sequential particle filter method for static models. Biometrika 89, 539–552 (2002).
Google Scholar
Chopin, N., Jacob, P. E. & Papaspiliopoulos, O. SMC²: an efficient algorithm for sequential analysis of state space models. J. R. Stat. Soc. B 75, 397–426 (2013).
Google Scholar
Lindsten, F. & Schön, T. B. Backward simulation methods for Monte Carlo statistical inference. Found. Trends Mach. Learn. 6, 1–143 (2013).
Google Scholar
Doucet, A., Godsill, S. & Andrieu, C. On sequential Monte Carlo sampling methods for Bayesian filtering. Stat. Comput. 10, 197–208 (2000).
Google Scholar
Deichmann, R., Gottfried, J., Hutton, C. & Turner, R. Optimized EPI for fMRI studies of the orbitofrontal cortex. Neuroimage 19, 430–441 (2003).
CAS PubMed Google Scholar
Weiskopf, N., Hutton, C., Josephs, O. & Deichmann, R. Optimal EPI parameters for reduction of susceptibility-induced BOLD sensitivity losses: a whole-brain analysis at 3T and 1.5T. Neuroimage 33, 493–504 (2006).
PubMed Google Scholar
Jenkinson, M., Beckmann, C. F., Behrens, T. E. J., Woolrich, M. W. & Smith, S. M. FSL. Neuroimage 62, 782–790 (2012).
Google Scholar
Rouder, J. N., Speckman, P. L., Sun, D., Morey, R. D. & Iverson, G. Bayesian t tests for accepting and rejecting the null hypothesis. Psychon. Bull. Rev. 16, 225–237 (2009).
PubMed Google Scholar
Stephan, K. E., Penny, W. D., Daunizeau, J., Moran, R. J. & Friston, K. J. Bayesian model selection for group studies. NeuroImage 15, 1004–1017 (2009).
Google Scholar

Download references

Acknowledgements

We thank C. Summerfield (University of Oxford; Google DeepMind) for comments on an earlier version of the manuscript. This work was supported by a starting grant from the European Research Council awarded to V.W. (ERC-StG-759341), a junior researcher grant from the Agence Nationale de la Recherche awarded to V.W. (ANR-14-CE13-0028) and two department-wide grants from the Agence Nationale de la Recherche (ANR-10-LABX-0087 and ANR-10-IDEX-0001-02 PSL). C.F. was supported by a graduate research fellowship from the Direction Générale de l’Armement (2015-60-0041). S.P. was supported by a CNRS-Inserm ATIP-Avenir grant (R16069JS) and a research grant from the Programme Emergence(s) of the City of Paris.

Author information

These authors contributed equally: Charles Findling, Vasilisa Skvortsova.

Authors and Affiliations

Laboratoire de Neurosciences Cognitives et Computationnelles, Inserm U960, Département d’Études Cognitives, École Normale Supérieure, PSL University, Paris, France
Charles Findling, Vasilisa Skvortsova, Rémi Dromnelle, Stefano Palminteri & Valentin Wyart
ENSAE ParisTech, Paris-Saclay University, Palaiseau, France
Charles Findling
Institut des Systèmes Intelligents et de Robotique, CNRS UMR7222, Sorbonne University, Paris, France
Rémi Dromnelle

Authors

Charles Findling
View author publications
You can also search for this author in PubMed Google Scholar
Vasilisa Skvortsova
View author publications
You can also search for this author in PubMed Google Scholar
Rémi Dromnelle
View author publications
You can also search for this author in PubMed Google Scholar
Stefano Palminteri
View author publications
You can also search for this author in PubMed Google Scholar
Valentin Wyart
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

S.P. and V.W. were responsible for conceptualization. C.F., V.W. and S.P. were responsible for the methodology. C.F., V.S. and V.W. performed the formal analysis. V.S. and R.D. carried out the investigations. C.F., V.S. and V.W. wrote the original draft. C.F., V.S., S.P. and V.W. reviewed and edited the report. V.W. supervised the study and acquired funding.

Corresponding author

Correspondence to Valentin Wyart.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Peer review information Nature Neuroscience thanks Samuel Gershman, Yonatan Loewenstein, and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Integrated supplementary information

Supplementary Figure 1 Additional model comparisons across experiments 1 and 2 (N = 59 participants).

(a) Knock-out procedure. The top panel shows models that varied based on the presence or absence of learning noise (ζ) in addition to the softmax choice policy (β). The bottom panel shows models that varied based on the presence or absence of a softmax choice policy (β) on top of learning noise (ζ). (b) Results of the model comparison in the partial and complete feedback conditions for models described in panel a pooled across experiment 1 (N = 29) and experiment 2 (N = 30). Similarly to the main behavioral results, these comparisons revealed that participants featured both learning noise (fixed-effects: BF≈10^50.3, random-effects: exceedance p=0.99) and choice stochasticity (fixed-effects: BF≈10^82.4, random-effects: exceedance p=0.999) in the partial feedback condition (left panel). In the complete feedback condition, the model with learning noise better explained the data than the exact model (fixed-effects: BF≈10^100.3, random-effects: exceedance p=0.999). Furthermore, a model with learning noise and an argmax action selection policy fitted the data decisively better than a model with learning noise and a softmax policy (fixed-effects: BF≈10^15.8, random-effects: exceedance p=0.999) (right panel). Error bars for model frequencies correspond to the s.d. of estimated Dirichlet distributions.

Supplementary Figure 2 Results of the parameter recovery procedure.

Implementation of the parameter recovery procedure in experiment 1 (in the partial feedback condition). For a given set of parameter values, we simulated the model 29 times (once for each of the N = 29 different realizations of the task). Obtained simulated actions were fitted using the same exact procedure used to fit human data, to quantify the extent to which we could recover the simulated (ground-truth) parameters values. (a) Parameter recovery for the learning noise parameter ζ, with other parameters (softmax temperature 1/β and learning rates) fixed. (b) Parameter recovery for the softmax temperature 1/β with other parameters (learning noise ζ and learning rates) fixed. Fixed parameter values were set to group-level mean estimates obtained using a fixed-effects approach. For the single parameter whose value was varied, we considered 11 values logarithmically distributed around the group-level mean estimate. Horizontal lines represent the subjects’ group mean with the 99% confidence interval. Each dot represents the recovered parameter averaged across simulations (N=29) with vertical lines showing s.d.m. The results shown indicate that ground-truth parameter values are well recovered by our fitting procedure. Also, it shows that the fitting procedure is robust to changes of learning noise and softmax temperature parameters. Recovered parameters do not saturate within the range values parameters considered for learning noise (a) and start to saturate only when the softmax temperature parameter is set to about three times the group-level mean value (b).

Supplementary information

Supplementary Figures 1 & 2, Supplementary Tables 1 & 2, Supplementary Modeling Note, and Supplementary Note

Reporting Summary

Rights and permissions

Reprints and permissions

About this article

Cite this article

Findling, C., Skvortsova, V., Dromnelle, R. et al. Computational noise in reward-guided learning drives behavioral variability in volatile environments. Nat Neurosci 22, 2066–2077 (2019). https://doi.org/10.1038/s41593-019-0518-9

Download citation

Received: 11 October 2018
Accepted: 17 September 2019
Published: 28 October 2019
Issue Date: December 2019
DOI: https://doi.org/10.1038/s41593-019-0518-9

This article is cited by

Prefrontal signals precede striatal signals for biased credit assignment in motivational learning biases
- Johannes Algermissen
- Jennifer C. Swart
- Hanneke E. M. den Ouden
Nature Communications (2024)
Distinct reinforcement learning profiles distinguish between language and attentional neurodevelopmental disorders
- Noyli Nissan
- Uri Hertz
- Yafit Gabay
Behavioral and Brain Functions (2023)
Knowledge generalization and the costs of multitasking
- Kelly G. Garner
- Paul E. Dux
Nature Reviews Neuroscience (2023)
Blocking D2/D3 dopamine receptors in male participants increases volatility of beliefs when learning to trust others
- Nace Mikus
- Christoph Eisenegger
- Michael Naef
Nature Communications (2023)
Value-free random exploration is linked to impulsivity
- Magda Dubois
- Tobias U. Hauser
Nature Communications (2022)