When learning the value of actions in volatile environments, humans often make seemingly irrational decisions that fail to maximize expected value. We reasoned that these ‘non-greedy’ decisions, instead of reflecting information seeking during choice, may be caused by computational noise in the learning of action values. Here using reinforcement learning models of behavior and multimodal neurophysiological data, we show that the majority of non-greedy decisions stem from this learning noise. The trial-to-trial variability of sequential learning steps and their impact on behavior could be predicted both by blood oxygen level-dependent responses to obtained rewards in the dorsal anterior cingulate cortex and by phasic pupillary dilation, suggestive of neuromodulatory fluctuations driven by the locus coeruleus–norepinephrine system. Together, these findings indicate that most behavioral variability, rather than reflecting human exploration, is due to the limited computational precision of reward-guided learning.
Subscribe to Journal
Get full journal access for 1 year
only $18.75 per issue
All prices are NET prices.
VAT will be added later in the checkout.
Rent or Buy article
Get time limited or full article access on ReadCube.
All prices are NET prices.
The data (behavioral, neuroimaging and pupillometric) that support these findings are available from the corresponding author upon request.
Python and C++ code for fitting all computational models described in the article are available at https://github.com/csmfindling/learning_variability. The algorithmic backbone of the Monte Carlo procedures used to fit models can be found in Supplementary Modeling Note.
Sutton, R. S. & Barto, A. G. Reinforcement Learning: An Introduction (MIT Press, 1998).
Rescorla, R. A. & Wagner, A. R. A theory of Pavlovian conditioning: variations in the effectiveness of reinforcement and nonreinforcement. in Classical Conditioning II (eds Black, A. H.Prokasy, W. F.) 64–99 (Appleton-Century-Crofts, 1972).
Daw, N. D., O’Doherty, J. P., Dayan, P., Seymour, B. & Dolan, R. J. Cortical substrates for exploratory decisions in humans. Nature 441, 876–879 (2006).
Wilson, R. C., Geana, A., White, J. M., Ludvig, E. A. & Cohen, J. D. Humans use directed and random exploration to solve the explore–exploit dilemma. J. Exp. Psychol. Gen. 143, 2074–2081 (2014).
Wyart, V. & Koechlin, E. Choice variability and suboptimality in uncertain environments. Curr. Opin. Behav. Sci. 11, 109–115 (2016).
Drugowitsch, J., Wyart, V., Devauchelle, A.-D. & Koechlin, E. Computational precision of mental inference as critical source of human choice suboptimality. Neuron 92, 1398–1411 (2016).
Fechner, G. T. Elements of Psychophysics (Holt, Reinehart & Winston, 1966).
Churchland, M. M. et al. Stimulus onset quenches neural variability: a widespread cortical phenomenon. Nat. Neurosci. 13, 369–378 (2010).
Johnson, K. O., Hsiao, S. S. & Yoshioka, T. Neural coding and the basic law of psychophysics. Neuroscientist 8, 111–121 (2002).
Palminteri, S., Wyart, V. & Koechlin, E. The importance of falsification in computational cognitive modeling. Trends Cogn. Sci. 21, 425–433 (2017).
Boorman, E. D., Behrens, T. E. J., Woolrich, M. W. & Rushworth, M. F. S. How green is the grass on the other side? Frontopolar cortex and the evidence in favor of alternative courses of action. Neuron 62, 733–743 (2009).
Palminteri, S., Khamassi, M., Joffily, M. & Coricelli, G. Contextual modulation of value signals in reward and punishment learning. Nat. Commun. 6, 8096 (2015).
Lau, B. & Glimcher, P. W. Dynamic response-by-response models of matching behavior in rhesus monkeys. J. Exp. Anal. Behav. 84, 555–579 (2005).
Gershman, S. J., Pesaran, B. & Daw, N. D. Human reinforcement learning subdivides structured action spaces by learning effector-specific values. J. Neurosci. 29, 13524–13531 (2009).
Yu, A. J. & Cohen, J. D. Sequential effects: superstition or rational behavior? Adv. Neural Inf. Process. Syst. 21, 1873–1880 (2009).
Cohen, J. D., McClure, S. M. & Yu, A. J. Should I stay or should I go? How the human brain manages the trade-off between exploitation and exploration. Philos. Trans. R. Soc. Lond. B. Biol. Sci. 362, 933–942 (2007).
Doya, K. Modulators of decision making. Nat. Neurosci. 11, 410–416 (2008).
Shenhav, A., Botvinick, M. M. & Cohen, J. D. The expected value of control: an integrative theory of anterior cingulate cortex function. Neuron 79, 217–240 (2013).
Donoso, M., Collins, A. G. E. & Koechlin, E. Foundations of human reasoning in the prefrontal cortex. Science 344, 1481–1486 (2014).
Aston-Jones, G. & Cohen, J. D. An integrative theory of locus coeruleus-norepinephrine function: adaptive gain and optimal performance. Annu. Rev. Neurosci. 28, 403–450 (2005).
Usher, M., Cohen, J. D., Servan-Schreiber, D., Rajkowski, J. & Aston-Jones, G. The role of locus coeruleus in the regulation of cognitive performance. Science 283, 549–554 (1999).
Eldar, E., Cohen, J. D. & Niv, Y. The effects of neural gain on attention and learning. Nat. Neurosci. 16, 1146–1153 (2013).
Jepma, M. & Nieuwenhuis, S. Pupil diameter predicts changes in the exploration-exploitation trade-off: evidence for the adaptive gain theory. J. Cogn. Neurosci. 23, 1587–1596 (2011).
Joshi, S., Li, Y., Kalwani, R. M. & Gold, J. I. Relationships between pupil diameter and neuronal activity in the locus coeruleus, colliculi, and cingulate cortex. Neuron 89, 221–234 (2015).
Gershman, S. J. A unifying probabilistic view of associative learning. PLoS Comput. Biol. 11, e1004567 (2015).
Beck, J. M., Ma, W. J., Pitkow, X., Latham, P. E. & Pouget, A. Not noisy, just wrong: the role of suboptimal inference in behavioral variability. Neuron 74, 30–39 (2012).
Kennerley, S. W., Walton, M. E., Behrens, T. E. J., Buckley, M. J. & Rushworth, M. F. S. Optimal decision making and the anterior cingulate cortex. Nat. Neurosci. 9, 940–947 (2006).
Tervo, D. G. R. et al. Behavioral variability through stochastic choice and its gating by anterior cingulate cortex. Cell 159, 21–32 (2014).
Farashahi, S. et al. Metaplasticity as a neural substrate for adaptive learning and choice under uncertainty. Neuron 94, 401–414.e6 (2017).
Meder, D. et al. Simultaneous representation of a spectrum of dynamically changing value estimates during decision making. Nat. Commun. 8, 1942 (2017).
Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I. & Salakhutdinov, R. Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15, 1929–1958 (2014).
Bottou, L. Large-scale machine learning with stochastic gradient descent. in Proceedings of COMPSTAT’ 2010 (eds Lechevallier Y. & Saporta G.) 177–186 (2010).
Behrens, T. E. J., Woolrich, M. W., Walton, M. E. & Rushworth, M. F. S. Learning the value of information in an uncertain world. Nat. Neurosci. 10, 1214–1221 (2007).
Yu, A. J. & Dayan, P. Uncertainty, neuromodulation, and attention. Neuron 46, 681–692 (2005).
Arnsten, A. F. T. & Goldman-Rakic, P. S. Selective prefrontal cortical projections to the region of the locus coeruleus and raphe nuclei in the rhesus monkey. Brain Res. 306, 9–18 (1984).
Warren, C. M. et al. The effect of atomoxetine on random and directed exploration in humans. PLoS One 12, e0176034 (2017).
Kane, G. A. et al. Increased locus coeruleus tonic activity causes disengagement from a patch-foraging task. Cogn. Affect. Behav. Neurosci. 17, 1073–1083 (2017).
Browning, M., Behrens, T. E., Jocham, G., O’Reilly, J. X. & Bishop, S. J. Anxious individuals have difficulty learning the causal statistics of aversive environments. Nat. Neurosci. 18, 590–596 (2015).
Robert, C. & Casella, G. Monte Carlo Statistical Methods (Springer, 2004).
Chopin, N. A sequential particle filter method for static models. Biometrika 89, 539–552 (2002).
Chopin, N., Jacob, P. E. & Papaspiliopoulos, O. SMC2: an efficient algorithm for sequential analysis of state space models. J. R. Stat. Soc. B 75, 397–426 (2013).
Lindsten, F. & Schön, T. B. Backward simulation methods for Monte Carlo statistical inference. Found. Trends Mach. Learn. 6, 1–143 (2013).
Doucet, A., Godsill, S. & Andrieu, C. On sequential Monte Carlo sampling methods for Bayesian filtering. Stat. Comput. 10, 197–208 (2000).
Deichmann, R., Gottfried, J., Hutton, C. & Turner, R. Optimized EPI for fMRI studies of the orbitofrontal cortex. Neuroimage 19, 430–441 (2003).
Weiskopf, N., Hutton, C., Josephs, O. & Deichmann, R. Optimal EPI parameters for reduction of susceptibility-induced BOLD sensitivity losses: a whole-brain analysis at 3T and 1.5T. Neuroimage 33, 493–504 (2006).
Jenkinson, M., Beckmann, C. F., Behrens, T. E. J., Woolrich, M. W. & Smith, S. M. FSL. Neuroimage 62, 782–790 (2012).
Rouder, J. N., Speckman, P. L., Sun, D., Morey, R. D. & Iverson, G. Bayesian t tests for accepting and rejecting the null hypothesis. Psychon. Bull. Rev. 16, 225–237 (2009).
Stephan, K. E., Penny, W. D., Daunizeau, J., Moran, R. J. & Friston, K. J. Bayesian model selection for group studies. NeuroImage 15, 1004–1017 (2009).
We thank C. Summerfield (University of Oxford; Google DeepMind) for comments on an earlier version of the manuscript. This work was supported by a starting grant from the European Research Council awarded to V.W. (ERC-StG-759341), a junior researcher grant from the Agence Nationale de la Recherche awarded to V.W. (ANR-14-CE13-0028) and two department-wide grants from the Agence Nationale de la Recherche (ANR-10-LABX-0087 and ANR-10-IDEX-0001-02 PSL). C.F. was supported by a graduate research fellowship from the Direction Générale de l’Armement (2015-60-0041). S.P. was supported by a CNRS-Inserm ATIP-Avenir grant (R16069JS) and a research grant from the Programme Emergence(s) of the City of Paris.
The authors declare no competing interests.
Peer review information Nature Neuroscience thanks Samuel Gershman, Yonatan Loewenstein, and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Integrated supplementary information
Supplementary Figure 1 Additional model comparisons across experiments 1 and 2 (N = 59 participants).
(a) Knock-out procedure. The top panel shows models that varied based on the presence or absence of learning noise (ζ) in addition to the softmax choice policy (β). The bottom panel shows models that varied based on the presence or absence of a softmax choice policy (β) on top of learning noise (ζ). (b) Results of the model comparison in the partial and complete feedback conditions for models described in panel a pooled across experiment 1 (N = 29) and experiment 2 (N = 30). Similarly to the main behavioral results, these comparisons revealed that participants featured both learning noise (fixed-effects: BF≈1050.3, random-effects: exceedance p=0.99) and choice stochasticity (fixed-effects: BF≈1082.4, random-effects: exceedance p=0.999) in the partial feedback condition (left panel). In the complete feedback condition, the model with learning noise better explained the data than the exact model (fixed-effects: BF≈10100.3, random-effects: exceedance p=0.999). Furthermore, a model with learning noise and an argmax action selection policy fitted the data decisively better than a model with learning noise and a softmax policy (fixed-effects: BF≈1015.8, random-effects: exceedance p=0.999) (right panel). Error bars for model frequencies correspond to the s.d. of estimated Dirichlet distributions.
Implementation of the parameter recovery procedure in experiment 1 (in the partial feedback condition). For a given set of parameter values, we simulated the model 29 times (once for each of the N = 29 different realizations of the task). Obtained simulated actions were fitted using the same exact procedure used to fit human data, to quantify the extent to which we could recover the simulated (ground-truth) parameters values. (a) Parameter recovery for the learning noise parameter ζ, with other parameters (softmax temperature 1/β and learning rates) fixed. (b) Parameter recovery for the softmax temperature 1/β with other parameters (learning noise ζ and learning rates) fixed. Fixed parameter values were set to group-level mean estimates obtained using a fixed-effects approach. For the single parameter whose value was varied, we considered 11 values logarithmically distributed around the group-level mean estimate. Horizontal lines represent the subjects’ group mean with the 99% confidence interval. Each dot represents the recovered parameter averaged across simulations (N=29) with vertical lines showing s.d.m. The results shown indicate that ground-truth parameter values are well recovered by our fitting procedure. Also, it shows that the fitting procedure is robust to changes of learning noise and softmax temperature parameters. Recovered parameters do not saturate within the range values parameters considered for learning noise (a) and start to saturate only when the softmax temperature parameter is set to about three times the group-level mean value (b).
About this article
Cite this article
Findling, C., Skvortsova, V., Dromnelle, R. et al. Computational noise in reward-guided learning drives behavioral variability in volatile environments. Nat Neurosci (2019) doi:10.1038/s41593-019-0518-9