Abstract
Distinct model-free and model-based learning processes are thought to drive both typical and dysfunctional behaviours. Data from two-stage decision tasks have seemingly shown that human behaviour is driven by both processes operating in parallel. However, in this study, we show that more detailed task instructions lead participants to make primarily model-based choices that have little, if any, simple model-free influence. We also demonstrate that behaviour in the two-stage task may falsely appear to be driven by a combination of simple model-free and model-based learning if purely model-based agents form inaccurate models of the task because of misconceptions. Furthermore, we report evidence that many participants do misconceive the task in important ways. Overall, we argue that humans formulate a wide variety of learning models. Consequently, the simple dichotomy of model-free versus model-based learning is inadequate to explain behaviour in the two-stage task and connections between reward learning, habit formation and compulsivity.
This is a preview of subscription content, access via your institution
Access options
Access Nature and 54 other Nature Portfolio journals
Get Nature+, our best-value online-access subscription
$29.99 / 30 days
cancel any time
Subscribe to this journal
Receive 12 digital issues and online access to articles
$119.00 per year
only $9.92 per issue
Buy this article
- Purchase on Springer Link
- Instant access to full article PDF
Prices may be subject to local taxes which are calculated during checkout
Similar content being viewed by others
Data availability
The data obtained from human participants are available at https://github.com/carolfs/muddled_models
Code availability
All the code used to perform the simulations, run the magic carpet and the spaceship tasks, and analyse the results are available at https://github.com/carolfs/muddled_models
References
Ceceli, A. O. & Tricomi, E. Habits and goals: a motivational perspective on action control. Curr. Opin. Behav. Sci. 20, 110–116 (2018).
Redish, A. D., Jensen, S. & Johnson, A. Addiction as vulnerabilities in the decision process. Behav. Brain Sci. 31, 461–487 (2008).
Rangel, A., Camerer, C. & Montague, P. R. A framework for studying the neurobiology of value-based decision making. Nat. Rev. Neurosci. 9, 545–556 (2008).
Daw, N. D., Niv, Y. & Dayan, P. Uncertainty-based competition between prefrontal and dorsolateral striatal systems for behavioral control. Nat. Neurosci. 8, 1704–1711 (2005).
Sutton, R.S. & Barto, A.G. Reinforcement Learning: An Introduction (A Bradford Book, 1998).
Gillan, C. M., Otto, A. R., Phelps, E. A. & Daw, N. D. Model-based learning protects against forming habits. Cogn. Affect. Behav. Neurosci. 15, 523–536 (2015).
Daw, N. D., Gershman, S. J., Seymour, B., Dayan, P. & Dolan, R. J. Model-based influences on humans’ choices and striatal prediction errors. Neuron 69, 1204–1215 (2011).
Wunderlich, K., Smittenaar, P. & Dolan, R. J. Dopamine enhances model-based over model-free choice behavior. Neuron 75, 418–424 (2012).
Eppinger, B., Walter, M., Heekeren, H. R. & Li, S.-C. Of goals and habits: age-related and individual differences in goal-directed decision-making. Front. Neurosci. 7, 253 (2013).
Otto, A. R., Raio, C. M., Chiang, A., Phelps, E. A. & Daw, N. D. Working-memory capacity protects model-based learning from stress. Proc. Natl Acad. Sci. USA 110, 20941–20946 (2013).
Otto, A. R., Gershman, S. J., Markman, A. B. & Daw, N. D. The curse of planning. Psychol. Sci. 24, 751–761 (2013).
Smittenaar, P., FitzGerald, T. H., Romei, V., Wright, N. D. & Dolan, R. J. Disruption of dorsolateral prefrontal cortex decreases model-based in favor of model-free control in humans. Neuron 80, 914–919 (2013).
Sebold, M. et al. Model-based and model-free decisions in alcohol dependence. Neuropsychobiol. 70, 122–131 (2014).
Voon, V. et al. Disorders of compulsivity: a common bias towards learning habits. Mol. Psych. 20, 345–352 (2015).
Doll, B. B., Shohamy, D. & Daw, N. D. Multiple memory systems as substrates for multiple decision systems. Neurobiol. Learn. Mem. 117, 4–13 (2015).
Doll, B. B., Duncan, K. D., Simon, D. A., Shohamy, D. & Daw, N. D. Model-based choices involve prospective neural activity. Nat. Neurosci. 18, 767–772 (2015).
Cushman, F. & Morris, A. Habitual control of goal selection in humans. Proc. Natl Acad. Sci. USA 112, 13817–13822 (2015).
Otto, A. R., Skatova, A., Madlon-Kay, S. & Daw, N. D. Cognitive control predicts use of model-based reinforcement learning. J. Cogn. Neurosci. 27, 319–333 (2015).
Deserno, L. et al. Ventral striatal dopamine reflects behavioral and neural signatures of model-based control during sequential decision making. Proc. Natl Acad. Sci. USA 112, 1595–1600 (2015).
Decker, J. H., Otto, A. R., Daw, N. D. & Hartley, C. A. From creatures of habit to goal-directed learners: tracking the developmental emergence of model-based reinforcement learning. Psychol. Sci. 27, 848–858 (2016).
Kool, W., Cushman, F. A. & Gershman, S. J. When does model-based control pay off? PLoS Comput. Biol. 12, e1005090 (2016).
Kool, W., Gershman, S.J. & Cushman, F.A. Cost–benefit arbitration between multiple reinforcement-learning systems. Psychol. Sci. 28, 1321–1333 (2017).
Miller, K. J., Botvinick, M. M. & Brody, C. D. Dorsal hippocampus contributes to model-based planning. Nat. Neurosci. 20, 1269–1276 (2017).
Kool, W., Gershman, S. J. & Cushman, F. A. Planning complexity registers as a cost in metacontrol. J. Cogn. Neurosci. 30, 1391–1404 (2018).
FeherdaSilva, C. & Hare, T. A. A note on the analysis of two-stage task results: how changes in task structure affect what model-free and model-based strategies predict about the effects of reward and transition on the stay probability. PLoS ONE 13, e0195328 (2018).
Shahar, N. et al. Improving the reliability of model-based decision-making estimates in the two-stage decision task with reaction-times and drift-diffusion modeling. PLoS Comput. Biol. 15, e1006803 (2019).
Toyama, A., Katahira, K. & Ohira, H. Biases in estimating the balance between model-free and model-based learning systems due to model misspecification. J. Math. Psychol. 91, 88–102 (2019).
Daw, N. D. Are we of two minds?. Nat. Neurosci. 21, 1497–1499 (2018).
Akam, T., Costa, R. & Dayan, P. Simple plans or sophisticated habits? State, transition and learning interactions in the two-step task. PLoS Comput. Biol. 11, e1004648 (2015).
Miller, K. J., Shenhav, A. & Ludvig, E. A. Habits without values. Psychol. Rev. 126, 292–311 (2019).
Momennejad, I. et al. The successor representation in human reinforcement learning. Nat. Hum. Behav. 1, 680–692 (2017).
Dayan, P. Improving generalization for temporal difference learning: the successor representation. Neural Comput. 5, 613–624 (1993).
Dayan, P. & Berridge, K. C. Model-based and model-free Pavlovian reward learning: revaluation, revision, and revelation. Cogn. Affect. Behav. Neurosci. 14, 473–492 (2014).
Dayan, P. & Niv, Y. Reinforcement learning: the good, the bad and the ugly. Curr. Opin. Neurobiol. 18, 185–196 (2008).
Radulescu, A., Niv, Y. & Ballard, I. Holistic reinforcement learning: the role of structure and attention. Trends Cogn. Sci. 23, 278–292 (2019).
Shahar, N. et al. Credit assignment to state-independent task representations and its relationship with model-based decision making. Proc. Natl Acad. Sci. USA 116, 15871–15876 (2019).
Schultz, W., Dayan, P. & Montague, P. R. A neural substrate of prediction and reward. Science 275, 1593–1599 (1997).
Bayer, H. M. & Glimcher, P. W. Midbrain dopamine neurons encode a quantitative reward prediction error signal. Neuron 47, 129–141 (2005).
Caplin, A. & Dean, M. Axiomatic methods, dopamine and reward prediction error. Curr. Opin. Neurobiol. 18, 197–202 (2008).
Bromberg-Martin, E. S., Matsumoto, M., Hong, S. & Hikosaka, O. A pallidus–habenula–dopamine pathway signals inferred stimulus values. J. Neurophysiol. 104, 1068–1076 (2010).
Sadacca, B. F., Jones, J. L. & Schoenbaum, G. Midbrain dopamine neurons compute inferred and cached value prediction errors in a common framework. eLife 5, e13665 (2016).
Sharpe, M. J. et al. Dopamine transients are sufficient and necessary for acquisition of model-based associations. Nat. Neurosci. 20, 735–742 (2017).
Doll, B. B., Simon, D. A. & Daw, N. D. The ubiquity of model-based reinforcement learning. Curr. Opin. Neurobiol. 22, 1075–1081 (2012).
Balleine, B. W. & O’Doherty, J. P. Human and rodent homologies in action control: corticostriatal determinants of goal-directed and habitual action. Neuropsychopharmacology 35, 48–69 (2010).
Dezfouli, A. & Balleine, B. W. Habits, action sequences and reinforcement learning. Eur. J. Neurosci. 35, 1036–1051 (2012).
Dezfouli, A. & Balleine, B. W. Actions, action sequences and habits: evidence that goal-directed and habitual action control are hierarchically organized. PLoS Comput. Biol. 9, e1003364 (2013).
Dolan, R. J. & Dayan, P. Goals and habits in the brain. Neuron 80, 312–325 (2013).
Dezfouli, A., Lingawi, N. W. & Balleine, B. W. Habits as action sequences: hierarchical action control and changes in outcome value. Philos. Trans. R. Soc. Lond. B 369, 20130482–20130482 (2014).
Gershman, S. J., Markman, A. B. & Otto, A. R. Retrospective revaluation in sequential decision making: a tale of two systems. J. Exp. Psychol. Gen. 143, 182–194 (2014).
Balleine, B. W., Dezfouli, A., Ito, M. & Doya, K. Hierarchical control of goal-directed action in the cortical-basal ganglia network. Curr. Opin. Behav. Sci. 5, 1–7 (2015).
Miller, K. J., Shenhav, A. & Ludvig, E. A. Habits without values. Psychol. Rev. 126, 292–311 (2019).
Morris, A. & Cushman, F. Model-free RL or action sequences? Front. Psychol. 10, 2892 (2019).
Konovalov, A. & Krajbich, I. Gaze data reveal distinct choice processes underlying model-based and model-free reinforcement learning. Nat. Commun. 7, 12438 (2016).
Redish, A. D. Vicarious trial and error. Nat. Rev. Neurosci. 17, 147–159 (2016).
Krajbich, I., Armel, C. & Rangel, A. Visual fixations and the computation and comparison of value in simple choice. Nat. Neurosci. 13, 1292–1298 (2010).
Economides, M., Kurth-Nelson, Z., Lübbert, A., Guitart-Masip, M. & Dolan, R. J. Model-based reasoning in humans becomes automatic with training. PLoS Comput. Biol. 11, e1004463 (2015).
Shenhav, A. et al. Toward a rational and mechanistic account of mental effort. Annu. Rev. Neurosci. 40, 99–124 (2017).
Economides, M., Kurth-Nelson, Z., Lübbert, A., Guitart-Masip, M. & Dolan, R. J. Model-based reasoning in humans becomes automatic with training. PLoS Comput. Biol. 11, e1004463 (2015).
Schad, D. J. et al. Processing speed enhances model-based over model-free reinforcement learning in the presence of high working memory functioning. Front. Psychol. 5, 1450 (2014).
Gillan, C. M., Kosinski, M., Whelan, R., Phelps, E. A. & Daw, N. D. Characterizing a psychiatric symptom dimension related to deficits in goal-directed control. eLife 5, e11305 (2016).
Feher da Silva, C., Yao, Y.-W. & Hare, T.A. Can model-free reinforcement learning operate over information stored in working-memory? Preprint at bioRxiv https://doi.org/10.1101/107698 (2018).
Stan Development Team. PyStan: the Python interface to Stan http://mc-stan.org (2017).
Carpenter, B. et al. Stan: a probabilistic programming language. J. Stat. Softw. http://www.jstatsoft.org/v76/i01/ (2017).
Stan Development Team. Stan Modeling Language Users Guide and Reference Manual, Version 2.16.0 (2017).
Lewandowski, D., Kurowicka, D. & Joe, H. Generating random correlation matrices based on vines and extended onion method. J. Multivar. Anal. 100, 1989–2001 (2009).
Vehtari, A., Gelman, A. & Gabry, J. Practical Bayesian model evaluation using leave-one-out cross-validation and WAIC. Stat. Comput. 27, 1413–1432 (2017).
Seabold, S. & Perktold, J. Statsmodels: econometric and statistical modeling with python. In Proc. 9th Python in Science Conference (eds van der Walt, S. & Millman, J.) 92–96 (SciPy, 2010).
Acknowledgements
We thank G. M. Parente for the wonderful illustrations used in the spaceship and magic carpet tasks, S. Gobbi, G. Lombardi and M. Edelson for many helpful discussions and ideas, the participants at the NYU Neuroeconomics Colloquium for useful feedback and P. Dayan, W. Kool, A. Konovalov, I. Krajbich and S. Nebe for helpful comments on early drafts of this manuscript. Our acknowledgement of their feedback does not imply that these individuals fully agree with our conclusions or opinions in this paper. We also acknowledge W. Kool, F. Cushman and S. Gershman for making the data from their 2016 paper openly available at https://github.com/wkool/tradeoffs. This work was supported by the CAPES Foundation (https://www.capes.gov.br) grant no. 88881.119317/2016-01 and the European Union’s Seventh Framework programme for research, technological development and demonstration under grant no. 607310 (Nudge-it). The funders had no role in study design, data collection and analysis, decision to publish or preparation of the manuscript.
Author information
Authors and Affiliations
Contributions
C.F.S. and T.A.H. designed the tasks and computational models. C.F.S. programmed the tasks, collected the data and performed the analyses with input from T.A.H. Both authors wrote the manuscript.
Corresponding authors
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Peer review information Primary Handling Editor: Marike Schiffer.
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Extended data
Extended Data Fig. 1 Timelines of the spaceship and magic carpet task.
Each box depicts an event within the spaceship or magic carpet tasks. The duration of each even is given in seconds on the left. A) In the spaceship task, the 1st screen simply indicates that a new trial has begun. The 2nd and 3rd screens represent the initial state. At the 4th screen, the participant has up to 2 seconds to indicate her choice. The common or rare transition is shown on the 5th screen. The second-stage state was indicated by the background colour (black, red) and the choice by the green left and right arrows on the 6th screen. The 7th and final screen in a trial revealed whether or not a reward was delivered. After feedback, the task advanced directly to the next trial. B) The magic carpet task was designed to closely mimic the original, abstract version of the two-stage task while still allowing for story-based instructions that included causes and effects for all task events. Thus, we used the same Tibetan characters from the original task, made them into labels for magic carpets and genies rather than simply identifying coloured squares. In the magic carpet task, the 1st screen represented the initial state and first-stage choice. Participants had up to 2s to make this choice. On the second screen, the chosen option was highlighted for 3 seconds. Next, a ‘nap’ screen was shown for 1s while the magic carpet automatically took the participant to one of the two mountains. Although participants saw the common or rare transition screens depicted in Fig. 1d during the practice trials, the transitions were not shown during the main task to make it more comparable with previous versions. The second-stage state (blue, pink) and choice were indicated by the pink or blue lamps on the right and left side of the 4th screen. The participant had up to 2s to make her choice. The 5th screen highlighted the chosen lamp/genie for3s. The 6th and final screen in a trial revealed whether or not a reward was delivered. After reward feedback, there was a blank screen for 0.7-1.3s before the next trial began.
Supplementary information
Supplementary Information
Supplementary methods, results and discussion.
Rights and permissions
About this article
Cite this article
Feher da Silva, C., Hare, T.A. Humans primarily use model-based inference in the two-stage task. Nat Hum Behav 4, 1053–1066 (2020). https://doi.org/10.1038/s41562-020-0905-y
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1038/s41562-020-0905-y
This article is cited by
-
Memory for rewards guides retrieval
Communications Psychology (2024)
-
Bridging minds and policies: supporting early career researchers in translating computational psychiatry research
Neuropsychopharmacology (2024)
-
A multi-stage anticipated surprise model with dynamic expectation for economic decision-making
Scientific Reports (2024)
-
Distinct reinforcement learning profiles distinguish between language and attentional neurodevelopmental disorders
Behavioral and Brain Functions (2023)
-
Using smartphones to optimise and scale-up the assessment of model-based planning
Communications Psychology (2023)