Humans primarily use model-based inference in the two-stage task

Abstract

Distinct model-free and model-based learning processes are thought to drive both typical and dysfunctional behaviours. Data from two-stage decision tasks have seemingly shown that human behaviour is driven by both processes operating in parallel. However, in this study, we show that more detailed task instructions lead participants to make primarily model-based choices that have little, if any, simple model-free influence. We also demonstrate that behaviour in the two-stage task may falsely appear to be driven by a combination of simple model-free and model-based learning if purely model-based agents form inaccurate models of the task because of misconceptions. Furthermore, we report evidence that many participants do misconceive the task in important ways. Overall, we argue that humans formulate a wide variety of learning models. Consequently, the simple dichotomy of model-free versus model-based learning is inadequate to explain behaviour in the two-stage task and connections between reward learning, habit formation and compulsivity.

Access options

Rent or Buy article

Get time limited or full article access on ReadCube.

from$8.99

All prices are NET prices.

Fig. 1: The stimuli used in the three versions of the two-stage task.
Fig. 2: Stay probabilities for human participants; stay probabilities and model-based weights for simulated agents.
Fig. 3: Example of a reward main effect that cannot be driven by model-free learning.
Fig. 4: Model-based weights and logistic regression coefficients for different empirical datasets.
Fig. 5: Simulated behaviour of agents and real participants can be influenced by irrelevant changes in stimulus position.
Fig. 6: Simplified diagrams representing the strategy space in the two-stage task.

Data availability

The data obtained from human participants are available at https://github.com/carolfs/muddled_models

Code availability

All the code used to perform the simulations, run the magic carpet and the spaceship tasks, and analyse the results are available at https://github.com/carolfs/muddled_models

References

  1. 1.

    Ceceli, A. O. & Tricomi, E. Habits and goals: a motivational perspective on action control. Curr. Opin. Behav. Sci. 20, 110–116 (2018).

    Google Scholar 

  2. 2.

    Redish, A. D., Jensen, S. & Johnson, A. Addiction as vulnerabilities in the decision process. Behav. Brain Sci. 31, 461–487 (2008).

    Google Scholar 

  3. 3.

    Rangel, A., Camerer, C. & Montague, P. R. A framework for studying the neurobiology of value-based decision making. Nat. Rev. Neurosci. 9, 545–556 (2008).

    CAS  PubMed  PubMed Central  Google Scholar 

  4. 4.

    Daw, N. D., Niv, Y. & Dayan, P. Uncertainty-based competition between prefrontal and dorsolateral striatal systems for behavioral control. Nat. Neurosci. 8, 1704–1711 (2005).

    CAS  PubMed  Google Scholar 

  5. 5.

    Sutton, R.S. & Barto, A.G. Reinforcement Learning: An Introduction (A Bradford Book, 1998).

  6. 6.

    Gillan, C. M., Otto, A. R., Phelps, E. A. & Daw, N. D. Model-based learning protects against forming habits. Cogn. Affect. Behav. Neurosci. 15, 523–536 (2015).

    PubMed  PubMed Central  Google Scholar 

  7. 7.

    Daw, N. D., Gershman, S. J., Seymour, B., Dayan, P. & Dolan, R. J. Model-based influences on humans’ choices and striatal prediction errors. Neuron 69, 1204–1215 (2011).

    CAS  PubMed  PubMed Central  Google Scholar 

  8. 8.

    Wunderlich, K., Smittenaar, P. & Dolan, R. J. Dopamine enhances model-based over model-free choice behavior. Neuron 75, 418–424 (2012).

    CAS  PubMed  PubMed Central  Google Scholar 

  9. 9.

    Eppinger, B., Walter, M., Heekeren, H. R. & Li, S.-C. Of goals and habits: age-related and individual differences in goal-directed decision-making. Front. Neurosci. 7, 253 (2013).

    PubMed  PubMed Central  Google Scholar 

  10. 10.

    Otto, A. R., Raio, C. M., Chiang, A., Phelps, E. A. & Daw, N. D. Working-memory capacity protects model-based learning from stress. Proc. Natl Acad. Sci. USA 110, 20941–20946 (2013).

    CAS  PubMed  Google Scholar 

  11. 11.

    Otto, A. R., Gershman, S. J., Markman, A. B. & Daw, N. D. The curse of planning. Psychol. Sci. 24, 751–761 (2013).

    PubMed  Google Scholar 

  12. 12.

    Smittenaar, P., FitzGerald, T. H., Romei, V., Wright, N. D. & Dolan, R. J. Disruption of dorsolateral prefrontal cortex decreases model-based in favor of model-free control in humans. Neuron 80, 914–919 (2013).

    CAS  PubMed  PubMed Central  Google Scholar 

  13. 13.

    Sebold, M. et al. Model-based and model-free decisions in alcohol dependence. Neuropsychobiol. 70, 122–131 (2014).

    CAS  Google Scholar 

  14. 14.

    Voon, V. et al. Disorders of compulsivity: a common bias towards learning habits. Mol. Psych. 20, 345–352 (2015).

    CAS  Google Scholar 

  15. 15.

    Doll, B. B., Shohamy, D. & Daw, N. D. Multiple memory systems as substrates for multiple decision systems. Neurobiol. Learn. Mem. 117, 4–13 (2015).

    PubMed  Google Scholar 

  16. 16.

    Doll, B. B., Duncan, K. D., Simon, D. A., Shohamy, D. & Daw, N. D. Model-based choices involve prospective neural activity. Nat. Neurosci. 18, 767–772 (2015).

    CAS  PubMed  PubMed Central  Google Scholar 

  17. 17.

    Cushman, F. & Morris, A. Habitual control of goal selection in humans. Proc. Natl Acad. Sci. USA 112, 13817–13822 (2015).

    CAS  PubMed  Google Scholar 

  18. 18.

    Otto, A. R., Skatova, A., Madlon-Kay, S. & Daw, N. D. Cognitive control predicts use of model-based reinforcement learning. J. Cogn. Neurosci. 27, 319–333 (2015).

    PubMed  PubMed Central  Google Scholar 

  19. 19.

    Deserno, L. et al. Ventral striatal dopamine reflects behavioral and neural signatures of model-based control during sequential decision making. Proc. Natl Acad. Sci. USA 112, 1595–1600 (2015).

    CAS  PubMed  Google Scholar 

  20. 20.

    Decker, J. H., Otto, A. R., Daw, N. D. & Hartley, C. A. From creatures of habit to goal-directed learners: tracking the developmental emergence of model-based reinforcement learning. Psychol. Sci. 27, 848–858 (2016).

    PubMed  PubMed Central  Google Scholar 

  21. 21.

    Kool, W., Cushman, F. A. & Gershman, S. J. When does model-based control pay off? PLoS Comput. Biol. 12, e1005090 (2016).

    PubMed  PubMed Central  Google Scholar 

  22. 22.

    Kool, W., Gershman, S.J. & Cushman, F.A. Cost–benefit arbitration between multiple reinforcement-learning systems. Psychol. Sci. 28, 1321–1333 (2017).

    PubMed  Google Scholar 

  23. 23.

    Miller, K. J., Botvinick, M. M. & Brody, C. D. Dorsal hippocampus contributes to model-based planning. Nat. Neurosci. 20, 1269–1276 (2017).

    CAS  PubMed  PubMed Central  Google Scholar 

  24. 24.

    Kool, W., Gershman, S. J. & Cushman, F. A. Planning complexity registers as a cost in metacontrol. J. Cogn. Neurosci. 30, 1391–1404 (2018).

    PubMed  Google Scholar 

  25. 25.

    FeherdaSilva, C. & Hare, T. A. A note on the analysis of two-stage task results: how changes in task structure affect what model-free and model-based strategies predict about the effects of reward and transition on the stay probability. PLoS ONE 13, e0195328 (2018).

    Google Scholar 

  26. 26.

    Shahar, N. et al. Improving the reliability of model-based decision-making estimates in the two-stage decision task with reaction-times and drift-diffusion modeling. PLoS Comput. Biol. 15, e1006803 (2019).

    CAS  PubMed  PubMed Central  Google Scholar 

  27. 27.

    Toyama, A., Katahira, K. & Ohira, H. Biases in estimating the balance between model-free and model-based learning systems due to model misspecification. J. Math. Psychol. 91, 88–102 (2019).

    Google Scholar 

  28. 28.

    Daw, N. D. Are we of two minds?. Nat. Neurosci. 21, 1497–1499 (2018).

    CAS  PubMed  Google Scholar 

  29. 29.

    Akam, T., Costa, R. & Dayan, P. Simple plans or sophisticated habits? State, transition and learning interactions in the two-step task. PLoS Comput. Biol. 11, e1004648 (2015).

    PubMed  PubMed Central  Google Scholar 

  30. 30.

    Miller, K. J., Shenhav, A. & Ludvig, E. A. Habits without values. Psychol. Rev. 126, 292–311 (2019).

    PubMed  PubMed Central  Google Scholar 

  31. 31.

    Momennejad, I. et al. The successor representation in human reinforcement learning. Nat. Hum. Behav. 1, 680–692 (2017).

    CAS  PubMed  PubMed Central  Google Scholar 

  32. 32.

    Dayan, P. Improving generalization for temporal difference learning: the successor representation. Neural Comput. 5, 613–624 (1993).

    Google Scholar 

  33. 33.

    Dayan, P. & Berridge, K. C. Model-based and model-free Pavlovian reward learning: revaluation, revision, and revelation. Cogn. Affect. Behav. Neurosci. 14, 473–492 (2014).

    PubMed  PubMed Central  Google Scholar 

  34. 34.

    Dayan, P. & Niv, Y. Reinforcement learning: the good, the bad and the ugly. Curr. Opin. Neurobiol. 18, 185–196 (2008).

    CAS  PubMed  Google Scholar 

  35. 35.

    Radulescu, A., Niv, Y. & Ballard, I. Holistic reinforcement learning: the role of structure and attention. Trends Cogn. Sci. 23, 278–292 (2019).

    PubMed  PubMed Central  Google Scholar 

  36. 36.

    Shahar, N. et al. Credit assignment to state-independent task representations and its relationship with model-based decision making. Proc. Natl Acad. Sci. USA 116, 15871–15876 (2019).

    CAS  PubMed  Google Scholar 

  37. 37.

    Schultz, W., Dayan, P. & Montague, P. R. A neural substrate of prediction and reward. Science 275, 1593–1599 (1997).

    CAS  Google Scholar 

  38. 38.

    Bayer, H. M. & Glimcher, P. W. Midbrain dopamine neurons encode a quantitative reward prediction error signal. Neuron 47, 129–141 (2005).

    CAS  PubMed  PubMed Central  Google Scholar 

  39. 39.

    Caplin, A. & Dean, M. Axiomatic methods, dopamine and reward prediction error. Curr. Opin. Neurobiol. 18, 197–202 (2008).

    CAS  PubMed  Google Scholar 

  40. 40.

    Bromberg-Martin, E. S., Matsumoto, M., Hong, S. & Hikosaka, O. A pallidus–habenula–dopamine pathway signals inferred stimulus values. J. Neurophysiol. 104, 1068–1076 (2010).

    PubMed  PubMed Central  Google Scholar 

  41. 41.

    Sadacca, B. F., Jones, J. L. & Schoenbaum, G. Midbrain dopamine neurons compute inferred and cached value prediction errors in a common framework. eLife 5, e13665 (2016).

    PubMed  PubMed Central  Google Scholar 

  42. 42.

    Sharpe, M. J. et al. Dopamine transients are sufficient and necessary for acquisition of model-based associations. Nat. Neurosci. 20, 735–742 (2017).

    CAS  PubMed  PubMed Central  Google Scholar 

  43. 43.

    Doll, B. B., Simon, D. A. & Daw, N. D. The ubiquity of model-based reinforcement learning. Curr. Opin. Neurobiol. 22, 1075–1081 (2012).

    CAS  PubMed  PubMed Central  Google Scholar 

  44. 44.

    Balleine, B. W. & O’Doherty, J. P. Human and rodent homologies in action control: corticostriatal determinants of goal-directed and habitual action. Neuropsychopharmacology 35, 48–69 (2010).

    PubMed  Google Scholar 

  45. 45.

    Dezfouli, A. & Balleine, B. W. Habits, action sequences and reinforcement learning. Eur. J. Neurosci. 35, 1036–1051 (2012).

    PubMed  PubMed Central  Google Scholar 

  46. 46.

    Dezfouli, A. & Balleine, B. W. Actions, action sequences and habits: evidence that goal-directed and habitual action control are hierarchically organized. PLoS Comput. Biol. 9, e1003364 (2013).

    PubMed  PubMed Central  Google Scholar 

  47. 47.

    Dolan, R. J. & Dayan, P. Goals and habits in the brain. Neuron 80, 312–325 (2013).

    CAS  PubMed  PubMed Central  Google Scholar 

  48. 48.

    Dezfouli, A., Lingawi, N. W. & Balleine, B. W. Habits as action sequences: hierarchical action control and changes in outcome value. Philos. Trans. R. Soc. Lond. B 369, 20130482–20130482 (2014).

    Google Scholar 

  49. 49.

    Gershman, S. J., Markman, A. B. & Otto, A. R. Retrospective revaluation in sequential decision making: a tale of two systems. J. Exp. Psychol. Gen. 143, 182–194 (2014).

    PubMed  Google Scholar 

  50. 50.

    Balleine, B. W., Dezfouli, A., Ito, M. & Doya, K. Hierarchical control of goal-directed action in the cortical-basal ganglia network. Curr. Opin. Behav. Sci. 5, 1–7 (2015).

    Google Scholar 

  51. 51.

    Miller, K. J., Shenhav, A. & Ludvig, E. A. Habits without values. Psychol. Rev. 126, 292–311 (2019).

    PubMed  PubMed Central  Google Scholar 

  52. 52.

    Morris, A. & Cushman, F. Model-free RL or action sequences? Front. Psychol. 10, 2892 (2019).

    PubMed  PubMed Central  Google Scholar 

  53. 53.

    Konovalov, A. & Krajbich, I. Gaze data reveal distinct choice processes underlying model-based and model-free reinforcement learning. Nat. Commun. 7, 12438 (2016).

    CAS  PubMed  PubMed Central  Google Scholar 

  54. 54.

    Redish, A. D. Vicarious trial and error. Nat. Rev. Neurosci. 17, 147–159 (2016).

    CAS  PubMed  PubMed Central  Google Scholar 

  55. 55.

    Krajbich, I., Armel, C. & Rangel, A. Visual fixations and the computation and comparison of value in simple choice. Nat. Neurosci. 13, 1292–1298 (2010).

    CAS  PubMed  Google Scholar 

  56. 56.

    Economides, M., Kurth-Nelson, Z., Lübbert, A., Guitart-Masip, M. & Dolan, R. J. Model-based reasoning in humans becomes automatic with training. PLoS Comput. Biol. 11, e1004463 (2015).

    PubMed  PubMed Central  Google Scholar 

  57. 57.

    Shenhav, A. et al. Toward a rational and mechanistic account of mental effort. Annu. Rev. Neurosci. 40, 99–124 (2017).

    CAS  PubMed  Google Scholar 

  58. 58.

    Economides, M., Kurth-Nelson, Z., Lübbert, A., Guitart-Masip, M. & Dolan, R. J. Model-based reasoning in humans becomes automatic with training. PLoS Comput. Biol. 11, e1004463 (2015).

    PubMed  PubMed Central  Google Scholar 

  59. 59.

    Schad, D. J. et al. Processing speed enhances model-based over model-free reinforcement learning in the presence of high working memory functioning. Front. Psychol. 5, 1450 (2014).

    PubMed  PubMed Central  Google Scholar 

  60. 60.

    Gillan, C. M., Kosinski, M., Whelan, R., Phelps, E. A. & Daw, N. D. Characterizing a psychiatric symptom dimension related to deficits in goal-directed control. eLife 5, e11305 (2016).

    PubMed  PubMed Central  Google Scholar 

  61. 61.

    Feher da Silva, C., Yao, Y.-W. & Hare, T.A. Can model-free reinforcement learning operate over information stored in working-memory? Preprint at bioRxiv https://doi.org/10.1101/107698 (2018).

  62. 62.

    Stan Development Team. PyStan: the Python interface to Stan http://mc-stan.org (2017).

  63. 63.

    Carpenter, B. et al. Stan: a probabilistic programming language. J. Stat. Softw. http://www.jstatsoft.org/v76/i01/ (2017).

  64. 64.

    Stan Development Team. Stan Modeling Language Users Guide and Reference Manual, Version 2.16.0 (2017).

  65. 65.

    Lewandowski, D., Kurowicka, D. & Joe, H. Generating random correlation matrices based on vines and extended onion method. J. Multivar. Anal. 100, 1989–2001 (2009).

    Google Scholar 

  66. 66.

    Vehtari, A., Gelman, A. & Gabry, J. Practical Bayesian model evaluation using leave-one-out cross-validation and WAIC. Stat. Comput. 27, 1413–1432 (2017).

    Google Scholar 

  67. 67.

    Seabold, S. & Perktold, J. Statsmodels: econometric and statistical modeling with python. In Proc. 9th Python in Science Conference (eds van der Walt, S. & Millman, J.) 92–96 (SciPy, 2010).

Download references

Acknowledgements

We thank G. M. Parente for the wonderful illustrations used in the spaceship and magic carpet tasks, S. Gobbi, G. Lombardi and M. Edelson for many helpful discussions and ideas, the participants at the NYU Neuroeconomics Colloquium for useful feedback and P. Dayan, W. Kool, A. Konovalov, I. Krajbich and S. Nebe for helpful comments on early drafts of this manuscript. Our acknowledgement of their feedback does not imply that these individuals fully agree with our conclusions or opinions in this paper. We also acknowledge W. Kool, F. Cushman and S. Gershman for making the data from their 2016 paper openly available at https://github.com/wkool/tradeoffs. This work was supported by the CAPES Foundation (https://www.capes.gov.br) grant no. 88881.119317/2016-01 and the European Union’s Seventh Framework programme for research, technological development and demonstration under grant no. 607310 (Nudge-it). The funders had no role in study design, data collection and analysis, decision to publish or preparation of the manuscript.

Author information

Affiliations

Authors

Contributions

C.F.S. and T.A.H. designed the tasks and computational models. C.F.S. programmed the tasks, collected the data and performed the analyses with input from T.A.H. Both authors wrote the manuscript.

Corresponding authors

Correspondence to Carolina Feher da Silva or Todd A. Hare.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Peer review information Primary Handling Editor: Marike Schiffer.

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 Timelines of the spaceship and magic carpet task.

Each box depicts an event within the spaceship or magic carpet tasks. The duration of each even is given in seconds on the left. A) In the spaceship task, the 1st screen simply indicates that a new trial has begun. The 2nd and 3rd screens represent the initial state. At the 4th screen, the participant has up to 2 seconds to indicate her choice. The common or rare transition is shown on the 5th screen. The second-stage state was indicated by the background colour (black, red) and the choice by the green left and right arrows on the 6th screen. The 7th and final screen in a trial revealed whether or not a reward was delivered. After feedback, the task advanced directly to the next trial. B) The magic carpet task was designed to closely mimic the original, abstract version of the two-stage task while still allowing for story-based instructions that included causes and effects for all task events. Thus, we used the same Tibetan characters from the original task, made them into labels for magic carpets and genies rather than simply identifying coloured squares. In the magic carpet task, the 1st screen represented the initial state and first-stage choice. Participants had up to 2s to make this choice. On the second screen, the chosen option was highlighted for 3 seconds. Next, a ‘nap’ screen was shown for 1s while the magic carpet automatically took the participant to one of the two mountains. Although participants saw the common or rare transition screens depicted in Fig. 1d during the practice trials, the transitions were not shown during the main task to make it more comparable with previous versions. The second-stage state (blue, pink) and choice were indicated by the pink or blue lamps on the right and left side of the 4th screen. The participant had up to 2s to make her choice. The 5th screen highlighted the chosen lamp/genie for3s. The 6th and final screen in a trial revealed whether or not a reward was delivered. After reward feedback, there was a blank screen for 0.7-1.3s before the next trial began.

Supplementary information

Supplementary Information

Supplementary methods, results and discussion.

Reporting Summary

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Feher da Silva, C., Hare, T.A. Humans primarily use model-based inference in the two-stage task. Nat Hum Behav (2020). https://doi.org/10.1038/s41562-020-0905-y

Download citation

Search

Sign up for the Nature Briefing newsletter for a daily update on COVID-19 science.
Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing