Sources of suboptimality in a minimalistic explore–exploit task

Song, Mingyu; Bnaya, Zahy; Ma, Wei Ji

doi:10.1038/s41562-018-0526-x

Letter
Published: 11 February 2019

Sources of suboptimality in a minimalistic explore–exploit task

Nature Human Behaviour volume 3, pages 361–368 (2019)Cite this article

2665 Accesses
8 Citations
16 Altmetric
Metrics details

Subjects

A Publisher Correction to this article was published on 11 March 2019

This article has been updated

Abstract

People often choose between sticking with an available good option (exploitation) and trying out a new option that is uncertain but potentially more rewarding (exploration)^1,2. Laboratory studies on explore–exploit decisions often contain real-world complexities such as non-stationary environments, stochasticity under exploitation and unknown reward distributions^3,4,5,6,7. However, such factors might limit the researcher’s ability to understand the essence of people’s explore–exploit decisions. For this reason, we introduce a minimalistic task in which the optimal policy is to start off exploring and to switch to exploitation at most once in each sequence of decisions. The behaviour of 49 laboratory and 143 online participants deviated both qualitatively and quantitatively from the optimal policy, even when allowing for bias and decision noise. Instead, people seem to follow a suboptimal rule in which they switch from exploration to exploitation when the highest reward so far exceeds a certain threshold. Moreover, we show that this threshold decreases approximately linearly with the proportion of the sequence that remains, suggesting a temporal ratio law. Finally, we find evidence for ‘sequence-level’ variability that is shared across all decisions in the same sequence. Our results emphasize the importance of examining sequence-level strategies and their variability when studying sequential decision-making.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on Springer Link
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Fig. 1: Experimental design, optimal policy and summary statistics.**

**Fig. 2: Fits of the Opt model to selected summary statistics.**

**Fig. 3: Evidence of a threshold rule depending on the proportion of days left.**

**Fig. 4: Sequence-level variability as implemented through variable-threshold models (Num-V and Prop-V).**

Temporal discounting correlates with directed exploration but not with random exploration

Article Open access 04 March 2020

Hashem Sadeghiyeh, Siyu Wang, … Robert C. Wilson

The dynamics of explore–exploit decisions reveal a signal-to-noise mechanism for random exploration

Article Open access 04 February 2021

Samuel F. Feng, Siyu Wang, … Robert C. Wilson

Active inference and the two-step task

Article Open access 21 October 2022

Sam Gijsen, Miro Grundei & Felix Blankenburg

Code availability

All experimental and analysis codes used in this paper are available at https://github.com/mingyus/explore-exploit.

Data availability

All data that support the findings of this paper are available at https://github.com/mingyus/explore-exploit.

Change history

11 March 2019
The original and corrected figures and equations are shown in the accompanying Publisher Correction.

References

Cohen, J. D., McClure, S. M. & Angela, J. Yu Should I stay or should I go? How the human brain manages the trade-off between exploitation and exploration. Phil. Trans. R. Soc. Lond. B 362, 933–942 (2007).
Article Google Scholar
Mehlhorn, K. et al. Unpacking the exploration–exploitation tradeoff: a synthesis of human and animal literatures. Decision 2, 191–215 (2015).
Article Google Scholar
Acuna. D. & Schrater. P. Bayesian modeling of human sequential decision-making on the multi-armed bandit problem. In Proc. 30th Annual Conference of the Cognitive Science Society 2065–2070 (Cognitive Science Society, 2008).
Constantino, S. M. & Daw, N. D. Learning the opportunity cost of time in a patch-foraging task. Cogn. Affect. Behav. Neurosci. 15, 837–853 (2015).
Article Google Scholar
Daw, N. D., O’Doherty, J. P., Dayan, P., Seymour, B. & Dolan, R. J. Cortical substrates for exploratory decisions in humans. Nature 441, 876–879 (2006).
Article CAS Google Scholar
Knox, W. B., Otto, A. R., Stone, P. & Love, B. The nature of belief-directed exploratory choice in human decision-making. Front. Psychol. 2, 398 (2012).
Article Google Scholar
Steyvers, M., Lee, M. D. & Wagenmakers, E.-J. A Bayesian analysis of human decision-making on bandit problems. J. Math. Psychol. 53, 168–179 (2009).
Article Google Scholar
Sutton, R. S. & Barto, A. G. Reinforcement Learning: An Introduction (MIT Press, Cambridge, MA, 1998).
Seale, D. A. & Rapoport, A. Optimal stopping behavior with relative ranks: the secretary problem with unknown population size. J. Behav. Decis. Mak. 13, 391–411 (2000).
Article Google Scholar
Bellman, R. Dynamic Programming 1st edn (Princeton Univ. Press, Princeton, 1957).
Lee, M. D., Zhang, S., Munro, M. & Steyvers, M. Psychological models of human and optimal performance in bandit problems. Cogn. Syst. Res. 12, 164–174 (2011).
Article Google Scholar
McFadden, D. et al. in Frontiers in Econometrics (ed. Zarembka, P.) 105–142 (Academic Press, New York, 1973).
Gigerenzer, G. & Gaissmaier, W. Heuristic decision making. Annu. Rev. Psychol. 62, 451–482 (2011).
Article Google Scholar
Simon, H. A. Rational choice and the structure of the environment. Psychol. Rev. 63, 129–138 (1956).
Article CAS Google Scholar
Akaike, H. A new look at the statistical model identification. IEEE Trans. Automat. Contr. 19, 716–723 (1974).
Article Google Scholar
Cavanaugh, J. E. et al. Unifying the derivations for the Akaike and corrected Akaike information criteria. Stat. Probabil. Lett. 33, 201–208 (1997).
Article Google Scholar
Schwarz, G. et al. Estimating the dimension of a model. Ann. Stat. 6, 461–464 (1978).
Article Google Scholar
Kello, C. T. et al. Scaling laws in cognitive sciences. Trends Cogn. Sci. 14, 223–232 (2010).
Article Google Scholar
Rigoux, L., Stephan, K. E., Friston, K. J. & Daunizeau, J. Bayesian model selection for group studies—revisited. NeuroImage 84, 971–985 (2014).
Article CAS Google Scholar
Stephan, K. E., Penny, W. D., Daunizeau, J., Moran, R. J. & Friston, K. J. Bayesian model selection for group studies. NeuroImage 46, 1004–1017 (2009).
Article Google Scholar
Lau, B. & Glimcher, P. W. Dynamic response-by-response models of matching behavior in rhesus monkeys. J. Exp. Anal. Behav. 84, 555–579 (2005).
Article Google Scholar
Ito, M. & Doya, K. Validation of decision-making models and analysis of decision variables in the rat basal ganglia. J. Neurosci. 29, 9861–9874 (2009).
Article CAS Google Scholar
Boehner, P. Ockham: Philosophical Writings (Nelson, Canada, 1957).
Chater, N. & Vitányi, P. Simplicity: a unifying principle in cognitive science? Trends Cogn. Sci. 7, 19–22 (2003).
Article Google Scholar
Buhusi, C. V. & Meck, W. H. What makes us tick? Functional and neural mechanisms of interval timing. Nat. Rev. Neurosci. 6, 755–765 (2005).
Article CAS Google Scholar
Gibbon, J. Scalar expectancy theory and Weber’s law in animal timing. Psychol. Rev. 84, 279–325 (1977).
Article Google Scholar
Brown, G. D. A., Neath, I. & Chater, N. A temporal ratio model of memory. Psychol. Rev. 114, 539–576 (2007).
Article Google Scholar
Robbins, H. Some aspects of the sequential design of experiments. Bull. Am. Math. Soc. 58, 527–535 (1952).
Article Google Scholar
Charnov, E. Optimal foraging: the marginal value theorem. Theor. Popul. Biol. 9, 129–136 (1976).
Article CAS Google Scholar
Seale, D. A. & Rapoport, A. Sequential decision making with relative ranks: an experimental investigation of the “secretary problem”. Organ. Behav. Hum. Decis. Process. 69, 221–236 (1997).
Article Google Scholar
Van Opheusden, B., Galbiati, G., Bnaya, Z., Li, Y. & Ma, W. J. A computational model for decision tree search. (2017). In Proc. 39th Annual Conference of the Cognitive Science Society 1254–1259 (Cognitive Science Society, 2017).
MacGregor, J. N. & Ormerod, T. Human performance on the traveling salesman problem. Percept. Psychophys. 58, 527–539 (1996).
Article CAS Google Scholar
Sang, K. Modeling Exploration/Exploitation Behavior and the Effect of Individual Differences. PhD thesis, Indiana Univ. (2017).
Sang, K., Todd, P. & Goldstone, R. Learning near-optimal search in a minimal explore/exploit task. In Proc. 33rd Annual Conference of the Cognitive Science Society 2800–2805 (Cognitive Science Society, 2011).
Sang, K., Todd, P. M., Goldstone, R. & Hills, T. T. Explore/exploit tradeoff strategies in a resource accumulation search task. Preprint at https://psyarxiv.com/zw3s8 (2018).
Hills, T. T., Todd, P. M. & Goldstone, R. L. The central executive as a search process: priming exploration and exploitation across domains. J. Exp. Psychol. Gen. 139, 590–609 (2010).
Article Google Scholar
Navarro, D. J., Newell, B. R. & Schulze, C. Learning and choosing in an uncertain world: an investigation of the explore–exploit dilemma in static and dynamic environments. Cogn. Psychol. 85, 43–77 (2016).
Article Google Scholar
Wilson, R. C., Geana, A., White, J. M., Ludvig, E. A. & Cohen, J. D. Humans use directed and random exploration to solve the explore–exploit dilemma. J. Exp. Psychol. Gen. 143, 2074–2081 (2014).
Article Google Scholar
Stoll, F. M., Fontanier, V. & Procyk, E. Specific frontal neural dynamics contribute to decisions to check. Nat. Commun. 7, 11990 (2016).
Article CAS Google Scholar
Kolling, N., Wittmann, M. & Rushworth, M. F. S. Multiple neural mechanisms of decision making and their competition under changing risk pressure. Neuron 81, 1190–1202 (2014).
Article CAS Google Scholar
Mai, J.-E. Looking for Information: A Survey of Research on Information Seeking, Needs, and Behavior (Emerald Group Publishing, UK, 2016).
Badre, D., Doll, B. B., Long, N. M. & Frank, M. J. Rostrolateral prefrontal cortex and individual differences in uncertainty-driven exploration. Neuron 73, 595–607 (2012).
Article CAS Google Scholar
Boorman, E. D., Behrens, T. E. J., Woolrich, M. W. & Rushworth, M. F. S. How green is the grass on the other side? Frontopolar cortex and the evidence in favor of alternative courses of action. Neuron 62, 733–743 (2009).
Article CAS Google Scholar
Barraclough, D. J., Conroy, M. L. & Lee, D. Prefrontal cortex and decision making in a mixed-strategy game. Nat. Neurosci. 7, 404–410 (2004).
Article CAS Google Scholar
Miller, E. K. & Cohen, J. D. An integrative theory of prefrontal cortex function. Annu. Rev. Neurosci. 24, 167–202 (2001).
Article CAS Google Scholar
Wallis, J. D. & Miller, E. K. Neuronal activity in primate dorsolateral and orbital prefrontal cortex during performance of a reward preference task. Eur. J. Neurosci. 18, 2069–2081 (2003).
Article Google Scholar
Watanabe, M. Reward expectancy in primate prefrontal neurons. Nature 382, 629–632 (1996).
Article CAS Google Scholar
Rich, A. S. & Gureckis, T. M. Exploratory choice reflects the future value of information.Decision 5, 177–192 (2018).
Article Google Scholar
Gureckis, T. M. et al. psiTurk: an open-source framework for conducting replicable behavioral experiments online. Behav. Res. Methods 48, 829–842 (2016).
Article Google Scholar
Glimcher, P. & Fehr, E. Neuroeconomics 2nd edn (Academic Press, 2014).

Download references

Acknowledgements

The authors thank R. Polonia for helpful comments on the manuscript, and people in W.J.M.’s laboratory for helpful discussions. The funders had no role in study design, data collection and analysis, decision to publish or preparation of the manuscript.

Author information

These authors contributed equally: Mingyu Song, Zahy Bnaya.

Authors and Affiliations

Princeton Neuroscience Institute, Princeton University, Princeton, NJ, USA
Mingyu Song
Center for Neural Science, New York University, New York, NY, USA
Mingyu Song, Zahy Bnaya & Wei Ji Ma
Department of Psychology, New York University, New York, NY, USA
Mingyu Song, Zahy Bnaya & Wei Ji Ma

Authors

Mingyu Song
View author publications
You can also search for this author in PubMed Google Scholar
Zahy Bnaya
View author publications
You can also search for this author in PubMed Google Scholar
Wei Ji Ma
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

All authors designed the study, developed the models, interpreted the results, and wrote the paper. M.S. and Z.B. collected the data and performed the analyses.

Corresponding author

Correspondence to Wei Ji Ma.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Supplementary Figures 1–16, Supplementary Tables 1–3, Supplementary Methods 1–3, Supplementary Results 1 and 2, and Supplementary References

Reporting Summary

Rights and permissions

Reprints and permissions

About this article

Cite this article

Song, M., Bnaya, Z. & Ma, W.J. Sources of suboptimality in a minimalistic explore–exploit task. Nat Hum Behav 3, 361–368 (2019). https://doi.org/10.1038/s41562-018-0526-x

Download citation

Received: 15 June 2018
Accepted: 19 December 2018
Published: 11 February 2019
Issue Date: April 2019
DOI: https://doi.org/10.1038/s41562-018-0526-x