Letter | Published:

Sources of suboptimality in a minimalistic explore–exploit task

Abstract

People often choose between sticking with an available good option (exploitation) and trying out a new option that is uncertain but potentially more rewarding (exploration)1,2. Laboratory studies on explore–exploit decisions often contain real-world complexities such as non-stationary environments, stochasticity under exploitation and unknown reward distributions3,4,5,6,7. However, such factors might limit the researcher’s ability to understand the essence of people’s explore–exploit decisions. For this reason, we introduce a minimalistic task in which the optimal policy is to start off exploring and to switch to exploitation at most once in each sequence of decisions. The behaviour of 49 laboratory and 143 online participants deviated both qualitatively and quantitatively from the optimal policy, even when allowing for bias and decision noise. Instead, people seem to follow a suboptimal rule in which they switch from exploration to exploitation when the highest reward so far exceeds a certain threshold. Moreover, we show that this threshold decreases approximately linearly with the proportion of the sequence that remains, suggesting a temporal ratio law. Finally, we find evidence for ‘sequence-level’ variability that is shared across all decisions in the same sequence. Our results emphasize the importance of examining sequence-level strategies and their variability when studying sequential decision-making.

Access optionsAccess options

Rent or Buy article

Get time limited or full article access on ReadCube.

from$8.99

All prices are NET prices.

Code availability

All experimental and analysis codes used in this paper are available at https://github.com/mingyus/explore-exploit.

Data availability

All data that support the findings of this paper are available at https://github.com/mingyus/explore-exploit.

Additional information

Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Change history

  • 11 March 2019

    The original and corrected figures and equations are shown in the accompanying Publisher Correction.

References

  1. 1.

    Cohen, J. D., McClure, S. M. & Angela, J. Yu Should I stay or should I go? How the human brain manages the trade-off between exploitation and exploration. Phil. Trans. R. Soc. Lond. B 362, 933–942 (2007).

  2. 2.

    Mehlhorn, K. et al. Unpacking the exploration–exploitation tradeoff: a synthesis of human and animal literatures. Decision 2, 191–215 (2015).

  3. 3.

    Acuna. D. & Schrater. P. Bayesian modeling of human sequential decision-making on the multi-armed bandit problem. In Proc. 30th Annual Conference of the Cognitive Science Society 2065–2070 (Cognitive Science Society, 2008).

  4. 4.

    Constantino, S. M. & Daw, N. D. Learning the opportunity cost of time in a patch-foraging task. Cogn. Affect. Behav. Neurosci. 15, 837–853 (2015).

  5. 5.

    Daw, N. D., O’Doherty, J. P., Dayan, P., Seymour, B. & Dolan, R. J. Cortical substrates for exploratory decisions in humans. Nature 441, 876–879 (2006).

  6. 6.

    Knox, W. B., Otto, A. R., Stone, P. & Love, B. The nature of belief-directed exploratory choice in human decision-making. Front. Psychol. 2, 398 (2012).

  7. 7.

    Steyvers, M., Lee, M. D. & Wagenmakers, E.-J. A Bayesian analysis of human decision-making on bandit problems. J. Math. Psychol. 53, 168–179 (2009).

  8. 8.

    Sutton, R. S. & Barto, A. G. Reinforcement Learning: An Introduction (MIT Press, Cambridge, MA, 1998).

  9. 9.

    Seale, D. A. & Rapoport, A. Optimal stopping behavior with relative ranks: the secretary problem with unknown population size. J. Behav. Decis. Mak. 13, 391–411 (2000).

  10. 10.

    Bellman, R. Dynamic Programming 1st edn (Princeton Univ. Press, Princeton, 1957).

  11. 11.

    Lee, M. D., Zhang, S., Munro, M. & Steyvers, M. Psychological models of human and optimal performance in bandit problems. Cogn. Syst. Res. 12, 164–174 (2011).

  12. 12.

    McFadden, D. et al. in Frontiers in Econometrics (ed. Zarembka, P.) 105–142 (Academic Press, New York, 1973).

  13. 13.

    Gigerenzer, G. & Gaissmaier, W. Heuristic decision making. Annu. Rev. Psychol. 62, 451–482 (2011).

  14. 14.

    Simon, H. A. Rational choice and the structure of the environment. Psychol. Rev. 63, 129–138 (1956).

  15. 15.

    Akaike, H. A new look at the statistical model identification. IEEE Trans. Automat. Contr. 19, 716–723 (1974).

  16. 16.

    Cavanaugh, J. E. et al. Unifying the derivations for the Akaike and corrected Akaike information criteria. Stat. Probabil. Lett. 33, 201–208 (1997).

  17. 17.

    Schwarz, G. et al. Estimating the dimension of a model. Ann. Stat. 6, 461–464 (1978).

  18. 18.

    Kello, C. T. et al. Scaling laws in cognitive sciences. Trends Cogn. Sci. 14, 223–232 (2010).

  19. 19.

    Rigoux, L., Stephan, K. E., Friston, K. J. & Daunizeau, J. Bayesian model selection for group studies—revisited. NeuroImage 84, 971–985 (2014).

  20. 20.

    Stephan, K. E., Penny, W. D., Daunizeau, J., Moran, R. J. & Friston, K. J. Bayesian model selection for group studies. NeuroImage 46, 1004–1017 (2009).

  21. 21.

    Lau, B. & Glimcher, P. W. Dynamic response-by-response models of matching behavior in rhesus monkeys. J. Exp. Anal. Behav. 84, 555–579 (2005).

  22. 22.

    Ito, M. & Doya, K. Validation of decision-making models and analysis of decision variables in the rat basal ganglia. J. Neurosci. 29, 9861–9874 (2009).

  23. 23.

    Boehner, P. Ockham: Philosophical Writings (Nelson, Canada, 1957).

  24. 24.

    Chater, N. & Vitányi, P. Simplicity: a unifying principle in cognitive science? Trends Cogn. Sci. 7, 19–22 (2003).

  25. 25.

    Buhusi, C. V. & Meck, W. H. What makes us tick? Functional and neural mechanisms of interval timing. Nat. Rev. Neurosci. 6, 755–765 (2005).

  26. 26.

    Gibbon, J. Scalar expectancy theory and Weber’s law in animal timing. Psychol. Rev. 84, 279–325 (1977).

  27. 27.

    Brown, G. D. A., Neath, I. & Chater, N. A temporal ratio model of memory. Psychol. Rev. 114, 539–576 (2007).

  28. 28.

    Robbins, H. Some aspects of the sequential design of experiments. Bull. Am. Math. Soc. 58, 527–535 (1952).

  29. 29.

    Charnov, E. Optimal foraging: the marginal value theorem. Theor. Popul. Biol. 9, 129–136 (1976).

  30. 30.

    Seale, D. A. & Rapoport, A. Sequential decision making with relative ranks: an experimental investigation of the “secretary problem”. Organ. Behav. Hum. Decis. Process. 69, 221–236 (1997).

  31. 31.

    Van Opheusden, B., Galbiati, G., Bnaya, Z., Li, Y. & Ma, W. J. A computational model for decision tree search. (2017). In Proc. 39th Annual Conference of the Cognitive Science Society 1254–1259 (Cognitive Science Society, 2017).

  32. 32.

    MacGregor, J. N. & Ormerod, T. Human performance on the traveling salesman problem. Percept. Psychophys. 58, 527–539 (1996).

  33. 33.

    Sang, K. Modeling Exploration/Exploitation Behavior and the Effect of Individual Differences. PhD thesis, Indiana Univ. (2017).

  34. 34.

    Sang, K., Todd, P. & Goldstone, R. Learning near-optimal search in a minimal explore/exploit task. In Proc. 33rd Annual Conference of the Cognitive Science Society 2800–2805 (Cognitive Science Society, 2011).

  35. 35.

    Sang, K., Todd, P. M., Goldstone, R. & Hills, T. T. Explore/exploit tradeoff strategies in a resource accumulation search task. Preprint at https://psyarxiv.com/zw3s8 (2018).

  36. 36.

    Hills, T. T., Todd, P. M. & Goldstone, R. L. The central executive as a search process: priming exploration and exploitation across domains. J. Exp. Psychol. Gen. 139, 590–609 (2010).

  37. 37.

    Navarro, D. J., Newell, B. R. & Schulze, C. Learning and choosing in an uncertain world: an investigation of the explore–exploit dilemma in static and dynamic environments. Cogn. Psychol. 85, 43–77 (2016).

  38. 38.

    Wilson, R. C., Geana, A., White, J. M., Ludvig, E. A. & Cohen, J. D. Humans use directed and random exploration to solve the explore–exploit dilemma. J. Exp. Psychol. Gen. 143, 2074–2081 (2014).

  39. 39.

    Stoll, F. M., Fontanier, V. & Procyk, E. Specific frontal neural dynamics contribute to decisions to check. Nat. Commun. 7, 11990 (2016).

  40. 40.

    Kolling, N., Wittmann, M. & Rushworth, M. F. S. Multiple neural mechanisms of decision making and their competition under changing risk pressure. Neuron 81, 1190–1202 (2014).

  41. 41.

    Mai, J.-E. Looking for Information: A Survey of Research on Information Seeking, Needs, and Behavior (Emerald Group Publishing, UK, 2016).

  42. 42.

    Badre, D., Doll, B. B., Long, N. M. & Frank, M. J. Rostrolateral prefrontal cortex and individual differences in uncertainty-driven exploration. Neuron 73, 595–607 (2012).

  43. 43.

    Boorman, E. D., Behrens, T. E. J., Woolrich, M. W. & Rushworth, M. F. S. How green is the grass on the other side? Frontopolar cortex and the evidence in favor of alternative courses of action. Neuron 62, 733–743 (2009).

  44. 44.

    Barraclough, D. J., Conroy, M. L. & Lee, D. Prefrontal cortex and decision making in a mixed-strategy game. Nat. Neurosci. 7, 404–410 (2004).

  45. 45.

    Miller, E. K. & Cohen, J. D. An integrative theory of prefrontal cortex function. Annu. Rev. Neurosci. 24, 167–202 (2001).

  46. 46.

    Wallis, J. D. & Miller, E. K. Neuronal activity in primate dorsolateral and orbital prefrontal cortex during performance of a reward preference task. Eur. J. Neurosci. 18, 2069–2081 (2003).

  47. 47.

    Watanabe, M. Reward expectancy in primate prefrontal neurons. Nature 382, 629–632 (1996).

  48. 48.

    Rich, A. S. & Gureckis, T. M. Exploratory choice reflects the future value of information.Decision 5, 177–192 (2018).

  49. 49.

    Gureckis, T. M. et al. psiTurk: an open-source framework for conducting replicable behavioral experiments online. Behav. Res. Methods 48, 829–842 (2016).

  50. 50.

    Glimcher, P. & Fehr, E. Neuroeconomics 2nd edn (Academic Press, 2014).

Download references

Acknowledgements

The authors thank R. Polonia for helpful comments on the manuscript, and people in W.J.M.’s laboratory for helpful discussions. The funders had no role in study design, data collection and analysis, decision to publish or preparation of the manuscript.

Author information

All authors designed the study, developed the models, interpreted the results, and wrote the paper. M.S. and Z.B. collected the data and performed the analyses.

Competing interests

The authors declare no competing interests.

Correspondence to Wei Ji Ma.

Supplementary information

Supplementary Information

Supplementary Figures 1–16, Supplementary Tables 1–3, Supplementary Methods 1–3, Supplementary Results 1 and 2, and Supplementary References

Reporting Summary

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark
Fig. 1: Experimental design, optimal policy and summary statistics.
Fig. 2: Fits of the Opt model to selected summary statistics.
Fig. 3: Evidence of a threshold rule depending on the proportion of days left.
Fig. 4: Sequence-level variability as implemented through variable-threshold models (Num-V and Prop-V).