In a changing world, how do we decide our best option? How do we settle between picking something familiar or trying out a new, possibly more rewarding, choice?
You step into a music store: how do you choose which CD to buy? If you pick something by your favourite composer, Schubert say, you know you will enjoy it. But you might want to expand your repertoire by trying a less familiar piece, for example Mahler's symphony No. 10. You might not like it; on the other hand, you might discover a new favourite. So how do you decide between sticking with what you know and like, and trying something more adventurous that might prove even more rewarding? On page 876 of this issue, Daw et al.1 provide valuable insight into how people balance the two basic impulses of exploiting what they know and exploring new options.
Standard economic theories tend to assume that people always make the decision that gives them the greatest reward — in terms of its subjective value or utility to the person. In reality, however, people often do not know which option will produce the best outcome, as even familiar environments change continually, so decision-making strategies must be adjusted through experience2. In reinforcement-learning theories3, estimates of future rewards expected from alternative actions are referred to as their value functions, and these must be updated appropriately to reflect the actual rewards. Choosing the action with the maximum value function is referred to as exploitation. Because value functions are only estimates of future rewards, however, it is often desirable to try out actions that might seem to have suboptimal value functions, and this is known as exploration.
To see how people deal with making decisions in changing circumstances, Daw et al.1 asked their subjects to play a computer game where they could choose among four slot machines. Each machine was assigned a different payout level, and each time the subject played, the payout they received from their chosen machine was determined randomly on the basis of the mean and the standard deviation of the assigned payout. In addition, the mean payout of each slot machine drifted randomly over time, preventing the subjects from ever knowing accurately which slot machine would reward them the most.
Daw et al. considered the ways in which people might resolve the exploration–exploitation dilemma. One possibility, known as the ɛ-greedy rule, is to select the action with the maximum value function most of the time, but to choose randomly among the remaining options with a small probability (ɛ). Alternatively, people might explore more frequently if the differences among value functions of the various options are relatively small. This behaviour is described by the ‘softmax’ rule, which states that the probability of choosing a particular action is given by the Boltzmann distribution of the value functions (analogous to the Boltzmann distribution in physics that describes the behaviour of particles according to their energies).
To illustrate the difference between these two rules, let us consider the choice of Schubert versus Mahler, and imagine that the value function is higher for Schubert than for Mahler (shown by the light-blue area in Fig. 1). If there is no exploration, one would always choose Schubert, consistent with the utility maximization posited by standard economic theories. According to the ɛ-greedy rule, Mahler will be chosen occasionally (with probability ɛ). According to the softmax rule, the probability that less-preferred Mahler will be chosen depends on the difference in the value functions of these two composers, and increases as Mahler's value function becomes more similar to Schubert's.
The results of Daw and colleagues showed that in their slot-machine game the softmax rule accounted for the pattern of exploration better than the ɛ-greedy rule. Yet another strategy of exploration in decision-making is to increase the probability of choosing a particular action according to the uncertainty of its outcome. Daw et al. found that applying this so-called ‘exploration bonus’ did not account for the data any better than the simple softmax rule. So it seems that the subjects did indeed explore according to the softmax rule.
To understand how the brain decides to explore, Daw et al. examined the activity in the subjects' brains during the slot-machine game using functional magnetic resonance imaging. Consistent with findings from previous studies, they found that several brain areas, such as the orbitofrontal cortex, displayed changes in activity that were related to the value functions of the chosen action2. More notably, having classified a subject's choice in each trial as exploitation or exploration, the authors looked for brain areas that showed increased activity associated with each choice. They did not find any areas that became more active during exploitation, but they did find that exploration was accompanied by stronger activation in the frontopolar cortex, a region considered important for the control of cognitive functions4.
The findings of Daw et al. not only provide a unique insight into how we make decisions, but also raise many exciting questions for future research. For example, the randomness of the choice generated by the softmax rule can be specified by a parameter analogous to the temperature in the Boltzmann distribution. At a high ‘temperature’, all actions are chosen more or less randomly, whereas at a low ‘temperature’ preferred actions are chosen deterministically. Is it possible that the temperature parameter in the softmax rule decreases, and decision-makers explore less, as they get more familiar with a particular task? If so, what are the underlying neural mechanisms? Changes in the parameters of learning models, such as temperature, are known as meta-learning5,6, and this process may be controlled by neuromodulators — chemicals such as noradrenaline and acetylcholine that affect the efficacy of other neurotransmitters7,8. But the precise roles of these neuromodulators in reinforcement learning and decision-making have still to be defined. As beautifully exemplified by Daw et al.1, exploring these questions will require interdisciplinary approaches, in which the computational theories inspire and guide behavioural and neurobiological experiments.