Exploring Feature Dimensions to Learn a New Policy in an Uninformed Reinforcement Learning Task

When making a choice with limited information, we explore new features through trial-and-error to learn how they are related. However, few studies have investigated exploratory behaviour when information is limited. In this study, we address, at both the behavioural and neural level, how, when, and why humans explore new feature dimensions to learn a new policy for choosing a state-space. We designed a novel multi-dimensional reinforcement learning task to encourage participants to explore and learn new features, then used a reinforcement learning algorithm to model policy exploration and learning behaviour. Our results provide the first evidence that, when humans explore new feature dimensions, their values are transferred from the previous policy to the new online (active) policy, as opposed to being learned from scratch. We further demonstrated that exploration may be regulated by the level of cognitive ambiguity, and that this process might be controlled by the frontopolar cortex. This opens up new possibilities of further understanding how humans explore new features in an open-space with limited information.


Policy simulation
To verify that the best strategy (optimal policy) for the designed task requires the use of all three dimensional features (shape, colour, and pattern), we created a decision agent that utilised a naïve reinforcement learning algorithm 17 to evaluate the performance of each policy.
Seven different policies (policy πi) were modelled using combinations of three features, as follows (Fig. 2a); π1, using shape feature (1 dim); π2, using colour feature (1 dim); π3, using pattern feature (1 dim); π4, using shape and colour features (2 dim); π5, using shape and pattern features (2 dim); π6, using colour and pattern features (2 dim); and π7, using all three features (shape, colour and pattern) (3 dim). Policies 1,2, and 3 thus required the use of only one feature dimensionality. Policies 4,5, and 6 required the use of combinations of two feature dimensionalities, and policy 7 used combination of all three feature dimensionalities.
The decision agent performed the task in the same manner as participants (i.e., 256 trials presented in random order). For each policy, the agent updated values for each stimulus using the naïve reinforcement learning algorithm (equation (1)) and made a decision using the softmax function (equation (2)). See section 3.1. (Naïve reinforcement learning) for further details. S1 Fig. 1 demonstrates the mean final scores of 1,000 simulations. The free parameters α and β differed for the two task simulations. Consistent values of α = 0.1 and β =1.5 were used in the first simulation (S1 Fig. 1a), while seven different α and β values were used in the second simulation (S1 Fig. 1b). Theses seven α and β values represent the mean value of fitted parameters from 29 subjects (S9 Table 1). When policies 1 through 4 were applied, performance was far below zero, indicating that these policies were not appropriate for the task. Although policies 5 and 6 exhibited high performance, policy 7 exhibited significantly better performance than both (paired t-test, p < 0.001). A dramatic difference was observed when different free parameters were adopted for each policy (S1 Fig. 1b) (paired t-test, p < 0.001).

HMM based policy search model
A participant's current policy was estimated by applying an HMM-based policy search model. HMM is used to detect hidden states that cannot be observed directly. As such, it has been widely utilised in both cognitive neuroscience 4,48 and computer science 18 . In the present study, we used a modified Baum-Welch algorithm of the HMM 18-21,49,50 , which learned model parameters (transition probability between policies and probability of observing a certain action) to detect hidden states of the model and to determine the current policies used for each of the 256 trials.
The model was developed mainly by referring to the Baum-welch algorithm 50 , and slight modifications were made to fit our behavioural situation. In particular, seven different policies were identified as hidden states, and the observations were subjects' behaviour, which is left choice or right choice, thus, the probability of choosing the left or right button for each policy was served as an emission probability of the policy (equation (2)). Transition probability A was trained using an expectation-maximisation (EM) algorithm 51 the probabilities of the initial values were regarded as equal (i.e., 1/7). The model details are described below: In the expectation phase (E), αt(i) represents the forward probability of using state i at time t, βt(i) represents the backward probability of using state i at time t, γt(i) represents the current probability of using state i at time t, κt(i, j) represents the probability of using state i at time t and state j in time t+1.
In the maximisation phase(M), the transition probability A was updated. According to the model, the probability of using policy i was measured as γt(i).

Softmax function-based policy search model
We also developed another policy search model using the softmax function (equation  In this case, β' represents the inverse temperature parameter, k represents the trial number, and P(πi) and P(ak) represent the as prior probability of each policy and action respectively.

Policy transition inference
Policy probabilities for all 256 trials were estimated using the two policy search models. In order to eliminate noise and transient policy shifts, we performed fifth polynomial fitting to all seven policy probability signals, rather than averaging values within a time-window. Then, the policy with the highest polynomial fitted probability value among the seven policies was inferred as current policy for each trial (Fig. 3) 24,25 . In the present study, there were no trials with identical probabilities. However, if such as case were to occur, the policy taking more features into account was selected as the current policy.

Model comparison
We estimated the fitness of each of the following models with regard to each participant's behaviour using the maximising likelihood function L (equation (3) Here, ak represents go or no-go choice, and ɛ represents randomness of the choice (0 < ɛ < 1). As a result, mean percent correct of choice, log-likelihood, AIC (p=0.0069), and BIC (0.00075) value revealed our suggested model, value transfer learning model, outperformed this policy 7 + ɛ model (Fig 4g, S10 Table 2). S1 Figure 1: Policy simulation results. Simulated score for each policy obtained after 256 trials using naïve reinforcement learning algorithm (equation (1)). Each bar represents each policy, blue: policy using one dimension, green: policy using two dimensions, orange: policy using three dimensions. a, Simulation with same α and β for all policies; α = 0.1, β = 1.5.
b, Simulation with fitted α and β (average among participants) for each policy (S9 Table 1).
Insets: simulated results for all 256 trials.    Values are represented as the mean ± SD. α < 10e -4 and β > 700 were discarded for proper estimation of average free parameters. S10  The ROI analysis (Table 2) was performed by small volume correction (SVC) with a combined mask of all predefined ROIs for each regressor and only clusters survived from voxel-level p-value < 0.05 was reported. We setted the initial voxelwise threshold as uncorrected p-value < 0.005 and k > 10 voxels. S12 FWE: whole-brain family-wise error correction. PCC: posterior cingulate cortex, SPL: superior parietal lobule, SI: primary somatosensory cortex, OFC: orbitofrontal cortex * The cluster is as same as the cluster from an ROI analysis (Table 2).
Whole brain cluster-wise correction was performed with cluster-level p-value<0.05 and k > 10 voxels proceeded by initial threshold of p<0.001 (uncorr.).