Feature-based learning improves adaptability without compromising precision

Learning from reward feedback is essential for survival but can become extremely challenging with myriad choice options. Here, we propose that learning reward values of individual features can provide a heuristic for estimating reward values of choice options in dynamic, multi-dimensional environments. We hypothesize that this feature-based learning occurs not just because it can reduce dimensionality, but more importantly because it can increase adaptability without compromising precision of learning. We experimentally test this hypothesis and find that in dynamic environments, human subjects adopt feature-based learning even when this approach does not reduce dimensionality. Even in static, low-dimensional environments, subjects initially adopt feature-based learning and gradually switch to learning reward values of individual options, depending on how accurately objects’ values can be predicted by combining feature values. Our computational models reproduce these results and highlight the importance of neurons coding feature values for parallel learning of values for features and objects.

and red lines show the maximum performance based on the feature-based approach in the generalizable and non-generalizable environments, respectively, assuming that the decision maker selects the more rewarding option based on this approach on every trial. The maximum performance for the object-based approach was similar in the two environments and equal to that of the feature-based approach in the generalizable environment. (e-h). Plotted is the average probability of choosing the more rewarding option on each trial during the four super-blocks of Experiments 1 and 2. Overall, performance increased and learning improved over the course of each block and dropped after reversal in both experiments.
There was no evidence for different performance in early and late parts of the experiments. Supplementary Figure 4. Comparison of the goodness-of-fit for the simulated data using various models in Experiments 1 to 4. Each column shows results generated with a given model (numbered 1 to 6) and row a to d correspond to Experiments 1 to 4, respectively. Plotted is the average AIC (Akaike information criterion) over all sets of parameters (mean ± s.e.m.) for data generated with one of the six models in Experiments 1 to 4 and fit with each of the six models. The results for the model used to generate data in a given experiment and its object-based or feature-based counterpart are highlighted in cyan and orange, respectively. The model used to generate the data provided the best fit with a few exemptions: coupled feature-based or object-based models in Experiment 3 and 4. Even for those models, fits based on the models with a similar learning approach (object-based or feature-based) were better than the corresponding object-based or feature-based models, indicating that the learning approach was identifiable in all cases.  Supplementary Figure 5. Comparison of the estimated versus actual model parameters using the same models used to generate data in Experiments 1 to 4. Each column shows results generated and fit with a given model (numbered 1 to 6) and row a to d correspond to Experiments 1 to 4, respectively. Plotted is the average of the estimated ratio of the learning rate to the stochasticity in choice versus the actual ratio of the learning rate to the stochasticity, for data generated with one of the six models in Experiments 1 to 4 and fit with the same model. We used the ratio of the learning rate to the stochasticity in choice as the measure because these two parameters influence choice similarly (i.e. a scaled version of the two parameters results in very similar choice behavior). Overall, our fitting procedure allowed accurate estimation of the actual model parameters.   show the results for super-blocks 1 to 4, respectively. Plotted are the BIC based on the best feature-based and object-based models for each individual and separately for each environment. The insets show histograms of the difference in the BIC from the two best models for the generalizable (blue) and nongeneralizable (red) environments. The dashed lines show the medians, and the star shows that the median is significantly different from zero (one-sided sign-rank test, P < 0.05). In Experiment 1, the feature-based models provided better fits than object-based models (one-sided sign-rank test; first super-block: P = 0.018, d = 0.39; second super-block: P = 0.0036, d = 0.45; third super-block: P = 0.005, d = 0.50; fourth super-block: P = 0.040, d = 0.46, N = 43). In Experiment 2, the object-based models provided better fits than feature-based models in all super-blocks except the third one (first super-block: P = 0.029, d = 0.50; second super-block: P = 0.038, d = 0.27; third super-block: P = 0.25, d = 0.15; fourth super-block: P = 0.036, d = 0.37, N = 21). Overall, we did not find any evidence for changes in the learning approach during the course of the experiments. These results show that subjects were more likely to adopt the feature-based approach in the generalizable environment and the object-based approach in the nongeneralizable environment, and that our results were not driven by two types of behavior during early and late parts of the experiments. (e-h) The same as in a-d but for the excluded subjects. Overall, there was no evidence that excluded subjects changed their strategy during the experiments.   compare the HDML and PDML models, we tested the overall performance and the ability of these models to adopt the feature-based vs. object-based approach in a large set of environments and examined how interactions between generalizability, frequency of changes in reward probabilities (volatility), and dimensionality affect the behavior of these models.
First, we used the two network models to simulate various environments with different levels of generalizability and volatility. These environments were constructed by varying the relationship between the reward value of each object and the reward values of its features, and changing the block length, i.e. the number of trials where reward probabilities were fixed (see Methods). The maximum and minimum levels of generalizability in these simulations correspond to environments used in Experiments 1 and 2, respectively (Supplementary Figure 1). Both models were able to perform the task in various environments with different levels of volatility and generalizability, but the performance of the HDML model was slightly higher in all environments (∆performance = 0.001±0.0042 (mean±std); two-sided sign-rank test, P = 0.0023, d = 0.23; Supplementary Fig. 9a, d, g). More importantly, the difference in the strength of connection from FVE and OVE neurons to the next stage of processing (C F -C O ) was more strongly modulated by generalizability and volatility in the HDML compared to the PDML model. This indicated that HDML was better able to adjust the strength of connections from value-encoding neurons ( Supplementary Fig. 9b, e, h). As generalizability or volatility increased, connections between FVE neurons and the signal-selection circuit became stronger than connections between OVE neurons and the signal-selection circuit. Therefore, only the HDML model assigned larger weights to feature-based rather than object-based reward values (larger W F -W O ) as the environment became more generalizable or volatile ( Supplementary Fig. 9c, f, i).
Overall, these results demonstrated that, although both models were able to perform the task, the HDML model exhibited higher performance and stronger adjustment of connections from the value-encoding neurons to the next level of computation. Therefore, HDML was overall more successful in assigning more graded weights to different learning approaches according to reward statistics in the environment.
Second, we examined the interaction between dimensionality reduction and generalizability in adopting a model of the environment by simulating various environments in Experiments 3 and 4 using the two models. Because dimensionality is a discrete number, we considered two different environments with different number of feature instances (three or four) resulting in dimensionality D = 3 2 and D = 4 2 . We also changed the level of generalizability across different environments (see Methods). Consistent with simulation results for Experiments 1 and 2 presented in Supplementary Figure 9, an increase in generalizability caused both models to assign higher weights to feature-based rather than object-based reward values, but this effect was much stronger for the HDML model (larger positive slopes in Supplementary Fig. 10e-f compared with Supplementary Fig. 10b-c). An increase in dimensionality further biased both models to assign more weight to feature-based compared to object-based reward values.
Overall, the simulation results for the two alternative network models illustrate that the HDML model exhibits higher performance and stronger adjustment to task parameters and reward statistics in the environment. These results indicate that hierarchical decision-making and learning might be more advantageous for adopting the model for learning in dynamic, multidimensional environments.