Belief state representation in the dopamine system

Learning to predict future outcomes is critical for driving appropriate behaviors. Reinforcement learning (RL) models have successfully accounted for such learning, relying on reward prediction errors (RPEs) signaled by midbrain dopamine neurons. It has been proposed that when sensory data provide only ambiguous information about which state an animal is in, it can predict reward based on a set of probabilities assigned to hypothetical states (called the belief state). Here we examine how dopamine RPEs and subsequent learning are regulated under state uncertainty. Mice are first trained in a task with two potential states defined by different reward amounts. During testing, intermediate-sized rewards are given in rare trials. Dopamine activity is a non-monotonic function of reward size, consistent with RL models operating on belief states. Furthermore, the magnitude of dopamine responses quantitatively predicts changes in behavior. These results establish the critical role of state inference in RL.

state) to compute the RPE (δ). The 4 th column shows the theoretical RPE, which is centered around 0. The main distinction between the standard RL models (a, b) and the belief state models (c-f) is the state representation, with a single state in the case of the standard RL model due to the ambiguity of the odor. g-l The last two columns show the theoretical value and RPE on trial 2, obtained by fitting each model's RPE to the GCaMP responses (see parameters in Supplementary Table 1) using linear regression. This regression accounted for the fact that in our task most reward responses were positive, likely due to temporal uncertainty 1,2 . Only using belief states allows reproducing the nonmonotonic pattern of dopamine RPEs observed on trial 2. a Standard RL with a fixed initial value for the state at 0.5 (V, averaged between the trained states s 1 and s 2 ), leading to a monotonically increasing RPE. b Variant of the standard RL model with free initial values for the state depending on the previous block, following s 1 (value V 1 ) and s 2 (value V 2 ). The averaged value is indicated by a black doted line. This also leads to a monotonically increasing RPE, only the intercept is affected. c RL with belief state using a fixed initial prior for all states' likelihood at 0.5 (p). The value of the belief state depends on the reward size, with smaller rewards being more likely to being similar to s 1 , resulting in a low value, and with bigger rewards being more likely to being similar to s 2 , resulting in a high value. This expectation function predicts a non-monotonic pattern in RPEs when compared to the delivered reward. d Variant of the RL with belief state using a free initial prior following s 1 (prior p 1 ), constraining both priors to sum to 1 -notice the averaged prior in black dotted line identical to the prior in c. e Variant of the RL with belief state using two free initial priors following s 1 (prior p 1 ) and following s 2 (prior p 2 ). Notice how this allows the averaged prior function in black dotted line to be biased towards being in s 2 in this example, leading to an asymmetric non-monotonic pattern of prediction error. f Variant of the RL with belief state using three states, one additional one for intermediate rewards (s 3 ), and three initial free priors following s 1 (prior p 1 ), following s 2 (prior p 2 ) and for intermediate rewards (prior p 3 ).

Supplementary Figure 2.
Behavior and fibre photometry recordings of VTA dopamine neurons in classical conditioning. We presented 3 odors, which predicted the delivery of water one second later with either 90% (red), 50% (green) or 0% (black) probability. Unpredicted water was delivered on 10% of trials. a On unpredicted water delivery trials, the mouse licked on water delivery. b, c For odors predicting reward with 90% or 50% probability, the mouse showed anticipatory licking after odor presentation proportional to the probability of reward delivery. d-f The activity of dopamine neurons on reward delivery showed a canonical RPE pattern: strongest response to fully unpredicted reward (d), decreased responses to predicted rewards (e) and dip at reward omission (f). Dopamine neurons activity at cue onset was proportional to the value of the cue (e, f). Data represents mean ± s.e.m. n indicates number of trials.  upper row (a -c) shows the average across mice, while the lower row (d -f) shows the same average after normalizing each mouse's signal by min-max normalization. This normalization corrects for the different amplitudes in GCaMP signals across the different recording conditions, but preserves the features observed in each recording condition. Note that the monoticity and non-monoticity of the responses in trials 1 and 2, respectively, are observed in each recording condition (a -c). g Normalized dopamine responses for all mice on trials 1 to 5. h Best fit by standard and with belief state reinforcement learning models. Data represents mean ± s.e.m.

Supplementary Figure 7. Polynomial fits to dopamine response and behavior for trials 1 and 2.
Polynomials of degree 1 (left), 2 (middle) or 3 (right) were fit to the data and the corresponding r 2 and adjusted r 2 , corrected for the degree of the polynomial, were computed. The highest adjusted r 2 is highlighted in bold. a Dopamine reward responses on trial 1 were best fit by a linear function. b Dopamine reward responses on trial 2 were best fit by a cubic function. c Change in anticipatory licking from trial 1 to trial 2 was best fit by a cubic function although the linear function also provided a good fit (r 2 = 0.94). d Change in anticipatory licking from trial 2 to trial 3 was best fit by the cubic function.

Supplementary Figure 8. Dopamine reward responses and model fits across trials. a
Standard and belief state reinforcement learning models were simulated using the average parameters across mice (Supplementary Table 1). b Sum of squared errors between simulations from both models. Trial 2 shows the strongest difference. c Normalized dopamine responses to rewards and model fits to dopamine responses of the RL models without or with belief states, with two free initial values or priors.

Supplementary Tables
Supplementary Table 1. Best-fitting parameter estimates shown as mean across mice and model comparison. Bayesian information criterion (BIC) and exceedance probabilities 3,4 both favoured the RL model with belief states with two initial free priors over other models. The best values are highlighted in bold.