Learning with reinforcement prediction errors in a model of the Drosophila mushroom body

Effective decision making in a changing environment demands that accurate predictions are learned about decision outcomes. In Drosophila, such learning is orchestrated in part by the mushroom body, where dopamine neurons signal reinforcing stimuli to modulate plasticity presynaptic to mushroom body output neurons. Building on previous mushroom body models, in which dopamine neurons signal absolute reinforcement, we propose instead that dopamine neurons signal reinforcement prediction errors by utilising feedback reinforcement predictions from output neurons. We formulate plasticity rules that minimise prediction errors, verify that output neurons learn accurate reinforcement predictions in simulations, and postulate connectivity that explains more physiological observations than an experimentally constrained model. The constrained and augmented models reproduce a broad range of conditioning and blocking experiments, and we demonstrate that the absence of blocking does not imply the absence of prediction error dependent learning. Our results provide five predictions that can be tested using established experimental methods.

in the MV model, its preference for the CS+ remains strong, and even appears to 23 increase by a small amount, resulting in only a small, positive ∆ f . 24 25 To explain the effect of the inhibitory block on the magnitude ofm CS+ −m CS− , 26 we analyse the synaptic weight dynamics,ẇ ± , given the reinforcement provided 27 (r + = 1, r − = 0) and the inhibition of M + . In the VSλ model, the weight dynamics 28 are given by: w VSλ and for the MV model, we have: where the factor of 1/2 compensates for the fact that each lobe in the MV model where c = λ − w K k is a constant. As such, the difference between CS+ and  RPs is: That is, the net preference for the CS+ over the CS− does not change, as the RP 41 for both is increased by the same amount due to the inhibition of M + .
where we have substituted in the stabilised MBON firing rates, as calculated ear-63 lier, and used the following parameters from the simulations: c = 2 (λ = 12, 64 w K k = 10) and r + = 1 (from the training phase).

6
The decay rate of the CS+ memory in the MV model is given by where we have assumed that, before training, w MV ± = , with a small number 68   Consider the case in which our models undergo a phase of aversive conditioning 122 (with r − = 1, r + = 0), followed by a test phase in which the CS+ is presented 123 without a US (r − = 0). The plasticity rules we present in the main text specify 124 that, during aversive conditioning, w + will undergo synaptic depression (and w − 125 will undergo potentiation in the MV model). As the RPE is minimised, the D + and 126 D − responses will tend towards a baseline firing rate that is determined by w K k, 127 the direct input to DANs from KCs. Thus, after the initial excitatory response to 128 the punishment, the D − response will decrease over the following trials (in the MV 129 model, the D + firing rate will increase after its initial inhibited response). As such, 130 when the punishment is later removed, and the CS+ is presented alone, D − will 131 exhibit a reduced response compared to baseline (that is, an excitatory response 132 that is weaker than w K k). This appears to contradict observations by Dylla  In the main text, we describe two plasticity rules for the MV model that help to minimise the full cost function, C RPE . Here, we outline some of the issues with the first plasticity rule that was described by Eq. 7: Eq. 7 specifies that each synapse requires information from only a single DAN, 153 whereby the synapse and the DAN have opposite valences (e.g. w + and D − ).

11
The final issue that arises with Eq. 7, and described in the main text, is that it 165 does not reproduce experimental results in which artificial stimulation of either 166 D + (or D − ) can act as a proxy for reward (or punishment) to induce a memory. 167 The explanation for this is provided in Supplementary Fig. 4.  Fig. 3b in the main text). This model requires the plasticity rule, P VS ± , to induce potentiation with elevated DAN firing rates. As such, the factor (w T b) Both MV model plasticity rules depress the M − firing rate after D + stimulation. The brief, weak association that is induced with Eq. 7 disappears because feedback inhibition from M − to D − is reduced. Consequently, the D − firing rate increases and the M + firing rate is depressed as well. In contrast, when D + modulates plasticity at both w + and w − , as required by Eq. 8, D + stimulation depresses the firing rate of M − and potentiates the firing rate of M + .
Supplementary Figure 5: Learning reinforcement predictions in the mixed-valence model. Each column corresponds to the behaviour of the model with different KC→DAN synaptic weights, as specified by the γ value above each column. a-c When γ is chosen such that w T K k is less than the RPE magnitude, the dynamic range of DANs is restricted (b). As such, the RPE reported by the difference in firing rates of D + and D − is reduced (c), resulting in slower learning (a), and RPEs that decay more slowly. d-i Above a critical value, changing the KC→DAN synaptic weights has no effect on model behaviour. j Because KC→MBON weights, w ± , are initialised with small values, they must initially become larger on average in order to learn accurate RPs. Consequently, the mean MBON firing rates also increase initially (the trend of increasing firing rates up to trial 200), but then stabilise (after trial 200). Abbrev.: RP (reinforcement prediction); MBON (mushroom body output neuron); DAN (dopamine neuron); RPE (reinforcement prediction error); m + , m − (approach/avoidance MBON firing rate); d + , d − (appetitive/aversive DAN firing rate).
Supplementary Figure 6: Model performance, as compared with different idealised agents, for a multiple-alternative forced choice task involving 2 or more cues. a The performance of the mixed valence (MV) model (blue line) and the valence specific (VSλ) model (green line), as measured by the trial averaged reinforcement (TAR) obtained, and as a function of the number of cues. As a comparison, we also provide the TAR for a perfect agent that always chooses the most positively reinforcing cue (black, solid curve)this emulates an agent that predicts reinforcements for all cues with 100% accuracy, and is fully deterministic (β = ∞) -and the mean TAR for an agent that randomly chooses cues with equal probability for each cue (black, long-dashed curve). The TAR for the VSλ and MV models peaks when there are six cues to choose. This is likely due to the trivial fact that there are more positive reinforcements from which to choose when there are more cues, as demonstrated by the monotonic increase in TAR for the perfect agent. A fairer comparison against the model (black, dotted curve) is an agent that predicts reinforcements for all cues with 100% accuracy, yet makes decisions probabilistically using the same softmax function used by the model (β = 5). This results in only a small reduction in the TAR compared to the perfect agent. An even more realistic comparison (purple curve) is an agent that can only update reinforcement predictions (RPs) for the chosen cue, but otherwise exhibits perfect synaptic plasticity, i.e. the RP is set to be equal to the reinforcement just obtained for a given cue. This substantially diminishes performance, which becomes comparable to the model when the number of cues is fewer than twenty. We conclude that the reduced performance of the model, as compared with the 100% accurate agent, is due to the unavoidable fact that the model's ability to maintain accurate predictions for all cues decreases with the number of cues (panel c). This is exacerbated by the slowness of synaptic plasticity, as set by the learning rate. Continued on next page. Figure 6: Continued from previous page. Finally, we tested a version of the model in which KC responses to different cues may overlap (red dashed line). This model comprised 2000 KCs, similar to estimates for a single hemisphere in Drosophila [13], of which ∼5% (totalling 100 KCs) were randomly selected to respond to each cue [14]. This means that, when there are 100 cues from which to choose, for example, each KC responds to 5 cues on average. Strikingly, this had negligible effect on the mean TAR obtained by the model. This may be surprising, as synapses from each KC end up learning the mean reinforcement across those 5 cues (panel e). However, because the reinforcement predicted for any one cue is encoded in the synapses from 100 KCs, the weighted input from all KCs to M + and M − yields accurate RPs. We analyse this feature in greater detail in panels e-g. All lines: mean TAR across 100 simulation runs; shaded regions: standard deviation in the TAR.

Supplementary
b Probabilities of choosing the most rewarding cue for two of the ideal models discussed in a. Black line: the probability as computed using the softmax function with β = 5, for an agent that has full knowledge of all available reinforcements for the current trial (100% accurate RPs). Purple line: fraction of trials in which the perfect plasticity model actually chose the most rewarding cue. Lines: mean across 100 simulation runs; shaded regions: standard deviation c Root mean squared error between RPs and actual reinforcements, where the mean is taken across all cues and all trials. Line: mean across 100 simulation runs; shaded region: standard deviation. d A slow learning rate is advantageous when the reinforcements have a stochastic element. We added zeromean Gaussian white noise, with SD σ ξ , to the same, low-pass filtered reinforcement schedules. The performance of both the MV model and the agent with perfect plasticity decreased with increasing noise. However, the addition of stochastic noise had a much greater impact on the perfect plasticity agent. Any advantage it had over the MV model was lost for σ ξ 1.0. Moreover, the performance of the agent was marginally, though consistently lower than the model in this regime. e-g When cues elicit overlapping responses in KCs, accurate RPs may be obtained from the ensemble of KC inputs to the MBONs, despite the fact that each KC responds to multiple cues. Here, the MV model has 2000 KCs and is trained on 200 cues. RPs are updated on each trial for all cues, rather than for the chosen cue. All reinforcements and RPs are taken from the final trial in the simulation, after learning has converged.
e Each data point corresponds to the reinforcement for a single cue, and a single KC's contribution (the difference between excitatory currents it elicits in M + and M − ) to the RP for that cue. The contribution from individual KCs to the overall RP is very noisy (variability in the vertical axis). This is because the weight from each KC can only learn the mean reinforcement over the cues to which that KC responds. RP contributions from five example KCs are shown in the coloured data points (each colour corresponds to a single KC, and each KC responds to multiple cues). The RP contribution from each KC is approximately the same for all cues to which it responds. Note that, because each cue elicits a response in 100 KCs on average, the RP contribution from each KC is approximately 1% of the total RP. The red line plots y = x/100 to show this correspondence between reinforcements and KC-specific RPs.
f A single KC's contribution to the RP, averaged across all cues to which it responds, corresponds well with approximately 1% of the actual reinforcement averaged across those cues. KCs in these simulations respond to 5 cues, on average. The red line shows the expected correspondence, and plots y = 5x/100. g Although the contribution from each KC toward the total RP is noisy, the total RP, which is the summed contributions over all responding KCs, is relatively precise. Red line: line of equality. Experimental data provided in Supplementary Table 2. a As with Fig. 5e in the main text, comparing ∆ f scores from experiments with those from the MV model, operating with the plasticity rule in Eq. 7. The dark green highlighted data corresponds to the dark green highlighted data in c. Solid grey line is a weighted least square linear fit with correlation coefficient R = 0.55 (0.51, 0.58) (p < 10 −4 using a permutation test; 95% confidence interval in parentheses using bootstrapping; n = 92). Each data point corresponds to a single ∆ f computed for a batch of 50 simulation runs, and for one pool of experiments using the same intervention from a single study. b Examples of the simulated interventions, as in Fig. 5c in the main text. c Comparison of ∆ f scores for the MV model operating with a plasticity rule according to either Eq. 8 (horizontal axis) or Eq. 7 (vertical axis). Both models correspond well for most experimental interventions, especially when interventions are applied to MBONs. Highlighted are four interventions (shown in d that help distinguish the two plasticity rules. d Description of the four interventions highlighted in c. Dark green: stimulating D + during the CS+ training only in the absence of other reinforcement. Dark purple: stimulating D − during the CS+ training only in the absence of other reinforcement. Light green: stimulating D + throughout an aversive conditioning experiment. Light purple: stimulating D − throughout an appetitive conditioning experiment. e Comparison of ∆ f scores for the VSλ model and the MV model using the plasticity rule in Eq. 8. Both models correspond well for most experimental interventions. Highlighted are three interventions (described in f that help distinguish the two models. Dark blue highlighted data corresponds to the dark blue highlighted data in a and in Fig. 5d- Table 1: Criteria for unbounded, stable learning of reinforcement predictions. Here, we tabulate the properties of DANs, MBONs, and synaptic plasticity that modulate learning. Columns A-F describe the different properties, each of which takes a value of either +1 or −1, and each row provides a unique combination those properties. Note that column A is determined by the product of values in columns C and D. Columns G-I determine whether or not each of three criteria, all of which are required for learning, are satisfied by the particular combination of DAN, MBON, and plasticity properties. These criteria are: i) that learning induces the expected change in MBON firing rate and thus the expected change in behaviour; ii) that learning is stable, so that, for example, the addition of excitatory reinforcement information is offset after learning by the depression of feedback excitation or the potentiation of feedback inhibition. The condition under which each criteria is satisfied is determined by the expression in the second row, which states how the property values in columns B-F must be combined. Column J determines whether or not learning can occur. Only four combinations of properties (highlighted in blue) enable learning, and each one contributes to the MV model.