Abstract
Effective decision making in a changing environment demands that accurate predictions are learned about decision outcomes. In Drosophila, such learning is orchestrated in part by the mushroom body, where dopamine neurons signal reinforcing stimuli to modulate plasticity presynaptic to mushroom body output neurons. Building on previous mushroom body models, in which dopamine neurons signal absolute reinforcement, we propose instead that dopamine neurons signal reinforcement prediction errors by utilising feedback reinforcement predictions from output neurons. We formulate plasticity rules that minimise prediction errors, verify that output neurons learn accurate reinforcement predictions in simulations, and postulate connectivity that explains more physiological observations than an experimentally constrained model. The constrained and augmented models reproduce a broad range of conditioning and blocking experiments, and we demonstrate that the absence of blocking does not imply the absence of prediction error dependent learning. Our results provide five predictions that can be tested using established experimental methods.
Introduction
Effective decision making benefits from an organism’s ability to accurately predict the rewarding and punishing outcomes of each decision, so that it can meaningfully compare the available options and act to bring about the greatest reward. In many scenarios, an organism must learn to associate the valence of each outcome with the sensory cues predicting it. A broadly successful theory of reinforcement learning is the delta rule^{1,2}, whereby reinforcement predictions (RPs) are updated in proportion to reinforcement prediction errors (RPEs): the difference between predicted and received reinforcements. RPEs are more effective as a learning signal than absolute reinforcement signals because RPEs diminish as the prediction becomes more accurate, adding stability to the learning process. In mammals, RPEs related to rewards are signalled by dopamine neurons (DANs) in the ventral tegmental area and substantia nigra, enabling the brain to implement approximations to the delta rule^{3,4}. In Drosophila melanogaster, DANs that project to the mushroom body (MB) (Fig. 1a) provide both reward and punishment modulated signals that are required for associative learning^{5}. However, to date, MB DAN activity is typically interpreted as signalling absolute reinforcements (either positive or negative) for two reasons: (i) a lack of direct evidence for RPE signals in DANs, and (ii) limited evidence in insects for the blocking phenomenon, in which conditioning of one stimulus can be impaired if it is presented alongside a previously conditioned stimulus, an effect that is indicative of RPEdependent learning^{2,6,7}. Here, we incorporate anatomical and functional data from recent experiments into a computational model of the MB, in which MB DANs do compute RPEs. The model provides a circuitlevel description for delta rule learning in the MB, which we use to demonstrate why the absence of blocking does not necessarily imply the absence of RPEs.
The MB is organised into lateral and medial lobes of neuropil in which sensory encoding Kenyon cells (KCs) innervate the dendrites of MB output neurons (MBONs), which modulate behaviour (Fig. 1b). Consistent with its role in associative learning, DAN signals modulate MBON activity via synaptic plasticity at KC → MBON synapses^{8,9,10}. Current models of MB function posit that the MB lobes encode either positive or negative valences of reinforcement signals and actions^{10,11,12,13,14,15,16}. Most DANs in the protocerebral anterior medial (PAM) cluster (called D_{+} in the model presented here, Fig. 1c) are activated by rewards, or positive reinforcement (R_{+}), and their activation results in depression at synapses between coactive KCs (K) and MBONs that are thought to induce avoidance behaviours (M_{−}). DANs in the protocerebral posterior lateral 1 (PPL1) cluster (D_{−}) are activated by punishments, i.e. negative reinforcement (R_{−}), and their activation results in depression at synapses between coactive KCs and MBONs that induce approach behaviours (M_{+}). A fly can therefore learn to approach rewarding cues or avoid punishing cues as a result of synaptic depression at KC inputs to avoidance or approach MBONs, respectively.
To date, there is only indirect evidence for RPE signals in MB DANs. DAN activity is modulated by feedforward reinforcement signals, but some DANs also receive excitatory feedback from MBONs^{17,18,19,20}, and it is likely this extends to all MBONs whose axons are proximal to DAN dendrites^{21}. We interpret the difference between approach and avoidance MBON firing rates as a RP that motivates behaviour, consistent with the observation that behavioural valence scales with the difference between approach and avoidance MBON firing rates^{15}. As such, DANs that integrate feedforward reinforcement signals and feedback RPs from MBONs are primed to signal RPEs for learning. To the best of our knowledge, these latter two features have yet to be incorporated in computational models of the MB^{22,23,24}.
Here, we incorporate the experimental data described above to formulate a reduced computational model of the MB circuitry, demonstrate how DANs may compute RPEs, derive a plasticity rule for KC → MBON synapses that minimises RPEs, and verify in simulations that our MB model learns accurate RPs. We identify a limitation to the model that imposes an upper bound on RP magnitudes, and demonstrate how putative connections between DANs, KCs and MBONs^{25,26} help circumvent this limitation. Introducing these additional connections yields testable predictions for future experiments as well as explaining a broader range of existing experimental observations that connect DAN and MBON stimulus responses to learning. Lastly, we show that both incarnations of the model—with and without additional connections—capture a wide range of observations from classical conditioning and blocking experiments in Drosophila. Different behavioural outcomes in the two models for specific experiments provide further strong experimental predictions.
Results
A model of the mushroom body that minimises reinforcement prediction errors
The MB lobes comprise multiple compartments, each innervated by a different set of MBONs and DANs (Fig. 1b), and each encoding memories for different forms of reinforcement^{27}, with different longevities^{28}, and for different stages of memory formation^{29}. Nevertheless, compartments appear to contribute to learning by similar mechanisms^{9,10,30}, and it is reasonable to assume that the process of learning RPs is similar for different forms of reinforcement. We therefore reduce the multicompartmental MB into two compartments, and assign a single, ratebased unit to each class of MBON and DAN (colourcoded in Fig. 1b, c). KCs, however, are modelled as a population, in which each sensory cue selectively activates a unique subset of ten cells. Given that activity in approach and avoidance MBONs—denoted M_{+} and M_{−} in our model—respectively bias flies to approach or avoid a cue, i, we interpret the difference in their firing rates, \({\hat{m}}^{i}\,=\,{m}_{+}^{i}\,\,{m}_{}^{i}\), as the fly’s RP for that cue.
For the purpose of this work, we assume that the MB has only a single objective: to form RPs that are as accurate as possible, i.e. that minimise the RPE. We do this within a multiplealternative forced choice (MAFC) paradigm (Fig. 1d; also known as a multiarmed bandit) in which a fly is exposed to one or more sensory cues in a given trial, and is forced to choose one. The fly then receives a reinforcement signal, \({\hat{r}}^{i}\,=\,{r}_{+}^{i}\,\,{r}_{}^{i}\), which has both rewarding and punishing components (coming from sources R_{+} and R_{−}, respectively), and which is specific to the chosen cue. Over several trials, the fly must learn to predict the reinforcements for each cue, and use these predictions to reliably choose the most rewarding cue. We formalise this objective with a cost function that penalises differences between RPs and reinforcements
where the sum is over all cues, i. To minimise C^{RPE} through learning, we derived a plasticity rule, \({{\mathcal{P}}}^{{\rm{RPE}}}\) (full derivation in Methods: Synaptic plasticity):
whereby synaptic weights are updated according to \({{\bf{w}}}_{\pm }\left(t\,+\,1\right)\,=\,{{\bf{w}}}_{\pm }\left(t\right)\,+\,{{\mathcal{P}}}_{\pm }^{{\rm{RPE}}}\). Here, k is a vector of KC firing rates, and we use subscripts ‘±’ to denote the valence of the neuron: if + (−) is considered in ±, then ∓ refers to − (+), and vice versa. As such, d_{±} refers to the firing rate of either D_{+} or D_{−}. The learning rate, η, must be small (see Methods: Synaptic plasticity) to allow the plasticity rule to average over multiple stimuli as well as stochasticity in the reinforcement schedule (see Methods: Reinforcement schedule). Note that a single DAN, D_{±}, only has access to half of the reinforcement and RP information, and by itself does not compute the full RPE. However, the difference between D_{+} and D_{−} firing rates does yield the full RPE (see Methods: DAN firing rates):
Three features of Eq. (2) are worth highlighting here. First, elevations in d_{±} increase the net amount of synaptic depression at active synapses that impinge on M_{∓}, which encodes the opposite valence to D_{±}, in agreement with experimental data^{9,10,30}. Second, the postsynaptic MBON firing rate is not a factor in the plasticity rule, unlike in reinforcementmodulated Hebbian rules^{31}, yet nevertheless in accordance with experiments^{9}. Third, and most problematic, is that Eq. (2) requires synapses to receive dopamine signals from both D_{+} and D_{−}, conflicting with current experimental findings in which appetitive DANs only modulate plasticity at avoidance MBONs, and similarly for aversive DANs and approach MBONs^{8,9,10,27,32,33}. In what follows, we consider two solutions to this problem. First, we formulate a different cost function to satisfy the valence specificity of the MB anatomy. Second, to avoid shortcomings that arise in the valencespecific model, we propose the existence of additional connectivity in the MB circuit.
A valencespecific mushroom body model exhibits limited learning
To accommodate the constraints from experimental data, in which DANs and MBONs of opposite valence are paired in subcompartments of the MB^{15,21}, we consider an alternative cost function, \({C}_{\pm }^{\,{\text{VS}}\,}\), that satisfies this valence specificity:
We refer to model circuits that adhere to this valence specificity as valencespecific (VS) models. The VS cost function can be minimised by the corresponding VS plasticity rule (see Methods: Synaptic plasticity):
where \({{\bf{w}}}_{{\rm{K}}}^{{\rm{T}}}{\bf{k}}\) models the direct excitatory current from KCs to DANs (Methods, Eq. (13)). As required, Eq. (5) maintains the relationship between increased DAN activity and enhanced synaptic depression.
Equation (5) exposes a problem for learning according to our assumed objective in the VS model. The problem arises because D_{±} receives only excitatory inputs. Thus, whenever a cue is present, KC inputs^{34} prescribe D_{±} with a minimum, cuespecific firing rate, \({d}_{\pm }^{i}\,=\,{{\bf{w}}}_{{\rm{K}}}^{{\rm{T}}}{{\bf{k}}}^{i}\,+\,{r}_{\pm }^{i}\,+\,{m}_{\mp }^{i}\,\ge\, {{\bf{w}}}_{{\rm{K}}}^{{\rm{T}}}{{\bf{k}}}^{i}\). As such, synapses will be depressed (\({\mathcal{P}}_{\mp }^{{\text{VS}}}\,<\, 0\)) whenever \({r}_{\pm }^{i}\,+\,{m}_{\mp }^{i}\,> \,0\). Once \({{\bf{w}}}_{\pm }^{{\rm{T}}}{{\bf{k}}}^{i}\,=\,0\), the VS model can no longer learn the valence of cue i as synaptic weights cannot become negative. Eventually, RPs for all cues become equal with \({\hat{m}}^{i}\,=\,0\), such that choices become random (Supplementary Fig. 1a, b). In this case, D_{+} and D_{−} firing rates become equal to the positive and negative reinforcements, respectively, such that the RPE equals the net reinforcement (Supplementary Fig. 1c, d).
A heuristic solution is to add a constant source of potentiation, which acts to restore synaptic weights to a constant, nonzero value. We therefore replace \({{\bf{w}}}_{{\rm{K}}}^{{\rm{T}}}{\bf{k}}\) in Eq. (5) with a constant, free parameter, λ:
If \(\lambda \,> \, {r}_{+}{r}_{} +{{\bf{w}}}_{{\rm{K}}}^{{\rm{T}}}{\bf{k}}\), \({\mathcal{P}}^{{{\rm{VS}} \lambda }}_{\pm}\) can take both positive and negative values, preventing synaptic weights from being held at zero. This defines a new baseline firing rate for D_{±} that is greater than \({\bf{w}}^{\rm{T}}_{\rm{K}}{\bf{k}}\). Hereafter, we refer to the VS model with plasticity governed by \({\mathcal{P}}^{{\text{VS}}\, \lambda }_{\pm }\) as the VSλ model.
The VSλ model provides only a partial solution, as it is restricted by an upper bound to the magnitude of RPs that can be learned: \( {\hat{m}}_{\max}=\max \left(0,\lambda {\bf{w}}_{\rm{K}}^{\rm{T}}{\bf{k}}\right)\). This becomes problematic when multiple choices provide reinforcements of the same valence that exceed \( \hat{m}{ }_{\max }\), as the MB will not be able to differentiate their relative values. In addition to increasing λ, \( \hat{m}{ }_{\max }\) may be increased by reducing KC → DAN synaptic transmission. In Fig. 2a, we set w_{K} = γ1, with 1 a vector of ones, and show RPs for several values of γ, with λ = 11.5 (corresponding DAN and MBON firing rates are in Supplementary Fig. 2). The upper bound is reached when w_{+} or w_{−}, and thus the corresponding MBON firing rates, go to zero (an example when γ = 1 is shown in Fig. 2b, c). These results appear to contradict recent experimental work in which learning was impaired, rather than enhanced, by blocking KC → DAN synaptic transmission^{34} (note, the block may have also affected other DAN inputs that impaired learning).
In the VSλ model, DAN firing rates begin to exhibit RPE signals. A sudden increase in positive reinforcements, for example at trial 20 in Fig. 2d, results in a sudden increase in d_{+}, which then decays as the excitatory feedback from M_{−} diminishes as a result of synaptic depression in w_{−} (Fig. 2c–e). Similarly, sudden decrements in positive reinforcements, for example at trial 80, are signalled by reductions in d_{+}. However, when the reinforcement magnitude exceeds the upper bound, as in trials 40–60 and 120–140 in Fig. 2, D_{±} exhibits sustained elevations in firing rate from baseline by an amount \(\max \left(0,{r}_{\pm }\,\, \hat{m}{ }_{\max }\right)\) (Fig. 2d, Supplementary Fig. 2). This constitutes a major prediction from our model.
A mushroom body circuit with unbounded learning
In the VSλ model, excitatory reinforcement signals can only be partially offset by decrements to w_{+} and w_{−}, resulting in the upper bound to RP magnitudes. To overcome this problem, DANs must receive a source of inhibition. A candidate solution is a circuit in which positive reinforcements, R_{+}, inhibit D_{−}, and similarly, R_{−} inhibits D_{+} (illustrated in Fig. 3a). Such inhibitory reinforcement signals have been observed in the γ2, γ3, γ4 and γ5 compartments of the MB^{8,35}. Using the derived plasticity rule, \({\mathcal{P}}_{\pm }^{{\text{VS}}}\) in Eq. (5), this circuit learns accurate RPs with no upper bound to the RP magnitude (Supplementary Fig. 3b). Hereafter, we refer to the VS model with unbounded learning as the VSu model. Learning is now possible because, when the synaptic weights w_{±} are weak, or when D_{∓} is inhibited, Eq. (5) specifies that \({{\bf{w}}}_{{\rm{K}}}^{{\rm{T}}}{\bf{k}}\,\,{d}_{\mp }\,> \,0\), i.e. synaptic weights will potentiate until the excitatory feedback from M_{±} equals the reinforcementinduced feedforward inhibition. Similarly, synapses are depressed in the absence of reinforcement because the excitatory feedback from M_{±} to D_{∓} ensures that \({{\bf{w}}}_{{\rm{K}}}^{{\rm{T}}}{\bf{k}}\,\,{d}_{\mp }\,<\,0\) (Supplementary Fig. 3c). Consequently, step changes in reinforcement yield RPE signals in D_{∓} that always decay to a baseline set by \({{\bf{w}}}_{{\rm{K}}}^{{\rm{T}}}{\bf{k}}\) (Supplementary Fig. 3d, e). Despite the prevalence in reports of long term synaptic depression in the MB, there exist several lines of evidence for potentiation (or depression of inhibition) as well^{10,16,19,36}. However, when reinforcement signals are inhibitory, D_{+}, for example, is excited only by the removal of R_{−}, and not by the appearance of R_{+} (similarly for D_{−}), counter to the experimental classification of DANs as appetitive (or aversive)^{12,13,14,37}.
To ensure that D_{±} is also excited by R_{±}, we could simply add these excitatory inputs to the model. This is unsatisfactory, however, as such inputs would not contribute to learning: they would recapitulate the circuitry of the original VS model, which we have shown cannot learn. We therefore asked whether other variations of the VSu model could learn without an upper bound, and identified three criteria (tabulated in Supplementary Table 1) that must be satisfied to achieve this: (i) learning must be effective, such that positive reinforcement either potentiates excitation of approach behaviours (inhibition of avoidance), or depresses inhibition of approach behaviours (excitation of avoidance), and similarly for negative reinforcement, (ii) learning must be stable, such that excitatory reinforcement signals are offset via learning, either by synaptic depression of feedback excitation, or by potentiation of feedback inhibition, and similarly for inhibitory reinforcement signals, (iii) to be unbounded, learning must involve synaptic potentiation, whether reinforcement signals excite DANs that induce potentiation, or inhibit DANs that induce depression. By following these criteria, we identified a dual version of the VSu circuit in Fig. 3a, which is illustrated in Fig. 3b. In this circuit, R_{+} excites D_{+}, and R_{−} excites D_{−}. However, DANs induce synaptic potentiation when activated above baseline, while M_{+} and M_{−} are inhibitory, so are interpreted as inducing avoidance and approach behaviours, respectively. Despite their different configurations, RPs are identical in each of the dual MB circuits (Supplementary Fig. 3g–k).
Neither dual model, by itself, captures all of the experimentally established anatomical and physiological properties of the MB. However, by combining them into one (Fig. 3c), we obtain a model that is consistent with the circuit properties observed in experiments, but necessitates additional features that constitute major predictions. First, DANs receive both positive and negative reinforcement signals, which are either excitatory or inhibitory, depending on the valences of the reinforcement and the DAN. Second, in addition to the excitatory feedback from MBONs to DANs of the opposite valence, MBONs also provide feedback to DANs of the same valence via inhibitory interneurons, which we propose innervate areas targeted by MBON axons and DAN dendrites^{21}. We refer to this circuit as the mixedvalence (MV) model, as DANs receive a mixture of both positive and negative valences in both the feedforward reinforcement and feedback RPs, consistent with recent findings in Drosophila larvae^{26}. Importantly, each DAN in this hybrid model now has access to the full reinforcement signal, \(\hat{r}\), and the full RP, \(\hat{m}\), or \(\hat{r}\) and \(\hat{m}\), depending on the valence of the DAN. Deriving a plasticity rule (Methods: Synaptic plasticity) to minimise \({C}_{\pm }^{{\rm{RPE}}}\) yields
which takes the same form as Eq. (5) (except that d_{±} depends on more synaptic inputs; see Methods: DAN firing rates), and adheres to our current understanding that plasticity at MBONs is modulated by DANs of the opposite valence. However, Eq. (7) incurs several problems (outlined in Supplementary Discussion), and fails a crucial test: stimulating D_{+} (D_{−}) as a proxy for reinforcement induces a weak appetitive (aversive) memory only briefly, which then disappears with repeated cuestimulation pairings (Supplementary Fig. 4), contradicting experiments in which strong, lasting memories are induced by this method^{13,14,15,27,28,32,33,38,39}. One can derive an alternative plasticity rule (Methods: Synaptic plasticity) to minimise \({C}_{\pm }^{{\rm{RPE}}}\), which takes a form similar to Eq. (2):
Although Eq. (8) requires that synapses receive information from DANs of both valences, it does yield strong, lasting memories when D_{±} is stimulated as a proxy for reinforcement (Supplementary Fig. 4). We therefore use Eq. (8) for the MV model hereafter, introducing a third major prediction: plasticity at synapses impinging on either approach or avoidance MBONs may be modulated by DANs of both valences.
Figure 3d demonstrates that the MV model accurately tracks changing reinforcements, just as with the dual versions of the VSu model. However, a number of differences from the VSu models can also be seen. First, changing RPs result from changes in the firing rates of both M_{+} and M_{−} (Fig. 3e). Although MBON firing rates show an increasing trend, they eventually stabilise (Supplementary Fig. 5j). Moreover, when w_{±} reach zero, the changes in w_{∓} compensate, resulting in larger changes in the firing rate of M_{∓}, as seen between trials 40–60 in Fig. 3e. Second, DANs respond to RPEs, irrespective of the reinforcement’s valence: d_{+} and d_{−} increase with positive and negative RPEs, respectively, and decrease with negative and positive RPEs (Fig. 3f, g). Third, blocking KC → DAN synaptic transmission (by setting γ = 0) slows down learning, but does not abolish it entirely (Fig. 3d). With input from KCs blocked, the baseline firing rate of D_{±} is zero, and because any given RPE excites one DAN type and inhibits the other, only one of either D_{+} or D_{−} can signal the RPE, reducing the magnitude of d_{±} − d_{∓} in Eq. (8), and therefore the speed of learning (Supplementary Fig. 5). To avoid any slowing down to learning, \({{\bf{w}}}_{{\rm{K}}}^{{\rm{T}}}{\bf{k}}\) must be greater than or equal to the RPE. This may explain the 25% reduction in learning performance in experiments that blocked KC → DAN inputs^{34}, although the block may have also affected other DAN inputs.
Decision making in a multiplealternative forced choice task
We next tested the VSλ and MV models on a task with multiple cues from which to choose. Choices are made using the softmax function (Eq. (11)), such that the model more reliably chooses one cue over another when cuespecific RPs are more dissimilar. Throughout the task, the cuespecific reinforcements slowly change (see example reinforcement schedules in Fig. 4), and the model must continually update RPs (Fig. 4), according to its plasticity rule, in order to choose the most positively reinforcing cues as possible. Specifically, we update only those synaptic weights that correspond to the chosen cue (see Methods, Eqs. (21, 22)).
In a task with two alternatives, switches in cue choice almost always occur after the actual switch in the reinforcement schedule because of the slow learning rate and the probabilistic nature of decision making (Fig. 4a). The model continues to choose the more rewarding cues when there are as many as 200 (Supplementary Fig. 6a; Fig. 4b shows an example simulation with five cues). Up to ten cues, the trial averaged obtained reinforcement (TAR) becomes more positive with the number of cues (coloured lines in Supplementary Fig. 6a), consistent with the fact that increasing the number of cues increases the maximum TAR for an individual that always selects the most rewarding cue (black solid line, Supplementary Fig. 6a). Increasing the number of cues beyond ten reduces the TAR, which corresponds with choosing the maximally rewarding cue less often (Supplementary Fig. 6b), and a decreasing ability to maintain accurate RPs when synaptic weights are updated for the chosen cue only (Supplementary Fig. 6c; and see Methods: Synaptic plasticity). Despite this latter degradation in performance, the VSλ and MV models are only marginally outperformed by a model with perfect plasticity, whereby RPs for the chosen cue are set to equal the last obtained reinforcement (Supplementary Fig. 6a). Furthermore, when Gaussian white noise is added to the reinforcement schedule, the performance of the perfect plasticity model drops below that of the other models, for which slow learning helps to average over the noise (Supplementary Fig. 6d). The model suffers no noticeable decrement in performance when KC responses to different cues overlap, e.g. when a random 5% of 2000 KCs are assigned to each cue (Supplementary Fig. 6a, e–g).
Both models capture learned fly behaviours in a variety of conditioning experiments
To determine how well the VSλ and the MV models capture decision making in flies, we applied them to an experimental paradigm (illustrated in Fig. 5a) in which flies are conditioned to approach or avoid one of two odours. We set λ in the VSλ model to be large enough so as not to limit learning. In each experiment, flies undergo a training stage, during which they are exposed to a conditioned stimulus (CS+) concomitantly with an unconditioned stimulus (US), for example sugar (appetitive training) or electric shock (aversive training). Flies are next exposed to a different stimulus (CS−) without any US. Following training, flies are tested for their behavioural valence with respect to the two odours. The CS+ and CS− are released at opposite ends of a tube. Flies are free to approach or avoid the stimuli by walking towards one end of the tube or the other. In our model, we do not simulate the spatial extent of the tube, nor specific fly actions, but model choice behaviour in a simple manner by applying the softmax function to the current RPs.
In addition to these control experiments, we simulated a variety of interventions frequently used in experiments (Fig. 5a–c). These experiments are determined by four features: (1) US valence (Fig. 5a): appetitive, aversive, or neutral, (2) intervention type (Fig. 5c): inhibition of neuronal output, e.g. by expression of shibire, or activation, e.g. by expression of dTrpA1, both of which are controlled by temperature, (3) the intervention schedule (Fig. 5b): during the CS+ only, throughout CS+ and CS−, during test only, or throughout all stages, (4) the target neuron (Fig. 5c): either M_{+}, M_{−}, D_{+}, or D_{−}. Further details of these simulations are provided in Methods: Experimental data and model comparisons.
We compared the models to behavioural results from 439 experiments (including 235 controls), which tested 27 unique combinations of the above four parameters in 14 previous studies^{10,13,14,15,16,17,18,27,28,32,35,36,38,39} (the Source data and experimental details for each experimental intervention used here is provided in Supplementary Data 1). In Fig. 5d, e, we plot a test statistic, Δ_{f}, that compares behavioural performance indices (PIs) between a specific intervention experiment and its corresponding control, where the PI is +1 if all flies approached the CS+, and −1 if all flies approached the CS−. When Δ_{f} > 0, more flies approached the CS+ in the intervention than in the control experiment, and when Δ_{f} < 0, fewer flies approached the CS+ in the intervention than in the control. Interventions in both models correspond well with those in the experiments: Δ_{f} from the VSλ model and experiments are correlated with R = 0.68, and Δ_{f} from the MV model and experiments are correlated with R = 0.65 (p < 10^{−4} for both models). The smaller range in Δ_{f} scores from the experimental data are likely a result of the greater difficulty in controlling extraneous variables, resulting in smaller effect sizes.
Four cases of inhibitory interventions exemplify the correspondence of both the VSλ and MV model with experiments, and are highlighted in Fig. 5d, e (light green, purple, blue and orange rings). Also highlighted are two examples of excitatory interventions, in which artificial stimulation of either D_{+} or M_{−} during CS+ exposure, without any US, was used to induce an appetitive memory and approach behaviour. The two models yield very similar Δ_{f} scores, but not always (Supplementary Fig. 7e). The example highlighted in dark blue in Fig. 5d, e, in which M_{+} was inhibited throughout appetitive training but not during the test, shows that this intervention had little effect in the MV model, in agreement with experiments^{36}, but resulted in a strong reduction in the appetitiveness of the CS+ in the VSλ model (Δ_{f} ≈ −4.5). In the Supplementary Note, we analyse the underlying synaptic weight dynamics that lead to this difference in model behaviours. The analyses show that not only does this intervention amplify the difference between CS+ and CS− RPs in the MV model, it also results in faster memory decay in the VSλ model. Hence, the preference for the CS+ is maintained in the MV model, but is diminished in the VSλ model.
The alternative plasticity rule (Eq. (7)) for the MV model yields Δ_{f} scores that correspond less well with the experiments (R = 0.55, Supplementary Fig. 7a), in part because associations cannot be induced by pairing a cue with D_{±} stimulation (Supplementary Fig. 4). This conditioning protocol, plus one other (Supplementary Fig. 7c), helps distinguish the two plasticity rules in the MV model, and can be tested experimentally. Lastly, both the VSλ and MV models provide a good fit to reevaluation experiments^{18,19} in which the CS+ or CS− is exposed a second time, without the US, before the test phase (Supplementary Fig. 8, Supplementary Data 2).
The absence of blocking does not refute the use of reinforcement prediction errors for learning
When training a subject to associate a compound stimulus, XY, with reinforcement, R, the resulting association between Y and R can be blocked if the subject were previously trained to associate X with R^{6,7}. The Rescorla–Wagner model^{2} provides an explanation: if X already predicts R during training with XY, there will be no RPE with which to learn associations between Y and R. However, numerous experiments in insects have reported only partial blocking, suggesting that insects may not utilise RPEs for learning^{40,41,42,43}. This conclusion overlooks a strong assumption in the Rescorla–Wagner model, namely, that neural responses to X and Y are independent. In the insect MB, KC responses to stimuli X and Y may overlap, and the response to the compound XY does not equal the sum of responses to X and Y^{44,45,46}. Thus, if the MB initially learns that X predicts R, but the ensemble of KCs that respond to X is different to the ensemble that responds to XY, then some of the synapses that encode the learned RP will not be recruited. Consequently, the accuracy of the prediction will be diminished, such that training with XY elicits a RPE and an association between Y and R can be learned. We tested this hypothesis, which constitutes a neural implementation of previous theories^{47,48}, by simulating the blocking paradigm using the MV model (Fig. 6a).
Two stimuli, X and Y, elicited nonoverlapping responses in the KCs (Fig. 6b). When stimuli are encoded independently—that is, the KC response to XY is the sum of responses to X and Y—previously learned XR associations block the learning of YR associations during the XY training phase (Fig. 6c, e), as expected.
To simulate nonindependent KC responses during the XY training phase, the KC response to each stimulus was corrupted: some KCs that responded to stimulus X in isolation were silenced, and previously silent KCs were activated (similarly for Y; see Methods: blocking paradigm). This captured, in a controlled manner, nonlinear processing that may result, for example, from recurrent inhibition within and upstream of the MB. The average severity of the corruption to stimulus i was determined by \({p}_{cor}^{i}\), where \({p}_{cor}^{i}\,=\,0.0\) yields no corruption, and \({p}_{cor}^{i}\,=\,1.0\) yields full corruption. Corrupting the KC response to X allows a weak YR association to be learned (Fig. 6d), which translates into a behavioural preference for Y during the test (Fig. 6e). Varying the degree of corruption to stimulus X and Y results in variable degrees of blocking (Fig. 6f). The blocking effect was maximal when \({p}_{cor}^{X}\,=\,0\), and absent when \({p}_{cor}^{X}\,=\,1\). However, even in the absence of blocking, corruption to Y during compound training prevents learned associations being carried over to the test phase, giving the appearance of blocking. These results provide a unifying framework with which to understand inconsistencies between blocking experiments in insects. Importantly, the variability in blocking can be explained without refuting the RPE hypothesis.
Discussion
Overview
Successful decision making relies on the ability to accurately predict, and thus reliably compare, the outcomes of choices that are available to an agent. The delta rule, as developed by Rescorla and Wagner^{2}, updates beliefs in proportion to a prediction error, providing a method to learn accurate and stable predictions. In this work, we have investigated the hypothesis that, in Drosophila melanogaster, the MB implements the delta rule. We posit that approach and avoidance MBONs together encode RPs, and that feedback from MBONs to DANs, if subtracted from feedforward reinforcement signals, endows DANs with the ability to compute RPEs, which are used to modulate synaptic plasticity. We formulated a plasticity rule that minimises RPEs, and verified the effectiveness of the rule in simulations of MAFC tasks. We demonstrated how the established valencespecific circuitry of the MB restricted the learned RPs to within a given range, and postulated crosscompartmental connections, from MBONs to DANs, that could overcome this restriction. Such crosscompartmental connections are found in Drosophila larvae, but their functional relevance is unknown^{25,26}. We have thus presented two MB models that yield RPEs in DAN activity and that learn accurate RPs: (i) the VSλ model, in which plasticity incorporates a constant source of synaptic potentiation; (ii) the MV model, in which we propose mixedvalence connectivity between DANs, MBONs and KC → MBON synapses. Both the VSλ and the MV models receive equally good support from behavioural experiments in which different genetic interventions impaired learning, while the MV model provides a mechanistic account for a greater variety of physiological changes that occur in individual neurons after learning. It is plausible, and can be beneficial, for both the VSλ and MV models to operate in parallel in the MB, as separately learning positive and negative aspects of decision outcomes, if they arise from independent sources, is important for contextdependent modulation of behaviour. Such learning has been proposed for the mammalian basal ganglia^{49}. We have also demonstrated why the absence of strong blocking effects in insect experiments does not necessarily imply that insects do not utilise RPEs for learning.
Predictions
The models yield predictions that can be tested using established experimental protocols. Below, we specify which model supports each prediction.
Prediction 1—both models
Responses in single DANs to the unconditioned stimulus (US), when paired with a CS+, should decay towards a baseline over successive CS ± US pairings, as a result of the learned changes in MBON firing rates. To the best of our knowledge, only one study has measured DAN responses throughout several CS–US pairings in Drosophila^{50}. Consistent with DAN responses in our model, Dylla et al.^{50} reported such decaying responses in DANs in the γ and \({\beta }^{\prime}\)lobes during paired CS+ and US stimulation. However, they reported similar decaying responses when the CS+ and US were unpaired (separated by 90 s) that were not significantly different from the paired condition. The authors concluded that DANs do not exhibit RPEs, and that the decaying DAN responses were a result of nonassociative plasticity. An alternative interpretation is that a 90 s gap between CS+ and US does not induce DAN responses that are significantly different from the paired condition, and that additional processes prevent the behavioural expression of learning. Ultimately, the evidence for either effect is insufficient. Furthermore, Dylla et al. observed increased CS+ responses in DANs after training. Conversely, after training in our models—i.e. when the US was set to zero—DAN responses to the CS+ decreased. Interpreting posttraining activity in DANs as responses to the CS+ alone, or alternatively as responses to an omitted US, are equally valid in our model because the CS+ and US always occurred together. Resolving time within trials in our models would allow us to better address this conflict with experiments. The Dylla et al. results are, however, consistent with the temporal difference (TD) learning rule^{51,52} (as are studies on second order conditioning in Drosophila^{53,54}), of which the Rescorla–Wagner rule used in our work is a simplified case. We discuss this further in the Supplementary Discussion, as well as features of the TD learning rule, and experimental factors, which may explain why the expected changes in DAN responses to the CS and US were not observed in previous studies^{12,37}.
Prediction 2—VSλ model
After repeated CS ± US pairings, a sufficiently large reinforcement will prevent the DAN firing rate from decaying back to its baseline response to the CS+ in isolation. Here, sufficiently large means that the inequality required for learning accurate RPs, \(\lambda \,> \, {r}_{+}{r}_{} +{{\bf{w}}}_{{\rm{K}}}^{{\rm{T}}}{\bf{k}}\), is not satisfied. Because KC → DAN input, \({{\bf{w}}}_{{\rm{K}}}^{{\rm{T}}}{\bf{k}}\), may be difficult to isolate in experiments, sufficiency could be guaranteed by ensuring the reinforcement satisfies ∣r_{+} − r_{−}∣ > λ. That is, pairing a CS + with a novel reward (punishment) that would more than double the stabilised D_{+} (D_{−}) firing rate, where the stabilised firing rate is achieved after repeated exposure to the CS+ in isolation. Note that, if λ were to adapt to the reinforcement magnitude, this would be a difficult prediction to falsify.
Prediction 3—MV model
The valence of a DAN is defined by its response to RPEs, rather than to reinforcements per se. Thus, DANs previously thought to be excited by positive (negative) reinforcement are in fact excited by positive (negative) RPEs. For example, a reduction in electric shock magnitude, after an initial period of training, would elicit an excitatory (inhibitory) response in appetitive (aversive) DANs. Felsenberg et al.^{18,19} provide indirect evidence for this. The authors trained flies on a CS+, then reexposed the fly to the CS+ without the US. For an appetitive (aversive) US, CS+ reexposure would have yielded a negative (positive) RPE. By blocking synaptic transmission from aversive (appetitive) DANs during CS+ reexposure, the authors prevented the extinction of learned approach (avoidance). Such responses are consistent with those of mammalian midbrain DANs, which are excited (inhibited) by unexpected appetitive (aversive) reinforcements^{3,55,56,57}.
Prediction 4—both models
In the MV model, learning is mediated by simultaneous plasticity at both approach and avoidance MBON inputs. The converse, that plasticity at approach and avoidance MBONs is independent, would support the VSλ model. Appetitive conditioning does indeed potentiate responses in MBV3/α3 and MVP2/γ1pedc approach MBONs^{16,36}, and depress responses in M4\({\beta }^{\prime}\)/\({\beta }^{\prime}\)2mp and M6/\(\gamma 5{\beta }^{\prime}\)2a avoidance MBONs^{10}. Similarly, removal of an expected aversive stimulus, which constitutes a positive RPE, depresses M6/\(\gamma 5{\beta }^{\prime}\)2a avoidance MBONs^{19}. In addition, aversive conditioning depresses responses in MPV2/γ1pedc and MBV2/\(\alpha 2{\alpha }^{\prime}2\) approach MBONs^{9,30}, and potentiates responses in M4\({\beta }^{\prime}\)/\({\beta }^{\prime}\)2mp and M6/\(\gamma 5{\beta }^{\prime}\)2a avoidance MBONs^{10,29}. However, the potentiation of M4\({\beta }^{\prime}\) and M6 MBONs is at least partially a result of depressed feedforward inhibition from the MVP2 MBON^{16,19}. To the best of our knowledge, simultaneous changes in approach and avoidance MBON activity has not yet been observed. A consequence of this coordinated plasticity is that, if plasticity onto one MBON type is blocked (e.g. the synaptic weights cannot be depressed any further), plasticity at the other MBON type should compensate.
Prediction 5—MV model
DANs of both valence modulate plasticity at MBONs of a single valence. This is a result of using the plasticity rule specified by Eq. (8), which better explains the experimental data than Eq. (7) (Fig. 5d, e, Supplementary Fig. 7a). In contrast, anatomical and functional experimental data suggest that, in each MB compartment, the DANs and MBONs have opposite valences^{21,58}. However, the GAL4 lines used to label DANs in the PAM cluster often include as many as 20–30 cells each, and it has not yet been determined whether all labelled DANs exhibit the same valence preference. Similarly, the valence encoded by MBONs is not always obvious. In^{15}, for example, it is not clear whether optogenetically activated MBONs biased flies to approach the light stimulus, or to exhibit nogo behaviour that kept them within the light. In larval Drosophila, there are several examples of crosscompartmental DANs and MBONs^{25,59}, but a full account of the valence encoded by these neurons is yet to be provided. In adult Drosophila, γ1pedc MBONs deliver crosscompartmental inhibition, such that M4/6 MBONs are effectively modulated by both aversive PPL1γ1pedc DANs and appetitive PAM DANs^{16,19}.
Other models of learning in the mushroom body
We are not the first to present a MB model that makes effective decisions after learning about multiple reinforced cues^{22,23,24}. However, these models utilise absolute reinforcement signals, as well as bounded synapses that cannot strengthen indefinitely with continued reinforcements. Thus, given enough training, these models would not differentiate between two cues that were associated with reinforcements of the same sign, but different magnitudes. Carefully designed mechanisms are therefore required to promote stability as well as differentiability of same sign, different magnitude reinforcements. Our model builds upon these studies by incorporating feedback from MBONs to DANs, which allows KC → MBON synapses to accurately encode the reinforcement magnitude and sign with stable fixed points that are reached when the RPE signalled by DANs decays to zero. Alternative mechanisms that may promote stability and differentiability are forgetting^{60} (e.g. by synaptic weight decay), or adaptation in DAN responses^{61}. Exploring these possibilities in a MB model for comparison with the RPE hypothesis is well worth while, but goes beyond the scope of this work.
Model limitations
Central to this work is the assumption that the MB has only a single objective: to minimise the RPE. In reality, an organism must satisfy multiple objectives that may be mutually opposed. In Drosophila, anatomically segregated DANs in the γlobe encode water rewards, sugar rewards, and motor activity^{8,13,14,27}, suggesting that Drosophila do indeed learn to satisfy multiple objectives. Multiobjective optimisation is a challenging problem, and goes beyond the scope of this work. Nevertheless, for many objectives, the principle that accurate predictions aid decision making, which forms the basis of this work, still applies.
For simplicity, our simulations compress all events within a trial to a single point in time, and are therefore unable to address some timedependent features of learning. For example, activating DANs either before or after cue exposure can induce memories with opposite valences^{28,62,63}; in locusts, the relative timing of KC and MBON spikes is important^{64,65}, though not necessarily in Drosophila^{9}. Nor have we addressed the credit assignment problem: how to associate a cue with reinforcement when they do not occur simultaneously. A candidate solution is TD learning^{51,52}, whereby reinforcement information is backpropagated in time to all cues that predict it. While DAN responses in the MB hint at TD learning^{50}, it is not yet clear how the MB circuity could implement it. An alternative solution is an eligibility trace^{52,66}, which enables synaptic weights to be updated upon reinforcement even after presynaptic activity has ceased.
Lastly, our work here addresses memory acquisition, but not memory consolidation, which is supported by distinct circuits within the MB^{67}. Incorporating memory stabilising mechanisms may help to better align our simulations of genetic interventions with fly behaviour in conditioning experiments.
Blocking experiments
By incorporating the fact that KC responses to compound stimuli are nonlinear combinations of their responses to the components^{44,45,46}, we used our model to demonstrate why the lack of evidence for blocking in insects^{40,41,42,43} cannot be taken as evidence against RPEdependent learning in insects. Our model provides a neural circuit instantiation of similar arguments in the literature, whereby variable degrees of blocking can be explained if the brain utilises representations of stimulus configurations, or latent causes, which allow learned associations to be generalised between a compound stimulus and its individual elements by varying amounts^{47,48,68,69}. The effects of such configural representations on blocking are more likely when the component stimuli are similar, for example, if they engage the same sensory modality, as was the case in^{40,41,42,43}. By using component stimuli that do engage different sensory modalities, experiments with locusts have indeed uncovered strong blocking effects^{70}.
Summary
We have developed a model of the MB that goes beyond previous models by incorporating feedback from MBONs to DANs, and shown how such a MB circuit can learn accurate RPs through DAN mediated RPE signals. The model provides a basis for understanding a broad range of behavioural experiments, and reveals limitations to learning given the anatomical data currently available from the MB. Those limitations may be overcome with additional connectivity between DANs, MBONs and KCs, which provide five strong predictions from our work.
Methods
Experimental paradigm
In all but the last two results sections, we apply our model to a multiarmed bandit paradigm^{52,71} comprising a sequence of trials, in which the model is forced to choose between a number of cues, each cue being associated with its own reinforcement schedule. In each trial, the reinforcement signal may have either positive valence (reward) or negative valence (punishment), which changes over trials. Initially, the fly is naive to the cuespecific reinforcements. Thus, in order to reliably choose the most rewarding cue, it must learn, over successive trials, to accurately predict the reinforcements for each cue. Individual trials comprise three stages in the following order (illustrated in Fig. 1d): (i) the model is exposed to and computes RPs for all cues, (ii) a choice probability is assigned to each cue using a softmax function (described below), with the largest probability assigned to the cue that predicts the most positive reinforcement, (iii) a single cue is chosen probabilistically, according to the choice probabilities, and the model receives reinforcement with magnitude r_{+} (positive reinforcement, or reward) or r_{−} (negative reinforcement, or punishment). The fly uses this reinforcement signal to update its cuespecific RP.
Simulations
Connectivity and synaptic weights
KC → MBON: KCs (K in Fig. 1c) constitute the sensory inputs (described below) in our models. Sensory information is transmitted from the KCs, of which there are N_{K}, to two MBONs, M_{+} and M_{−}, through excitatory, feedforward synapses. For simplicity, we use a subscript '+' to label positive valence (e.g. reward or approach) and '−' to label negative valence (e.g. punishment or avoidance). K_{i} synapses onto M_{±} with a synaptic weight w_{±i}, which is initialised with w_{±i} = 0.1ξ_{±i} for each run of the model, where ξ_{±i} is a uniform random variable in the range 0–1.
KC → DAN: KCs drive excitatory responses in DANs from the PPL1 cluster^{34}. In our model, we assume that KCs also provide input to appetitive DANs in the PAM cluster. Thus, K_{i} drives D_{±} through unmodifiable, excitatory synapses with weights, w_{K} = γ1, where \({\mathbf{1}}\,=\,{\left[1,1,\ldots ,1\right]}^{{\rm{T}}}\) is a vector of ones of length N_{K}.
MBON → DAN: MBONs provide excitatory feedback to their respective DANs^{17,18,19}. In both the valencespecific (VS) and mixedvalence (MV) models, M_{±} synapses onto D_{∓} with unit synaptic weight. In the mixedvalence (MV) model, M_{±} also provides inhibitory feedback to D_{±} via an inhibitory interneuron, but we do not model the interneuron explicitly. Thus, we describe the feedback weight simply as w_{M} = 1, and specify whether the input is excitatory or inhibitory in the firing rate equation for D_{±} (Eqs. (13) and (14)).
Inputs and KC sensory representation
Projection neurons from the antennal lobe and optic lobes provide a substantial majority of inputs to KCs in the MB. These inputs carry olfactory and visual information and, together with recurrent inhibition from the anterior paired lateral neuron, drive a sparse representation of sensory information in ~5–10% of the KCs^{72,73,74}. For simplicity, we bypass the computations performed in nuclei upstream of the KCs, and assign a unique population of 10 KCs to each cue. Thus, for N_{c} cues, we simulate N_{K} = 10N_{c} KCs. Each KC is always activated by its assigned cue, and each active KC, j, is given the same firing rate, k_{j} = 1 Hz. In a subset of simulations used for Supplementary Fig. 6a, c–e, we simulate 2000 KCs, where each KC is assigned to a cue with probability p = 0.05, so that 5% of KCs, on average, are active for a given cue. In these simulations, we normalised the total KC firing rates for each cue, i, such that \({\sum}_{j}{k}_{j}^{i}\,=\,10\) Hz. This ensured that the multiplicative effect of KC firing rates on the speed of learning (Eqs. (2) and (5)) does not confound the interpretation of our results.
MBON firing rates and reinforcement predictions
Neurons are modelled as linear–nonlinear (LN) units that output a firing rate, y, equal to the rectified linear sum of their inputs, x:
where \(f(z)\,=\,\max (0,z)\) is the rectifying nonlinearity. Equation (9) can be written more concisely in vector notation: \(y\,=\,f\left({{\bf{w}}}^{{\rm{T}}}{\bf{x}}\right)\), where w^{T} = [w_{1}, …, w_{N}] for N presynaptic neurons, and superscript T denotes the transpose. Throughout this text, bold fonts denote vectors.
At the beginning of each trial, MBON firing rates, and thus RPs, are computed for each cue. The firing rate, m_{±}, of MBON M_{±}, signals the amount of positive (or negative) reinforcement associated with a given cue, labelled i, according to
where k^{i} is the vector of KC responses to stimulus i, and w_{±} are plastic, excitatory synaptic weights. The net reinforcement predicted by sensory cue i is then determined by \({\hat{m}}^{i}\,=\,{m}_{+}^{i}\,\,{m}_{}^{i}\).
Decision making
In each trial, RPs for all cues are compared, and the model is forced to decide which cue should be chosen. Decisions are made probabilistically using a softmax function, \(p\left(i\right)\), which specifies the probability of choosing cue i as a function of the differences between its RP and the RPs of every other cue:
where β is a constant (analogous to the inverse temperature in thermodynamics) and modulates the extent to which \(p\left(i\right)\) increases or decreases with respect to \({\hat{m}}^{j}\,\,{\hat{m}}^{i}\). When β = 0, choices are independent of the learned valence, and each of the M available options are chosen with equal probability, \(p\left(i\right)\,=\,{M}^{1}\). When β = ∞, decisions are made deterministically, such that the cue with the most positive RP is always chosen. For the MAFC task, the cue that is ultimately chosen on a given trial is determined by drawing a single, random sample, ξ, from a uniform distribution in the range 0–1, and selecting a cue, q, such that
DAN firing rates
Once a cue has been chosen, the RP specific to that cue is fed back to the DANs where they are compared against the actual reinforcement, \({\hat{r}}^{i}\,=\,{r}_{+}^{i}\,\,{r}_{}^{i}\), received in that trial, where r_{±} is the magnitude of reinforcement signal R_{±}. Given the chosen cue, q, D_{±} firing rates in the VS models are given by
whereas, in the MV model, D_{±} is given by
We set w_{M} = 1, such that the difference in DAN firing rates yields the RPE for cue q:
where \({\hat{d}}^{q}\) for the MV model is valid when \({{\bf{w}}}_{{\rm{K}}}^{{\rm{T}}}{{\bf{k}}}^{q}\,> \, {\hat{r}}^{q}\,\,{\hat{m}}^{q}\). When the inequality is not satisfied, the precise expression for \({\hat{d}}^{q}\) in the MV model, taking into consideration the nonlinear rectification in d_{+} and d_{−}, is
Synaptic plasticity
We assume that the objective of the MB is to form accurate RPs, which minimise RPEs. This objective can be formulated as
where the sum is over all cues, i, \({{\bf{w}}}_{+}^{{\rm{T}}}{{\bf{k}}}^{i}\) is the firing rate of M_{+}, expressed as the weighted input from the KC population response, k^{i}, through synapses with strength w_{+}, and similarly for \({{\bf{w}}}_{}^{{\rm{T}}}{{\bf{k}}}^{i}\). Learning an accurate RP amounts to minimising C^{RPE} by modifying the synaptic weights. Assuming that inputs onto approach and avoidance MBONs are modified independently^{9}, we perform gradient descent on C^{RPE} with respect to w_{+} and w_{−} separately. The plasticity rule, \({{\mathcal{P}}}_{\pm }^{{\rm{RPE}}}\), is then defined by the negative gradient:
where η is the learning rate, and the last line is reached by substituting in the DAN firing rates, d_{±}, for the VS model. If instead d_{±} is used from the MV model, it is possible to write plasticity rules that minimise C^{RPE} in two ways (respectively Eq. (7) and Eq. (8) in Results), either:
where the factor 1/2 in Eq. (8) accommodates the factor of 2 in Eq. (15). The two equations are equivalent when DAN firing rates are not clipped by rectification, but behave differently when the rates are rectified (Supplementary Fig. 4). We use Eq. (8) throughout the main text, and compare model behaviours for both Eqs. (8) and (7) in Supplementary Fig. 7.
We take a similar approach to derive the VS plasticity rule, but use a valencespecific cost function
We derive the plasticity rule, \({\mathcal{P}}_{\pm }^{\,{\text{VS}}\,}\), by gradient descent on \({C}_{\pm }^{\,{\text{VS}}\,}\):
where d_{±} are computed according to the VS model. These plasticity rules are in fact only an approximation to gradient descent, and hold true only when: (i) the DAN firing rates are not clipped by the nonlinear rectification; (ii) the learning rate, η, is sufficiently small, which allows us to dispense of the sum over cues, assuming instead that plasticity minimises a running average of the cost. Here, sufficiently small means that \(\eta \,<\,{\left(2{\sum}_{j}{k}_{j}^{i}\right)}^{1}\) for all cues, i, which ensures that learning does not result in unstable oscillations in RPs. The plasticity rule therefore describes the mean drift in synaptic weights over several trials. This need not be at odds with rapid learning in insects, as small synaptic weight changes may yield large behavioural changes in our model, depending on the softmax parameter β in Eq. (11). For Figs. 2, 3, we use η = 2.5 × 10^{−2}; for Fig. 4, we use η = 10^{−1}; for Fig. 5, we use η = 5 × 10^{−2}. We set η → η/2 for the MV model, because each DAN in the MV model encodes the full RPE, as opposed to half the RPE in the VS model. This ensures that synaptic weight updates have the same magnitude for a given RPE in both models. In the simulations, we use Eqs. (19) and (20) to specify discrete updates to the synaptic weights at the end of each trial, t, conditioned on the chosen cue, q. Specifically, the update for the VS model is given by
and for the MV model by
where the superscript q specifies the firing rate of each neuron in the presence of cue q alone, under the assumption that this cue dominates the neural activity at the point of receiving its corresponding reinforcement signal. The update equation for the VS model with the modified plasticity rule (which we call the VSλ model) is
Note that the plasticity rule is not a function of the postsynaptic MBON firing rate (except indirectly through the DAN firing rate). This is possible because a separate plasticity rule exists for synapses impinging on each MBON, negating the need to label the postsynaptic neuron via its firing rate, as would be the case in threefactor Hebbian rules that are typically used in models of reinforcementmodulated learning^{31}.
Reinforcement schedule
At the end of each trial, a reinforcement signal specific to sensory cue i is provided. Reinforcements, r^{i}, take continuous values, and are drawn on each trial, t, from a normal distribution, \({r}^{i}(t) \,\sim\, {\mathcal{N}}\left({\mu }_{i}(t),{\sigma }_{{\rm{R}}}\right)\), with mean μ_{i}(t), and standard deviation σ_{R}. The reinforcement signals that arrive at DANs, R_{+} and R_{−} in Fig. 1d, have amplitudes \({r}_{+}^{i}\,=\,\max \left(0,{r}^{i}\right)\) and \({r}_{}^{i}\,=\,\min \left(0,{r}^{i}\right)\), respectively. Over the course of a simulation run, \({\mu }_{i}\left(t\right)\) is varied according to a predetermined schedule, and σ_{R} is fixed. Thus, at different stages throughout each experiment, the most rewarding cue may switch between the multiple alternatives. Unless otherwise stated, σ_{R} = 0.1. The reinforcement schedules were as follows. For Figs. 2, 3, and Supplementary Figs. 1–3, 5, \({\mu }_{1}\left(t\,=\,1\right)\,=\,0\), and was held fixed for 20 trials, then underwent a step change of +1 at trials 21, 41, 141, and 161, and a step change of −1 at trials 61, 81, 101, and 121. For Fig. 4 and Supplementary Fig. 6, \({\mu }_{i}\left(t\right)\,=\,Ag({\xi }_{\mu }(t))\,+\,{\sigma }_{{\rm{R}}}{\xi }_{\sigma }\left(t\right)\), where \({\xi }_{\mu }\left(t\right)\) and \({\xi }_{\sigma }\left(t\right)\) are Gaussian white noise processes with zero mean and unit variance, such that ξ_{μ} determines the mean reinforcement, and \({\xi }_{\sigma }\left(t\right)\) determines the additive noise on trial t. A low pass filter, \(g({\xi }_{\mu })\,=\,{F}^{1}\left\{\right.F\{{\xi }_{\mu }\}F\{G\left(0,\tau \right)\}\left\}\right.\), is applied to ξ_{μ}, where \(G\left(0,\tau \right)\) is a Gaussian function with unit area, centred on 0, and with standard deviation τ = 10 trials, F{⋅} is the Fourier transform, and F^{−1}{⋅} is the inverse Fourier transform. Because the Fourier transform method of filtering assumes \({\xi }_{\mu }\left(1\right)\,=\,{\xi }_{\mu }\left({N}_{t}\,+\,1\right)\), where N_{t} is the number of trials, we generate ξ_{μ} for 250 trials, then delete the first 50 trials after filtering. Finally, the reinforcement amplitude is determined by \(A\,=\,2/\mathop{\max }\limits_{t}(\, g({\xi }_{\mu }(t)))\).
Experimental data and model comparisons
The VSλ and MV models were compared to experimental data by simulating an often used conditioning protocol. To align with experiments, each simulation utilised the following procedure (Fig. 5a): (i) in the first stage of training, the model is exposed to a single cue by itself, the CS+, for ten trials, with reinforcements drawn from a normal distribution, \({\mathcal{N}}\left(\mu ,0.1\right)\), where μ was chosen according to whether appetitive (μ = 1), aversive (μ = −1), or neutral (μ = 0) conditioning was simulated, (ii) during the next 10 trials, the model is exposed to a second cue by itself, the CS−, with reinforcements drawn from a distribution with μ = 0 and the same variance as for the CS+, (iii) the final two trials comprise the test stage, in which the model is exposed to both cue 1 and cue 2, as in the MAFC task with two alternatives, with μ = 0 for both cues. On each test trial, the model is forced to choose either cue 1 or cue 2, using Eq. (12). We used 10 trials per training stage as, given the parameters for η (learning rate) and β (inverse temperature), it took this many trials for the mean performance (see below for how performance is measured) across multiple runs of the simulation to plateau at, or near, the maximum possible value. The test was run for only two trials as synaptic plasticity was allowed to continue during the test stage, under the assumption that the formation of new CS+ related short term memories^{18,19} might alter the behaviour of flies in the test stage of experiments.
For each simulation, we applied one of many possible additional protocol features, in which neuronal activity was manipulated. We therefore define a protocol as a unique combination of four features:

1.
US valence (Fig. 5a): (i) appetitive (μ = 1), (ii) aversive (μ = −1), (iii) neutral (μ = 0). To ensure the VSλ model was not limited in learning RPs as large as ±1, we set λ = 12.

2.
Intervention type (Fig. 5c), which modified the target neuron’s output firing rate from \({y}_{{\rm{targ}}}\) to \({\tilde{y}}_{{\rm{targ}}}\): (i) block of neuronal output (e.g. by shibire), which was simulated by multiplicatively scaling the manipulated neuron’s firing rate, such that \({\tilde{y}}_{{\rm{targ}}}\,=\,0.1{y}_{{\rm{targ}}}\), (ii) neuronal activation (e.g. by dTrpA1), which was simulated by adding a constant current, such that \({\tilde{y}}_{{\rm{targ}}}\,=\,{y}_{{\rm{targ}}}\,+\,5\).

3.
The intervention type was applied following one of four activation schedules (Fig. 5b): (i) during the CS+ only, (ii) throughout training (CS+ and CS−), (iii) during test only, (iv) throughout all stages.

4.
The target neuron to which the intervention type was applied (Fig. 5c): (i) M_{+}, (ii) M_{−}, (iii) D_{+}, (iv) or D_{−}.
We compared behavioural data from experiments with that of our model for 27 of the 96 possible variations of these four features. These data were obtained from 14 published studies^{10,13,14,15,16,17,18,27,28,32,35,36,38,39}, comprised of 439 experiments that followed conditioning protocols similar to that used in our simulations (235 controls with no intervention, 204 experiments with one of the 27 interventions).
Simulations were run in batches of 50, each batch yielding 100 choices from the two test trials. From these choices, we computed a performance index (PI\({\,}_{{\rm{mod}}}\)) given by
where n_{+} is the number of choices for the CS+ and n_{−} for the CS−. A distribution of PIs for each protocol was obtained by running 20 such batches. PIs from the experimental data were extracted by eye from the 14 published papers. These PIs are computed in a similar way as for the model, but where n_{+} and n_{−} correspond to the number of flies that approached the CS+ or CS−, respectively. We averaged across PIs from experiments that used the same intervention in the same study, reducing the number of intervention samples from 204 to 92, against which PIs from the simulations were compared.
To measure the effect strength of each intervention in both the model and the experiments, we converted PIs into fractions of flies (or model runs) that chose the CS+, \(f\,=\,\left({\rm{PI}}\,+\,1\right)/2\), then computed a test statistic, Δ_{f}, which compares f_{c} from control to f_{i} from intervention experiments, given that the underlying data is binomially distributed, as follows:
where N_{fly} is the number of flies used in that experiment. The binomial distribution adjustment to f_{i} − f_{c} accounts for the bounded nature of f between 0 and 1. As such, for a given absolute difference, f_{i} − f_{c}, Δ_{f} is larger when f_{c} is near to 1 than when it is near to 0.5. That is, small changes to excellent memory performance imply a stronger effect than small changes to mediocre performance. Because N_{fly} was rarely stated in the studies we assessed, we set N_{fly} = 50, which is typical for experiments of this nature, and corresponds to the number of runs in each batch of simulations from which a single PI was computed from the model.
To examine the correspondence between PIs from the model and experiments, we fit a weighted linear regression to the experimental versus model Δ_{f} data using the MATLAB R2012a function robustfit, which computes iteratively reweighted least square fits with a bisquare weighting function. We then computed the Pearson correlation coefficient, R, of the weighted data using the weights, w_{r}, provided by robustfit, according to
where \({\sigma }_{{\rm{mod}}}\) and \({\sigma }_{\exp }\) are the standard deviations of \({{\bf{w}}}_{{\rm{r}}}{{\boldsymbol{\Delta }}}_{f}^{{\rm{mod}}}\) and \({{\bf{w}}}_{{\rm{r}}}{{\boldsymbol{\Delta }}}_{f}^{\exp }\), respectively, and bold fonts denote vectors for all data points in either the model or experimental data sets. We determined the probability with which R comes from a distribution with zero mean by reshuffling the weighted data.
Blocking paradigm
Blocking experiments were simulated by pairing a CS, X, with rewards drawn from a Gaussian distribution, \({\mathcal{N}}\left(1,0.1\right)\) for 10 trials, followed by 10 trials in which a compound stimulus, XY, was paired with rewards drawn from the same distribution. After conditioning, a test phase comprised two trials in which the two available options were Y or null, whereby the null option elicited a RP equal to zero. Rewards drawn from \({\mathcal{N}}\left(0,0.1\right)\) were provided in each test trial. Performance indices (PIs) were computed in the same way as for the comparison between models and experimental data, using 20 batches of 50 simulation runs, yielding 20 PIs. Here, however, n_{+} denotes the number of choices for cue Y and n_{−} for the null option. The two stimuli, X and Y, were represented by responses in two, nonoverlapping subsets of 20 KCs each. When either stimulus was presented alone (X during the first conditioning phase, Y during the test phase), 10 KCs in each subset were activated. During the compound training phase, each stimulus, i, was independently corrupted by silencing each active KC with a probability \({p}_{{\rm{cor}}}^{i}\). For each KC silenced, a previously silent KC was activated, but only within the subpopulation corresponding to that stimulus, thus ensuring that both stimuli remained nonoverlapping. The KC responses to each individual stimulus were then added for the compound XY stimulus.
Reporting summary
Further information on research design is available in the Nature Research Reporting Summary linked to this article.
Data availability
All experimental data in Fig. 5 and Supplementary Figs. 7, 8 were lifted from figures in the cited publications. No additional experimental data was generated in this work^{75}. Source data are provided with this paper.
Code availability
All of the code that was used for running simulations and analysing data are made available on the archived github repository https://doi.org/10.5281/zenodo.4531420^{75}. The most recent version of this code can be found at: https://github.com/BrainsOnBoard/paper_RPEs_in_drosophila_mb.
References
Bush, R. R. & Mosteller, F. A mathematical model for simple learning. Psychol. Rev. 58, 313–323 (1951).
Rescorla, R. A. & Wagner, A. R. A theory of Pavlovian conditioning: variantions in the effectiveness of reinforcement and nonreinforcement. in Classical conditioning II: current research and theory, 64–99 (eds Black, A. H. & Prokasy, W. F.) (AppletonCenturyCrofts, 1972).
Schultz, W., Dayan, P. & Montague, P. R. A neural substrate of prediction and reward. Science 275, 1593–1599 (1997).
Schultz, W. Neuronal reward and decision signals: from theories to data. Physiol. Rev. 95, 853–951 (2015).
Waddell, S. Reinforcement signalling in Drosophila; dopamine does it all after all. Curr. Opin. Neurobiol. 23, 324–329 (2013).
Kamin, L. ’Attentionlike’ processes in classical conditioning. In Miami symposium on the Prediction of Behaviour: Aversive stimulation, 9–33 (ed Jones, M.) (Miami University Press, 1968).
Kamin, L. Predictability, surprise, attention, and conditioning. In Punishment and aversive behavior, (eds Church, R. & Campbell, B.) (AppletonCenturyCrofts, 1969).
Cohn, R., Morantte, I. & Ruta, V. Coordinated and compartmentalized neuromodulation shapes sensory processing in drosophila. Cell 163, 1742–1755 (2015).
Hige, T. et al. Heterosynaptic plasticity underlies aversive olfactory learning in drosophila. Neuron 88, 985–998 (2015).
Owald, D. et al. Activity of defined mushroom body output neurons underlies learned olfactory behavior in drosophila. Neuron 86, 417–427 (2015).
Schwaerzel, M. et al. Dopamine and octopamine differentiate between aversive and appetitive olfactory memories in drosophila. J. Neurosci. 23, 10495–10502 (2003).
Mao, Z. & Davis, R. L. Eight different types of dopaminergic neurons innervate the Drosophila mushroom body neuropil : anatomical and physiological heterogeneity. Front. Neural Circuits 3, 1–17 (2009).
Burke, C. J. et al. Layered reward signalling through octopamine and dopamine in drosophila. Nature 492, 433–437 (2012).
Liu, C. et al. A subset of dopamine neurons signals reward for odour memory in drosophila. Nature 488, 512–516 (2012).
Aso, Y. et al. Mushroom body output neurons encode valence and guide memorybased action selection in Drosophila. eLife 3, e04580 (2014).
Perisse, E. et al. Aversive learning and appetitive motivation toggle feedforward inhibition in the drosophila mushroom body. Neuron 90, 1086–1099 (2016).
Ichinose, T. et al. Reward signal in a recurrent circuit drives appetitive longterm memory formation. Elife 4, e10719 (2015).
Felsenberg, J., Barnstedt, O., Cognigni, P., Lin, S. & Waddell, S. Reevaluation of learned information in drosophila. Nature 544, 240–244 (2017).
Felsenberg, J. et al. Integration of parallel opposing memories underlies memory extinction. Cell 175, 709–722.e15 (2018).
Zhao, X., Lenek, D., Dag, U., Dickson, B. J. & Keleman, K. Persistent activity in a recurrent circuit underlies courtship memory in drosophila. Elife 7, e31425 (2018).
Aso, Y. et al. The neuronal architecture of the mushroom body provides a logic for associative learning. eLife 3, e04577 (2014).
Bazhenov, M., Huerta, R. & Smith, B. H. A computational framework for understanding decision making through integration of basic learning rules. J. Neurosci. 33, 5686–5697 (2013).
Peng, F. & Chittka, L. A simple computational model of the bee mushroom body can explain seemingly complex forms of olfactory learning and memory. Curr. Biol. 27, 224–230 (2017).
Cope, A. J. et al. Abstract concept learning in a simple neural network inspired by the insect brain. PLOS Comput. Biol. 14, e1006435 (2018).
Saumweber, T. et al. Functional architecture of reward learning in mushroom body extrinsic neurons of larval Drosophila. Nat. Commun. 9, 1–19 (2018).
Eschbach, C. et al. Recurrent architecture for adaptive regulation of learning in the insect brain. Nat. Neurosci. 23, 544–555 (2020).
Lin, S. et al. Neural correlates of water reward in thirsty drosophila. Nat. Neurosci. 17, 1536–1542 (2014).
Aso, Y. & Rubin, G. M. Dopaminergic neurons write and update memories with celltypespecific rules. Elife 5, e16135 (2016).
Bouzaiane, E., Trannoy, S., Scheunemann, L., Plaçais, P.Y. & Preat, T. Two independent mushroom body output circuits retrieve the six discrete components of drosophila aversive memory. Cell Rep. 11, 1280–1292 (2015).
Séjourné, J. et al. Mushroom body efferent neurons responsible for aversive olfactory memory retrieval in drosophila. Nat. Neurosci. 14, 903–910 (2011).
Frémaux, N. & Gerstner, W. Neuromodulated spiketimingdependent plasticity, and theory of threefactor learning rules. Front. Neural Circuits 9, 85 (2015).
Huetteroth, W. et al. Sweet taste and nutrient value subdivide rewarding dopaminergic neurons in drosophila. Curr. Biol. 25, 751–758 (2015).
Yamagata, N. et al. Distinct dopamine neurons mediate reward signals for short and longterm memories. Proc. Nat Acad. Sci. U.S.A. 112, 578–83 (2015).
CervantesSandoval, I., Phan, A., Chakraborty, M. & Davis, R. L. Reciprocal synapses between mushroom body and dopamine neurons form a positive feedback loop required for learning. eLife 6, e23789 (2017).
Yamagata, N., Hiroi, M., Kondo, S., Abe, A. & Tanimoto, H. Suppression of dopamine neurons mediates reward. PLOS Biol. 14, e1002586 (2016).
Plaçais, P. Y., Trannoy, S., Friedrich, A. B., Tanimoto, H. & Preat, T. Two pairs of mushroom body efferent neurons are required for appetitive longterm memory retrieval in drosophila. Cell Rep. 5, 769–780 (2013).
Riemensperger, T., Völler, T., Stock, P., Buchner, E. & Fiala, A. Punishment prediction by dopaminergic neurons in drosophila. Curr. Biol. 15, 1953–1960 (2005).
ClaridgeChang, A. et al. Writing memories with lightaddressable reinforcement circuitry. Cell 139, 405–415 (2009).
Aso, Y. et al. Specific dopaminergic neurons for the formation of labile aversive memory. Curr. Biol. 20, 1445–1451 (2010).
Smith, B. H. An analysis of blocking in odorant mixtures: An increase but not a decrease in intensity of reinforcement produces unblocking. Behav. Neurosci. 111, 57–69 (1997).
Gerber, B. & Ullrich, J. No evidence for olfactory blocking in honeybee classical conditioning. J. Exp. Biol. 202, 1839–1854 (1999).
Brembs, B. & Heisenberg, M. Conditioning with compound stimuli in drosophila melanogaster in the flight simulator. J. Exp. Biol. 204, 2849–2859 (2001).
Guerrieri, F., Lachnit, H., Gerber, B. & Giurfa, M. Olfactory blocking and odorant similarity in the honeybee. Learn. Mem. 12, 86–95 (2005).
Broome, B. M., Jayaraman, V. & Laurent, G. Encoding and decoding of overlapping odor sequences. Neuron 51, 467–482 (2006).
Honegger, K. S., Campbell, R. A. & Turner, G. C. Cellularresolution population imaging reveals robust sparse coding in the drosophila mushroom body. J. Neurosci. 31, 11772–11785 (2011).
Shen, K., Tootoonian, S. & Laurent, G. Encoding of mixtures in a simple olfactory system. Neuron 80, 1246–1262 (2013).
Pearce, J. M. A model for stimulus generalization in pavlovian conditioning. Psychol. Rev. 94, 61–73 (1987).
Wagner, A. R. Contextsensitive elemental theory. Q. J. Exp. Psychol. B 56 B, 7–29 (2003).
Möller, M. & Bogacz, R. Learning the payoffs and costs of actions. PLOS Comput. Biol. 15, e1006285 (2019).
Dylla, K. V., Raiser, G., Galizia, C. G. & Szyszka, P. Trace conditioning in drosophila induces associative plasticity in mushroom body kenyon cells and dopaminergic neurons. Front. Neural Circuits 11, 42 (2017).
Sutton, R. S. Learning to predict by the methods of temporal differences. Mach. Learn. 3, 9–44 (1988).
Sutton, R. S. & Barto, A. G. Reinforcement learning: an introduction (MIT Press, Cambridge, MA, 2018), 2nd edn.
Tabone, C. J. & De Belle, J. S. Secondorder conditioning in drosophila. Learn. Mem. 18, 250–253 (2011).
Konig, C., Khalili, A., Niewalda, T., Gao, S. & Gerber, B. An optogenetic analogue of secondorder reinforcement in drosophila. Biol. Lett. 15, 9–13 (2019).
Schultz, W. & Romo, R. Responses of nigrostriatal dopamine neurons to highintensity somatosensory stimulation in the anesthetized monkey. J. Neurophysiol. 57, 201–217 (1987).
Ungless, M. A., Magill, P. J. & Bolam, J. P. Uniform inhibition of dopamine neurons in the ventral tegmental area by aversive stimuli. Science 303, 2040–2042 (2004).
Matsumoto, M. & Hikosaka, O. Two types of dopamine neuron distinctly convey positive and negative motivational signals. Nature 459, 837–841 (2009).
Takemura, S.Y. et al. A connectome of a learning and memory center in the adult Drosophila brain. eLife 6, e16135 (2017).
Eichler, K. et al. The complete connectome of a learning and memory centre in an insect brain. Nature 548, 175–182 (2017).
Davis, R. L. & Zhong, Y. The biology of forgetting—a perspective. Neuron 95, 490–503 (2017).
Tobler, P. N., Fiorillo, C. D. & Schultz, W. Adaptive coding of reward value by dopamine neurons. Science 307, 1642–1645 (2005).
Tanimoto, H., Heisenberg, M. & Gerber, B. Event timing turns punishment to reward. Nature 430, 983 (2004).
Handler, A. et al. Distinct dopamine receptor pathways underlie the temporal sensitivity of associative learning. Cell 178, 60–75.e19 (2019).
Cassenaer, S. & Laurent, G. Hebbian STDP in mushroom bodies facilitates the synchronous flow of olfactory information in locusts. Nature 448, 709–713 (2007).
Cassenaer, S. & Laurent, G. Conditional modulation of spiketimingdependent plasticity for olfactory learning. Nature 482, 47–51 (2012).
Klopf, A. H. Brain Function and Adaptive Systems—A Heterostatic Theory. Technical Report AFCRL—720164 (Air Force Cambridge Research Laboratories, 1972).
Cognigni, P., Felsenberg, J. & Waddell, S. Do the right thing: neural network mechanisms of memory formation, expression and update in drosophila. Curr. Opin. Neurobiol. 49, 51–58 (2017).
Soto, F. A., Gershman, S. J. & Niv, Y. Explaining compound generalization in associative and causal learning through rational principles of dimensional generalization. Psychol. Rev. 121, 526–558 (2014).
Soto, F. A. Contemporary associative learning theory predicts failures to obtain blocking: Comment on Maes et al. (2016). J. Exp. Psychol. Gen. 147, 597–602 (2018).
Terao, K., Matsumoto, Y. & Mizunami, M. Critical evidence for the prediction error theory in associative learning. Sci. Rep. 5, 1–8 (2015).
Robbins, H. Some aspects of the sequential design of experiments. Bull. Am. Math. Soc. 58, 527–536 (1952).
Wang, Y. et al. Stereotyped odorevoked activity in the mushroom body of drosophila revealed by green fluorescent proteinbased Ca2+ imaging. J. Neurosci. 24, 6507–6514 (2004).
Turner, G. C., Bazhenov, M. & Laurent, G. Olfactory representations by drosophila mushroom body neurons. J. Neurophysiol. 99, 734–746 (2008).
Lin, A. C., Bygrave, A. M., de Calignon, A., Lee, T. & Miesenböck, G. Sparse, decorrelated odor coding in the mushroom body enhances learned odor discrimination. Nat. Neurosci. 17, 559–68 (2014).
Bennett, J.E.M., Philippides, A. & Nowotny, T. Learning with reinforcement prediction errors in a model of the Drosophila mushroom body. https://github.com/BrainsOnBoard/paper_RPEs_in_drosophila_mb (2021).
Acknowledgements
Special thanks to Eleni Vasilaki for helpful discussions and feedback on the mathematical formulations, James Marshal for feedback on the paper, and the Waddell and Vogels labs for fruitful discussions on learning in Drosophila. Thanks also to the members of the Brains on Board team for their critical feedback at various points throughout this project. This work was funded by the EPSRC (Brains on Board project, grant number EP/P006094/1).
Author information
Authors and Affiliations
Contributions
J.E.M.B. conceived the model, wrote the code, generated and analysed the data. J.E.M.B. and T.N. conceived the reinforcement schedules and ideal agents to test the models. J.E.M.B., A.P. and T.N. wrote and revised the paper.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Peer review information Nature Communications thanks Maxim Bazhenov and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Peer reviewer reports are available.
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Source data
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Bennett, J.E.M., Philippides, A. & Nowotny, T. Learning with reinforcement prediction errors in a model of the Drosophila mushroom body. Nat Commun 12, 2569 (2021). https://doi.org/10.1038/s41467021225924
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41467021225924
This article is cited by

Aversive view memories and risk perception in navigating ants
Scientific Reports (2022)

Visualization of learninginduced synaptic plasticity in output neurons of the Drosophila mushroom body γlobe
Scientific Reports (2022)
Comments
By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.