Acetylcholine-modulated plasticity in reward-driven navigation: a computational study

Neuromodulation plays a fundamental role in the acquisition of new behaviours. In previous experimental work, we showed that acetylcholine biases hippocampal synaptic plasticity towards depression, and the subsequent application of dopamine can retroactively convert depression into potentiation. We also demonstrated that incorporating this sequentially neuromodulated Spike-Timing-Dependent Plasticity (STDP) rule in a network model of navigation yields effective learning of changing reward locations. Here, we employ computational modelling to further characterize the effects of cholinergic depression on behaviour. We find that acetylcholine, by allowing learning from negative outcomes, enhances exploration over the action space. We show that this results in a variety of effects, depending on the structure of the model, the environment and the task. Interestingly, sequentially neuromodulated STDP also yields flexible learning, surpassing the performance of other reward-modulated plasticity rules.

in the third trial and so forth. The first rewarded trial is distributed as a discrete uniform random variable over the interval [1,8] (Fig. 1B.iv, filled circles). Numerical simulations of agents with cholinergic depression (+ACh) appear to closely match the theoretical distribution ( Fig. 1B.iv, histogram) which means that the agent never takes more than 8 trials to find the reward. Systematic exploration leads to the reward faster, thus enhancing the overall performance ( Fig. 1B.v). Systematic exploration is further shown in a different test experiment, where we use an identical but completely unrewarded task. In this case, all the +ACh agents simulated manage to fully explore the environment by trial 8, whereas approximately half of the −ACh agents need more than 20 trials to visit all arms (M = 10000 simulations, Fig. 1B.vi).
T-maze -continuous model. The radial arm maze example is a very simple model of navigation that could be practically reduced to a single decision making problem. We therefore move to a more detailed, but similar model (Methods). The basic structure of the network was kept unchanged, but new features were introduced, in particular: i) infinite possible positions for the agents (inside of the maze), ii) infinite possible actions, and iii) online decision-making at each timestep.
In this task, the maze has the shape of a T (Fig. 1C.i-ii). There are N = 441 place cells distributed as a grid along the stem and the arms. The position of the agent is represented by a vector of its Cartesian coordinates and can be anywhere in the maze (a continuous state space). Action neurons represent each a different direction, and are arranged in a winner-take-all fashion, as before. Decisions are taken at every timestep. The direction and speed of each move is taken to be the average of the action neurons' directions, weighted by their firing rate. This means that any arbitrary direction can be chosen (continuous action space). The recurrent connectivity ensures  Sequentially neuromodulated spike-timing-dependent plasticity rule. When acetylcholine is present at the synapse, the plasticity window (black) is negative and symmetric (i.e. the weight changes proportionally to the lag between the spikes, irrespectively of the order). If dopamine is added, the plasticity window converts to positive (red). (ii) Schematic of a neural network model of a reward-driven navigation task. A place cell is connected to action neurons, which inhibit each other following a winner-take-all scheme. (B) Radial maze: (i-ii) Example trajectories. The maze consist of eight arms, the reward is located in the upper-central arm (with the star). (i) The agent without cholinergic depression (−ACh, green) visits the same unrewarded arm more than once. (ii) The agent with cholinergic depression (+ACh, brown) explores the maze in a systematic way: it excludes unrewarded arms and finds the rewarded arm sooner. (iii) Percent cumulative distribution of the first rewarded trial (histogram) and the corresponding theoretical distribution (geometric distribution with p 1 8 = ; filled circles) for simulations without cholinergic depression. (iv) Percent cumulative distribution of the first rewarded trial (histogram) and the corresponding theoretical distribution (discrete uniform distribution on [1,8]; filled circles) for simulations with cholinergic depression. (v) Percentage of successful simulations over consecutive trials for −ACh and +ACh agents (solid lines). Theoretical learning curves, assuming one-shot learning (cumulative distribution of the first successful trial; dashed lines with dots). (vi) Agents navigating the environment without any reward. Percent cumulative distribution of the trial when the maze is fully explored (green: −ACh, brown: +ACh). (C) T-maze: (i-ii) Example trajectories. The maze has two arms, the reward is located in the right arm (with the star). (i) Without cholinergic interaction (−ACh), the agent consistently goes to the the same unrewarded arm. (ii) With cholinergic interaction (+ACh), the agent finds the rewarded arm sooner. (iii) Percent cumulative distribution of the first rewarded trial (histogram) and the corresponding theoretical distribution (geometric distribution with = p 1 2 ; filled circles) for simulations without cholinergic depression. (iv) Percent cumulative distribution of the first rewarded trial (histogram) and the corresponding theoretical distribution (discrete uniform distribution on [1,2]; filled circles) for agents with cholinergic depression. The empirical distribution approximates the theoretical one, but does not match it exactly. (v) Percentage of successful simulations over consecutive trials for −ACh and +ACh agents (solid lines). The theoretical learning curves (assuming one-shot learning) are the cumulative distribution of the first successful trial (dashed lines with dots). (vi) Agents navigating the environment without any reward. The graph shows the percent cumulative distribution of the trial when the maze is fully explored (green: −ACh, brown: +ACh).
that only neurons with similar orientations are active at the same time. This coherent bump of activity creates smooth and consistent trajectories. The agent is limited inside the maze: if it tries to cross the boundaries, it instantly bounces back in the opposite direction (it turns by 180 degrees).
In order to test the effects of acetylcholine on systematic exploration in this model, we use a task similar to the radial maze. Each trial starts with the agent in the stem and ends when the agent enters one of the arms (or when a time limit has passed). A reward is placed in one of the arms (e.g. right arm in Fig. 1C.i-ii). The agent has to discover it and learn how to reach it.
While cholinergic depression still makes the discovery of the reward faster ( Fig. 1C.i-iv) and achieves a better performance on average (Fig. 1C.v M = 1000 simulations), the empirical distribution does not match the theoretical distribution perfectly (~Uniform [1,2]; Fig. 1C.iv, +ACh). When the reward is removed, full exploration of the environment is still aided by cholinergic depression. The agent fully explores the maze in just two trials in about 85% of the simulations (Fig. 1C.vi). However, exploration of the environment is more systematic with acetylcholine, but not perfect. A clearcut exclusion of wrong choices was easier to obtain in the radial maze, where there was a one-to-one correspondence between the arm and the synapse. Here in the T-maze, which has more decision points and continuous actions, more complex dynamics come into play. It is still possible to suppress wrong actions (mainly because the geometry of the maze translates into a sort of discretization of the action space), but it is difficult to achieve the same level of precision (Table 1). This concept will become clearer in the following sections, where we investigate this mechanism further by changing the maze to an open field. In an open field, no discretization is possible.

Cholinergic depression enhances exploration of the action space.
Learning in an open field. The influence of cholinergic depression on systematic exploration seems to be more complex in the continuous model. In order to study this in more detail, we choose an environment where the agent can move more freely the open field. We model the field as a square, with place cells evenly distributed over the entire area. The task is analogous to the previous ones: the agent starts each trial in the centre of the square, and has to find and learn how to reach the reward location (circle in the top right corner of the field; Fig. 2A.i-ii).
Again, due to the retroactive effect of dopamine, the vast majority of the agents are able to learn the task and navigate to the reward ( Fig. 2A.iv; M = 1000 simulations under each condition) increasingly faster ( Fig. 2A.v). Since dopamine affects synapses through an eligibility trace that decays over time, actions that are closer in time to reward delivery are reinforced more. Even though agents do not always pick the optimal path ( Fig. 2A.i-ii), they do develop a preference for shorter paths by an amount that depends on the time constant of the eligibility trace. Thanks to cholinergic depression, +ACh agents learn to avoid unrewarding paths; this leads to increased precision in navigation and a marginal improvement in performance ( Fig. 2A.iv). Unlike previous tasks, cholinergic depression does not provide any advantage in reward discovery ( Fig. 2A.iii). Thus, in this particular environment, cholinergic depression does not affect systematic exploration (Table 1).
Exploration in an open field. We decide to further investigate the exploration patterns of our models. We remove the reward, and let the agent explore. As a proxy measure of the patterns of exploration over the open field, we take the place cells' mean firing rates (average across time and simulations; Fig. 2B.i-iv). Once normalized to 1, place cells' activity can be thought of as a probability distribution over the open field. This provides us with a proxy for establishing where in the field the average −ACh and +ACh agents spend the longest time. The patterns of exploration are indeed altered by cholinergic interaction: whilst +ACh agents spend more time around the centre of the field (starting position; Fig. 2B.ii) on average, −ACh agents tend to stay closer to the boundaries ( Fig. 2B.i). In order to quantify the amount of exploration, we want to calculate how closely the distribution under the two conditions (−ACh and +ACh) approximates a benchmark distribution for a uniformly random exploration of the environment. The benchmark distribution was calculated by sampling (M = 1000) random locations inside the open field for the duration of a trial, and provides a benchmark measure for random exploration of the environment (Benchmark Exploration over the Environment, BEE; Fig. 2B.iii). We use the Kullback-Leibler divergence (KL) as a metric to quantify the difference between distributions (the more different, the higher the KL divergence). Then, we calculate the KL divergence between the distributions under either condition (−ACh and +ACh) and the benchmark (BEE). The average agent explores the environment more evenly without cholinergic interaction (KL(−ACh||BEE) = 0.03; KL(+ACh||BEE) = 0.07). However, averaging over all simulations provides only limited information about the behaviour of a single realization (as an extreme example, it could happen that each agent explores only one of the corners for the entire duration of the trial, but that it chooses one of the four corners with equal probability). We therefore also calculate the KL divergence between the output of each simulation and the benchmark (Fig. 2B.v). According to this analysis, acetylcholine seems to modestly enhance exploration. The reason for this discrepancy is that, without cholinergic depression, there is no way of suppressing unrewarding choices. Nothing prevents −ACh agents from spending a long time in the same area at the boundaries, whereas cholinergic depression encourages +ACh agents to change direction. As a consequence, −ACh agents bounce against the walls of the maze significantly more often than +ACh agents (Fig. 2B.vi).  Because of the winner-take-all connectivity of the action neurons, agents tend to keep their action choice constant and thus follow straight lines. However, acetylcholine depresses active synapses, which correspond to the currently winning action. +ACh agents are thus encouraged to pick a different action in consecutive timesteps and change direction more often (Fig. 2B.vii). This translates into more circular trajectories that tend to be focused around the centre of the maze rather than the boundaries (Fig. 2B.ii).

+ACh −ACh
We can understand these findings by considering that cholinergic depression makes the postsynaptic activity (and therefore the winning action neurons) more variable, thereby enhancing exploration over the action space. This, however, does not translate into increased exploration over the open field. In order to confirm this, we use another benchmark distribution, this time as a proxy for exploration over the action space (Benchmark Exploration over the Action space, BEA). In each benchmark simulations (M = 1000 simulations) the position of the agent is initialized at the centre of the field. From there, every action is taken completely at random: the angle of the direction is chosen from a uniform distribution over [0, 2π], while the velocity is kept fixed (random walk; Methods). Place cells' activity shows a high peak around the initial position ( Fig. 2B.iv), meaning that the average benchmark agent does not move very far. As expected, the distribution of place cells' activity in the +ACh simulations ( Fig. 2B.ii) is more similar to this benchmark distribution than −ACh simulations (KL(−ACh||BEA) = 13.12, Fig. 2B.i; KL(+ACh||BEA) = 9.7, Fig. 2B.ii). In conclusion, acetylcholine enhances exploration over the action space but not necessarily over the environment (Table 1).

Cholinergic depression improves performance in dynamic environments.
Relearning in an open field. We have shown that cholinergic depression allows the agent to learn from negative outcomes and increases exploration over the action space. These characteristics suggest that cholinergic depression might be especially advantageous in dynamic environments. We consider a task in which, after 20 initial trials where the agent learns how to navigate to the reward ( Fig. 2A), the reward is moved to a new location. In our case, it is moved to the opposite corner ( Fig. 3A.i-ii). +ACh agents discover the new reward location in fewer trials, while as much as one out of four −ACh agents cannot find it before the end of the experiment (Fig. 3A.iii) 22 . In addition, the +ACh agents show better task performance than the −ACh agents (96.8% correct versus 63% correct; Fig. 3A.v). −ACh agents mostly just extend the previously learned path (Fig. 3A.i), whereas +ACh agents stop visiting the old reward location altogether ( Fig. 3A.ii,iv). This results in a difference in the time to navigate to the new reward location (Fig. 3A.vi). Even with the addition of noise in the neural activity ( Supplementary Fig. 1) and in the weights ( Supplementary Fig. 2), −ACh cannot achieve the same degree of behavioural flexibility ( Table 2).
There are two main mechanisms underlying the behavioural flexibility exhibited by +ACh (Fig. 3B). On one hand, cholinergic depression decreases the strength of synapses associated to those actions that are no longer rewarding (Trial 26, lower weights in the upper-right corner). On the other hand, retroactive dopaminergic potentiation allows the agent to learn new sequences of actions that lead to the reward (Trial 40, higher weights in the bottom-left corner). Behaviourally, this results into the extinction of the previously learned path and the acquisition of a newly rewarding path. Thanks to dopaminergic potentiation, −ACh agents can also learn the path to the new reward location (Trial 26 and 40, weights in the bottom-left corner are higher than average), but they cannot unlearn the old reward location, which remains the agents' most followed path (highest weights in the top-right corner; Fig. 3A.iv). It is worth noting here that acetylcholine affects all synapses, not only the ones that had been previously potentiated. For example, in the first part of the experiment, +ACh agents learn to navigate to the reward (Trial 21, higher weights in the top-right corner) but also to avoid unrewarding paths (Trial 21, lower weights in the left-bottom corner). This increases the precision of +ACh agents and improves performance ( Fig. 2A.iv). For this reason, we use here the generic term "unlearning" to indicate the effect of synaptic depression on any sequence of actions that becomes less likely to be chosen again.
Learning and relearning in an open field with obstacles. As mentioned earlier, the specifics of the task strongly affect the outcome. To explore this point further, we repeat the same experiment using a slightly different maze geometry. We insert two vertical obstacles in the open field, and move the reward location on the x axis (y = 0), to the right side of the obstacles, for the first part of the experiment (Fig. 4A.i-ii; obstacles = white vertical bars, reward location = black solid circle). In this case, −ACh agents initially perform better at finding the reward: 40% find the reward in the first trial, in contrast to just 30% of +ACh agents ( Fig. 4A.iii). It is much easier to discover the reward when following straight lines in this particular maze geometry (even more so than in a simple open field; Fig. 4A.i-ii). Later in the experiment (Trial 20), however, agents equipped with cholinergic depression achieve a slightly higher success rate ( Fig. 4A.iv), and are faster to navigate to the reward (because they do not get stuck against the walls or the obstacles; Fig. 4A.i-ii,v). The results for the second part of the experiment, when the reward is moved horizontally to the left side of the obstacles, are qualitatively similar to the open field but even more pronounced (Fig. 4B). Almost 40% of −ACh agents (39.7%) do not find the new reward before the end of the experiment (Fig. 4B.iii), and 89.2% of them still visit the old reward location in the last trial ( Fig. 4B.iv). With this maze geometry, it is more difficult to extend the old path to the new reward location. More prominently than in the open field, agents with cholinergic depression are twice as successful as −ACh agents ( Fig. 4B.v) and can navigate to the reward twice as fast ( Fig. 4B.vi). boxplot, the rectangle spans the q 1 = 25th and q 3 = 75th percentiles of the distribution. The line inside the rectangle is the median, and the whiskers indicate the minimum and the maximum points not considered outliers (minimum = q 1 − 1.5(q 3 − q 1 ), maximum = q 3 + 1.5(q 3 − q 1 )). Points that are larger than the maximum or smaller than the minimum are outliers. Comparison with other learning rules. Reward-modulated STDP. Until now, we have focused on the functional role of cholinergic depression, comparing the same learning rule with and without cholinergic depression (+ACh and −ACh). We next investigate how sn-Plast compares to other reward-modulated learning rules.
To this end, we change the plasticity rule in our model to standard reward-modulated STDP (r-STDP; Fig. 5). In r-STDP, synapses follow a classical STDP rule with different amplitudes for the pre-post (A pre-post ) and post-pre (A post-pre ) windows (e.g. Fig. 5B.i). However, all synaptic changes are gated by dopamine and become effective only retroactively through an eligibility trace. If no reward is found, weights are left unchanged. Notably, if the amplitudes of the r-STDP learning window are set to A pre-post = A post-pre = +1, reward-modulated STDP is equivalent to the plasticity rule used in our control simulations (sn-Plast without acetylcholine; −ACh).
We then investigate how the agent performs when equipped with r-STDP. We start from testing learning in a static environment. As before, the agent moves in an open field and has 20 trials to learn to navigate to the reward location ( Fig. 2A). We then run a parameters sweep, and examine how the agent's performance     is 37%. Using this value as baseline, we can determine whether agents learn or unlearn. The agents' performance varies as a function of the integral of the learning window: it rises above baseline for positive-integral windows (A pre-post + A post-pre > 0; the part above the diagonal, Fig. 5A.i) and below baseline for negative-integral windows (A pre-post + A post-pre < 0; the part under the diagonal, Fig. 5A.i). When the integral of the plasticity window is zero (A pre-post + A post-pre = 0; diagonal of the matrix, Fig. 5A.i), there is little variation from baseline. However, the performance clearly increases with the amplitude of the pre-post learning window, A pre-post (Fig. 5A.ii). This is because, in a spiking neural network, presynaptic spikes contribute to elicit postsynaptic spikes (spike-spike correlation 35 ). As such, the amplitude of the pre-post window, relatively to the post-pre window, brings an extra contribution to learning. We can conclude that the order of the spikes matters, although only marginally so. In our model, what really determines whether the agent learns or unlearns is the integral of the STDP window 31 . We next compare four agents, equipped with different STDP windows having: i) positive integral (red), ii) negative integral (yellow), iii) zero integral and A pre-post > A post-pre (dark orange) and iv) zero integral and A pre-post < A post-pre (inverse STDP window; light orange). As expected, the best learner is the agent with the positive learning window, whereas the agent with a negative learning window effectively unlearns (Fig. 5B.ii). There is generally very little change in performance when the integral of the STDP window is zero: if A pre-post > A post-pre , we can observe some slow learning; if A pre-post < A post-pre there is very slow unlearning instead (Fig. 5B.ii). As mentioned earlier ( Fig. 5A.ii), the spike order is still relevant, although only marginally so. This is due to spike-spike correlation 35 . These patterns remain consistent in the second part of the experiment, when the reward is moved to a different corner of the field. The agent with a positive integral learns how to navigate to the new reward location (Fig. 5B.iv) but does not really unlearn the path to the first reward (the visits to the previously rewarded location are still as high as 62.4% at trial 40; Fig. 5B.iii). The agent with a negative integral completely unlearns the path to the second reward too (Fig. 5B.iv). Agents with vanishing integrals show very little change in both learning of the new reward location and unlearning of the old one ( Fig. 5B.iii-iv).
Thus, r-STDP allows the agent to either learn or unlearn the path to the reward, depending on the integral of the learning window. This learning rule, however, appears to be quite rigid. It lacks the flexibility of sn-Plast which, thanks to the modulation of both acetylcholine and dopamine, can switch between these modalities in response to environmental changes ( Table 2). For this reason, sn-Plast is more suited to dynamic tasks that require a degree of adaptation. This analysis also shows how the spike order is only relatively important to learning. This characteristic is intrinsic to the model 31,32 and in striking agreement with the experimental data from which we derive our plasticity rule (both dopaminergic and cholinergic modulated STDP windows are symmetric and therefore invariant to spike order 21,22 ).

Dynamic reward signal.
In sn-Plast, the reward signal is binary, it is either present or not. Alternatively, we could conceive signals with more complex temporal dynamics. In particular, we want to focus here on a signal that keeps track of the history of the reward delivery. This is particularly useful in a changing environment and worth comparing to our sequentially neuromodulated plasticity rule. We employ a dynamic reward signal, ρ(tr), given by the difference between the raw reward, R(tr), and a moving average of the past rewards, R tr . Synapses modulated by the dynamic reward signal ρ(tr) are updated only if the outcome of trial tr is somewhat surprising, that is, if it differs from the outcomes of the most recent trials. The effect is twofold: if the reward is reached consistently and continuously, the average reward becomes very close to the actual reward value, ρ(tr) ≈ 0, and synapses stop being potentiated; if the agent stops receiving a reward suddenly ( < R tr R tr ( ) ( ), second half of the experiment), the dynamic reward signal becomes negative and synapses are depressed.
In order to test the performance of this different learning rule, we use a similar task as before. In the first half of the experiment, the agent moves in an open field and has to learn how to navigate to the reward (as in the previous task, Fig. 2A). Agents receiving a dynamic reward signal are disadvantaged in this initial learning phase. Even though they are equally fast to discover the reward (Fig. 6A.i), they are slower to learn how to reach the reward location, and they are less successful (Fig. 6A.ii). Even when they learn the path to the reward, they take longer to reach it (Fig. 6A.iii). In general, learning is less efficient when using a dynamic reward signal. In the second part of the experiment, from trial 21 to trial 40, the reward is moved to the opposite corner of the environment. Unlike before, however, a trial ends if the agent enters either one of the rewarding areas (old or new), or if a time limit is reached. The agent has to discover the novel reward location and learn a new path. Agents equipped with the dynamic reward signal plasticity rule outperform −ACh agents, but their performance is still inferior to +ACh agents. Only 14% of the dynamic reward signal agents do not manage to discover the reward by the end of the experiment, this is significantly better than our control simulations (−ACh, 71.9%) but worse than sn-Plast agents (+ACh, 1.6%; Fig. 6B.i). Notably, the dynamic signal allows for some unlearning of the previous reward location (Fig. 6B.ii). Approximately half of the agents can reach the novel reward location by the end of the experiment (Fig. 6B.iii), and can do so fairly quickly (Fig. 6B.iv) suggesting that they do indeed learn a completely new path. However, at least 50% of the agents still visit the old reward location by trial 40, as opposed to almost zero +ACh agents (Fig. 6B.iii). Overall, agents equipped with sn-Plast outperform agents receiving a dynamic reward signal.
The dynamic reward signal can be both positive and negative in sign, this allows for both learning and unlearning of the appropriate actions. Even though this rule provides more flexibility than classic reward-modulated STDP, the mechanisms for learning and unlearning are still highly connected. Weight changes are regulated by the timescale of integration of the moving average reward (Methods). In the first half of the experiment, a longer timescale leads to improved performance: the average reward R requires more successful trials to converge to R, so synapses get potentiated more ( Supplementary Fig. 3A). However, a longer timescale also implies that if the reward is moved, it will take longer to depress the appropriate synapses and unlearn the unrewarding path ( Supplementary Fig. 3B). The dynamic reward signal compares poorly with our sequentially neuromodulated rule: sn-Plast uses separate mechanisms to learn and unlearn, increasing flexibility and improving performance ( Table 2).
Negative feedback. In sn-Plast, synapses are biased towards depression unless a reward is delivered. This kind of depression allows suppression of a previously learned sequence of actions, but it is indiscriminately persistent throughout exploration and is not specific to reward omission. Alternatively, we could imagine that a negative feedback is delivered to the synapse when the expected reward is omitted. This negative signal would retroactively depress synapses through the use of an eligibility trace, similarly to dopamine but opposite in sign. The synaptic change would then be positive if the reward is delivered (A feedback = 1) and negative if it is omitted (A feedback = −1). This feedback signal is reminiscent of a prediction error 1 , but different in that the expectations are not updated during the experiment. We thus compare sn-Plast to this model with targeted negative feedback. As before, the agent explores the open field for the first 20 trials and has to learn how to navigate to the reward. For the remaining 20 trials the reward is moved to the opposite corner of the field, and the agent has to discover it and learn the new path. Unlike before, however, a trial ends if the agent enters either one of the rewarding areas (old or new), or if a time limit is reached. Whenever the agent enters the old reward location, a negative feedback signal induces synaptic depression. Since no negative feedback is present in the first half of the task, agents with negative feedback signal perform identically to −ACh agents (Fig. 7A). The continuous updating of cholinergic depression increases the success rate (Fig. 7A.ii) and diminishes the average time to reward (Fig. 7A.iii), but it slightly deteriorates the initial exploration (it takes longer to discover the reward; Fig. 7A.i). It is in the second part of the experiment, after reward displacement, that we see the effect of the negative feedback. As expected, −ACh agents show the poorest performance (Fig. 7B). In contrast, agents with negative feedback signal are able to partially unlearn the previously rewarded location (Fig. 7B.ii). They can therefore also find (Fig. 7B.i) and reach (Fig. 7B.iii) the newly rewarded location more often. Nevertheless, +ACh agents still show the best results. They unlearn the old reward location completely (Fig. 7B.ii). They also find and learn the new reward location, reaching an almost perfect performance (Fig. 7B.iii). The time to the reward is also shorter for +ACh agents, whereas almost no difference can be found between the other two sets of simulations (Fig. 7B.iv).
Targeted negative feedback acts retroactively through an eligibility trace. Consequently, synapses active more recently get depressed more. This mechanism allows suppression of previously rewarded actions to some extent, but performs quite poorly when compared to cholinergic depression, at least in the current model (Table 2). Cholinergic depression acts equally on all synapses throughout exploration, and therefore offers a more powerful and direct way of unlearning.

Discussion
In this paper we investigated the possible functional consequences of neuromodulated hippocampal STDP, based on our recent experimental findings 22 . In particular, we analyzed this plasticity rule in a network model of reward-driven navigation. Consistent with previous models, dopamine makes it possible to learn the path to the reward 23,24,27 . Acetylcholine, instead, allows learning from negative outcomes. This yields behavioural flexibility and is particularly useful in dynamic environments, where it is necessary to both learn and unlearn in a task-relevant manner. In a simple model with discrete state and action space, cholinergic depression allows suppression of unrewarding choices and systematic exploration of the maze. In more complex continuous models, it enhances exploration over the action space, but this does not necessarily translate into increased exploration over the entire maze.
Dopamine. Dopamine is thought to signal reward delivery and reinforce behaviour 1-3 . The behavioural [36][37][38] and algorithmic 39 mechanisms of reward-modulated learning have been thoroughly investigated and characterized. Recently, its neural substrates have been explored as well. Dopamine has been reported to shift STDP towards potentiation [18][19][20] . In this study, we build on our previous experimental findings on the retroactive effect of dopamine on hippocampal synaptic plasticity 21 . Taking inspiration from both reinforcement learning 39 and biology 40 , dopamine was theorized to act on synaptic eligibility traces. These traces keep track of and reinforce neural activity associated with distal rewards 23,24,27 . Given the similarity of sn-Plast to other reward-modulated plasticity rules 23 , our network unsurprisingly succeeds in learning rewarded patterns of activity. Nevertheless, our learning rule does differentiate itself from other reward-modulated plasticity rules because of the symmetrical positive shape of its STDP window. Consistently, in our model the integral of the learning window has greater importance than the exact spike timing 35 . In a different framework, however, spike timing could have functional roles, for instance when precise spike sequences are learned 23,24,31 .
One limitation of our model is that we assume that dopamine signals exclusively reward. As such, it would only update synaptic weights during reward delivery. However, dopamine has also been associated with spatial novelty in the hippocampus and has been shown to correlate with exploratory behaviour [41][42][43][44][45][46] . In fact, the role of dopamine as a reward signal in the hippocampus has been challenged because of the sparsity of the projections from VTA 44,47 . Nevertheless, more recent research points towards a role for dopamine in reinforcing spatial representations in the hippocampus 41 and goal-directed navigation 47,48 .
We approximate dopamine as a stable reward signal that is available globally at the synapses with every reward delivery. However, dopaminergic neurons exhibit different modes of firing, with phasic firing coding for reward prediction error. As such, dopamine is released only when the reward is unexpected 1-3 . If the animal was able to predict the reward delivery correctly, then VTA dopaminergic neurons would not increase their firing rate.
In our work, we compared sn-Plast to other plasticity rules. In particular, we explored: i) a dynamic reward signal, which carries information about the history of the past trials, and ii) a negative feedback signal, which is released selectively when the agent enters a previously rewarded location. Although clearly different, these signals are reminiscent of a prediction error. The dynamic reward signal favours synaptic updates that are "surprising" and it can be both positive and negative in sign. The negative feedback signal is completely symmetric and opposite in sign to the reward signal, as such, it constitutes a limit case of the reward prediction error (the prediction error can only be larger or equal to the negative feedback signal). We showed that sn-Plast outperforms both plasticity rules. This is because cholinergic depression acts as a more general and separate mechanism from dopaminergic potentiation (i.e. it acts on synapses immediately, not through an eligibility trace, and acts indiscriminately on all unrewarding synapses). We could speculate that, for analogous reasons, sn-Plast would similarly outperfom a prediction error signal.
Frémaux et al. proposed a spiking implementation of the prediction error in a similar reward-driven network model 32 . The agents were able to perform very complex tasks but, unfortunately, the authors did not investigate changing environments. Although we imagine that sn-Plast would outperform a prediction error-based learning rule, it could be interesting to substitute the reward signal in sn-Plast with a prediction error. Cholinergic depression might interact with the prediction error dynamics, probably affecting both exploration and performance. Interestingly, in Frémaux et al. 32 no backpropagation of the prediction error was observed. This suggests that the dynamics of the prediction error in this specific framework might be different or unexpected.
Acetylcholine. Acetylcholine is known to play an important role in learning and memory [49][50][51] . In the hippocampus, acetylcholine has been reported to facilitate both long term potentiation 52-62 and long-term depression [63][64][65][66][67] , depending on a number of variables, such as plasticity induction protocols, acetylcholine concentrations and type of cholinergic receptors 28 . In our previous work 22 , we found that cholinergic modulation of hippocampal STDP resulted in a symmetrical negative learning window and used this data as a starting point for our investigation.
Acetylcholine has been studied in relation to behavioural tasks 50,68,69 . Microdialysis studies have reported an increase in cholinergic release in the hippocampus during engagement in spatial learning tasks 6,8,10-14 and a reduction during consummatory behaviour 4,34 . Adding this dynamic to our neuromodulated network allowed us to study the possible effects of acetylcholine on navigation and decision-making 22 . Acetylcholine has been postulated to signal novelty and saliency 6,7,9 , and was reported to enhance exploratory behaviours like rearing 8,70 . For this reason, we largely focused on characterizing the effect of acetylcholine on exploration. In our model, acetylcholine indeed increases exploration over the action space, but this does not necessarily translate into increased exploration over the entire physical environment. In addition to exploration, the effect of acetylcholine on spatial learning has also been connected to paradigm shifts and reversal learning [71][72][73] , which in turn has been shown to depend on long-term depression in hippocampal synapses 74 . This is in agreement with our observation that acetylcholine is useful in dynamic scenarios where unlearning previously learned actions is advantageous. Acetylcholine has been hypothesized to modulate learning in other computational theories before [75][76][77] . It was put forward as a signal for uncertainty in probabilistic environments 76 and a switch signal for the encoding of new information, as opposed to the consolidation of memories 75,77 . Finding a clear correspondence between these theories and the model we present here is not trivial. However, our results are consistent with previous work, in that they suggest a functional role for acetylcholine in learning which is: i) complementary to dopamine, and ii) relevant to dynamical, changing environments.

Conclusion
In conclusion, we model here a role for dopamine as a behavioural reinforcer, and propose a new role for cholinergic depression in learning from negative outcomes. Despite its simplicity, our feed-forward network captures the key characteristics of sequentially neuromodulated plasticity, allowing us to examine its potential role in reward-based navigation 22 . In addition, by allowing us to clearly examine its dynamics, it provides us with a useful tool to further investigate the relationship between synaptic and behavioural learning. The continuosly updated cholinergic depression allows learning from unsuccessful trials, unlearning of previously rewarded locations, and enhances exploration over the appropriate action space. As such, sn-Plast is an effective reward-modulated learning rule for navigation tasks.

Methods
The navigation model is based on a one-layer network 32 . The place cells in the input layer code for the position of the agent in the environment. They project to the output layer of action neurons. Each one of the action neurons represents a different direction. Lateral connectivity in this layer ensures that action neurons compete with each other in a winner-take-all scheme. Their activity is then used to determine the action (i.e. direction and velocity) to take at every instant.
Place cells. Discrete model. In the case of the radial maze, the state space is discrete and contains only one location: the centre of the maze. From there, the agent chooses to which of the eight possible arms to move. The network is therefore composed of a single place cell, active for the whole duration of the trial, simulated as a Poisson process with rate λ = 4000 pc Hz.
Continuous model. The position of the agent at time t is described by the two-dimensional vector of its Cartesian coordinates, x(t). There are N place cells, spread over the entire environment at a horizontal and vertical distance of σ from one another. The spiking activity of place cell i is modelled as an inhomogeneous Poisson process, with rate λ t x ( ( )) i pc defined as follows: The firing rate λ i pc is a function of the distance of the agent from the place cell centre x i . It is at its maximum, λ = 400 pc Hz, when the agent is located exactly in x i and it decreases as it moves away. This mechanism simulates a place field in a 2D environment, which allows for an accurate representation of the position of the agent in the environment. In both models the firing rates of the place cells are taken to be very high, this is just to speed up computational times while preserving navigation accuracy.
Open T-maze -continuous model: The T-maze is cropped out from the open field plane. It is composed by a stem of length l stem = 3.2 a.u. and width wd stem = 0.6 a.u., and two arms, each having length l arm = 1.7 a.u. and width wd arm = 0.8 a.u. The agent starts every trial from the bottom of the stem: (x start , y start ) = (0, −2) a.u. In the T-maze, there are N = 441 place cells at distance σ = 0.2 from one another.
Action neurons. Neuron model. Place cells constitute the input to the network, and they all project to all action neurons with weights w feed . These feed-forward weights are initialized to w in and bounded between w min and w max (see Table 3 for specific values). The feedforward weights should be initialized roughly halfway between the minimum and the maximum value, so that both cholinergic depression and dopaminergic potentiation can have an effect on the action choice. Action neurons are also connected with each other through synaptic weights w lat . The neurons are modelled as SRM 0 33 , the membrane potential of neuron j is therefore given by:   . F i pc and F k a are sets containing respectively t i and t k , the arrival times of all spikes fired by place cell i and action neuron k. Spiking behaviour is stochastic and follows an inhomogeneous Poisson process with parameter λ j (u j (t)), which depends on the membrane potential at time t. In particular, j j j 0 where λ 0 is the maximum firing rate, Δu regulates randomness of the spiking behaviour and θ = 16 mV is the spiking threshold. For simplicity, the resting potential is set to 0. The biologically realistic value of the membrane potential can be retrieved through a translation and does not affect the dynamics of the network 33 .
Discrete model. In the radial maze, there are only eight possible actions to take from the initial position. There are N = 8 neurons, each coding for a different arm. These neurons are connected through inhibitory synapses: w lat = −250. This connectivity scheme ensures that, given enough time, one neuron will inhibit all others and be substantially more active. Other parameters were set to: λ 0 = 100 Hz, Δu = 0.5 mV.
Continuous model. Action neurons represent different directions in the Cartesian plane. Specifically, each action neuron j represents direction a j , where a j = a 0 (sin(θ j ), cos(θ j )), with θ = π j j N 2 , N = 40 and a 0 = 0.08. The lateral connectivity between action neuron k and action neuron j is defined as follows jk lat where Z is a normalizing factor, w -= −300, w + = 100 and f is a lateral connectivity function, which is symmetric, positive and increases monotonically with the similarity of the actions. In particular, Neurons therefore excite each other when they have a similar tuning, and depress otherwise. This ensure that only a few similarly tuned action neurons are active at any given time, making the trajectory of the agent smooth and consistent. Other parameters were set to: λ 0 = 60 Hz, Δu = 2 mV. Action selection. The action selection process determines the decision to take, based on the firing rates of the action neurons. The activity of action neuron j is approximated by filtering spike train Y j with kernel γ: , with τ γ = 50 ms and ν γ = 20 ms.
Discrete model. Decisions in the discrete case are taken only at the end of the trial. When a time limit T max = 5 s has been reached, the action neuron with maximum firing rate is selected. In the unlikely case two neurons exhibit exactly the same firing rate at the end of trial, the winning neuron is chosen at random. The agent then enters the arm associated with the winning neuron. All activity is reset before the onset of the next trial.
Continuous model. In the continuous case, actions are taken continuously, at every timestep t. The action selection process thus determines a(t), the action to take at time t. If each action neuron j represents direction a j and has an estimated firing rate ρ j (t), then the action a(t) is the average of all the directions encoded, weighted by their respective firing rates where N = 40 is the total number of action neurons. This decision making mechanism allows the agent to move in any direction, making the action space effectively continuous. A large number of action neurons allows for higher the accuracy of the navigation and action selection.
Navigation details. Continuous model. Once action a(t) has been determined, the update for the position of the agent is i f (t 1) within the boundaries ( ( )) otherwise (7) The agent therefore normally moves with instantaneous velocity a(t). When the agent tries to surpass the limits of the field, it is instantly bounced back by a distance d = 0.01. The unit vector u(x(t)) points in the direction opposite to the boundary. To avoid large boundary effects, the feed-forward weights between place cells on the boundaries and action neurons that code for a direction a j outside of the field are set to zero.
The agent is free to explore the environment for a maximum duration of T max . If it finds the reward at a time t rew < T max , the trial is terminated earlier, precisely at time t = T rew + 300 ms. The extra time mimics consummatory behavior, navigation is thus paused during this interval (i.e. place cells activity is set to zero). The effect of the inter-trial interval is modelled by resetting all activity.
T-maze -continuous model: When used in the task, the reward is located in the right arm of the maze. Specifically, we consider the reward to be found whenever the agent crosses the vertical line x r = 1 a.u. The maximum duration of a trial is T max = 5 s, but the trial ends whenever the agent enters one of the arms (whenever the agent crosses either the vertical line x r = 1 or the vertical line x l = −1). When in the stem, the available actions are restricted only to upwards movements (angle between , . When in the top part of the maze, only horizontal movements are allowed (angle between ). Open field -continuous model: For the first 20 trials, the reward can be found in the circular goal area centred in c 1 = (1.5, 1.5) with radius r 1 = 0.3. In trials 21 to 40, the goal area moves to centre c 2 = (−1.5, −1.5), but maintains the same shape and size. If the open field has obstacles, the agent is not allowed to cross them and is therefore pushed back, similarly to what happens with the walls. In this case, the goal area is initially centred in c 1 = (0, 1.5), and then moved to c 2 = (0, −1.5). The maximum duration of a trial is T max = 15 s. This maximum duration of a trial T max was chosen so that the agent could discover the reward in the first few trials ( Fig. 2A), its value is not intended to have behavioural or biological meaning.
Sequentially neuromodulated plasticity (sn-Plast). The synaptic weights between place cells and action neurons play a fundamental role in defining a policy for the agent. Plasticity is essential for the agent to learn to navigate the open field and is implemented in a way that follows the experimental results presented in Brzosko et al. 2015 and2017. The synaptic changes combine the modified STDP rule (Fig. 3) and an eligibility trace that allows for delayed updates.
In particular, the total weight update is: where η is the learning rate, A emulates the effect of the different neuromodulators, W is the STDP window and ψ is the eligibility trace. F i pc and F j a are sets containing respectively t i and t j , the arrival times of all spikes fired by place cell i and action neuron j.
The basic STDP window is x with τ = 10 ms. This function is always symmetric and positive, but the sign of the final weight change is determined by the neuromodulators at the synapse: Dopamine is assumed to be released simultaneously in all synapses whenever a reward is delivered. All weight changes are gated by neuromodulation (A = 0 when all neuromodulators are absent). The learning rate η also depends on neuromodulators (see Table 3 for specific values): ACh DA The weight change due to STDP is convoluted with an eligibility trace ψ, modelled as an exponential decay The eligibility trace keeps track of the active synapses and allows for a delayed update of the synaptic strength. The timescale of the eligibility trace τ e determines the length of the rewarding path learned: a shorter timescale favours shorter paths. Variable α in the exponent acts as a flag and ensures that the eligibility trace is active with dopamine only (α = 1).
When no interaction with acetylcholine was assumed (−ACh), the weights were potentiated only at the end of the trial, in the case that the agent found the reward (A = 1, α = 1). They were left unchanged otherwise (A = 0). If acetylcholine was present throughout the task (+ACh), the weights were updated online (A = −1, α = 0). When no reward was found before the end of the trial, weights were depressed. Otherwise, they were potentiated retroactively (A = 1, α = 1).
SCIeNtIfIC REPORTS | (2018) 8:9486 | DOI:10.1038/s41598-018-27393-2 Dopamine-modulated standard asymmetric STDP curve. We also compared our symmetric learning windows to standard asymmetric STDP curves. The total weight update with this rule is where η = 0.01 is the learning rate, W 2 is the STDP window (equation 14) and ψ is the eligibility trace (equation 12). B gates all synaptic changes until the end of the trial: = { B 1 at the end of the trial 0 during exploration . F i pc and F j a are sets containing t i and t j respectively, the arrival times of all spikes fired by place cell i and action neuron j. The spike timing plasticity rule was implemented as follows: The integral of the learning window determines if the agent learns, unlearns or does not learn. We therefore considered four different parameter sets: (i) positive integral (A pre-post = 1, A post-pre = −0.5); (ii) negative integral (A pre-post = 0.5, A post-pre = −1); zero integral with either (iii) positive A pre-post (standard STDP window; A pre-post = 0.5, A post-pre = −0.5) or (iv) negative A post-pre (inverted STDP window; A pre-post = −0.5, A post-pre = 0.5). The time constant was identical for the two sides of the window and was taken to be τ = 10 ms. We ran 1000 simulations for each parameter set.
Dynamic reward signal. We compared sn-Plast to a learning rule gated by a dynamic reward signal. This learning rule is similar to the one used in the control simulations (−ACh), but the weight change here is scaled by the dynamic reward signal tr R tr R tr ( ) ( ) ( ) ρ = − . Here, R(tr) is the value of the reward received during trial tr and R tr ( ) is the moving average reward. In our simulations, we assumed that R(tr) = 1 if the agent reaches the rewarding area before the end of the trial, R(tr) = 0 otherwise. The moving average reward is calculated as R tr R tr Rtr (1). In Fig. 6, we used β = 0.75. Here, β regulates the timescale of integration of the average reward. The higher β, the shorter the timescale. The weight update for simulations with dynamic reward signal is:¯∑ 1 at the end of the trial 0 during exploration (16) Negative feedback signal. We also compared our neuromodulated learning rule to a dopamine-modulated rule with negative feedback. In this set of simulations, we assumed that whenever the agent reaches the location of an omitted reward it receives a negative feedback that inverts the sign of the learning window induced by dopamine. The weight update for simulations with negative feedback is: where η = 0.01, A feedback = 1 when the new reward is found, A feedback = −1 if the agent navigates to the old reward location and A feedback = 0 otherwise. The eligibility trace ψ (equation 12) is active only when the feedback signal is delivered at the end of the trial (equation 16).