The Long and the Short of Serotonergic Stimulation: Optogenetic activation of dorsal raphe serotonergic neurons changes the learning rate for rewards

Serotonin plays an influential, but computationally obscure, modulatory role in many aspects of normal and dysfunctional learning and cognition. Here, we studied the impact of optogenetic stimulation of dorsal raphe serotonin neurons in mice performing a non-stationary, reward-driven, foraging task. We report that activation of serotonin neurons significantly boosted learning rates for choices following long inter-trial-intervals that were driven by the recent history of reinforcement.

1 Introduction 22 Learning from the outcomes of past actions is crucial for effective decision-making 23 and thus ultimately for survival. In the case of important outcomes, such as rewards, 24 * kiigaya@gatsby.ucl.ac.uk ascending neuromodulatory systems have been implicated in aspects of this learning 25 due to their pervasive effects on processing and plasticity. Of these systems, per- 26 haps best understood is the involvement of phasically-fluctuating levels of dopamine 27 activity and release in signalling temporal difference [55] prediction errors for appeti- 28 tive outcomes [46,50]. Since prediction errors are a key component of reinforcement 29 learning (RL) algorithms, signalling mismatches between outcomes and predictions, 30 this research has underpinned and inspired a large body of theory on the neural im- 31 plementation of RL. 32 From the early days of investigations into aversive processing in Aplysia [27], sero- 33 tonin (5-HT) has also been implicated in plasticity. This is broadly evident in the mam-34 malian brain, from the restoration of the critical period for the visual system of rodents 35 occasioned by local infusion of 5-HT [58] to the impairment of particular aspects of 36 associative learning arising from 5-HT depletion in monkeys [7,59]. Despite theoret- 37 ical suggestions for an association with aversive learning [19,53,13,6,15], direct 38 experimental tests into serotonin's role in RL tasks have led to a complex pattern of 39 results [12,52,42,45,22,11]. For instance, recent optogenetic studies reporting that 40 stimulating 5-HT neurons could lead to positive reinforcement [42] do not appear to 41 be consistent with other optogenetic findings, which instead suggest an involvement timescale (e.g. how many trials) over which reward histories are integrated to assess 48 the value of taken actions. 5-HT can readily influence learning rates through its in- 49 teraction with dopamine [21]; and indeed, there is evidence that animals adapt the 50 timescales of plasticity to the prevailing circumstances [16,4,47,62,30], and also 51 consider more than one timescale simultaneously [10,38,23,31]. 5-HT could be in- 52 volved in some, but not other, timescales. It could also be associated with some, but 53 not other, of the many decision-making systems [14,25,9,40] that are known to be 54 involved in RL. 55 We therefore reanalyzed experiments in which mice performed a partially self- 56 paced, dynamic foraging task for water rewards [22]. In this task, 5-HT neurons in 57 the dorsal raphe nucleus (DRN) were optogenetically-activated during reward delivery 58 in a trial-selective manner. The precise control of the timing and location of stimula- 59 tion offered the potential of studying in detail the way in which 5-HT affects reward 60 valuation and choice. We used methods of computational model comparison to ex- 61 amine these various possible influences. We first noted a substantial difference in the 62 control of actions that followed short and long intertrial intervals: only the latter were 63 influenced by extended reward histories, as expected for choices driven by conven-64 tional RL. We then found that the learning rate associated with these (latter) choices 65 was significantly increased by 5-HT stimulation. 66 2 Results 68 (ITIs) 69 We reanalyzed data from a dynamic foraging or probabilistic choice task in which 70 subjects faced a two-armed bandit [22]. Full experimental methods are given in that 71 publication. Briefly, the subjects were four adult transgenic mice expressing CRE re-72 combinase under the serotonin transporter promoter (SERT-Cre) and four wild-type 73 littermates (WT) [22]. In this task (Figure 1a), mice were required to poke the cen- 74 ter port to initiate a trial. They were then free to choose between two side ports, 75 where reward was delivered probabilistically at both ports on each trial (on a concur-76 rent variable-ratio-with-hold schedule [39]). On a subset of trials, when mice entered 77 a side port, one second of photo-stimulation was provided to DRN 5-HT neurons via 78 an implanted optical fiber (Figure 1b). ChR2-YFP expression was histologically con-79 firmed to be localized to the DRN in SERT-Cre mice (Figure 1c) [22]. 80 Following previous experiments in macaque monkeys [54,10,39], the probability 81 that a reward is associated with a side port per trial was fixed in a given block of 82 trials (Left vs Right probabilities: 0.4 vs 0.1, or 0.1 vs 0.4). Once a reward had been 83 associated with a side port, the reward remained available until collection (although 84 multiple rewards did not accumulate). Photostimulation was always delivered at one 85 of the side ports in a given block (Left vs Right probabilities: 1 vs 0, or 0 vs 1). Block 86 changes occurred every 50-150 trials and were not signaled, meaning that animals 87 needed to track the history of rewards in order to maximize rewards. 88 As previously reported [22], subject's choices tended to follow changes in reward 89 contingencies (Figure 1d), exhibiting a form of matching behavior [54,10,39]. A 90 deterministic form of matching behavior can maximize average rewards in this task 91 [49,43,32,31] because the probability of getting a reward increases on a side as the 92 other side is exploited (due to the holding of rewards). For more behaviorally realizable 93 policies, slow learning of reward contingencies has been shown to be beneficial to 94 increase the chance of obtaining rewards [31]. 95 We confirmed the results of previous analyses [22] showing that the optogenetic 96 stimulation of DRN neurons did not appear to change the average preference of the 97 side ports (Figure 1e). The animals' preference for the side port that was associ-98 ated with a higher water probability was not affected by the side which was photo-99 stimulated. However, these analyses do not fully take advantage of the experimental 100 design in which photo-stimulation was delivered on a trial-by-trial basis. The latter 101 should allow us to examine whether the effect of stimulation is more prominent on a 102 specific subset of trials. 103 2.2 Fast or slow: ITI duration determined decision policy. 104 The task contained a free operant component in that the subjects were free to initiate 105 each trial. This resulted in a wide distribution of inter-trial-intervals (ITIs). It was 106 notable that some ITIs were substantially larger than others (Figure 1f) Figure 1: (a). Schematic diagram of trial events in the task. On each trial, a mouse was required to enter the center port (Trial initiation) and then move to one of the side ports (Choice). A reward might be delivered at the side port according to a variable-ratio-with hold-schedule. The next trial started when the mouse entered the center port. The inter-trial-interval (ITI) is defined as the time from when the mouse left the side port until it entered the center port to initiate the next trial. In a given block of trials, one side port was associated with a higher reward probability per trial (0.4) than the other (0.1); although following delivery, rewards were held (but not accumulated) until collected. Furthermore, during a block, photo-stimulation (12.5 Hz, 5 mW for 1 s) was always delivered as soon as the mouse entered just one of the side ports. (b). A schematic of the optogenetic stimulation. DRN neurons were infected with viral vector AAV2/1-Dio-ChR2-EYFP. In SERT-Cre mice, 5-HT neurons expressed ChR2-YFP (green) and could be photoactivated with blue light that was delivered by an optical fiber implant. . The probability of choosing the higher water probability side is shown for the blocks in which the photo-stimulation was assigned to the opposite side from the higher water probability side (Opp.), and for the blocks in which the photo-stimulation was assigned to the same side (Same). The difference within WT mice, within SERT-Cre mice, and between WT and SERT-Cre mice for either condition were not significant. The error bars indicate the mean ± SEM over sessions. (f). Inter-trial-intervals (ITIs) in the same session as c. The red circle indicates trials with long ITIs (> 7 sec). (g). The average predictive accuracy of the existing reward-and choicekernel model [39,22] when fitted to all trials. This model captures a form of win-stay, lose-shift rule. Choices following short ITIs (≤ 7 s) were well predicted by the model, while choices following short ITIs (> 7 s) were not. The difference between short and long ITIs was significant for both WT and SERT mice [permutation test. p < 0.001 indicated by three stars.]. Data and images from [22] this effect, we separated short from long ITI trials using a threshold of 7 seconds 108 (we consider other thresholds below; values greater than 4 seconds led to equivalent 109 results).

Animals showed a wide distribution of inter-trial-intervals
110 Figure S1 reports the mean proportions of long ITI trials in WT and SERT-Cre 111 mice. The frequency of long ITI trials was slightly, but statistically significantly, different 112 between WT and SERT; however, this appears not to be due to stimulation itself, as 113 control analysis showed that stimulation itself did not significantly change the ITI that 114 followed ( Figure S2). We also found that long ITI trials were most common in the last 115 part of each experimental session, but were also seen in earlier parts of each session 116 ( Figure S3). 117 Previous studies have suggested a relationship between the duration of an ITI and 118 the nature of the subsequent choice. For example, subjects have been reported to 119 make more impulsive choices after shorter ITIs [48]. Another studies have shown that 120 perceptual decisions are more strongly influenced by more venerable prior experience 121 when working memory was disturbed during the task [1]. Here, we hypothesized that 122 choices following short ITIs might also be more strongly influenced by the most recent 123 choice outcome compared to those following long ITIs, since the outcome preceding 124 a short ITI is more likely to be kept in working memory until the time of choice. 125 To investigate this, we first exploited an existing model of the behavior on this task 126 [39,22]. This is a variant of an RL model which separately integrates reward and 127 choice history over past trials, subject to exponential decay [39]. This model captures 128 a form of win-stay, lose-shift rule [3,61] when time constants are small. 129 We found that choices following short ITIs (ITIs < 7 s) were well-predicted by this accounted for by a short-term-memory-based win-stay lose-switch strategy. 143 We hypothesized that choices following long ITIs might reflect slow learning of 144 reward history over many trials [36,31]. Indeed, by complexity-adjusted model com-145 parison (integrated BIC) [29,34], we found that choices following ITIs > 7 s were best 146 described by a standard RL model ( Figure S4). This analysis supported our hypothe-147 sis that choices following long ITIs are influenced by a relatively long period of reward 148 history compared to choices following short ITIs. It is also worth noting that in con-   Figure 2: (a). Schematic diagram of model-agnostic analysis. The correlation between the choices following long ITIs (window = 5 trials) and the reward bias (window = 10 trials) was estimated using adjacent sliding windows. The reward bias was estimated on trials only with (top) or without (bottom) photo-stimulation. The windows were shifted one trial at a time. The greyed-out trials are the ones that are ignored for the assessments. Note that, due to the task design in which photo-stimulation is associated with only one side (Left or Right) in a given block, in some moving windows reward bias had to be computed from one side only. Thus we assigned +1 (respectively −1) to a reward from Left (Right) and no-reward from Right (Left) when we computed reward bias. We aware that this is not a perfect measure for reward bias; but we still expect finite correlations since reward rates from the Left choice and the Right choice are on average negatively correlated by the task design in a given block (reward probability: 0.1 vs 0.4). The correlation was estimated separately for each mouse. (b). Model-agnostic analysis suggests that the impact of reward history on choices following long ITIs was modulated by optogenetic stimulation. The x-axis indicates if the reward bias was computed over trials with or without photo-stimulations. The stars indicate how significantly the correlation is different from zero, or the correlations are different from each other, tested by a permutation test, where estimated reward bias was permuted within or between conditions. Three stars indicates p < 0.001. The error bars indicate the mean ± SEM of data. There are two separate decision making systems: a fast system generating a form of " win-stay, lose-switch", and a slow system following reinforcement learning (RL). After short ITIs (T IT I < T Threshold ), choice is generated by the fast system following win-stay, loseswitch. After long ITIs (T IT I > T Threshold ), choice is generated by the slow RL system. The ITI threshold T Threshold is a free parameter that is fitted to data. (b) The RL system is assumed to learn the value of choice on all trials, including those with short ITIs for whose choices it was not responsible. The learning rate of the RL system is allowed to be modulated by photo-stimulation. When photo-stimulation is (respectively, is not) delivered, choice value is updated at the rate of α Stim (α no-Stim ). (C) Photostimulation increased the learning rate of SERT-Cre mice. The estimated learning rates for the WT (left), SERT-Cre (center), SERT-Cre mice (right) with shuffled stimulations are shown. The difference between α Stim and α no-Stim in WT mice, between α Stim in WT mice and α Stim in SERT-Cre mice, between α Stim α no-Stim in SERT-Cre mice with shuffled stimulation conditions, and between α Stim in SERT-Cre mice and α Stim in SERT-Cre mice with shuffled stimulation conditions were not significant. The difference between α Stim and α no-Stim in SERT-Cre mice, and between α no-Stim in WT and α no-Stim in SERT-Cre mice were significant (permutation test, p < 0.001). The difference between α no-Stim in SERT-Cre mice and α no-Stim in SERT-Cre mice with shuffled stimulations was also significant (permutation test, p < 0.01). (d) Generative test of the model. The analysis of Figure 2d was applied to data generated by the model. The correlations were all significantly differently from zero, while the difference between photo-stimulation and no photo-stimulation conditions between WT and SERT-Cre mice was also significant.
(for non-stimulated ones). We found that this model fits the data more proficiently than 193 a number of variants (see the Methods section for details) embodying a range of dif-194 ferent potential effects of optogenetic stimulation: including acting as a direct reward 195 itself; as a multiplicative boost to any real reward; or causing a change in the learning 196 and/or forgetting rates ( Figure S8). 197 This model also fits choices better than a model that learns and forgets outcome 198 history according to wall-clock time (measured in seconds) rather than according to 199 the number of trials. To do this we simply adapted the previously validated two-kernel 200 model that integrates choice and reward history over trials [39,22] such that the in- In the best fitting model (Figure 3a), we found that optogenetic stimulation in-206 creased the learning rate in SERT-Cre mice, but not in WT mice (Figure3c). Consis-207 tent with the previous analyses, we also found that the time constants for the choice 208 kernel and the reward kernel for choices following short ITIs were very short for both 209 WT and SERT-Cre mice (Figure S5), and that the ITI thresholds were not significantly 210 different between WT and SERT-Cre mice ( Figure S6). In addition, we replicated the 211 same results using a model with a fixed (= 7 sec) ITI threshold ( Figure S7).

212
As a control analysis, we fitted the model to SERT-Cre data with randomly re-213 assigned stimulation trials. Shuffling the trials abolished the effect of photo-stimulation 214 on the learning rate (Figure 3c). 215 Although the learning rate on stimulation trials in SERT-Cre mice was significantly 216 greater than that on non-stimulated trials, it was not significantly different from the 217 learning rate in WT mice (Figure 3c) by the experimenters rather than those observed by the subjects, to avoid any bias 242 that is independent of the reward history (such as choice history). 243 Choices following long ITIs were indeed significantly influenced by long run reward 244 history spanning over the entire experimental session ( Figure S9). The data from the 245 generative test also confirms this correlation (Figure S9) imentally. Note that a similar effect has been also observed in perceptual decision 286 making. In one example, longer-lasting prior experience was more influential when 287 working memory was disturbed during the task [1]. 288 The distribution of short and long ITI trials suggests that they might reflect the The fact that only a subset of trials was apparently affected by the stimulation is 296 arguably a cautionary tale for the interpretation of optogenetics experiments. What 297 looked like a null effect [22] had to be elucidated through computational modeling. 298 Equally, for the short ITI trials, what seemed like behavior controlled by conventional 299 RL, might come from a different computational strategy (and potentially neural sub-300 strate) altogether [9]. This could prompt a reexamination of previous data (as shown 301 by [5]). Further caution might be prompted by the observation that the learning rate in 302 the SERT-Cre mice in the absence of stimulation was actually significantly lower than 303 that of the WT mice in the absence of stimulation, rising to a similar magnitude as the 304 WTs, with stimulation. This may be due to chronic effects of optogenetic stimulation 305 of DRN neurons, as suggested in recent experiments [11], or due to baseline effects 306 of the genetic constructs. 307 The learning rates that we found even for the slow system are a little too fast to 308 capture fully the long term correlation that can be found in the data. This is apparent  we assumed in the model fitting. 326 Finally, one of the main reasons to be interested in serotonin is the prominent role 327 that drugs affecting this neuromodulator play in treating psychiatric disorders. While 328 our results add substantial complexity to this landscape, they also offer the prospect 329 of richer and more finely targeted manipulations, given greater understanding. Previous studies have shown that animal's choice behavior in a dynamic foraging task without the change-over-delay constraint [28] can be well-described by a linear twokernel model (e.g. [39,22]). In this model, the probability P L t of choosing Left on trial t is determined by a linear combination of values computed from reward and choice history, given by where a L t (a R t ) is the value computed from a reward kernel for Left (Right), b L t (b R t ) is the value computed from a choice kernel for Left (Right), and δ is the bias. Assuming simple exponential kernels [39,54,10], the reward values are updated on every trial as: where a L t (a R t ) is the reward value for Left (Right) choice on trial t, χ is the temporal forgetting rate of the kernel, ρ is the initial height of the kernel, and r L = 1 ( r R = 1) if a reward is obtained from Left (Right) on trial t, or 0 otherwise. Since these equations can also be written as: this kernel is equivalent to a forgetful Q-learning rule [60,9] with a learning rate χ and 345 reward sensitivity ρ /χ.

346
The value for choice is also updated as if Left (Right) is chosen on trial t while 0 otherwise. We note that the initial height 349 of the choice kernel, η, is normally negative [39,22], meaning that the choice kernel 350 normally captures a tendency towards alternation. Such tendencies are common in 351 tasks with reward schedules like those in the current task if a penalty for alternation is 352 not imposed (change over delay) [28]. 353 We assumed that the update takes place on every trial, even those associated with 354 long ITIs. T Threshold to be a free parameter that is determined by data. We also tested the 362 fixed value T Threshold = 7 seconds based on our preliminary analyses and found 363 results consistent with the variable ITI-threshold model (Fig. S7). 364 The fast system generates decisions based on the two-kernel model described in M.2.1. The slow system performs simple Q-learning. Specifically, the probability P L t of choosing Left on trial t after a long ITI > T Threshold is given by where v L t (v R t ) is the value for Left (Right), κ is the bias term, and T is the decision 365 noise. 366 The agent updates values for chosen action according to the Rescorla-Wagner rule, but at different learning rates for photo-stimulation (α Stim ) and no-stimulation (α No-Stim ) trials. For example, if Left was chosen and photo-stimulation was applied, the value of Left choice is updated as If no stimulation was applied, on the other hand, In order to explore other possibilities for optogenetic stimulation effects, we constructed 377 three other models. 378 Asymmetric learning rate model 379 We allowed the model to have different learning rates for reward and no-reward trials when photo-stimulation was applied. Specifically, we modified Equation 9 of the main model as if r L = 1, and if r L = 0. The same is applied for the Right choice. 380 Multiplicative value model 381 Here we assumed that photo-stimulation changed the sensitivity of reward. Specifically, we modified the learning rules of slow system as if photo-stimulation is applied, otherwise Additive value model 382 Here we assumed that photo-stimulation carried a independent rewarding value. Specifically, we modified the learning rules of slow system as if photo-stimulation is applied, otherwise The same is applied for the Right choice.

384
In order to determine the distribution of model parameters h, we conducted a hierarchical Bayesian, random effects analysis [29,34,33] for each subject. In this, the (suitably transformed) parameters h i of experimental session i are treated as a random sample from a Gaussian distribution with means and variance θ θ θ = {µ µ µ θ , Σ Σ Σ θ }.
The prior distribution θ θ θ can be set as the maximum likelihood estimate:

M-3
We optimized θ θ θ using an approximate Expectation-Maximization procedure. For the E-step of the k-th iteration, a Laplace approximation gives us where N m k i , Σ Σ Σ k i is the Normal distribution with the mean m k i and the covariance Σ Σ Σ k i that is obtained from the inverse Hessian around m k i . For the M step: For simplicity, we assumed that the covariance Σ k θ had zero off-diagonal terms, as-385 suming that the effects were independent.

386
Model comparison 387 We compared models according to their integrated Bayes Information Criterion (iBIC) 388 scores [29,34,33]. We analysed model log likelihood log p(D|M ): where we approximated the integral as the average over K samples h j 's generated Model's average predictive accuracy 396 We defined the model's average predictive accuracy as the arithmetic mean of the 397 likelihood per trial, using each session's MAP parameter estimate. That is, where N trial is the number of the trial, d t i is the datapoint on trial t in session i.

399
In our generative simulations, we used the same reward/photo-stimulation sched-400 ule as the actual data.  Figure S1: The distribution of ITIs. The proportions of short (≤ 7 sec) ITI trials and long (> 7 sec) ITI trials were significantly different for both WT (left) and SERT-Cre (right) mice. The difference between WT and SERT-Cre mice was also significant, though the optogenetic stimulation itself did not change the subsequent ITIs (see Figure S2). The error bars indicate the mean ± SEM of sessions.

S-1
After noStim After Stim  Figure S2: Probability that the ITI is longer than 7 sec, following a photo-, or no photo-, stimulation. Stimulation does not significantly increase the chance of creating a long ITI event. The difference between the WT mice and SERT-Cre mice was significant (permutation test, p < 0.001). The simple Q-learning model was assumed to learn values on all trials but was responsible for decisions on trials following long ITIs (> 7 sec).

S-7
ΔiBIC score in log scale (Total of SERT-Cre mice)   Figure S9: Choices following long ITIs were predicted by reward history over many trials. The correlation between the choices following long ITIs in the last part of each session (5th quintile) and the reward bias estimated in different quintiles. On the x-axis, '1-5' indicates the overall reward bias computed in the total of the 1st, 2nd, 3rd, 4th and 5th quintiles, while '5' means the bias from the 5-th quintile only. The top left (top right) shows the results of WT (SERT-Cre) mice, while the bottom left (bottom right) shows the results of model's generated data for WT (SERT-Cre) mice.
The stars indicate how significantly the correlation is different from zero, tested by a permutation test. The test statistic was constructed by the mean of the correlation coefficients of four animals at each quintile condition, where the correlation coefficient was computed by randomly permuted data in each condition in each animal. One star indicates p < 0.05; two stars indicates p < 0.01, while three stars indicates p < 0.001. The error bars indicate the mean ± SEM of data.
S-9 Corr. Rward bias (Stim or no Stim) vs Choice bias ( all ITI) Figure S10: The impact of reward history on choices following short ITIs did not show effects of optogenetic stimulation. The x-axis indicates if the reward bias was computed over trials with or without photo-stimulations. Due to the experimental bias of stimulation and reward probability, the correlation appears to be larger when stimulation is on for both groups; however, the difference between WT and SERT-Cre was not significant. The error bars indicate the mean ± SEM of sessions.
S-10  Figure S11: The impact of reward history on choice was more strongly seen in choices following long ITIs than ones following short ITIs in SERT-Cre mice, while it was not the case in WT mice. The x-axis indicates if the correlation was computed for choices following short or long ITIs. The y axis indicates the ratio of the correlation between reward bias and choice bias computed over trials with photo-stimulations to the correlation computed over trials without photo-stimulations. The difference between the short and the long ITI conditions in SERT-Cre mice was significant (permutation test; p < 0.05), while the difference in WT mice was not significant. The error bars indicate the mean ± SEM of data.
S-11  Figure S13: Choices following long ITIs in the first quintile were correlated with reward bias in the fifth quintile in the previous session. The star indicate how significantly the correlation is different from zero, tested by a permutation test. The test statistic was constructed by the mean of the correlation coefficients of four animals, where the correlation coefficient was computed by randomly permuted data in each animal. One star indicates p < 0.05. The error bars indicate the mean ± SEM of data.