A behavioural correlate of the synaptic eligibility trace in the nucleus accumbens

Reward reinforces the association between a preceding sensorimotor event and its outcome. Reinforcement learning (RL) theory and recent brain slice studies explain the delayed reward action such that synaptic activities triggered by sensorimotor events leave a synaptic eligibility trace for 1 s. The trace produces a sensitive period for reward-related dopamine to induce synaptic plasticity in the nucleus accumbens (NAc). However, the contribution of the synaptic eligibility trace to behaviour remains unclear. Here we examined a reward-sensitive period to brief pure tones with an accurate measurement of an effective timing of water reward in head-fixed Pavlovian conditioning, which depended on the plasticity-related signaling in the NAc. We found that the reward-sensitive period was within 1 s after the pure tone presentation and optogenetically-induced presynaptic activities at the NAc, showing that the short reward-sensitive period was in conformity with the synaptic eligibility trace in the NAc. These findings support the application of the synaptic eligibility trace to construct biologically plausible RL models.

Animal behaviours are effectively reinforced when a reward follows a preceding sensorimotor event typically ranging 1-60 s in the conditioning tasks. The time window varies depending on several factors, including type of reinforced behaviour; for example, appetitive licking or lever press typically allow reward delays of 1-3 s 1,2 , whereas approaching behaviour allows delays of 10-60 s [3][4][5][6] . To enable such learning, mechanisms are required to associate two temporally separated sensorimotor and reward events flexibly. Reinforcement learning (RL) theory explains that each sensorimotor event evokes an eligibility trace during which a reward can effectively reinforce preceding events [7][8][9][10] . Theoretically, the trace can be built up by sequential sensorimotor events occurring during reward learning to yield an accumulating eligibility trace 11 , allowing animals to learn from rewards with diverse delays. Although recent studies have attempted to address neuronal substrates for eligibility traces during reward learning guided by complex sequential sensorimotor events [12][13][14] , the reward-sensitive periods to a simple sensory input that can closely reflect an eligibility trace before building up remains elusive.
Neuronal substrates for an eligibility trace of reward have been studied as dopamine actions on glutamatergic synapses. Upon unexpected rewards, dopamine neurons in the ventral tegmental area (VTA) show a phasic burst firing (~ 0.3 s) 15,16 , which is regarded to represent a reward prediction error signal in the RL theory. Following optogenetic studies supported this idea by showing that the phasic dopamine activity is sufficient and indispensable to establish reward learning 2,[17][18][19] . VTA dopamine neurons send dense projection to the nucleus accumbens (NAc), which also receives glutamatergic inputs from several brain regions such as the amygdala. The amygdala sends sensory information of the CS 20 and the amygdala to NAc pathway is required for auditory cue-reward association 21,22 . The dopaminergic and glutamatergic inputs signal through dopamine D1 receptors (D1Rs) and N-methyl-d-aspartate type glutamate receptors (NMDARs) in the NAc for reward conditioning 23,24 . In slice preparations, D1R, NMDAR, and Ca 2+ /calmodulin-dependent protein kinase II (CaMKII) regulate the enlargement of the dendritic spine, a structural basis for long-term potentiation of the D1R-expressing spiny projection neurons (D1-SPNs) 25 . Of note, pairing of glutamatergic inputs and postsynaptic action potentials shaped the dopamine-sensitive period for plasticity only about 1 s [25][26][27][28][29] .
These lines of evidence suggest that synaptic activities triggered by sensorimotor events leave synaptic eligibility traces for 1 s in the NAc, a time window during which reward-related dopamine could induce plasticity for behavioural learning. This cellular mechanism corresponds to the theoretical model of NeoHebbian threefactor learning rules, which requires a third factor such as dopaminergic inputs as well as Hebbian concurrent presynaptic and postsynaptic activities to update weights of neuronal connections 8 . However, several different neuronal mechanisms may exist in the brain for different types of eligibility traces. For example, outside the NAc, synaptic eligibility traces have been found to have longer time scales of 5 s in the neocortex 30 and 10 min in the hippocampus 31 . In addition to synaptic eligibility traces, persistent activities that store eligible events in working memory can also associate temporally separated events 32 .
To clarify the contribution of the synaptic eligibility trace in the NAc in vivo, we sought to examine the reward-sensitive period around a short auditory input in a Pavlovian conditioning task with head-restrained mice. The water of reward was directly delivered to the mouth of mice to accurately present the unconditioned stimuli (US) without any delay before consumption. This tone-water-licking task enabled the rapid establishment of conditioning within an hour, in contrast to tasks where licking is reinforced by water (antecedent-licking-water operant conditioning) which requires several days for their acquisition 33 and involves brain regions such as the prefrontal cortex (PFC) 12,34,35 . We examined the reward-sensitive periods of the conditioned stimuli (CSs) and tested the dependence of the conditioning on the NAc. We further applied optogenetic stimulation of synaptic inputs to NAc to eliminate the possible delay of the sensory stimulus to the NAc.

Results
Rapid Pavlovian conditioning with a short CS in head-restrained mice. We used a head-restrained device to deliver a US of water at an arbitrary timing for Pavlovian conditioning. The position of the licking port was set close to the mouth of the mice (Fig. 1a) so that a drop of water would immediately touch the mouse to signify delivery of the US. Thus, licking responses (UCR) were induced just after the presentation of the US (Fig. 1b). Before conditioning, we measured baseline responses to a short, pure tone (8 kHz, 0.5 s) (Fig. 1c), which was subsequently used as the CS, and confirmed that the tone itself did not evoke a licking response (Fig. 1d). For the tone-water-licking conditioning, we presented a CS followed by a US at the CS offset (0.5 s) for 180 trials (Fig. 1e,f). To monitor the formation of the association during conditioning, 20 CS-only trials were pseudo-randomly inserted among the 180 trials with CS-US presentation so that 2 CS-only trials were included in every 20 trials. The learning curve of the conditioning was obtained by plotting the lick scores calculated using the averaged licking frequency for 2 s from the onset of CS, which was subtracted from the lick frequency 2 s before CS (Fig. 1g). The results showed that mice started to predict US arrival at the presentation of the CS after 40 trials of pairing, and learning was saturated after 120 trials (Fig. 1g, Kruskal-Wallis test, χ 2 (10) = 39.8, P = 1. Reward-sensitive period to brief CS in NAc-dependent Pavlovian conditioning. We then determined the reward-sensitive period to a CS of 0.5 s by presenting US with various delays (Fig. 2a-f). When the US preceded the CS, the CS did not induce licking responses after conditioning (Fig. 2a,b). The mice rapidly predicted the US when the CS preceded the US by no more than 1 s (Fig. 2c-e). However, a CS-US interval of 2-s did not allow the formation of the association (Fig. 2f). The difference in peak frequency between + 0.5 s (Fig. 2d) and + 1 s (Fig. 2e) was consistent with evidence from prior studies showing that frequency of responses to CSs decreases as the CS-US interval gets longer 33 . The lick scores were calculated from the averaged licking frequency for 2 s after CS presentation subtracted from that 2 s before CS presentation to plot a learning curve (Fig. 2g) and time window (Fig. 2h). We found that the reward-sensitive period was only within 1 s after the short tone ( NAc-dependence of the conditioning. We tested whether the molecular signaling required for plasticity in the NAc is indispensable for the rapidly forming conditioning. We first examined CaMKII signaling by an autocamtide 2-related inhibitory peptide (AIP), a peptide that inhibits CaMKII activity 36 , with which we previously showed that AIP expression in the SPNs prevented plasticity and learning 37 . Then, Adeno-associated virus (AAV) vector with a PPTA promoter for D1-SPNs 25 (Fig. 3a) was injected bilaterally into the NAc, and the extent of the expression was monitored by a green fluorescent protein that was co-expressed with AIP using a P2A cleavage site (Fig. 3b,c). We tested the behavioural effects of AIP expression in the NAc and found that the AIP expression in the NAc abolished learning ( Fig. 3d-g) (two-sided Mann-Whitney U test, U = 3, P = 0.01). In contrast, expression of AIP in the prefrontal cortex (PFC) under a CaMKII promoter did not affect conditioning (Fig. 3h, Supplementary Fig. S2  www.nature.com/scientificreports/ indicated that the current rapid conditioning task preferentially relied on the NAc molecular signaling related to plasticity, unlike other reward conditioning that involves the PFC 12,34,35 , which may have longer eligibility trace 30 . Next, we injected a dopamine D1R antagonist (SCH23390) in the bilateral NAc during conditioning (Fig. 3i). A D1R antagonist blocked the conditioning when the CRs were measured at the end of conditioning ( Fig. 3j-m) (two-sided Mann-Whitney U test, U = 3, P = 0.044). The D1R antagonist also partially inhibited US responses, suggesting that D1R inhibition also affected motor components. Furthermore, CRs on the following day where no drug was present were also inhibited in mice with the D1R antagonist ( Fig. 3n) (two-sided Mann-Whitney U test, U = 3, P = 0.047), supporting that the D1R antagonist blocked conditioning.
Reward-sensitive period to optogenetic stimulation of the synaptic input to the NAc. Although we found the 1 s of reward-sensitive period in the NAc-dependent conditioning task, it is still possible that www.nature.com/scientificreports/ the observed window was formed upstream of the NAc and the NAc mechanism was far shorter. To exclude this possibility, we applied optogenetics to stimulate glutamatergic inputs to the NAc directly. Previous studies showed that the basolateral amygdala (BLA) to NAc pathway represents CS information [20][21][22] , and also reinforces behaviours 22 . We hypothesized that weak optogenetic stimulation of this pathway acts as a CS while strong stimulation acts as a reinforcer. The ChR2-expressing AAV vector was injected into the left amygdala, and an optical fibre was placed in the ipsilateral NAc (Fig. 4a,b). First, we replicated reinforcement effects of the BLA to NAc pathway ( Supplementary Fig. S3 online) by stimulating axonal fibres (457 nm, 5 ms, 20 Hz, ten times) at high (> 5 mW) laser power ( Supplementary Fig. S3 online) (Kruskal-Wallis test, χ 2 (3) = 19.1, P = 0.0003; posthoc Steel's test: laser on at low power vs. laser off, P = 0.87, laser on at high power vs. laser off, P = 0.0036, laser on at low power vs. laser on at high power, P = 0.0019). In contrast, subthreshold low laser powers (< 3 mW) did not reinforce this behaviour (laser on at low power vs. laser off at low power, P = 0.87) ( Supplementary Fig. S3 online). We then tested whether this weak stimulation of synaptic inputs (optogenetic conditioned stimulus, CSopto) could be associated with the US. In head-fixed mice, blue light stimulation (20 Hz, 0.5 s, 5 ms pulse) of CSopto alone in the NAc did not cause the licking response (Fig. 4c,d). When CSopto was paired with a US of water (Fig. 4e,f), the mice started to show anticipatory licking to CSopto within 40 trials (Fig. 4e,f, In contrast, mice injected with a Venus vector without ChR2 did not form an association (Fig. 4g,h) (Kruskal-Wallis test, χ 2 (10) = 6.52, P = 0.76), indicating that mice did not respond to optical stimulation itself as a CS but the conditioning relied on optically induced synaptic activation. Moreover, CSopto conditioning was dependent on the D1R, which was tested using a withinsubject design to functionally confirm virus injection and fibre placement for ChR2 excitation ( Supplementary  Fig. S4 online, two-sided Mann-Whitney U test, U = 3, P = 0.018). www.nature.com/scientificreports/ Finally, we examined reward-sensitive periods for the CSopto (20 Hz, 0.5 s) (Fig. 5). The time window of conditioning by the CSopto was within 1 s after the onset of CSopto (Fig. 5h) (Fig. 2h). For the negative conditions (− 1 s, − 0.5 s, and 2 s), we confirmed successful conditioning with 1 s delay on the next day ( Supplementary Fig. S5 online), indicating that the negative results were not due to inappropriate virus injection or optical fibre placement.

Discussion
We demonstrated that the reward-sensitive period was 1 s after the brief CS, which was similar even with the optogenetic stimulation of glutamatergic inputs in the NAc with a Pavlovian conditioning task in head-restrained mice. The period was in good agreement with the temporal profile of synaptic eligibility trace in the NAc. Thus, our data provide a behavioural line of evidence to apply the timing of the synaptic eligibility traces to construct RL models.
At the molecular level, the time window of 1 s suggests that the temporal scale is mainly determined by a signaling pathway involving D1R, Ca 2+ priming of adenylate cyclase (AC), protein kinase A (PKA), and CaMKII 25,28,29   www.nature.com/scientificreports/ increase in cAMP concentration even in the presence of reward-related phasic dopamine input which activates the cAMP production pathway of D1R-G s/olf -AC 25,28 . When postsynaptic action potentials cause Ca 2+ influx, Ca 2+ -sensitive AC is primed for 1 s so that dopamine can outcompete phosphodiesterase activity to allow cAMP to increase, which in turn activates PKA. PKA then disinhibits CaMKII specifically at the spine, which receives presynaptic glutamatergic inputs concurrently with postsynaptic activity 25,28 . This time window of 1 s is longer than another major time window determined by NMDA receptors that detect concurrent presynaptic and postsynaptic activities for plasticity at ~ 50 ms 38 . This indicates that the synaptic eligibility trace mechanism effectively prolongs the duration of reward detection but compromises precision in detection of temporal contiguity. Interestingly, similar molecular timing mechanisms associated with Ca 2+ -sensitive AC have been found in Aplysia 39,40 and in insects [41][42][43] , suggesting that the neuronal mechanism involving Ca 2+ -sensitive AC may resolve the tradeoff between the sensitivity and precision. www.nature.com/scientificreports/ The short NAc eligibility trace predicts that NAc plasticity becomes predominant when reward immediately follows preceding sensory events. For example, the visual and olfactory cues of foods are usually present immediately before tasting. The palatable reward of foods thus can strongly reinforce sensory cues by the synaptic eligibility trace in the NAc so that only the sensory cue can subsequently activate the NAc. The NAc strongly reacts to sensory cues of foods both in human 44,45 and rodents 46 . Rapid action of addictive substances taken by inhalation or injections would explain the NAc reactions to predictive cues 47 . Thus, the short synaptic eligibility trace may explain why the NAc activities react to the sensory information of reward itself.
The three factors of the presynaptic input, postsynaptic action potentials of SPNs, and dopamine may contain specific information for learning, assuming the involvement of synaptic eligibility trace. Several lines of behavioural evidence support the idea that the presynaptic input represents the CS 20-22 and dopamine activity represents a reward prediction error [15][16][17][18][19] . In contrast, the exact information represented by postsynaptic action potentials has not been well clarified. We argue two possible models here. One model is that the postsynaptic action potentials cause licking behaviours by activating downstream brainstem nuclei 48,49 . Consistent with this idea, we showed that CSopto induced a transient, rhythmic licking movement, supporting the existence of a licking pathway downstream of the NAc. Spontaneous licking occurred even before establishment of learning (baseline licking in Fig. 1f) once after water presentation (baseline licking in Fig. 1b vs. d), suggesting that licking-related postsynaptic activities during the CS period may fire together with CS-related presynaptic inputs to generate a synaptic eligibility trace so that subsequent dopamine inputs can cause plasticity for autoshaping of conditioning. Instead, a Pavlovian association model requires licking-related postsynaptic activities during US periods to be associated with preceding CS-related presynaptic activities. In this scenario, CS-induced presynaptic activities and US-induced postsynaptic activities are separated by intervals up to 1 s which cannot cause plasticity given the known synaptic mechanisms in the NAc but can do so in the hippocampus 50 . The other model is that CS-related presynaptic inputs cause dendritic spikes instead of action potentials to induce plasticity 51 when subsequent dopamine inputs arrive; once synaptic weights have been enhanced by this plasticity, CS-related presynaptic activity can trigger action potentials. A limitation of this model is that it cannot explain why particular behaviours, licking responses in our study, are selectively reinforced during conditioing. www.nature.com/scientificreports/ The actual circuit model needs to be clarified in future studies by visualization of learning-related circuits and timing-specific neuronal manipulation of relevant neural circuits. Even without eligibility traces, a temporal-difference (TD) algorithm provides a model for explaining associations between two temporally separated events. In the TD model, time is represented in a discrete state and the reward value is initially associated with the state at the timing of reward. Then, after learning has proceeded through multiple trials, the value gradually shifts back to the onset of the CS 15 . This model can explain associations between two temporally separated events at any interval given a sufficient number of trials, which is inconsistent with our observation of the time window. It is still possible, however, that a gradual backward shift of licking occurred in our study, a pattern which is predicted by TD learning theory. Although we observed no apparent shifting of licking responses using a short auditory CS (Fig. 1), a definitive analysis was difficult because of ambiguous onset of licking due to baseline responses measured during the early period of conditioning. As shown in a human study, development of one-shot learning is needed to exclude involvements of the TD learning pattern 52 . In one previous study with rats, it was found that CS-induced dopamine responses did not follow the TD learning pattern but instead exhibited a CS-induced response at the onset of the CS, a pattern consistent with learning models involving eligibility traces in conditioning with a CS-US interval of 1 s 53 . Interestingly, in a recent study with mice in which an olfactory CS and CS-US intervals of 3 s were used, the investigators observed gradual shifts toward the onset of CSs over multiple trials 54 , suggesting that TD mechanisms also play a role in learning but with longer intervals than the synaptic eligibility trace.
Ethologically relevant behaviours require longer reward time windows than the synaptic eligibility traces. Working memory-like mechanisms may send persistent inputs to the NAc 32 , which may activate the synaptic eligibility trace even after the cessation of external sensory inputs. Second-order conditioning, where reward predicting CS becomes a reinforcer for other preceding events, also allows learning from longer reward delays 15,54,55 . Synaptic mechanisms with more prolonged eligibility traces outside the NAc 30,31 can play direct roles in complex reward learning 12,34,56 . How the NAc and additional brain mechanisms interplay during complex reward learning will be a future research focus.
In conclusion, we identified that the reward-sensitive period was 1 s in the NAc-dependent rapid conditioning task, which is in close agreement with the dopamine-sensitive period for synaptic plasticity in the NAc. Such biologically defined temporal constraints may help to understand and construct biologically plausible RL models. 3)-mCherry. The PPTA promoter, a D1-SPN specific promoter, was cloned from the mouse as described previously 25,57 . Autocamtide 2-related inhibitory peptide (AIP), a CaMKII inhibitory peptide, and self-cleaving 2A peptide of porcine teschovirus-1 (P2A) were fused with clover and cloned in a sCre dependent double inverted ORF expression vector designed using sloxP and sloxP (M1). The original plasmid containing hChR2(H134R) was a kind gift from Dr. Deisseroth, and sCre was purchased from Kazusa DNA Research Institute (Japan) 58 . AAV vectors were produced, and their titers were measured as described previously 59 . Briefly, plasmids for the AAV vector, pHelper (Stratagene), and RepCap5 (Applied Viromics) were transfected to HEK293 cells (AAV293, Stratagene). After 3 days of incubation, the cells were collected and purified twice using iodixanol. The titers for AAV were estimated using a quantitative polymerase chain reaction.

Animals and surgery.
Wild type or DAT-IRES-Cre (B6.SJL-Slc6a3tm1.1(cre)Bkmn/J, The Jackson Laboratory) male B6J mice aged 2-4 months old were used. These mice were housed on a 12-h light/12-h dark cycle. A custom-made titanium plate was attached to the head using dental cement. For AIP experiments in the NAc, a total of 1.5 μl of the AAV mixture of PPTA-sCre (5 × 10 11 GC/ml) with either EF1-sDIO(M1)-Clover-P2A-AIP (2 × 10 13 GC/ml) or EF1-sDIO(M1)-Clover (1 × 10 13 GC/ml) were bilaterally injected (AP + 1. Behavioural experiments. Mice were allowed 4 days for recovery after head plate installation in experiments without virus injections and 3 weeks for recovery in experiments with virus injections. Mice were then habituated for 3 days to the experimental setup without head fixation, and water restricted such that body weight was maintained at no less than 80% of the baseline weight. On the day of the experiment, the mice were headfixed, and the licking responses to tone presentation (8 kHz, 70 dB) used as CS were monitored for five trials (day 1, baseline session). For the US, a drop of 5% glucose water (2 μl) was presented through the tip of a lick port controlled by a syringe pump. The position of the lick port was set such that the drop of water contacted the www.nature.com/scientificreports/ mouth of the mice to induce licking without any training. The conditioning session consisted of 180 trials with the presentation of CS-US pairs and 20 trials with the presentation of CS only. For the time window experiment, each mouse was assigned to one of the CS-US delays of − 1 s, − 0.5 s, 0 s, + 0.5 s, 1 s, or 2 s with CS duration of 0.5 s. For the CS duration experiment, each mouse was assigned to one of the CS duration of 0.2 s, 1 s, 2 s, 3 s, or 4 s. The data from the mice assigned to CS-US delays of + 0.5 s were also used as that of the CS duration of 0.5 s. The intervals between the trials were randomized with a uniform distribution between 15 and 21 s, with a mean of 18 s. To monitor learning during conditioning, CS-only trials were pseudo-randomly inserted so that two trials with CS only were included in every of 18 CS-US trials to record conditioned reflexes (CRs) without US. The licking responses were electrically measured. The control of the stimulus presentations and the recording of the licking responses were performed with custom software written in LabView (National Instruments). For experiments with ChR2 stimulation, a fibre cannula was connected to a blue laser (473 nm, Thorlabs). For the operant conditioning session 22 shown in Supplementary Fig. S3, conditions with laser on and off were alternately repeated twice. In the laser-on condition, axonal fibres were stimulated (5 ms pulse, ten times in 20 Hz) 100 ms after the detection of a licking event while no stimulation was made in the laser off condition. After the stimulation, we inserted a 500-ms refractory period for stimulation, even though the sensor detected licking. The number of licking responses was counted for 190 s. To initiate licking, the lick port delivered a drop of water once 10 s before recording. The session was repeated with increasing laser power from 1, 2, 3, 5, 7.5 to 15 mW (200 μm core fibre) or until the mice lick counts during the laser-on period were 20 times greater than those during the laser off period. For Pavlovian conditioning with ChR2, 20-Hz laser stimulation (5 ms pulse, 1 or 2 mW) given 10 times (CSopto) was substituted for the CS tone.
For the drug infusion experiment, SCH23390 (400 μM, Abcam) dissolved in ACSF (125 mM NaCl, 2.5 mM KCl, 2 mM CaCl 2 , 1 mM MgCl 2 , 1.25 mM NaH 2 PO 4 , 26 mM NaHCO 3 , and 20 mM glucose) or ACSF for controls was infused at the rate of 16.66 nl/min by a syringe pump (Legato111, KD scientific) 30 min before the experiments. The infusion was continued during the conditioning at the rate of 14.9 nl/min. For pharmacological experiments during CSopto conditioning, SCH23390 or saline were intraperitoneally injected 30 min before the conditioning experiments. Doses of 0.25 and 0.5 mg/kg were tested. As the results were similar between the doses, the data were pooled in the analysis.
Histological analysis. For the AIP experiments, the mice were subjected to histological analysis to confirm AIP expression in the NAc. After the behavioural experiments, the mice were transcardially perfused with 4% paraformaldehyde and decapitated. Coronal slices of 50-μm thickness were obtained. Clover fluorescent was obtained using stereoscopic microscopy (Leica M165-FC), and images were captured with a CMOS camera (Hamamatsu photonics ORCA R2). AIP expression was considered sufficient if it was expressed bilaterally, including more than 3/4 of the anterior part of the anterior commissure, a NAc surrounding structure. Out of the 18 NAc-injected mice, five failed to satisfy this criterion (one did not exhibit expression at all, three exhibited unilateral expression only, and one exhibited expression only in the medial half of the NAc) and were therefore excluded from behavioural analyses. For some slices, detailed fluorescence images were obtained using confocal microscopy (Leica, SP5) of the preparations, which were counter-stained using DAPI.

Data analysis.
For the analysis of the CS-induced licking responses (CRs), we calculated the lick score in the CS-only trials as [average licking frequency (Hz) during 2 s after CS presentation] − [average licking frequency during 2 s before CS presentation]. Kruskal-Wallis test followed by Steel test or t test were adapted for statistical tests with a threshold of P < 0.05. Wilcoxon rank-sum test, Mann-Whitney test. Data analyses were performed using Excel (Microsoft) and Excel Statistics (SSRI). Data are presented as mean ± SEM.

Data availability
All data are available from the corresponding author upon reasonable request. www.nature.com/scientificreports/