The serial blocking effect: a testbed for the neural mechanisms of temporal-difference learning

Temporal-difference (TD) learning models afford the neuroscientist a theory-driven roadmap in the quest for the neural mechanisms of reinforcement learning. The application of these models to understanding the role of phasic midbrain dopaminergic responses in reward prediction learning constitutes one of the greatest success stories in behavioural and cognitive neuroscience. Critically, the classic learning paradigms associated with TD are poorly suited to cast light on its neural implementation, thus hampering progress. Here, we present a serial blocking paradigm in rodents that overcomes these limitations and allows for the simultaneous investigation of two cardinal TD tenets; namely, that learning depends on the computation of a prediction error, and that reinforcing value, whether intrinsic or acquired, propagates back to the onset of the earliest reliable predictor. The implications of this paradigm for the neural exploration of TD mechanisms are highlighted.

Freezing to the clicker was lower in the blocking groups compared to the control groups and the effect of blocking was similar between the compound and serial groups. Freezing to the pre-trained light was greater in Group Block Simultaneous compared to Control Simultaneous, but the direction of this difference was reversed for the serial groups at least on the first trial (inset).
the target, to-be-blocked cue is introduced during the trace interval (blocking cue → blocked cue → reinforcer). We found that serially presenting the cues yielded an equivalent amount of blocking to that observed in the standard blocking design. Moreover, in the serial control condition we observe an additive effect of second-order conditioning of the earliest cue in the sequence, superimposed on first-order conditioning. As will be seen, these findings have important implications for the neural exploration of reinforcement learning mechanisms.

Serial cue presentation provides an effective blocking examination
During Conditioning rats in Group Block Simultaneous received delay pairings between a light and a shock whereby cue offset coincided with shock onset, whereas rats in Group Block Serial received trace conditioning whereby a 30 s interval was interpolated between cue offset and shock onset. As expected (see Fig. 1a www.nature.com/scientificreports www.nature.com/scientificreports/ clicker ( Fig. 1c: Clicker) were examined separately in the serial groups. Freezing to the light in Groups Block Serial and Control Serial differed on the first trial of Day 1 in Phase 2 training but this difference disappeared by the second trial, revealed by no effect of group (F 1 . Thus blocking could be observed immediately following learning on the first trial of serial compound conditioning. This affords a trial-based theoretical examination of the mechanisms that drive blocking 1,13,14 and shows that blocking results from a downregulation of outcome processing 1 as opposed to cue/attentional processing 13,14 in the present case. This blocking effect was maintained on the following day (see Fig. 1 legend for statistics). The serial blocking design offers an online confirmation of the effectiveness of blocking, which eliminates the disruptive effects of testing under conditions different to those of acquisition, including but not limited to any perceptual or behavioural masking of the novel cue by the pre-trained cue illustrated in the simultaneous compound. Further, this effect provides evidence against any role for local contextual cues on blocking 15 or we would see high levels of fear to the novel cue in Phase 2.
The groups received non-reinforced Tests to the individual stimuli (Fig. 1d). Freezing to the clicker in the Control groups was higher compared to the Block groups (see Fig. 1d , data not shown). The higher level of freezing to the light in Group Control Serial compared to Group Block Serial seems counterintuitive given the extra conditioning trials the latter group received during Phase 1. However, conditioning to the pretrained cue is relatively weak due to the trace period between cue offset and shock onset. Furthermore, fear to the first cue (light) in the serial compound is determined not only by the direct association of that cue with the shock, but also by the backpropagation of the association of the second cue (clicker) in the serial compound with the shock, i.e., second-order conditioning. As noted earlier, conditioning to the second cue (clicker) is greater in Group Control Serial compared to Group Block Serial (i.e., the blocking effect), which results in higher levels of second-order conditioning in the former compared to the latter group (see also) 2,16 .

Second-order conditioning in a serial compound procedure
In this second experiment we sought to confirm the elevated conditioning seen to the first cue of a serial compound and explore the role of the second cue in this learning. Three groups of rats were conditioned such that Group Serial received two sequential cues (conditioned stimuli, CSs) of different modalities (visual and auditory, counterbalanced) where CS1 offset coincided with CS2 onset, and CS2 offset coincided with shock onset (CS1 → CS2 → shock); Group Single received trace conditioning with CS1 only (CS1 → trace → shock); Group Compound received trace conditioning with CS1 and CS2 presented in compound (CS1&CS2 → trace → shock). Stimuli were presented and analyzed in accordance with their temporal relationship with shock. During Conditioning on Day 1, fear to CS1 (Groups Serial and Single) and CS1 + CS2 (Group Compound) increased across trials (Fig. 2a; F 1 Fig. 2 legend for statistics).
CS2 was presented following CS1 in group Serial, therefore the best comparison for CS2 was the equivalent temporal interval following CS1 presentation in Group Single. Fear to CS2 in Group Serial (Fig. 2b) was similar to fear during the same temporal interval following CS1 offset in Group Single (F < 1, CI [−0.32, 0.87]). This fear increased across trials (F 1,18 = 40.89, CI [1.05, 2.06]) but did not do so differentially between the two groups (F < 1, CI [0.86, 1.16]). No statistical differences were detected between the two conditions on subsequent days (see Fig. 2 legend).
Rats were tested for fear to the CS1 and CS2 (where applicable) during non-reinforced sessions. Freezing to CS1 (i.e., Primacy cue, Fig. 2c: CS1) was greater in Group Serial compared to Groups Single and Compound (F 1,27 = 14.42, CI [0.60, 1.99]), while the latter two groups did not differ from one another (F < 1, [−1.00, 0.63]). These data provide evidence for a primacy effect when cues are presented serially. There was no effect of training with a single or a compound CS on learning about CS1 in trace conditioning. Freezing to CS2 (Fig. 2c: CS2) was greater in Group Serial compared to Group Compound (t 19 = 3.64, p = 0.001), due to the closer temporal position of CS2 to footshock in Group Serial compared to Group Compound. Interestingly, fear to CS2 during conditioning (Days 1-3) predicted fear to CS1 on test for Group Serial (r = 0.674, p = 0.012) but fear during the same temporal interval in Group Single did not predict fear to CS1 on test (r = 0.273, p = 0.366; Fig. 2d). These www.nature.com/scientificreports www.nature.com/scientificreports/ data show that training in a serial compound results in stronger conditioning to the first reliable predictor of the compound; that is, fear propagated back to CS1. Furthermore, our data show that this is dependent on the presence of a 'bridging' (CS2) stimulus between CS1 offset and US onset, and that the amount of conditioning acquired by this bridging stimulus across training predicts the amount of fear that is expressed by the temporal primacy stimulus (CS1) on test. In other words, the associative strength acquired by CS2 transferred to CS1 akin to second-order conditioning.

Discussion
In this article, we presented a serial blocking paradigm that is specifically designed to explore the neural circuits underpinning TDRL. This paradigm is ideally suited to investigating the neural bases of TDRL's fundamental assumptions that (1) learning will not occur in the absence of a prediction error and that (2) the value of the reinforcer propagates back to the onset of the earliest reliable predictor via the second-order conditioning effect observed in Group Control Serial. Particularly noteworthy is the fact that, unlike in second-order conditioning, the effect observed in the serial control group does not compete with the development of conditioned inhibition 12 . This is a critical advantage in single-unit recording studies where a high number of training trials is desirable.
In addition to being able to test both assumptions at once, the current paradigm offers the neuroscientist the advantage of temporally uncoupling the presentation of the blocking and blocked stimuli. This allows for a dissociable examination of the contribution of specific neural circuits to cognitive processes related to each of these cues. For instance, one could optogenetically target neural structures implicated in the generation of reinforcement expectancies by the blocking cue (e.g., basolateral amygdala or prelimbic cortex in fear [17][18][19] , orbitofrontal cortex in reward) 20,21 without affecting redundancy-driven decrements in the salience of the blocked cue 9 . Similarly, one could separately examine the contribution of mesolimbic dopamine to (1) temporally specific predictions set up by the blocking cue, (2) prediction error at the time of reinforcement 22 , and (3) novelty-related salience when the blocked cue is first introduced (e.g.) [23][24][25][26] . While the present design focuses on fear, staggered but still overlapping presentations of the pre-trained cue and the blocked cue have been effective in producing blocking with a rewarding outcome 27,28 , thus leaving no reason to suppose that the current serial design would be ineffective in the appetitive setting. Thus, in combination with techniques with high temporal resolution such as single-unit recording and optogenetics, the serial blocking paradigm offers an unprecedented opportunity to dissect the reinforcement learning circuit.
Notably, the above advantages over the simultaneous paradigm come at no cost in terms of the strength of the blocking effect. The equivalent size of the blocking effect in the serial and simultaneous blocking groups presumably reflects a comparable expectation of reinforcement at the time of its delivery despite the lower level of responding to the blocking cue observed in the serial group. Thus, the serial blocking paradigm allows the neuroscientist to dissociate a predictor's ability to evoke conditioned behavior (e.g., freezing) from its ability to generate temporally-precise reinforcement expectancies and produce blocking. This is consistent with a dissociation between the acquisition of value and that of temporally-precise reinforcement expectancies, as specified by 29 as well as the predicted-time-of-arrival hypothesis 30 . Finally, our paper together with the existing body of literature provide procedural guidance in obtaining blocking. Specifically, blocking will be observed when cue arrangements maintain a consistent temporal relationship between the pre-trained cue and the outcome across phases 31-33 irrespective of cue length 27 , when the blocking cue precedes the outcome 32 , and the delivery of the novel cue does not precede the pre-trained cue 16,27,31,32 .

Materials and Methods
Subjects. Forty-eight Long-Evans rats (Charles River; St. Constant, Quebec, Canada) were used (12 rats per group, equal number of males and females) in Experiment 1. Thirty seven rats (21 males and 16 females) of Long-Evans background (M = 347.5 ± 10.16 g) were used in Experiment 2. One rat (Group Single) was excluded from the analyses of Experiment 2, because it was deemed to be an outlier according to the Grubb's outlier test (Zc = 2.46 Z = 2.57 https://www.graphpad.com/quickcalcs/Grubbs1.cfm). The weights of the rats ranged between 275 and 325 g at the beginning of the experiments. All rats had ad libitum access to food and water and were housed in pairs in standard clear shoebox cages in a humidity and temperature-controlled environment under reverse light-dark conditions (12:12 h light-dark cycle; lights off at 8:00 a.m.), and with experimental sessions occurring about 3-4 hours after the onset of the dark cycle. Rats were handled once a week during the acclimation period in the colony and then daily for 3 days prior to the experiments. All rats were treated in accordance with the approval granted by the Canadian Council on Animal Care and the Concordia University Animal Care Committee. Habituation. On Day 0 rats were habituated to the auditory and visual stimuli. The habituation session lasted one day and consisted of two presentations of each cue (clicker or flashing light) 5 min upon placement in the experimental chambers. The cues were presented two times each for 30 s with an intertrial interval (ITI) of 2 min and the session lasted for a total of 16 min. Experiment 1: Phase 1 Conditioning. On each of Days 1-3, rats in the Blocking groups received three pairings between the flashing light and shock for a total of nine such pairings across Phase 1. The first light-shock pairing took place 5 min upon placement in the conditioning chamber, and successive pairings were separated by an average of 5 min ITI (range: 240-360 s). The last light-shock pairing occurred 4.5 min prior to the end of the training session. Rats were brought and placed in the operant chambers again 3.5 hours after the training sessions in order to receive exposure to the context (no cues or other stimuli presented) to reduce freezing to the background cues. Rats in the Control groups did not receive Phase 1 conditioning and were merely handled outside the laboratory.

Experiment 1: CS-US relation. For rats in the Simultaneous groups (Block Simultaneous and Control
Simultaneous) the CS (or CSs) were trained in a delay procedure such that shock (unconditioned stimulus, US) onset coincided with CS offset, i.e., the cues were presented for 30 s at the end of which a shock was delivered. For rats in the Serial groups (Block Serial and Control Serial) the light CS was trained in a serial fashion such that light offset was followed by a 30 s trace period at the end of which the shock was delivered (Phase 1) or light offset coincided with clicker onset, and clicker offset coincided with shock onset (Phase 2). Experiment 2: Conditioning. Phase 1 lasted 3 days. Rats in Group Serial received conditioning trials in a serial delay procedure such that CS1 offset coincided with CS2 onset and CS offset coincided with shock onset. For rats in Group Single, a single CS, i.e., CS1, was paired with the shock with an interval between CS offset and shock onset of 30 s. For rats in Group Compound, a simultaneous compound presentation of CS1 and CS2 was paired with the shock with an interval between compound offset and footshock onset of 30 s. All groups received three pairings per day, for a total of nine pairings across Phase 1. The first pairing took place five minutes upon placement in the conditioning chamber, and successive pairings were separated by an average ITI of 5 min (range: 240-360 s). The last CS-shock pairing occurred four minutes prior to the end of the training session. Rats were brought and placed in the operant chambers again 3.5 hours after the training sessions in order to receive exposure to the context (no cues or other stimuli presented) to reduce freezing to the background cues. Experiment 1: Non-reinforced Tests. Rats were tested for fear to the clicker on Days 6 and 7 and to the flashing light on Day 8. Data for fear to the clicker were pooled between the two tests. Test sessions consisted of eight 30 s nonreinforced presentations of the conditioned cues (light or clicker) 1 min apart. Each test session consisted of a 5 min acclimation period prior to the first presentation of a cue. Rats were removed from the conditioning chambers 1 min following the last (eight) presentation of the cue. Experiment 2: Non reinforced Tests. Rats were tested for fear to CS1 and CS2 on Days 6 and 7 respectively (i.e., rats in Group Single were only tested to the conditioned cue on Day 6). Procedurally, the Test sessions were identical to those described in Experiment 1.
Scoring and Statistics. All sessions were videotaped and scored offline. Freezing behavior was scored on a second-by-second basis with a timestamp procedure in which each rat was observed for the entire session and scored as either freezing or moving. Freezing was defined as the absence of all movement, except for those related to breathing (R. J. Blanchard & Blanchard, 1969) 34 . A percentage of the time spent over the total observation time was calculated for each rat. A second scorer blind to the subjects' group assignment scored a random subset of the data. The correlation between the scorers (AM and PP) was 0.99. Experiment 1 was based on a classic blocking design and Experiment 2 was based on data obtained in Experiment 1, and therefore the hypotheses with regard to the directionality of the differences were pre-determined. Therefore, our data were analyzed using planned orthogonal contrasts (version 21, PSY2000). Significance was set at the 0.05, and confidence intervals were standardized and presented in standard deviation units.

Data Availability
All data will be made available upon request.