Meta-reinforcement learning via orbitofrontal cortex

The meta-reinforcement learning (meta-RL) framework, which involves RL over multiple timescales, has been successful in training deep RL models that generalize to new environments. It has been hypothesized that the prefrontal cortex may mediate meta-RL in the brain, but the evidence is scarce. Here we show that the orbitofrontal cortex (OFC) mediates meta-RL. We trained mice and deep RL models on a probabilistic reversal learning task across sessions during which they improved their trial-by-trial RL policy through meta-learning. Ca2+/calmodulin-dependent protein kinase II-dependent synaptic plasticity in OFC was necessary for this meta-learning but not for the within-session trial-by-trial RL in experts. After meta-learning, OFC activity robustly encoded value signals, and OFC inactivation impaired the RL behaviors. Longitudinal tracking of OFC activity revealed that meta-learning gradually shapes population value coding to guide the ongoing behavioral policy. Our results indicate that two distinct RL algorithms with distinct neural mechanisms and timescales coexist in OFC to support adaptive decision-making.


Article
https://doi.org/10.1038/s41593-023-01485-3 that process memory signals by themselves unlike biological neurons, but in our implementation with regular recurrent units, our networks only process and maintain history signals through recurrent activity dynamics.However, we also note that the deep RL model is not intended to be an accurate network model for the brain.Instead, our goal here is to compare and contrast between the brain and the deep RL model.These deep RL models were previously used to propose the theory that a single network can perform multiple layers of RLs using synaptic plasticity-and activity dynamics-based computations [3][4][5][6] , but it has not been determined whether the brain uses similar processes to use distinct RL mechanisms in a single area to mediate meta-RL.
We trained the deep RL models in the same probabilistic reversal learning task using the meta-RL framework.As with the trained mice, the trained deep RL models dynamically adjusted their choice preference based on their choice and reward outcomes on each trial (Fig. 1d-f and Extended Data Fig. 2a-c).Notably, because the synaptic weights were fixed within each session, the results indicate that across-session plasticity established recurrent connectivity that can implement trial-by-trial RL using recurrent activity dynamics (Fig. 1g), similar to previous reports [3][4][5] .
Having established that both mice and deep RL models improve their task performance over training, we next examined their action policies of trial-by-trial RL during training.We quantified their history-based action policies using a logistic regression model fit to the behavior in each session.We found that both mice and deep RL models learned to choose the side that was more frequently rewarded (positive weights for reward history) in the recent trials (Fig. 1h,i).Additionally, deep RL models developed reward-independent choice alternation (negative weights for choice history) during training.This tendency for choice alternation by deep RL models is beneficial in this task because the probability of a lickport loaded with a reward cumulatively increases if the side has not been selected in recent trials 11 .Mice do not appear to make use of this feature of the task in our particular experimental condition as indicated by the nonnegative weights for choice history.As a result, mice primarily learned to choose the side with the higher predetermined reward assignment probability for the trial block (0.6 or 0.525), while deep RL models learned to exploit the cumulative nature of the reward probability and choose the side that is more likely to give reward in individual trials (Fig. 1e,f and Extended Data Fig. 2a-f).Despite this difference in the reward-independent choice effects, the overall behavior of both mice and deep RL models could be well fit by RL models (Extended Data Fig. 2g), which allowed us to estimate their subjective action values on each trial (Fig. 1b,d).
Although the summed history weights of the regression model revealed a gradual increase in the magnitude of history dependence (Fig. 1i), the analysis does not distinguish whether they changed the way they integrate history events (shape of history kernels) during training.To quantify the stability of their action policy, we defined the action policy axis for each type of history using the history regression coefficients and measured the angle between the policy axes from different sessions (Fig. 1j).We found that the reward history angle between adjacent sessions was initially large and gradually decreased, (Fig. 1k,l).Thus, both mice and deep RL models dynamically updated their reward-based action policies in early sessions and gradually stabilized their RL policies during training.Additionally, deep RL models that learned to use choice history stabilized their choice-based action policies.
These results demonstrate that both mice and deep RL models meta-learned to perform RL (that is, meta-RL) and equip us with a behavioral model that allows us to estimate the subjective action values on a trial-by-trial basis.

OFC plasticity is required for across-session meta-learning of RL
In the deep RL model with the meta-RL framework, slow synaptic plasticity across sessions during training establishes a network that optimizes their behavioral action policies for a fast-timescale RL.This slow RL was mediated by Ca 2+ /calmodulin-dependent protein kinase II (CaMKII)-dependent synaptic plasticity in the orbitofrontal cortex (OFC) in the mouse brain.Following the slow RL, the neural activity in both OFC and deep RL models robustly encoded action value signals that are necessary for the fast RL behaviors.Longitudinal imaging of neural activity revealed the dynamics and stabilization of the activity-based fast RL through the plasticity-based slow RL in OFC.Although the precise learning algorithms for synaptic plasticity may differ between the mouse OFC and the deep RL models (for example, nonbiological backpropagation algorithm), both exploit a shared neural network that performs synaptic plasticity-based computations and activity dynamics-based computations for two layers of RLs.These results highlight OFC as a critical prefrontal area that mediates multiple layers of RLs in parallel to mediate meta-RL.

Meta-learning of RL
Following pretraining in which mice were familiarized with the task apparatus and licking to receive rewards (Methods), we trained mice to learn to perform RL on a probabilistic reversal learning task [7][8][9][10][11] (Fig. 1a).Head-fixed mice reported their choices with directional licking (left or right) after a ready period (2-2.5 s).On each trial, each lickport was loaded to release a water reward upon licking according to its current reward assignment probability, and mice received a reward only when the chosen lickport was loaded with a reward on the trial.Once a lickport was loaded, it remained loaded in subsequent trials until the reward was collected (baiting).The reward assignment probabilities (A L and A R ) on the lickports (0.6 versus 0.1 or 0.525 versus 0.175) changed every 60-80 trials without cue, encouraging mice to dynamically update their subjective action values for left and right based on their choice and reward outcome on each trial.Across the training sessions, mice improved their task performance by learning to adjust their choice preference dynamically toward the side that was more frequently rewarded in the recent trials (Fig. 1b and Extended Data Figs. 1 and 2a-c).We quantified their task performance by the following two measures: the probability of choosing the side with higher reward assignment probability (P(choosing A High )), and the average probability of reward availability on the chosen side in our task with the baiting rule (optimality score; Methods).Learning resulted in an improvement of both performance measures (Fig. 1e,f and Extended Data Fig. 2a-c).Thus, with training over sessions, mice improved their trial-by-trial RL action policy through across-session RL (that is, meta-learning of RL).
As a conceptual framework for such meta-RL, we adopted an in silico network model under the meta-RL framework [3][4][5] .In this framework, we trained deep RL models consisting of recurrent networks with advantage actor-critic (A2C) method 12 (Fig. 1c).The recurrent layer receives inputs of the choice and outcome information only at the single time step immediately after a choice and incorporates them in the activity through their recurrent connectivity.Therefore, the ongoing network activity reflects the cumulative history from past trials.The action probability on each trial is computed from the network activity.The synaptic weights within the deep RL model are fixed during each training session (500 trials).However, at the end of each training session, the recurrent network updates its synaptic weights using the outer-loop RL of the meta-RL framework that evaluates the performance in each training session as a critic by calculating temporal difference (TD) errors 13 .Because of this infrequent weight updating, the networks cannot use weight updates to mediate trial-by-trial RL.In these situations, recurrent networks are encouraged to learn to use activity dynamics for the trial-by-trial RL [3][4][5][6] .While adopting this framework from previous publications [3][4][5] , we used recurrent networks with regular recurrent units instead of long short-term memory (LSTM) units used in other studies.LSTM units have internal gate functions implements trial-by-trial RL using recurrent activity dynamics (Fig. 1g).We examined whether a similar mechanism is involved in mice.We focused on OFC as a candidate area that may mediate meta-RL in the mouse brain.Previous studies showed that OFC neurons encode value signals [14][15][16][17][18][19] and undergo structural synaptic plasticity in reward-based learning [20][21][22] .Furthermore, OFC forms reciprocal connections with ventral tegmental area (VTA) neurons [23][24][25][26][27][28] , a source of TD error in the brain 29 .Based on these previous findings, we hypothesized that synaptic plasticity in OFC is involved in the slow across-session meta-learning of meta-RL in the mouse brain.
To test this hypothesis, we sought to block the plasticity induction by targeting CaMKII, a master regulator of synaptic plasticity in the brain 30 .We used paAIP2, a light-inducible inhibitor of CaMKII kinase activity [31][32][33] that can block synaptic plasticity and impair learning 31,33 (Fig. 2a).Notably, photoactivated paAIP2 selectively blocks the induction of long-term potentiation (LTP) without affecting the CaMKII function of LTP maintenance 31,32 , leaving the connectivity established before the photoactivation intact.We performed experiments in OFC slices and confirmed that photoactivation of paAIP2 potently blocks structural LTP in dendritic spines (Fig. 2b-e).Furthermore, we validated its in vivo efficacy in the mouse primary motor cortex (M1) during a lever-press motor learning task, a well-established paradigm that induces synaptic plasticity [34][35][36] .We found that the increase of spine volume and the formation of new spines over days of learning were suppressed by photoactivation of paAIP2 in M1 neurons, and the neurons maintained their health with normal dendritic structures and spine density after 2 weeks of daily photoactivations (Extended Data Fig. 3).Although we did not measure the effect of paAIP2 on functional plasticity, structural plasticity has been repeatedly shown as an accurate proxy for functional plasticity 37,38 .
To examine the involvement of OFC plasticity in across-session meta-learning of RL, we virally expressed paAIP2 locally in OFC neurons and photoactivated paAIP2 for 3 s after every choice throughout consecutive 30 training sessions (Fig. 2f).We compared their across-session learning of history-based action policy against the control group from the same litters with EGFP expression without paAIP2.We found that the task learning, acquisition of history dependence and stabilization of reward-based action policy were delayed in https://doi.org/10.1038/s41593-023-01485-3 the paAIP2 group (Fig. 2g-j and Extended Data Fig. 4a,b).Plasticity suppression by paAIP2 did not affect task engagement or general task performance such as the choice bias (Extended Data Fig. 4d-h).We confirmed that paAIP2 photoactivation did not alter the firing properties of OFC neurons in organotypic slices (Extended Data Fig. 5a-e), and the baseline firing rates of OFC neurons were not affected by ~1 h of photoactivation during the RL task in vivo (Extended Data Fig. 5f).To examine potential long-term effects on OFC neurons, we also recorded OFC neural activity after consecutive 30 training sessions with paAIP2 photoactivations.We found that this long-term paAIP2 photoactivation did not alter firing rates at the baseline, firing rates during the RL task or population value coding (Extended Data Fig. 5g,h).Therefore, OFC neurons maintained healthy firing properties with our experimental protocol, and the selective paAIP2 effects on learning indicate that local synaptic plasticity in OFC is necessary for the efficient, across-session meta-learning of the RL action policy in mice.

Trial-by-trial RL is independent of CaMKII-dependent synaptic plasticity in OFC
The abovementioned results are consistent with the notion that OFC plasticity during learning establishes a circuit that implements within-session, trial-by-trial RL using activity dynamics, similar to the deep RL model with the meta-RL framework.However, it is possible that OFC plasticity might also contribute to within-session RL by reflecting value updates with updates of synaptic weights.Therefore, we next tested whether OFC plasticity is necessary for within-session RL in expert mice.A cohort of paAIP2-expressing mice was first trained without photoactivation until their performance reached the expert level.We then performed paAIP2 photoactivation on alternating sessions using the same light protocol as in the learning experiments described above.To minimize nonspecific effects of light such as distraction in photoactivation sessions, another pair of light-emitting diodes (LEDs) was used to shine a masking blue light over the mouse head in every trial in both photoactivation and control sessions.
Unlike the photoactivation experiments during the learning phase, blocking OFC plasticity after the acquisition of an expert action policy did not cause any detectable changes in the task performance and history dependence of mouse behavior (Fig. 3 and Extended Data Fig. 4c,i-m).Although we cannot exclude potential contributions of CaMKII-independent forms of plasticity (for example, short-term plasticity on presynaptic neurotransmitter release) that are not blocked by paAIP2, our results indicate that CaMKII-dependent plasticity in OFC is selectively required for the meta-learning of RL but not for the within-session RL of expert mice.

OFC activity robustly encodes value signals
After the across-session RL, deep RL models perform trial-by-trial RL using their recurrent activity.Thus, we next examined whether trial-by-trial RL in expert mice is mediated by OFC activity.We first examined the encoding of action value-related signals, which are necessary for the trial-by-trial RL.We analyzed how the network activity encoded the following three value-related signals: ΔQ (value difference between left and right), which is the policy value that directly drives decision-making; ∑Q (sum of two action values), which reflects state value and motivation and Q ch (value of the side chosen in the previous trial), which is the value that was updated by the preceding action.As expected, in the trained deep RL models, these value signals https://doi.org/10.1038/s41593-023-01485-3could be reliably decoded from the population activity in the recurrent layer on each trial at the time steps before the decision (Fig. 4a).
To examine value coding in OFC, we performed in vivo two-photon calcium imaging of layer 5/6 OFC neurons.This was done through an implanted gradient-index (GRIN) lens in CaMKIIa-tTA::tetO-GCaMP6s double transgenic mice that express GCaMP6s in cortical excitatory neurons 7,39,40 (Fig. 4b).Extracted fluorescence signals from individual neurons were deconvolved to estimate the underlying spiking activity before analyses (Supplementary Fig. 1) 41,42 .OFC neurons were heterogeneous with different neurons exhibiting activity peaks at different trial periods, collectively tiling the entire trial period (Fig. 4c).We examined population coding of value-related signals by decoding analyses.Because both action values and neural activity can slowly change throughout a session, they may show spurious correlations [43][44][45][46] .To minimize spurious correlations that derive from the slow-timescale autocorrelations of individual variables, we designed a decoder to decode the difference in the value-related signals from the difference in the population activity between adjacent trials.To confirm that this approach of decoding differences between adjacent trials reduces spurious correlations, we calculated the chance accuracy by shuffling variables across sessions.We found that the chance accuracy with the trial-difference decoder was close to zero, closer to zero than the chance accuracy of a standard decoder that decodes signals in individual trials (Extended Data Fig. 6a).We found that ΔQ, ∑Q and Q ch were significantly encoded in the population activity throughout the entire trial period until the next choice (Fig. 4d).These signals could be reliably decoded even when the decoding analysis was performed using exclusively either left, right, rewarded or unrewarded trials (Extended Data Fig. 6b), indicating that the value decoding is not merely reflecting those binary signals.

OFC activity is necessary for trial-by-trial RL
Next, we investigated the involvement of the population activity in the trained deep RL model and mouse OFC in the expert task performance.We first simulated transient inactivation of trained deep RL model by silencing the prechoice activity of a subset of neurons in the recurrent layer in ~13% of randomly interleaved trials.Figure 5a shows inactivation of 30% of neurons, while Extended Data Fig. 7 shows additional proportions of inactivated neurons.The action policies on inactivation and control trials were examined by the history regression model as above.Deep RL inactivation significantly decreased the dependence on history (Fig. 5a and Extended Data Fig. 7a).
To evaluate the effect of OFC inactivation on expert mouse performance, we performed optogenetic inactivation (Fig. 5b).We injected adeno-associated virus (AAV) encoding Cre-dependent ChrimsonR 47 , a red-light-gated cation channel, in the OFC of PV (Parvalbumin) -Cre transgenic mice.Red light was delivered through fiber-optic cannulas to inactivate OFC bilaterally by activating local PV-expressing inhibitory neurons in ~13% of randomly interleaved trials.Another pair of LEDs was used to shine a red light over the mouse head in every trial as the masking light.When OFC was bilaterally inactivated throughout the ITI and ready period, behavioral dependence on reward history was largely abolished in inactivation trials (Fig. 5c).Furthermore, inactivation only during ITI or ready period also significantly reduced dependence on reward history (Extended Data Fig. 8a-c).The effects of inactivation were restricted to the decision immediately following inactivation, and the history dependence largely recovered in the following trials (Extended Data Fig. 8d).In contrast, control mice with only tdTomato expression in PV-expressing inhibitory neurons without ChrimsonR did not show any significant changes to their action policies after the same red-light delivery into the OFC.
The decreased behavioral dependence on reward history may result in either more stochastic decision-making, an increased dependence on choice history, or an increased dependence on history-independent action bias that is static within each session.We found that the predictability of their choice patterns with the regression model was not affected by inactivation in mice (Extended Data Fig. 8e,f), suggesting that inactivation did not increase the stochasticity of decisions.Instead, the history-independent idiosyncratic action biases substantially increased by inactivation in both deep RL models and mice (Fig. 5d and Extended Data Figs.7b and 8a-c).
Previous studies have found that unilateral inactivation of several brain areas such as premotor cortex and striatum can lead to a lateralized choice bias toward the direction ipsilateral to the inactivated hemisphere 48,49 .Therefore, we tested whether the direction of the increased history-independent action bias depends on the side of the inactivated hemisphere.We performed unilateral OFC inactivation (ITI + ready) of expert mice with the inactivation side flipped on alternating days (Fig. 6a).Overall, the effects of unilateral inactivation were similar to bilateral inactivation, with decreased dependence on history and increased dependence on history-independent biases (Fig. 6b,c).Notably, the direction of the increased bias did not depend on the hemisphere that was inactivated (Fig. 6d).Instead, the bias direction slowly drifted across sessions irrespective of inactivated hemispheres (Fig. 6e).Thus, the OFC does not appear to be directly controlling history-independent decision biases.Rather, our results suggest that the OFC is selectively involved in history-dependent value-based decision-making (Fig. 6f).When OFC is inactivated, the action selection circuit relies mostly on the history-independent bias that is independent of OFC and slowly drifts over days.

Dynamics and stabilization of value coding during meta-learning
The results so far indicate that CaMKII-dependent plasticity in OFC is required for efficient across-session meta-learning of RL action policy but not for the trial-by-trial RL performance by expert mice, while OFC activity is required for expert trial-by-trial RL performance.These results resemble the deep RL model in which the plasticity-based outer-loop RL slowly establishes the inner-loop algorithm that performs trial-by-trial RL using recurrent activity dynamics (Fig. 1g).To investigate how the inner-loop activity-based RL is modified by the outer-loop plasticity-based RL during training, we longitudinally tracked identical neural populations across training sessions in both deep RL models and mice (Fig. 7a).We found that the strength of the three value-related signals increased in the recurrent activity of deep RL models during training (Fig. 7b), and the behavioral dependence on reward history closely tracked the strength of the value-related signals (Fig. 7c).OFC similarly increased Q ch and ∑Q signals in the population activity during training (Fig. 7b and Extended Data Fig. 9a), and the behavioral history dependence tracked the signal strength (Fig. 7c and Extended Data Fig. 9b).ΔQ signal strength in OFC did not correlate with the behavioral history dependence in mice unlike in deep RL models, suggesting that ΔQ signal strength in OFC is not the limiting factor to determine behavioral dependence on this signal.
To evaluate the evolution of action value representations during across-session meta-learning, we longitudinally examined the coding axes for value-related signals in neural population activity.For each mouse and deep RL model, we identified the coding axes for the three value-related signals and defined the similarity of the coding axes between nearby sessions as the cosine similarity between the two axes.In both deep RL model and mouse OFC, we found that the stability of coding axis was initially low but gradually increased across training sessions (Fig. 7d and Extended Data Fig. 9c), indicating a highly dynamic reorganization at the early phase of training and a gradual stabilization of value coding over training sessions.
The stabilization of value coding may lead to a stabilization of the action policy to compute value from history information, thus resulting in a more stable use of history to drive decision-making.A prediction of this idea is that behavioral action policies also stabilized during training, tracking the stabilization of value coding.We tested this prediction by examining whether the similarity of value coding axes between pairs of sessions correlates with the similarity of behavioral action policies for reward history.In both deep RL model and OFC, the coding axis similarity of paired sessions positively correlated with the policy axis similarity (Fig. 7e and Extended Data Fig. 9d).This relationship remained even when we used only later sessions (≥10 d of training) or only sessions with high decoding accuracy (r ≥ 0.2; Extended Data Fig. 10).
The close relationship between OFC value coding and behavioral action policies supports a critical role of OFC in guiding meta-RL.Taken together, these results indicate that the meta-RL mechanism in the mouse brain resembles the deep RL model with a meta-RL framework, with the OFC having a central role in both fast and slow timescales of RLs.

Discussion
Recurrent network activity and long-term synaptic plasticity are two distinct mechanisms by which neural networks can process and store information.Because of the difference in the underlying mechanisms, they can run in parallel in a shared neural network at distinct timescales.Deep RL models with meta-RL frameworks take advantage of this flexibility and implement slow and fast RL [3][4][5][6]50 . Her we showed that CaMKII-dependent plasticity in OFC is required for slow but not fast RL, while OFC activity is required for fast RL, resembling the deep RL models.Put it another way, OFC uses two distinct mechanisms at two timescales to mediate slow and fast RL.The multiple layers of learning with distinct mechanisms and timescales would confer an extra level of flexibility and stability in cognition and learning 2,6 .The slow plasticity-based learning allows the accumulation of experiences over long periods of time, leading to the storage of generalized knowledge as synaptic weights.Animals can then exploit this stable and generalized knowledge to quickly adapt to new environments, using computations by the recurrent networks.
Although our work demonstrates the critical involvement of OFC in both fast and slow RL, many other brain regions are also involved in RL 7,8,10,16,29,[51][52][53][54][55] .OFC likely works together with these other regions that mediate different aspects of fast trial-by-trial RL, such as stable maintenance 7,8,53 and updating 29,54,55  prefrontal areas, striatum and ventral tegmental area.Furthermore, plasticity-based RL may happen at different speeds in different brain areas, possibly governed by differences of neuromodulatory inputs and their receptor expression 5,23,25,50 .Even a single area may possibly run multiple timescales of plasticity-based RL in parallel using distinct mechanisms (for example, long-term plasticity versus short-term plasticity).Different timescales of plasticity within and across areas may confer extra flexibility on the brain by providing more than two timescales for RL.Future studies will unravel how different regions work together to regulate the meta-RL process in the brain, which would inspire further modifications of meta-RL frameworks for deep RL models in the field of AI.Our study provides an important clue toward the neurobiological understanding of 'learning to reinforcement learn'.

Online content
Any methods, additional references, Nature Portfolio reporting summaries, source data, extended data, supplementary information, acknowledgements, peer review information; details of author contributions and competing interests; and statements of data and code availability are available at https://doi.org/10.1038/s41593-023-01485-3.

Animals
All procedures were in accordance with the Institutional Animal Care and Use Committee at University of California San Diego.Wild-type (WT) C57BL/6 mice were obtained from Charles River.Transgenic mice were obtained from the Jackson Laboratory (CaMKIIa-tTA: B6;CBA-Tg(Camk2a-tTA)1Mmay/J ( JAX 003010); tetO-GCaMP6s: B6;DBA-Tg(tetO-GCaMP6s)2Niell/J ( JAX 024742); PV-Cre: B6;129P2-Pvalb tm1(cre)Arbr /J ( JAX 008069)).All surgeries and in vivo experiments were carried out in adult mice (6 weeks or older).Mice were housed in a cage (68-72 °F temperature and 0-100% humidity) and had free access to food.Mouse cages were kept in a room with a reversed 12-h light/12-h dark cycle, and all experiments were performed during the dark period.All postsurgery mice were singly housed with a running wheel.Both male and female healthy adult mice were used for most experiments except for the paAIP2 experiments where only male C57BL/6 adult mice were used.

Surgery for two-photon imaging and optogenetics
Mice were injected with dexamethasone (2 mg kg −1 ) subcutaneously at the beginning of surgery and continuously anesthetized with 1-2% isoflurane during surgery.All surgeries were performed while mice were placed on heating pads, and their eyes were protected with Vaseline.After cleaning the surface of dorsal skull with a razor blade and 75% ethanol, we performed craniotomy at the target lateral OFC coordinate (~1.45 mm lateral and ~2.6 mm anterior from bregma) where we implanted either a GRIN lens or a fiber-optic cannula.The details of the procedures up to the craniotomy step can be found in ref. 56.
For two-photon imaging experiments, we first aspirated the cortex above the target coordinate up to 1.0 mm depth using a blunt end 30G needle (0.312 mm O.D.; SAI Infusion Technologies).Then, we unilaterally implanted a GRIN lens (Inscopix, GLP-0561; 500 µm diameter) above the deep layer of lateral OFC (1.5 mm depth) in either left or right hemisphere.Note that the layers are inverted in the ventral cortex, so we targeted the deep layer to keep all layers of OFC intact.The implanted GRIN lens was fixed at the target coordinates using 3M Vetbond (WPI) 57 on the skull, followed by cyanoacrylate glue and black dental acrylic cement (Lang Dental).We glued the upper part of 1.5 mm screw cap tube (Thermo Fisher Scientific) along with its screw cap on the head using cyanoacrylate glue and black cement to protect the implanted GRIN lens.
Additionally, a custom-built metal head bar was secured on the skull above the cerebellum with cyanoacrylate glue and dental acrylic cement.Buprenorphine (0.1 mg kg −1 of body weight) and Baytril (10 mg kg −1 of body weight) were subcutaneously injected after surgery, and mice were monitored until they recovered from anesthesia.

Surgery for extracellular spike recording
For extracellular spike recording experiments, we covered the dorsal skull with cyanoacrylate glue before craniotomy so that dental cement attaches well on the skull at a later step.Then, we made a small hole in the skull above cerebellum and implanted a ground wire (stainless steel).The implanted ground wire and a head bar were fixed on the cerebellum skull using cyanoacrylate glue and black dental acrylic cement (Lang Dental).Next, we performed another craniotomy above lateral OFC (lOFC: ~1.45 mm lateral and ~2.6 mm anterior from bregma) and unilaterally injected either ~350 nl of AAVDJ-CaMKIIP-mEGFP-P2A-paAIP2 or AAV2/1-CB7-EGFP in the right hemisphere of WT C57BL/6 mice.After virus injections, we made a small hole in the dura and slowly inserted a chronic 64-channel silicone probe (ASSY-236 H6 probe and mini-amp-64 from Cambridge NeuroTech: 2 shanks with 32 channels per shank, recording sites are tiled along 400 µm of each shank) along with a lambda optic fiber (0.39 NA, 200 µm fiber diameter, 700 µm active length; Optogenix) up to 1400 µm depth (the probe was slowly moved further down to the OFC during pretraining tasks).The distance between the recording sites and the active part of the fiber is ~220 µm.Both the probe and the fiber were mounted on a Nano-Drive V2 (Cambridge NeuroTech) before the implantation so that we could change the probe depth during training.After the implantation, the dura was sealed by Dura-Gel (Cambridge NeuroTech).Then, we covered the electrode and the sliding part of the Nano-Drive with Vaseline using low-temperature cautery (FIAB, F7255).After mounting sufficient amount of Vaseline, the implants were fixed on the skull using black dental acrylic cement (Lang Dental).Buprenorphine (0.1 mg kg −1 of body weight) and Baytril (10 mg kg −1 of body weight) were subcutaneously injected after surgery, and mice were monitored until they recovered from anesthesia.

RL task and training for mice
Mice were water-restricted at 1-2 ml d −1 .After at least a week of water restriction, we started behavioral training.Behavioral control was automated with a real-time system.We used BControl system (C Brody, Princeton University) running on Linux communicating with MATLAB (MathWorks) for imaging experiments, and Bpod system (v0.5,J. I. Sanders and A. Kepecs, Washington University in St. Louis) running on Arduino DUE communicating with MATLAB for optogenetics experiments.For the Bpod system, another dedicated Arduino UNO with a sound card (Adafruit, ADA1788) was also used to generate sounds from a speaker.We wrote custom behavior scripts on respective systems for our behavior task.We previously reported this behavior task and training 7 .Mice were head-fixed on a custom-built behavior stage.Two lickports were placed on the left and right sides of the head-fixed mouse, and licking was monitored by infrared radiation (IR) beams.An amber LED for the ready cue was placed ~5 cm away from the nose, and a speaker was placed under the mouse stage.Each trial has a ready period, an answer period and ITI.At the beginning of each trial, the amber LED turned on to signal the beginning of the ready period.The ready period lasted for either 2 or 2.5 s (randomly assigned every trial).After the ready period, the speaker generated 10 kHz tone for the go cue.Both the 10 kHz tone and amber LED cues were terminated when mice made a choice (the first lick to one of the lickports during the answer period) or when mice did not lick for the maximum answer period of 2 s.Each choice was accompanied by a 50 ms feedback tone (left, 5 kHz; right, 15 kHz).If a reward was assigned to the chosen side of the lickport, ~2 µl of water was released immediately after the choice.The answer period and choice were followed by ITI.
Before training mice in the probabilistic reversal learning task, we trained mice in three different pretraining tasks (pretask I: 2-3 d, pretask II: 2-5 d and pretask III: ~2 weeks).In the pretask I, all the choices mice made during the answer period were rewarded with 100% probability.Licking during the ready period was not punished in this pretraining phase, and the mean ITI was gradually increased from 1 to 6 s (±1 s jitter in the duration of every trial).Mice learn that they can collect rewards by licking lickports during the answer period in the pretask I.In the pretask II, reward was delivered alternately from left and right https://doi.org/10.1038/s41593-023-01485-3lickports following either choice during the answer period.From this pretask II, licking during the ready period was punished by 500 ms white noise alarm sound and trial abort with an extra 2 s ITI.Mice learn that they can collect rewards from either lickport and need to withhold licking during the ready period in the pretask II.In the pretask III, a choice was rewarded only when the choice was opposite to the choice in the immediately preceding trial.Mice were encouraged to choose from both lickports in this pretask III.Training in the pretask III was terminated when their correct choice rates reached 70%.Through these three pretraining tasks, mice learned the general task structure, including that only their first lick during the answer period is associated with outcome, rewards are available from both lickports and, they need to withhold licking during the ready period.
After the pretraining, we trained mice in the probabilistic reversal learning task.We assigned reward to each lickport on every choice trial according to a specific reward assignment probability of each lickport.In each trial, one of the lickports had a higher reward assignment probability.The combinations of reward assignment probabilities were either (60%, 10%) or (52.5%, 17.5%), and the probability changed randomly every 60-80 trials in the order of (left, right) = …, (60%, 10%), (10%, 60%), (52.5%, 17.5%), (17.5%, 52.5%), (60%, 10%), ….We postponed the probability block transition if the fraction of choosing the lickport with a higher reward assignment probability was below 50% in recent 60 trials until the fraction reached at least 50% to ensure that mice switch choice preference in each probability block.Once a reward was assigned to a lickport on a trial, the reward remained assigned to the lickport until the lickport was chosen in the future trial (concurrent variable-interval schedules 9,11,58 ).ITI varied randomly between 5 and 7 s.Both alarm (trials with licking during the ready period) and miss (trials without licking during the answer period) trials were not rewarded.We did not include alarm and miss trials in neural activity analyses to ensure that the ready periods we analyzed were free of licking behaviors and that mice were engaged in the task in the trials.

Artificial meta-RL network
We trained artificial meta-RL networks (deep RL models) with firing rate units in this study (five independently trained deep RL models).Learning rate of 0.0005 was used for the network training.The network architecture and training method of our artificial neural networks are based on previous papers on meta-RL [3][4][5] .Unlike the previous publications, we used simple recurrent units with tanh activation functions instead of gated recurrent units (LSTM cell 59 ) because biological neurons do not have such sophisticated gated functions.Therefore, maintenance and computation of signals were performed only through recurrent connectivity.We trained the meta-RL network using the A2C method 12 with a single worker.The number of time steps per trial was randomly assigned to 4 or 5 on each trial to reflect the variable ITI in the mouse task.The meta-RL network had three input neurons that each received the information of either reward outcome (1 for reward, 0 otherwise), left action (1 for left, 0 otherwise) or right action (1 for right, 0 otherwise) from the immediately preceding time step.These input neurons are inactive except for the single time step immediately after previous choice in each trial.These input neurons are connected to the next recurrent layer (50 units for actor and 50 units for critic), and the history signals are maintained through the recurrent connectivity in the recurrent layer.The output of either actor or critic is given by where tanh(•) is a hyperbolic tangent activation function of the form e z +e −z , x (t) is a 3 × 1 vector representing the inputs from the presynaptic three input neurons at a time step t, y (t−1) is a 50 × 1 vector representing the layer's outputs from a previous time step (t − 1), W x is a 50 × 3 matrix containing the connection weights for the inputs from three presynaptic input neurons, W y is a 50 × 50 matrix containing the connection weights for the recurrent connections and b is a 50 × 1 vector containing each neuron's bias term.The recurrent neurons in the actor network project to two output neurons, and the recurrent neurons in the critic network project to one output neuron.The two output neurons from the actor represent logits for left and right actions, and the actions were sampled from the softmax distribution defined by these action logits outputs.Because the input neurons send history signals to the recurrent layer only at the first time step immediately after previous choice, the action selection of the network needs to rely on the history signals that were maintained in the recurrent layer across time steps through its recurrent connectivity.The selected action accompanies a reward if the reward was loaded on the selected side in the trial.On the other hand, an output neuron from the critic represents the state value for the next trial.Following the A2C method, we defined the policy loss (L π ) and value loss (L v ) as follows: where a t is the action, s t is the state, π (a t , |, s t ) is the action policy, A (s t ) is the advantage function, H (π) is the entropy of the policy, R t is the discounted n-step bootstrapped return that represents the expected future rewards and V (s t ) is the state value.β e and β v are the hyperparameters that determine the relative contributions of the entropy term and the value loss to the total loss function.We used TD error δ (s t ) as the estimator of the advantage A (s t ) function as follows: The n-step return R t is given by The entropy of policy H (π) is given by The parameters of the networks were updated following the gradients of the total loss function (L tot = L π + L v ) as follows: ∂θ c (7)   where θ a and θ c represent the parameters for the actor and the critic networks, respectively.The networks were trained using RMSProp and backpropagation through time.The training hyperparameters were fixed as follows: learning rate = 0.0005, γ = 0.5, β e = 0.5, β v = 0.01 and unroll length = 50 time steps.The networks performed 500 trials per session (episode) in the task environment where the reward assignment probabilities and its baiting rule are identical to the mouse task, and the network parameters were fixed within each session (no synaptic plasticity).Network parameters were updated after each training session to approximate slow learning.If weights are updated every trial, the networks learn to perform trial-by-trial RL using the weight updating mechanism as in standard deep RL models (for example, Deep Q-Network (DQN)) 12,60 .Instead, we updated the weights of the networks infrequently to encourage the networks to learn activity dynamics-based trial-by-trial RL.Each network ran 300 training sessions and an additional 301st session of the network from the last training session.
For inactivation of a trained network, we randomly sampled specified fractions of neurons (0-100%) from the recurrent layer of the https://doi.org/10.1038/s41593-023-01485-3trained network and inactivated those neurons at the time step immediately before action selection to mimic the prechoice period inactivation experiments for mice.For each inactivation fraction condition, we repeated the simulations 50 times with different random subsampling of recurrent neurons for inactivation and averaged the results.Following the inactivation condition for OFC, we set the frequency of inactivation trials per session to ~13% with the constraint that each inactivation trial must be followed by at least four consecutive control trials.

Two-photon calcium imaging of OFC neurons
In vivo neural calcium signals were recorded using two-photon microscopes (B-Scope; Thorlabs) with a ×16, 0.8 NA water immersion objective lens (Nikon) and 925 nm lasers (Ti-Sapphire laser; Newport).ScanImage (Vidrio Technologies) running on MATLAB (MathWorks) was used for image acquisitions.Images (512 × 512 pixels) were continuously recorded at ~30 Hz during task performance.We imaged from seven CaMKIIa-tTA::tetO-GCaMP6s transgenic mice expressing GCaMP6s in CaMKII-positive neurons.Signals were collected from deep layers of lateral OFC using unilaterally implanted GRIN lens (Inscopix, GLP-056; 1500 µm diameter).For each animal, we imaged two nonoverlapping neural populations from two different depths by alternately imaging at the two depths across training sessions.Therefore, we longitudinally imaged 14 distinct OFC neural populations (7 mice × 2 planes) in total.The positions of focal planes were adjusted before each imaging session such that the vasculature patterns and cells within the field-of view (FOV) were aligned to the template images from previous imaging sessions.Before starting each imaging session, we unscrewed the protective cap (Surgery for two-photon imaging and optogenetics) and attached a custom-built water chamber on the head to keep enough water between the GRIN lens and the objective lens during imaging.Slow drifts in the imaging field were manually corrected during imaging.Residual motions and image distortions were corrected by PatchWarp 61 .We used Suite2p 62 to draw regions of interests (ROIs) corresponding to individual neurons and extract their fluorescence.The cellular ROIs were first classified by a user-trained classifier, and the classifications were further manually refined by humans.The pixels where multiple neurons overlap were excluded at the signal extraction step.Contaminations of neuropil activity in each cellular ROI were also estimated and removed from the extracted fluorescence following the algorithm on Suite2p.Slow linear trend was removed from each extracted fluorescence, and the detrended signal was deconvolved using a nonnegative deconvolution algorithm 41,42 to obtain the estimate of the underlying spiking activity.To identify the same neurons between paired imaging sessions, we aligned their mean-intensity images along with cellular ROIs using affine transformations and manually identified ROIs corresponding to identical neurons between the paired sessions.
In this study, we analyzed the activity of a total of 47,311 OFC neurons (for this reported number, longitudinally imaged neurons were counted multiple times) from 390 imaging sessions in total.The average number of imaging sessions per population was 27.86 ± 7.73 (mean ± s.d.).The average number of analyzed cells per population for all imaging sessions was 171.76

Optogenetic suppression of OFC activity
We performed OFC inactivation by optically activating PV-positive inhibitory neurons in OFC.We virally expressed ChrimsonR-tdTomato in Cre-dependent manner in PV-positive inhibitory neurons (Surgery for two-photon imaging and optogenetics).As the control mice, we virally expressed tdTomato without ChrimsonR in PV-positive inhibitory neurons.After head fixation, implanted fiber-optic cannulas were connected to 625 nm fiber-coupled LEDs (Thorlabs) using a bifurcated optic fiber.We also attached a red LED on the side of each of the fiber-optic cannulas to use it as the masking light by illuminating the mouse head every trial.We controlled the three LEDs (one for inactivation and two for masking light) to generate sequences of square pulses (40 Hz) using Arduino UNO.At the end of each stimulation period, we attenuated the intensity of all three LEDs with a linear attenuation over the last 100 ms.The intensity after the fiber-optic cannulas for OFC inactivation light was ~1.5 mW per fiber.We set the frequency of inactivation trials per session to ~13% with the constraint that each inactivation trial must be followed by at least four consecutive control trials to avoid excessive perturbation.In contrast to the LED for inactivation, the masking light LEDs were turned on every trial.The timing and duration of masking light LEDs on each trial were matched to those of the inactivation LED for each inactivation condition.Our inactivation covered both ITI and ready period (ITI + ready: from 0.5 s after the beginning of ITI until the end of the ready period), or briefly at either ITI (2 s ITI: from 1 s after the beginning of ITI until 3 s after the beginning of ITI, 5 s ITI: from the beginning of ITI until 5 s after the beginning of ITI) or ready period (ready: the entire ready period of 2 or 2.5 s).The numbers of mice and sessions collected for each condition were as follows: (ChrimsonR-tdTomato, ITI + ready, bilateral (6 mice, 43 sessions)), (ChrimsonR-tdTomato, 2 s ITI, bilateral (9 mice, 60 sessions)), (ChrimsonR-tdTomato, 5 s ITI, bilateral (8 mice, 47 sessions)), (ChrimsonR-tdTomato, ready, bilateral (10 mice, 62 sessions)), (tdTomato, ITI + ready, bilateral (5 mice, 30 sessions)), (tdTomato, 2 s ITI, bilateral (8 mice, 49 sessions)), (tdTomato, 5 s ITI, bilateral (8 mice, 46 sessions)), (tdTomato, ready, bilateral (8 mice, 49 sessions)) and (ChrimsonR-tdTomato, ITI + ready, unilateral (4 mice, 177 sessions)).

Optogenetic suppression of OFC plasticity during behavior
We suppressed synaptic plasticity in OFC by optically activating paAIP2 (refs.31-33), a photoactivatable inhibitor of CaMKII kinase activity, expressed in OFC neurons.We virally expressed EGFP-P2A-paAIP2 in CaMKII-positive neurons (Surgery for two-photon imaging and optogenetics).For control experiments, we virally expressed EGFP without paAIP2.After head fixation, implanted fiber-optic cannulas were connected to 473 nm blue lasers (Shanghai Laser & Optics Century Co.) via fiber-optic patch cords.We also attached a blue LED on the side of each of the fiber-optic cannulas to use it as the masking light by illuminating the mouse head every trial.Both the lasers and masking light LEDs generated continuous light.The laser intensity after the fiber-optic cannulas was ~25 mW per fiber.We bilaterally illuminated OFC in both EGFP-P2A-paAIP2 mice and EGFP control mice for the first 3 s during ITI on every trial.For the experiments where we blocked plasticity across training sessions, we split each WT male litter into half for paAIP2 (five mice) and control (five mice) group.The type of virus (EGFP-paAIP2 or EGFP) injected in each mouse was kept blind to the mouse trainer.
For the experiments where we started plasticity blocking at the expert stage, we trained a separate cohort of five paAIP2-expressing mice until they reached the expert stage using only masking blue LED light during the task performance.For each of the two types of history (reward and choice), we calculated the s.d. of the sum of history regression weights (five weights from Eq. ( 12)) during the recent 7 d of training sessions, and we judged the mouse as an expert with stable performance when the s.d. for all two types of summed history weights were <0.025 in the recent 7 d and the mouse had been trained for at least 17 d.We started the plasticity-blocking experiments after individual mice passed this criterion.During the experimental sessions, we bilaterally illuminated OFC through cannulas on alternating days (10 d of paAIP2 photoactivation sessions and 10 d of control sessions).
Although we blocked CaMKII activity only during behavior sessions, some task-related plasticity that normally occurs after a behavior session may be also suppressed.We previously found that synaptogenesis in M1 during motor learning, such as formation of new dendritic spines, tends to occur between days of learning, rather than during https://doi.org/10.1038/s41593-023-01485-3behavioral sessions 35 .The potent paAIP2 effects on M1 spine dynamics in the same lever-press task (Extended Data Fig. 3) suggest that the CaM-KII blocking during behavioral sessions may also suppress task-related plasticity that occurs with some delays after each behavior session.

Extracellular spike recording of OFC neurons
In vivo neural spikes were recorded using 64-channel chronic silicon probes (ASSY-236 H6, mini-amp-64; Cambridge NeuroTech) and a data acquisition board (Open Ephys) from a mouse with EGFP-P2A-paAIP2 expression and a mouse with only EGFP expression.Open Ephys GUI was used to monitor and save the recorded data at 30 kHz.The implanted probe and optic fiber were slowly moved down into OFC using Nano-Drive V2 (Cambridge NeuroTech) during the pretraining task.Blue laser light was delivered into OFC through the implanted fiber using the protocol we used for the behavior experiments (~25 mW, 3 s during ITI on every trial).After 30 d of illumination sessions, we recorded several distinct OFC populations by moving the probe depth.Our recorded neurons are from both deep and superficial layers (depth range: 1,500-2,100 µm).We used Kilosort3.0(ref.63) for automatic spike sorting, and the results were further manually refined using phy 64 .

Two-photon glutamate uncaging and light stimulation
Two-photon imaging was performed using a custom-built two-photon microscope.mEGFP was excited with a Ti:Sapphire laser (Coherent Ultra II) tuned at the wavelength of 920 nm.The fluorescence was collected by an objective lens (×60, 1.0 NA; Olympus) and detected with photoelectron multiplier tubes (Hamamatsu, H7422-40p) placed after wavelength filters (Chroma, HQ520/60 m-2p for green).The signal was acquired using a photon counting board (PicoQuant TH260) and custom software.A second Ti:Sapphire laser (InSight, Spectra-Physics) tuned at the wavelength of 720 nm was used to uncage 4-methoxy-7 -nitroindolinyl-caged-l-glutamate (MNI-caged glutamate; Tocris) in artificial cerebral spinal fluid (ACSF) with a train of 6 ms, ~2.5-3 mW pulses (30 times at 0.5 Hz) near the spine of interest.In light stimulation experiments, slices were continuously illuminated with a blue LED (Thorlabs, M470L5) at a wavelength of 473 nm (160 mW cm −2 ) from the bottom of the sample.Experiments were performed at room temperature (∼25 °C), and slices were perfused with Mg 2+ free ACSF (127 mM NaCl, 2.5 mM KCl, 4 mM CaCl 2 , 25 mM NaHCO 3 , 1.25 mM NaH 2 PO 4 and 25 mM glucose) containing 1 µM tetrodotoxin and 4 mM MNI-caged l-glutamate aerated with 95% O 2 and 5% CO 2 at 25 °C.

Patch-clamp electrophysiology
Infected OFC pyramidal neurons were visualized using epifluorescence illumination.Whole-cell current-clamp recordings were obtained using a Multiclamp 700B amplifier.Patch pipettes (3-5 ΩM) were filled with a potassium gluconate solution (130 mM K gluconate, 10 mM Na phosphocreatine, 4 mM MgCl 2 , 4 mM NaATP, 0.3 mM MgGTP, 3 mM l-ascorbic acid, 10 mM HEPES (pH 7.2) and 320 mOsm).These experiments were performed at room temperature (~25 °C) in ACSF containing 127 mM NaCl, 2.5 mM KCl, 2 mM CaCl 2 , 1 mM MgCl 2 , 25 mM NaHCO 3 , 1.25 mM NaH 2 PO 4 and 25 mM glucose and oxygenated.Recordings were digitized at 10 kHz and filtered at 2 kHz.To test the effect of long-term blue light stimulation, whole-cell current-clamp recordings were performed in a different group of neurons before and after blue light stimulation for 40 min (1 s ON and 3 s OFF).Depolarizing current injections were given in 100 pA increments up to 800 pA.Threshold, action potential (AP) width and AP amplitude were analyzed on the current step where the first AP was observed.All data were acquired and analyzed with custom software written in C# and MATLAB.

Two-photon imaging of spines over motor learning
WT (C57BL/6) mice between 3 and 6 months of age were prepared for craniotomy and cranial window placement, as previously described.Dendritic spines were visualized via sparse viral expression of either EGFP (AAV2/1.pCAG.FLEX.EGFP.WPRE.bGH;Allen Institute) or paAIP2-EGFP (AAV8-CBA-DIO-mEGFP-P2A-paAIP2) + diluted Cre recombinase (AAV9.pCaMKII.Cre, Penn Vector Core; 1:2,500-5,000× dilution in saline) in the forelimb region of the primary motor cortex, M1 (coordinates: +1,500 µm lateral, +300 µm anterior).Mice of different conditions were paired 1:1 with cage-matched siblings to minimize batch effects.Mice were allowed to recover with postoperative care for approximately 2 weeks, after which they were progressively water restricted-removing ad libitum access to water and reducing water delivered from 2 ml to 1 ml over 6 d-to achieve a maximum 30% reduction in body weight to motivate them to engage in the motor learning task.Water-restricted mice were acclimated to the behavioral imaging rig for 2 d before the experiment started and then subjected to the lever-press task.The apical dendrites of labeled layer 2/3 neurons were imaged via two-photon excitation of EGFP using a Ti:Sa pulsed laser tuned to 925 nm (MaiTai; Newport) coupled to a commercial two-photon microscope (B-Scope; Thorlabs) equipped with a ×16/0.8-NAobjective (Nikon).Laser power was controlled with a Pockel's cell and ranged from 7 to 40 mW.Imaging was always performed in awake animals.Images (512 × 512 pixels at zoom values ranging from 7 to 12.1×, corresponding to interpixel distances of 0.5 pixel per µm at 1× and scaling linearly for the zoom values used) were recorded at 30.05 Hz.Z-stacks of the apical dendritic arbor were taken by acquiring 100 frames per slice over a variable number of slices ranging from 20 to 60 with a step size of 1 µm, depending on the morphology and optical accessibility of targeted dendrites.This imaging process was repeated daily, before behavioral training, for each of the 14 sessions.Only data from the sessions indicated in Extended Data Fig. 3b are presented in this study.

Lever-press task
Water-restricted mice were required to press a lever, comprising a piezoelectric flexible force transducer (LCL-113G; Omega Engineering) attached to a 1/14-mm-thick brass rod, past two thresholds (a ~1.5 mm lower threshold to prevent holding below baseline and a target threshold of ~3 mm) within 200 ms during the cue period (a 6 kHz tone) to receive a water reward (~10 µl).The lever position was continuously monitored using a data acquisition device (LabJack) and software (Ephus, MATLAB, MathWorks) working with custom software running on LabVIEW (National Instruments), which monitored threshold crossing.The behavioral setup was controlled with MATLAB software (Dispatcher, Z. Mainen and C. Brody).Rewarded trials were paired with a 500 ms, 10-kHz tone, while failed trials were presented with a white noise punishment signal and the start of the next intertrial interval (ITIs).ITIs were 8-12 s.

Optical stimulation during motor learning
During training, light from a blue LED (465 nm; Doric) was directed through a 200 µm diameter patch chord into the cranial window by clamping from above (Extended Data Fig. 3a).Photostimulation occurred during the cue period of every trial in both paAIP2 and control mice.The average power measured at the fiber tip was ~2.5 ± 0.3 mW.https://doi.org/10.1038/s41593-023-01485-3

Analysis of changes in dendritic spine size
Image stacks for each session were motion-corrected using a custom algorithm (MATLAB, MathWorks).This approach is similar to those reported previously [34][35][36] but was co-opted to also register slices across z planes.This was achieved by first iteratively aligning all imaging frames within a given slice to the resulting frame average until performance saturated, then registering across slices from the central z slice moving to the top and bottom of the volume.To accurately assess spine volume without contamination from other structures (which is still possible despite sparse expression conditions), targeted dendrites were extracted via semi-automated tracing using the Simple Neurite Tracer plugin 66 in ImageJ.This tool uses intensity-based path-finding to trace continuous paths along dendrites, after which approximately tubular volumes surrounding the dendrite can be considered, and intensity-based extractions of associated structures are achieved.Using this approach, large z-stacks of complicated dendritic arbor could be disentangled and analyzed individually.After extraction of target dendritic branches, segments of individual dendrites were selected for consideration based only on their consistent health and optical quality across sessions.It should be noted that optical obstructions could arise that did not reflect poor health; a portion of dendrite could be optically distorted on one imaging session, then appear in healthy and of high quality on a later session (Extended Data Fig. 3h, below red box).Dendritic segments that were selected for analysis based on these criteria were typically several 10 s of µm in length and displayed spine densities consistent with previous reports (Analysis of spine density and spine turnover).Maximum projection images of extracted dendrites were then used as the base images for subsequent measurements.Spine volume was measured by considering the integrated fluorescence intensity within an elliptical ROI drawn around the spine head in maximum projection images.To account for changing expression levels of the fluorescence indicator over days, integrated spine head intensities were normalized by the average dendritic intensity.Dendritic intensity was measured by drawing a series of points along the dendrite (which also allows the tracking of dendritic distance) and then expanding to an elliptical area around each point corresponding to the approximate diameter of the dendritic segment being analyzed.Unique individual pixels contained within these areas were then used as the ROI corresponding to a particular dendritic branch.To prevent over-generalization of dendritic normalization, only pixels within 20 µm (in dendritic distance) of a given spine were considered.These steps were repeated for both 'early' (session 1) and 'late' (sessions 11-14) imaging sessions.Late session images were selected based on optical quality of the imaged field, as described above.All analyses on a given imaging field used the same late session (that is, spines were counted only once and the late session was not decided on a 'per-spine' basis).The late sessions considered for EGFP-expressing (by mouse: 12, 13, 14, 14, mean = 13.25) and paAIP2-expressing (13, 14, 14, 11, mean = 13) animals were similar.Spine size estimates acquired in this way were largely stable over sessions (Extended Data Fig. 3e).The 'enlargement' threshold of 1.5× was chosen based on previous literature investigating controlled induction of spine enlargement in single spines.

Analysis of spine density and spine turnover
In a separate set of analyses, spine density was measured along individual dendritic segments that were comparable in quality across sessions.These analyses used many of the same dendritic segments considered in the spine size measurements but were performed without specifically constraining the segments used in the size analysis.Spines were aligned based on the overall structural similarity of the surrounding region and were considered the same spine if (1) they presented similar structural appearance and (2) the attachment point of their neck to the dendrite occurred in a similar position relative to other spines and dendritic morphology.Spines were considered for analysis if and only if they could be easily discerned from other structures, which could lead to an underestimate of true spine density.Spine formation events were scored as those structures that did not have any structural correlates in session 1, while spine elimination events were those structures that were present in session 1 but had no obvious structural correlates in late sessions.'Stable' spines correspond to all other spines; namely, those that were scored as 'present' throughout the experiment.While we acknowledge that very small-to-invisible spines may be inaccurately labeled as absent, such error is likely consistent across the groups being compared.Furthermore, our previous work tracking new spines in this way overlapped with orthogonally derived metrics that supported their identities as newly formed synapses 34 .To calculate spine densities, the linear distance along dendritic branches was extracted as the sum of linear distances between dendritic poly points (see description in Analysis of changes in dendritic spine size above).To measure the dendritic length in three-dimensional space, each dendritic poly point was assigned a slice address based on the maximum intensity of the extracted dendrite within the associated elliptical ROI.The distance between adjacent ROIs was thus calculated as either the linear distance between ROI centers or the hypotenuse between the linear distance and the z-step size (always 1 µm in the presented data), depending on whether they were addressed to the same z plane.Dendritic distances were measured between the locations of the 'first' and 'last' spines on a dendrite (that is, the flanking edges of spine counting for a given dendritic segment).We assumed that the dendritic length between registered spines is constant over sessions and used the dendritic length measured in session 1 for all sessions.These distances were used to calculate overall spine density, new spine density and elimination density.

Assessment of dendritic health
Imaged dendrites were assessed for health based on standard parameters, such as spine density and dendritic morphology.'Blebbing,' or large varicosities present along the dendrite accompanied by constriction of previously uniform regions, was considered reflective of a dead-or-dying cell and would instantiate the exclusion of such a dendrite from consideration in any session.Nonetheless, care was taken not to confuse blebs with spines oriented in the z axis, which can appear varicose along the dendrite.Such spines could be differentiated from blebs based on their size geometries in the z axis, as well as their commonplace appearance in early session images that were not yet exposed to risk of photodamage.These structures have previously been confirmed as spines with correlated electron microscopy of targeted dendrites 34 .Based on blebbing criteria alone, only one dendritic segment across the eight mice presented in this study was considered potentially sick, and this segment was excluded from all analyses.Given the potential effect of paAIP2 expression on spine density, we did not further filter dendrites based on spine density alone.The lack of significant difference between spine densities between EGFP-and paAIP2-expressing neurons (Extended Data Fig. 3i) is thus not a result of selection bias or unequal filtering of the data.

Quantification of task performance in RL task
We used two types of performance metrics to quantify the task performance in each session.The first metric is the optimality score which quantifies how optimal the behavior was in each session.Once a reward was assigned to a lickport, the reward was maintained on the side until it was collected by a participant in our task environment.As a result, the probability that a reward is available on the lickport gradually increases if the lickport has not been selected in the recent trials 8 .Therefore, the probability that a reward is available on each side is given by https://doi.org/10.1038/s41593-023-01485-3 where A c (x) is the reward assignment probability of choice c on trial x (0.6, 0.525, 0.175 or 0.1), n c (t) is the number of successive c choices before trial t (for example, n R (t) = 3 when the choice on (t − 4) was left and the choices on (t − 3), (t − 2) and (t − 1) were right).Therefore, the optimal action policy that maximizes expected reward outcomes in our task environment is the policy that maximizes the following optimality score: where n is the number of trials and P rewC (t) is the probability of reward availability on the chosen side in trial t (P rewL (t) in left choice trial and P rewR (t) in right choice trial).The action policy with higher optimality score achieves higher reward rate in the task.Optimality score is a less noisy measure of behavioral optimality than a simple reward outcome rate score because the optimality score is not affected by the randomness of reward assignment in each trial.
The second performance metric is the frequency of choosing the side with higher reward assignment probability on each trial, which is given by where the numerator is the number of choices with higher reward assignment probability A on each trial ( A High is either 0.6 or 0.525).
Although an action policy that maximizes this second metric is a suboptimal action policy in our task environment because of the cumulative nature of reward baiting, our mice primarily learned to increase this metric over the optimality score, which suggests that the objective function of the mouse learning may be closer to this second metric than the optimality score unlike deep RL models.

Quantification of behavioral history dependence
We used a logistic regression model to quantify the behavioral history dependence of individual animals and deep RL models on each session.
The model predicts an action on each trial based on reward and choice history from the past five trials 9 .The model is given by ln ( P L (t) where P L (t) is the probability of choosing left on trial t, r x (t − i) is the reward history for left (L) or right (R) side on trial t − i (1 for reward and 0 for no-reward), c x (t − i) is the choice history for left (L) or right (R) side on trial t − i (1 for chosen and 0 for unchosen), β r(t−i) and β c(t−i) are the raw regression weights for each type of history and β bias is the history-independent constant action bias term.We fit the model to behaviors using the L-BFGS solver without regularization (LogisticRegression function in scikit-learn 67 ).Although we previously parametrized mouse behaviors in the same behavior task using three different types of history terms 7 , we now do not recommend the model because of the collinearity between the parameters.

Action policy axis and magnitude of history dependence
To summarize the action policy in each behavior session, we defined action policy axes for the kth session as follows; where ⃗ p p p k k k r r r and ⃗ p p p k k k c c c are the policy axes for each type of history.These axes are defined using the regression weights from Eq. (12).To quantify the stability of the action policy for each type of history across training sessions, we calculated the cosine similarity of the coding axis vectors for type-x history between kth session and (k + m)th session as follows: where θ is the angle between the paired axis vectors in degrees and ||•|| denotes L2-norm of a vector.

Quantification of inactivation effects on behavioral history dependence
To quantify the inactivation effects on the behavioral history dependence, we fit the following logistic regression model: where P L (t) is the probability of choosing left on trial t, r x (t − i) is the reward history for left (L) or right (R) side on trial t − i (1 for reward and 0 for no-reward), c x (t − i) is the choice history for left (L) or right (R) side on trial t − i (1 for chosen and 0 for unchosen).Ctrl (t) is 1 on control trials and 0 on inactivation trials, while Opto (t) is 0 on control trials and 1 on inactivation trials.The model contains separate regression weights for control ( β ctrl x ) and inactivation ( β opto x ) trials.The model was fit to behaviors with L2 regularization (LogisticRegressionCV function in scikit-learn) to prevent overfitting to the data because the number of trials per fit was limited for these inactivation datasets.For the regularization, we selected the inverse of regularization strength from a logarithmic scale between 10 −4 and 10 4 (100 grids) by fivefold cross-validation.We used L-BFGS solvers for the L2 regularizations.
Because the frequency of inactivation trials is only ~13%, the number of control trials is much larger than the number of inactivation trials.To make the history dependence estimations robust against the difference in the trial numbers between control and inactivation trials, we matched the number of control trials to the number of inactivation trials for each model fitting by randomly subsampling the control trials.The subsampling and fitting were repeated with the smallest number of iterations to include every control trial at least once.We took the mean of the regression weights from all the iterations.For |Bias|, we took the absolute value of β ctrl bias for each iteration before averaging across iterations.

RL model
A class of RL models that originated from the Rescorla-Wagner (RW) model 13,68 is widely used to estimate action values of animals and humans.Previously we optimized the RW RL model to describe the behavioral patterns of mice in the current task 7 .We used this model to estimate action values on each trial.In this RL model, the value of chosen action (Q ch ) is updated according to its reward outcome on every trial as follows: https://doi.org/10.1038/s41593-023-01485-3 where R (t) is reward outcome on trial t (1 for rewarded and 0 for unrewarded trials), α rew is the learning rate for rewarded trials and α unr is the learning rate for unrewarded trials.Because the action value (Q) takes a value between 0 and 1, the reward prediction error R (t) − Q ch (t) is positive on rewarded trials and negative on unrewarded trials.The value of unchosen action (Q unch ) was also updated to reflect the time-dependent forgetting of unchosen action value 7,69,70 as follows: where Q unch is discounted every trial by the forgetting rate ω.We used the above value updating rule for both mice and deep RL models, but we used slightly different choice probability estimation to reflect the outcome-independent choice alternation that was unique in deep RL models (Fig. 1h).The probabilities of choosing left action (P L ) for mice (Eq.19) and deep RL models (Eq.20) on trial t are given by where Q L and Q R are the action values for left and right, respectively, β ΔQ reflects the sensitivity of a mouse to the action value difference, β 0 is the value-independent action bias that is constant in each session, 1) is the choice history on previous trial (1 for left choice and −1 for right choice) and β c is the weight for C (t − 1).Negative value on β c accounts for the tendency for choice alternation deep RL models.
The cost function J (θ) was defined using the model likelihood L (θ) and L2-penalty as follows: where θ j represents the model parameters.L2-penalty was included to obtain a model with better generalization.The regularization parameter λ was selected by tenfold cross-validation (minimum cross-validation error).

Decoding of value-related signals
We decoded three different value-related signals, ΔQ (Q L − Q R , value difference between left and right), ∑Q (Q L + Q R , sum of the two action values) and Q ch (value of the side chosen in the previous trial), from OFC neurons or the recurrent neurons in deep RL models.One important issue that needs to be considered for the decoding of value-related signals from neural activity is that action values are serially correlated across trials (autocorrelation).Because neural activity is also serially correlated, a simple decoder that decodes value on trial t from neural activity on trial t may overestimate the relationships between the value and neural activity.This is because two independent variables with slow serial correlations can appear correlated by chance [43][44][45][46] .Therefore, we devised a decoder that minimizes the contribution of the spurious correlations between slowly evolving variables.The spurious correlations originate from the autocorrelation of each variable across time.Therefore, we built the following decoders where majority of the slow autocorrelations of the value-related signals and the neural activity are ignored: where a i (t) is the activity of the ith neuron on trial t, β x i is the regression weight for the activity difference between adjacent choice trials and β x 0 is the constant term.In each model, the difference in the value-related variable between adjacent choice trials is decoded using the neural activity difference between adjacent choice trials.This decoder focuses on the changes on each trial (trial derivatives).By focusing on the trial derivatives, we can suppress the potential spurious correlation between the value-related signal and neural population activity because trial derivatives have much less slow-timescale autocorrelations.
We used all recurrent neurons (n = 100) for decoding from deep RL models.The mean activity of the three time steps immediately before choice was used for the decoding from deep RL models.For decoding from OFC neurons (calcium imaging), we used only 55 neurons that were randomly subsampled from each population to match the number of neurons in the decoders across different sessions and mice.Fifty-five was the minimum number of simultaneously imaged neurons in our dataset.For each OFC neural population, we subsampled 55 neurons in each iteration without replacement until the iteration number reached the smallest number to include every cell for decoding.A small number of randomly selected neurons were sampled twice for the last iteration.The decoding accuracy from all iterations was averaged.Each iteration of decoding was performed using tenfold cross-validation without shuffling.For the decoding with intracellularly recorded spikes, we subsampled 18 neurons instead because the number of neurons that could be simultaneously recorded with a silicone probe was limited.We calculated chance decoding accuracy with two methods (within-session and cross-session).To obtain the within-session chance decoding accuracy, the value differences between adjacent trials were shuffled 100 times across trials.Decoding results from the shuffled data were averaged for the within-session chance decoding accuracy.The cross-session chance decoding accuracy was obtained by decoding the value differences using the neural activity from a different behavior session as suggested previously 45 .Using the neural activity of each session, we decoded the value differences from randomly selected 100 behavior sessions of different mice.The value differences of each randomly selected session were circularly permuted at a random trial number before decoding to randomize the first trial position in the reward probability blocks.Decoding results from the 100 sessions were averaged for the cross-session chance decoding accuracy.

Coding axis similarity
We quantified the similarity of coding axes between paired sessions for OFC neurons and recurrent neurons in deep RL models.Our two-photon calcium imaging was performed at two different focus planes on alternating days for each mouse.Therefore, we quantified the similarity of the coding axis vectors from paired sessions that are 2 d apart.For each session pair, we first registered which neurons correspond to which in the paired sessions.Based on the cellular identity, we gave a unique ID to each shared neuron.Coding axis vector for each value-related signal on kth day was defined using the decoder weights from Eqs. (22-24) as follows: where the numbers on the lower right (1, 2, …, n) indicate the unique IDs given to n neurons shared between paired imaging sessions.Cosine similarity of the coding axis vectors between kth day and (k + 2)th day is given by where θ is the angle between the paired axis vectors in degrees and ||•|| denotes L2-norm of a vector.For deep RL models, we similarly defined coding axis vectors using decoder weights from all 100 recurrent neurons.Cosine similarity was calculated for immediately adjacent sessions (kth session and (k + 1) th session).

Relationships between value coding in neural activity and behavioral action policy
We analyzed the relationships between value coding in neural activity and behavioral action policy.We examined both their magnitude relationships and stability relationships.Because action values are updated based on reward history, we focused on the reward-based action policy axis (Eq.13).The magnitude of behavioral reward history dependence was defined as follows using the from Eq. ( 13): where |•| denotes absolute value of the regression weight.The magnitude relationships between value coding and action policy were analyzed using this behavioral history dependence and the value decoding accuracy from neural population activity.
To analyze the across-session stability relationships between value coding and action policy, we calculated the angles of a coding axis pair and a policy axis pair between kth and (k + 2)th sessions as follows: We used these angles instead of their cosines because cosine function introduces nonlinearity, which can skew the data distribution on a scatterplot.

Mixed-effects models for statistics
We used mixed-effects models for the statistical analyses of nested data.lmer function in the lme4 package 71 for parametric test or aligned rank transform (ART) 72 for nonparametric test was used in R. ART was used for all statistical tests for the comparisons of regression weight sizes (including action bias size) because the weight distributions were severely skewed.lmer was used for the other statistical tests with mixed-effects models.The models used in this manuscript are as follows: where the fixed effect is the training session number and a random intercept is for the subject.This model was used to assess the action policy changes during training in Fig. 1i,l.
where the fixed effect opt is 1 on inactivation trials and 0 on control trials, a random slope is for subject (mouse or deep RL) and a random intercept is for session.This model was used for the paired comparisons in Figs.3b, 5 and 6 and Extended Data Figs.4i-m and 8.
where the fixed effect virus is 1 for EGFP-paAIP2 mice and 0 on control EGFP mice and a random intercept is for training day (session numbers).This model was used for Fig. 2g,h,j and Extended Data Fig. 4d-h.
where the fixed effect virus is 1 for EGFP-paAIP2 mice and 0 for control EGFP mice and a random intercept is for the trial number from the probability block transition.This model was used for Extended Data Fig. 4a-c.
where the random intercept is for simultaneously imaged neural population, and the fixed effect x and the observation y are the shuffling (0, not shuffled; 1, shuffled) and the decoding accuracy (Fig. 4d and Extended Data Fig. 6), the session number and the decoding accuracy (Fig. 7b and Extended Data Fig. 9a), the decoding accuracy of a value-related signal from OFC population activity and the behavioral history dependence (Fig. 7c and Extended Data Fig. 9b), the session pair number and the cosine similarity between value coding axes from the paired session (Fig. 7d and Extended Data Fig. 9c), the angle between value coding axes and the angle between policy axes (Fig. 7e and Extended Data Figs.9d and 10), respectively.

Statistics and reproducibility
Sample size.No statistics were used to predetermine the sample size.
Data exclusions.We excluded animals that did not learn the task either due to loss of motivation or sickness.For two-photon calcium imaging, we excluded neurons that were not consistently within the field-of-view during each imaging session.
Randomization.We allocated male mice from the same littermates randomly to paAIP2 group and control group for in vivo OFC plasticity suppression experiments.Selections of animals in the other experiments were completely random, and mice from different litters were mixed.

Blinding.
For the experiments where we assessed the effects of OFC plasticity suppression on the learning curves, the type of virus injected (EGFP-P2ApaAIP2 or EGFP) was blinded to the trainer of mice.Data collection and analysis were not performed blind to the conditions of the other experiments.

Data analysis software and library
We used Python3, MATLAB and R for data processing and analyses.All statistical tests are two-sided unless otherwise noted.Confidence interval (CI) and s.e.m were obtained by bootstrapping 1,000 times unless otherwise noted.lme4 package 71 and ARTool 72 were used in R for mixed effects models.We used TensorFlow2 (ref.73) for training artificial neural networks and scikit-learn 67 for training the other machine learning models.SciPy 74 and Numpy 75 were also used for numerical computations.Matplotlib 76 and seaborn 77 were used for data visualizations.Extended Data Fig. 2 | Choice adaptability after probability block transition, the performance comparison between mice and deep RL models.a, Mean probability of choosing the side with a higher reward assignment probability after block transition (Deep RL: ≤100 th session for early and 230 th −301 st sessions for late; Mouse: ≤ 5 th session for early and ≥15 th session for late).Shadings indicate s.e.m. b, Mean optimality score after block transition.c, Mean reward rate after block transition.This reward rate is mere a noisier measure of (b) due to the probabilistic nature of the reward assignment.Therefore, we used the quantities of (a) and (b) for most task performance quantifications.d-f, Performance comparisons between deep RL (230 th -301 st sessions) and mice (≥15 th session).Two-sided Wilcoxon rank-sum test.The box shows the quartiles, and the whiskers extend to 5 th and 95 th percentiles.Mice developed an action policy to preferentially select the side with higher reward assignment probability, while deep RL outperformed mice by utilizing the cumulative nature of the reward availability on the unchosen side as reflected on the optimality score.This action policy difference is reflected on their choice history dependence (see Fig.         Sum of history weights between -2 and -5 trials was used for the +2 trial effect comparison, and the sum of history weights between -3 and -5 trials was used for the +3 trial effect comparison.Significant inactivation effect was restricted to the immediately following trial (Fig. 5c).Coding of Q ch and ∑Q consistently increased during training across all trial periods.ΔQ signal during later trial periods decreased in early training sessions, possibly reflecting that OFC is not the primary site for within-trial maintenance of this information.Areas such as the retrosplenial cortex may be responsible for the within-trial maintenance 7,8 .OFC neurons were subsampled (55 cells/population) for decoding.Statistics are from mixed effects models with session as the fixed effect and neural population as the random intercept.Post-choice (ΔQ: p = 0.35, Q ch : p = 2.07 × 10 −20 , ∑Q: p = 3.05 × 10 −12 ), ready (ΔQ: p = 0.33, Q ch : p = 4.05 × 10 −18 , ∑Q: p = 5.65 × 10 −7 ), pre-choice (ΔQ: p = 5.81 × 10 −7 , Q ch : p = 4.63 × 10 −14 , ∑Q: p = 3.84 × 10 −8 ).b, Relationships between the decoding accuracy and the strength of behavioral dependence on reward history (Sum of unsigned regression weights).Scatterplots with different colors for 14 different OFC populations.Statistics are from mixed effects models with accuracy as the fixed effect and neural population as the random intercept.Post-choice (ΔQ: p = 0.68, Q ch : p = 2.07 × 10 −17 , ∑Q: p = 3.23 × 10 −6 ), ready (ΔQ: p = 0.80, Q ch : p = 2.33 × 10 −11 , ∑Q: p = 0.081), pre-choice (ΔQ: p = 3.13 × 10 −5 , Q ch : p = 6.93 × 10 −10 , ∑Q: p = 0.019).c, Angle between coding axes for shared neurons from adjacent sessions (2 days apart for OFC) was measured to quantify the similarity of population coding for value-related signals.Cosine similarity of the coding axes increases during training except for the ΔQ at ready and pre-choice periods (likely due to weak ΔQ signal during these trial periods as shown in (a)).Statistics are from mixed effects models with session pair as the fixed effect and neural population as the random intercept.Post-choice (ΔQ: p = 4.35 × 10 −4 , Q ch : p = 1.09 × 10 −12 , ∑Q: p = 3.99 × 10 −7 ), ready (ΔQ: p = 0.37, Q ch : p = 1.20 × 10 −7 , ∑Q: p = 5.94 × 10 −3 ), pre-choice (ΔQ: p = 1.89 × 10 −3 , Q ch : p = 8.63 × 10 −9 , ∑Q: p = 7.92 × 10 −3 ).d, Relationships between the angle of coding axes for values and the angle of action policy axes for reward history in pairs of sessions.The similarity in coding axes correlates with the similarity in behavioral action policy except for ΔQ at ready and pre-choice periods (likely due to weak ΔQ signal at these trial periods).Statistics are from mixed effects models with coding axis angle as the fixed effect and neural population as the random intercept.Post-choice (ΔQ: p = 7.31 × 10 −4 , Q ch : p = 1.34 × 10 −8 , ∑Q: p = 3.15 × 10 −7 ), ready (ΔQ: p = 0.36, Q ch : p = 8.14 × 10 −8 , ∑Q: p = 1.57× 10 −5 ), pre-choice (ΔQ: p = 0.54, Q ch : p = 5.39 × 10 −8 , ∑Q: p = 2.77 × 10 −5 ).All shadings indicate s.e.m.All regression lines and statistics are from mixed effects models (Methods).n.s.P > 0.05, **P < 0.01, ***P < 0.001, ****P < 0.0001.All tests are two-sided.

3 ALFig. 1 |
Fig. 1 | Meta-learning of RL. a, Schematic of the behavior task for mice.b, Example mouse behavior in an expert session (top) and the estimated left and right action values from an RL model in each trial (bottom).Choice frequency was calculated using nine-trial sliding windows.c, Schematic of the deep RL that implements meta-RL.d, Example behavior of a trained deep RL. e, Mean probability of choosing the side with a higher reward assignment probability.Note that the reward assignment probability is not equal to the reward probability in individual trials because a reward, once assigned, remains available until consumed.f, Mean optimality score that measures the optimality of action policy in this task considering the cumulative nature of reward availability.g, Schematic illustrating the meta-RL mechanism in the deep RL.Deep RL updates action values on each trial using recurrent activity, and the action policy (that is, the way they compute the values) is gradually updated by synaptic plasticity across sessions based on the performance evaluation on each session.h, Mean history regression weights in early (deep RL, ≤100th; OFC, day 1-14) and late (deep RL, ≥230th; OFC, ≥day 15) sessions.Mean weight was calculated using the early or late sessions for each individual, and the mean ± 95% CI of the means across models/mice is shown.i, Sum of the history weights of the five past trials (median ± s.e.).Both mice and deep RL models learned to use reward history for decision-making.Weights are plotted along a symmetric log scale where only the range between the minor ticks closest to 0 is linear scale ('symlog' option in matplotlib in Python).Deep RL (reward, P < 1 × 10 −100 ; choice, P < 1 × 10 −100 ), mouse (reward, P = 5.01 × 10 −45 ; choice, P = 1.00 × 10 −7 ).j, Angle between policy axes from different sessions was measured to quantify the similarity of action policies.k, Cosine similarity of policy axes between different pairs of training sessions.l, Cosine similarity between the policy axis on the nth session and the mean policy axis of the following 5 d (n + 1 − n + 5).Deep RL (reward, P < 1 × 10 −100 ; choice, P < 1 × 10 −100 ), mouse (reward, P = 4.1 × 10 −21 ; choice, P = 2.58 × 10 −5 ).Shadings and error bars indicate s.e. and 95% CI, respectively.Statistics in i and l are from mixed-effects models (session number as the fixed effect, subjects as the random intercept, two-sided test).NS P > 0.05, ****P < 0.0001.Five independently trained deep RL models and seven mice used for OFC imaging are included in e, f, h, i, k, and l.NS, not significant.

Fig. 2 |
Fig. 2 | OFC plasticity is required for across-session meta-learning of RL. a, Schematics of optogenetic suppression of synaptic plasticity with paAIP2.b, Virally transfected neurons expressing mEGFP and paAIP2 in a cortical organotypic slice.Right, a field-of-view from lateral orbitofrontal cortex showing transfected pyramidal neurons.c, Top, representative control paAIP2-labeled dendritic shaft of the OFC neuron in which LTP induction using two-photon uncaging without paAIP2 stimulation showed an increase in the spine volume.Bottom, a representative dendritic shaft of the OFC neuron expressing mEGFP and paAIP2 in which LTP induction during blue light stimulation did not show any structural change.Fluorescence intensity of mEGFP was used to measure the spine volume change.For structural long-term potentiation (sLTP) experiments, we transfected slices nine independent times from which we recorded 16 cells in each condition.We obtained similar results as represented in b and c for these nine independent slices.d, Average (mean ± s.e.m.) time course summary of all spines from paAIP2-labeled OFC neurons where LTP was induced successfully without light (gray, 16 spines from 8 neurons) but failed when stimulated with light (blue, 16 spines from 8 neurons).e, Bar graphs showing mean transient volume change (volume change averaged over 0-2 min (mean ± s.e.m.), unpaired t test, t(30) = 4.17, P = 0.0002) and sustained volume change (volume change averaged over 12-14 min (mean ± s.e.m.), unpaired t test, t(30) = 3.252, P = 0.0028).Asterisk denote statistical significance.f, Histology image showing paAIP2 expression and fiber-optic cannula targeting the lateral OFC (LO).Yellow dotted line indicates the location of cannula.We confirmed that all mice in paAIP2 groups (5 mice in Fig.2and 5 mice in Fig.3) in this study show similar expression patterns as in this example.g, Mean probability of choosing the side with a higher reward assignment probability (early, P = 0.95; middle, P = 1.89 × 10 −5 ; late, P = 5.36 × 10 −6 ), and the optimality score (early, P = 0.66; middle, P = 1.38 × 10 −2 ; late, P = 2.74 × 10 −4 ).Mice with EGFP (black, five mice) or EGFP-P2A-paAIP2 (blue, five mice) virus injections.h, Summed history weights (medians) across training sessions.Compared separately for days 1-5, 6-20 and 21-30.Suppression of OFC plasticity during training impairs the learning of reward-based action policy.Reward (early, P = 0.41; middle, P = 1.58 × 10 −8 ; late, P = 2.37 × 10 −3 ), choice (early, P = 0.87; middle, P = 0.65; late, P = 0.27).i, Mean cosine similarity of policy axes between pairs of training sessions and its difference between control and paAIP2 mice.j, Mean cosine similarity between the policy axis on the nth session and the mean policy axis of the following 5 d.Reward (early, P = 0.43; middle, P = 1.08 × 10 −9 ; late, P = 2.50 × 10 −3 ), choice (early, P = 0.80; middle, P = 0.20; late, P = 1.51 × 10 −3 ).Shadings and error bars indicate s.e. and 95% CI, respectively.Statistics in g, h and j are from mixed-effects models (session number as the fixed effect, subject as the random intercept).Aligned rank transform for h.All tests are two-sided.NS P > 0.05, *P < 0.05 , **P < 0.01, ***P < 0.001, ****P < 0.0001.DLO, dorsolateral OFC; VO, ventral OFC.

Fig. 3 |ChanceFig. 4 |
Fig. 3 | Trial-by-trial RL is independent of CaMKII-dependent synaptic plasticity in OFC.Photoactivation of paAIP2 on every other session in expert mice (five mice, blue shadings indicate photoactivation sessions).a, Summed history weights in individual expert sessions.b, Pairwise comparisons of the photoactivation effects.Each line indicates the mean per mouse.Suppression of OFC plasticity in expert mice does not affect history-based action policy and task performance.Shadings and error bars indicate s.e.m. and 95% CI, respectively.All statistics are from mixed-effects models (virus as the fixed effect, session as the random intercept, subject as the random slope, two-sided).NS P > 0.05.

1 Fig. 5 |
Fig. 5 | OFC activity is necessary for trial-by-trial RL. a, Inactivation of the recurrent activity of deep RL models at the prechoice time step impairs behavioral dependence on history (30% of cells were inactivated, P < 1 × 10 −10 for both).Mean regression weights of 50 sessions (top), and the sum of each type of history weights from the past five trials (bottom).b, Schematics and a histology image for bilateral OFC inactivation.Inactivation was performed in ~13% of trials throughout the duration of ITI (0.5 s delay) and ready period.Yellow dotted line indicates the location of the cannula.c, Bilateral optogenetic inactivation of OFC impairs reward history dependence.Mean regression weights for mice with ChrimsonR-tdTomato (top), and the sum of each type of history weights from the past five trials for mice with ChrimsonR-tdTomato or only tdTomato

Fig. 6 |
Fig.6| History-independent action bias is independent of OFC.a, Unilateral OFC inactivation, alternating the side of inactivation every session.The impact on the history-independent bias direction in an example mouse (ΔBias = (bias in inactivated trials) − (bias in control trials)) is shown.b, Mean effects of unilateral OFC inactivation on history dependence (4 mice, 177 sessions).Black, control trials; red, inactivation trials.Similarly to bilateral inactivation, behavioral dependence on reward history was impaired by unilateral inactivation (P = 0.018).c, Unilateral inactivation increased the size of the mean unsigned bias (P = 0.030).d, The direction of the bias did not depend on the side of unilateral

. 1 |
Task engagement was consistent in mice across training sessions.a, Mean number of choice trials per session.b, Mean frequency of alarm trials (licking during ready period).c, Mean frequency of miss trials (trials where mice did not make a choice during the 2 sec answer period).d, Reaction time in choice trials (across-animal averaging of median reaction time).All error bars indicate 95% CI.Only the 7 mice used for OFC imaging were included.

. 3 |
1h for choice alternation only in deep RL).g, Choice prediction accuracy by RL model and history regression model for expert sessions (Deep RL: ≥ 230 th session, mice: ≥ 15 th session).The RL model predicts choices as well as the regression model despite fewer parameters for both mice and deep RL models.The box shows the quartiles, and the whiskers extend to 5 th and 95 th percentiles.Data from 5 independently trained deep RL models and 7 mice used for OFC imaging are used for a-g.The ≥ 15 th session group of the 7 mice consisted of 292 sessions in total.****P < 0.0001.See next page for caption.Extended Data Fig.

3 | CaMKII inhibition by paAIP2 blocks dendritic spine plasticity in M1 during motor learning in vivo without affecting the spine density and dendritic structure. a,
Schematic of behavior and imaging setup.Left, mice expressing either EGFP or both EGFP and paAIP2 were subjected to 2-photon imaging prior to behavioral training.Right, during training, blue light was directed into the cranial window.b,Top, behavioral paradigm.An auditory cue is presented, after which the lever must be pressed past both the smaller (red dotted line) and larger (green dotted line) thresholds in order to receive a water reward.Blue light is on during all cue periods.Bottom, longitudinal experimental schedule.Imaging was performed prior to behavior on sessions 1, 11, 12, 13, and 14.For each field of view, of the 11 th -14 th session, one with the best image quality was chosen and used as the late session.Blue light was presented during every session.c,Selected examples of spine enlargement in EGFP control animals.Unfilled arrowheads demarcate the spines of interest prior to enlargement.Filled arrowheads indicate the spines after enlargement in late imaging sessions.d,Example images illustrating the prevalence of spine enlargement along dendrites in EGFP controls and paAIP2-expressing mice.Demarcated spines are those showing >= 1.5× volume relative to session 1. Filled and unfilled arrows demarcate spines before and after enlargement, respectively.The probability of spine enlargement shown here is comparable to the average values reported in g. e, Spine volume measurements from late sessions of training relative to the first session of training.Data points correspond to individual spines.Only spines present in both early and late sessions ('stable spines') are shown.Colors represent individual animals.The median value of each animal (color-coded horizontal bars) as well as the median of these values for each group (black bars with centripetal arrowheads) are shown.Black dotted line corresponds to a relative spine volume of 1, indicating stable spine size over the experiment.Red dotted line indicates the spine enlargement threshold (1.5× session 1 size) used in subsequent analyses.n= 449 stable spines / 25 dendritic segments / 5 neurons / 4 mice for EGFP controls, n = 308 stable spines, 18 dendritic segments / 5 neurons / 4 mice for paAIP2.f,Histograms of changes in spine size over motor learning for EGFP-(gray) and paAIP2 (light blue)-expressing mice.Both groups show a primary peak at 1, indicating that a majority of spines are relatively stable in their size.The median relative spine size in EGFP controls The means of animals for each group are plotted as color-coded horizontal bars.Error bars correspond to mean ± SEM across animals.Right, the probability of spine enlargement (> 1.5×) is significantly lower in paAIP2expressing animals (p = 5e-05, chi-square test of proportions).Mean ± SEM.Note that the nonzero enlargement probability in paAIP2 animals indicates that plasticity is still occurring, albeit at a lowered level.Statistical tests are twosided.h, Example in vivo images illustrating the viability of paAIP2-expressing neurons across multiple imaging sessions.Left, example in vivo images of a dendrite in early and late imaging sessions.Zoomed-in versions of the selected dendritic segment (red box) on both sessions are shown at bottom.Right, an extracted portion of the dendrite demarcated at left.The majority of spines are stable, and there is no apparent sign of diminishing dendritic health.Images are shown as color-coded by depth to illustrate out-of-plane structures.i, Overall spine density is comparable between EGFP-and paAIP2-expressing mice, and is stable over time.Individual dendritic segments used in this analysis are shown as partially transparent points/lines for both early and late sessions, color-coded by animal.The median spine density for each animal is plotted as color-matched opaque lines.The medians across animals are plotted as black lines.There is no main effect of training session (that is early vs. late, p = 0.47; 2-way ANOVA) nor of transgene (that is EGFP vs. paAIP2, p = 0.38; 2-way ANOVA) on spine density.Further, there is no significant interaction between training session and transgene.Together, these data illustrate that spine density is stable over training, irrespective of the transgene being expressed.n = 628 spines / 20 dendritic segments / 4 mice for EGFP controls; n = 614 spines / 23 dendritic segments / 4 mice for paAIP2.The mean dendritic segment length was 50 ± 7µm for EGFP and 50 ± 10µm for paAIP2.j, Example in vivo images illustrating spine turnover in EGFP control mice.Both spine formation (cyan arrows) and spine elimination (red arrows) are apparent on the dendritic segment shown.Unfilled arrows indicate the location of future formation or elimination; filled arrows indicate the corresponding state in the late learning session.k, Summary of spine turnover in EGFP-and paAIP2-expressing mice.Left, new spine density measured along dendritic segments (each data point represents 1 dendritic segment) from late imaging sessions for each animal (color-coded data points).
Mean number of choice trials per session.f, Mean frequency of alarm trials (licking during ready period).g, Mean frequency of miss trials (trials where mice did not make a choice during the 2 sec answer period).h, Reaction time in choice trials (mean of median reaction time).Statistics are from mixed effects models (virus as the fixed effect and session as the random intercept).
trial-by-trial RL of expert mice.Shadings indicate s.e.m. d-h, Suppression of OFC plasticity during training did not affect history-independent action bias and task engagement.d,Median size of history-independent action bias for mice with both EGFP and paAIP2 expression (blue) or with only EGFP expression (black).e,from mixed-effects model (suppression as the fixed effect, session as the random intercept).Statistics in i-m are from mixed effects models (suppression as the fixed effect, subject as the random slope, session as the random intercept).All tests are two-sided.n.s.P > 0.05, **P < 0.01, ***P < 0.001, ****P < 0.0001.