Context-dependent extinction learning emerging from raw sensory inputs: a reinforcement learning approach

The context-dependence of extinction learning has been well studied and requires the hippocampus. However, the underlying neural mechanisms are still poorly understood. Using memory-driven reinforcement learning and deep neural networks, we developed a model that learns to navigate autonomously in biologically realistic virtual reality environments based on raw camera inputs alone. Neither is context represented explicitly in our model, nor is context change signaled. We find that memory-intact agents learn distinct context representations, and develop ABA renewal, whereas memory-impaired agents do not. These findings reproduce the behavior of control and hippocampal animals, respectively. We therefore propose that the role of the hippocampus in the context-dependence of extinction learning might stem from its function in episodic-like memory and not in context-representation per se. We conclude that context-dependence can emerge from raw visual inputs.

Treatment of anxiety disorders by exposure therapy is often followed by unwanted renewal of the seemingly extinguished fear. A better understanding of the cognitive and neural mechanisms governing extinction learning and fear renewal is therefore needed to develop novel therapies. The study of extinction learning goes back to Ivan Pavlov's research on classical conditioning 1 . Pavlov and his colleagues discovered that the conditioned response (CR) to a neutral conditioned stimulus (CS) diminishes over time, if the CS is presented repeatedly without reinforcement by an unconditioned stimulus (US). This phenomenon is called extinction learning.
The arguably best understood extinction learning paradigm is fear extinction, in which an aversive stimulus, e.g. an electric foot shock, is used as US 2 . The conditioned fear response diminishes during extinction, only to return later under certain circumstances 3 . While such return of fear can occur spontaneously (spontaneous recovery), or can be induced by unsignaled presentation of the US (reinstatement), it can also be controlled by context changes (ABA renewal) as follows: Subjects acquire a fear response to a CS, e.g. a neutral tone, by repeatedly coupling it with an US, e.g. a foot shock, in context A. Then the fear response is extinguished in a different context B, and eventually tested again in context A in the absence of the US. Renewal means that the fear response returns in the test phase despite the absence of the US 4 . ABA renewal demonstrates that extinction learning does not delete the previously learned association, although it might weaken the association 3 . Furthermore, ABA renewal underscores that extinction learning is strongly context-dependent [4][5][6][7][8][9][10] .
Fear extinction is thought to depend mainly on three cerebral regions: The amygdala, responsible for initial fear learning, the ventral medial prefrontal cortex, guiding the extinction process, and the hippocampus, controlling the context-specific retrieval of extinction 3 and/or encoding contextual information 11 . Here, we focus on the role of the hippocampus in extinction learning tasks. Lesions or pharmacological manipulations of the hippocampus have been reported to induce deficits in context encoding 3,11 , which in turn impede context disambiguation during extinction learning [5][6][7][8] . In particular, André and Manahan-Vaughan found that intracerebral administration of a dopamine receptor agonist, which alters synaptic plasticity in the hippocampus 12 , reduced the context-dependent expression of renewal in an appetitive ABA renewal task in a T-maze 9 . This task depends on the the specific involvement of certain subfields of the hippocampus 13 .
However, it remains unclear how behavior becomes context-dependent and what neural mechanisms underlie this learning. The majority of computational models targets exclusively extinction tasks that involve classical OPEN 1 Institute for Neural Computation, Ruhr University Bochum, Bochum, Germany. 2 Neurophysiology, Medical Faculty, Ruhr University Bochum, Bochum, Germany. * email: sen.cheng@rub.de conditioning, like the earliest model of conditioning, the Rescorla-Wagner model 14 (for an overview, see Ref. 3 ). However, in its original form this model treats extinction learning as unlearning of previously acquired associations between CS and US and so could not account for the phenomenology of extinction learning, including renewal in particular. Nevertheless, the Rescorla-Wagner model inspired the development of reinforcement learning (RL) and many recent models of extinction learning. For instance, Ludvig et al. added experience replay 15 to the Rescorla-Wagner model to account for a range of phenomena associated with extinction in classical conditioning: spontaneous recovery, latent inhibition, retrospective revaluation, and trial spacing effects 16 . However, their approach required explicit signals for cues and context, and did not learn from realistic visual information. Moreover, Ludvig et al. acknowledged that their method required the integration of more advanced RL techniques in order to handle operant conditioning tasks.
Computational models of extinction learning based on operant/instrumental conditioning are rather rare 3 . One of the few models was developed by Redish et al. and was based on RL driven by temporal differences. Their model emphasized the importance of a basic memory system to successfully model renewal in operant extinction tasks 17 . However, their model relied on the existence of dedicated signals for representing cues and contexts, and did not learn from realistic visual input.
Overcoming the caveats of existing approaches, we propose a novel model of operant extinction learning that combines deep neural networks with experience replay memory and reinforcement learning, the deep Q-network (DQN) approach 15 . Our model does not require its architecture to be manually tailored to each new scenario and does not rely on the existence of explicit signals representing cues and contexts. It can learn complex spatial navigation tasks based on raw visual information, and can successfully account for instrumental ABA renewal without receiving a context signal from external sources. Our model also shows better performance than the model by Redish et al.: faster acquisition in simple mazes, and more biologically plausible extinction learning in a context different from the acquisition context. In addition, the deep neural network in our model allows for studying the internal representations that are learned by the model. Distinct representations for contexts A and B emerge in our model during extinction learning from raw visual inputs obtained in biologically plausible virtual reality (VR) scenarios and reward contingencies.

Materials and methods
Modeling framework. All simulations in this paper were performed in a VR testbed designed to study biomimetic models of rodent behavior in spatial navigation and operant extinction learning tasks (Fig. 1). The framework consists of multiple modules, including a VR component based on the Blender rendering and simulation engine (The Blender Foundation, https ://www.blend er.org/), in which we built moderately complex and biologically plausible virtual environments. In the VR, the RL agent is represented by a small robot that is equipped with a panoramic camera to simulate the wide field of view of rodents. The topology module defines allowed positions and paths for the robot and provides pose data together with environmental rewards. In conjunction with image observations from the VR module, these data make up a single memory sample that is stored in the experience replay module. The DQN component relies on this memory module to learn its network Visually driven reinforcement learning with deep Q-networks. The agent learns autonomously to navigate in an environment using RL 19 : Time is discretized in time steps t; the state of the environment is represented by S t ∈ S , where S is the set of all possible states. The agent observes the state S t and selects an appropriate action, A t ∈ A(S t ) , where A(S t ) is the set of all possible actions in the current state. In the next time step, the environment evolves to a new state S t+1 , and the agent receives a reward R t+1 .
In our modeling framework, the states correspond to spatial positions of the agent within the VR. The agent observes these states through visual images captured by the camera from the agent's position (e.g. Fig. 2). The set of possible states, S , is defined by the discrete nodes in a manually constructed topology graph that spans the virtual environment (Fig. 3). State space discretization accelerates the convergence of RL. The possible actions in each state, A(S t ) , are defined by the set of outgoing connections from the node in the topology graph. Once an action is chosen, the agent moves to the connected node. In the current implementation, the agent does not rotate.
For the learning algorithm we adopt a specific variant of RL called Q-learning, where the agent learns a state-action value function Q(S, A) , which represents the value to the agent of performing action A in state S. The agent learns from interactions with the environment by updating the state-action value function according to the following rule 19 : where α is the learning rate and γ is the discount factor that devalues future rewards relative to immediate ones. The Q-learning update rule has been shown mathematically to converge to a local optimum. Since the stateaction value function is likely very complex and its analytical model class is unknown, we use a deep neural network to approximate Q(S, A) 15 . In our simulations, the DQN consists of fully connected, dense layers between the input and output layer and the weights are initialized with random values. We use the tanh-function as the activation function.
The weights in the DQN are updated using the backpropagation algorithm 20 . This algorithm was developed for training multi-layered networks in supervised learning tasks, where a network learns to associate inputs with target outputs. For a given input, the difference between the network's output and the target output is calculated; this is called the error. The network weights are then updated so as to reduce the error. Updating begins in the output layer and is applied successively down to the preceding layer. However, in reinforcement learning, which we use here, the goal is different and there are no target input-output associations to learn. Nevertheless, the backpropagation algorithm can still be applied because the DQN represents the Q function and Eq. (1b) represents the error that should be used to update the Q function after an experience. In other words, in Q learning the target function is not known but the error is, and that is all that is needed to train the DQN. Furthermore, to stabilize the convergence, we use two copies of the DQN, which are updated at different rates: The online network is updated at every step, while the target network is updated every N steps by blending it with the online network.
The input to the network consists of (30,1)-pixel RGB images, which are captured in the virtual environment. The network architecture comprises an input layer (with 90 neurons for 30 pixels × 3 channels), four hidden layers (64 neurons each), that enable the DQN to learn complex state-action value functions, and an output layer that contains a single node for each action the agent can perform. We use the Keras 21 and KerasRL 22 implementation for the DQN in our model.
To balance between exploration and exploitation our agent uses an ǫ-greedy strategy 19 : In each state S t , the agent either selects the action that yields the highest Q(S t , A t ) with a probability of 1 − ǫ (exploitation), or chooses a random element from the set of available actions A(S t ) with a probability of ǫ (exploration). To ensure sufficient exploration, we set ǫ = 0.3 in all simulations, unless otherwise noted.
In our model, experiences are stored in memory to be replayed later. An experience contains information about the state observed, action taken, and rewards received, i.e. (S t , A t , S t+1 , R t+1 ) . The memory module uses first-in-first-out memory, i.e. if the memory's capacity c is exceeded, the oldest experiences are deleted first to make room for novel ones. In each time step, random batches of memory entries are sampled for experience replay 15 . The DQN is trained with those replayed experiences; replay memory access is quite fast, and random batches of experiences are sampled from the replay memory in each training cycle in order to drive learning and update the weights of the DQN. Experience replay in RL was inspired by hippocampal replay of sequential neural activity encountered during experience, which is hypothesized to play a critical role in memory consolidation and learning 23,24 . We therefore suggest that lesions or drug-induced impairments of the hippocampal formation can be simulated by reducing the agent's replay memory capacity. The interleaved use of new and replayed experiences in our model resembles awake replay in animals 25 and makes a separate replay/consolidation phase unneccesary since it would not add any new information.  19 . These rewards are task-specific and will be defined separately for each simulation setup below.
Operant learning tasks simulated in our model. General setup. All trials used in our simulations had a common structure. Each trial began with the agent located at one of the defined starting nodes. In each following time step, the agent could chose to execute one action that was allowed in the current state as defined by the topology graph. Each time step incurred a negative reward of − 1 to encourage the agent to find the shortest path possible from the start to the goal node. Trials ended when either the agent reached the goal node, or the number of simulated time steps exceeded 100 (time-out). The experience replay memory capacity was set to c = 5000 . Different contexts were defined by the illumination color of the house lights. Context A was illuminated with red house lights and context B with white house lights. The starting node was always the first node, while the target node was node N. The agent received a reward of +1.0 for reaching the target node.
Morris watermaze. Next, we simulated spatial navigation in the more challenging Morris watermaze 26 to show that our model can learn operant spatial tasks in 2-d from raw visual input. The watermaze was circular with a diameter of 2 m. It was placed in the center of a typical lab environment to provide distal cues. The topology graph for this maze consisted of a rectangular grid that covered the center part of the watermaze (Fig. 3a). Four nodes {N(orth),E(ast),S(outh),W(est)} were used as potential starting nodes. They were selected in analogy to a protocol introduced by Anisman et al. 27 : Trials were arranged into blocks of 4 trials. Within each block, the sequence of starting locations was a random permutation of the sequence (N, E, S, W). The goal node was always centered in the NE quadrant and yielded an additional reward of + 1.0.

Extinction learning in T-maze.
To study the emergence of context-dependent behavior, we simulated the following ABA renewal experiment 9 : In a simple T-Maze, rats were rewarded for running to the end of one of the side arms (target arm). The acquisition context A was defined by a combination of distal and proximal visual cues, as well as odors. Once the animals performed to criterion, the context was changed to B, which was distinct from A, and the extinction phase began. In this phase, animals were no longer rewarded for entering the former target arm. Once the animals reverted to randomly choosing between the two side arms, the context was switched back to A and the test phase started. Control animals showed renewal and chose the former target arm more frequently than the other arm, even though it was no longer rewarded. In our modeling framework, we maintained the spatial layout of the experimental setup and the richness of visual inputs in a laboratory setting, which is usually oversimplified in other computational models. A skybox showed a standard lab environment with distal cues for global orientation, and maze walls were textured using a standard brick texture ( Fig. 2a-d). We made a few minor adjustments to accommodate the physical differences between rats and the simulated robot. For instance, we had to adjust the size of the T-maze and use different illumination to distinguish contexts A and B. In the rat experiments, the two contexts were demarcated by odor cues in addition to visual cues 9 . The simplifications in our model make the distinction between contexts harder, as odor was a powerful context disambiguation cue in the original experiment.
The topology graph in the T-Maze consisted of seven nodes (Fig. 3b). Reaching a node at the end of the side arms yielded a reward of + 1.0 to encourage goal-directed behavior in the agent. The left endpoint was the target node and yielded an additional reward of + 10.0 only in the acquisition phase, not in the extinction and test phases. As in the experiments, a trial was scored as correct, if the agent entered the target arm, and as incorrect, if the other arm was selected or the trial timed out. Finally, to compare our model account of ABA renewal to that of Redish et al. 17 , we used the same topology graph, but provided their model with inputs that represented states and contexts as binary encoded indices.
A simulated session consisted of 300 trials in each of the experimental phases: acquisition, extinction and test. So, altogether a session consisted of 900 trials.
Extinction learning in inhibitory avoidance. To model learning in a task that requires complex spatial behavior in addition to dynamically changing reinforcements, we use an inhibitory avoidance task in an ABA renewal design. The agent had to navigate from a variable starting location to a fixed goal location in the shock zone maze www.nature.com/scientificreports/ ( Fig. 2e-h), while avoiding a certain region of the maze, the shock zone. Distal cues were provided in this environment using the same skybox as in the watermaze simulation. The topology graph was built using Delaunay triangulation (Fig. 3c). The start node was randomly chosen from four nodes (green dots). Reaching the goal node (red dot) earned the agent a reward of + 1.0. If the agent landed on a node in the shock zone (red area, Figs. 2e-f, 3c), it would receive a punishment (reward= −20.0 ). In both the extinction and test phases, the shock zone was inactive and no negative reward was added for landing on a node within the shock zone.

Neural representations of context in ABA renewal tasks.
To study the influence of experience replay and model the effect of hippocampal lesions/inactivations, we compared agents with intact memory ( c = 5000 ) to memory-impaired agents with c = 150.
To study the representations in the DQN layers, we used dimensionality reduction as follows: We defined I l (z, x) to represent the activation of layer l ∈ [0, 1, . . . , 5] when the DQN is fed with an input image captured at location x in context z ∈ {A, B} , where l = 0 is the input layer, and l = 5 the output layer. Let X A and X B be the sets of 1000 randomly sampled image locations in context A and B, respectively. We expected that the representations I l (A, X A ) and I l (B, X B ) in the DQN's higher layers form compact, well-separated clusters in activation space by the end of extinction. This would enable agents to disambiguate the contexts and behave differently in the following test phase in context A from how they behaved in context B at the end of extinction, which is exactly renewal. By contrast, for memory-impaired agents we would expect no separate clusters for the representations of the two contexts, which would impair the agent's ability to distinguish between contexts A and B, which in turn would impair renewal. We applied Linear Discriminant Analysis (LDA) to find a clustering that maximally separates I l (A, X A ) and I l (B, X B ) . The cluster distances d l are related to the context discrimination ability of network layer l.

Results
We developed a computational framework based on reinforcement learning driven by experience memory replay and a deep neural network to model complex operant learning tasks in realistic environments. Unlike previous computational models, our computational modeling framework learns based on raw visual inputs and does not receive pre-processed inputs.
Proof-of-principle studies of spatial learning. As a proof-of-principle and to facilitate a direct comparison to an earlier model, we first studied spatial learning on a simple linear track with simplified inputs, i.e. the positions were represented by binary indices. To vary the difficulty of the task, the track was represented by a 1-d topology graph with variable node count. The first node was always the starting node and the agent was rewarded for reaching the last node. The larger the number of nodes in the graph, the more difficult it was for the agent to solve the task. Hence, the average number of trials, in which the agent reached the goal node before the trial timed out, decreased with the number of nodes N (Fig. 4a, blue line). The reason for this result is simple. The agent had to find the goal node by chance at least once before it could learn a goal-directed behavior. This was less likely to occur in longer tracks with a larger number of intermediate nodes.
Comparing our model to existing approaches turned out to be difficult, since operant extinction learning has not received much attention in the past. The model proposed by Redish et al. 17 was eventually chosen as a reference. Even though this model was developed for different kinds of tasks, it could solve the spatial learning task on the linear track if inputs provided to the model were simplified to represent states as indices (Fig. 4b  Next, we studied more complex spatial learning in a 2-d watermaze 26 with raw camera inputs. Since the agent was rewarded for finding the location of the escape platform as quickly as possible, we quantified learning performance by the escape latency (Fig. 4a). The learning curve indicates clearly that the system's performance improved rapidly with learning trials, i.e. the latency decreased. Our result compared well to those reported in the literature on rodents 28 .

Extinction and renewal of operant behavior.
Having shown that our model agent can learn simple spatial navigation behaviors based on raw camera inputs, we turned our attention to more complex learning experiments such as operant extinction and renewal. Initial tests were run in the T-maze. Performance was measured with cumulative response curves (CRCs). If the agent entered the target arm, the CRC was increased by one. If the other arm was entered or the trial timed out, the CRC was not changed. Hence, consistent correct performance yielded a slope of 1, (random) alternation of correct and incorrect responses yielded a slope of 0.5, and consistent incorrect responses yielded a slope of 0. The CRC for our model looked just as expected for an ABA renewal experiment (Fig. 5a). After a very brief initial learning period, the slope increased to about 1 until the end of the acquisition phase, indicating nearly perfect performance. At the transition from the acquisition to the extinction phase, the slope of the CRCs stayed constant in the first 50 extinction trials, indicating that the agent generalized the learned behavior from context A to context B, even though it no longer received a larger reward for the left turn. After 50 extinction trials, the slope decreased gradually (Fig. 5a, red arrow), indicating extinction learning. Finally, the change from context B back to context A at the onset of the test phase triggered an abrupt rise of the CRC slope, indicating a renewal of the initially learned behavior. The CRC for Redish et al. 's model looked similar to that for our model (Fig. 5b), but exhibited two key differences that were not biologically plausible. First, the CRCs split into one group that showed no extinction, and hence no renewal either, and another group that showed extinction and renewal. Second, the latter group showed a sharp drop of the CRC slope in the very first extinction trial (Fig. 5b, red arrow), indicating an immediate shift in behavior upon the first unrewarded trial. www.nature.com/scientificreports/ To test our model in an even more challenging task, we used an inhibitory avoidance learning paradigm in the shock zone maze, which combined complex spatial navigation, positive and negative rewards (see "Methods"), in an ABA renewal paradigm. In this task, agents had to navigate to a goal location in a 2-d environment along the shortest path possible, while at the same time avoiding a shock zone with an irregular shape (Fig. 2e,f). A trial was considered correct if the agent reached the goal without entering the shock zone on the way. This definition allowed us to use the CRC to quantify the trial-by-trial performance of simulated agents. The curve increased by one if a trial was correct and remained constant otherwise. That means, a slope of 1 in the CRC indicated perfect avoidance of the shock zone, and zero slope represented no avoidance at all.
Even in this complex learning task, the memory-intact agent in our model showed biologically realistic behavior. The mean CRC started with a slope of about 0.0 in the acquisition phase (Fig. 6a). After ≈ 100 trials, the CRC slope increased to about 0.8, as the agent learned to circumnavigate the shock zone. At the beginning of the extinction phase, the agent continued to avoid the shock zone, even though shocks were no longer applied and even though the context had changed from A to B. However, by the end of the extinction phase, the agent learned to reach the goal faster by traversing the shock zone, as indicated by a CRC slope of about 0.0. When www.nature.com/scientificreports/ the context was switched back to context A at the onset of the test phase, the CRC slope rose sharply, indicating renewal (Fig. 6a). Over the course of the test phase, the agents learned that the shock zone was no longer active in context A and that it could be safely traversed, causing a smooth drop of the CRC slope to about 0.0 in the final trials. These results showed that the agents in our model could learn context-dependent behavior even in a highly complex operant learning task in a two-dimensional environment.
To confirm that renewal in our model is indeed triggered by the context change, we tested for ABC renewal in the T-Maze, i.e., acquisition in context A (red house lights), extinction in context B (white house lights), and renewal in one of two neutral contexts, C 1 or C 2 . In the C 1 context, we set the house light to a mixture of 30% red and 70% blue. In C 2 , house lights were set to 70% red and 30% blue. Based on the literature, renewal strength in the ABC paradigm was expected to be diminished compared to to ABA renewal. We also expected that renewal would become weaker when context C was more dissimilar to context A, i.e., C 1 should yield a lower renewal strength than C 2 . Our model showed both expected effects (Fig. 6d), confirming that indeed ABA renewal in our model is driven by the context change.
The role of experience replay in learning context-dependent behavior. We next explored the role of experience replay in the emergence of context-dependent behavior in our model agent. To this end, we repeated the above simulation with a memory-impaired agent. The memory-impaired agent learned in much the same way as the memory-intact agent did during acquisition (Fig. 6b), but exhibited markedly different behavior thereafter. First, extinction learning was faster in the memory-impaired agent, consistent with experimental observations. This result might seem unexpected at first glance, but it can be readily explained. With a low memory capacity, experiences from the preceding acquisition phase were not available during extinction and the learning of new reward contingencies during extinction led to catastrophic forgetting in the DQN 29 . As a result, the previously acquired behavior was suppressed faster during extinction. Second, no renewal occurred in the test phase, i.e., the CRC slope remained about zero at the onset of the test phase. This is the result of catastrophic forgetting during extinction. Our results suggest that ABA renewal critically depends on storing experiences from the acquisition phase in experience memory and making them available during extinction learning. This result is consistent with experimental observations that hippocampal animals do not exhibit renewal and suggests that the role of the hippocampus in memory replay might account for its role in renewal.
The agent's learning behavior in our model crucially depended on two hyperparameters, namely the capacity of the experience replay memory c, and the exploration parameter ǫ . To analyze the impact of these hyperparameters, we performed and averaged 20 runs in the shock zone maze for different parameter combinations of c and ǫ . Let S − (c, ǫ) and S + (c, ǫ) be the slopes of the mean CRCs in the last 50 steps of extinction and first 50 steps of the test phase, respectively. In the ideal case of extinction followed by renewal: S − (c, ǫ) = 0.0 and S + (c, ǫ) = 1.0 . We therefore defined the renewal strength for each combination of hyperparameters as which is near 1 for ideal renewal and near 0 for no renewal. As expected from our results on memory-impaired agents above, the renewal strength was near zero for low memory capacity (Fig. 6c). It increased smoothly with increasing memory capacity c, so that a memory capacity c > 4000 allowed the agents to exhibit renewal reliably. The renewal strength displayed a stable ridge for c > 4000 and 0.25 < ǫ < 0.35 . This not only justified setting ǫ = 0.3 in our other simulations, but also showed that our model is robust against small deviations of ǫ from that value.
The renewal strength was very low for small values of ǫ regardless of memory capacity. This points to an important role of exploration in renewal. With a low rate of exploration in the extinction phase, the model agent persevered and followed the path learned during acquisition. This led to S − (c, ǫ) ≈ S + (c, ǫ) , and hence F R (c, ǫ) ≈ 0 , and explained why the exploration parameter has to take relatively large values of 0.25 < ǫ < 0.35 to generate robust renewal.

Emergence of distinct neural representations for contexts.
Having shown that our model learned context-dependent behavior in complex tasks, we turned our attention to the mechanism underlying this behavior in a specific experiment. We simulated an experiment in appetitive, operant extinction learning that found that increasing the dopamine receptor agonists in the hippocampus had a profound effect on ABA renewal in a simple T-maze 9 (Fig. 7a). The model agent received rewards for reaching the end of either side arm, but the reward was larger in the left (target) arm. Like in the experimental study 9 , we quantified the performance of the simulated agents by the number of trials, out of 10, that the agents chose the target arm at different points of the simulated experiment: {early acq.,late acq., early ext., late ext., early test, late test}. In the acquisition phase in context A, memory-intact agents initially chose randomly between the left and right arm, but by the end of acquisition they preferentially chose the target arm (Fig. 7b, blue filled bars). In the extinction phase in context B, agents initially continued to prefer the target arm, i.e. they generalized the association formed in context A to context B, but this preference attenuated by the end of extinction. Critically, when the context was then switched back to the acquisition context A in the test phase, agents preferred the target arm again, even though it was no longer associated with a larger reward. This is the hallmark of ABA renewal. The renewal score S R = (earlytest/lateext.) − 1.0 indicated the relative strength of the renewal effect in the T-maze experiments. S R ≈ 0.59 for the memory-intact agent. Since there was no special reward for preferentially choosing the target arm, the agent stopped doing so by the end of the test phase.
The memory-impaired agent showed the same acquisition learning as the memory-intact agent, but exhibited faster extinction learning and no renewal ( S R ≈ −0.14 , Fig. 7b, unfilled bars). These results are consistent with our modeling results in the inhibitory avoidance task discussed above and the experimental results from rats 9 . www.nature.com/scientificreports/ We hypothesized that context-dependent ABA renewal emerged in memory-intact agents because the higher layers of the DQN developed distinct representations of contexts A and B, which would allow the DQN to associate different reward contingencies with similar inputs. We therefore studied the evolution of context representations in the DQN based on the cluster distances between the representations of the two contexts (see "Methods"). It is important to keep in mind that, in the DQN, different layers have different functions: Lower layers are more closely related to sensory processing, while higher layers (closer to the output layer) fulfill evaluative tasks. Hence, lower layers reflect the properties of the visual inputs to the network. Since images from the context A form a cluster and are different from images from context B (contexts A and B were distinguished by different illumination colors), image representations in lower layers form distinct clusters. Hence, the cluster distance in the input layer, d 0 , was nonzero and remained so through the simulation (Fig. 7c,d). At the end of acquisition (red curve), the agents had been exposed only to visual inputs from context A, so there was no benefit for the DQN to represent the contexts A and B differently. Therefore, the difference in the context representations in the input layer was gradually washed out in higher layers, i.e., d 1 , . . . , d 4 < d 0 . As a result, distinct image representations are mapped onto the same evaluation. In other words, the network generalizes the evaluation it has Figure 7. ABA renewal arises because experience replay drives the emergence of two distinct context representations. (a) André and Manahan-Vaughan showed that renewal was reduced by a pharmacological manipulation that interfered with synaptic plasticity in the hippocampus. (Adapted from Fig. 1 in Ref. 9 .) Renewal score in experimental animals: S R ≈ 0.45 , renewal score in control animals: S R ≈ 1.08 . (b) In our model, agents with intact experience replay memory showed ABA renewal in the T-Maze task, i.e. an increase in response rate in early test despite the continued absence of reward ( S R ≈ 0.59 ). By contrast, memory-impaired agents did not exhibit renewal and had a lower renewal score ( S R ≈ −0.14 ). (c) Cluster distances between network representations of context A and B in the T-maze for memory-intact agents. Results were averaged over 1000 runs of our model. The representation of contexts A and B in higher layers of the deep neural network ( l ≥ 3 ) formed distinct clusters that were well separated. They were learned in the extinction phase, since the distinction was not present at the end of the acquisition phase. (d) By contrast, memory-impaired agents did not show distinct context representations in any learning phase. www.nature.com/scientificreports/ learned in context A to context B, just as subjects do. However, after extinction (blue curve), the replay memory contained observations from both contexts A and B, which were associated with different reward contingencies. This required that the higher layers of the DQN discriminate between the contexts, and d 3 , d 4 > d 0 . The network's ability to represent the two contexts differently, and thus maintain different associations during the extinction phase, gave rise to renewal in the subsequent testing phase (green curve), where the experience replay memory encoded novel observations from context A, yet still contained residual experiences from context B. Reward contingencies were identical in both contexts, and there was no longer a benefit to maintaining distinct context representations, hence, d 3 , d 4 d 0 . Note that d 3 and d 4 were also much smaller at the end of the test phase than they were at the end of extinction, and would eventually vanish if the renewal test phase was extended. As expected, cluster distances for the DQN in memory-impaired agents (Fig. 7d) did not change between the phases of the experiment and remained at low values throughout. This was the consequence of the reduced experience replay memory capacity and accounted for the lack of renewal in memory-impaired agents. Since the agents had no memory of the experiences during acquisition in context A, the DQN suffered from catastrophic forgetting 29 when exposed to a different reward contingency in context B and therefore the DQN could not develop separate representations of the two contexts during extinction.

Discussion
We have developed a computational model to study the emergence of context-dependent behavior in complex learning tasks within biologically realistic environments based on raw sensory inputs. Learning is modeled as deep Q-learning with experience replay 15 . Here, we apply our model to a sparsely researched topic: modeling of operant extinction learning 3 . We found that our model accounts for many features that were observed in experiments, in particular, in context-dependent ABA renewal, and the importance of hippocampal replay in this processes. Our model has several advantages relative to previous models. Compared to Redish et al. 's model of operant extinction learning 17 , our model learned faster on the linear track, and showed more plausible extinction learning in that our model generalized from the acquisition to the extinction context. Further, our model allowed for the investigation of internal representations of the inputs and, hence, the comparison to neural representations. Finally, our model makes two testable predictions: Context-dependent behavior arises due to the learning of clustered-representations of different contexts, and context representations are learned based on memory replay. We discuss these predictions in further detail below.
Learning context representations. One common suggestion in the study of learning is that stimuli can be distinguished a priori into discrete cues and contextual stimuli based on some inherent properties 30 . For instance, cues are discrete and rapidly changing, whereas the context is diffuse and slowly changing. Prior models of extinction learning, e.g. Refs. 16,17 , rely on external signals that isolate sensory stimuli and the current context. They maintain different associations for different contexts and switch between associations based on the explicit context signal. Since this explicit context switching is biologically unrealistic, Gershman et al. suggested a latent cause model in which contexts act as latent causes that predict different reward contingencies 31 . When expected rewards do not occur, or unexpected rewards are delivered, the model infers that the latent cause has changed. However, the model still depends on explicit stimulus signals, and requires complex Bayesian inference techniques to create and maintain the latent causes.
An alternative view is that there is no categorical difference between discrete cues and contextual information during associative learning, i.e., all sensory inputs drive learning to some extent 14 . Two lines of experimental evidence support this learning hypothesis. First, the learning of context can be modulated by attention and the informational value of contexts 32 . This finding suggests that contextual information is not fixed and what counts as contextual information depends on task demands 33 . Second, the context is associated with the US to some degree 34 , instead of merely gating the association between CS and US. There are suggestions that the association strength of the context is higher in the early stage of conditioning than later, when it becomes more apparent that the context does not predict the US 34 . However, when context is predictive of CS-US pairing, the CR remains strongly context-dependent after multiple learning trials 35 .
In line with the learning view, the internal deep Q-network in our model received only raw camera inputs and no explicit stimuli or context signaling. The model autonomously adapted the network weights to given learning tasks based on reward information collected during exploration and from the experience replay memory. By analyzing the model's internal deep Q-network, we found that distinct context representations emerged spontaneously in the activations of the network's higher layers over the course of learning, if the agent was exposed to two contexts with different reward contingencies. Distinct contextual representations emerge rather slowly in our study due to the role of the context in ABA renewal. If the task had been different, e.g. the agent had been exposed to US in context A and no US in context B in interleaved trials from the outset, distinct contextual representations in the higher layers of the network would have emerged faster. Nevertheless, they would have been learned in our model and not assumed to exist a priori as in almost all previous models.
Finally, we emphasize that, in addition to this task-dependent contextual representation in higher layers, the network's lower layers exhibit task-independent differences in neural activity in different contexts. We do not call the latter "contextual representations" since they are mere reflections of the sensory differences between inputs from different contexts.
The neural representation of context. The hippocampus has been suggested to play a role in the encoding, memorization, and signaling of contexts, especially in extinction learning 3,11 . However, there is no conclusive evidence to rule out that the hippocampus is involved in the renewal phenomenon primarily due to its other well-established role in general-purpose memory storage and retrieval [36][37][38]  www.nature.com/scientificreports/ of distinct contextual representations critically depended on experience replay, suggesting that experience replay facilitates the formation of stable memory representations-as predicted by neuroscientific studies 23,39 .
In line with previous suggestions 6,10 , we therefore propose that contexts might be represented, in brain regions outside the hippocampus and that learning of these representations is facilitated by memory of experiences (inputs, actions and rewards), which are stored with the help of the hippocampus. However, we note that our model does not preclude that the hippocampus generates contextual representations of its own and that these representations drive behavior. Our model only indicates that hippocampal replay of low-level inputs suffices to generate context representations in other regions, such as the PFC. Unlike previous computational models of extinction learning, our model delineates the simulated fast hippocampal replay (experience replay memory) from other cortical regions (deep Q-network) that learn more slowly. As the network learns abstract representations of the visual inputs, the lower layers most likely correspond to visual areas in the cortex, while the higher layers serve functions similar to those served by the prefrontal cortex (PFC). Contextual representations in higher layers of the network emerge slowly in our study because the relevance of the contexts in an ABA renewal paradigm emerges slowly. If the task had been different, e.g., the agent had been exposed to US in context A and no US in context B in interleaved trials from the beginning, like in contextual fear conditioning, the distinct contextual representations in the higher layers of the network would have emerged faster. Nevertheless, they would have been learned in our model, and not assumed to exist a priori as in almost all previous models.
We propose that our hypothesis above could be tested using high-resolution calcium imaging 40 of the PFC in behaving rodents. Our model predicts that different contexts are encoded by PFC population activity that form well-separated clusters by the end of extinction learning, allowing the animals to disambiguate contexts, and leading to renewal. In hippocampal animals, however, we expect that novel contexts are not encoded in separate PFC clusters, but instead their representations in the PFC population will overlap with those of previously acquired contexts. Furthermore, our model suggests that context representations in the brain are highly dynamic during renewal experiments, adapting to the requirements of each experimental phase. Distinct context representations would therefore arise in control animals over the course of extinction learning, and not necessarily be present from the start, cf. 41 .

The role of the hippocampus in renewal.
Our hypotheses about contextual representations discussed above naturally beg the question: What role does hippocampus play in renewal? The experimental evidence leaves no doubt that the hippocampus plays a critical role in renewal 3,7,11,42 . In our model, we focused on the role of experience replay observed in the hippocampus, i.e. we propose that hippocampus is involved in renewal because of its ability to store and replay basic experiences. In our model, replay of experiences from the acquisition phase was critical to learn distinct context representations in the extinction phase. This is consistent with observations in humans 43 . Only subjects whose activation in the bilateral hippocampus was higher during extinction in a novel context than in the acquisition context showed renewal later. With limited experience memory, learning during extinction in our model overwrites previously acquired representations, also known as catastrophic forgetting. This in turn prevented renewal from occurring when the agent was returned to the acquisition context. If lesioning the hippocampus in animals has a similar effect as limiting the experience memory in our simulation, as we propose, our model predicts that impairing the hippocampus pre-extinction weakens renewal, while impairing the hippocampus post-extinction should have no significant influence on the expression of renewal.
Both of these predictions are supported by some experimental studies, and challenged by others. Regarding the first prediction, some studies have confirmed that hippocampal manipulations pre-extinction have a negative impact on ABA renewal. For instance, ABA renewal was disrupted by pharmacological activation of dopamine receptors prior to extinction learning 9 . By contrast, other studies suggest that disabling the hippocampus preextinction has no impact on later renewal. For instance, inactivation of the dorsal hippocampus with muscimol, an agonist of GABA A receptors 44 , or scopolamine, a muscarinic acetylcholine receptor antagonist 8 , prior to extinction learning did not weaken ABA renewal during testing. The apparent contradiction could be resolved, if the experimental manipulations in Refs. 8,44 did not interfere with memories encoded during acquisition, as other manipulations of the hippocampus would, but instead interfered with the encoding of extinction learning. As a result, the acquired behavior would persist during the subsequent renewal-test. Indeed, activation of GABA A receptors, or the inactivation of muscarinic receptors reduce excitation of principle cells, thereby changing signal-to noise ratios in the hippocampus. We hypothesize that, as a result, animals exhibited reduced extinction learning following scopolamine 8 or muscimol 44 , consistent with impaired encoding of extinction. By contrast, dopamine receptor activation during extinction learning 9 , did not reduce extinction learning, but impaired renewal. The results of the latter study are more consistent with the results of our model when memory capacity is reduced pre-extinction: extinction learning was slightly faster and renewal was absent.
In support of our second prediction, a study showed that antagonism of β-adrenergic receptors, which are essential for long-term hippocampus-dependent memory and synaptic plasticity 45 , post-extinction had no effect on ABA renewal 46 . By contrast, several studies suggested that lesioning the hippocampus post-extinction reduces renewal of context-dependent aversive experience. For instance, ABA renewal in the test phase was abolished by excitotoxic 42 and electrolytic 7 lesions of the dorsal hippocampus after the extinction phase. More targeted lesions of dorsal CA1, but not CA3, post-extinction impaired AAB renewal 6 . The authors of these studies generally interpreted their results to indicate that the hippocampus, or more specifically CA1, are involved in retrieving the contextual information necessary to account for renewal. However, damage to these subregions can be expected to have generalized effects on hippocampal information processing of any kind. Another possibility is that post-extinction manipulations, which were made within a day of extinction learning, interfered with systems consolidation. This is very plausible in terms of dopamine receptor activation 9 as this can be expected to promote www.nature.com/scientificreports/ the consolidation of the extinction learning event. To model this account in our framework, one would have to split the learning and replay during extinction phase, which occur in parallel in our current implementation, into two separate stages. Only new learning would occur during the extinction phase, while experience replay (of the acquisition phase) would occur during sleep or wakeful quiescence after the extinction phase, as suggested by the two-stage model of systems consolidation. In this case, the associations formed during acquisition would be initially overwritten during extinction learning, as they were in our simulations without experience memory, and then relearned during the replay/consolidation period 23,24 . If systems consolidation was altered during this period, extinction learning would still be retained, but the association learned during acquisition would be overwritten and hence there would be no renewal. In line with this interpretation, recent findings suggest that extinction learning and renewal engage similar hippocampal subregions and neuronal populations 13 , thus, suggesting that manipulating the hippocampus can have an impact on prior learned experience and thus, subsequent renewal.
In conclusion, we have shown that, in an ABA renewal paradigm, contextual representations can emerge in a neural network spontaneously driven only by raw visual inputs and reward contingencies. There is no need for external signals that inform the agent about cues and contexts. Since contextual representations might be learned, they might also be much more dynamic than previously thought. Memory is critical in facilitating the emergence of distinct context representations during learning, even if the memory itself does not code for context representation per se.