Optimizing agent behavior over long time scales by transporting value

Humans prolifically engage in mental time travel. We dwell on past actions and experience satisfaction or regret. More than storytelling, these recollections change how we act in the future and endow us with a computationally important ability to link actions and consequences across spans of time, which helps address the problem of long-term credit assignment: the question of how to evaluate the utility of actions within a long-duration behavioral sequence. Existing approaches to credit assignment in AI cannot solve tasks with long delays between actions and consequences. Here, we introduce a paradigm where agents use recall of specific memories to credit past actions, allowing them to solve problems that are intractable for existing algorithms. This paradigm broadens the scope of problems that can be investigated in AI and offers a mechanistic account of behaviors that may inspire models in neuroscience, psychology, and behavioral economics.


Supplementary Methods: Signal-to-Noise Ratio Analysis
As in the article text, we refer to phases 1-3 as P1-P3. We define the signal as the squared norm of the expected policy change in P1 induced by the policy gradient. To be precise, let ∆θ := t∈P 1 ∇ θ log π(a t |h t )R t . Further, in the following assume that the returns are baseline-subtracted, i.e.
We define the signal as We define the noise as the trace of the variance of the policy gradient Noise := Tr Var π [∆θ]) Recall that r t ≡ 0 for t ∈ P1. Further, P1 and P2 are approximately independent as P2 is a distractor phase whose initial state is unmodified by activity in P1. The only dependence is given by the agent's internal state and parameters, but we assume for these problems it is a weak dependence, which we ignore for present calculations. In this case, Based on these considerations, the signal term is easy to calculate: Define g θ := t∈P 1 ∇ θ log π(a t |h t ). With this, the noise term becomes where Tr Var π [∆θ|no P2] is the variance in the policy gradient due to P1 and P3 without a P2 distractor phase. (The approximate equality represents that the memory state of the system is altered by the P2 experience, but we neglect this dependence.) We make the assumption that performance in P2 is independent of activity in P1, which is approximately the case in the distractor task we present in the main text. With this assumption, the first term above becomes ∇ θ log π(a t |h t ) ∇ θ log π(a t |h t ) .
In the limit of large P2 reward variance, we have Var π t ∈P 2 r t × Tr Var π t∈P 1 ∇ θ log π(a t |h t ) .
The reward variance in P2, Var π t ∈P 2 r t , reduces the policy gradient SNR, and low SNR is known to impact the convergence of stochastic gradient optimization negatively [1]. Of course, averaging S independent episodes increases the SNR correspondingly to S × SNR, but the approach of averaging over an increasing number of samples is not universally possible and only defers the difficulty: there is always a level of reward variance in the distractor phase that matches or overwhelms the variance reduction achieved by averaging.

Supplementary Methods: Algorithm Pseudocode
for RMA and TVT

Supplementary Methods: Task Definitions
All tasks were implemented in DeepMind Lab (DM Lab) [2]. DM Lab has a standardized environment map unit length: all sizes given below are in these units. For all DM Lab experiments, agents processed 15 frames per second. The environment itself produced 60 frames per second, but we propagated only the first observation of each packet of four to the agents. Rewards accumulated over each packet were summed together and associated to the first, undropped frame. Similarly, the agents chose one action at the beginning of this packet of four frames: this action was applied four times in a row. We define the number of "Agent Steps" as the number of independent actions sampled by the agent: that means one agent step per packet of four frames. We used a consistent action set for all experiments except for the Arbitrary Visuomotor Mapping task. For all other tasks, we used a set of six actions: move forward, move backward, rotate left with rotation rate of 30 (mapping to an angular acceleration parameter in DM Lab), rotate right with rotation rate of 30, move forward and turn left, move forward and turn right. For the Arbitrary Visuomotor Mapping, we did not need to move relative to the screen, but we instead needed to move the viewing angle of the agent. We thus used four actions: look up, look down, look left, look right (with rotation rate parameter of 10).
DM Lab maps use texture sets to determine the floor and wall textures. We use a combination of four different texture sets in our tasks: Pacman, Tetris, Tron and Minesweeper. DM Lab texture sets can take on various colours but Algorithm 1 RMA Worker Pseudocode // Assume global shared model parameter vectors θ and counter T := 0 // Assume thread-specific parameter vectors θ // Assume discount factor γ ∈ (0, 1] and bootstrapping parameter λ ∈ [0, 1] Initialize thread step counter t := 1 repeat Synchronize thread-specific parameters θ := θ Zero model's memory & recurrent state if new episode begins ) Apply a t to environment and receive reward r t and observation o t+1 t := t + 1; T := T + 1 until environment termination or t − t start == τ window If not terminated, run additional step to computeV (z t+1 , log π t+1 ) and set R t+1 :=V (z t+1 , log π t+1 ) // (but don't increment counters) (Optional) Apply Temporal Value Transport (Alg. 3) Reset performance accumulators A := 0; L := 0; H := 0 for k from t down to t start do Asynchronously update via gradient ascent θ using dθ until T > T max we use the default colours for each set, which are: Pacman: blue floors and red walls. Tetris: blue floor and yellow walls. Tron: yellow floor and green walls. Minesweeper: blue floor and green walls. Examples of how these sets appear can be seen in various figures in the main text.
Episodes for the tasks with delay intervals are broken up into multiple phases. Phases do not repeat within an episode. Generally, the tasks contain three Algorithm 2 Temporal Value Transport for One Read input: {r t } t∈ [1,T ] , {V t } t∈ [1,T ] , read strengths {β t } t∈ [1,T ] , read weights {w t } t∈ [1,T ] splices : = [] for each crossing of read strength β t above β threshold do t max := arg max t {β t |t ∈ crossing window} Append t max to splices end for for t in 1 to T do for t in splices do if {The read based on w t influences value prediction at next step, hencê V t +1 .} end if end for end for return {r t } t∈ [1,T ] phases (P1-P3), with a middle phase.
We used a standardized P2 distractor phase task: the map is an 11 × 11 open square (Figure 1b second column). The agent spawns (appears) adjacent to the middle of one side of the square, facing the middle. An apple is randomly spawned independently at each unit of the map with probability 0.3, except for the square in which the agent spawns. Each apple gives a reward r apple of 5 when collected and disappears after collection. The agent remains in this phase for 30 seconds. (This length was varied in some experiments.) The map uses the Tetris texture set unless mentioned otherwise.
In several tasks, we use cue images to provide visual feedback to the agent, e.g. indicating that an object has been picked up. These cue images are colored rectangles that overlay the input image, covering the majority of the top half of the image. An example of a red cue image is shown in Supplementary Figure 10a, third panel. These cues are shown for 1 second once activated, regardless of a transition to a new phase that may occur during display.
In each episode of Passive Visual Match, four distinct colors are randomly chosen from a fixed set of 16 colors. One of these is selected as the target color and the remaining three are distractor colors. Four squares are generated with these colors, each the size of one wall unit. The three phases in each episode are: 1. The map is a 1 × 3 corridor with a target color square covering the wall unit at one end. The agent spawns facing the square from the other end of the corridor (Figure 1b first column). There are no rewards in this phase. The agent remains in this phase for 5 seconds. The map uses the Pacman texture set. [1,k] , read weights {w

Algorithm 3 Temporal Value Transport for Multiple Reads
2. The standard distractor phase described above.
3. The map is a 4 × 7 rectangle with the four color squares (the target color and three distractor colors) on one of the longer sides, with a unit gap between each square. The ordering of the four colors is randomly chosen. There is an additional single unit square placed in the middle of the opposite side, in which the agent spawns, facing the color squares. In front of each color square is a groundpad (Figure 1b last two columns). When the agent touches one of these pads, a reward of 10 points is given if it is the pad in front of the target painting and a reward of 1 is given for any other pad. The episode then ends. If the agent does not touch a pad within 5 seconds then no reward is given for this phase and the episode ends. The map uses the Tron texture set.
Active Visual Match is the same as Passive Visual Match, except that the map in phase 1 is now larger and the position of the target image in phase 1 is randomized. The phase 1 map consists of two 3 × 3 open squares connected by a 1 × 1 corridor that joins each square in the middle of one side ( Figure 2a first two columns). The agent spawns in the center of one of the two squares, facing the middle of one the walls adjacent to the wall with the opening to the corridor. The target color square is placed randomly over one of any of the wall units on the map.
The three phases of Key-to-Door are: 1. The map is identical to the map in phase 1 of Active Visual Match. The agent spawns in the corner of one the squares that is furthest from the opening to the corridor, facing into the square but not towards the opening. A key is placed randomly within the map (not at the spawn point) and if the agent touches the key it disappears and a black cue image is shown (Figure 4a first two columns). As in the Visual Match tasks, there are no rewards in this phase, and the phase lasts for 5 seconds. The map uses the Pacman texture set.
3. The map is a 1×3 corridor with a locked door in the middle of the corridor.
The agent spawns at one end of the corridor, facing the door. At the end of the corridor on the other side of the door is a goal object (Figure 4a fourth column). If the agent touched the key in phase one, the door can be opened by walking into it, and then if the agent walks into the goal object a reward of 10 points is given. Otherwise, no reward is given. The map uses the Tron texture set.
Key-to-Door-to-Match combines elements of Key-to-Door with Active Visual Match. One target color and three distractor colors are chosen in the same way as for the Visual Match tasks. In contrast to the standard task setup, there are five phases per episode: 1. This phase is the same as phase 1 of Key-to-Door but with a different map. The map is a 3 × 4 open rectangle with an additional 1 × 1 square attached at one corner, with the opening on the longer of the two walls. The agent spawns in the additional 1 × 1 square, facing into the rectangle (Figure 5a first column). The map uses the Minesweeper texture set.
2. The standard distractor phase except that the phase lasts for only 15 seconds instead of 30 seconds.
3. The map is the same as in phase 3 of Key-to-Door. Instead of a goal object behind the locked door, the target color square covers the wall at the far end of the corridor (Figure 5a third column). There is no reward in this phase, and it lasts for 5 seconds. The map uses the Pacman texture set.
4. The standard distractor phase except that the phase lasts for only 15 seconds instead of 30 seconds.
5. The final phase is the same as phase 3 in the Visual Match tasks.
The three phases of Two Negative Keys are: 1. The map is a 3 × 4 open rectangle. The agent spawns in the middle of one of the shorter walls, facing into the rectangle. One red key is placed in a corner opposite the agent, and one blue key is placed in the other corner opposite the agent. Which corner has the red key and which the blue key is randomized per episode. If the agent touches either of the keys, a red or blue cue image is shown according to which key the agent touched (Supplementary Figure 10 first three columns). After one key is touched, it disappears, and nothing happens if the agent goes on to touch the remaining key (i.e., no cue is displayed and the key remains in the map). The phase lasts for 5 seconds, and there are no rewards; if the agent does not touch any key during this period, at the end of the phase a black cue image is shown. The map uses the Tron texture set.
2. The standard distractor phase except with the Tetris texture set.
3. The layout is the same as in phase 3 of the Key-to-Door task. If the agent has picked up either of the keys then the door will open when touched, and the agent can collect the goal object, at which point it will spawn back into the map from phase 2 but with all remaining apples removed. This phase lasts for only 2 seconds in total; when it ends, a reward of -20 is given if the agent did not collect the goal object; a reward of -10 is given if the agent collected the goal object after touching the blue key; and a reward of -1 is given if the agent collected the goal object after touching the red key. The map uses the Tron texture set.
In each episode of Latent Information Acquisition, three objects are randomly generated using the DM Lab object generation utilities. Color and type of object is randomized. Each object is independently randomly assigned to be a good or a bad object.
1. The map is a 3 × 5 rectangle. The agent spawns in one corner facing outwards along one of the shorter walls. The three objects are positioned randomly among five points as displayed in Figure 6c in the main text ( Figure 6a first four columns). If an agent touches one of the good objects, it disappears, and a green cue image is shown. If an agent touches one of the bad objects, it disappears, and a red cue image is shown. This phase lasts for 5 seconds, and there are no rewards. The map uses the Tron texture set. The image cues shown in this phase are only shown for 0.25 seconds so that the cues do not interfere with continuation of the P1 activity (in all other tasks they are shown for 1 second).
2. The standard distractor phase except with the Tetris texture set.
3. The map, spawn point, and possible object locations are the same as in phase 1. The objects are the same, but their positions are randomly chosen again. If the agent touches a good object it disappears, and a reward of 20 is given. If the agent touches a bad object it disappears and a reward of -10 is given. This phase lasts for 5 seconds. The map uses the Tron texture set.

Supplementary Methods: Distractor Phase Modifications
In order to analyze the effect of increasing variance of distractor reward on agent learning, we created variants of the distractor phase where this reward variance could be easily controlled. Since the distractor phase is standardized, any of these modifications can be used in any of those tasks. Zero Apple Reward: The reward given for apples in the distractor phase is zero. Even though the apples give zero reward, they still disappear when touched by the agent. Fixed Number of Apples: The reward given for apples remains at 5. Instead of the 120 free squares of the map independently spawning an apple with probability 0.3, we fix the number of apples to be 120 × 0.3 = 36 and distribute them randomly among the 120 available map units. Under an optimal policy where all apples are collected, this has the same mean reward as the standard distractor phase but with no variance. Variable Apple Reward: The reward r apple given for apples in the distractor phase can be modified (to a positive integer value), but with probability 1 − 1/r apple each apple independently gives zero reward instead of r apple . Any apple touched by the agent still disappears.
This implies that the optimal policy and expected return under the optimal policy is constant, but variance of the returns increases with r apple . Since there are 120 possible positions for apples in the distractor phase, and apples independently appear in each of these positions with probability 0.3, the variance of undiscounted returns in P2, assuming all apples are collected, is

Supplementary Methods: Control Tasks
Control tasks are taken from the DM Lab 30 task set [2]. The tasks we include had a memory access component to performance. We provide only brief descriptions here since these tasks are part of the open source release of DM Lab available at https://github.com/deepmind/lab/tree/master/ game_scripts/levels/contributed/dmlab30. In Explore Goal Locations Small, agents must find the goal object as fast as possible. Within episodes, when the goal object is found the agent respawns and the goal appears again in the same location. The goal location, level layout, and theme are randomized per episode. The agent spawn location is randomized per respawn.
In Natlab Varying Map Randomized, the agent must collect mushrooms within a naturalistic terrain environment to maximise score. The mushrooms do not regrow. The map is randomly generated and of intermediate size. The topographical variation, and number, position, orientation and sizes of shrubs, cacti and rocks are all randomized. Locations of mushrooms are randomized.
The time of day is randomized (day, dawn, night). The spawn location is randomized for each episode.
In Psychlab Arbitrary Visuomotor Mapping, a task in the Psychlab framework [3], the agent is shown images from a visual memory capacity experiment dataset [4] but in an experimental protocol known as arbitrary visuomotor mapping. The agent views consecutive images that are associated to particular cardinal directions. The agent is rewarded if it can remember the direction to move its fixation cross for each image. The images are drawn from a set of roughly 2,500 possible images, and the specific associations are randomly generated per episode.

Supplementary Methods: Task Specific Parameters
Across models the same parameters were used for the TVT, RMA, LSTM+Mem, and LSTM agents except for γ, which for the TVT model was always 0.96 and was varied as expressed in the figure legends for the other models. Learning rate was varied only for the learning rate analysis in Section 7. Across tasks, we used the parameters shown in Table 1 with a few exceptions: • For all the control tasks, we used α image = 1 instead of 20.
• For all the control tasks, we used τ window = 200 instead of using the full episode.
• For the Two Negative Keys task, we used α entropy = 0.05 instead of 0.01.
The TVT specific parameters were not varied across tasks. In Supplementary  Figure 15, we show that β threshold could range from 1 to 5 without destroying performance on the Active Visual Match task. For the Key-to-Door task, in Supplementary Figure 14, we show that we saw no performance difference for a factor of two variation in the threshold, but, for the largest value of the threshold, performance deteriorated, as the TVT mechanism was not triggered.

Supplementary Methods: Task Analyses
For Active Visual Match and Key-to-Door tasks, we performed analysis of the effect of distractor phase reward variance on the performance of the agents. To do this we used the same tasks but with modified distractor phases as described in Supplementary Methods 4.

Active Visual Match
Supplementary Figure 13 shows learning curves for r apple = 0 (zero apple reward) and r apple = 1 (variable apple reward). When r apple = 1, all apples give reward. Learning for the RMA was already significantly disrupted when r apple = 1, so for Active Visual Match we do not report higher variance examples. Figure 4c shows learning curves with apple reward r apple set to 1, 3, 6, and 10, which gives variances of total P2 reward as 25, 100, 196, and 361, respectively, (the variable apple reward condition). Note that episode scores for these tasks show that all apples are usually collected in P2 at policy convergence.

Key-to-Door
Note that the mean distractor phase return in the previous analysis is much less than the mean return in the standard distractor phase. Another way of looking at the effect of variance in the distractor phase whilst including the full mean return is shown in Supplementary Figure 11, which has three curves: one for zero apple reward, one for a fixed number of apples, and one for the full level (which has a variable number of apples per episode but the same expected return as the fixed number of apples case). From the figure, it can be seen that introducing large rewards slows learning in phase 1 due to the variance whilst the agent has to learn the policy to collect all the apples, but that the disruption to learning is much more significant when the number of apples continues to be variable even after the agent has learned the apple collection policy.

Return Prediction Saliency
To generate Figure 4e in the main text, a sequence of actions and observations for a single episode of Key-to-Door was recorded from a TVT agent trained on that level. We show two time steps where the key was visible. We calculated gradients ∂V t /∂I w,h,c t of the agent's value predictions with respect to the input image at each time step. We then computed the sensitivity of the value function prediction to each pixel: We smoothed these sensitivity estimates using a 2 pixel-wide Gaussian filter: We then normalized this quantity based on its statistics across time and pixels by computing the 97th percentile: Input images were then layered over a black image with an alpha channel that increased to 1 based on the sensitivity calculation. Specifically, we used an alpha channel value of:

Learning Rate Analysis for High Discount Factor
To check that the learning rates used for the high discount RMA and LSTM models were reasonable, we ran the largest variance tasks from in Section 7 (for RMA with γ = 0.998) and 7 (for LSTM with γ = 0.998) for learning rates 3.2 × 10 −7 , 8 × 10 −7 , 2 × 10 −6 , 5 × 10 −6 and 1.25 × 10 −5 . The results are shown in Supplementary Figure 12, and they show that the default learning rate of 5 × 10 −6 was the best among those tried.

Behavioral Analysis of Active Visual Match
We compared the P1 behaviors of a TVT agent versus an RMA as shown in Figure 3a in the main text. First, we modified the environment to fix the color square in one of three pre-selected wall locations. We then ran TVT and RMA for 10 episodes in each of these three fixed color square conditions. Finally, we plotted the agents' positional trajectories in each condition. We also visualized the TVT agent's memory retrievals by plotting a single episode trajectory with arrowheads indicating agent orientation on each second agent step. Each arrowhead is also color-coded by the maximal read weight from any time step in P3 back to the memory encoded at this time and position in P1.

Behavioral Analysis of Latent Information Acquisition
We evaluated TVT and RMA for 50 episodes in the latent information acquisition task. To visualize, we scatter-plotted the agent's position as a black dot for each P1 time step (50 episodes × 75 P1 time steps = 3, 750 dots in total). We also binned the agent's position on a 4 × 5 grid and counted the percentage of time the agent had occupied each grid cell. We visualized this grid occupancy using a transparent heatmap overlying the top-down view. To further quantify the behaviour of TVT versus RMA, we recorded how many objects were acquired by the agent in the exploration phase in each of the 50 test trials and plotted the mean and standard deviation in a bar plot.

Return Variance Analysis
Over 20 trials, in Key-to-Door we computed and compared two return variances based on trajectories from the same TVT agent. The first was the undiscounted return: R t = t ≥t r t . The second was computed as in Algorithm 1 and Algorithm 3 using TVT (γ = λ = 0.96), i.e., it was bootstrapped recursively: and r t was modified by TVT.

Supplementary Tables
Parameter Value  All models learned to retrieve the P3 reward with no P2 delay, but performance is hampered for longer delays for models with no reconstructive loss. Using CIFAR-10 images [5] instead of colored squares as P1 and P3 images, the RMA was still able to perform the Passive Image Match Task.     Although there is overhead to running the TVT memory system and algorithm, its throughput in processing environment steps is of the same order as a simple LSTM agent, left, LSTM with an external memory, and the RMA agent.  Table 4 in [6] and from the original A3C [7]. The methods that worked best on this domain used off-policy experience replay; such methods make repeated learning updates from single experiences. The policy gradient methods without replay (including TVT, A3C, and IMPALA) performed worse than the replaybased Q learning methods (Reactor, APE-X, and R2D2), and all were below the human reported baselines. The very low probability of successfully encountering any reward in this game, alongside deterministic transition dynamics, implies that an effective strategy is to repeat any action trajectory that led to non-zero reward. In this regime, with very low probabilities of securing reward, TVT did not improve results.