Deep neural networks have achieved impressive successes in fields ranging from object recognition to complex games such as Go1,2. Navigation, however, remains a substantial challenge for artificial agents, with deep neural networks trained by reinforcement learning3,4,5 failing to rival the proficiency of mammalian spatial behaviour, which is underpinned by grid cells in the entorhinal cortex6. Grid cells are thought to provide a multi-scale periodic representation that functions as a metric for coding space7,8 and is critical for integrating self-motion (path integration)6,7,9 and planning direct trajectories to goals (vector-based navigation)7,10,11. Here we set out to leverage the computational functions of grid cells to develop a deep reinforcement learning agent with mammal-like navigational abilities. We first trained a recurrent network to perform path integration, leading to the emergence of representations resembling grid cells, as well as other entorhinal cell types12. We then showed that this representation provided an effective basis for an agent to locate goals in challenging, unfamiliar, and changeable environments—optimizing the primary objective of navigation through deep reinforcement learning. The performance of agents endowed with grid-like representations surpassed that of an expert human and comparison agents, with the metric quantities necessary for vector-based navigation derived from grid-like units within the network. Furthermore, grid-like representations enabled agents to conduct shortcut behaviours reminiscent of those performed by mammals. Our findings show that emergent grid-like representations furnish agents with a Euclidean spatial metric and associated vector operations, providing a foundation for proficient navigation. As such, our results support neuroscientific theories that see grid cells as critical for vector-based navigation7,10,11, demonstrating that the latter can be combined with path-based strategies to support navigation in challenging environments.
Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
We thank M. Jaderberg, V. Mnih, A. Santoro, T. Schaul, K. Stachenfeld and J. Yosinski for discussions, and M. Botvinick and J. Wang for comments on an earlier version of the manuscript. C.Ba. funded by Royal Society and Wellcome Trust.Reviewer information
Nature thanks J. Conradt and the other anonymous reviewer(s) for their contribution to the peer review of this work.
Extended data figures and tables
The recurrent layer of the grid cell network is an LSTM with 128 hidden units. The recurrent layer receives as input the vector . The initial cell state and hidden state of the LSTM, and , respectively, are initialized by computing a linear transformation of the ground truth place and head-direction activity at time 0. The output of the LSTM is followed by a linear layer on which dropout is applied. The output of the linear layer, , is linearly transformed and passed to two softmax functions that calculate the predicted head direction cell activity, , and place cell activity, . We found evidence of grid-like and head direction-like units in the linear layer activations .
Spatial activity plots for all 512 units in the linear layer . Units exhibit spatial activity patterns resembling grid cells, border cells, and place cells. Head direction tuning was also present but is not shown.
Extended Data Fig. 3 Characterization of grid-like units in square environment and circular environment.
a, The scale (assessed from the spatial autocorrelogram of the ratemaps) of grid-like units exhibited a tendency to cluster at specific values. The number of distinct scale clusters was assessed by sequentially fitting Gaussian mixture models with one to eight components. In each case, the efficiency of the fit (likelihood versus number of parameters) was assessed using Bayesian information criterion (BIC). BIC was minimized with three Gaussian components, indicating the presence of three distinct scale clusters. b, Spatial stability of units in the linear layer of the supervised network was assessed using spatial correlations— bin-wise Pearson product moment correlation between spatial activity maps (32 spatial bins in each map) generated at two different points in training, t = 2 × 105 and t′ = 3 × 105 training steps (two-thirds of the way through training and at the end of training, respectively). This separation was imposed to minimize the effect of temporal correlations and to provide a conservative test of stability. Grid-like units (gridness > 0.37), blue; directionally modulated units (resultant vector length > 0.47, green. Grid-like units exhibit high spatial stability, while directionally modulated units do not. c, Robustness of the grid representation to starting conditions. The network was retrained 100 times with the same hyperparameters but different random seeds controlling the initialization of network weights, and . Populations of grid-like units (gridness > 0.37) were found to appear in all cases, with the average proportion of grid-like units being 23% (s.d. 2.8%). d, The supervised network was also trained in a circular environment (diameter 2.2 m). As before, units in the linear layer exhibited spatially tuned responses resembling grid, border, and head direction cells. Eight units are shown. Top, ratemap displaying activity binned over location. Middle, spatial autocorrelogram of the ratemap; gridness20 is indicated above. Bottom, polar plot of activity binned over head direction. e, Spatial scale of grid-like units (n = 56 (21.9%)) is clustered. Distribution is best fit by a mixture of two Gaussians (centres 0.58 and 0.96 m, ratio 1.66). f, Distribution of directional tuning for 31 most directionally active units; single line for each unit indicates length and orientation of resultant vector47. g, Distribution of gridness and directional tuning. Dashed lines indicate 95% confidence interval derived from shuffling procedure (500 permutations); five grid units (9%) exhibit significant directional modulation.
Extended Data Fig. 4 Grid-like units did not emerge in the linear layer when dropout was not applied.
Linear layer spatial activity maps (n = 512) generated from a supervised network trained without dropout. The maps do not exhibit the regular periodic structure diagnostic of grid cells.
The architecture of the supervised network (grid network, light blue dashed) was incorporated into a larger deep reinforcement learning network, including a visual module (green dashed) and an actor–critic learner (based on A3C41; dark blue dashed). In this case the supervised learner does not receive the ground truth and to signal its initial position, but uses input from the visual module to self-localize after placement at a random position within the environment. Visual module: since experimental evidence suggests that place cell input to grid cells functions to correct for drift and anchor grids to environmental cues21,27, visual input was processed by a convolutional network to produce place cell (and head direction cell) activity patterns which were used as input to the grid network. The output of the vision module was only provided 5% of the time to the grid network (see Methods for implementational details), akin to occasional observations of salient environmental cues made by behaving animals27. The output of the vision module was concatenated with to form the input to the grid LSTM, which is the same network as in the supervised case (see Methods and Extended Data Fig. 1). The actor–critic learner (light blue dashed) receives as input the concatenation of produced by a convolutional network with the reward r t , the previous action a t−1 , the linear layer activations of the grid cell network (current grid-code), and the linear layer activations observed last time the goal was reached, (goal grid-code), which is set to zero if the goal has not been reached in the episode. The fully connected layer was followed by an LSTM with 256 units. The LSTM has two different outputs. The first output, the actor, is a linear layer with six units followed by a softmax activation function, which represents a categorical distribution over the agent’s next action . The second output, the critic, is a single linear unit that estimates the value function v t .
Extended Data Fig. 6 Characterization of grid-like representations and robustness of performance for the grid cell agent in the square land maze environment.
a, Spatial activity plots for the 256 linear layer units in the agent exhibit spatial patterns similar to grid, border, and place cells. b, Cumulative reward indexing goal visits per episode (goal, 10 points) when distal cues are removed (dark blue) and when distal cues are present (light blue). Performance is unaffected, hence dark blue largely obscures light blue. Average of 50% best agent replicas (n = 32) plotted (see Methods). The grey band displays the 68% CI based on 5,000 bootstrapped samples. c, Cumulative reward per episode when no goal code was provided (light blue) and when goal code was provided (dark blue). When no goal code was provided the agent performance fell to that of the baseline deep reinforcement learning agent (A3C) (100 episodes average score no goal code, 123.22 versus A3C, 112.06; effect size, 0.21; 95% CI, 0.18–0.28). Average of 50% best agent replicas (n = 32) plotted (see Methods). The grey band displays the 68% CI based on 5,000 bootstrapped samples. d, After locating the goal for the first time during an episode, the agent typically returned directly to it from each new starting position, showing decreased latencies for subsequent visits, paralleling the behaviour exhibited by rodents.
a–c, AUC performance gives robustness to hyperparameters (that is, learning rate, baseline cost, entropy cost; see Supplementary Table 2 in Supplementary Methods for details of the range) and seeds (see Methods). For each environment we run 60 agent replicas (see Methods). Light purple is the grid agent, blue is the place cell agent and dark purple is A3C. a, Square arena. b, Goal-driven. c, Goal doors. In all cases the grid cell agent shows higher robustness to variations in hyperparameters and seeds. d–i, Performance of place cell prediction, NavMemNet and DNC agents (see Methods) against grid cell agent. Dark blue is the grid cell agent (Extended Data Fig. 5), green is the place cell prediction agent (Extended Data Fig. 9a), purple is the DNC agent, light blue is the NavMemNet agent (Extended Data Fig. 9b). The grey band displays the 68% CI based on 5,000 bootstrapped samples. d–f, Performance in goal-driven. g–i, Performance in goal-doors. Note that the performance of the place cell agent (Extended Data Fig. 8b, lower panel) is shown in Fig. 3.
a, The A3C implementation is as described41. b, The place cell agent was provided with the ground-truth place, , and head-direction, , cell activations (as described above) at each time step. The output of the fully connected layer of the convolutional network was concatenated with the reward r t , the previous action at−1, the ground-truth current place code, , and current head-direction code, , together with the ground truth goal place code, , and ground truth head direction code, , observed the last time the agent reached the goal (see Methods).
a, The architecture of the place cell prediction agent is similar to the grid cell agent, having a grid cell network with the same parameters as that of the grid cell agent. The key difference is the nature of the input provided to the policy LSTM. Instead of using grid codes from the linear layer of the grid network , we used the predicted place cell population activity vector , and the predicted head direction population activity vector , (the activations present on the output place and head direction unit layers of the grid cell network, corresponding to the current and goal position, respectively) as input for the policy LSTM. As in the grid cell agent, the output of the fully connected layer of the convolutional network, , the reward r t , and the previous action at−1, were also input to the policy LSTM. The convolutional network had the same architecture as described for the grid cell agent. b, NavMemNet agent. The architecture implemented is as described3, specifically FRMQN, but the A3C algorithm was used in place of Q-learning. The convolutional network had the same architecture described for the grid cell agent and the memory was formed of two banks (keys and values), each composed of 1,350 slots.
a, Overhead view of the linear sunburst maze in initial configuration, with only door 5 open. Example trajectory from grid cell agent during training (green line, icon indicates start location). b, Test configuration with all doors open; grid cell agent uses the newly available shortcuts (multiple episodes shown). c, Histogram showing proportion of times the agent uses each of the doors during 100 test episodes. The agent shows a clear preference for the shortest paths. d, Performance of grid cell agent and comparison agents during test episodes. e, f, Example grid cell agent (e) and example place cell agent (f) trajectory during training in the double E-maze (corridor 1 doors closed). g, h, In the test phase, with all doors open, the grid cell agent exploits the available shortcut (g), while the place cell agent does not (h). i, j, Performance of agents during training (i) and test (j). k, l, The proportion of times the grid (k) and place (l) cell agents used the doors on the first to third corridors during test. The grid cell agent shows a clear preference for available shortcuts, while the place cell agent does not.