Main

Space and time are fundamental physical structures in the natural world, and all organisms have evolved strategies for navigating space to forage, mate and escape predation1,2,3. In humans and other mammals, the concept of a spatial or cognitive map has been postulated to underlie spatial reasoning tasks4,5,6. A spatial map is an internal, neural representation of an animal’s environment that marks the location of landmarks, food, water and shelter, which can be queried for navigation and planning. The neural algorithms underlying spatial mapping are thought to generalize to other sensory modes to provide cognitive representations of auditory and somatosensory data7 as well as to construct internal maps of more abstract information including concepts8,9, tasks10, semantic information11,12,13 and memories14. Empirical evidence suggests that the brain uses common cognitive mapping strategies for spatial and non-spatial sensory information so that common mapping algorithms might exist that can map and navigate over not only visual but also semantic information and logical rules inferred from experience7,8,15. In such a paradigm, reasoning itself could be implemented as a form of navigation within a cognitive map of concepts, facts and ideas.

After the notion of a spatial or cognitive map emerged, the question of how environments are represented within the brain and how the maps can be learned from experience has been a central question in neuroscience16. Place cells in the hippocampus are neurons that are active when an animal transits through a specific location in an environment16. Grid cells in the entorhinal cortex fire in regular spatial intervals and likely track an organism’s displacement in the environment17,18. Yet, even with the identification of a substrate for the representation of space, the question of how a spatial map can be learned from sensory data has remained, and the neural algorithms that enable the construction of spatial and other cognitive maps remain poorly understood.

Empirical work in machine learning has demonstrated that deep neural networks can solve spatial navigation tasks as well as perform path prediction and grid cell formation19,20. Two studies19,20 demonstrate that neural networks can learn to perform path prediction and that networks generate firing patterns that resemble the firing patterns of grid cells in the entorhinal cortex. Other studies20,21,22 demonstrate navigation algorithms that require the environment’s map or that use firing patterns resembllng place cells in the hippocampus. These studies allow an agent to access environmental coordinates explicitly19 or initialize a model with place cells that represent specific locations in an arena20. In machine learning and autonomous navigation, a variety of algorithms have been developed to perform mapping tasks, including simultaneous location and mapping (SLAM) and monocular SLAM algorithms23,24,25,26, as well as neural network implementations27,28,29. Yet, SLAM algorithms contain many specific inference strategies, like visual feature and object detection, that are specifically engineered for map building, wayfinding and pose estimation based on visual information. Whereas extensive research in computer vision and machine learning use video frames, these studies do not extract representations of the environment’s map30,31. A unified theoretical and mathematical framework for understanding the mapping of spaces based on sensory information remains incomplete.

Predictive coding has been proposed as a unifying theory of neural function where the fundamental goal of a neural system is to predict future observations given past data32,33,34. When an agent explores a physical environment, temporal correlations in sensory observations reflect the structure of the physical environment. Landmarks nearby one another in space will also be observed in temporal sequence. In this way, predicting observations in a temporal series of sensory observations requires an agent to internalize some implicit information about a spatial domain. Historically, Poincaré motivated the possibility of spatial mapping through a predictive coding strategy, where an agent assembles a global representation of an environment by gluing together information gathered through local exploration35,36. The exploratory paths together contain information that could, in principle, enable the assembly of a spatial map for both flat and curved manifolds. Indeed, extended Kalman filters25,37 for SLAM perform a form of predictive coding by directly mapping visual changes and movement to spatial changes. However, extended Kalman filters, as well as other SLAM approaches, require intricate strategies for landmark size calibration, image feature extraction and models of the camera’s distortion, whereas biological systems can solve flexible mapping and navigation issues that engineered systems cannot. Yet, while the concept of predictive coding for spatial mapping is intuitively attractive, a major challenge is the development of algorithms that can glue together local sensory information gathered by an agent into a global, internally consistent environmental map. Connections between mapping and predictive coding in the literature have primarily focused on situations where an agent has explicit access to its spatial location as a state variable38,39,40. The problem of building spatial maps de novo from sensory data remains poorly understood.

Here we demonstrate that a neural network trained on a sensory predictive coding task can construct an implicit spatial map of an environment by assembling observations acquired along local exploratory paths into a global representation of a physical space within the network’s latent space. We analyse sensory predictive coding theoretically and demonstrate mathematically that solutions to the predictive sensory inference problem have a mathematical structure that can naturally be implemented by a neural network trained using backpropagation and comprising a ‘path encoder’, an internal spatial map and a ‘sensory decoder’. In such a paradigm, a network learns an internal map of its environment by inferring an internal geometric representation that supports predictive sensory inference. We implement sensory predictive coding within an agent that explores a virtual environment while performing visual predictive coding using a convolutional neural network with self-attention. Following network training during exploration, we find that the encoder network embeds images collected by an agent exploring an environment into an internal representation of space. Within the embedding, the distances between images reflect their relative spatial position, not object-level similarity between images. During exploratory training, the network implicitly assembles information from local paths into a global representation of space as it performs a next-image inference problem. Fundamentally, we connect predictive coding and mapping tasks, demonstrating a computational and mathematical strategy for integrating information from local measurements into a global self-consistent environmental model.

Mathematical formulation of spatial mapping as sensory predictive coding

In this Article, we aim to understand how a spatial map can be assembled by an agent that is making sensory observations while exploring an environment. Papers in the literature that study connections between predictive coding and mapping have primarily focused on situations where an agent has access to its ‘state’ or location in the environment38,39,40. Here we develop a theoretical model and neural network implementation of sensory predictive coding that illustrates why and how an internal spatial map can emerge naturally as a solution to sensory inference problems. The neural network is a feedforward deep neural network trained using backpropagation, or gradient descent, rather than Helmholtz machines41,42, which are commonly used in predictive coding. We first formulate a theoretical model of visual predictive coding and demonstrate that the predictive coding problem can be solved by an inference procedure that constructs an implicit representation of an agent’s environment to predict future sensory observations. The theoretical analysis also suggests that the underlying inference problem can be solved by an encoder–decoder neural network that infers spatial position based upon observed image sequences.

We consider an agent exploring an environment \({{\varOmega }}\subset {{\mathbb{R}}}^{2}\), while acquiring visual information in the form of pixel valued image vectors \({{I}}_{x}\in {{\mathbb{R}}}^{m\times n}\) given an xΩ. The agent’s environment Ω is a bounded subset of \({{\mathbb{R}}}^{2}\) that could contain obstructions and holes. In general, at any given time t, the agent’s state can be characterized by a position x(t) and orientation θ(t) where x(t) and θ(t) are coordinates within a global coordinate system unknown to the agent.

The agent’s environment comes equipped with a visual scene, and the agent makes observations by acquiring image vectors \({{I}}_{{x}_{k}}\in {{\mathbb{R}}}^{m\times n}\) as it moves along a sequence of points xk. At every position x and orientation θ, the agent acquires an image by effectively sampling from an image the conditional probability distribution P(Ixk, θk) which encodes the probability of observing a specific image vector I when the agent is positioned at position xk and orientation θk. The distribution P(Ix, θ) has a deterministic and stochastic component where the deterministic component is set by landmarks in the environment while stochastic effects can emerge due to changes in lighting, background and scene dynamics. Mathematically, we can view P(Ix, θ) as a function on a vector bundle with base space Ω and total space Ω × I (ref. 43). The function assigns an observation probability to every possible image vector for an agent positioned at a point (x, θ). Intuitively, the agent’s observations preserve the geometric structure of the environment: the spatial structure influences temporal correlations.

In the predictive coding problem, the agent moves along a series of points (x0, θ0), (x1, θ1), …, (xk, θk) while acquiring images I0, I1, …,Ik. The motion of the agent in Ω is generated by a Markov process with transition probabilities P(xi+1, θi+1xi, θi). Note that the agent has access to the image observations Ii but not the spatial coordinates (xi, θi). Given the set {I0, …,Ik} the agent aims to predict Ik+1. Mathematically, the image prediction problem can be solved theoretically through statistical inference by (1) inferring the posterior probability distribution P(Ik+1I0, I1. …, Ik) from observations. Then, (2) given a specific sequence of observed images {I0, …,Ik}, the agent can predict the next image Ik+1 by finding the image Ik+1 that maximizes the posterior probability distribution P(Ik+1I0, I1, …, Ik).

The posterior probability distribution P(Ik+1I0, I1, …, Ik) is by definition

$$P({I}_{k+1}\,|\, {I}_{0},{I}_{1},\,\ldots ,{I}_{k})=\frac{P({I}_{0},{I}_{1},\,\ldots ,{I}_{k},{I}_{k+1})}{P({I}_{0},{I}_{1},\,\ldots ,{I}_{k})}.$$

If we consider P(I0, I1, …,Ik, Ik+1) to be a function of an implicit set of spatial coordinates (xi, θi) where the (xi, θi) provide an internal representation of the spatial environment, then we can express the posterior probability P(Ik+1I0, I1, …, Ik) in terms of the implicit spatial representation

$$\begin{array}{ll}P\left(I_{k+1} \mid I_0, I_1, \, \ldots, I_k\right) \\=\displaystyle\int_{\Omega} {\mathrm{d}}{x} \,{\mathrm{d}} \theta \,P\left(x_0, \theta_0, x_1, \theta_1, \, \ldots, x_k, \theta_k\right) \frac{P\left(I_0, I_1, \, \ldots, I_k \mid x_0, \theta_0, \, \ldots, x_k, \theta_k\right)}{P\left(I_0, I_1, \, \ldots, I_k\right)} \\ \\\qquad P\left(x_{k+1} \mid x_k, \theta_k\right) P\left(I_{k+1} \mid x_{k+1}, \theta_{k+1}\right) \\ =\displaystyle\int_{\Omega} {\mathrm{d}}{x} \, {\mathrm{d}} \theta \, \underbrace{P\left(x_0, \theta_0, x_1, \theta_1, \, \ldots, x_k, \theta_k \mid I_0, I_1, \, \ldots, I_k\right)}_{{\mathrm{encoding}}\; ({\mathrm{term }} \, 1) } \\\qquad\underbrace{P\left(x_{k+1}, \theta_{k+1} \mid x_k, \theta_k\right)}_{{\mathrm {spatial}}\; {\mathrm{transition}}\; {\mathrm{probability}}\; ({\mathrm{term}} \, 2) } \underbrace{P\left(I_{k+1} \mid x_{k+1}, \theta_{k+1}\right)}_{{\mathrm{decoding}}\; ({\mathrm{term}} \, 3) } \\\end{array}$$
(1)

where in equation (1) the integration is over all possible paths {(x0, θ0), …, (xk, θk)} in the domain Ω, for differentials dx = dx0, …, dxk and dθ = dθ0, …, dθk. Equation (1) can be interpreted as a path integral over the domain Ω. The path integral assigns a probability to every possible path in the domain and then computes the probability that the agent will observe a next image Ik given an inferred location (xk+1, θk+1). In detail, term 1 assigns a probability to every discrete path {(x0, θ0), …, (xk, θk)} Ω as the conditional likelihood of the path given the observed sequences of images {I0, …,Ik}. Term 2 computes the probability that an agent at a terminal position xk moves to the position (xk+1, θk+1), given the Markov transition function P(xk+1, θk+1xk, θk). Term 3 is the conditional probability that image Ik+1 is observed, given that the agent is at position (xk+1, θk+1).

Conceptually, the product of terms solves the next-image prediction problem in three steps. First, estimating the probability that an agent has traversed a particular sequence of points given the observed images; second, estimating the next position of the agent (xk+1, θk+1) for each potential path; and third, computing the probability of observing a next image Ik+1 given the inferred terminal location xk+1 of the agent. Critically, an algorithm that implements the inference procedure encoded in the equation would construct an internal but implicit representation of the environment as a coordinate system x, θ that is learned by the agent and used during the next-image inference procedure. The coordinate system provides an internal, inferred representation of the agent’s environment that is used to estimate future image observation probabilities. Thus, our theoretical framework demonstrates how an agent might construct an implicit representation of its spatial environment by solving the predictive coding problem.

The three-step inference procedure represented in the equation for P(Ik+1I0, I1, …,Ik) can be directly implemented in a neural network architecture, as demonstrated in the Supplementary Information. The first term acts as an ‘encoder’ network that computes the probability that the agent has traversed a path {(x0, θ0), …, (xk, θk)} given an observed image sequence I0, …, Ik that has been observed by the network (Fig. 1b). The network can then estimate the next position (xk+1, θk+1) of the agent given an inferred location (xk, θk) and apply a decoding network to compute P(Ik+1xk+1, θk+1), while outputting the prediction Ik+1 using a decoder. A network trained through visual experience must learn an internal coordinate system and representation x, θ that not only offers an environmental representation but also establishes a connection between observed images Ij and inferred locations (xj, θj).

Fig. 1: A predictive coding neural network explores a virtual environment.
figure 1

In predictive coding, a model predicts observations and updates its parameters using the prediction error. a, An agent traverses its environment by taking the most direct path to random positions. b, A self-attention-based encoder–decoder neural network architecture learns to perform predictive coding. A ResNet-18 convolutional neural network acts as an encoder; self-attention is performed with eight heads, and a corresponding ResNet-18 convolutional neural network performs decoding to the predicted image. c, The neural network learns to perform predictive coding effectively, achieving a mean-squared error of 0.094 between the actual and predicted images. Conv., convolution; concat., concatenation; norm., normalization.

A neural network performs predictive coding

Motivated by the implicit representation of space contained in the predictive coding inference problem, we developed a computational implementation of a predictive coding agent and studied the representation of space learned by that agent as it explored a virtual environment. We first create an environment with the Malmo environment in Minecraft44. The physical environment measures 40 × 65 lattice units and encapsulates three aspects of visual scenes: a cave provides a global visual landmark, a forest provides degeneracy between visual scenes, and a river with a bridge constrains how an agent traverses the environment (Fig. 1a). An agent follows paths (Supplementary Fig. 5b,c), determined by A* search to find the shortest path between randomly sampled positions, and receives visual images along every path.

To perform predictive coding, we construct an encoder–decoder convolutional neural network with a ResNet-18 architecture45 for the encoder and a corresponding ResNet-18 architecture with transposed convolutions in the decoder (Fig. 1b). The encoder–decoder architecture uses the U-Net architecture46 to pass the encoded latent units into the decoder. Multi-headed attention47 processes the sequence of encoded latent units to encode the history of past visual observations. The multi-headed attention has h = 8 heads. For the encoded latent units with dimension D = C × H × W, the dimension d of a single head is d = C × H × W/h for height H, width W and channels C.

The predictive coder approximates predictive coding by minimizing the mean-squared error between the actual observation and its predicted observation. The predictive coder trains on 82,630 samples for 200 epochs with gradient descent optimization with Nesterov momentum48, a weight decay of 5 × 10−6 and a learning rate of 10−1 adjusted by OneCycle learning-rate scheduling49. The optimized predictive coder has a mean-squared error between the predicted and actual images of 0.094 and a good visual fidelity (Fig. 1c).

Predictive coding network constructs an implicit spatial map

We show that the predictive coder creates an implicit spatial map by demonstrating it recovers the environment’s spatial position and distance. We encode the image sequences using the predictive coder’s encoder to analyse the encoded sequence as the predictive coder’s latent units. To measure the positional information in the predictive coder, we train a neural network to predict the agent’s position from the predictive coder’s latent units (Fig. 1a). The neural network’s prediction error

$$E(x,\hat{x})={\left\Vert \hat{x}-x\right\Vert }_{{\ell }_{2}}$$

indirectly measures the predictive coder’s positional information. To provide comparative baselines, we construct a position prediction model. To provide a lower bound for the prediction error, we construct a model that gives the agent’s actual position with small additive Gaussian noise:

$$\hat{x}=x+\epsilon ,\epsilon \sim {{{\mathcal{N}}}}(0,\sigma ).$$

such that ε 𝒩(0,σ) indicates noise ε that is distributed from a Gaussian distribution with zero mean and standard deviation σ. To compare the predictive coder to the baselines, we compare the prediction error histograms (Fig. 2b).

Fig. 2: Predictive coding neural network constructs an implicit spatial map.
figure 2

a, The predictive coder’s latent space encodes accurate spatial positions. A neural network predicts the spatial location from the predictive coding’s latent space. A heatmap of the prediction errors between the actual position and the predictive coder’s predicted positions show a low prediction error. b, The histogram of prediction errors of positions from the predictive coder’s latent space show a low prediction error. As a baseline (Noise model (σ = 1 lattice unit)), actual positions with a small noise displacement give an error model. c, Predictive coding’s latent distances recover the environment’s spatial metric. Sequential visual images are mapped to the neural network’s latent space, and the latent space distances (2) are plotted with physical distances onto a joint density plot. A nonlinear regression model \(\left\Vert z-z^{\prime} \right\Vert =\alpha \log \left\Vert x-x^{\prime} \right\Vert +\beta\) is shown as a baseline. d, A correlation plot and a quantile–quantile plot show the overlap between the empirical and model distributions.

The predictive coder encodes the environment’s spatial position to a low prediction error (Fig. 2d). The predictive coder has a mean error of 5.04 lattice units and >80% of samples have an error <7.3 lattice units. The additive Gaussian model with σ = 4 has a mean error of 4.98 lattice units and >80% of samples with an error <7.12 lattice units.

We show the predictive coder’s latent space recovers the local distances between the environment’s physical positions. For every path that the agent traverses, we calculate the local pairwise distances in physical space and in the predictive coder’s latent space with a neighbourhood of 100 time points. To determine whether latent space distances correspond to physical distances, we calculate the joint density between latent space distances and physical distances (Fig. 2c). We model the latent distances by fitting the physical distances with additive Gaussian noise to a logarithmic function:

$$d(z,z^{\prime} )=\alpha \log (\left\Vert x-x^{\prime} +\epsilon \right\Vert )+\beta ,\epsilon \sim {{{\mathcal{N}}}}(0,\sigma ).$$

The modelled distribution is concentrated with the predictive coder’s distribution (Fig. 2d) with a Pearson correlation coefficient of 0.827 and a Kullback–Leibler divergence \(({{\mathbb{D}}}_{{{{\rm{KL}}}}}(\,{p}_{{{{\rm{PC}}}}}\parallel {p}_{{{{\rm{model}}}}}))\) of 0.429 bits.

Predictive coding network learns spatial proximity not image similarity

In the previous section, we show that a neural network that performs predictive coding learns an internal representation of its physical environment within its latent space. Here we demonstrate that the prediction task itself is essential for spatial mapping. Prediction forces a network to learn spatial proximity and not merely image similarity. Many frameworks, including principal components analysis, IsoMap50 and auto-encoder neural networks can collocate images by visual similarity. While similar scenes might be proximate in space, similar scenes can also be spatially divergent. For example, the virtual environment we constructed has two different ‘forest’ regions that are separated by a lake. Thus, the two forest environments might generate similar images but are actually each closer to the lake region than to one another (Fig. 1a).

To demonstrate the central role for prediction in mapping, we compared the latent representation of images generated by the predictive coding network to a representation learned by an auto-encoder. The auto-encoder network has a similar architecture to the predictive encoder but encodes a single image observation in a latent space and decodes the same observations. As the auto-encoder only operates on a single image, rather than a sequence, the auto-encoder learns an embedding based on image proximity not underlying spatial relationships. As with the predictive coder, the auto-encoder (Fig. 3a) trains to minimize the mean-squared error between the actual image and the predicted image on 82,630 samples for 200 epochs with gradient descent optimization with Nesterov momentum, a weight decay of 5 × 10−6 and a learning rate of 10−1 adjusted by the OneCycle learning-rate scheduler. The auto-encoder has a mean-squared error of 0.039 and a high visual fidelity.

Fig. 3: Predictive coding network learns spatial proximity, not image similarity.
figure 3

a, An auto-encoding neural network compresses visual images into a low-dimensional latent vector and reconstructs the image from the latent space. The auto-encoder trains on visual images from the environment without any sequential order. b,c, Auto-encoding encodes lower resolution in positional information. A neural network predicts the spatial location from the auto-encoding’s latent space (b). A heatmap of the prediction errors between the actual position and the auto-encoder’s predicted positions show a higher prediction error compared to the predictive coder. Auto-encoding captures less positional information compared to predictive coding (c). The histogram shows the prediction errors of positions from the latent space of both the auto-encoder and the predictive coder. d, Latent distances, however, show a weaker relationship with physical distances, as the joint histogram between physical and latent distances is less concentrated. e, A correlation plot and a quantile–quantile plot show a lower correlation and a lower density overlap between the empirical and model distributions. f, Predictive coding’s latent units communicate more fine-grained spatial distances, whereas auto-encoding communicates broad spatial regions. Joint density plots show the association between latent distances and physical distances for both predictive coding and auto-encoding. Predictive coding’s latent distances increase with spatial distances, with a higher concentration compared to auto-encoding.

The predictive coder encodes a higher resolution and a more accurate spatial map in its latent space than the auto-encoder. As with the predictive coder, we train an auxiliary neural network to predict the agent’s position from the auto-encoder’s latent units (Fig. 3b). The neural network’s prediction error indirectly measures the auto-encoder positional information. For greater than 80% of the auto-encoder’s points, its prediction error is less than 13.1 lattice units, as compared to the predictive coder that has >80% of its samples below a prediction error of 7.3 lattice units (Fig. 3c).

We also show that the predictive coder recovers the environment’s spatial distances with finer resolution compared to the auto-encoder. As with the predictive coder, we calculate the local pairwise distances in physical space and in the auto-encoder’s latent space, and we generate the joint density between the physical and latent distances (Fig. 3d). Compared to the predictive coder’s joint density, the auto-encoder’s latent distances increase with the agent’s physical distance. The auto-encoder’s joint density shows a larger dispersion compared to the predictive coder’s joint density, indicating that the auto-encoder encodes spatial distances with higher uncertainty.

We can quantitatively measure the dispersion in the auto-encoder’s joint density by calculating mutual information of the joint density (Fig. 3e)

$$I[X;Z\;]={{\mathbb{E}}}_{p(X,Z\;)}\left[\log \frac{p(X,Z\;)}{p(X\;)p(Z\;)}\right].$$

The auto-encoder has a mutual information of 0.227 bits, while the predictive coder has a mutual information of 0.627 bits. As a comparison, positions with additive Gaussian noise having a standard deviation σ of 2 lattice units have a mutual information of 0.911 bits. The predictive coder encodes 0.400 additional bits of distance information to the auto-encoder. The predictive coder’s additional distance information of 0.400 bits exceeds the auto-encoder’s distance information of 0.227 bits, which indicates the temporal dependencies encoded by the predictive coder capture more spatial information compared to visual similarity.

Predictive coding network maps visual degenerate environments, whereas auto-encoding cannot

The sequential prediction task is beneficial for spatial mapping: the predictive coder captures more accurate spatial information compared to the auto-encoder, and the predictive coder’s latent distances have a stronger correspondence to the environment’s metric. However, it is unclear whether predictive coding is necessary (as opposed to beneficial) to recover an environment’s map; an auto-encoder may still recover the environment’s map. In this section, we demonstrate that predictive coding is necessary for recovering an environment’s map. First, we show empirically that there exist environments that auto-encoding cannot recover. Second, we provide insight into why the auto-encoder fails with a theorem showing that auto-encoding cannot recover many environments—specifically, environments with visually similar yet spatially different locations.

In the previous sections, the agent explores a natural environment with forest, river and cave landmarks. While this environment models exploration in outdoor environments, the lack of controlled visual scenes complicates the interpretation of the operation of the predictive coder and auto-encoder. We introduce a circular corridor (Fig. 4a) to introduce visual scenes that are visually identical—rather than visually similar—yet spatially different. Specifically, the rooms appear clockwise as red, green, red, blue and yellow; there exist two distinct red rooms. The two distinct red rooms permit answering two questions: (1) Can the predictive coder and auto-encoder recover the map for environments with visual symmetry? (2) Does the predictive coder recover a global map or a relative map? In other words, does the predictive coder recover the circular corridor’s geometry, or does it learn a linear hallway?

Fig. 4: Predictive coding network can learn a circular topology and distinguishes visually identical, spatially different locations.
figure 4

a, An agent traverses a circular environment with two visually identical red rooms; this provides visually similar yet spatially different locations. b, The predictive coder’s latent distances show a correspondence with the circular environment’s metric, while the auto-encoder’s latent distances show little correlation. c, Similar to Figs. 2 and 3, a different neural network measures the predictive coder’s spatial information by predicting the agent’s location from the predictive coder’s latent space. The predictive coder’s latent space demonstrates a low prediction error. d, Similar to Figs. 2 and 3, the nonlinear regression measures the correspondence between the latent distances \(\parallel z-z^{\prime}\parallel\) and the actual distances \(\parallel {x}-{x}^{\prime}\parallel\) with the model \(\parallel{z-z}^{\prime}\parallel=\alpha\log\parallel{x-x}^{\prime}\parallel+\beta .\) The correlation plot (left) with the nonlinear regression model show a strong correlation between the predictive coder’s latent distances and the environment’s actual distances (r = 0.827). The quantile–quantile plot (right) between the predictive coder’s latent distances and the regression model show high overlap (\({{\mathbb{D}}}_{{{{\rm{KL}}}}}({p}_{{{{\rm{PC}}}}}\parallel {p}_{{{{\rm{model}}}}})=0.250\)). e, Without any past information, the auto-encoder cannot distinguish the two different red rooms and produces a high prediction error in these locations. f, The correlation plot (left) with the nonlinear regression model show little correlation between the auto-encoder’s latent distances and the environment’s actual distances (r = 0.288). The quantile–quantile plot (right) between the auto-encoder’s latent distances and the regression model show little overlap (\({{\mathbb{D}}}_{{{{\rm{KL}}}}}({p}_{{{{\rm{PC}}}}}\parallel {p}_{{{{\rm{model}}}}})=3.806\)).

Similar to previous sections, we train a neural network (or a predictive coder) to perform predictive coding while traversing the circular corridor. In addition, we train a neural network (or an auto-encoder) to perform auto-encoding. The auto-encoder fails to recover spatial information in areas with visual degeneracy: it maps the two distinct red rooms to the same location (Fig. 4e). In Fig. 4e, the auto-encoder predicts the images in the left red room to locations in the right red room—whereas locations with distinct visual scenes (such as the yellow and blue rooms) show a low prediction error (mean error \({\left\Vert x-\hat{x}\right\Vert }_{{\ell }_{2}}=5.004\) lattice units). In addition, the auto-encoder’s latent distances do not separate the different red rooms in latent space—whereas the predictive coder separates the two red rooms (Fig. 4b). Moreover, the predictive coder demonstrates a low prediction error throughout, including the two visually degenerate red rooms (mean error \({\left\Vert x-\hat{x}\right\Vert }_{{\ell }_{2}}=0.071\) lattice units) (Fig. 4c).

Moreover, we measure the relationship between the predictive coder’s (and auto-encoder’s) metric and the environment’s metric by fitting a regression model (Fig. 4b),

$$\left\Vert z-z^{\prime} \right\Vert =\alpha \log \left\Vert x-x^{\prime} \right\Vert +\beta,$$

between the predictive coder’s (and auto-encoder’s) latent distances (\(\left\Vert z-z^{\prime} \right\Vert\)) and the environment’s physical distances (\(\left\Vert x-x^{\prime} \right\Vert\)). Compared to the natural environment, the auto-encoder’s latent distances show more deviation from the environment’s spatial distances, whereas the predictive coder’s latent distances maintain a correspondence with spatial distances. For the predictive coder, the latent metric recovers spatial metric quantitatively: the correlation plot (Fig. 4d, left) shows a high correlation (r = 0.827) between the latent and spatial distances, and the quantile–quantile plot (Fig. 4d, right) shows a high overlap between the regression model and the observed latent distances (\({{\mathbb{D}}}_{{{{\rm{KL}}}}}({p}_{{{{\rm{PC}}}}}\parallel {p}_{{{{\rm{model}}}}})=0.250\)). The auto-encoder’s latent metric, conversely, does not recover the spatial metric: the correlation plot (Fig. 4f, left) shows a low correlation (r = 0.288) between the latent and spatial distances, and the quantile–quantile plot (Fig. 4f, right)) shows a low overlap between the regression model and the observed latent distances (\({{\mathbb{D}}}_{{{{\rm{KL}}}}}({p}_{{{{\rm{PC}}}}}\parallel {p}_{{{{\rm{model}}}}})=3.806\)).

As shown in Fig. 4, the auto-encoder cannot recover the spatial map of the circular corridor—whereas the predictive coder can recover the map. Here we show that auto-encoders cannot recover the environment’s map for any environment with visual degeneracy, not just the circular corridor. To show that the auto-encoder cannot learn the environment’s map, we show that any statistical estimator cannot learn the environment’s map from stationary observations. For clarity and brevity, we will provide a proof sketch on a lattice environment X, a closed subset of \({{\mathbb{Z}}}^{2}\).

Theorem 1

Consider an environment X, a closed subset of the lattice \({{\mathbb{Z}}}^{2}\) with a function \(x \,\stackrel{f}{\mapsto} \,I\) that gives an image \({I}_{x}=f(x)\subset {{\mathbb{R}}}^{D}\) for the image dimension D and for each position xX. Let the environment’s observations be degenerate such that

$$f({x}_{1})=f({x}_{2})\,\,{{{\rm{for}}}}\,{{{\rm{some}}}}\,\,{x}_{1}\ne {x}_{2}.$$

There exists no decoder \(I\,\stackrel{d}{\mapsto}\,x\) that satisfies

$$x=d\circ {I}_{x}=d\circ f(x)\;{\rm{where}}\; d \circ f(x) \triangleq d(f(x)).$$

Proof

The proof proceeds as a consequence that a function has no left inverse if and only if it is not one-to-one. Suppose there exists a decoder \(I\,\stackrel{d}{\mapsto}\,x\) that satisfies

$$x=d\circ {I}_{x}=d\circ f(x).$$

Consider

$$f({x}_{1})=f({x}_{2})\,\,{{{\rm{for}}}}\,{{{\rm{some}}}}\,\,{x}_{1}\ne {x}_{2}.$$

Then,

$${x}_{1}=d\circ f({x}_{1})=d(I\,)=d\circ f({x}_{2})={x}_{2},$$

which is a contradiction, as required.

Because Theorem 1 demonstrates there exists no decoder for a visually degenerate environment with stationary observations, an auto-encoder cannot recover a visually degenerate environment; the auto-encoder’s failure arises because two locations with the same observation cannot be discriminated.

Corollary 1

Consider an auto-encoder \({g\,=\,{\mathrm{dec}}\,\circ}\) enc with an encoder \(I \,\stackrel{\rm{enc}}{\mapsto}\, z\) and decoder \(z \,\stackrel{{{dec}}}{\mapsto}\, I\) that compresses images into a latent space \(z\in {{\mathbb{R}}}^{L}\) for the latent dimension L. There exists no decoder \(z \,\stackrel{h}{\mapsto}\, x\) that satisfies

$$x=h\circ {z}_{x}=h\circ {{{\rm{enc}}}}\circ f(x).$$

Proof

Consider the decoder \({d\,=\,{h}\,\circ}\,{\mathrm{enc}}:\,{I}\rightarrow{x}\). By Theorem 1, this decoder cannot satisfy

$$x=d\circ f(x)=h\circ {{{\rm{enc}}}}\circ f(x),$$

as required.

Predictive coding generates units with localized receptive fields that support vector navigation

In the previous section, we demonstrate that the predictive coding neural network captures spatial relationships within an environment containing more internal spatial information than can be captured by an auto-encoder network that encodes image similarity. Here we analyse the structure of the spatial code learned by the predictive coding network. We demonstrate that each unit in the neural network’s latent space activates at distinct, localized regions—akin to place fields in the mammalian brain—in the environment’s physical space (Fig. 5a). These place fields overlap, and their aggregate covers the entire physical space. Each physical location is represented by a unique combination of overlapping regions encoded by the latent units. This combination of overlapping regions recovers the agent’s current physical position. Furthermore, given two physical locations, there now exist two distinct combinations of overlapping regions in latent space. Vector navigation is the representation of the vector heading to a goal location from a current location51. We show that overlapping regions (or place fields) can give a heading from a current location to a goal location. Specifically, a linear decoder recovers the vector to a goal location from a starting location by taking the difference in place fields, which supports vector navigation (Supplementary Fig. 1). Traditionally, other studies51 consider grid cell-supported vector navigation, whereas we only consider vector navigation using place cells.

Fig. 5: The predictive coding network generates place fields that support vector-based distance calculations.
figure 5

a, When encoding past images for predictive coding, the self-attention module generates latent vectors. Each continuous unit in these latent vectors activates in concentrated, localized regions in physical space. These continuous units can be thresholded to generate a binary vector determining whether each unit is active. Each latent unit covers a unique region, and each physical location gives a unique combination of these overlapping regions. As an agent moves away from its original location, the combination of overlapping regions gradually deviates from its original combinations. This deviation, as measured by Hamming distance, correlates with physical distance. b, Distance is given by the difference in the latent units’ overlapping regions. Two nearby locations have small deviations in overlap (right) while two distant locations have large deviations (middle). c, Latent units are spatially organized into localized regions. The active latent units are approximated by a two-dimensional Gaussian distribution (bottom) to measure the latent unit’s localization (top). The latent units’ Gaussian approximations are highly localized, with a mean area of 254.6 for densities P ≥ 0.0005. d, Latent units distributed across the environment. The number of latent units was calculated as each lattice block in the environment (left), and the number of lattice blocks was calculated for each active unit (right). The latent units provide a unique combination for 87.6% of the environment, and their aggregate covers the entire environment. e, distance from the region overlap captures most of the predictive coder’s spatial information. We calculate the distance for every pair of active latent vectors and their respective physical Euclidean distances as a joint distribution. The proposed mechanism captures a majority of the predictive coder’s spatial information, as the proposed mechanism’s mutual information (0.542 bits) compares to the predictive coder’s mutual information (0.627 bits).

To support this proposed mechanism, we first demonstrate the neural network generates place fields. In other words, units from the neural network’s latent space produce localized regions in physical space. To determine whether a latent unit is active, we threshold the continuous value with its 90th-percentile value. The agent has a head direction that varies to ensure the regions are stable across all head directions. To measure a latent unit’s localization in physical space, we fit each latent unit distribution, with respect to physical space, to a two-dimensional Gaussian distribution (Fig. 5c, top), defined by

$$P(x)=\frac{1}{2\uppi \sqrt{| {{\Sigma }}|} }\exp \left[-\frac{1}{2}{(x-\mu )}^{T}{{{\Sigma }}}^{-1}(x-\mu )\right]$$

for the covariance matrix Σ and the mean vector μ. We measure the area of the ellipsoid given by the Gaussian approximation where P ≥ 0.0005 (Fig. 5c, bottom). The area of the latent unit approximation measures how localized a unit is compared to the environment’s area, which measures 40 × 65 = 2,600 lattice units. The latent unit approximations have a mean area of 254.6 lattice units (9.79% of the environment) and 80% of the areas are <352.6 lattice units (13.6% of the environment).

The units in the neural network’s latent space provide a unique combinatorial code for each spatial position. The aggregate of latent units covers the environment’s entire physical space. At each lattice block in the environment, we calculate the number of active latent units (Fig. 5d, left). The number of active latent units is different in 87.6% of the lattice blocks. Every lattice block has at least one active latent unit, which indicates the aggregate of the latent units covers the environment’s physical space. Moreover, to ensure the regions remain stable across shifting landmarks, the environment’s trees were removed and randomly redistributed in the environment (Supplementary Fig. 5a,b). The regions remain stable after changing the tree landmarks, with a Jaccard index (Snew ∩ Sold/SnewSold) (or the intersection over union of new regions Snew and old regions Sold) of 0.828.

Lastly, we demonstrate that the neural network can measure physical distances and could perform vector navigation—representing the vector heading from a current location to a goal location—by comparing the combinations of overlapping regions in its latent space. We first determine the active latent units by thresholding each continuous value by its 90th-percentile value. At each position, we have a 128-dimensional binary vector that gives the overlap of 128 latent units. We take the bitwise difference z1 − z2 between the overlapping codes z1 and z2 at two varying positions x1 and x2 with the vector displacement x1 − x2 (Supplementary Fig. 1a). We then fit a linear decoder from the code z1 − z2 to the vector displacement x1 − x2,

$${x}_{1}-{x}_{2}=W[{z}_{1}-{z}_{2}]+b.$$

for weight W and bias b. The predicted distance error \(\left\Vert r-\hat{r}\right\Vert\) and the predicted direction error \(\Vert \theta -\hat{\theta }\Vert\) are decomposed from the predicted displacement \({\hat{x}}_{1}-{\hat{x}}_{2}\). The linear decoder has a low prediction error for distance (<80%, 12.49 lattice units; mean 7.89 lattice units) and direction (<80%, 48.04°; mean 30.6°) (Supplementary Fig. 1b,c). The code z1 − z2 is highly correlated with direction θ and distance r with Pearson correlation coefficients 0.924 and 0.718, respectively (Supplementary Fig. 1d).

We can measure the correspondence between the bitwise distance z1 − z2 and the physical distances \({\left\Vert {x}_{1}-{x}_{2}\right\Vert }_{{\ell }_{2}}\), which use the Euclidean distance \(\vert\vert x {{\vert \vert_{\ell}}_{2}} = {\sqrt {\sum^{D}_{i=1} x^{2}_{i}}}\) for dimension D. For the bitwise distance, we threshold the latent units to its 90th-percentile then compute the L1-norm (\(\vert\vert x {{\vert \vert_{\ell}}_{1}} = \sum^{D}_{i=1} \vert x_{i} \vert\) for dimension D) between the units. Similar to the previous sections, we compute the joint densities of the binary vectors’ bitwise distances and the physical positions’ Euclidean distances. We then calculate their mutual information to measure how much spatial information the bitwise distance captures. The proposed mechanism for the neural network’s distance measurement—the binary vector’s Hamming distance—gives a mutual information of 0.542 bits, compared to the predictive coder’s mutual information of 0.627 bits and the auto-encoder’s mutual information of 0.227 bits (Fig. 5e). The code from the overlapping regions captures a majority amount of the predictive coder’s spatial information.

Discussion

Mapping is a general mechanism for generating an internal representation of sensory information. While spatial maps facilitate navigation and planning within an environment, mapping is a ubiquitous neural function that extends to representations beyond visual–spatial mapping. The primary sensory cortex, for example, maps tactile events topographically. Physical touches that occur in proximity are mapped in proximity for both the neural representations and the anatomical brain regions52. Similarly, the cortex maps natural speech by tiling regions with different words and their relationships, which shows that topographic maps in the brain extend to higher-order cognition. The similar representation of non-spatial and spatial maps in the brain suggests a common mechanism for charting cognitive maps53. However, it is unclear how a single mechanism can generate both spatial and non-spatial maps.

Here we show that predictive coding provides a basic, general mechanism for charting spatial maps by predicting sensory data from past sensory experiences—including environments with degenerate observations. Our theoretical framework applies to any vector-valued sensory data and could be extended to auditory data, tactile data or tokenized representations of language. We demonstrate a neural network that performs predictive coding and can construct an implicit spatial map of an environment by assembling information from local paths into a global frame within the neural network’s latent space. The implicit spatial map depends specifically on the sequential task of predicting future visual images. Neural networks trained as auto-encoders do not reconstruct a faithful geometric representation in the presence of physically distant yet visually similar landmarks.

Moreover, we study the predictive coding neural network’s representation in latent space. Each unit in the network’s latent space activates at distinct, localized regions—called place fields—with respect to physical space. At each physical location, there exists a unique combination of overlapping place fields. At two locations, the differences in the combinations of overlapping place fields provide the distance between the two physical locations. The existence of place fields in both the neural network and the hippocampus16 suggests that predictive coding is a universal mechanism for mapping. In addition, vector navigation emerges naturally from predictive coding by computing distances from overlapping place field units. Predictive coding may provide a model for understanding how place cells emerge, change and function.

Predictive coding can be performed over any sensory modality that has some temporal sequence. As natural speech forms a cognitive map, predictive coding may underlie the geometry of human language. Intriguingly, large language models train on causal word prediction—a form of predictive coding—build internal maps that support generalized reasoning, answer questions and mimic other forms of higher-order reasoning54. Similarities in spatial and non-spatial maps in the brain suggest that large language models organize language into a cognitive map and chart concepts geometrically. These results all suggest that predictive coding might provide a unified theory for building representations of information-connecting disparate theories including place cell formation in the hippocampus, somatosensory maps in the cortex and human language.

Methods

Environment simulation

Forest–cave–river environment

These experiments leverage the Malmo framework44 to construct a controlled environment within Minecraft. This environment is a rectangular space measuring 40 by 65 lattice units and incorporates three key visual features: a prominent cave serving as a global landmark, a forest area introducing some visual ambiguity between scenes and a river with a bridge that restricts agent movement options. Within this environment, an agent traverses paths between randomly chosen waypoints. These paths are determined using the A* search algorithm to ensure obstacles did not block the agent’s path. The agent varies its speed and direction to traverse the generated paths. During its exploration, the agent captures visual observations at regular intervals along each path.

Circular environment

To explore the model’s ability to differentiate between visually identical but spatially distinct scenes, these experiments used a circular corridor environment. This environment consists of an infinitely repeating sequence of rooms, specifically coloured red, green, red, blue and yellow in a clockwise direction. Notably, there are two distinct red rooms despite their identical appearance. Technically, the environment is an infinitely long hallway segmented into these coloured rooms. Similar to the previous experiment, an agent navigates between randomly chosen waypoints within this environment. The paths are determined using the A* search algorithm, and the agent captures visual observations at regular intervals along its journey.

Predictive coder

Architecture

The proposed neural network follows an encoder–decoder architecture, employing a U-Net structure to process input image sequences and predict future images. The encoder and decoder components are both based on ResNet-18 convolutional neural networks.

The encoding module utilizes a ResNet-18 model to extract hierarchical features from the input image sequence. Each image in the sequence is processed independently through the ResNet-18 encoder, generating a sequence of latent vectors. The encoder consists of residual blocks, each containing convolutional layers, batch normalization and rectified linear unit (ReLU) activations. The downsampling is achieved via strided convolutions within the residual blocks.

The self-attention module utilizes multi-headed attention, which processes the sequence of encoded latent units to encode the history of past visual observations. The network consists of one layer of multi-headed attention. The multi-headed attention has h = 8 heads. For the encoded latent units with dimension D = C × H × W, the dimension d of a single head is d = C × H × W/h.

The latent vectors output by the encoder are concatenated to form an ordered sequence. This sequence is then processed by a self-attention layer to capture temporal dependencies and relationships among the image sequence. The self-attention mechanism enables the model to weigh the importance of each latent vector in the context of the entire sequence, facilitating improved temporal feature representation.

The decoding module mirrors the encoder’s architecture, utilizing a ResNet-18 model adapted for upsampling. The decoder reconstructs the future images from the transformed latent vectors, employing transposed convolutions and residual blocks analogous to those in the encoder.

Training

The predictive coder is trained for 200 epochs using stochastic gradient descent as the optimization algorithm. The training parameters include a learning rate of 0.1, Nesterov momentum of 0.9 and a weight decay of 5 × 10−6. To optimize the learning process, the learning rate is scheduled using the OneCycle learning-rate policy. This policy adjusts the learning rate cyclically between a lower and upper bound, facilitating efficient convergence and improved performance. The OneCycle learning-rate schedule is characterized by an initial increase in the learning rate, followed by a subsequent decrease.

Latent units

The predictive coder’s encoding and self-attention modules were used to analyse the encoded sequences as the predictive coder’s latent units. The image sequence first undergoes processing through the encoder, which extracts a compressed representation capturing the key features within each image. Subsequently, this encoded sequence is fed into the self-attention module. This module specifically focuses on the inherent temporal order of the images within the sequence. The self-attention module’s processed output forms the predictive coder’s latent units.

Auto-encoder

Architecture

Unlike the predictive coder architecture, the auto-encoder architecture transforms the current images (rather than the past images for the predictive coder) into a low-dimensional latent vector. The proposed neural network follows an encoder–decoder architecture employing a U-Net structure to process input image sequences into a low-dimensional latent vector and to reconstruct the initial image. The encoder and decoder components are both based on ResNet-18 convolutional neural networks. However, the auto-encoder architecture does not utilize any self-attention layers to integrate past observations of images.

The encoding module utilizes a ResNet-18 model to extract hierarchical features from the input image sequence. Each image in the sequence is processed independently through the ResNet-18 encoder, generating a sequence of latent vectors. The encoder consists of residual blocks, each containing convolutional layers, batch normalization and ReLU activations. The downsampling is achieved via strided convolutions within the residual blocks.

Unlike the predictive coder, the latent vectors output by the encoder are directly processed by the decoder. Whereas the predictive coder predicts the future images within an image sequence, the auto-encoder predicts the current images, given the low-dimensional latent vector generated by the encoder.

The decoding module mirrors the encoder’s architecture, utilizing a ResNet-18 model adapted for upsampling. The decoder reconstructs the future images from the transformed latent vectors, employing transposed convolutions and residual blocks analogous to those in the encoder.

Training

The predictive coder is trained for 200 epochs using stochastic gradient descent as the optimization algorithm. The training parameters include a learning rate of 0.1, Nesterov momentum of 0.9 and a weight decay of 5 × 10−6. To optimize the learning process, the learning rate is scheduled using the OneCycle learning-rate policy. This policy adjusts the learning rate cyclically between a lower and upper bound, facilitating efficient convergence and improved performance. The OneCycle learning-rate schedule is characterized by an initial increase in the learning rate, followed by a subsequent decrease.

Latent units

The auto-encoder’s encoding module was used to analyse the encoded images as the auto-encoder’s latent units. The image sequence first undergoes processing through the encoder, which extracts a compressed representation capturing the key features within each image. The encoder’s processed output forms the auto-encoder’s latent units.

Positional decoder

To assess the effectiveness of the predictive coder in capturing positional information within the encoded sequences, this analysis employed an auxiliary neural network for position prediction. This network, referred to as the positional decoder, takes the latent units generated by the predictive coder—or auto-encoder—as input. The decoder architecture consists of several layers designed to extract this positional information: a convolutional layer transforms the input to a higher dimension (256), followed by a ReLU activation for non-linearity. A max pooling layer then reduces the spatial resolution while maintaining relevant features. Subsequently, two fully connected (affine) layers with ReLU activations project the data to a lower dimension (64) and finally to a 2-dimensional output, corresponding to the agent’s predicted position (x and y coordinates).

During training, the mean-squared error between the agent’s actual position and the predicted position served as the loss function

$$E(x,\hat{x})={\left\Vert x-\hat{x}\right\Vert }_{{\ell }_{2}}.$$

To optimize this loss, the AdamW optimizer was employed with a two-stage learning-rate schedule. The initial stage utilized a learning rate of 10−4 for 1,000 epochs, followed by a fine-tuning stage with a reduced learning rate of 10−5 for an additional 1,000 epochs.

Modelling the correspondence between latent and physical distances

This analysis evaluated the ability of the predictive coder’s latent space to encode local positional information. For each path traversed by the agent, we computed the pairwise distances between positions in physical space and the corresponding latent space distances within a neighbourhood of 100 time steps. To assess the correspondence between these two distance measures, we analysed the joint distribution of physical and latent space distances. We modelled the relationship between latent distances and their corresponding physical distances using a logarithmic function with additive Gaussian noise

$$\hat{x}=x+\epsilon ,\epsilon \sim {{{\mathcal{N}}}}(0,\sigma ).$$

The goodness-of-fit between the model and the actual data was evaluated using two metrics: the Pearson correlation coefficient, which measures the dependence between the physical and latent distances, and the Kullback–Leibler divergence

$$({{\mathbb{D}}}_{{{{\rm{KL}}}}}(\;{p}_{{{{\rm{PC}}}}}\parallel {p}_{{{{\rm{model}}}}})),$$

which quantifies the difference between the two modelled regression distribution and the observed empirical distribution.

Mutual information of the predictive coder and auto-encoder

The spatial information encoded within the latent representations of both the predictive coder and the auto-encoder was evaluated. To achieve this, this analysis computed the joint densities between the latent distances in each model and the corresponding actual physical distances within the environment. By analysing these joint densities, we were able to quantify the physical information within each model’s latent space. Mutual information

$$I[X;Z\;]={{\mathbb{E}}}_{p(X,Z\;)}\left[\log \frac{p(X,Z\;)}{p(X\;)p(Z\;)}\right]$$

was employed as a metric to assess this physical information. Higher mutual information indicates that the latent distances in a model encode a greater amount of spatial information, signifying a stronger correlation between the distances in the latent space and the actual physical separations between locations in the environment. This comparison allows us to gauge the relative effectiveness of each model in capturing and representing spatial relationships within their respective latent spaces.

Place field analysis

Place field calculation

This analysis investigated the spatial localization of individual units within the neural network’s latent space. First, this analysis computed the histogram of the distribution of the 128-dimensional latent vectors. To identify active units, this analysis employed a thresholding technique based on the 90th-percentile value of the continuous latent unit values. This ensured a focus on units with notable activation levels. The agent’s head direction was varied during data collection to ensure the identified localized regions remained stable regardless of the agent’s orientation.

Place field statistical fitting

To quantify the degree of localization for each active unit, this analysis fitted a two-dimensional Gaussian distribution

$$P(x)=\frac{1}{2\uppi | {{\Sigma }}| }\exp \left[-\frac{1}{2}{(x-\mu )}^{T}{{{\Sigma }}}^{-1}(x-\mu )\right]$$

to its corresponding distribution in physical space. The area of the resulting ellipsoid, defined by the Gaussian approximation and exceeding a probability threshold of P ≥ 0.0005, served as our localization metric. This area reflects the spatial extent of the unit’s activation within the environment, relative to the overall environment size of 40 × 65 lattice units (2,600 units). Units with smaller ellipsoid areas indicate a more concentrated activation pattern in physical space, suggesting a higher degree of localization.

Vector navigation analysis

This analysis investigated the ability of the neural network’s latent space to not only encode positional information but also represent the vector heading from a current location to a goal location—called vector navigation. To assess this, we compared the combinations of overlapping regions in the latent space representations of two distinct positions x1 and x2. This analysis achieved this by computing the bitwise difference z1 − z2 between the corresponding latent codes z1, z2 for these positions. Subsequently, we examined the relationship between this difference vector and the actual physical displacement vector x1 − x2 using a linear decoder

$${x}_{1}-{x}_{2}=W\;[{z}_{1}-{z}_{2}]+b$$

This decoder was trained to predict the displacement vector based solely on the latent code difference. The predicted displacement was then decomposed into its distance and directional components, to calculate the specific errors associated with predicting both the distance and direction to the goal location. This analysis computed the Pearson correlation coefficient between the predicted distance, predicted direction and the predicted displacement vector.

Mutual information calculation

This analysis employed a complementary approach to evaluate the spatial information encoded within the binary vectors derived from the latent space. Here the joint densities were computed between the bitwise distances of these binary vectors and the Euclidean distances between corresponding physical positions. The mutual information

$$I[X;Z\;]={{\mathbb{E}}}_{p(X,Z\;)}\left[\log \frac{p(X,Z\;)}{p(X\;)p(Z\;)}\right].$$

was then computed to quantify the amount of spatial information captured by the bitwise distances. This metric essentially reflects how well the bitwise distance between latent codes reflects the actual physical separation between locations in the environment. Finally, to provide context for the obtained value, the mutual information of the binary vectors’ bitwise distance was compared with the mutual information derived from the latent distances of both the predictive coder and the auto-encoder. This comparison assesses the relative effectiveness of each model in capturing spatial information within their respective latent representations.

Place field stability with shifting landmarks

To assess the stability of the identified localized regions within the latent space, this analysis investigated their resilience to changes in the environment’s landmarks. The environment was manipulated: the trees, originally serving as landmarks, were removed and then randomly redistributed throughout the space. Subsequently, the Jaccard index

$$| {S}_{{{{\rm{new}}}}}\cap {S}_{{{{\rm{old}}}}}| /| {S}_{{{{\rm{new}}}}}\cup {S}_{{{{\rm{old}}}}}|$$

was employed to quantify the overlap between the latent units identified in the original environment and those found in the environment with shifted landmarks. The Jaccard index ranges from 0 to 1, where a value of 1 indicates a perfect overlap between the sets of latent units, and 0 signifies no overlap. This analysis allowed us to evaluate how well the latent units maintain their spatial correspondence despite alterations to the environment’s visual features.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.