Abstract
Humans construct internal cognitive maps of their environment directly from sensory inputs without access to a system of explicit coordinates or distance measurements. Although machine learning algorithms like simultaneous localization and mapping utilize specialized inference procedures to identify visual features and construct spatial maps from visual and odometry data, the general nature of cognitive maps in the brain suggests a unified mapping algorithmic strategy that can generalize to auditory, tactile and linguistic inputs. Here we demonstrate that predictive coding provides a natural and versatile neural network algorithm for constructing spatial maps using sensory data. We introduce a framework in which an agent navigates a virtual environment while engaging in visual predictive coding using a selfattentionequipped convolutional neural network. While learning a nextimage prediction task, the agent automatically constructs an internal representation of the environment that quantitatively reflects spatial distances. The internal map enables the agent to pinpoint its location relative to landmarks using only visual information.The predictive coding network generates a vectorized encoding of the environment that supports vector navigation, where individual latent space units delineate localized, overlapping neighbourhoods in the environment. Broadly, our work introduces predictive coding as a unified algorithmic framework for constructing cognitive maps that can naturally extend to the mapping of auditory, sensorimotor and linguistic inputs.
Similar content being viewed by others
Main
Space and time are fundamental physical structures in the natural world, and all organisms have evolved strategies for navigating space to forage, mate and escape predation^{1,2,3}. In humans and other mammals, the concept of a spatial or cognitive map has been postulated to underlie spatial reasoning tasks^{4,5,6}. A spatial map is an internal, neural representation of an animal’s environment that marks the location of landmarks, food, water and shelter, which can be queried for navigation and planning. The neural algorithms underlying spatial mapping are thought to generalize to other sensory modes to provide cognitive representations of auditory and somatosensory data^{7} as well as to construct internal maps of more abstract information including concepts^{8,9}, tasks^{10}, semantic information^{11,12,13} and memories^{14}. Empirical evidence suggests that the brain uses common cognitive mapping strategies for spatial and nonspatial sensory information so that common mapping algorithms might exist that can map and navigate over not only visual but also semantic information and logical rules inferred from experience^{7,8,15}. In such a paradigm, reasoning itself could be implemented as a form of navigation within a cognitive map of concepts, facts and ideas.
After the notion of a spatial or cognitive map emerged, the question of how environments are represented within the brain and how the maps can be learned from experience has been a central question in neuroscience^{16}. Place cells in the hippocampus are neurons that are active when an animal transits through a specific location in an environment^{16}. Grid cells in the entorhinal cortex fire in regular spatial intervals and likely track an organism’s displacement in the environment^{17,18}. Yet, even with the identification of a substrate for the representation of space, the question of how a spatial map can be learned from sensory data has remained, and the neural algorithms that enable the construction of spatial and other cognitive maps remain poorly understood.
Empirical work in machine learning has demonstrated that deep neural networks can solve spatial navigation tasks as well as perform path prediction and grid cell formation^{19,20}. Two studies^{19,20} demonstrate that neural networks can learn to perform path prediction and that networks generate firing patterns that resemble the firing patterns of grid cells in the entorhinal cortex. Other studies^{20,21,22} demonstrate navigation algorithms that require the environment’s map or that use firing patterns resembllng place cells in the hippocampus. These studies allow an agent to access environmental coordinates explicitly^{19} or initialize a model with place cells that represent specific locations in an arena^{20}. In machine learning and autonomous navigation, a variety of algorithms have been developed to perform mapping tasks, including simultaneous location and mapping (SLAM) and monocular SLAM algorithms^{23,24,25,26}, as well as neural network implementations^{27,28,29}. Yet, SLAM algorithms contain many specific inference strategies, like visual feature and object detection, that are specifically engineered for map building, wayfinding and pose estimation based on visual information. Whereas extensive research in computer vision and machine learning use video frames, these studies do not extract representations of the environment’s map^{30,31}. A unified theoretical and mathematical framework for understanding the mapping of spaces based on sensory information remains incomplete.
Predictive coding has been proposed as a unifying theory of neural function where the fundamental goal of a neural system is to predict future observations given past data^{32,33,34}. When an agent explores a physical environment, temporal correlations in sensory observations reflect the structure of the physical environment. Landmarks nearby one another in space will also be observed in temporal sequence. In this way, predicting observations in a temporal series of sensory observations requires an agent to internalize some implicit information about a spatial domain. Historically, Poincaré motivated the possibility of spatial mapping through a predictive coding strategy, where an agent assembles a global representation of an environment by gluing together information gathered through local exploration^{35,36}. The exploratory paths together contain information that could, in principle, enable the assembly of a spatial map for both flat and curved manifolds. Indeed, extended Kalman filters^{25,37} for SLAM perform a form of predictive coding by directly mapping visual changes and movement to spatial changes. However, extended Kalman filters, as well as other SLAM approaches, require intricate strategies for landmark size calibration, image feature extraction and models of the camera’s distortion, whereas biological systems can solve flexible mapping and navigation issues that engineered systems cannot. Yet, while the concept of predictive coding for spatial mapping is intuitively attractive, a major challenge is the development of algorithms that can glue together local sensory information gathered by an agent into a global, internally consistent environmental map. Connections between mapping and predictive coding in the literature have primarily focused on situations where an agent has explicit access to its spatial location as a state variable^{38,39,40}. The problem of building spatial maps de novo from sensory data remains poorly understood.
Here we demonstrate that a neural network trained on a sensory predictive coding task can construct an implicit spatial map of an environment by assembling observations acquired along local exploratory paths into a global representation of a physical space within the network’s latent space. We analyse sensory predictive coding theoretically and demonstrate mathematically that solutions to the predictive sensory inference problem have a mathematical structure that can naturally be implemented by a neural network trained using backpropagation and comprising a ‘path encoder’, an internal spatial map and a ‘sensory decoder’. In such a paradigm, a network learns an internal map of its environment by inferring an internal geometric representation that supports predictive sensory inference. We implement sensory predictive coding within an agent that explores a virtual environment while performing visual predictive coding using a convolutional neural network with selfattention. Following network training during exploration, we find that the encoder network embeds images collected by an agent exploring an environment into an internal representation of space. Within the embedding, the distances between images reflect their relative spatial position, not objectlevel similarity between images. During exploratory training, the network implicitly assembles information from local paths into a global representation of space as it performs a nextimage inference problem. Fundamentally, we connect predictive coding and mapping tasks, demonstrating a computational and mathematical strategy for integrating information from local measurements into a global selfconsistent environmental model.
Mathematical formulation of spatial mapping as sensory predictive coding
In this Article, we aim to understand how a spatial map can be assembled by an agent that is making sensory observations while exploring an environment. Papers in the literature that study connections between predictive coding and mapping have primarily focused on situations where an agent has access to its ‘state’ or location in the environment^{38,39,40}. Here we develop a theoretical model and neural network implementation of sensory predictive coding that illustrates why and how an internal spatial map can emerge naturally as a solution to sensory inference problems. The neural network is a feedforward deep neural network trained using backpropagation, or gradient descent, rather than Helmholtz machines^{41,42}, which are commonly used in predictive coding. We first formulate a theoretical model of visual predictive coding and demonstrate that the predictive coding problem can be solved by an inference procedure that constructs an implicit representation of an agent’s environment to predict future sensory observations. The theoretical analysis also suggests that the underlying inference problem can be solved by an encoder–decoder neural network that infers spatial position based upon observed image sequences.
We consider an agent exploring an environment \({{\varOmega }}\subset {{\mathbb{R}}}^{2}\), while acquiring visual information in the form of pixel valued image vectors \({{I}}_{x}\in {{\mathbb{R}}}^{m\times n}\) given an x ∈ Ω. The agent’s environment Ω is a bounded subset of \({{\mathbb{R}}}^{2}\) that could contain obstructions and holes. In general, at any given time t, the agent’s state can be characterized by a position x(t) and orientation θ(t) where x(t) and θ(t) are coordinates within a global coordinate system unknown to the agent.
The agent’s environment comes equipped with a visual scene, and the agent makes observations by acquiring image vectors \({{I}}_{{x}_{k}}\in {{\mathbb{R}}}^{m\times n}\) as it moves along a sequence of points x_{k}. At every position x and orientation θ, the agent acquires an image by effectively sampling from an image the conditional probability distribution P(I ∣ x_{k}, θ_{k}) which encodes the probability of observing a specific image vector I when the agent is positioned at position x_{k} and orientation θ_{k}. The distribution P(I ∣ x, θ) has a deterministic and stochastic component where the deterministic component is set by landmarks in the environment while stochastic effects can emerge due to changes in lighting, background and scene dynamics. Mathematically, we can view P(I ∣ x, θ) as a function on a vector bundle with base space Ω and total space Ω × I (ref. ^{43}). The function assigns an observation probability to every possible image vector for an agent positioned at a point (x, θ). Intuitively, the agent’s observations preserve the geometric structure of the environment: the spatial structure influences temporal correlations.
In the predictive coding problem, the agent moves along a series of points (x_{0}, θ_{0}), (x_{1}, θ_{1}), …, (x_{k}, θ_{k}) while acquiring images I_{0}, I_{1}, …, I_{k}. The motion of the agent in Ω is generated by a Markov process with transition probabilities P(x_{i+1}, θ_{i+1} ∣ x_{i}, θ_{i}). Note that the agent has access to the image observations I_{i} but not the spatial coordinates (x_{i}, θ_{i}). Given the set {I_{0}, …, I_{k}} the agent aims to predict I_{k+1}. Mathematically, the image prediction problem can be solved theoretically through statistical inference by (1) inferring the posterior probability distribution P(I_{k+1} ∣ I_{0}, I_{1}. …, I_{k}) from observations. Then, (2) given a specific sequence of observed images {I_{0}, …, I_{k}}, the agent can predict the next image I_{k+1} by finding the image I_{k+1} that maximizes the posterior probability distribution P(I_{k+1} ∣ I_{0}, I_{1}, …, I_{k}).
The posterior probability distribution P(I_{k+1} ∣ I_{0}, I_{1}, …, I_{k}) is by definition
If we consider P(I_{0}, I_{1}, …, I_{k}, I_{k+1}) to be a function of an implicit set of spatial coordinates (x_{i}, θ_{i}) where the (x_{i}, θ_{i}) provide an internal representation of the spatial environment, then we can express the posterior probability P(I_{k+1} ∣ I_{0}, I_{1}, …, I_{k}) in terms of the implicit spatial representation
where in equation (1) the integration is over all possible paths {(x_{0}, θ_{0}), …, (x_{k}, θ_{k})} in the domain Ω, for differentials dx = dx_{0}, …, dx_{k} and dθ = dθ_{0}, …, dθ_{k}. Equation (1) can be interpreted as a path integral over the domain Ω. The path integral assigns a probability to every possible path in the domain and then computes the probability that the agent will observe a next image I_{k} given an inferred location (x_{k+1}, θ_{k+1}). In detail, term 1 assigns a probability to every discrete path {(x_{0}, θ_{0}), …, (x_{k}, θ_{k})} ∈ Ω as the conditional likelihood of the path given the observed sequences of images {I_{0}, …, I_{k}}. Term 2 computes the probability that an agent at a terminal position x_{k} moves to the position (x_{k+1}, θ_{k+1}), given the Markov transition function P(x_{k+1}, θ_{k+1} ∣ x_{k}, θ_{k}). Term 3 is the conditional probability that image I_{k+1} is observed, given that the agent is at position (x_{k+1}, θ_{k+1}).
Conceptually, the product of terms solves the nextimage prediction problem in three steps. First, estimating the probability that an agent has traversed a particular sequence of points given the observed images; second, estimating the next position of the agent (x_{k+1}, θ_{k+1}) for each potential path; and third, computing the probability of observing a next image I_{k+1} given the inferred terminal location x_{k+1} of the agent. Critically, an algorithm that implements the inference procedure encoded in the equation would construct an internal but implicit representation of the environment as a coordinate system x, θ that is learned by the agent and used during the nextimage inference procedure. The coordinate system provides an internal, inferred representation of the agent’s environment that is used to estimate future image observation probabilities. Thus, our theoretical framework demonstrates how an agent might construct an implicit representation of its spatial environment by solving the predictive coding problem.
The threestep inference procedure represented in the equation for P(I_{k+1} ∣ I_{0}, I_{1}, …, I_{k}) can be directly implemented in a neural network architecture, as demonstrated in the Supplementary Information. The first term acts as an ‘encoder’ network that computes the probability that the agent has traversed a path {(x_{0}, θ_{0}), …, (x_{k}, θ_{k})} given an observed image sequence I_{0}, …, I_{k} that has been observed by the network (Fig. 1b). The network can then estimate the next position (x_{k+1}, θ_{k+1}) of the agent given an inferred location (x_{k}, θ_{k}) and apply a decoding network to compute P(I_{k+1} ∣ x_{k+1}, θ_{k+1}), while outputting the prediction I_{k+1} using a decoder. A network trained through visual experience must learn an internal coordinate system and representation x, θ that not only offers an environmental representation but also establishes a connection between observed images I_{j} and inferred locations (x_{j}, θ_{j}).
A neural network performs predictive coding
Motivated by the implicit representation of space contained in the predictive coding inference problem, we developed a computational implementation of a predictive coding agent and studied the representation of space learned by that agent as it explored a virtual environment. We first create an environment with the Malmo environment in Minecraft^{44}. The physical environment measures 40 × 65 lattice units and encapsulates three aspects of visual scenes: a cave provides a global visual landmark, a forest provides degeneracy between visual scenes, and a river with a bridge constrains how an agent traverses the environment (Fig. 1a). An agent follows paths (Supplementary Fig. 5b,c), determined by A^{*} search to find the shortest path between randomly sampled positions, and receives visual images along every path.
To perform predictive coding, we construct an encoder–decoder convolutional neural network with a ResNet18 architecture^{45} for the encoder and a corresponding ResNet18 architecture with transposed convolutions in the decoder (Fig. 1b). The encoder–decoder architecture uses the UNet architecture^{46} to pass the encoded latent units into the decoder. Multiheaded attention^{47} processes the sequence of encoded latent units to encode the history of past visual observations. The multiheaded attention has h = 8 heads. For the encoded latent units with dimension D = C × H × W, the dimension d of a single head is d = C × H × W/h for height H, width W and channels C.
The predictive coder approximates predictive coding by minimizing the meansquared error between the actual observation and its predicted observation. The predictive coder trains on 82,630 samples for 200 epochs with gradient descent optimization with Nesterov momentum^{48}, a weight decay of 5 × 10^{−6} and a learning rate of 10^{−1} adjusted by OneCycle learningrate scheduling^{49}. The optimized predictive coder has a meansquared error between the predicted and actual images of 0.094 and a good visual fidelity (Fig. 1c).
Predictive coding network constructs an implicit spatial map
We show that the predictive coder creates an implicit spatial map by demonstrating it recovers the environment’s spatial position and distance. We encode the image sequences using the predictive coder’s encoder to analyse the encoded sequence as the predictive coder’s latent units. To measure the positional information in the predictive coder, we train a neural network to predict the agent’s position from the predictive coder’s latent units (Fig. 1a). The neural network’s prediction error
indirectly measures the predictive coder’s positional information. To provide comparative baselines, we construct a position prediction model. To provide a lower bound for the prediction error, we construct a model that gives the agent’s actual position with small additive Gaussian noise:
such that ε ∼ 𝒩(0,σ) indicates noise ε that is distributed from a Gaussian distribution with zero mean and standard deviation σ. To compare the predictive coder to the baselines, we compare the prediction error histograms (Fig. 2b).
The predictive coder encodes the environment’s spatial position to a low prediction error (Fig. 2d). The predictive coder has a mean error of 5.04 lattice units and >80% of samples have an error <7.3 lattice units. The additive Gaussian model with σ = 4 has a mean error of 4.98 lattice units and >80% of samples with an error <7.12 lattice units.
We show the predictive coder’s latent space recovers the local distances between the environment’s physical positions. For every path that the agent traverses, we calculate the local pairwise distances in physical space and in the predictive coder’s latent space with a neighbourhood of 100 time points. To determine whether latent space distances correspond to physical distances, we calculate the joint density between latent space distances and physical distances (Fig. 2c). We model the latent distances by fitting the physical distances with additive Gaussian noise to a logarithmic function:
The modelled distribution is concentrated with the predictive coder’s distribution (Fig. 2d) with a Pearson correlation coefficient of 0.827 and a Kullback–Leibler divergence \(({{\mathbb{D}}}_{{{{\rm{KL}}}}}(\,{p}_{{{{\rm{PC}}}}}\parallel {p}_{{{{\rm{model}}}}}))\) of 0.429 bits.
Predictive coding network learns spatial proximity not image similarity
In the previous section, we show that a neural network that performs predictive coding learns an internal representation of its physical environment within its latent space. Here we demonstrate that the prediction task itself is essential for spatial mapping. Prediction forces a network to learn spatial proximity and not merely image similarity. Many frameworks, including principal components analysis, IsoMap^{50} and autoencoder neural networks can collocate images by visual similarity. While similar scenes might be proximate in space, similar scenes can also be spatially divergent. For example, the virtual environment we constructed has two different ‘forest’ regions that are separated by a lake. Thus, the two forest environments might generate similar images but are actually each closer to the lake region than to one another (Fig. 1a).
To demonstrate the central role for prediction in mapping, we compared the latent representation of images generated by the predictive coding network to a representation learned by an autoencoder. The autoencoder network has a similar architecture to the predictive encoder but encodes a single image observation in a latent space and decodes the same observations. As the autoencoder only operates on a single image, rather than a sequence, the autoencoder learns an embedding based on image proximity not underlying spatial relationships. As with the predictive coder, the autoencoder (Fig. 3a) trains to minimize the meansquared error between the actual image and the predicted image on 82,630 samples for 200 epochs with gradient descent optimization with Nesterov momentum, a weight decay of 5 × 10^{−6} and a learning rate of 10^{−1} adjusted by the OneCycle learningrate scheduler. The autoencoder has a meansquared error of 0.039 and a high visual fidelity.
The predictive coder encodes a higher resolution and a more accurate spatial map in its latent space than the autoencoder. As with the predictive coder, we train an auxiliary neural network to predict the agent’s position from the autoencoder’s latent units (Fig. 3b). The neural network’s prediction error indirectly measures the autoencoder positional information. For greater than 80% of the autoencoder’s points, its prediction error is less than 13.1 lattice units, as compared to the predictive coder that has >80% of its samples below a prediction error of 7.3 lattice units (Fig. 3c).
We also show that the predictive coder recovers the environment’s spatial distances with finer resolution compared to the autoencoder. As with the predictive coder, we calculate the local pairwise distances in physical space and in the autoencoder’s latent space, and we generate the joint density between the physical and latent distances (Fig. 3d). Compared to the predictive coder’s joint density, the autoencoder’s latent distances increase with the agent’s physical distance. The autoencoder’s joint density shows a larger dispersion compared to the predictive coder’s joint density, indicating that the autoencoder encodes spatial distances with higher uncertainty.
We can quantitatively measure the dispersion in the autoencoder’s joint density by calculating mutual information of the joint density (Fig. 3e)
The autoencoder has a mutual information of 0.227 bits, while the predictive coder has a mutual information of 0.627 bits. As a comparison, positions with additive Gaussian noise having a standard deviation σ of 2 lattice units have a mutual information of 0.911 bits. The predictive coder encodes 0.400 additional bits of distance information to the autoencoder. The predictive coder’s additional distance information of 0.400 bits exceeds the autoencoder’s distance information of 0.227 bits, which indicates the temporal dependencies encoded by the predictive coder capture more spatial information compared to visual similarity.
Predictive coding network maps visual degenerate environments, whereas autoencoding cannot
The sequential prediction task is beneficial for spatial mapping: the predictive coder captures more accurate spatial information compared to the autoencoder, and the predictive coder’s latent distances have a stronger correspondence to the environment’s metric. However, it is unclear whether predictive coding is necessary (as opposed to beneficial) to recover an environment’s map; an autoencoder may still recover the environment’s map. In this section, we demonstrate that predictive coding is necessary for recovering an environment’s map. First, we show empirically that there exist environments that autoencoding cannot recover. Second, we provide insight into why the autoencoder fails with a theorem showing that autoencoding cannot recover many environments—specifically, environments with visually similar yet spatially different locations.
In the previous sections, the agent explores a natural environment with forest, river and cave landmarks. While this environment models exploration in outdoor environments, the lack of controlled visual scenes complicates the interpretation of the operation of the predictive coder and autoencoder. We introduce a circular corridor (Fig. 4a) to introduce visual scenes that are visually identical—rather than visually similar—yet spatially different. Specifically, the rooms appear clockwise as red, green, red, blue and yellow; there exist two distinct red rooms. The two distinct red rooms permit answering two questions: (1) Can the predictive coder and autoencoder recover the map for environments with visual symmetry? (2) Does the predictive coder recover a global map or a relative map? In other words, does the predictive coder recover the circular corridor’s geometry, or does it learn a linear hallway?
Similar to previous sections, we train a neural network (or a predictive coder) to perform predictive coding while traversing the circular corridor. In addition, we train a neural network (or an autoencoder) to perform autoencoding. The autoencoder fails to recover spatial information in areas with visual degeneracy: it maps the two distinct red rooms to the same location (Fig. 4e). In Fig. 4e, the autoencoder predicts the images in the left red room to locations in the right red room—whereas locations with distinct visual scenes (such as the yellow and blue rooms) show a low prediction error (mean error \({\left\Vert x\hat{x}\right\Vert }_{{\ell }_{2}}=5.004\) lattice units). In addition, the autoencoder’s latent distances do not separate the different red rooms in latent space—whereas the predictive coder separates the two red rooms (Fig. 4b). Moreover, the predictive coder demonstrates a low prediction error throughout, including the two visually degenerate red rooms (mean error \({\left\Vert x\hat{x}\right\Vert }_{{\ell }_{2}}=0.071\) lattice units) (Fig. 4c).
Moreover, we measure the relationship between the predictive coder’s (and autoencoder’s) metric and the environment’s metric by fitting a regression model (Fig. 4b),
between the predictive coder’s (and autoencoder’s) latent distances (\(\left\Vert zz^{\prime} \right\Vert\)) and the environment’s physical distances (\(\left\Vert xx^{\prime} \right\Vert\)). Compared to the natural environment, the autoencoder’s latent distances show more deviation from the environment’s spatial distances, whereas the predictive coder’s latent distances maintain a correspondence with spatial distances. For the predictive coder, the latent metric recovers spatial metric quantitatively: the correlation plot (Fig. 4d, left) shows a high correlation (r = 0.827) between the latent and spatial distances, and the quantile–quantile plot (Fig. 4d, right) shows a high overlap between the regression model and the observed latent distances (\({{\mathbb{D}}}_{{{{\rm{KL}}}}}({p}_{{{{\rm{PC}}}}}\parallel {p}_{{{{\rm{model}}}}})=0.250\)). The autoencoder’s latent metric, conversely, does not recover the spatial metric: the correlation plot (Fig. 4f, left) shows a low correlation (r = 0.288) between the latent and spatial distances, and the quantile–quantile plot (Fig. 4f, right)) shows a low overlap between the regression model and the observed latent distances (\({{\mathbb{D}}}_{{{{\rm{KL}}}}}({p}_{{{{\rm{PC}}}}}\parallel {p}_{{{{\rm{model}}}}})=3.806\)).
As shown in Fig. 4, the autoencoder cannot recover the spatial map of the circular corridor—whereas the predictive coder can recover the map. Here we show that autoencoders cannot recover the environment’s map for any environment with visual degeneracy, not just the circular corridor. To show that the autoencoder cannot learn the environment’s map, we show that any statistical estimator cannot learn the environment’s map from stationary observations. For clarity and brevity, we will provide a proof sketch on a lattice environment X, a closed subset of \({{\mathbb{Z}}}^{2}\).
Theorem 1
Consider an environment X, a closed subset of the lattice \({{\mathbb{Z}}}^{2}\) with a function \(x \,\stackrel{f}{\mapsto} \,I\) that gives an image \({I}_{x}=f(x)\subset {{\mathbb{R}}}^{D}\) for the image dimension D and for each position x ∈ X. Let the environment’s observations be degenerate such that
There exists no decoder \(I\,\stackrel{d}{\mapsto}\,x\) that satisfies
Proof
The proof proceeds as a consequence that a function has no left inverse if and only if it is not onetoone. Suppose there exists a decoder \(I\,\stackrel{d}{\mapsto}\,x\) that satisfies
Consider
Then,
which is a contradiction, as required.
Because Theorem 1 demonstrates there exists no decoder for a visually degenerate environment with stationary observations, an autoencoder cannot recover a visually degenerate environment; the autoencoder’s failure arises because two locations with the same observation cannot be discriminated.
Corollary 1
Consider an autoencoder \({g\,=\,{\mathrm{dec}}\,\circ}\) enc with an encoder \(I \,\stackrel{\rm{enc}}{\mapsto}\, z\) and decoder \(z \,\stackrel{{{dec}}}{\mapsto}\, I\) that compresses images into a latent space \(z\in {{\mathbb{R}}}^{L}\) for the latent dimension L. There exists no decoder \(z \,\stackrel{h}{\mapsto}\, x\) that satisfies
Proof
Consider the decoder \({d\,=\,{h}\,\circ}\,{\mathrm{enc}}:\,{I}\rightarrow{x}\). By Theorem 1, this decoder cannot satisfy
as required.
Predictive coding generates units with localized receptive fields that support vector navigation
In the previous section, we demonstrate that the predictive coding neural network captures spatial relationships within an environment containing more internal spatial information than can be captured by an autoencoder network that encodes image similarity. Here we analyse the structure of the spatial code learned by the predictive coding network. We demonstrate that each unit in the neural network’s latent space activates at distinct, localized regions—akin to place fields in the mammalian brain—in the environment’s physical space (Fig. 5a). These place fields overlap, and their aggregate covers the entire physical space. Each physical location is represented by a unique combination of overlapping regions encoded by the latent units. This combination of overlapping regions recovers the agent’s current physical position. Furthermore, given two physical locations, there now exist two distinct combinations of overlapping regions in latent space. Vector navigation is the representation of the vector heading to a goal location from a current location^{51}. We show that overlapping regions (or place fields) can give a heading from a current location to a goal location. Specifically, a linear decoder recovers the vector to a goal location from a starting location by taking the difference in place fields, which supports vector navigation (Supplementary Fig. 1). Traditionally, other studies^{51} consider grid cellsupported vector navigation, whereas we only consider vector navigation using place cells.
To support this proposed mechanism, we first demonstrate the neural network generates place fields. In other words, units from the neural network’s latent space produce localized regions in physical space. To determine whether a latent unit is active, we threshold the continuous value with its 90thpercentile value. The agent has a head direction that varies to ensure the regions are stable across all head directions. To measure a latent unit’s localization in physical space, we fit each latent unit distribution, with respect to physical space, to a twodimensional Gaussian distribution (Fig. 5c, top), defined by
for the covariance matrix Σ and the mean vector μ. We measure the area of the ellipsoid given by the Gaussian approximation where P ≥ 0.0005 (Fig. 5c, bottom). The area of the latent unit approximation measures how localized a unit is compared to the environment’s area, which measures 40 × 65 = 2,600 lattice units. The latent unit approximations have a mean area of 254.6 lattice units (9.79% of the environment) and 80% of the areas are <352.6 lattice units (13.6% of the environment).
The units in the neural network’s latent space provide a unique combinatorial code for each spatial position. The aggregate of latent units covers the environment’s entire physical space. At each lattice block in the environment, we calculate the number of active latent units (Fig. 5d, left). The number of active latent units is different in 87.6% of the lattice blocks. Every lattice block has at least one active latent unit, which indicates the aggregate of the latent units covers the environment’s physical space. Moreover, to ensure the regions remain stable across shifting landmarks, the environment’s trees were removed and randomly redistributed in the environment (Supplementary Fig. 5a,b). The regions remain stable after changing the tree landmarks, with a Jaccard index (∣S_{new} ∩ S_{old}∣/∣S_{new} ∪ S_{old}∣) (or the intersection over union of new regions S_{new} and old regions S_{old}) of 0.828.
Lastly, we demonstrate that the neural network can measure physical distances and could perform vector navigation—representing the vector heading from a current location to a goal location—by comparing the combinations of overlapping regions in its latent space. We first determine the active latent units by thresholding each continuous value by its 90thpercentile value. At each position, we have a 128dimensional binary vector that gives the overlap of 128 latent units. We take the bitwise difference z_{1} − z_{2} between the overlapping codes z_{1} and z_{2} at two varying positions x_{1} and x_{2} with the vector displacement x_{1} − x_{2} (Supplementary Fig. 1a). We then fit a linear decoder from the code z_{1} − z_{2} to the vector displacement x_{1} − x_{2},
for weight W and bias b. The predicted distance error \(\left\Vert r\hat{r}\right\Vert\) and the predicted direction error \(\Vert \theta \hat{\theta }\Vert\) are decomposed from the predicted displacement \({\hat{x}}_{1}{\hat{x}}_{2}\). The linear decoder has a low prediction error for distance (<80%, 12.49 lattice units; mean 7.89 lattice units) and direction (<80%, 48.04°; mean 30.6°) (Supplementary Fig. 1b,c). The code z_{1} − z_{2} is highly correlated with direction θ and distance r with Pearson correlation coefficients 0.924 and 0.718, respectively (Supplementary Fig. 1d).
We can measure the correspondence between the bitwise distance ∣z_{1} − z_{2}∣ and the physical distances \({\left\Vert {x}_{1}{x}_{2}\right\Vert }_{{\ell }_{2}}\), which use the Euclidean distance \(\vert\vert x {{\vert \vert_{\ell}}_{2}} = {\sqrt {\sum^{D}_{i=1} x^{2}_{i}}}\) for dimension D. For the bitwise distance, we threshold the latent units to its 90thpercentile then compute the L_{1}norm (\(\vert\vert x {{\vert \vert_{\ell}}_{1}} = \sum^{D}_{i=1} \vert x_{i} \vert\) for dimension D) between the units. Similar to the previous sections, we compute the joint densities of the binary vectors’ bitwise distances and the physical positions’ Euclidean distances. We then calculate their mutual information to measure how much spatial information the bitwise distance captures. The proposed mechanism for the neural network’s distance measurement—the binary vector’s Hamming distance—gives a mutual information of 0.542 bits, compared to the predictive coder’s mutual information of 0.627 bits and the autoencoder’s mutual information of 0.227 bits (Fig. 5e). The code from the overlapping regions captures a majority amount of the predictive coder’s spatial information.
Discussion
Mapping is a general mechanism for generating an internal representation of sensory information. While spatial maps facilitate navigation and planning within an environment, mapping is a ubiquitous neural function that extends to representations beyond visual–spatial mapping. The primary sensory cortex, for example, maps tactile events topographically. Physical touches that occur in proximity are mapped in proximity for both the neural representations and the anatomical brain regions^{52}. Similarly, the cortex maps natural speech by tiling regions with different words and their relationships, which shows that topographic maps in the brain extend to higherorder cognition. The similar representation of nonspatial and spatial maps in the brain suggests a common mechanism for charting cognitive maps^{53}. However, it is unclear how a single mechanism can generate both spatial and nonspatial maps.
Here we show that predictive coding provides a basic, general mechanism for charting spatial maps by predicting sensory data from past sensory experiences—including environments with degenerate observations. Our theoretical framework applies to any vectorvalued sensory data and could be extended to auditory data, tactile data or tokenized representations of language. We demonstrate a neural network that performs predictive coding and can construct an implicit spatial map of an environment by assembling information from local paths into a global frame within the neural network’s latent space. The implicit spatial map depends specifically on the sequential task of predicting future visual images. Neural networks trained as autoencoders do not reconstruct a faithful geometric representation in the presence of physically distant yet visually similar landmarks.
Moreover, we study the predictive coding neural network’s representation in latent space. Each unit in the network’s latent space activates at distinct, localized regions—called place fields—with respect to physical space. At each physical location, there exists a unique combination of overlapping place fields. At two locations, the differences in the combinations of overlapping place fields provide the distance between the two physical locations. The existence of place fields in both the neural network and the hippocampus^{16} suggests that predictive coding is a universal mechanism for mapping. In addition, vector navigation emerges naturally from predictive coding by computing distances from overlapping place field units. Predictive coding may provide a model for understanding how place cells emerge, change and function.
Predictive coding can be performed over any sensory modality that has some temporal sequence. As natural speech forms a cognitive map, predictive coding may underlie the geometry of human language. Intriguingly, large language models train on causal word prediction—a form of predictive coding—build internal maps that support generalized reasoning, answer questions and mimic other forms of higherorder reasoning^{54}. Similarities in spatial and nonspatial maps in the brain suggest that large language models organize language into a cognitive map and chart concepts geometrically. These results all suggest that predictive coding might provide a unified theory for building representations of informationconnecting disparate theories including place cell formation in the hippocampus, somatosensory maps in the cortex and human language.
Methods
Environment simulation
Forest–cave–river environment
These experiments leverage the Malmo framework^{44} to construct a controlled environment within Minecraft. This environment is a rectangular space measuring 40 by 65 lattice units and incorporates three key visual features: a prominent cave serving as a global landmark, a forest area introducing some visual ambiguity between scenes and a river with a bridge that restricts agent movement options. Within this environment, an agent traverses paths between randomly chosen waypoints. These paths are determined using the A^{*} search algorithm to ensure obstacles did not block the agent’s path. The agent varies its speed and direction to traverse the generated paths. During its exploration, the agent captures visual observations at regular intervals along each path.
Circular environment
To explore the model’s ability to differentiate between visually identical but spatially distinct scenes, these experiments used a circular corridor environment. This environment consists of an infinitely repeating sequence of rooms, specifically coloured red, green, red, blue and yellow in a clockwise direction. Notably, there are two distinct red rooms despite their identical appearance. Technically, the environment is an infinitely long hallway segmented into these coloured rooms. Similar to the previous experiment, an agent navigates between randomly chosen waypoints within this environment. The paths are determined using the A^{*} search algorithm, and the agent captures visual observations at regular intervals along its journey.
Predictive coder
Architecture
The proposed neural network follows an encoder–decoder architecture, employing a UNet structure to process input image sequences and predict future images. The encoder and decoder components are both based on ResNet18 convolutional neural networks.
The encoding module utilizes a ResNet18 model to extract hierarchical features from the input image sequence. Each image in the sequence is processed independently through the ResNet18 encoder, generating a sequence of latent vectors. The encoder consists of residual blocks, each containing convolutional layers, batch normalization and rectified linear unit (ReLU) activations. The downsampling is achieved via strided convolutions within the residual blocks.
The selfattention module utilizes multiheaded attention, which processes the sequence of encoded latent units to encode the history of past visual observations. The network consists of one layer of multiheaded attention. The multiheaded attention has h = 8 heads. For the encoded latent units with dimension D = C × H × W, the dimension d of a single head is d = C × H × W/h.
The latent vectors output by the encoder are concatenated to form an ordered sequence. This sequence is then processed by a selfattention layer to capture temporal dependencies and relationships among the image sequence. The selfattention mechanism enables the model to weigh the importance of each latent vector in the context of the entire sequence, facilitating improved temporal feature representation.
The decoding module mirrors the encoder’s architecture, utilizing a ResNet18 model adapted for upsampling. The decoder reconstructs the future images from the transformed latent vectors, employing transposed convolutions and residual blocks analogous to those in the encoder.
Training
The predictive coder is trained for 200 epochs using stochastic gradient descent as the optimization algorithm. The training parameters include a learning rate of 0.1, Nesterov momentum of 0.9 and a weight decay of 5 × 10^{−6}. To optimize the learning process, the learning rate is scheduled using the OneCycle learningrate policy. This policy adjusts the learning rate cyclically between a lower and upper bound, facilitating efficient convergence and improved performance. The OneCycle learningrate schedule is characterized by an initial increase in the learning rate, followed by a subsequent decrease.
Latent units
The predictive coder’s encoding and selfattention modules were used to analyse the encoded sequences as the predictive coder’s latent units. The image sequence first undergoes processing through the encoder, which extracts a compressed representation capturing the key features within each image. Subsequently, this encoded sequence is fed into the selfattention module. This module specifically focuses on the inherent temporal order of the images within the sequence. The selfattention module’s processed output forms the predictive coder’s latent units.
Autoencoder
Architecture
Unlike the predictive coder architecture, the autoencoder architecture transforms the current images (rather than the past images for the predictive coder) into a lowdimensional latent vector. The proposed neural network follows an encoder–decoder architecture employing a UNet structure to process input image sequences into a lowdimensional latent vector and to reconstruct the initial image. The encoder and decoder components are both based on ResNet18 convolutional neural networks. However, the autoencoder architecture does not utilize any selfattention layers to integrate past observations of images.
The encoding module utilizes a ResNet18 model to extract hierarchical features from the input image sequence. Each image in the sequence is processed independently through the ResNet18 encoder, generating a sequence of latent vectors. The encoder consists of residual blocks, each containing convolutional layers, batch normalization and ReLU activations. The downsampling is achieved via strided convolutions within the residual blocks.
Unlike the predictive coder, the latent vectors output by the encoder are directly processed by the decoder. Whereas the predictive coder predicts the future images within an image sequence, the autoencoder predicts the current images, given the lowdimensional latent vector generated by the encoder.
The decoding module mirrors the encoder’s architecture, utilizing a ResNet18 model adapted for upsampling. The decoder reconstructs the future images from the transformed latent vectors, employing transposed convolutions and residual blocks analogous to those in the encoder.
Training
The predictive coder is trained for 200 epochs using stochastic gradient descent as the optimization algorithm. The training parameters include a learning rate of 0.1, Nesterov momentum of 0.9 and a weight decay of 5 × 10^{−6}. To optimize the learning process, the learning rate is scheduled using the OneCycle learningrate policy. This policy adjusts the learning rate cyclically between a lower and upper bound, facilitating efficient convergence and improved performance. The OneCycle learningrate schedule is characterized by an initial increase in the learning rate, followed by a subsequent decrease.
Latent units
The autoencoder’s encoding module was used to analyse the encoded images as the autoencoder’s latent units. The image sequence first undergoes processing through the encoder, which extracts a compressed representation capturing the key features within each image. The encoder’s processed output forms the autoencoder’s latent units.
Positional decoder
To assess the effectiveness of the predictive coder in capturing positional information within the encoded sequences, this analysis employed an auxiliary neural network for position prediction. This network, referred to as the positional decoder, takes the latent units generated by the predictive coder—or autoencoder—as input. The decoder architecture consists of several layers designed to extract this positional information: a convolutional layer transforms the input to a higher dimension (256), followed by a ReLU activation for nonlinearity. A max pooling layer then reduces the spatial resolution while maintaining relevant features. Subsequently, two fully connected (affine) layers with ReLU activations project the data to a lower dimension (64) and finally to a 2dimensional output, corresponding to the agent’s predicted position (x and y coordinates).
During training, the meansquared error between the agent’s actual position and the predicted position served as the loss function
To optimize this loss, the AdamW optimizer was employed with a twostage learningrate schedule. The initial stage utilized a learning rate of 10^{−4} for 1,000 epochs, followed by a finetuning stage with a reduced learning rate of 10^{−5} for an additional 1,000 epochs.
Modelling the correspondence between latent and physical distances
This analysis evaluated the ability of the predictive coder’s latent space to encode local positional information. For each path traversed by the agent, we computed the pairwise distances between positions in physical space and the corresponding latent space distances within a neighbourhood of 100 time steps. To assess the correspondence between these two distance measures, we analysed the joint distribution of physical and latent space distances. We modelled the relationship between latent distances and their corresponding physical distances using a logarithmic function with additive Gaussian noise
The goodnessoffit between the model and the actual data was evaluated using two metrics: the Pearson correlation coefficient, which measures the dependence between the physical and latent distances, and the Kullback–Leibler divergence
which quantifies the difference between the two modelled regression distribution and the observed empirical distribution.
Mutual information of the predictive coder and autoencoder
The spatial information encoded within the latent representations of both the predictive coder and the autoencoder was evaluated. To achieve this, this analysis computed the joint densities between the latent distances in each model and the corresponding actual physical distances within the environment. By analysing these joint densities, we were able to quantify the physical information within each model’s latent space. Mutual information
was employed as a metric to assess this physical information. Higher mutual information indicates that the latent distances in a model encode a greater amount of spatial information, signifying a stronger correlation between the distances in the latent space and the actual physical separations between locations in the environment. This comparison allows us to gauge the relative effectiveness of each model in capturing and representing spatial relationships within their respective latent spaces.
Place field analysis
Place field calculation
This analysis investigated the spatial localization of individual units within the neural network’s latent space. First, this analysis computed the histogram of the distribution of the 128dimensional latent vectors. To identify active units, this analysis employed a thresholding technique based on the 90thpercentile value of the continuous latent unit values. This ensured a focus on units with notable activation levels. The agent’s head direction was varied during data collection to ensure the identified localized regions remained stable regardless of the agent’s orientation.
Place field statistical fitting
To quantify the degree of localization for each active unit, this analysis fitted a twodimensional Gaussian distribution
to its corresponding distribution in physical space. The area of the resulting ellipsoid, defined by the Gaussian approximation and exceeding a probability threshold of P ≥ 0.0005, served as our localization metric. This area reflects the spatial extent of the unit’s activation within the environment, relative to the overall environment size of 40 × 65 lattice units (2,600 units). Units with smaller ellipsoid areas indicate a more concentrated activation pattern in physical space, suggesting a higher degree of localization.
Vector navigation analysis
This analysis investigated the ability of the neural network’s latent space to not only encode positional information but also represent the vector heading from a current location to a goal location—called vector navigation. To assess this, we compared the combinations of overlapping regions in the latent space representations of two distinct positions x_{1} and x_{2}. This analysis achieved this by computing the bitwise difference z_{1} − z_{2} between the corresponding latent codes z_{1}, z_{2} for these positions. Subsequently, we examined the relationship between this difference vector and the actual physical displacement vector x_{1} − x_{2} using a linear decoder
This decoder was trained to predict the displacement vector based solely on the latent code difference. The predicted displacement was then decomposed into its distance and directional components, to calculate the specific errors associated with predicting both the distance and direction to the goal location. This analysis computed the Pearson correlation coefficient between the predicted distance, predicted direction and the predicted displacement vector.
Mutual information calculation
This analysis employed a complementary approach to evaluate the spatial information encoded within the binary vectors derived from the latent space. Here the joint densities were computed between the bitwise distances of these binary vectors and the Euclidean distances between corresponding physical positions. The mutual information
was then computed to quantify the amount of spatial information captured by the bitwise distances. This metric essentially reflects how well the bitwise distance between latent codes reflects the actual physical separation between locations in the environment. Finally, to provide context for the obtained value, the mutual information of the binary vectors’ bitwise distance was compared with the mutual information derived from the latent distances of both the predictive coder and the autoencoder. This comparison assesses the relative effectiveness of each model in capturing spatial information within their respective latent representations.
Place field stability with shifting landmarks
To assess the stability of the identified localized regions within the latent space, this analysis investigated their resilience to changes in the environment’s landmarks. The environment was manipulated: the trees, originally serving as landmarks, were removed and then randomly redistributed throughout the space. Subsequently, the Jaccard index
was employed to quantify the overlap between the latent units identified in the original environment and those found in the environment with shifted landmarks. The Jaccard index ranges from 0 to 1, where a value of 1 indicates a perfect overlap between the sets of latent units, and 0 signifies no overlap. This analysis allowed us to evaluate how well the latent units maintain their spatial correspondence despite alterations to the environment’s visual features.
Reporting summary
Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.
Data availability
All datasets supporting the findings of this study, including the latent variables for the autoencoding and predictive coding neural networks, as well as the training and validation datasets, are available on GitHub at https://github.com/jgornet/predictivecodingrecoversmaps and via Zenodo at https://doi.org/10.5281/zenodo.11287439 (ref. ^{55}).
Code availability
The code supporting the conclusions of this study is available on GitHub at https://github.com/jgornet/predictivecodingrecoversmaps and via Zenodo at https://doi.org/10.5281/zenodo.11287439 (ref. ^{55}). The repository contains the Project Malmo environment code, training scripts for both the predictive coding and autoencoding neural networks, as well as code for the analysis of predictive coding and autoencoding results.
References
Epstein, R. A., Patai, E. Z., Julian, J. B. & Spiers, H. J. The cognitive map in humans: spatial navigation and beyond. Nat. Neurosci. 20, 1504–1513 (2017).
Wang, Z. J. & Thomson, M. Localization of signaling receptors maximizes cellular information acquisition in spatially structured natural environments. Cell Syst. 13, 530–546 (2022).
Sivak, D. A. & Thomson, M. Environmental statistics and optimal regulation. PLoS Comput. Biol. 10, e1003826 (2014).
Anderson, J. Cognitive Psychology and Its Implications 9th edn (Worth Publishers, 2020).
Rescorla, M. Cognitive maps and the language of thought. Br. J. Philos. Sci. 60, 377–407 (2009).
Whittington, J. C., McCaffary, D., Bakermans, J. J. & Behrens, T. E. How to build a cognitive map. Nat. Neurosci. 25, 1257–1272 (2022).
Aronov, D., Nevers, R. & Tank, D. W. Mapping of a nonspatial dimension by the hippocampal–entorhinal circuit. Nature 543, 719–722 (2017).
Nieh, E. H. et al. Geometry of abstract learned knowledge in the hippocampus. Nature 595, 80–84 (2021).
Whittington, J. C. et al. The TolmanEichenbaum machine: unifying space and relational memory through generalization in the hippocampal formation. Cell 183, 1249–1263 (2020).
Wilson, R. C., Takahashi, Y. K., Schoenbaum, G. & Niv, Y. Orbitofrontal cortex as a cognitive map of task space. Neuron 81, 267–279 (2014).
Constantinescu, A. O., O’Reilly, J. X. & Behrens, T. E. J. Organizing conceptual knowledge in humans with a gridlike code. Science 352, 1464–1468 (2016).
Garvert, M. M., Dolan, R. J. & Behrens, T. E. A map of abstract relational knowledge in the human hippocampal–entorhinal cortex. eLife 6, e17086 (2017).
Huth, A. G., de Heer, W. A., Griffiths, T. L., Theunissen, F. E. & Gallant, J. L. Natural speech reveals the semantic maps that tile human cerebral cortex. Nature 532, 453–458 (2016).
Corkin, S. Lasting consequences of bilateral medial temporal lobectomy: clinical course and experimental findings in H.M. Semin. Neurol. 4, 249–259 (1984).
Behrens, T. E. et al. What is a cognitive map? Organizing knowledge for flexible behavior. Neuron 100, 490–509 (2018).
O’Keefe, J. Place units in the hippocampus of the freely moving rat. Exp. Neurol. 51, 78–109 (1976).
Hafting, T., Fyhn, M., Molden, S., Moser, M.B. & Moser, E. I. Microstructure of a spatial map in the entorhinal cortex. Nature 436, 801–806 (2005).
Amaral, D. G., Ishizuka, N. & Claiborne, B. in Understanding the Brain Through the Hippocampus: the Hippocampal Region as a Model for Studying Brain Structure and Function (eds StormMathisen, J. et al.) Ch 1 (1990).
Cueva, C. J. & Wei, X.X. Emergence of gridlike representations by training recurrent neural networks to perform spatial localization. In Proc. 6th International Conference on Learning Representations (ICLR) 1512–1530 (Curran Associates, Inc., 2018).
Banino, A. et al. Vectorbased navigation using gridlike representations in artificial agents. Nature 557, 429–433 (2018).
Crane, K., Weischedel, C. & Wardetzky, M. The heat method for distance computation. Commun. ACM 60, 90–99 (2017).
Zhang, T., Rosenberg, M., Jing, Z., Perona, P. & Meister, M. Endotaxis: A neuromorphic algorithm for mapping, goallearning, navigation, and patrolling. eLife 12, RP84141 (2023).
Thrun, S. & Montemerlo, M. The Graph SLAM algorithm with applications to largescale mapping of urban structures. Int. J. Robot. Res. 25, 403–429 (2006).
MurArtal, R. & Tardós, J. D. Visualinertial monocular SLAM with map reuse. IEEE Robot. Autom. Lett. 2, 796–803 (2017).
Mourikis, A. I. & Roumeliotis, S. I. A multistate constraint Kalman filter for visionaided inertial navigation. In Proc. 2007 IEEE International Conference on Robotics and Automation 3565–3572 (IEEE, 2007).
Lynen, S. et al. Get out of my lab: largescale, realtme visualinertial localization. In Proc. Robotics: Science and System XI (eds Kavraki, L. E., Hsu, D. & Buchli, J.) (RSS, 2015); https://doi.org/10.15607/RSS.2015.XI.037
Gupta, S. et al. Cognitive mapping and planning for visual navigation. In Proc. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 7272–7281 (IEEE, 2017).
Mirowski, P. et al. Learning to navigate in cities without a map. In Proc. 32nd International Conference on Neural Information Processing Systems (eds Bengio, S. & Wallach, H.M.) 2424–2435 (Curran Associates, Inc., 2018).
Duan, Y. et al. RL^{2}: fast reinforcement learning via slow reinforcement learning. Preprint at https://doi.org/10.48550/arXiv.1611.02779 (2016).
Higgins, I. et al. DARLA: improving zeroshot transfer in reinforcement learning. In Proc. 34th International Conference on Machine Learning (eds Precup, D. & Teb, Y. W.) 1480–1490 (PMLR, 2017); https://proceedings.mlr.press/v70/higgins17a.html
Seo, Y., Lee, K., James, S. L. & Abbeel, P. Reinforcement learning with actionfree pretraining from videos. In Proc. 39th International Conference on Machine Learning (eds Chaudhuri, K. et al.) 19561–19579 (PMLR, 2022); https://proceedings.mlr.press/v162/seo22a.html
Lee, T. S. & Mumford, D. Hierarchical Bayesian inference in the visual cortex. JOSA A 20, 1434–1448 (2003).
Mumford, D. in First European Congress of Mathematics. Progress in Mathematics Vol. 3 (eds Joseph, A. et al.) 187–224 (Springer, 1994).
Rao, R. P. N. & Ballard, D. H. Predictive coding in the visual cortex: a functional interpretation of some extraclassical receptivefield effects. Nat. Neurosci. 2, 79–87 (1999).
Poincaré, H. The Foundations of Science: Science and Hypothesis, the Value of Science, Science and Method (Cambridge Univ. Press, 2015).
O’Keefe, J. & Nadel, L. The Hippocampus as a Cognitive Map (Clarendon Press, Oxford Univ. Press, 1978).
Thrun, S., Burgard, W. & Fox, D. Probabilistic Robotics (MIT Press, 2005).
Stachenfeld, K. L., Botvinick, M. M. & Gershman, S. J. The hippocampus as a predictive map. Nat. Neurosci. 20, 1643–1653 (2017).
Recanatesi, S. et al. Predictive learning as a network mechanism for extracting lowdimensional latent space representations. Nat. Commun. 12, 1417 (2021).
Fang, C., Aronov, D., Abbott, L. & Mackevicius, E. L. Neural learning rules for generating flexible predictions and computing the successor representation. eLife 12, e80680 (2023).
Dayan, P., Hinton, G. E., Neal, R. M. & Zemel, R. S. The Helmholtz machine. Neural Comput. 7, 889–904 (1995).
Luttrell, S. P. A Bayesian analysis of selforganizing maps. Neural Comput. 6, 767–794 (1994).
Tu, L. W. Differential Geometry: Connections, Curvature, and Characteristic Classes 1st edn (Springer, 2017).
Johnson, M., Hofmann, K., Hutton, T. & Bignell, D. The Malmo platform for artificial intelligence experimentation. In Proc. TwentyFifth International Joint Conference on Artificial Intelligence (ed. Brewka, G.) 4246–4247 (AAAI Press, 2016).
He, K., Zhang, X., Ren, S. & Sun, J. Deep residual learning for image recognition. In Proc. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 770–778 (IEEE, 2016).
Ronneberger, O., Fischer, P. & Brox, T. UNet: convolutional networks for biomedical image segmentation. In Medical Image Computing and ComputerAssisted Intervention (MICCAI 2015) (eds Navab, N. et al.) 234–241 (Springer International Publishing, 2015).
Vaswani, A. et al. Attention is all you need. In Proc. 31st International Conference on Neural Information Processing Systems (eds Von Luxburg, U. et al.) 5999–6009 (Curran Associates, Inc., 2017).
Sutskever, I., Martens, J., Dahl, G. & Hinton, G. On the importance of initialization and momentum in deep learning. In Proc. 30th International Conference on Machine Learning (eds Dasgupta, S. & McAllester, D.) 1139–1147 (PMLR, 2013); https://proceedings.mlr.press/v28/sutskever13.html
Smith, L. N. & Topin, N. Superconvergence: very fast training of neural networks using large learning rates. Preprint at https://doi.org/10.48550/arXiv.1708.07120 (2018).
Tenenbaum, J. B., de Silva, V. & Langford, J. C. A global geometric framework for nonlinear dimensionality reduction. Science 290, 2319–2323 (2000).
Bush, D., Barry, C., Manson, D. & Burgess, N. Using grid cells for navigation. Neuron 87, 507–520 (2015).
Rosenthal, I. A. et al. S1 represents multisensory contexts and somatotopic locations within and outside the bounds of the cortical homunculus. Cell Rep. 42, 112312 (2023).
Behrens, T. E. J. et al. What is a cognitive map? Organizing knowledge for flexible behavior. Neuron 100, 490–509 (2018).
Brown, T. et al. Language models are fewshot learners. In Proc. 33rd International Conference on Neural Information Processing Systems (eds Larochelle, H. et al.) 1877–1901 (Curran Associates, Inc., 2020).
Gornet, J. jgornet/predictivecodingrecoversmaps: Nature Machine Intelligence prerelease. Zenodo https://doi.org/10.5281/zenodo.11287439 (2024).
Acknowledgements
We appreciate I. Strazhnik for her contributions to the scientific visualizations and figure illustrations. Her expertise in translating our research into clear visuals has significantly elevated the clarity and impact of our paper. We are grateful to T. Siapas, E. Lubenov, D. Mobbs and M. Rosenberg for their invaluable and insightful discussions. Their expertise and feedback have been instrumental in the development and realization of this research. Additionally, we appreciate the insights provided by L. Xu, M. Wang and J. Zheng, which played a crucial role in refining various aspects of our study. We are thankful for the support provided by The David and Lucile Packard Foundation under grant no. 201969662, as well as the Chen Institute at Caltech, the Heritage Medical Research Institute and the Chan Zuckerberg Initiative (M.T. and J.G.).
Author information
Authors and Affiliations
Contributions
J.G. and M.T. contributed equally to all stages of the work. J.G. and M.T. conceived the study. M.T. and J.G. jointly derived the mathematical framework, J.G. and M.T. built the environment simulation and the corresponding dataset. J.G. and M.T. constructed the neural network architecture and oversaw its training. J.G. and M.T. performed the analysis of the results.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Peer review
Peer review information
Nature Machine Intelligence thanks the anonymous reviewers for their contribution to the peer review of this work.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Supplementary Information
Supplementary text and Figs. 1–6.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Gornet, J., Thomson, M. Automated construction of cognitive maps with visual predictive coding. Nat Mach Intell 6, 820–833 (2024). https://doi.org/10.1038/s42256024008631
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1038/s42256024008631
This article is cited by

Cognitive maps from predictive vision
Nature Machine Intelligence (2024)