Predictive learning extracts latent space representations from sensory observations

Neural networks have achieved many recent successes in solving sequential processing and planning tasks. Their success is often ascribed to the emergence of the task’s low-dimensional latent structure in the network activity – i.e., in the learned neural representations . Similarly, biological neural circuits and in particular the hippocampus may produce representations that organize semantically related episodes. Here, we investigate the hypothesis that representations with low-dimensional latent structure, reﬂecting such semantic organization, result from learning to predict observations about the world. Speciﬁcally, we ask whether and when network mechanisms for sensory prediction coincide with those for extracting the underlying latent variables. Using a recurrent neural network model trained to predict a sequence of observations in a simulated spatial navigation task, we show that network dynamics exhibit low-dimensional but nonlinearly transformed representations of sensory inputs that capture the latent structure of the sensory environment. We quantify these results using nonlinear measures of intrinsic dimensionality which highlight the importance of the predictive aspect of neural representations, and provide mathematical arguments for when and why these representations emerge. We focus on how our results can aid the analysis and interpretation of experimental data.


Introduction
Neural network representations are often described as encoding latent semantic information from a corpus of data (1)(2)(3)(4)(5)(6)(7)(8)(9). Similarly, the brain forms representations to help it overcome a formidable challenge: to organize episodes, tasks and behavior according to a priori unkown latent variables underlying the experienced sensory information. How does such an organization of semantic information emerge? Two related bodies of work have shown that this can occur due to the process of prediction -giving rise to predictive representations. First, neural networks are able to extract semantic characteristics from linguistic corpora when trained to predict the context in which a given word appears (10)(11)(12)(13). The resulting neural representations of words (known as word embeddings) have geometric properties that reflect the semantic meaning of the words they represent (14). Second, models learning to encode for future sensory information give rise to internal representations that encode spatial maps useful for goal-directed behavior (9,(15)(16)(17). Characterizing predictive representations can shed light on where and how the brain exploits predictive mechanisms to semantically organize sensory information.
The hippocampus provides a case in point. While traditionally distinct theories of hippocampus involve declarative memory (18)) and spatial navigation (19), considerable effort has been devoted to reconciling these apparently contrasting views (20)(21)(22)(23). In particular, Eichenbaum (22) proposed that the hippocampus supports a semantic relational network that organizes related episodes to subserve sequential planning (7,9,24). Inspired by this work, our goal here is to build theoretical and data-analytic tools that explain why a predictive learning process in neural networks leads to low-dimensional maps of the latent structure of the underlying tasks -and what the general signatures of such maps in neural recordings might be. We begin with a generative model: observations are generated from latent variables embedded in a low-dimensional manifold. In the special case of spatial navigation, the latent variables are the position and orientation of an agent in the spatial environment, and the observations are high-dimensional sensory inputs specific to a given position and orientation. The predictive learning task we study is to predict future observations. Our central question is whether a recurrent neural network (RNN) trained on this predictive learning task will extract representations of the underlying low-dimensional latent variables. We develop analytical tools to reveal the low-dimensional structure of representations created by predictive learning. Crucial to this is the distinction between linear (25)(26)(27)(28)(29) and nonlinear dimensionality (30,31), which allows us to uncover what we call latent space signal transfer, wherein information about nonlinearly encoded latent variables moves into the linearly defined top principal components of the representation as learning progresses. Latent space signal transfer is accompanied by clear trends in the linear and nonlinear dimensionality of the underlying representation manifold, and by the formation of neurons with localized Predictive network solving a navigation task. a) Logic diagram of task and information: an agent explores a latent space X through actions and receives observations regarding it. The network's task is to predict the next sensory observation. By learning to do so it recovers information regarding the underlying hidden latent space. b) Illustration of the agent with sensors in square maze where the walls have been colored (cfr. Methods). The 5 sensors span a 90 o degree angle and perceive the color and distance of the wall along their respective directions. The agent moves in a direction θ which is updated continuously according to draws from a gaussian distribution (giving a random walk on a circle). c) Diagram of the predictive recurrent neural network: the network receives actions and observations as inputs and is trained to output the next sensory observation. d) Cost during training for the network (cf. Methods). e) Place cell activities: average activity of 100 neurons (one per small quadrant) against the x, y coordinates of the latent space. f) Head direction activities: average activity of 100 neurons (one per small quadrant) on the latent space against the agent's direction θ.
activations on the nonlinear manifold, manifold cells (32). Importantly, all of these phenomena are measurable signatures of predictive learning that can be tested in data from biological or machine learning experiments.

Predictive Learning in a RNN
In predictive learning a neural network is trained to minimize the errors between its output and a stream of future sensory observations. Here we demonstrate our main result: that the network uncovers the low-dimensional latent space structure in the course of optimizing its future predictions (cfr. Fig. 1a). This occurs despite the fact that the network has no direct information regarding the latent variables generating the observations. We test our hypothesis that predictive learning extracts the underlying low-dimensional latent variables from a high-dimensional sensory stream in the context of a spatial navigation task. In spatial navigation the latent space is the set of spatial coordinates that identify the agent's state, (x, y, θ), where θ identifies its direction. The observation space depends on the agent's ability to sense the environment. The agent we consider is equipped with simple sensors that span a visual cone of 90 o centered on its current direction θ. Each sensor reports the distance and color of the environment's wall along its direction, Fig. 1b. The environment the agent navigates is a discrete grid of locations. Each wall tile, one at each wall location, is colored randomly; a relatively narrow spatial autocorrelation of two tiles induces independent sensory observations across sensors. For simplicity we consider the case of random exploration, where the agent's actions do not depend on the observations. At each step the agent's direction θ is updated by a small random angle dθ . The agent then moves to the discrete grid location most aligned with the updated direction θ + dθ (unless it is not occupied by a wall; cfr. Methods for details). Actions are performed by the agent with respect to its allocentric framework, so that there are nine possible choices: for each location there are eight neighbouring locations plus the possibility of remaining in the same location. While the agent moves in the environment it collects a stream of observations. In predictive learning, the RNN learns to predict the upcoming sensory observation (see Fig. 2c). This is achieved by minimizing the difference between the RNN output y t at time t and the upcoming observation Fig. 2d. We refer to the activations of the units of the trained RNN as its predictive representation. As the agent traverses the environment, it traces out a trajectory in three spaces: the latent variable space (x, y, θ), the observation space, and the neural activation (representation) space. As the RNN learns to predict the next observation, its representation is influenced both by the observation space (since the task is defined purely in terms of observations) and by the latent space (since the latent variables are a generative model for the observations); a priori, it is not obvious which space's influence will be stronger. At the end of learning, we find that neurons clearly encode the latent space: Fig. 1e shows how the latent variables are encoded in the neural representation space. Moreover, single neurons' receptive fields function as "place" and "border" cells that encode the latent variables x and y, and as "head direction" cells that encode θ (Fig. 1f) (19,33,34). Thus, the neural representation has extracted information about the latent space from the observations, without any explicit prompt to do so. In the last section and more in depth in the Suppl. Mat., we show how this phenomenon is robust to alterations of the sensory observations and network architecture.

Latent and neural representation spaces
So far, we have considered how the latent variables are represented one neuron at a time within our predictive learning RNN. How does the neural population as a whole represent the latent space? To answer this question precisely we develop methods for analyzing neural representation manifolds. We begin with the most basic characteristic of a representation manifold, its dimensionality. We start by analyzing a simplified, concrete model of latent space coding. Low-dimensional (Low-D) representation manifolds occur when a large number of neurons are strongly and consistently tuned to a small set of latent variables. Place and grid cells are examples of such coding (19,(35)(36)(37). Specifically, given two continuous variables x, y that parametrize a latent space, Fig. 2a, consider an ensemble of N neurons with Gaussian tuning curves that are centered over uniformly distributed locations on the latent space. For example a neuron may be centered at location (x 0 , y 0 ) and have a gaussian radial basis tuning curve as shown in Fig. 2b, . The responses of an ensemble of N neurons map the latent space manifold (Fig. 2a) to a neural response manifold embedded in neural representation space (that is, the N -dimensional space spanned by the activity of all neurons in the population. To visualize the response manifold, we project it onto its first three Principal Components (PCs), Fig. 2c. As the agent traverses a trajectory x t in the 2d latent space (Fig. 2a, grayscale), the representation r t traces out a trajectory on the response manifold (Fig. 2c, grayscale). We can view the tuning curve of a single neuron (Fig. 2b) on the response manifold to obtain the manifold tuning curve of this neuron (Fig. 2d). In the next section we will analyze in more depth the meaning and properties of manifold tuning curves. The two dimensions of the latent space completely parametrize the response manifold, resulting in a two-dimensional curved surface.
The fact that the representation manifold has two dimensions is revealed by a measure known as Intrinsic Dimensionality (ID), whose formal definition relies on concepts of Riemannian geometry for smooth manifolds (30). While the ID of the representation manifold is two, due to its curvature, many linear components are necessary to cover it in the N -dimensional neural space. This linear dimensionality can be captured by a second measure of dimensionality: the Participation Ratio (PR) of the manifold. This metric is defined over the eigenvalues λ 1..N of the covariance matrix C of the neural activity: . CC-BY-NC-ND 4.0 International license a certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under The copyright holder for this preprint (which was not this version posted July 13, 2019. ; https://doi.org/10.1101/471987 doi: bioRxiv preprint a PCs 1,2,3 and Latent space latent variables underlying the inputs. As such, it does not depend on specific details of the neural code. • PR, by contrast, is a property of the neural code. The more localized the neural fields are (i.e. the smaller the response curve width σ is), the more decorrelated the neural activations are, and, in turn, the higher the linear dimensionality PR is.
Thus, the difference between PR and ID carries information about the non-linear embedding of latent variables in the representation. We suggest a novel metric, Dimensionality Gain (DG), to capture such difference which measures the extent to which a given representation linearly expands the "true" (i.e. intrinsic) dimensionality of the manifold: (2) Fig. 2e shows a key observation, that we will return to in the context of predictive representations: that the Dimensionality Gain (DG) increases as the width σ of the neural fields decreases. Thus a higher DG is regarded as a signature of low-D coding. In the Suppl.Mat. we give an analytical formula for this relationship as well as a more thorough explanation of relationships among ID, PR, and DG (Fig.S1).

The learned neural representation manifold
In the previous section we illustrated signatures of low-D representation manifolds of the latent variable space in the case where neurons function exactly as place cells directly encoding the latent space. This led to interesting and readily measurable phenomena. First, neurons show clear response fields on the response manifold, which we named manifold tuning curves. Moreover, the representation manifold is low-dimensional while appearing higher-dimensional according to linear measures: that is, the representation has a high dimensionality gain (DG). These observations beg the question of whether representation manifolds learned via the predictive learning framework introduced in the first section will have the same properties. We begin by showing in Fig. 3a the neural representation projected into the space of its first three PCs, colored according to each of the three latent variables x, y, and θ. Each point in these plots corresponds to the neural representation at a specific moment in time, and the color of the point is determined by the position or orientation of the agent in the latent environment at that moment. This shows that the agent's location x, y is systematically encoded in the first three PCs, while PCs four and five encode the agent's orientation θ, Fig. 3b. As the agent's input is the observations rather than the latent variables, it is natural to ask whether the observation variables are similarly encoded in the RNN representation. Fig. 3c shows that, while the first three PCs do encode distance, they do not appear to encode the sensor-averaged color in any of the three RGB channels. Intriguingly, this is a consequence of learning: average color information is encoded in the first PCs in the beginning of learning (see more details below). Figs. 3a and 3b, taken together, suggest that the network allocates most of its internal variability to the encoding of latent variables. We next explore the relationship between the responses of single cells and the population activity along the manifold. In the simplest case of Fig. 2, in which the latent space directly parameterized the responses of individual cells, we showed that the receptive fields of single cells tiled the representation manifold in the same way that they tiled the latent space. Does the same phenomenon occur for learned representations in the RNN? Fig. 3d demonstrates that this is indeed the case, by showing the activity of the same 100 neurons in Fig. 1e averaged over "locations" in the space spanned by the first two PCs. This reveals that single neurons have activities that resemble receptive fields on the neural representation manifold. We refer to these as neural manifold cells. If the neural manifold clearly represents the latent space (Fig. 3a) and neural receptive fields tile the latent space (Fig. 1e), then neural activities are also localized on the manifold. Intriguingly the reverse is also true: localized activities in the latent space (e.g. place cells, cfr. Fig. 1e) follow from tiling of the manifold by single neuron receptive fields. The preceding analysis shows how the neural representation manifold and single neuron coding are tied to one another, via the latent space. We proceed to study how the manifold and its connection to the latent space emerges over the course of predictive learning. In Fig. 2 we highlighted two different ways to assess the dimensionality of the representation: a linear measure (Participation Ratio, PR) and a nonlinear one (Intrinsic Dimensionality, ID). Here, we find that the PR of predictive representations, computed at every training epoch, keeps increasing through learning (Fig. 4a).
The increase corresponds to the formation of place cells with respect to the latent space (Fig. 1e) or, equivalently, manifold cells with respect to the representation manifold (Fig. 2d). While the PR increases, ID decreases until it reaches a value of approximately 5 ( Fig. 4b; see also Methods). Recall from our analysis in the previous section that the value of ID is independent of single neuron fields. Although we cannot explain the number 5 precisely, we note that if the latent variables are encoded then it cannot be less than the number of latent components (x, y, θ). Furthermore the encoding of the actions could explain the fact that it is higher than 3. ID is considerably smaller than PR, pointing to a dimensionality gain DG of roughly DG = PR ID ≈ 3 toward the end of learning. This is consistent with our previous analysis where we showed that local manifold fields tend to increase the DG, (cf. Fig. 2e and Suppl. Mat.). In Figs. 3a and 3b we showed that the first five PCs of the learned representation are highly correlated with latent space variables. This latent space signal transfer is another signature of predictive learning that we can exploit and track through training. Specifically, we compute the average of the canonical correlation (CC) coefficients between the representation projected into its PCs, and latent space variables x, y, θ. The blue line in Fig. 4c shows the average CC between the representation in PCs 1 to 3 and the position x, y of the agent in latent space. When the average CC is 1, this means that all the signal regarding x, y has been transferred onto PCs 1 to 3. Similar interpretations hold for the other curves we show, which track the transfer of signal relative to the latent space x, y, θ. Fig. 4c shows that, between epoch 50 and 150, most of the information regarding the latent space moves onto the first few PC modes of the neural activities. The same analysis can be carried out with respect to observation space variables. This is shown in Fig. 4d, and indicates that the observation space signal flows out of the first few PC components as learning progresses. Together Figs. 4c and 4d show that the representation, as interpreted through PC components, encodes more latent space information vs. observation space information as learning progresses (blue and red lines). The transfer of latent variable information to the first PCs of the representation is tightly connected to the linear and non-linear dimensionality of the representation, as discussed in more depth in the Suppl. Mat. Altogether Fig. 4 suggests that predictive learning forms a low-D representation (Fig. 4a), with specific signatures that can be quantified via latent signal transfer and dimensionality (Fig. 4b).

A neural network mechanism for low-D representation manifolds through predictive learning
Why does predictive learning lead to the discovery, and low-D representation, of the latent space? In this section we provide theoretical arguments suggesting why the predictive step, in particular, can be such an important ingredient in extracting latent manifolds. For simplicity, we consider the case where the movement of the agent in the latent space X is governed by a discrete-time dynamical system: where x = (x, y, θ) and F (x) is a vector field on X ; for the arguments below, this vector field may be deterministic or stochastic (as for the "off policy" actions taken by the agent in our simulations). We note that F may depend on a learned policy but, without loss of generality, we omit this detail. The agent's observation at time t is then defined as a differentiable function of the latent variable: o t = ϕ(x t ). Such a mapping induces a nonlinear dynamical system in the space of the observations o which can be written in terms of the dynamics . Assuming that the trajectory x t stays close to a reference point x * ∈ X we can expand the . CC-BY-NC-ND 4.0 International license a certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under The copyright holder for this preprint (which was not this version posted July 13, 2019. where higher order terms can be neglected when the linear regime dominates and Dϕ(x * ) is the Jacobian matrix of ϕ evaluated at x * . We now turn to the update rules of the artificial recurrent network, also defined as a discrete-time dynamical system: where g is a nonlinear function and W , W in , W out are respectively recurrent, input and output weights (the agent's actions are not considered here, cfr. Suppl. Mat. for further details). We compare the effect of two cost functions on learning in the network, given an agent's trajectory {x t | 0 ≤ t ≤ T } in latent space: one predictive and another non-predictive, respectively represented by For the predictive coding objective C pred , we use Eq. (4) and Eq. (5) to obtain Assuming that the activity of the network remains in a regime where g is approximately linear (for convenience, with slope 1), we can further simplify Eq. (7) into The two terms in this inequality suggest a possible solution to minimizing C pred : to "auto-encode" the observation at the current time o t while learning a linear representation of the observed dynamics. The latter necessarily implies a low dimensional representation, the same as latent space. To see this, consider a sample trajectory of length T in a neighborhood of x * : {x t |1 < t < T } and the corresponding network activations {r t |1 < t < T }. Let X and R be the following N latent × T and N × T matrices, respectively: It follows that minimizing the contribution of each term in Eq. (8) to minimize C pred is equivalent to solving the . CC-BY-NC-ND 4.0 International license a certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under The copyright holder for this preprint (which was not this version posted July 13, 2019. ; https://doi.org/10.1101/471987 doi: bioRxiv preprint ordinary least squares problem: where ϕ and F are applied column-wise to X. This suggests that W out W in ≈ I while the activation vector r mainly encodes a representation of the latent variable's dynamic update rule F (x) (akin to the dynamics' derivative). Furthermore, as X is rank N latent and, assuming W out and W are of higher rank, a natural way to satisfy this is by R also being rank N latent . This is consistent with low-dimensional network dynamics. We emphasize that the analysis above is approximate, and is local, involving linearization around a given point x * in the latent space. However, by allowing x * to change in time so that the linear approximation holds for trajectories on a longer scale, the network would then learn a collection of local linear dynamics. We observe clues in our numerical experiments that these approximate relationships are indeed respected. Fig. S2b shows that the matrix W out W in has a clear diagonal structure suggesting that input observations are fed forward to the outputs. The role of recurrent dynamics is then to approximate the local map Dϕ(x * )F (x). In this sense the representation r doesn't directly encode for x but rather represents a collection of local linear maps indexed by the position of the agent in the latent space, and coding for its dynamics in this space. By contrast, for the non-predictive objective C non−pred the terms ||o t+1 − y t+1 || 2 = ||o t − W out W r t−1 − W out W in o t || 2 are missing the dynamic update and cannot be decomposed as in Eq. (7). The absence of the low-dimensional latent space dynamics in this non-predictive settings suggests that the representation shouldn't discover the latent manifold through learning. We will demonstrate this explicitly in the next section. The arguments above imply that predictive representations will have low ID (i.e., low nonlinear dimensionality). We next give reasoning for why such predictive representation develop localized receptive fields. As shown in Fig. 2e, this leads, in turn, to high PR (i.e., high linear dimensionality) and hence high DG, all phenomena that we have observed in our network simulations above. We begin with the assumption that the low-dimensional predictive representations are a smooth map of the latent space.
A consequence is Lipschitz continuity, which guarantees that nearby points in the latent space (x, x ) map onto nearby points (r, r ) in representation space, at least up to a given radius: where κ is the Lipschitz constant and d indicates distance. This preservation of distances, or similarities -together with the positivity constraint (r i ≥ 0 for each neuron i) -is known to lead to localized manifold fields (38, 39) (cf. Suppl.Mat). Interestingly, in our framework this result appears to be true for both positive representations (when the activation function is a sigmoid) and more general ones (e.g. when the activation function is tanh, data not shown). The arguments above indicate that predictive learning leads to increases in linear dimensionality, as observed in our learning simulations (Fig. 4a). But when should this increase stop? A possible answer is: when the linear dimensionality of the neural representation matches that of the outputs that the network is seeking to produce. We give a simplified argument based on linear readout that suggests why this answer might be correct. Rewriting the cost function in Eq. (6) for a linear readout we obtain C pred = 1 t=0 ||o t+1 − W out r t || 2 , and recognize that (for W out randomly distributed or orthogonal), the linear dimensionality of the representation tends to match the linear dimensionality of the output as they are directly related through the linear transformation W out (cf. (40)(41)(42)). Our numerical studies lend evidence to this: the PR increases through learning until it saturates at about the PR dimensionality of the output, which is 16.2, Fig. 4a.

Non-predictive learning fails to extract low-D latent manifold
A central idea in this article is that learning is predictive, so that the underlying RNN is learning to anticipate the observation on the next timestep. But is the predictive aspect really necessary for the network to extract the low-D latent manifold? Here we address this question by directly contrasting predictive learning with the corresponding non-predictive case. We train 100 RNNs, which differ only in the initialization of their weights and the agent's generated trajectory, in two different scenarios: predictive learning vs. non-predictive auto-encoding; that is, predicting the next observation o t+1 as described earlier, and vs. returning the current observation o t (43,44). We find that all networks trained through predictive learning show the characteristics outlined above (low Intrinsic Dimensionality with high Dimensionality Gain and latent signal transfer), while the same networks trained with the auto-encoding loss develop qualitatively different representations. Most importantly, with the auto-encoding loss the learned representations do not reflect the latent state variables as they do for the predictive coding loss, i.e. latent variables do not dominate the linear factors. In Figs. 5a and 5b we show CCA between the first three PCs of the representation and the latent space or the observations yields completely different trends in the predictive vs non-predictive case. In Fig. 5a the average CCA coefficient between the representation and the latent space grows throughout learning while the average coefficient between the representation and the observations decreases (cfr. Figs. 4c and 4d). In contrast, by this metric the networks trained to auto-encode the observations do not develop representations that encode the latent space, but rather only the observations. Consequently, as shown in Fig. 5c  localized in the latent space. This is in striking contrast with the same plots for the predictive case Fig. 1e. The PR and ID dimensions of the learned representations also differ significantly between the predictive and non-predictive settings. For the predictive learning network (Fig. 5d), PR grows and ID decreases throughout training. For the auto-encoding network (Fig. 5e) PR grows but ID does not decrease, as the representation does not extrac the latent manifold. We can summarize these properties by analyzing the Dimensionality Gain. Fig. 5f shows that in the predictive case (blue line) DG progressively increases through learning, while this does not occur for the non-predictive case.

Control simulations that test role of recurrence and robustness to task setup
We conducted a series of additional simulations to control that our main findings are robust and rely on predictive learning, as our theoretical arguments predict. The Suppl.
Mat. gives a fuller description of these (Fig. S3, Fig. S4); we give a brief listing here. First, we checked the importance of a trained RNN architecture, showing that both freezing the output weights and using a non-recurrent network hinder the development of predictive representations with the key properties described above. We also checked the importance of the predictive nature of the training objective: training the network to reproduce the observations on the last time step as opposed to predicting those on the next step hinders learning (as does autoencoding the input, as pointed out above). Finally, our results are robust to the details of input statistics, specifically to adding noise in the input and to the degradation of the information contained in the input to be without color space, action space or distance-related information. Altogether these findings corroborate our theoretical arguments, cfr. Suppl.Mat. and Fig. S3, Fig. S4.

Discussion
How the brain extracts information about the latent structures of the external world given only indirect sensory observations is a long-standing question. We find that predictive learning in recurrent neural networks (RNNs) leads to an intriguing answer, as it automatically constructs a low dimensional neural representation of the latent space. We explore this phenomenon both in simulations of an egocentric spatial navigation task -a situation that is naturally described by latent variables corresponding to the spatial coordinates, and providing intuitive mathematical arguments that indicate the generality of the phenomenon.
Signatures of predictive learning in neural data What features characterize predictive learning in neural data? When the observations to be predicted arise from an environment with an underlying low-dimensional latent structure, our work suggests several distinct signatures. First, the dimensionality of the set of neural responses will likely appear high when assessed with standard linear measures, such as the participation ratio. However, when assessed through nonlinear metrics sensitive to the dimensionality of curved manifolds, the dimensionality will be lower, tending to the number of independent latent variables. These two signatures taken together imply a high dimensionality gain (DG), or ratio of linear to nonlinear dimension. The presence of a low-D neural representation manifold suggests another signature of predictive learning: neural manifold cells, with responses strongly tuned to the variables which parameterize the neural representation manifold (cfr. Fig. 3d). While locality in latent space is an established aspect of neural receptive fields, locality in the manifold is an allied feature that will be exciting to check in experimental data. This builds on recent work on understanding neuronal representations through the lens of representation dimensionality (26-28, 38, 45).
Discovering latent structure in data and sensory observations Our techniques require no advance knowledge of what the latent variables are, or even how many of them there are. The consequence is that both the number and identity of latent variables can be discovered by analysis of a learned neural response manifold, as studied in other settings by (43,(46)(47)(48). We introduce latent signal transfer as a viable way to uncover the relevant variables fig. 3d: as the response manifold is learned, the position of population responses along the manifold can be increasingly well predicted by the true low-dimensional latent variables, but increasingly poorly predicted by irrelevant variables. Thus, the problem of discovering the low-dimensional, latent structure in complex, high-dimensional dynamic signals becomes that of discovering the variables that parameterize a low-dimensional neural response manifold. We suggest that such parametrization of learning via dimensionality and latent signal transfer -two related phenomena as discussed in the Suppl. Mat. -may contribute to the understanding of how both biological brains and neural network algorithms solve difficult tasks such as navigating an environment based on complex, high-dimensional cues.

Related frameworks and findings
From an algorithmic and computational perspective, our proposal is motivated by the recent success of predictive models in machine learning tasks that require vector representations reflecting the semantic relationships between the data samples in the tasks. On one hand, information retrieval and computational linguistics have benefited enormously from the geometric properties of word embeddings learned by predictive models (10)(11)(12)46). On the other hand, prediction over observations has been used as an auxiliary task in reinforcement learning to acquire representations favoring goal-directed learning (9,(15)(16)(17). Finally we note that the responses are reminiscent of the types of place-related activity observed in the hippocampus and entorhinal cortex, lending in particular mechanistic grounding to the recent proposal by (22) that the hippocampus builds a semantic relational network. We argue that relevant semantic relations are encoded by neural representation of low intrinsic dimensionality, and in turn these are being constructed by predictive learning to reflect the relevant latent variables in a task. Our results substantiate and build on the importance of allied frameworks in constructing such relational networks (14,15,49).
Open questions Distinctive to our work is the use of nonlinear dimensionality analysis to characterize the relationship between the neural representation manifold and the latent space. In order to reveal this low-dimensional structure, we rely on nonlinear techniques, as more common linear measures would give the illusion of high-dimensional representations. Nonetheless, more work is needed to harness and theoretically formalize the role of nonlinearities in neural population codes. Furthermore, predictive learning is a general framework that goes beyond the example of navigation analyzed here, and future work will expand in other directions (text, visual processing, behavioral tasks, etc.) that may open new theoretical frameworks and new implications for learning and generalization. Finally, it will be crucial to adapt and test these ideas for the analysis of large-scale population recordings of in-vivo neural data -ideally longitudinally so that the evolution of learned neural representations can be tracked with metrics such as the emergence of a low-D neural representation manifold, dimensionality gain, and latent signal transfer. A very exciting possibility is that this might uncover the presence of latent variables in tasks where they were previously unsuspected or unidentified.