Neural network based successor representations to form cognitive maps of space and language

How does the mind organize thoughts? The hippocampal-entorhinal complex is thought to support domain-general representation and processing of structural knowledge of arbitrary state, feature and concept spaces. In particular, it enables the formation of cognitive maps, and navigation on these maps, thereby broadly contributing to cognition. It has been proposed that the concept of multi-scale successor representations provides an explanation of the underlying computations performed by place and grid cells. Here, we present a neural network based approach to learn such representations, and its application to different scenarios: a spatial exploration task based on supervised learning, a spatial navigation task based on reinforcement learning, and a non-spatial task where linguistic constructions have to be inferred by observing sample sentences. In all scenarios, the neural network correctly learns and approximates the underlying structure by building successor representations. Furthermore, the resulting neural firing patterns are strikingly similar to experimentally observed place and grid cell firing patterns. We conclude that cognitive maps and neural network-based successor representations of structured knowledge provide a promising way to overcome some of the short comings of deep learning towards artificial general intelligence.


Introduction
Cognitive maps are mental representations that serve an organism to acquire, code, store, recall, and decode information about the relative locations and features of objects [1].Electrophysiological research in rodents suggests that the hippocampus [2] and the entorhinal cortex [3] are the neurological basis of cognitive maps.There, highly specialised neurons including place [4] and grid cells [5] support map-like spatial codes, and thus enable spatial representation and navigation [6], and furthermore the construction of multi-scale maps [7,8].Also human fMRI studies during virtual navigation tasks have shown that the hippocampal and entorhinal spatial codes, together with areas in the frontal lobe, enable route planning during navigation [9], e.g.detours [10], shortcuts or efficient novel routes [11], and in particular hierarchical spatial planning [12] based on distance preserving representations [13].
Recent human fMRI studies even suggest that these map-like representations might not be restricted to physical space, i.e. places and spatial relations, but also extend to more abstract relations like in social and conceptual spaces [14][15][16], thereby contributing broadly to other cognitive domains [17], and thus enabling navigation and route planning in arbitrary abstract cognitive spaces [18].
The hippocampus also plays a crucial role in episodic and declarative memory [19,20].Furthermore, the hippocampal formation, as a hub in brain connectivity [21], receives highly processed information via direct and indirect pathways from a large number of multi-modal areas of the cerebral cortex including language related areas in the frontal, temporal, and parietal lobe [22].Finally, some findings indicate that the hippocampus even contributes to the coding of narrative context [23,24], and that memory representations, similar to the internal representation of space, systematically vary in scale along the hippocampal long axis [25].This scale might be used for goal directed navigation with different horizons [26] or even encode information from smaller episodes to more complex concepts [27].This geometry can also be modeled in artificial neural networks when performing an abstraction task [28].Cognitive maps therefore enable flexible planning through re-mapping of place cells and through the continuous (re-)scaling, generalization or detailed representation of information [29].
A number of computational models try to describe the hippocampal-entorhinal complex.For instance, the Tolman-Eichenbaum Machine describes hippocampal and entorhinal cell types and allows flexible transfer of structural knowledge [30].Another framework that aims to describe the firing patterns of place cells in the hippocampus uses the successor representation (SR) as a building block for the construction of cognitive or predictive maps [31,32].The hierarchical structure in the entorhinal cortex can also be modeled by means of multi-scale successor representations [33].Here, SR can be for example learned with a feature set of boundary vector cells [34] or with a sequence generation model inspired by the entorhinal-hippocampal circuit [35].
To further investigate both the biological plausibility and potential machine learning applications of multi-scale SR and cognitive maps, we developed a neural network based simulation of place cell behavior under different circumstances.In particular, we trained a neural network to learn the SR for a simulated spatial environment and a navigation task in a virtual maze as proposed by Alvernhe et al. [36].In addition, we investigated if the applicability of our model extends from space to language as the hippocampus is known to also contribute to language processing [37] [38].Therefore, we created a model to learn a simplified artificial language.In particular, the model's task was to learn the underlying grammatical structure in terms of SR of words by observing exemplary input sentences only.

Successor Representation
The developed model is based on the principle of the successor representation (SR).As proposed by Stachenfeld et al. the SR can model the firing patterns of the place cells in the hippocampus [32].The SR was originally designed to build a representation of all possible future rewards V (s) that may be achieved from each state s within the state space over time [39].The future reward matrix V (s) can be calculated for every state in the environment whereas the parameter t indicates the number of time steps in the future that are taken into account, and R(s t ) is the reward for state s at time t.The discount factor γ[0, 1] reduces the relevance of states s t that are further in the future relative to the respective initial state s 0 (cf.Equation 1).
Here, E[] denotes the expectation value.
The future reward matrix V (s) can be re-factorized using the SR matrix M , which can be computed from the state transition probability matrix T of successive states (cf.2).In case of supervised learning, the environments used for our model operate without specific rewards for each state.For the calculation of these SR we choose R(s t ) = 1 for every state.

Spatial Environment
The spatial environments created in our framework are designed as discrete grid-like spatial rooms which can be freely explored by the agent.The neighboring states of a particular state are defined as direct successor states.Walls and barriers are not counted as possible successor state for a neighboring initial state.The squared room from the spatial exploration task consisted of 100 states arranged as a 10 × 10 rectangular grid (cf. Figure 1).The maze from the spatial navigation task was represented as a 15 × 15 rectangular grid, whereas only 94 states from all 225 states were "allowed" states that could be observed by the agent (cf. Figure 3).

Language Environment
Additionally, we set up a state space with a non-spatial structure, i.e. a linguistic environment.The environment consists of 40 discrete states representing the vocabulary.Each state corresponds to a particular word, and each word belongs to one of the five different word classes: adjectives, verbs, nouns, pronouns and question words.The transition probabilities between subsequent words are defined according to a simplified syntax which consists of three types of linguistic constructions: an adjective-noun construction (cf.rule 3a), a descriptive construction (cf.rule 3b) and an interrogative construction (cf.rule 3c).
adjective → noun (3a) The syntax rules, i.e. constructions, determine the transition probabilities for randomly chosen starting states and a word from the picked word group is set as label for the training data.The individual words from the successor word class are chosen with equal probability.The constructed sentences have therefore no particular meaning.For the language data set, 5,000 training samples were generated and the network was trained for 50 epochs.

Neural network architectures
To be able to learn the SR by just observing the environment, we set up three-layered neural networks that learn the transition probabilities of the different environment.The input to the network is the momentary state encoded as one-hot vectors.Thus the number of neurons in the input layer is 100 in the exploration task, 225 in the navigation task and 45 in the linguistic task.
The hidden layer neurons have a ReLU activation function, whereas the number of neurons is equal to the corresponding number of input neurons in all networks.We also tested network architectures with a smaller number of hidden layer neurons (bottleneck).In all cases, the results where very similar.
The softmax output layer gives a probability distribution for all successor states.Thus, the number of neurons in the output layer corresponds to the number of possible states and thus to the number of input neurons in all networks.
The training data set is created by sampling trajectories through the spatial structure of the simulated environment.First, a random starting state is chosen as input and subsequently another random possible successor state is chosen from its neighbors as desired output.Walls are excluded as input states.For the first experiment we sampled 50,000 states and successor states, and trained for 10,000 epochs.

Reinforcement learning
In an attampt to reproduce experimental data, we simulated a maze as proposed by Alvernhe et al. [36].Furthermore, a reward system is required for reinforcement learning (RL).Our RL approach enables us to define rewards in the spatial environments, which we use to simulate the food trays of the original experiment.The network structure is again a three-layered network, with a ReLU activation function for the hidden layer neurons and a softmax output layer, which yields the probabilities for the next actions.A DQN agent can choose from several actions depending on the number of the neighboring states belonging to the current state.If the agent chooses a wall state during training, the momentary training run is terminated and a new random starting state is chosen randomly.Training was performed for 10,000 epochs.

Transition probability and successor representation matrix
After the training process, the network can predict all probabilities of successor states for any given initial state.Concatenating the predictions of all states leads to the transition probability (TP) matrix of our environments, which we use to calculate the SR matrix (cf.Equation 2).In case of the supervised learning approach (spatial exploration task and language task), the output of the network is a vector shaped like a row of the respective environment's TP or SR matrix and can therefore directly be used to fill the TP or SR matrix, respectively.The reinforcement learning network however only yields the probabilities for the direct successors of a given state, which therefore need to be further extended to a vector containing all possible states of the environment.

Reproducing experimental grid cell firing patterns
After training the network, the resulting SR matrices are evaluated.Therefore, each state encoded as one-hot vector is fed in as input to the network, and the resulting softmax output vectors are concatenated to built the SR matrix.The resulting SR matrix can be used to calculate their Eigendecompostion.The different Eigenvectors can be ordered according to their size, and are subsequently reshaped to fit the shape of the corresponding state space, i.e. simulated environment.The reshaped Eigenvectors are supposed to form grid like patterns, and to be a representation of the grid cells' receptive fields [32].

Multi-dimensional scaling
A frequently used method to generate low-dimensional embeddings of high-dimensional data is t-distributed stochastic neighbor embedding (t-SNE) [40].However, in t-SNE the resulting lowdimensional projections can be highly dependent on the detailed parameter settings [41], sensitive to noise, and may not preserve, but rather often scramble the global structure in data [42,43].In contrats, multi-Dimensional-Scaling (MDS) [44][45][46][47] is an efficient embedding technique to visualize high-dimensional point clouds by projecting them onto a 2-dimensional plane.Furthermore, MDS has the decisive advantage that it is parameter-free and all mutual distances of the points are preserved, thereby conserving both the global and local structure of the underlying data.
When interpreting patterns as points in high-dimensional space and dissimilarities between patterns as distances between corresponding points, MDS is an elegant method to visualize highdimensional data.By color-coding each projected data point of a data set according to its label, the representation of the data can be visualized as a set of point clusters.For instance, MDS has already been applied to visualize for instance word class distributions of different linguistic corpora [48], hidden layer representations (embeddings) of artificial neural networks [49,50], structure and dynamics of recurrent neural networks [51][52][53], or brain activity patterns assessed during e.g.pure tone or speech perception [48,54], or even during sleep [55,56].In all these cases the apparent compactness and mutual overlap of the point clusters permits a qualitative assessment of how well the different classes separate.

Supervised learning reproduces basic firing patterns of place cells in rodents
In the supervised learning approach, the transition probabilities between neighboring places and hence the SR is learned by randomly observing places (states) and exploring their potential successors.In the simplest case of a 2D square environment without any obstacles, the transition probabilities from any starting place to all its eight neighbors are identical, i,e.uniformly distributed.Places at the walls (corners) of the room however only have five (three) neighboring states, so the transition probabilities corresponding to the remaining neighboring states representing those walls are zero, respectively.The resulting successor representations learned by the neural network are almost identical to the the ground truth (cf. Figure 1).Furthermore, these firing patterns are strikingly similar to those of place cells in rodents.They reflect the environment's spatial structure depending on obstacles and room shape [63], i.e. the intensity of the firing patterns is centered around the starting position (as this is also the most probable next state) and directed away from any walls into the open space (cf. Figure 1 1), as described e.g. in [64], and found experimentally [65].

Eigenvectors of learned SR resemble firing patterns of grid cells in rodents
Stachenfeld et al. [64] propose that the grid-like firing patterns of grid cells in the entorhinal cortex of rodents [66] may be explained by an Eigendecomposition of the SR matrix, whereas each individual grid cell would correspond to one Eigenvector.To test this assumption in the context of our framework, we calculated the Eigenvectors of the learned SR matrix as shown in Figure 1, and reshaped them to the shape of the environment.We find that this procedure actually leads to grid cell-like firing patterns (cf. Figure 2).The first 30 Eigenvectors ordered by increasing value of the corresponding Eigenvalues are shown in Figure 2. As known from neurobiology [66], the grid-like patterns vary in orientation and mesh size (i.e.frequency).In particular, the smaller the corresponding Eigenvalue of the Eigenvector, the smaller the mesh size, i.e. more fine-grained the resulting grid, becomes (cf. Figure 2).Furthermore it is known that, the individual orientation of the grid cells' firing patterns follows no particular order, whereas the mesh size of the grids varies systematically along the long axis of the entorhinal cortex [67].This feature, especially, is thought to enable multi-scale mapping, route planing and navigation [33].[32].For the squared room depicted in Figure 1, the Eigenvectors for the first 30 Eigenvalues of the SR matrix are shown (re-shaped to the shape of the squared environment).Indeed, they resemble grid cell-like firing patterns.Furthermore, the grids vary in orientation and scaling, as observed in electrophysiological experiments in rodents.

Reinforcement learning reproduces basic firing patterns of place cells in rodents
In contrast to a goal-free random walk in order to explore a novel environment (as in the previous setting), navigation is usually driven by a specific goal or reward [68] like, e.g.food.The task is therefore ideally suited for reinforcement learning (RL) [69].In our simulation, we reproduced a classical rodent maze experiment presented by Alvernhe et al. [36].As in the supervised learning setting, the successor representations learned in the RL setting are very similar to the ground truth (cf. Figure 3).Also, the resulting place cell firing patterns closely resemble those of place cells in rodents during maze navigation tasks.The SR place fields are clearly different from those obtained without any reward in the goal-free exploration task (cf. Figure 1).A position close to a reward state is associated with highly localized firing patterns, whereas the highest successor probabilities are directed towards the reward states.Furthermore, places in the middle of the maze are associated with firing patterns that are stretched parallel to the orientation of the maze's main corridor.Highest successor probabilities are localized around the starting position, but still also in reach of the reward states.The side arms of the maze are a detour to the goal (reward position), and are therefore mainly ignored by the network, i.e. associated with the lowest successor state probabilities.

Linguistic structures
Linguistic constructions define a network-like linguistic map Cognitive maps are however not restricted to physical space.On the contrary, cognitive maps may also be applied to arbitrary abstract and complex state spaces.In general, any state space can be represented as a graph.In this case, nodes correspond to states, and edges to state transitions.A prime example of such graph-like (or network-like) state space representations is language.In cognitive linguistics, there is an overall agreement on the fact, that language is represented as a network in the human mind [70][71][72][73][74][75][76], whereas the nodes correspond to linguistic units at different hierarchical levels from phonemes, through words, to idioms and abstract argument structure constructions [70].
In particular, "the nodes at one level of analysis are networks at another level of analysis" [77].Hence, multi-scale SR [33] appears to be an ideal theoretical framework to explain language representation and processing in the human mind, whereas the systematically varying grid-scale along the long axis of the entorhinal cortex [67] might explain its implementation in the human brain.
To investigate this hypothesis, we constructed as a first step a simplified language as described in detail in the Methods section.The lexicon together with the three linguistic constructions result in a network-like linguistic map (cf. Figure 4), that has to be learned by the neural network.The neural network learns state TP and SR matrices The learned behaviour of the network in the state space can be displayed as a state TP or SR matrix.After training, the TP matrix which is predicted by the network is very similar to the ground truth (cf. Figure 5a,c).However the network also predicts adjectives (states 0-10) as successors of nouns (states 20-30), even though this transition does not explicitly exist in the three pre-defined constructions.Consequently, the network's SR matrix is also slightly different from the ground truth (cf. Figure 5b,d).Word classes spontaneously emerge as clusters in the TP and SR vector space

A B C D
The transition probabilities from a given word to all other 40 words (rows in the TP matrix), as well as the corresponding successor probabilities (rows in the SR matrix) can be represented as vectors, and hence may be interpreted as points in a 40-dimensional TP or SR space respectively, whereas each word corresponds to a particular point.In order to further investigate the properties of these high-dimensional representations, we visualize both the TP and the SR space using multidimensional scaling.In particular, the 40-dimensional TP and SR vector representations of each word are projected onto a two-dimensional plane as described in detail in the Methods section.By color-coding each word according to its word class, we observed putative clustering of the vocabulary.Remarkably, the words actually cluster according to their word classes (cf.Figures 6 and 7), even though, this information was not provided (e.g. as an additional label for each word) to the neural network at any time during training.

Discussion
In this study, we demonstrated that efficient successor representations can be learned by artificial neural networks in different scenarios.The emerging representations share important properties with network-like cognitive maps, enabling e.g.navigation in arbitrary abstract and conceptual spaces, and thereby broadly supporting domain-general cognition, as proposed by Bellmund et al. [18].
In particular, we created a model, which can learn the SR for spatial and non-spatial environments.The model successfully reproduced experimentally observed firing patterns of place and grid cells in simulated spatial environments in two different scenarios.First, an exploration task based on supervised learning in a squared room without any obstacles, and second, a navigation task based on reinforcement learning in a simulated maze.Furthermore our neural network model learned the underlying word classes of a simplified artificial language framework just by observing sequences of words.
The involvement of the entorhinal-hippocampal complex -as being the most probable candidate structure underlying network-like cognitive maps and multi-scale navigation [15,16,29,33,64] in language processing has already been experimentally demonstrated [37,38].Our study further supports, in particular, the involvement of place cells, as being the nodes of the "language network" as suggested in cognitive linguistics [70][71][72][73][74][75][76].Early language acquisition, especially, is driven by passive listening [78] and implicit learning [79].Our model replicates learning by listening and therefore resembles a realistic scenario.
The varying grid cell scaling along the long axis of the entorhinal cortex is known to be associated with hierarchical memory content [25].The Eigenvectors of the SR matrix are strikingly similar to the firing patterns of grid cells, and therefore provide a putative explanation of the computational mechanisms underlying grid cell coding.These multi-scale representations are perfectly suited to map hierarchical linguistic structures from phonemes through words to sentences, and even beyond, like e.g.events or entire narratives.Indeed, recent neuroimaging studies provide evidence for the existence of "event nodes" in the human hippocampus [23].
Since our neural network model is able learn the underlying structure of a simplified language, we speculate that also the human hippocampal-entorhinal complex similarly encodes the complex linguistic structures of the languages learned by a given individual.Therefore learning further languages of similar structure as previously learned languages might be easier due to the fact that the multi-scale representations and cognitive maps in the hippocampus can be more easily transferred and re-mapped [15,16], i.e. re-used, in major parts to the new language.
Whether the hippocampus is actually involved in multi-scale representation and processing of linguistic structures across several hierarchies needs to be verified experimentally and theoretically.Neuroimaging studies during natural language perception and production, like for instance listening to audiobooks [48], need to be performed.Only continuous, connected speech and language provides such corpus-like rich linguistic structures, being crucial to assess putative multi-scale processing.Additionally, further theoretical studies are needed to extend the presented model, and to apply it to more complex and naturalistic linguistic tasks, like e.g.word prediction in a natural language scenario.
As recently suggested, the neuroscience of spatial navigation might be of particular importance for artificial intelligence research [80].A neural network implementation of hippocampal successor representations, especially, promises advances in both fields.Following the research agenda of Cognitive Computational Neuroscience proposed by Kriegeskorte et al. [81], neuroscience and cognitive science benefit from such models by gaining deeper understanding of brain computations [50,82,83].Conversely, for artificial intelligence and machine learning, neural network-based multi-scale successor representations to learn and process structural knowledge (as an example of neuroscience-inspired artificial intelligence [84]), might be a further step to overcome the limitations of contemporary deep learning [85][86][87][88] and towards human-level artificial general intelligence.

Figure 1 :
Figure 1: Supervised learning to explore a spatial environment: The SR for a 2D squared environment is learned with a supervised neural network.The small green squares indicate two sample starting positions (a, d).The corresponding SR are calculated and serve as ground truth (b, e).The neural network learns the transition probabilities for its direct neighbors and estimates the SR for a sequence length of t=10 (c, f).

Figure 2 :
Figure 2: Grid cell-like Eigenvectors of the SR matrix: The firing patterns of grid cells in the entorhinal cortex are proposed to represent the Eigenvectors of the SR matrix[32].For the squared room depicted in Figure1, the Eigenvectors for the first 30 Eigenvalues of the SR matrix are shown (re-shaped to the shape of the squared environment).Indeed, they resemble grid cell-like firing patterns.Furthermore, the grids vary in orientation and scaling, as observed in electrophysiological experiments in rodents.

Figure 3 :
Figure 3: Reinforcement learning to navigate a spatial environment: We reproduced the rat maze experiment published by Alvernhe et al. [36] (left column).Therefore, we simulated the corresponding maze environment, small green squares indicate three different sample starting positions (second column).Based on the transition probabilities to neighboring states we calculated the successor representation of the maze as ground truth (third column).The predicted SR of the trained network, i.e. the firing patterns of the artificial place cells are very similar to the underlying ground truth (right column).

Figure 4 :
Figure 4: Network-like map of linguistic constructions: The simplified language model consists of five different word classes and three linguistic constructions defining allowed word class transitions.The word transition matrix can be visualized as a graph or network-like map, whereas each word corresponds to a node, and edges represent possible word transitions.Different node colors indicate different word classes.Note that, edges for transition probabilities smaller than 10 −4 are not shown for better readability.

Figure 5 :
Figure 5: Word transition probability and word successor representation matrices: After training on the linguistic data, the neural network predicts the word transition probabilities (TP) and the word successor representations (SR).As ground truth, the calculated TP matrix (a) and the SR matrix (b) for t = 2 and γ = 1 are shown.The corresponding predictions learned by the network are very similar to the ground truth in both cases (c, d).States (0-39) correspond to words.0-9: adjectives, 10-19: verbs, 20-29: nouns, 30-34: pronouns, 35-40: question words.

Figure 6 :
Figure 6: MDS of the word transition probability vectors: A two-dimensional projection of the 40dimensional word TP vectors (rows in TP matrix) for the calculated ground truth (a) and the learned TP matrix (b).In both cases, the words build clearly separated, dense clusters according to the respective word class.Different colors correspond to word classes.Note that, scaling of the axes is in arbitrary units since coordinates have no particular meaning other than indicating the relative positions of the projected vectors.

Figure 7 :
Figure 7: MDS of the word successor representation vectors: A two-dimensional projection of the 40-dimensional word SR vectors (rows in SR matrix) for the calculated ground truth (a) and the learned SR matrix (b).In both cases, the words cluster according to the respective word class.Different colors correspond to word classes again.However, the resulting clusters are less dense and located closer to each other than for the TP vectors.Since SR vectors cover several time steps, whereas TP vectors only cover a single time step in the future, this result is intuitive.Note that, scaling of the axes is in arbitrary units since coordinates have no particular meaning other than indicating the relative positions of the projected vectors.