Recurrent Spiking Networks Solve Planning Tasks

A recurrent spiking neural network is proposed that implements planning as probabilistic inference for finite and infinite horizon tasks. The architecture splits this problem into two parts: The stochastic transient firing of the network embodies the dynamics of the planning task. With appropriate injected input this dynamics is shaped to generate high-reward state trajectories. A general class of reward-modulated plasticity rules for these afferent synapses is presented. The updates optimize the likelihood of getting a reward through a variant of an Expectation Maximization algorithm and learning is guaranteed to convergence to a local maximum. We find that the network dynamics are qualitatively similar to transient firing patterns during planning and foraging in the hippocampus of awake behaving rats. The model extends classical attractor models and provides a testable prediction on identifying modulating contextual information. In a real robot arm reaching and obstacle avoidance task the ability to represent multiple task solutions is investigated. The neural planning method with its local update rules provides the basis for future neuromorphic hardware implementations with promising potentials like large data processing abilities and early initiation of strategies to avoid dangerous situations in robot co-worker scenarios.

(S1) Equation (S1) is a simple stochastic version of the spike response model. 1 Equation (3) of the main text is a special case of this form with a resetting rectangular PSP kernel which is of one time step length and thusν i (t) = ν i (t) andỹ j (t) = y j (t).
Solving the inference problem specified in equation (1) in the main text is equal to minimizing the Kullback-Leibler divergence defined in equation (5) in the main text. This requires to compute the derivative of the log-probability over spike sequences ∂ ∂ θ k j log q(ν ν ν ; θ θ θ ). Inserting (4) from the main text allows to rewrite this derivative ∂ ∂ θ k j log q(ν ν ν ; θ θ θ ) = ∂ ∂ θ k j T ∑ t=1 ν t,k log ρ t,k + (1 − ν t,k ) log(1 − ρ t,k ) (S2) Figure S1. Illustration of the simple probabilistic planning model. A: The neural network architecture considered here for solving the probabilistic planning problem. A layer of state neurons that control the behavior of the agent receive feedforward input from context neurons, the activity of which determine the desired goal. B: A simple planning problem that requires to pass through a passage at two specific points in time. One successful movement trajectory is shown (blue). The inset shows the lateral synaptic weights between the state neuron that encode the transition model for this task.
The general update rule in equation (7) in the main text follows then directly from this last result.

Detail to A finite horizon planning task
Here, we provide additional details to the network model that was used to solve the finite horizon planning model in Fig. S1A.
In this architecture the activity of the K state neurons was constrained to winner-take-all (WTA) dynamics, which assures that exactly one neuron is active in each time step, i.e., ∑ k ν t,k = 1, ∀t. Therefore the spiking probability ρ t,k of neuron k to spike at time t is given by The WTA network gives rise to a very simple planning model. The state neurons directly encode the discrete state of the environment, where ν t,k = 1 denotes that the environment is in state k at time t. Furthermore, the lateral synaptic weights directly encode the transition operator T of this planning model. To see this, we note that the dynamics realize a distribution over network state trajectories ν ν ν 1:T , which can be written in the form q (ν ν ν 1:T | θ θ θ ) = p(ν ν ν 0 ) where by inserting equation (3) of the main text we identify Thus the first term of (S8) determines the transition operator T implemented through the lateral weights w ki , and the second term realizes the function φ t parametrized by the feedforward weights θ k j . From this we can see that the weights w ki encode the log-probabilities of making a transition from state i to state k, i.e., w ki ∝ log T ν t,k = 1 ν t−1,i = 1 of the planning model.
To derive the update rules for the feedforward weights θ k j projecting onto the WTA network, we plug in the activation function (S6) into the general form in equation (S5). Using this we find that the weights θ k j become optimal, subject to the learning rule The updates converge to a lower bound of the KL-divergence, see (5) in the main text. If the activity of the context neurons has the following temporal structure this bound becomes tight and the KL-divergence vanishes: We assume that in each time step exactly one context neuron is active, i.e., N = T and y t, j = 1 if t = j. This allows the network to precisely learn the temporal structure of the planning task. To see this we plug in the assumed dynamics of the context neurons and (S6) into equation (S9) and set it to zero, which yields the fixed point θ * k j of parameter θ k j given by θ * k j ∝ log p(v k, j = 1|r = 1). The parameters θ k j converge (up to a normalizing constant) to the probability that neuron k should spike at time t = j, given that the reward is acquired at the end of the trial.
We evaluated the model in the simple navigation task illustrated in Fig. S1. FigureS1A illustrates the WTA network architecture. As described above, the state neurons encode the current state of the agent, which can be at any position on a linear track with K = 9 different locations. Each trial has a duration of T = 30 time steps. A reward is delivered to the agent if it avoids an obstacle at time T /2 and at time T . The planning problem and the synaptic weights w ki that implement the state transition probabilities are illustrated in Fig. S1B. The agent is not allowed to jump to arbitrary positions but only to neighboring states.
In the first implementation of this task we used a time-binned binary version the network that did not use the PSP kernels, i.e.,ν t,k = ν t,k andỹ t, j = ν t, j . Furthermore one neuron of the network was forced to fire in each time step according to (S6).
This neuron model allows to directly implement the simple planning problem and we found that we could achieve almost perfect performance with this network (reward rate averaged over 100 trial runs was 97.80 ± 4.64%.
We compared this neuron model with a more realistic spiking model that lets each state neuron fire with a time-dependent Poisson firing rate of ρ 0 · ρ t,k . The constantν = 100 Hz is the maximum firing rate. Furthermore we used a resetting rectangular PSP kernels with 50 ms length and thusν t,i andỹ j (t) are one if a presynaptic spike occurred within the last 50 ms and zero 3/7 else. Here each trial had a duration of 1.5 seconds of equivalent biological time. This time was divided into 30 time bins of equal length of 50 ms. The activity of the 30 context neurons was determined by a firing probability that was set to 100 Hz for the duration of one bin. The decoded discrete state in the environment was assigned to the index of the state neuron with the highest firing rate in each time bin. Figure 1B-D of the main text was produced with this model. With this more detailed neural network model we achieved a reward rate of 87.40 ± 15.08% for the simple linear track task.

Details to Extension to infinite horizon problems
Here we provide additional details to the infinite horizon planning model. In the infinite horizon case, we assume that the length of a trial T is unknown. Therefore the agent may receive rewards at any time t during continuous exploration of the state space.
The goal of planning is then to optimize the parameters θ θ θ in the neural network so that it generates infinite length trajectories that maximize the expected total discounted reward ∑ ∞ t=0 γ tr t , where r t is a scalar reward received at time t and γ ∈ [0, 1] is a discount factor.
Also infinite horizon planning can be formalized as probabilistic inference problem as shown in. 2 The underlying generative model is an infinite mixture of Markov chains of finite lengths T . In our spiking neural network implementation, a spike train of length T , denoted here by ν ν ν T is generated in response to the injected context neuron activity. The corresponding joint distribution of spike train ν ν ν T and acquired rewards r is then given by q(r, ν ν ν T ; θ θ θ ) = p(T ) q(ν ν ν T ; θ θ θ ) p(r|ν ν ν T ) (S10) where p(T ) = (1 − γ)γ T is the prior distribution over trajectory lengths, and q(ν ν ν T ; θ θ θ ) is the distribution over spike trains of length T according to the network dynamics, equal to (S7). The probability of getting a binary reward r at the end of the trajectory of length T is equal to p(r = 1|ν ν ν T ).
Using this the goal of maximizing the total discounted reward is equivalent to maximizing the likelihood of the reward, i.e., p(r = 1 | θ θ θ ). 3 Using equation (S10) we can write where the expectation is taken over all trajectories of arbitrary length. Instead of calculating exactly the expectation, in the learning rule we approximate the expectation over the posterior by drawing a single continuous sample from the network activity and update the parameters online at each time step t. This corresponds to an online Monte-Carlo EM approximation for the infinite horizon planning problem. 4 The main idea for this derivation is that we can interpret sub-sequences from an infinite length trajectory of the network as weighted samples from (S10). For example, if during the run at time t a reward arrives (r t = 1), then we can select each sequence of states ending at t as a sample weighted with the posterior probability of its length p(T |r = 1; θ θ θ ). Moreover, in the 4/7 stationary case (when the context neurons inject a stationary input into the state neurons) the posterior over trajectory lengths becomes independent of the parameter and we get p(T |r = 1; θ θ θ ) = p(T ). Therefore we can directly weight the sample spike trajectories by the prior p(T ) = (1 − γ)γ T . This entails that for each time step t where the reward is r t = 1 (in the stationary case, where t becomes large), the parameter θ k j should undergo a change equal to This is the general update rule for infinite horizon planning problems and equals to the total update of θ k j from all considered sample trajectories up to time t.
The update can be realized by using an eligibility trace e t,k j associated to each synapse k j with dynamics that are updated in parallel to the synaptic weights θ t,k j = θ t−1,k + η r t e t,k j .
As in the finite horizon case, here we can also perform offline updates of parameters θ θ θ according to (S15) which minimizes D KL (p(ν ν ν|r = 1)||q(ν ν ν; θ θ θ )). With the offline learning rule the inputs install a gradient in the state space which integrates backward smoothing in the sampling process of trajectories. This has the effect that the stochastic trajectories are "attracted" by states in the state space with high probability of receiving a reward.

Details to Modeling the transient firing in place cells
A notable difference to the biological data is the resolution of the simulated activity. In the paper only 100 place cells were modeled to visualize the corresponding spike events. For a higher density of uniformly distributed place cells (i.e., 900 place cells are aligned with a 30-by-30 grid) the trend of the place cells' activity of three example movements is shown in Fig. S2.
Here from left to right we illustrate the place cells' activity for seven points in time. In each of the three rows in

Details to Learning obstacle avoidance tasks in a real robot
The proposed spiking neural network is a restricted Boltzmann machine for which efficient learning methods such as contrastive divergence (CD) 8 have been developed. We used CD to learn the transition model encoded in the weights w ki . CD estimates a maximum likelihood gradient by drawing samples from the model distribution q (ν ν ν ; θ θ θ ) in equation (4) in the main text. This gradient is used to update the model parameters given by ∆w ki = α ν t−1,kνt,i −ν t−1,k ν t,i , with the learning rate α = 0.05, demonstrationsν and model samples ν. The data samples are state transitionsν t−1,kνt,i generated from human demonstrations in kinesthetic teach-in. Only recordings of the Cartesian x and y coordinates in a KUKA lightweight arm were used. The trajectory data was transformed into spike patterns using 15 Gaussian basis functions per dimension in an inhomogeneous Poisson process. In our experiments, we used the Gaussian bandwidth parameter σ 2 = 0.03, an additional Poisson rate offset λ o f f = −1.5 and a refractory period of 10 ms. 9 The kinesthetic demonstrations covered the operational space of the robot (a grid of 70 × 70 cm). In total, 15 minutes of demonstrations sampled at 1 ms were transformed into spike patterns and used for training. The progress of the weight updates is illustrated in Fig. S3.