Control of neural systems at multiple scales using model-free, deep reinforcement learning

Recent improvements in hardware and data collection have lowered the barrier to practical neural control. Most of the current contributions to the field have focus on model-based control, however, models of neural systems are quite complex and difficult to design. To circumvent these issues, we adapt a model-free method from the reinforcement learning literature, Deep Deterministic Policy Gradients (DDPG). Model-free reinforcement learning presents an attractive framework because of the flexibility it offers, allowing the user to avoid modeling system dynamics. We make use of this feature by applying DDPG to models of low-level and high-level neural dynamics. We show that while model-free, DDPG is able to solve more difficult problems than can be solved by current methods. These problems include the induction of global synchrony by entrainment of weakly coupled oscillators and the control of trajectories through a latent phase space of an underactuated network of neurons. While this work has been performed on simulated systems, it suggests that advances in modern reinforcement learning may enable the solution of fundamental problems in neural control and movement towards more complex objectives in real systems.


Background
Background on Neural Control. Past approaches to neural control have focused on dynamical systems-based formulations of the control problem, where methods have been designed to control a single model of neural dynamics [3][4][5][6][7][8][9][10] . Specifically, in 9 a method is developed for directly optimizing oscillator coupling strength to impose synchronization on the system. While an interesting approach, this method is unlikely to directly lead to algorithms that can be implemented on real neural systems because the coupling weights between oscillators can be difficult to perturb. The authors of 7,8 use a different model for system inputs than we do: they assume that a common forcing input is applied to all oscillators. In addition, these control models are open-loop strategies: the authors give a number of motivations for this approach, stating that system dynamics and network connectivity is difficult to estimate and state information may be unavailable. For large scale neural systems, while the dynamics are largely unknown, high spatial resolution state observations can be obtained using functional Magnetic Resonance Imaging (fMRI) and detailed network connectivity can be obtained using Diffusion Tensor MRI (DT-MRI). More importantly though, while their results show that this forcing term can lower the minimum coupling strength at which synchronization occurs, there is still a lower-bound below which synchronization no longer occurs, even in the presence of forcing. Our claim is that DDPG is able to synchronize a collection of oscillators past this lower bound and we compare DDPG with this method in the results section.
Other dynamical systems approaches include [3][4][5][6]10 , where the authors employ phase reduction methods to models of neural oscillations to better understand how one might desynchronize [3][4][5][6] or synchronize 10 the oscillators (though in 10 , the problem of control was mentioned as a potential application, but not studied). These methods involve a search for a reduced set of phases (often a single scalar), whose dynamics well-characterize the oscillatory behavior of each oscillator in the network. The study of the movement of these oscillators towards (synchronization) or away from (desynchronization) these limit cycles is the focus of these papers. These methods are different from ours, not only in the fact that they are model-based, but also because they work primarily with low-dimensional dynamics. In particular, it is not clear how these methods would apply to similar control problems where the dynamics are high-dimensional.
Statistical approaches to control exist with both model-based 11,12 and model-free variants 13,14 having been attempted. In particular, the authors of 12 show how a controller based on a Generalized Linear Model (GLM) can be used to produce target spike sequences in an underactuated setting. While impressive, it is not clear how these results extend to the ability to generate or manipulate biologically meaningful states. Relying on correlated activity between neurons, we show how principal component trajectories can be induced in an underactuated setting. Principal Component Analysis (PCA) has been used in a number of experiments attempting to reduce neural dynamics to a lower dimensional manifold 15,[22][23][24] . The dynamics in the phase space defined by a few principal components has been related to simple behaviors, for example, reaching to a specific point in space results in a characteristic trajectory of primary motor cortex in the phase space of its first three principal components. And while the authors of 13,14 also employ model-free reinforcement learning for neural control, they attempt to solve a different problem from the ones we consider in this paper, that of the desynchronization of a network of synchronized, coupled oscillators. The advancements of deep reinforcement learning since the publication of these methods suggest that model-free reinforcement learning might be applied to more difficult, poorly understood problems. Deep Q-Learning. The method used in this paper is a variant of an approach towards optimal control problems called Q-learning. The goal of Q-learning is to estimate an action-value function (i.e. a Q-function) which is a measure of the expected future rewards, given a fixed policy, μ(a t |s t ). This Q-function may then be optimized by finding a policy that maximizes this quantity; this is the primary use we make of the Q-function in this paper. One of the particularly attractive features of Q-learning is that it is a model-free control strategy. This means that no explicit model of the state-transition dynamics is estimated during computation of the policy. Thus for Q-learning, particular importance is placed on finding a good estimator of the Q-function, and in this paper, we use a deep neural network to estimate Q.
The deep reinforcement learning literature typically begins by assuming that the system under consideration can be modeled with a Markov Decision Process (MDP). In a MDP, we have a state space S R n ⊆ , an action space ⊆ A R m , an initial state distribution p(s 1 ), transition dynamics p(s t + 1 |s t , a t ), and a reward function r(s t , a t ). The MDP framework allows for a convenient representation of the Q-function. The Q-function can be written as indicates the sampling of an action from the policy. Under an MDP, it may be written as where this equation is known as the Bellman equation. A common approach to estimating the optimal policy from the Bellman equation is to estimate μ in a greedy fashion by computing a s a Q a s ( ) arg max ( , ) . This is the approach taken in 1 , where a deep convolutional network is used to approximate the Q-function.
While this approach has been extremely successful on a number of different problems, it is impractical for large, continuous action spaces. Lillicrap et al. 2 proposed to modify this approach by incorporating a Deterministic Policy Gradient (DPG) (graphical description in Fig. 1) 25 . In this case, μ is assumed to be a deterministic function and the parameters of the policy are updated along the following gradient where ρ β is a behavior policy potentially distinct from μ, and θ μ and θ Q are the parameters of the policy and the value function respectively. The utility of an off-policy algorithm in the context of neural control (and biological control in general) is significant. For the simulated systems considered in this paper, being able to estimate the policy gradient in an off-policy fashion allows us to avoid resampling states with every step of the gradient optimization and to reuse past samples (i.e. experience replay). In the setting of general biological control, sampling from a policy could involve a procedure with a deleterious cumulative effect (e.g. stimulating a collection of neurons as in deep brain stimulation) and the ability to minimize the number of times this is performed could be quite valuable. The parameters θ Q are updated along the gradient of a separate loss function where y t is given by The loss function L(θ Q ) is known as the temporal difference error (TD-error) and there is a long literature about its properties and potential applications 26 . The gradients, ▽ θμ J and ▽ θQ L, are used to alternately perform updates of θ μ and θ Q respectively. This approach is termed an actor-critic method, where the actor, μ(a t |s t , θ μ ), is used to propose actions and the critic, Q μ (a t , s t |θ Q ), declares the value of those actions.
The creators of DDPG demonstrate its use on a number of physical control problems, but the application of this method or other deep reinforcement learning methods to problems in biology is lacking. In the next section, we address this problem by showing how DDPG can be used for neural control of a network of SLIF neurons and a network of weakly coupled oscillators.

Results
Network of Stochastic, Leaky Integrate and Fire Neurons. A great deal has been published about simulated networks of single neurons, making this class of models an important benchmark for our system. We chose the Stochastic, Leaky Integrate and Fire (SLIF) model for the validation of our system, and connected these neurons in a random, directed network. The SLIF model is given by where τ v is the membrane time constant, C is the membrane capacitance, e i (t) is the standard Gaussian white noise of the i'th cell, η denotes the standard deviation of this noise, u t R ( ) i S ∈ is the extrinsic control input to the i'th cell, ∈ × b R S 1 denotes the influence of the input on the neuron, I syn,j (t) is the synaptic current coming from the j'th neuron firing an action potential at time t s , A i,j is the weight of the connection between the i'th and j'th cells, E syn is the reversal potential of the synapse, g models the constant synaptic conductance, and τ s determines the decay of the synaptic current as time elapses from the incoming spike at t s . For these experiments, we assume the existence of only a single type of cell, that is, a generic excitatory cell. All values of model parameters used to generate the results shown in this paper are given in the Methods section.

Fully-Actuated Network Control.
For the first application of DDPG, we solve the toy problem of inducing an arbitrary spike train in each neuron of the network, where each target spike train is drawn at random from a Poisson distribution. A model-free controller of this system must learn a number of key tasks: to depolarize each neuron at the appropriate times, to hyperpolarize the neuron at other times, and to generate inputs to each cell sufficient to wash out the effects of neighboring neurons. The simple reward function was sufficient to achieve our objective of controlling cells to generate random, independent spike trains. Here, S i,t is the state of the i'th cell at time t and h i,t + 1 is the target activity of the i'th cell at time t + 1. In order to achieve control of the i'th cell using DDPG, we generate a separate policy and value function for each cell, resulting in N policies and value functions being estimated. This approach of controlling each cell individually with a separate policy and value function is reasonable for some modern optogenetic and implanted electrode systems where stimulation is capable of overriding the influence of neighboring inputs and the targeting of single cells is possible. We ran an experiment on a fully-actuated network of 20 SLIF cells where after a 500 episodes, the controller obtains perfect accuracy in inducing target spike trains in each cell in the network. Details regarding the initialization and parameters of this network are presented in the Methods section. One would expect the accuracy of this approach to hold for larger networks if the assumption that stimulation can always override neighboring inputs holds. Thus in this simulated setting, if not for the larger computational cost incurred by fitting and storing additional policies, controlling a small network is not more difficult than controlling a large network. Unfortunately, there are many other practical issues that make controlling a larger network much more difficult than controlling a smaller network. For example, the field of view through which neurons can be accessed in a live organism typically restricts the amount of hardware that can be used, and thus, the number of neurons that can be simultaneously simulated. One way to incorporate this limitation into our simulation involves the relaxation of Under-Actuated Network Control. In general, it is impossible to induce distinct spike trains in n independent neurons with arbitrarily high accuracy using fewer than n policies. Fortunately, in vivo neurons in a network often display highly correlated activity patterns, suggesting that full actuation is not necessary to control many biologically meaningful states. A common model for the correlated behaviors of neurons in a network (and one compatible with the SLIF model) is based on a linear mixture model where the membrane potential of neuron k at time t is given by where y pre,i (t) is the post-synaptic potential delivered from neuron i, w i is the synaptic weight between neurons i and k, and b k is the change in V k induced by neuron k itself (accounts for the term V t e t ( ) ( ) 27 . For example, correlated behaviors arise in this model system in the case of fully-connected networks of cells. In this case, if one cell in the cluster generates an action potential, then the change in the membrane potential at a subsequent timestep of all the other members of the cluster will be correlated. The situation where a network is composed of a collection of highly connected communities with sparse inter-community connections results in a covariance matrix that is approximately block-diagonal.
The study of such covariance matrices has a strong literature in the computational neuroscience community, if only indirectly, because of the popularity of methods such as Principal Component Analysis (PCA). For example, the dynamics of many large populations of neurons are known to oscillate on a low-dimensional (e.g. two or three dimensional) manifold (e.g. as shown in 16 ). Results demonstrating the low-dimensional structure of the activity of large collections of neurons have been produced for a number of different model systems and conditions [22][23][24] . It is known that PCA achieves perfect accuracy in the recovery of communities of neurons interacting via a linear mixture in the case of a covariance matrix with perfect block-diagonal structure 28 . Though, to the best of our knowledge, no previous work exists on controlling principal component trajectories in neural systems. Some recent work exists on spectral control in large scale neural systems 15 , but it isn't clear how this work might be applied to neural systems at different scales. We show how an underactuated system can be controlled to induce an arbitrarily structured oscillation in the phase space defined by a small number of principal components.
To do this, we construct an adjacency matrix of network connectivity with an approximate block-diagonal structure and assume recovery of the correct low-dimensional manifold (in general, this is not possible with PCA, but methods like the Treelet Transform 28 allow the recovery of the correct manifold without requiring perfect block-diagonal structure). Examples of an adjacency matrix with two dimensional dynamics as well as the associated principal components are shown in Fig. 2. Within each community, we pick a neuron at random to receive input from a distinct policy. We then construct one and two dimensional target oscillations and attempt to reconstruct these oscillations in the phase space defined by the first one or two principal components of the neural activity.
The reward function we use for accomplishing this is 1, otherwise, targ cntrl targ cntrl where φ cntrl (t) is the phase of the controlled oscillation, φ targ (t) is the phase of the target oscillation, d(·,·) is a distance function, and ε is a scalar. We define d(·,·) to be 0 when φ cntrl (t) is in the correct half of the unit circle at time t and infinity otherwise. To accomplish this, we discretize the unit circle into two halves, [0, π) and [π, 2π); the reward function attempts to force φ cntrl to be in the same half as φ targ at time t. We use the binary vector of action potentials at time t over all cells in the network to estimate ϕ cntrl by first projecting it onto the first one or two principal components, then estimating the phase angle. Results from these experiments are shown in Figs 3 and 4. We can see from these results that in the underactuated setting, while the objective is not quite as ambitious as in the fully-actuated setting (i.e. we are not able to induce an arbitrary spike-train in each cell), we can still induce physiologically meaningful states in a reasonably accurate way.
Kuramoto Model. The Kuramoto Model (KM) characterizes the dynamics of a network of oscillators. These oscillators interact with each other over the network, and the phase of a single oscillator changes in relation to the phases of its neighboring oscillators. The KM is given by where N is the number of oscillators, K is the coupling strength, A i,j is the edge weight between the i'th and j'th oscillators, φ i is the phase of the i'th oscillator, ω i is the natural frequency of the i'th oscillator, and ρ is a non-linear function characterizing the influence of neighboring oscillators on the i'th oscillator. In the original KM model, sin(⋅) was used as the non-linearity. All values of model parameters used to generate the results shown in this paper are given in the Methods section.  We adopt a recent control strategy for this model which was originally introduced for disrupting the synchronizing effects of the rightmost term in the KM (e.g. 29,30 ). It is known that for large K, the network of oscillators synchronize over time. An analytical solution for desynchronization of the oscillators is developed in 30 . That framework takes the form where i ctrl φ is an additive control term. With DDPG, rather than focusing on the popular problem of desynchronizing oscillators bound to synchronize, we address the problem of inducing synchrony in networks that will not synchronize on their own (i.e. with small K). To relate the controlled KM presented above with the DRL framework, consider a forward Euler discretization of the above ODE:  i t i t i can be defined using a deep neural network with parameters θ μ i . We apply this approach to the problem of synchronizing a network of weakly coupled oscillators by entrainment to a reference oscillator. This problem has recently been considered in 7,8 . An important result from this work is that the bound on the coupling strength below which synchronization no longer occurs, K c unf , is lowered by forcing input to K c . If p(ω) is the distribution from which the natural frequencies are drawn and it is unimodal and symmetric about 0, then K c p unf 2 (0) = π as N → ∞. In contrast, in the presence of forcing, the minimum coupling strength required for synchronization is reduced to = π K c 2 . Below this coupling strength, while entrainment imposes the reference frequency on each oscillator in a network, the phases achieve nominal synchronization (between 0.0 and 0.2) 7 . To demonstrate the utility of DDPG for model-free control of the Kuramoto model, we pick a coupling strength below K c (K = 0.1) and show that the synchronization achieved is significantly higher than this (about what one would expect for a coupling strength of slightly greater than 2 π using the method in 7,8 ). We define the state space to be the set of phases of all oscillators over a 40 timestep history, along with the adjacency matrix of the network. The reward function was defined to be where ε, η ∈ [0, 1] and q and q′ i are defined to be the order of synchronization. This quantity is given by where ψ corresponds to the average phase of the oscilators. The difference between q and q′ is that q corresponds to the synchronization of all N oscillators, while q′ i is defined to be the synchronization of the i'th oscillator with respect to a reference oscillator. It was observed that without this term in the reward function, the policy regularly failed to induce global synchronization. To stabilize the controller, one oscillator in the network was chosen at random to be the reference, though an external oscillator might also be used. Regularization by the norm of the action was also included, to encourage more reliable exploration of the state-action space. Specifically, without this regularization, the controller would favor large actions, even though these actions were rarely optimal. This regularization was thus included to encourage more thorough exploration of the state-action space. It can be seen from Figs 5 and 6 that the network eventually synchronizes. To understand this result, it may help to view the reward function as a shaped reward. This interpretation follows from the fact that synchronization of each oscillator in the network with a single reference oscillator is far easier than inducing global synchronization of the entire network. This point is emphasized by Fig. 6 which shows that almost all oscillators in the network achieve very high synchronization with the reference oscillator, and in general, the average synchronization with the reference is much higher than the global synchronization of the entire network. That first synchronizing with the reference makes global synchronization easier to obtain follows from the fact that the size of the state space to be explored is reduced. So, rather than having to tune frequencies and phases of the oscillators in the network, synchronization with the reference frequency-locks all oscillators and thus, global synchrony can be induced by adjusting their respective phases.

Discussion
The model-free approaches to control of neural systems presented here suggest that deep reinforcement learning has potential for application to this area. We show how the engineering problem is transformed from one that focuses on the design of appropriate system dynamics and the control of these models, to the design of good reward functions that allow for accurate and tractable optimization of the respective objectives. In the case of inducing target spike trains or latent trajectories in the SLIF model, the rewards were able to directly represent these objectives. The problem of inducing synchronization in a weakly coupled network of oscillators required the introduction of a shaped reward to allow for robust synchronization of the network. The problem of designing appropriately shaped rewards is a significant problem, not just in the application of DDPG (e.g. as in 31 ), but for  32 . In this work, we manually design the reward functions used to estimate optimal policies.
Arguably, this is not the most efficient way to find an optimal policy and in fact, several methods exist for combining model-free reinforcement learning with inverse reinforcement learning (IRL) algorithms, which are used to infer a reward function given state-action pairs sampled from an optimal policy [33][34][35][36] . And indeed, this is a interesting direction for future work. We attempted with this work to find a compromise between automating the discovery of good policies and allowing for interpretability of the actions of the learner. IRL presents a powerful set of tools with which policies can be learned to perform complex objectives, but for the first application of deep reinforcement learning to biological systems, we felt it best to ensure that the problems being solved had clear, That is, a point on either the blue or green curves demonstrates the probability of having a given value for qi. The blue histogram shows counts before training while the green histogram shows counts after training. The average synchronization with the reference, q i , is much higher than global synchronization, q, which is explained by the fact that synchronization with the reference is easier to induce than global synchronization.
interpretable control objectives. The ultimate goal of this work is to apply these algorithms to human biology and in all such work, there is necessarily a balance between technological advance and safety. Ideally, we would be able to take an optimal policy and deduce either an analytical form or empirical results explaining its performance. In fact, developing methods to allow deep neural networks to "explain themselves" is an active area of research in the machine learning and statistics communities. These methods are being proposed in response to the classical notion put forth famously by Richard Feynman who said, "What I cannot create, I do not understand". This philosophy would suggest to replace mysterious, poorly understood parts of deep reinforcement learning with manually constructed models (e.g. in a similar vein as what is done to improve the sample-complexity of model-free methods by incorporating manually designed components 18 ). Alternatively, attempts have been made to leave a black-box machine learning algorithm intact and attempt to better understand it. For example, a method using influence functions was recently developed to yield insight into how deep neural networks work when used for supervised learning 37 ; extending methods like that in 37 for problems in reinforcement learning is another interesting direction for future work. Further extensions to alternating IRL/policy optimization solvers (e.g. as in 35 ) is an even longer term goal.
There are other factors to consider in the application of deep reinforcement learning algorithms to biological systems in addition to the ability of humans to understand them. For example, our work assumes full observability of the system state. Relaxing this assumption to partial observability is required in some applications (requiring the use of methods such as 38 ): the manner in which the uncertainty induced by partial observability interacts with modeling uncertainty is an important problem for applications to biology. The off-policy nature of DDPG allows for a reduced number of samples from the policy during learning and thus, the use of experience replay during learning. This can reduce the number of times a controller would need to interact with an actual brain in order to fit a policy. Another approach for improving the efficiency of exploration was proposed in 31 , where the authors show that data generation and resampling efficiency can be improved relative to the number of parameter updates. For complex objectives, it may be helpful to initialize the search for an optimal policy from states that have achieved a partial reward. Both methods help to accelerate discovery of optimal policies for complex control objectives.

Conclusion
We have presented a model-free control strategy for the control of neural systems at multiple scales. We demonstrated the use of this strategy on a large-scale system (the Kuramoto Model) and a small-scale system (a network of Stochastic, Leaky Integrate and Fire neurons). We showed that this approach can be used to solve difficult, unsolved control problems including the induction of latent trajectories in the network of SLIF cells as well as the synchronization of weakly coupled Kuramoto oscillators. The success of this approach has been demonstrated in simulation, but we believe that it has potential for application to physical systems. A number of the extant issues with model-free deep reinforcement learning as applied to biological systems have been discussed and hopefully further work to address these issues will be forthcoming to accelerate progress in this exciting new application area.

Methods
Deep Neural Network Parameters. For the approximation of the Q-function as well as the policy and target networks, deep networks with two hidden layers were used. For the network of SLIF neurons, the first hidden layer had 400 units and the second had 300 units, while for the KM, the first hidden layer had 1200 units and the second had 1000. A value of 0 001 τ = .
was used in the exponential moving average between updated network parameters and past network parameters. ADAM 39 was used for stochastic gradient descent to perform parameter updates. The same learning rate was used for fitting the actor and critic networks for both model systems. For control of the network of SLIF neurons, a learning rate of 0.01 was used, while for control of the KM, a learning rate of 0.0001 was used. These learning rates were chosen based on the rates used in the original DDPG paper (KM control) or slightly modifying these values (SLIF control). The learning rate was increased for SLIF control because it was observed that if this rate were too low, the optimal stimulation strength was never reached and reliable spiking behavior wasn't induced in any of the cells. Mini-batch stochastic gradient descent was performed with a batch size of 32 4-tuples, where each 4-tuple contained (s t , a t , r(s t , a t ), s t + 1 ), for some time t. The discounted rate of future returns used was γ = 0.99. , and E syn = 70.0 17,18 . All simulations were begun with cells at the resting membrane potential. Before the controller delivered input into the system in both training and testing instances, the network was reinitialized to resting membrane potential. For the fully-actuated example, network adjacency matrices were initialized with directed edges whose weights were drawn from a uniform distribution over [0, 1]. This matrix was made symmetric for the under-actuated example. Simulations were run on networks as large as 40 cells for the fully-actuated example; 4 neuron networks were used for the 1-D under-actuated example and 32 neuron networks were used for the 2-D under-actuated example. For the fully-actuated example, the state consisted of a 10 timestep spike history concatenated with a 10 timestep lookahead into the target spike train. Similarly, for the under-actuated example, a 10 timestep spike history was concatenated with a 10 timestep lookahead to construct the state. Since the projection of the binary vector of all neuron spikes onto a principal component effectively gives an instantaneous firing rate of all neurons in a given community, the lookahead in this case consisted of sampled spiking activity of the controlled cell at the target rate.
SCiEntifiC REPORtS | (2018) 8:10721 | DOI:10.1038/s41598-018-29134-x KM Parameters. The controller was tested on a network of 20 oscillators, each initialized to a phase (φ i ) drawn uniformly at random from [0, 2π], with natural frequencies (ω i ) drawn from N(0, 10), and edge weights (A i,j ) drawn from a uniform distribution over [0,1]. For our experiments, we chose K = 0.1. ε, η were 0.1 and 1.0 respectively. Code Availability. All code required to replicate the results presented in this paper will be made publicly available on Github after publication.