Neurons learn by predicting future activity

Understanding how the brain learns may lead to machines with human-like intellectual capacities. It was previously proposed that the brain may operate on the principle of predictive coding. However, it is still not well understood how a predictive system could be implemented in the brain. Here we demonstrate that the ability of a single neuron to predict its future activity may provide an effective learning mechanism. Interestingly, this predictive learning rule can be derived from a metabolic principle, where neurons need to minimize their own synaptic activity (cost), while maximizing their impact on local blood supply by recruiting other neurons. We show how this mathematically derived learning rule can provide a theoretical connection between diverse types of brain-inspired algorithms, thus, offering a step toward development of a general theory of neuronal learning. We tested this predictive learning rule in neural network simulations and in data recorded from awake animals. Our results also suggest that spontaneous brain activity provides “training data” for neurons to learn to predict cortical dynamics. Thus, the ability of a single neuron to minimize surprise: i.e. the difference between actual and expected activity, could be an important missing element to understand computation in the brain.


Suppl.
. Implementation of predictive learning rule in a spiking neuronal network. a, Activity of a sample neuron. Red trace shows neuron internal activation which is a function of synaptic inputs. (Top) Spikes are generated based on the value of internal activation. Gray shaded area marks the extent of free phase which is used to predict steady-state activation. b, Actual vs. predicted steady-state free phase activity. Each dot represents the activity of one neuron during presentation of one stimuli. Activity of 1000 hidden neurons during presentation of 200 stimuli from the test dataset is shown. The correlation coefficient between actual and predicted activity was R=0.99 (p<0.00001). c, Learning curves on MNIST task for training and testing dataset. Error rate of 2.46% on the test set shows that the spiking neuronal network with our learning rule was able to solve the presented task.

Spiking network model.
The code for this network with all implementation details is provided at: https://github.com/ykubo82/bioCHL/tree/master/bioCHLspk . Briefly, our network was based on work by O'Connor et al. (2019) who developed spiking networks using the Equilibrium Propagation algorithm which is an expansion of Contrastive Hebbian Learning (Scellier et al., 2017). The steady state is calculated using the Forward Euler method with each time step t: Aj t = (1 -) Aj t-1 + p'(Aj t-1 ) (∑ p(Ai t-1 ) + bj) , where A is the activation state, p'(Aj) is the derivative of the activation function p, b is the bias, and i and j are neurons indices.  can be seen as the learning rate of the activations. These states are clipped to range [0, 1]. Each neuron communicates only using binary signals: 0 or 1, to model spiking activity. For generating spikes, an encoder converts the neuron activation state to 1 or 0, which is sent as an output signal (Suppl. Fig. 4a). The binary signals received by a neuron are converted into the activation state by a decoder. This could be seen as converting discrete spikes into a continuous post-synaptic membrane potential in actual neurons. We modified this spiking network by implementing our learning rule and by adding a least-squares model for predicting steady-state activation based on initial activation in steps 1-17. We also used AdaGrad (Duchi et al., 2011) to search for hyperparameters, which resulted in setting the learning rate to 0.01. The network architecture was the same as in our main model with 784-1000-10 neurons. Those results are consistent with other studies which showed that spiking networks can perform as well as non-spiking networks (i.e.: Zenke & Ganguli, 2018;Bellec et al. 2020).

Acoustic stimuli
In animal experiments, as stimuli we used 1 s long pure tones (2, 3.3, 5.4, 8.9, 15 and 24 kHz at 60dB), interleaved with 1s periods of silence, as described in (Luczak et al. 2009(Luczak et al. & 2013. Activity occurring >200ms after stimulus offset and before the next stimulus onset was regarded as spontaneous. The stimuli were continuously presented for 20-108 min, depending on how long the animal appeared to be comfortable during the experiment. All stimuli were tapered at the beginning and end with a 5ms cosine window. Experiments took place in a single-walled sound isolation chamber (IAC, Bronx, NY) with sounds presented free field (RP2/ES1, Tucker-Davis, Alachua, FL). To compensate for the transfer function of the acoustic chamber, tone amplitudes were calibrated prior to the experiment using a condenser microphone placed next to the animal's head (7017, ACO Pacific, Belmont CA) and a MA3 microphone amplifier (Tucker-Davis).

Changes in neuronal activity across periods
For each animal we divided the duration of recording into two halves. For each neuron we calculated the mean stimulus evoked firing rate in 30-40 ms time window, averaged over all stimulus presentations in each half of the experiment. Similarly, for each neuron we calculated the average predicted activity for the 30-40 ms time window, averaged across all stimuli in the 1st half of the experiment. We then calculated the difference between activities in the 2nd half minus the 1st half, and we correlated it with the difference between activity in the 1st half minus predicted activity in the 1st half (Fig. 5b). As described in the Methods section, only neurons with average stimulus evoked firing rates higher than 3 SD above pre-stimulus baseline were used in our analyses. Dividing the data in different proportions e.g. 60%-40% gave the same conclusions.
For comparison, analogous analyses were done on artificial neurons (Fig. 5a). For each neuron in the hidden layer the steady-state clamped activity was averaged over all 1200 stimulus presentations in a single learning epoch. Similarly, for each neuron in the hidden layer, the predicted activity was averaged across all stimuli presented in an epoch. We then calculated the difference between clamped activities in 2 epoch: ∆ = <̂, > -<̂, > , where <> denotes average, i is index of neuron, and M and N are indexes of epochs with M > N. This was then correlated with difference between average clamped and predicted activity in epoch N: ∆ = <̂, > -<̃, >. We found that the correlation between ∆C and ∆P was strongest in the earliest epochs where the learning curve was most changing (Fig. 2e). For this simulation we used neural network as described in Methods with 784-1000-10 neurons, and we used the learning rule described in Eq. 3. However, in order to reproduce the repeated presentation of the same stimuli as in animal experiments, we used the same 1200 stimulus in each learning epoch. Presenting different stimuli at each epoch as in our original simulation, gave qualitatively similar results.
Note that usually the activity of a neuron in the 1st half of recording is correlated with its activity in the 2nd half, which typically is a trivial experimental observation. However, it is nontrivial to correctly predict (only from activity in the 1st half) if that neuron activity will increase or if it will decrease in the 2nd half. This is exactly what we are presenting in Fig. 5b, that a change in average firing rate from the 1st to 2nd half can be correctly predicted based on our learning rule. To state it differently, the activity of a neuron in the 1st half (X1) is usually correlated with its activity in the 2nd half (X2), and this can be described as: X2 = X1+eps, where eps is a small positive or negative number representing change in activity from the 1st to the 2nd half. Thus, in Fig. 5b, we are not showing that X1 is similar (correlated) to X2, but rather that eps correlates with 'surprise': the difference between actual and predicted activity in the 1st half, which provides a novel insight about neuronal plasticity.

Predicting evoked dynamics from spontaneous activity
For predicting evoked dynamics from spontaneous, first we analyzed 'types' of spontaneous activity patterns. For this we divided spontaneous packets into 10 groups based on principal component analysis (PCA). Specifically, each packet was represented as a NxT matrix, where N is number of neurons and T=30 is number of 3ms long time bins. This matrix was then converted into a single vector V with length of NxT. This was repeated for each packet, thus we obtained a VxP matrix, where P is the number of spontaneous packets (P for each animal was: 293, 855, 1215 and 2472). We then ran PCA on the VxP matrix and based on values of the 1st PC we divided packets into 10 groups. Note that the distribution of packets in PCA space was continuous, thus from this analysis it should not be interpreted that there are distinct types of packets. Dividing packets into 5-20 groups and including the 2nd and 3rd PC gave consistent results with those presented in the main text.
For predicting evoked dynamics from spontaneous, we faced the problem that defining the onset of spontaneous packets is less accurate than the onset of evoked responses. Thus, to ensure a similar precision of alignment, during each stimulus presentation we detected the onset of evoked responses using the same criteria as for spontaneous packets. Then, the evoked responses were realigned according to that of the detected onset. This allowed for estimating the spontaneous and evoked activity with the same precision. Using the original onset times instead of the realigned gave qualitatively similar results. Fig. 5. Testing effect of different values of β1 and β2. In Eq. 6 we used: β1 = 2 and β2 = 2, which allowed to reduce that formula from exponential to linear. Here we illustrate that resulted linearized expression in form: ( − ) is a reasonable approximation of non-linear version: ( 2 −1 − 1 −1 ) for β1 and β2 in range observed experimentally: 1.7 < β1 < 4.8 and 1.7 < β2 < 2.7 (Devor et al., 2003), and for typical values of x in range between 0-1. a, Relation between ( − ) and ( 2 −1 − 1 −1 ) for β1 = 2.1 and β2 = 2.2. Here we generated 1000 random values for and separately for , uniformly distributed on interval 0-1. Distribution of points along diagonal shows close to linear relation between results of both formulas. Least square regression line is shown in yellow. Correlation coefficient between points obtained with linear and non-linear formula was R=0.9988. b, Similarly, we calculated correlation coefficients between both formulas for all combinations of β1 and β2 in range: 1.7 < β1 < 4.8 and 1.7 < β2 < 2.7, using step size of 0.1. The lowest correlation coefficient was R = 0.9335 for β1 = 4.7 and β2 = 2.7. c, The same plot as in panel a, but for the "worst case" scenario with β1 = 4.7 and β2 = 2.7. Note that even in this case, the relation between both formulas is close to linear.

Maximizing future energy balance.
Intuitively, it makes sense that planning, i.e. making educated predictions, can improve success of an organisms in accessing more energy resources. In this section we show that this holds true even for a single neuron, where maximizing future energy balance is best achieved by predicting future activity. For that, first we will write equation for energy balance for a neuron at time t+n, where t represents current time and n is a small time increment. Using the same logic and notation as in Eq. 4, energy balance of a neuron j at time t+n, can be expressed as a function of: cost of housekeeping processes ( ɛ ), cost of electrical activity (which for simplified linear neurons can be written as sum of its synaptic inputs: , + = ∑ , + ), and energy supply from local blood vessels controlled by combined local activity of neurons (∑ , + ): (Eq. S1): E j,t+n = − ɛ − 1 (∑ , + ) 1 + 2 (∑ , + ) 2 where β1 and β2 describe a non-linear relation between activity and energy (Devor et al., 2003).
Note also that this derivation of learning rule from metabolic principles is obtained simply by only considering input and output energy to the cell, without need for including specific metabolic interactions. This provides an important theoretical contribution by showing how complex relations between cell metabolism and plasticity could be described more simply.

Activation function
To simplify presented derivation of learning rule we used linear model of a neuron. However, we can obtain the same expression even if we use non-linear neuronal model with activation function like ReLU: f(x) = x + = max( 0 , x ) (Glorot et al. 2011). This is because for x > 0: ReLU( x ) = x, which results in the same formulas as presented in our derivation. The only difference is that if we use ReLU then we need to write a separate expression for x < 0, to specify that in such case ReLU( x ) = 0. For sigmoid activation function, showing equivalence of the formulas could be more complicated. However, there are multiple arguments that ReLU could be a better model of biological neurons than logistic sigmoid neurons (Glorot et al. 2011), and our neural network simulations performed similarly for both neuron models.

Biological plausibility of predictive mechanism implementation within a neuron.
In this study, we propose that each neuron implements a linear predictive model, which can be written in general form as: ̃(t) = λ(1)* x(t-1),+…+ λ(n)* x(t-n) + constant, where x(t-n) is past activity at time step t-n, n is a maximum number of time steps contributing to the prediction, and λ(n) is a coefficient describing how past activity at time step t-n contributes to predicted activity at time t. This may appear to be a challenging computation for a cell, considering that it requires storing values of many λ coefficients and to multiply past activity by a specific λ at each time step. Here we explain how this equation could be implemented by known intracellular processes, like regulation of ion concentration.
In neurons, an action potential generates a temporal increase in calcium concentration. This variation in calcium can be approximated by convolution of the action potential with the waveform of the unitary calcium transient (Yaksi et al. 2006). Note that convolution is a linear operation and can be mathematically expressed similarly to our predictive model. Thus, calcium concentration at time t can be described as: y(t) = λ(1)* x(t-1),+…+ λ(n)* x(t-n) , where coefficients λ correspond to values of unitary calcium transient at time steps 1 ... n, and x is the past activity. For instance, if λ(1) has the highest value, then the above equation means that neuron activity at time t-1 is the biggest predictor of calcium concentration at time t. Consequently, decay of λ values for larger n, tells us how quickly calcium ions are cleaned from the cell at those time steps after the action potential. Therefore, values of λ are directly related to the physical properties of calcium channels, such as number of channels and ion transport efficiency, among other factors. This illustrates that our equation for the predictive model could be implemented within each neuron, simply as a physical process of activity-dependent changes in ions concentration.
Our model proposes that, when a neuron receives e.g. larger than predicted top-down modulation, then not only will synapses be modified as described in Eq. 3, but also predictive model parameters λ will be updated to reduce prediction error. As explained above, this could be accomplished by e.g. changing the efficiency or number of ion channels, which would modify the calcium dynamics used for predictions. Although this is only a conjecture, this offers an experimentally testable hypothesis, that the waveform of the unitary calcium transient should change, depending on the frequency at which a neuron was stimulated earlier.