Learning rule validation in neural network

The brain is using a learning algorithm which is yet to be discovered. Here we demonstrate that the ability of a neuron to predict its expected future activity may be an important missing component to understand learning in the brain. We show that comparing predicted activity with the actual activity can provide an error signal for modifying synaptic weights. Importantly, this learning rule can be derived from minimizing neuron metabolic cost. This reveals an unexpected connection, that learning in neural networks could result from simply maximizing energy balance by each neuron. We validated this predictive learning rule in neural network simulations and in data recorded from awake animals. We found that neurons in the sensory cortex can indeed predict their activity ~10-20ms in the future. Moreover, in response to stimuli, cortical neurons changed their firing rate to minimize surprise: i.e. the difference between actual and expected activity, as predicted by our model. Our results also suggest that spontaneous brain activity provides “training data” for neurons to learn to predict cortical dynamics. Thus, this work demonstrates that the ability of a neuron to predict its future inputs could be an important missing element to understand computation in the brain.


Introduction
Artificial neural networks have shown remarkable performance in a multitude of difficult tasks, such as cancer detection, speech recognition, or self-driving cars [1][2][3] .However, the expertise of such neural networks is restricted only to particular domains.This suggests that the human brain, which can solve a wide variety of tasks, e.g.driving a car while discussing development of new cancer tests, may be using more powerful computational algorithms than those used in artificial neural networks.Currently, the best performing neural networks are trained using the backpropagation algorithm 4 .However, networks that use backpropagation have multiple properties which are difficult to reconcile with biological networks.For example: (1) they require separate phases for sending a bottom-up 'sensory' signal and for receiving a top-down error signal, whereas in the brain top-down feedback is combined with incoming sensory information; (2)  backpropagation requires symmetric weights, meaning that a synaptic connection from neuron A to B has to have exactly the same strength as connection from B to A; (3) neuron models are biologically unrealistic as they do not include, for example: spiking activity, internal neuron dynamics, or Dale's law (either excitatory or inhibitory neurons), among many other simplifications.This question of how to bridge the gaps between biological and artificial learning algorithms is a subject of rapidly growing research at the intersection of neuroscience and computer science [5][6][7] .

Synaptic learning rule to minimize prediction error
One of the most promising ideas of how backpropagation-like algorithms could be implemented in the brain is based on using temporal difference in neuronal activity to approximate top-down error signals [8][9][10][11][12][13][14] .A typical example of such algorithms is Contrastive Hebbian Learning [15][16][17] , which was proved to be equivalent to backpropagation under certain assumptions 18 .Contrastive Hebbian Learning requires networks to have recurrent connections between hidden and output layers, which allows activity to propagate in both directions (Fig. 1A).The learning consists of two separate phases.First, in the 'free phase', a sample stimulus is continuously presented to the input layer and the activity propagates through the network until the dynamics converge to an equilibrium (activity of each neuron achieves steady-state level).In the second 'clamped phase', in addition to presenting stimulus to the input, the output neurons are also held clamped at values representing stimulus category (e.g.: 0 or 1), and the network is again allowed to converge to an equilibrium.For each neuron, the difference between activity in clamped ( ̂) and free ( ̌) phase is used to modify synaptic weights (w) according to the equation (Eq.1): ∆  = ( ̂ ̂ −  ̌ ̌), where i and j are indices of pre-and post-synaptic neurons respectively, and  is a small number representing learning rate.Intuitively, this can be seen as adjusting weights to push neuron activity in free phase, closer to the desired activity represented by clamped phase.The obvious problem with the biological plausibility of this algorithm is that it requires the neuron to experience exactly the same stimulus twice in two separate phases, and that neuron needs to 'remember' its activity from the previous phase.
Here, we solve this problem by combining both activity phases into one, which is inspired by sensory processing in the cortex.For example, when presented with a new picture, in visual areas there is initially bottom-up driven activity containing mostly visual attributes of the stimulus (e.g.contours), which is then followed by top-down modulation containing more abstract information, e.g. this object is novel (Suppl.Fig. 1).Accordingly, our algorithm first runs only the initial part of the free phase, which represents bottom-up stimulus driven activity, and then, after a few steps the network output is clamped, corresponding to top-down modulation (Fig. 1B).
The main novel insight here is that the initial bottom-up activity is enough to allow neurons to estimate the expected top-down information, and the mismatch between estimated and actual activity can then be used as a teaching signal.To implement it in our model, neuron activity at initial time steps of free phase is used to predict its expected steady-state activity.This is then compared with actual activity at the end of the clamped phase, and the difference is used to update weights (Fig 1B , Methods).Thus, to modify synaptic weights, we replaced in Eq. ( 1) activity in free phase with predicted activity ( ̃): (Eq.2) ∆  = ( ̂ ̂ −  ̃ ̃).However, the problem is that this equation implies that a neuron needs to also know predicted activity of all its presynaptic neurons ( ̃), which could be problematic.To solve this problem, we found that ( ̃) could be replaced by the actual presynaptic activity in clamped phase ( ̂), which leads to the following simplified synaptic plasticity rule (Eq.3): ∆  = ( ̂ ̂ −  ̂ ̃) =  ̂( ̂ −  ̃) .Thus, to modify synaptic weights, a neuron only compares its actual activity ( ̂) with its predicted activity ( ̃), and applies this difference in proportion to each input contribution ( ̂).Initially the network receives only input signal (free phase), but after a few steps the output signal is also presented (clamped phase).The dashed line shows neuron activity in free phase if output would not be clamped.The light blue dot represents steady-state free phase activity predicted from initial activity (shaded region).Synaptic weights (w) are adjusted in proportion to the difference between steady-state activity in clamped phase ( ̂) and estimated free phase activity ( ̃).

Synaptic learning rule derivation by maximizing neuron energy balance.
Importantly, Eq. 3 is not an ad hoc algorithm to solve a computational problem, but this form of learning rule naturally arises as a consequence of minimizing a metabolic cost by a neuron.Most of the energy consumed by a neuron is for electrical activity, with synaptic potentials accounting for ~50% and action potential for ~20% of ATP used 19 .Using a simplified linear model of neuronal activity, this energy consumption can be expressed as − 1 (∑     )  1

𝑖
, where   represents the activity of pre-synaptic neuron i, w represents synaptic weights, b1 is a constant to match energy units, and β1 describes a non-linear relation between neuron activity and energy usage, which is estimated to be between 1.7 -4.8 20 .The remaining ~30% of neuron energy is consumed on housekeeping functions, which could be represented by a constant ˗ɛ.On the other hand, the increase in neuronal population activity also increases local blood flow leading to more glucose and oxygen entering a neuron (see review on neurovascular coupling: 21 ).This activity dependent energy supply can be expressed as: + 2 (∑   )  2

𝑘
, where xk represents spiking activity of neuron k from a local population of K neurons ( ∈ {1, … , , . . .}), with activity of neuron j:   = ∑      ; b2 is a constant and β2 reflects the exponential relation between activity and blood volume increase, which is estimated to be in range β2: 1.7-2.7 20 .Putting all those terms together, the energy balance of a neuron j could be expressed as (Eq.4): Using gradient ascent method, we can calculate the changes in synaptic weights ∆w that will maximize energy E j.For that, we need to calculate derivative of E j with respect to   (Eq.5): If we denote population activity as: ̅ = ∑    , and considering that ∑      =   , then we obtain (Eq.6): If we also denote that  1 =  1  1 and  2 =  2  2  1 , then we can take  1 in front of brackets; and considering that β1 and β2 > 1.7 (see 20 ), we may assume that β1 ≈ 2 and β2 ≈ 2, which simplifies Eq. 6 to (Eq.7): ∆  =  1   (  2 ̅ −   ) .
Note that Eq. 7 has the same form as the learning rule from (Eq. 3): ∆  =  ̂( ̂ −  ̃) .Even if β1 and β2 are not exactly 2, Eq. 3 still can provide a good linear approximation of the gradients prescribed by Eq. 6.Note also that in Eq. 3,  ̂ represents top-down modulation, thus Eq. 3 could be interpreted as changing weights to reduce the mismatch between activity of other neurons and a neuron's own expected activity.Similarly, Eq. 7 could be interpreted, such that weights should be changed to reduce the discrepancy between neuronal population response and a neuron's own activity.
Moreover, if we assume that neuron adapts to maximize energy balance in the future, then Eq. 7, changes to (Eq.S7): ∆  =  3  , ( 4  ⏞ −  ̃), where  ⏞ represents population recurrent activity, which can be thought of as type of top-down modulation, similar as  ̂.Also note that neuron j activity:   , becomes future predicted activity:  ̃ (see Suppl.Materials for details of derivation).Thus, this shows that the best strategy for a neuron to maximize future energy resources requires predicting its future activity.Altogether this reveals an unexpected connection, that learning in neural networks could result from simply maximizing energy balance by each neuron.

Learning rule validation in neural network simulations
To test if our new learning rule can be used to solve standard machine learning tasks, we created the following simulation.The neural network had 784 input units, 1000 hidden units, and 10 output units, and it was trained on a hand-written digit recognition task (MNIST 22 ; Suppl.Fig. 2; Methods).Our network achieved 1.9% error rate, which is at human level accuracy on this task (human error rate ~1.5-2.5%; 23).This accuracy is also similar to neural networks with comparable architecture trained with the backpropagation algorithm 22 .This demonstrates that the network with our learning rule can solve challenging non-linear classification tasks.
To verify that neurons could correctly predict future free phase activity, we took a closer look at sample neurons.Figure 2A illustrates activity of all 10 output neurons in response to an image of a sample digit after the first epoch of training.During steps 1-12 only the input signal was presented and the network was running in free phase.At step 13, the output neurons were clamped, with activity of 9 neurons set to 0 and the activity of one neuron representing correct image class set to 1.For comparison, this figure also shows activity of the same neurons without clamped outputs (free phase).It illustrates that after about 50 steps in free phase, the network achieves steady-state, with predicted activity closely matching.When the network is fully trained, it still takes about 50 steps for network dynamics in free phase to converge to steady-state (Fig. 2B).Note, that although all units initially increase activity at the beginning of the free phase, they later converge close to 0, except the one unit representing the correct category.Again, predictions made from the first 12 steps during free phase closely matched actual steady-state activity.The hidden units also converged to steady-state after about 50 steps.Figure 2C illustrates the response of one representative hidden neuron to 5 sample stimuli.Because hidden units experience clamped signal only indirectly, through synapses from output neurons, their steady-state activity is not bound to converge only to 0 or 1, as in the case of output neurons.Actual and predicted steady-state activity for hidden neurons is presented in Figure 2D.The average correlation coefficient between predicted and actual free phase activity was R=1+0.0001SD(averaged across 1000 hidden neurons in response to 200 randomly selected test images).Note that for all predictions we used a crossvalidation approach, where we trained a predictive model for each neuron on a subset of the data and applied it to new examples, which were then used for updating weights (Methods).Thus, neurons were able to successfully generalize their predictions to new unseen stimuli.The network error rate for the training and test dataset is shown in Fig. 2E.This demonstrates that our algorithm worked well, and each neuron accurately predicted its future activity.We also tested our learning rule in multiple other network architectures, which were designed to reflect additional aspects of biological neuronal networks.First, we introduced a constraint that 80% of the hidden neurons were excitatory, and the remaining 20% had only inhibitory outputs.This follows observations that biological neurons are releasing either excitatory or inhibitory neurotransmitters (Dale's law 24 ), and that about 80% of cortical neurons are excitatory.The network with this architecture achieved an error rate of 2.66% (Suppl.Fig. 3A), which again is comparable with human error rate on this task 23 .We also tested our algorithm in a network with two hidden layers, in a network without symmetric weights, and in a network with all-to-all connectivity within the hidden layer.We found that all of those modified networks similarly converged towards the solution (Suppl.Fig. 3B & 4).We also implemented our learning rule in a network with spiking neurons, which again achieved a similar error rate of 2.46% (Suppl.Fig. 5).
Altogether, this shows that our predictive learning rule performs well in a variety of biologically motivated network architectures.

Learning rule validation in awake animals
To test if real neurons could also predict their future activity, we analyzed neuronal recordings from the auditory cortex in awake rats (Methods).As stimuli we presented 6 tones, each 1s long and interspersed by 1s of silence, repeated continuously for over 20 minutes (Suppl.Materials).
For each of the 6 tones we calculated average onset and offset response, giving us 12 different activity profiles for each neuron (Fig. 3A).For each stimulus the activity in the time window 15-25ms was used to predict average future activity within the 30-40ms window.We used 12-fold cross-validation, where responses from 11 stimuli were used to train the least-square model, which was then applied to predict neuron activity for the 1 remaining stimulus.This procedure was repeated 12 times for each neuron.The average correlation coefficient between actual and predicted activity was R = 0.36+0.05SEM (averaged across 55 cells from 4 animals, Fig. 3B).Distribution of correlations coefficients for individual neurons were significantly different from 0 (t-test p<0.0001; insert in Fig. 3B).This shows that neurons have predictable dynamics, and from an initial neuronal response its future activity can be estimated.
Repeated presentation of stimuli over tens of minutes also induced long-term changes in neuronal firing rates 25 , similar as in perceptual learning.Importantly, based on our model it was possible to infer which individual neurons will increase or decrease their firing rate.To explain it, first let's look at neural network simulations in Fig. 3C.It shows that for a neuron the average change in activity from one learning epoch to the next, depends on the difference between clamped (actual) activity and predicted (expected) activity, in the previous learning epoch (Fig. 3C; correlation coefficient R = 0.35, p<0.0001;Suppl.Materials).Similarly, for cortical neurons, we found that change in firing rate from 1st to 2nd half of experiment was positively correlated with differences between evoked and predicted activity during the 1st half of experiment (R = 0.58, p<0.0001;Fig. 3D, Suppl.Materials).This could be understood in terms of Eq. 3, where if actual activity is higher than predicted, then synaptic weights are increased, thus leading to higher activity of that neuron in the next epoch.Therefore, similar behavior of artificial and cortical neurons, where firing rate changes to minimize 'surprise': difference between actual and predicted activity, provides a strong evidence in support of the learning rule presented here.This change relates to the difference between stimulus evoked and predicted activity during the 1st half of the experiment (Suppl.Materials).Each dot represents the activity of one neuron averaged across stimuli.Similar behavior of cortical and artificial neurons suggest that both may be using the same learning rule.Note that although results in panel A showing that firing rate is consistent from early to later phase of response may not be surprising, the results in panel D that neurons change firing rate depending on value of predicted activity, provides new insight into neuronal behavior.

Deriving predictive model parameters from spontaneous activity
Next, we tested if spontaneous brain activity could also be used to predict neuronal dynamics during stimulus presentation.The spontaneous activity, like during sleep, is defined as an activity not directly caused by any external stimuli.However, there are many similarities between spontaneous and stimulus evoked activity [26][27][28][29] .For example, spontaneous activity is composed of ~50-300 ms long population bursts called packets, which resemble stimulus evoked patterns 30 .This is illustrated in Figure 4A, where spontaneous activity packets in the auditory cortex are visible before sound presentation 31,32 .In our experiments, each 1s long tone presentation was interspersed with 1s of silence, and the activity during 200-1000 ms after each tone was considered as spontaneous (animals were in soundproof chamber; Suppl.Materials).The individual spontaneous packets were extracted to estimate neuronal dynamics (Methods).Then the spontaneous packets were divided in 10 groups based on similarity in PCA space (Suppl.Materials), and for each neuron we calculated its average activity in each group (Fig. 4B).Similarly as in previous analyses in Fig 3A, the initial activity in time window 5-25ms was used to derive the least-square model to predict future spontaneous activity in 30-40ms time window (Suppl.Materials).This least-square model was then applied to predict future evoked responses from initial evoked activity for all 12 stimuli.Figure 4C shows actual vs predict evoked activity for all neurons and stimuli (correlation coefficient R = 0.2+0.05SEM, averaged over 40 cells from 4 animals; the insert shows distribution of correlations coefficients of individual neurons; p=0.0008, t-test).Spontaneous brain activity is estimated to account for over 90% of brain energy consumption 33 , however the function of this activity still is a mystery.The above results offer a new insight: because neuronal dynamics during spontaneous activity is similar to evoked, thus spontaneous dynamics can be used by neurons as a training set to predict responses to new stimuli.

Fig 4 Predicting stimulus evoked responses from spontaneous activity dynamics. (A) Sample
spiking activity in the auditory cortex before and during tone presentation.Note that spontaneous activity is not continuous but rather composed of bursts called packets which are similar to tone evoked packets.The bottom trace shows smoothed multiunit activity: summed activity of all neurons (adopted with permission from 32 ).(B) Spontaneous packets were divided in 10 groups based on population activity patterns.Activity of a single neuron in 5 different spontaneous packet groups is shown.Gray area indicates the time window used for predicting future average activity within the 30-40ms time window (marked by arrow).This predictive model derived from spontaneous activity was then applied to predict future evoked activity based on initial evoked response.(C) Actual vs. predicted tone evoked activity.Plot convention is the same as in Fig. 3B.Skewness of the histogram to the right shows that for most neurons the evoked dynamics can be estimated based on spontaneous neuron's activity

Discussion
Here we present computational and biological evidence that the basic principle underlying single neuron learning may rely on minimizing future surprise: a difference between actual and predicted activity.Thus, a single neuron is not only performing summation of its inputs, but it also predicts the expected future, which we propose is a crucial component of the learning mechanism.Note that a single neuron has similar complexity as single cell organisms, which were shown to have 'intelligent' adaptive behaviors, including predicting consequences of its action in order to navigate toward food and away from danger [34][35][36] .This suggests that typical neuronal models used in machine learning may be too simplistic to account for the essential computational properties of biological neurons.Our work suggests new computational element within neurons, which could be crucial to reproduce brain learning mechanism.
There are multiple lines of evidence that the brain operates as a predictive system [37][38][39][40][41][42] .However, it remains controversial how exactly predictive coding could be implemented in the brain 8 .Most of proposed mechanisms involve specially designed neuronal circuits to allow for comparing expected and actual activity [43][44][45][46] .Although those models are to some degree biologically motivated, they require precise network configuration, which could be difficult to achieve considering that the brain connectivity is highly variable within and between brain areas.Here we propose that this problem can be solved by implementing predictions within a single neuron.Biological neurons have a variety of cellular mechanisms which operate on time scales of 1~100ms suitable for implementing predictions [47][48][49][50][51] .For instance, it was proposed that depolarization of the soma by basal dendrites in pyramidal cells could serve as an expected signal for comparison with top-down modulation arriving from apical dendrite 52,53 .The other interesting biological aspect of our model is that it belongs to the category of energy-based models, for which it was shown that synaptic update rules are consistent with spike-timing-dependent plasticity 54 .All of this demonstrates compatibility of our model with neurophysiology.
Our work also suggests that packets could be basic units of information processing in the brain.It is well established that sensory stimuli evoke coordinated bursts (packets) of neuronal activity lasting from tens to hundreds ms.We call such population bursts packets, because they have stereotypical structure with neurons active at the beginning conveying bottom-up sensory information (e.g. this is a face) and later in the packet neurons represent additional higher order information (e.g. this is a happy face of that particular friend) 55 .Also the later part of the packet can encode if there is discrepancy with expectation (e.g. this is a novel stimulus 56,57 ; Suppl.Fig. 1).This is likely because only the later part of the packet can receive top-down modulation after information about that stimulus is exchanged between other brain areas 58,59 .Thus, our work suggests that the initial part of the packet can be used to infer what the rest of the brain may 'think' about this stimulus, and the difference from this expectation can be used as a learning mechanism to modify synaptic connections.This could be the reason why, for example, we cannot process visual information faster than ~24 frames/s, as only after evaluating if a given image is consistent with expectation, only then the next image can be processed by the next packet.Our learning rule thus implies, that sensory information is processed in discrete units and each packet represents an elementary unit of perception.

Neural Network
The code to reproduce our network with all the implementation details is available at https://github.com/ykubo82/bioCHL .Briefly, the base network has the architecture: 784-1000-10 with sigmoidal units, and with symmetric connections (see Suppl.Fig. 3-5 for more biologically plausible network architectures which we also tested).To accelerate training, we used AdaGrad 60 , and we applied a learning rate of 0.03 to the hidden layer and 0.02 for the output layer.In the standard implementation of Contrastive Hebbian Learning, the learning rate α Eq. 1 is also multiplied by a small number (~0.1) for all top-down connections (e.g. from output to hidden layer) 61 .This different treatment of feed-forward and feedback connections could be biologically questionable as cortical circuits are highly recurrent.Therefore, to make our learning rule more biologically plausible we discarded this feedback gain factor as we wanted our network to learn by itself what should be the contribution of each input, as described in Eq. 3.

Future activity prediction
For all the predictions we used a cross-validation approach.Specifically, in each training cycle we ran free phase on 490 examples, which were used to derive least-squares model for each neuron to predict its future activity at time step 120 ( ̃(120)), from its initial activity at steps 1-12 ( ̌(1),..,  ̌( 12)).This can be expressed as (Eq.This demonstrates that each neuron can accurately predict its future activity even for novel stimuli which were never presented before.

Surgery, recording and neuronal data
The experimental procedures for the awake, head-fixed experiment have been previously described 31,32 and were approved by the Rutgers University Animal Care and Use Committee, and conformed to NIH Guidelines on the Care and Use of Laboratory Animals.Briefly, a headpost was implanted on the skull of four Sprague-Dawley male rats (300-500g) under ketamine-xylazine anesthesia, and a craniotomy was performed above the auditory cortex and covered with wax and dental acrylic.After recovery the animal was trained for 6-8 days to remain motionless in the restraining apparatus.On the day of the surgery, the animal was briefly anesthetized with isoflurane, the dura was resected, and after a recovery period, recording began.For recording we used silicon microelectrodes (Neuronexus technologies, Ann Arbor MI) consisting of 8 or 4 shanks spaced by 200µm, with a tetrode recording configuration on each shank.Electrodes were inserted in layer V in the primary auditory cortex.Units were isolated by a semiautomatic algorithm (klustakwik.sourceforge.net)followed by manual clustering (klusters.sourceforge.net) 62.Only neurons with average stimulus evoked firing rates higher than 3 SD above pre-stimulus baseline were used in analysis, resulting in 9, 12, 12, and 22 neurons from each rat.For predicting evoked activity from spontaneous, we also required that neurons must have mean firing rate during spontaneous packets above said threshold which reduced the number of neurons to 40.The spontaneous packet onsets were identified from the spiking activity of all recorded cells as the time of the first spike marking a transition from a period of global silence (30 ms with at most one spike from any cell) to a period of activity (60 ms with at least 15 spikes from any cells), as described before in 31,63 .The data presented here are available from the corresponding author upon reasonable request.
Title: Neurons learn by predicting future activity.

Suppl. Materials
Maximizing future energy balance.
Intuitively, it makes sense that planning, i.e. making predictions, can improve success of an organisms in accessing more energy resources.In this section we show that this holds true even for a single neuron, where maximizing future energy balance is best achieved by predicting its future activity.For that, first we will write equation for energy balance for a neuron at time t+n, where t represents current time and n is a small time increment.Using the same logic and notation as in Eq. 4, energy balance of a neuron j at time t+n, can be expressed as a function of: housekeeping processes (− ɛ ), cost of electrical activity (which for simplified linear neurons can be written as sum of its synaptic inputs:  ,+ = ∑    ,+

𝑘
In the main text we show in simulations (Fig. 2) and in experimental data (Fig. 3), that for small n, activity of neuron j at time t+n could be approximated by a linear function of its activity at earlier time step t:  ,+ =    , , where  is a regression coefficient.Thus Eq.S1 can be rewritten as: (Eq.S2): E j,t+n = − ɛ −  1 (  ∑    , )  Considering that β1 and β2 > 1.7 (Devor et al. 2003), we may assume that β1 ≈ 2 and β2 ≈ 2, which allows to simplify and reorganize terms as follows: Suppl.Fig. 2. Examples of handwritten digits from MNIST data set (LeCun et al. 1998).Note that classifying such images is a non-trivial task even for humans, as for instance digits at the bottom row (5, 6, 7, 8, 9) could be mistaken for: 0, 4, 4, 9 and 4, respectively.
Suppl.Fig. 3 (A) The learning curve for a network with 80% of excitatory and 20% inhibitory neurons in the hidden layer.For each excitatory neuron, all synaptic connections to output neurons were non-negative (w  0), and for inhibitory neurons all connections to output neurons were negative or zero (w<0).The network had the same architecture as our main network (784-1000-10 neurons), and was using the same learning rule (Eq.3).On the test data set it achieved a comparable error rate of 2.66%.(B) The learning curve for a network with 2 hidden layers (784-1000-512-10 neurons).Because in the deeper network it took a longer time for the network to settle to steady-state, we extended the free phase to 26 steps and started the clamped phase at step 27.This network solved the MNIST task with similar accuracy (error rate on test data: 3.68%), showing that our learning rule can be also applied to deeper networks.
Suppl.Fig. 4. Learning curves for a network with our learning rule but with asymmetric connections (orange), and for the same network with additional asymmetric lateral connections within the hidden layer (all-to-all connectivity in hidden layer; black line).To increase the efficiency of model testing, here we used simpler networks with only 50 hidden neurons, and trained it only on a fraction of MNIST examples.For comparison, the learning curve for a network with the same number of neurons using original Contrastive Hebbian Learning algorithm with symmetric connections is shown in green.This suggests that our learning rule also works in more biologically plausible network configurations.
Note on symmetric and recurrent connections: Typical artificial neural networks trained with the backpropagation algorithm, require that a synaptic weight from neuron i to j (wij) is exactly the same as a synaptic weight from neuron j to i (wij=wji is called symmetric weight).However, it was recently shown that networks without symmetric connections can still converge on a good solution (Lillicrap et al. 2016).Similar results were also found by Detorakis et al. (2019) who used asymmetric weights with Contrastive Hebbian Learning.Thus, our results are consistent with that previous work.
More interesting is our result that a fully recurrent network with lateral connections also converged similarly.This type of connectivity more closely reflects biological conditions as cortical neurons form extremely recurrent networks with many lateral connections.The main interest here is that the backpropagation algorithm is not well suited for networks with highly recurrent connectivity.Thus, the algorithm presented here with predictive neurons may provide a promising solution to training recurrent networks, but more future work is needed to address this question.p(Ai t-1 ) + bj) , where A is the activation state, p'(Aj) is the derivative of the activation function p, b is the bias, and i and j are neurons indices. can be seen as the learning rate of the activations.These states are clipped to range [0, 1].Each neuron communicates only using binary signals: 0 or 1, to model spiking activity.For generating spikes, an encoder converts the neuron activation state to 1 or 0, which is sent as an output signal (panel A).The binary signals received by a neuron are converted into the activation state by a decoder.This could be seen as converting discrete spikes into a continuous post-synaptic membrane potential in actual neurons.We modified this spiking network by implementing our learning rule and by adding a least-squares model for predicting steady-state activation based on initial activation in steps 1-17.We also used AdaGrad (Duchi et al., 2011) to search for hyperparameters, which resulted in setting the learning rate to 0.01.The network architecture was the same as in our main model with 784-1000-10 neurons.

Acoustic stimuli
In animal experiments, as stimuli we used 1 s long pure tones (2, 3.3, 5.4, 8.9, 15 and 24 kHz at 60dB), interleaved with 1s periods of silence, as described in (Luczak et al. 2009 & 2013).Activity occurring >200ms after stimulus offset and before the next stimulus onset was regarded as spontaneous.The stimuli were continuously presented for 20-108 min, depending on how long the animal appeared to be comfortable during the experiment.All stimuli were tapered at the beginning and end with a 5ms cosine window.Experiments took place in a single-walled sound isolation chamber (IAC, Bronx, NY) with sounds presented free field (RP2/ES1, Tucker-Davis, Alachua, FL).To compensate for the transfer function of the acoustic chamber, tone amplitudes were calibrated prior to the experiment using a condenser microphone placed next to the animal's head (7017, ACO Pacific, Belmont CA) and a MA3 microphone amplifier (Tucker-Davis).

Changes in neuronal activity across periods
For each animal we divided the duration of recording into two halves.For each neuron we calculated the mean stimulus evoked firing rate in 30-40 ms time window, averaged over all stimulus presentations in each half of the experiment.Similarly, for each neuron we calculated the average predicted activity for the 30-40 ms time window, averaged across all stimuli in the 1st half of the experiment.We then calculated the difference between activities in the 2nd half minus the 1st half, and we correlated it with the difference between activity in the 1st half minus predicted activity in the 1st half (Fig. 3D).As described in the Methods section, only neurons with average stimulus evoked firing rates higher than 3 SD above pre-stimulus baseline were used in our analyses.Dividing the data in different proportions e.g.60%-40% gave the same conclusions.
For comparison, analogous analyses were done on artificial neurons (Fig. 3C).For each neuron in the hidden layer the steady-state clamped activity was averaged over all 1200 stimulus presentations in a single learning epoch.Similarly, for each neuron in the hidden layer, the predicted activity was averaged across all stimuli presented in an epoch.We then calculated the difference between clamped activities in 2 epoch: ∆  = < ̂, > -< ̂, > , where <> denotes average, i is index of neuron, and M and N are indexes of epochs with M > N.This was then correlated with difference between average clamped and predicted activity in epoch N: ∆  = < ̂, > -< ̃, >.We found that the correlation between ∆C and ∆P was strongest in the earliest epochs where the learning curve was most changing (Fig. 2E).For this simulation we used neural network as described in Methods with 784-1000-10 neurons, and we used the learning rule described in Eq. 3.However, in order to reproduce the repeated presentation of the same stimuli as in animal experiments, we used the same 1200 stimulus in each learning epoch.Presenting different stimuli at each epoch as in our original simulation, gave qualitatively similar results.

Predicting evoked dynamics from spontaneous activity
For predicting evoked dynamics from spontaneous, first we estimated the range of spontaneous activity patterns.For this we divided spontaneous packets into 10 groups based on principal component analysis (PCA).Specifically, each packet was represented as a NxT matrix, where N is number of neurons and T=30 is number of 3ms long time bins.This matrix was then converted into a single vector V with length of NxT.This was repeated for each packet, thus we obtained a

Fig. 1 .
Fig. 1.Basics of the algorithm.(A) Schematic of the network.Note that activity propagates backand-forth between hidden and output layers.(B) Sample neuron activity.(Bottom traces)Initially the network receives only input signal (free phase), but after a few steps the output signal is also presented (clamped phase).The dashed line shows neuron activity in free phase if output would not be clamped.The light blue dot represents steady-state free phase activity predicted from initial activity (shaded region).Synaptic weights (w) are adjusted in proportion to the difference between steady-state activity in clamped phase ( ̂) and estimated free phase activity ( ̃).

Fig. 2 .
Fig. 2. Neuron prediction of expected activity.(A) Activity of 10 output neurons in response to a sample stimulus at the beginning of the network training.Gray area indicates the extent of the free phase (steps 1-12).Solid red lines show activity of the neurons clamped at step 13.Dashed lines represent free phase activity if output neurons had not been clamped.Dots show predicted steady-state activity in free phase based on initial activity (in gray area).(B) Activity of the same neurons after network training.Note that free phase and predicted activity converged to desired clamped activity.(C) Activity of a representative neuron in a hidden layer in response to 5 different stimuli after network training.Solid and dashed lines represent clamped and free phase respectively, and dots show predicted activity.(D) Predicted vs. actual free phase activity.For visualization clarity only every tenth hidden neuron out of 1000 is shown, in response to 20 sample images.Different colors represent different neurons, but some neurons may share the same color due to the limited number of colors.Distribution of points along the diagonal shows that predictions are accurate.(E) Decrease in error rate across training epochs.Yellow and green lines denote learning curves for training and test data set respectively.Note that in each epoch we only used 2% out of 60,000 training examples.

Fig. 3 .
Fig. 3. Predicting future activity of cortical neurons.(A) Response of a representative neuron to different stimuli.For visualization only 5 out of 12 responses are shown.Gray area indicates the time window which was used to predict future activity.Dots show predicted average activity in the 30-40ms time window.Colors correspond to different stimuli.(B) Actual vs. predicted activity for 55 cells from 4 animals in response to 12 stimuli.Different colors represent different neurons, but some neurons may share the same color due to the limited number of colors.Insert: histogram of correlation coefficients for individual neurons.Skewness of the distribution to the right shows that for most neurons correlation between actual and predicted response was positive.(C) Average change in clamped steady-state activity between 2 consecutive learning epochs in our network model.This change relates to the difference between clamped and predicted activity in the earlier epoch (N=7; Suppl.Materials).Each dot represents one neuron.Regression line is shown in yellow.(D) Average change in firing rate between 1st and 2nd half of our experiment with repetitive auditory stimulation.This change relates to the difference between stimulus evoked and predicted activity during the 1st half of the experiment (Suppl.Materials).Each dot represents the activity of one neuron averaged across stimuli.Similar behavior of cortical and artificial neurons suggest that both may be using the same learning rule.Note that although results in panel A showing that firing rate is consistent from early to later phase of response may not be surprising, the results in 4):  ̃(120) = λ(1)*  ̌(1),+…+ λ(12)*  ̌(12) + b, where terms in brackets correspond to time steps, and λ and b correspond to coefficients and offset terms found by leastsquares method.Next, 10 new examples were taken, for which free phase was ran only for 12 steps, then the above derived least-squares model was applied to predict free phase steady-state activity for each of 10 examples.From step 13 the network output was clamped.The weights were updated based on the difference between predicted and clamped activity calculated only from those 10 new examples.This process was repeated 120 times in each training epoch.Moreover, the MNIST dataset has 60,000 examples which we used for the above described training, and 10,000 additional examples which were only used for testing.For all plots in Figure 2 we only used test examples which network never saw during training.

Suppl. Fig. 5 .
Implementation of our learning rule in a spiking neuronal network.(A) Activity of a sample neuron.Red trace shows neuron internal activation which is a function of synaptic inputs.(Top) Spikes are generated based on the value of internal activation.Gray shaded area marks the extent of free phase which is used to predict steady-state activation.(B) Actual vs. predicted steady-state free phase activity.Each dot represents the activity of one neuron during presentation of one stimuli.Activity of 1000 hidden neurons during presentation of 200 stimuli from the test dataset is shown.The correlation coefficient between actual and predicted activity was R=0.99 (p<0.00001).(C) Learning curves on MNIST task for training and testing dataset.Error rate of 2.46% on the test set shows that the spiking neuronal network with our learning rule was able to solve the presented task.