Abstract
There are two principle approaches for learning in artificial intelligence: errordriven global learning and neuroscienceoriented local learning. Integrating them into one network may provide complementary learning capabilities for versatile learning scenarios. At the same time, neuromorphic computing holds great promise, but still needs plenty of useful algorithms and algorithmhardware codesigns to fully exploit its advantages. Here, we present a neuromorphic globallocal synergic learning model by introducing a braininspired metalearning paradigm and a differentiable spiking model incorporating neuronal dynamics and synaptic plasticity. It can metalearn local plasticity and receive topdown supervision information for multiscale learning. We demonstrate the advantages of this model in multiple different tasks, including fewshot learning, continual learning, and faulttolerance learning in neuromorphic vision sensors. It achieves significantly higher performance than singlelearning methods. We further implement the model in the Tianjic neuromorphic platform by exploiting algorithmhardware codesigns and prove that the model can fully utilize neuromorphic manycore architecture to develop hybrid computation paradigm.
Introduction
The majority of neuromorphic models are established by single backpropagation or single neuroscienceoriented local plasticity (LP), exhibiting radicals with different learning features and advantages. In general, backpropagation is global errordriven learning with two alternative information circuits (topdown and bottomup). The learning process is implemented via layerbylayer allocation of global supervised errors. Benefitting from the recent progress on deep learning, backpropagation and its many variants have been applied for training neuromorphic models represented by spiking neural networks (SNNs) and demonstrated good accuracy in certain specific tasks, such as image classification and reinforcement learning (RL)^{1,2,3,4,5,6,7,8}. In contrast, neuroscienceoriented LP is essentially correlationdriven learning, which prominently occurs between presynaptic and postsynaptic neurons and is triggered by asynchronous spike activity. Rooted in neuroscience, LP has been widely used for feature abstraction and facilitates the realization of many advanced braininspired learning and memory mechanisms^{9,10,11,12}. Despite remarkable practical advantages in lowlatency and energyefficiency computation^{13,14}, applications of neuromorphic models are still limited to a small range of usages and the overall performance is inferior to the stateoftheart results^{15,16}. To fully exploit the potential still requires many effective algorithms. At the same time, local correlationdriven learning (called LP) and global errordriven learning (called “global plasticity” (GP)) are the two principal learning routes for artificial intelligence (AI). Both approaches have their unique advantages but neither can completely outperform the other on all learning problems. Thus, it is highly expected to integrate them into one single neuromorphic hybrid model to explore complementary synergic learning capabilities in multiple learning scenarios. However, the different learning circuits and weight update behaviors make the development of such hybrid learning obscure and the further incorporation of complex and diverse spiking dynamics poses a greater challenge for exploring hybrid learning on neuromorphic models.
Recently, there has been an increasing number of studies involving globallocal hybrid models with different interests. A related biological study was on the threefactor learning rule^{17,18}. The threefactor learning rule describes a general framework of synaptic plasticity that incorporates presynaptic activity, postsynaptic activity, and additional third factors, such as neuromodulators and neurotransmitters, representing the topdown supervised signals. Several studies on developing spiking learning algorithms have been inspired by the threefactor learning rule^{8,19,20,21,22,23}. Adopting a biologically plausible manner, they took supervised errors or rewards as the biological third factor to modulate the magnitude of local weight update and successfully applied the methods for simple image classification tasks^{8,19}, RL tasks^{20} and probabilistic inference tasks^{21,22,23}. Despite great progress in understanding biological learning, due to the different learning goals, most of the works fail to fully exploit the advantages of global gradient learning and are generally not good at solving many complex learning problems^{24,25}.
Another related vein is metalearning. Metalearning, also named learningtolearn, is ubiquitous in nature for continuously improving the learning ability. Existing metalearning studies^{26,27,28} for SNNs narrowly focus on improving a single GPbased model without integrating LP. How to learn to optimize both the LP and GP with various spiking features and integrate their respective advantages in one neuromorphic model remains an important open issue. Alternatively, in the early 1990s, ref. ^{29} proposed a framework that can optimize LP through supervised signals. In followup research, several studies extended the framework to establish largescale nonSNNs and have demonstrated high performance in solving fewshot learning^{30,31} and unsupervised learning^{32,33}. However, due to the lack of an effective and configurable hybrid mechanism, the potential of globallocal learning has not yet been fully explored, either in some important learning capabilities or computational efficiency. Furthermore, none of the works considered developing such hybrid synergic learning in neuromorphic computing.
The previous lack of powerful hybrid learning for neuromorphic models also affects the use of manycore neuromorphic hardware, which aims to provide ultralowpower hardware solutions for AI implementations. A feasible hardware scheme that can support online hybrid learning has yet to be reported. If neuromorphic hybrid learning models with algorithmhardware codesign could be developed on neuromorphic platforms, then the neuromorphic manycore architecture can be exploited to explore hybrid onchip computation schemes to obtain better performances in practical learning scenarios.
Here, we show a spikebased hybrid plasticity (HP) model using a braininspired metalearning paradigm and a differential spiking dynamics model with parameterized LP. The approach provides a generic and flexible integration of these two learning methods and facilitates highefficiency hybrid learning on neuromorphic chips. By developing multiple synergic learning strategies, we demonstrate that, with slight modifications to the local modules, the proposed model can solve three different learning problems, including fewshot learning, continual learning, and faulttolerance learning. Finally, we exploit the method of the algorithmhardware codesigns to implement the hybrid model on Tianjic chips and further prove that it can fully utilize neuromorphic architecture to develop a hybrid computation paradigm.
Results
Hybrid synergic learning model
The proposed spikebased hybrid model exploits two streams of neuroscience experimental cues about synaptic modulation behaviors and a multiscale learning mechanism (Fig. 1a) to model the neuromodulatory mechanism and establish the synergic learning circuit.
First, in the hippocampus, local neural circuits can be simultaneously controlled by multiple types of topdown modulatory signals. These signals act on many synapses and modulate diverse plasticity behaviors, including the learning rate, update polarity, and plasticity consolidation^{17,34,35}. In particular, some neuromodulators, such as adenosine, can affect the actions of synaptic functioning and other modulators in a hierarchical manner^{36,37}. This indicates that neuromodulators can be formulized as a special type of metalearning parameter acting on synaptic plasticity in a weightsharing manner.
Second, neuromodulators can modulate synaptic plasticity on multiple temporal scales. Neuromodulators exhibit radically different evolution (learning) scales and individual functions from synapses^{35,38}. The coupling of plasticity and neuromodulatory mechanisms plays a vital role in building complex behavior functions, such as muscle contraction^{39}. It further implies a flexible and multiscale learning mechanism in which the learning process of neuromodulators and synapses may occur on different spatial and temporal scales; thus, they can be formulized as two types of variables with different learning manners in the optimization process.
To this end, we formulize the hyperparameters of LP, such as the learning rate and sliding threshold, as a group of metaparameters \({{{{{\boldsymbol{\theta }}}}}}\). Since these metaparameters \({{{{{\boldsymbol{\theta }}}}}}\) control the weight update behaviors, we propose modeling \({{{{{\boldsymbol{\theta }}}}}}\) as the upperlevel variable of weight \(w\) and transforming the learning process of synaptic weights and neuromodulators into a bilevel optimization problem^{40}. On this basis, we decouple the learning process of \({{{{{\boldsymbol{\theta }}}}}}\) and \(w\) in different optimization loops, and this allows to use the hyperparameter optimization technique^{41} for optimizing \({{{{{\boldsymbol{\theta }}}}}}\) (see Methods). In this manner, we expect to first provide an effective flexible modeling strategy that supports the modeling of versatile local learning rules and can learn to optimize both the underlying weights and the local learning rules in different optimization loops, thereby facilitating a generic integration of the two types of learning.
Furthermore, we want to incorporate diverse spiking dynamics, LP, and global learning into one unified temporal credit assignment problem. To this end, we jointly derive from membrane potential dynamics and ionchannel dynamics and obtain a differentiable signal propagation equation of subthreshold dynamics. We use a memory gate (Fig. 1d) and make a careful choice of derivative approximation^{8,42} for handling the discontinuous dynamics of spiking neurons (see Methods). Then, we adopt the backpropagation through time (BPTT) algorithm as global learning for training SNNs. Because LP has an independent correlationbased updating manner, directly integrating the local modules with handcrafted parameters and global learning together is difficult to ensure convergence. To incorporate the impact of local weight updates into the entire optimization framework, we take a parametric modeling strategy (Fig. 1c) to transform the local synaptic increments \(\Delta {w}_{{LP}}\) into a parametric function related to presynaptic spike activity, pre, postsynaptic spike activity, post, and local hyperparameters \({{{{{\boldsymbol{\theta }}}}}}\), consisting of the local learning rate, sliding threshold and some other hyperparameters determined by specific local learning rules. By doing so, we not only maintain the underlying weight directly optimized by BPTT but also exclusively model local weight updated behaviors as a temporalbased function concerning adjacent spiking activities and local metaparameters.
The adopted decoupling optimization strategy is also inspired by a variant of synaptic dynamics. Specifically, derived from the ionchannel differential dynamics, the synaptic weights \(w\left(t\right)\) have two terms
where \(w\left({t}_{n}\right)\) denotes the phase value of the synaptic dynamics at discrete time \({t}_{n}\), \(k\left(t\right)={e}^{\frac{{t}_{n}t}{{\tau }_{w}}}\) denotes the synaptic decay function, and \({\tau }_{w}\) denotes the synaptic constant. Here \(\widetilde{P}\left(t,{pre},{post}{{{{{\rm{;}}}}}}{{{{{\boldsymbol{\theta }}}}}}\right)\) denotes the generic local modifications. If we further assume the topdown signals to modify the state \(w\left({t}_{n}\right)\) at the specific time \({t}_{n}\) when supervision signals are provided, and assume that \(\widetilde{P}\left(t,{pre},{post}{{{{{\rm{;}}}}}}{{{{{\boldsymbol{\theta }}}}}}\right)\) represents the LP driven by neuron activities and metaparameters \({{{{{\boldsymbol{\theta }}}}}}\), then Eq. (1) can be used to accordingly decompose the weight into two parts, \({w}_{GP}\) and \({w}_{LP}\).
Finally, we develop a globallocal synergic learning model that integrates LP and GP into one neuromorphic model by exclusively allocating them to act on different weight parts and time scales. Here, each HP unit has two weight branches (Fig. 1d): \({w}_{GP}\), being directly updated by BPTT when receiving supervision information, and \({w}_{LP}\), being updated in an eventdriven manner by the metalearned spiking LP. To fully utilize various learning features of the LP and GP, we use the metalearning circuit and bilevel optimization technique to decouple the optimization process of \({w}_{LP}\) and \({w}_{GP}.\) The HP model provides a flexible and configurable solution that allows the construction of different learning modules, support various local learning rules, and configures different learning strategies for solving a variety of complex tasks. In particular, the property is quite fascinating for lowenergy and efficient execution in edge computing^{43} by reusing limited memory resources and adapting different learning strategies.
Baseline performance evaluation
We comprehensively evaluated the basic performance, including the accuracy, coding efficiency, and convergence, of the HP models using image classification datasets (including MNIST, FashionMNIST, and CIFAR10) and neuromorphic datasets (including CIFAR10DVS and DVSgesture). The network architecture and training details are described in Methods.
We analyzed the accuracy of the models on different types of datasets in Table 1. Because SNNs achieve a balance between performance and efficiency, we mainly compared the proposed model with other advanced spikebased models. In image classification tasks, the HP SNN achieves higher accuracies compared with other published work. In sequential learning tasks, the HP SNN improves the performance of a single GP SNN, indicating that local modules are beneficial for the use of longer learning time scales. We provide a deeper analysis in Fig. 3 thereon.
One prominent feature of our work is the modeling of synaptic ionchannel dynamics, thereby finding an interesting relationship between spiking information coding and an eventdriven inference capability (see Methods). Next, we show that HP SNNs can support rate coding and temporal rank coding schemes^{44} (Fig. 2a). Figure 2b, c compares the results of different coding methods on FashionMNIST and CIFAR10. Specifically, we compared the training curves of the SNNs with rank order coding and those with rate coding using the same simulation time windows. Because of the eventdriven property of rank order coding, we computed its average time windows and further compared the training curves of the SNN with rate coding using the same average time windows (called rateavg.) in Fig. 2b, c. These figures indicate that HP SNNs are well suited for ratebased and rankbased spiking networks. With a longer average time window, the ratebased model obtains higher accuracy. With more flexible and eventdriven response characteristics, the rankbased model achieves lower inference latency with a slight accuracy loss. Furthermore, we also plot the details of the average response time of each category for different datasets in Fig. 2d. We found that HP SNNs use a flexible strategy for decisionmaking. Using the CIFAR10 dataset as an example, for most categories, HP SNNs can make decisions within four time windows; however, for some complicated patterns, the model requires more time to make decisions. This flexible decision process significantly reduces the average inference latency and leverages the high efficiency of neuromorphic hardware. We instantiated our model on Tianjic chips and reported the energy evaluation in Supplementary Table 1.
We further compared the convergence of the proposed approach with other related learning methods. Intuitively, since local learning and global learning have independent update methods, directly combining them cannot ensure convergence. Furthermore, regarding neuroscienceoriented learning models, such as STDPbased SNNs, systematically and generally configuring parameters over local learning rules has yet to be resolved. Although handdesigned or biosimulated hyperparameters can alleviate the above problems, it is timeconsuming and difficult to ensure performance. By the parametrization of the local module, the HP approach can automatically optimize the local hyperparameters and achieve a synergic learning mode. As a demonstration, we comprehensively compared the loss curves and accuracies of a single LP network, a single GP network, a finetuning LP±GP network, and the proposed HP network in Fig. 3. For fairness, we used the same initial weight configurations (see Methods).
Figure 3a shows that the finetuning method has poor convergence and performs worse than the other models. Through the proposed compatible design, the HP model significantly improves the accuracy of the LP model (Fig. 3b) and finetuning hybrid model, indicating that the proposed method can efficiently integrate the LP and GP methods. The single GP model is suitable for optimizing the errors of common classification tasks. The HP model inherits this advantage and achieves comparable convergence on the static MNIST and FashionMNIST datasets. Furthermore, the HP model demonstrates higher accuracy and faster convergence than that of the GP model on the sequential MNIST dataset. Because the LP module provides a correlationdriven weight matrix that stores the correlation product of the presynaptic inputs and postsynaptic activations (i.e., spikes) in the previous steps, as indicated in ref. ^{45}, it can act as a type of attention mechanism to the recent past with the strength being determined by the scalar product between the current hidden vector and earlier input stimulus. Combining the results of Table 1 and Fig. 3 shows that adding LP modules could benefit the use of longer learning time scales. Furthermore, we visualized the activations in the first hidden layer by 2D embedding visualization of Tdistributed stochastic neighbor embedding over the Sequential MNIST. Take the yellow cluster in Fig. 3c as an example. Adding LP modules can help the HP model abstract the points within each class more compactly and push different classes farther. Overall, the above results indicate that the HP can efficiently coordinate the GP and LP methods with stable convergence for common classification tasks.
Faulttolerance learning
Fault tolerance is essential for the realtime information processing of neuromorphic chips to prevent the influence of internal noise or external interference. For example, neuromorphic vision sensors (NVSs)^{46} can quickly capture perpixel brightness changes with low latency and a high dynamic range but suffer from the inherent noise of physical devices and the movements of external background objects, thereby affecting their practical performance. Next, we demonstrate that by virtue of Hebbianbased local modules (Fig. 4a) and the hybrid strategy, the HP model may improve the fault tolerance of single GPbased networks. We examined the ability of this model to handle incomplete data using an image classification dataset (MNIST) and a neuromorphic dataset (NMNIST). We used incomplete data to refer to cropping data (e.g., parts of image information are masked, Fig. 4b) and noisemixed data (data mixed with saltandpepper noise, Fig. 4b). The models were trained on the standard datasets and tested using incomplete test samples. Figure 4 shows that as the cropping area increases, the HP model exhibits stronger resistance to the cropping area on the NMNIST (upper) and MNIST datasets (lower). Furthermore, the noise experiments also show that the HP model achieves good robustness and mitigates the interference of different types of noise. Similarly, the superiority of the HP model becomes apparent as the noise level increases (Fig. 4d). To obtain a more insightful analysis, we calculated the Euclidean distance (Fig. 4e) and the cosine similarity (Fig. 4f) between the first hidden layer activation of the incomplete data and those of the original data using the same model on the MNIST dataset. As Fig. 4f, e illustrates, the HP model diminishes the pattern distance between the incomplete patterns and original patterns, indicating that the local modules can help the network leverage the previous associative features from incomplete data and therefore benefit the network fault tolerance capability. We provide the effectiveness analyses of the HP models in the following section.
Fewshot learning
We next investigate the potential of HP models for fewshot learning. In this case, the classifier must adapt to new classes not seen in the training phase when only given a limited number of samples from each class. To efficiently establish a mapping relationship from the limited data, it is vital to leverage prior knowledge or acquire inductive biases. The GPbased networks succeed in abstracting useful features; however, it is difficult for the networks to exploit the prior knowledge hidden in the limited datasets without resorting to other techniques. In contrast, the brain is highly efficient in learning from limited data. Neuroscience findings^{11} reveal that the response of cortical neurons to a sensory stimulus can be reliably increased after just a few repetitions by virtue of local synaptic plasticity, indicating that such plasticity may play an important role in exploiting the correlation hidden in limited data. By integrating LP and GP learning, we expect that the synergic learning model can solve this problem through a twofold mechanism: (1) abstract sufficiently discriminant representation of input data by the GP module; (2) find a useful inductive bias from a limited number of example pairs mainly by the LP module.
Here, we used the Omniglot dataset to examine the performance of the proposed model. We adopted a widely used network structure^{31,47,48} to abstract features and compared their performance. We also fed the training labels to the last layer as used by ref. ^{30}, during the presentation time so as to help the network establish an inputtooutput correlation via the LP module. In this manner, when a query sample is received, the LP module may provide an augmented signal based on the correlationbased interproduct of the query sample and the centers of each previously appeared sample. Figure 5a, b depicts the comparison results. A detailed experimental setup is provided in Methods. Because vanilla backpropagation has difficulty in learning useful feature representations when given a limited number of samples, a single GP model is hardly learned in this task. The best accuracy of our model for fiveway oneshot and twentyway oneshot tasks is 98.7% and 94.6%, respectively, which are comparable with other stateofart results and significantly higher than those of the previous SNNs. Compared with the single GP model, the improved accuracy indicates that the LP module plays a critical role in performance. In addition, without resorting to additional techniques, the synergic learning model can achieve competitive results that are comparable to the stateoftheart results, as shown in Table 2.
Continual learning
To further explore the benefits that the HP model might provide, we further investigate the ability of HP models for continual learning, that is, an ability to learn new tasks without forgetting the previous tasks^{49}. Recent studies^{50} have shown that the motor cortex disinhibits a sparse subset of dendritic branches for new tasks, thereby reducing the disruption of network memory for previous tasks. It implies that the brain may multiplex some neurocircuits while highly activating some synaptic connections to represent taskrelated information when solving new tasks. These motivate us to develop a distributed synergic learning paradigm, activating a sparse overlapping subset of weight connections by GP learning and modulating other synaptic connections by a tasksharing LP learning. Unlike the previous XdG method^{25} that uses a subnetwork to solve a subtask by masking parts of neurons in each task, our method uses a finergrained synaptic modulation and a different synergic learning scheme. Here we allow the hybrid model to use a small number of overlapping connections to represent taskspecific information, and LP learning to learn common features among tasks. By doing so, we expect to alleviate the disruption of network memory in different tasks and expand the learning capacity of the hybrid model to handle multiple tasks.
To this end, we examined the HP model performance on the standard shuffled MNIST dataset and compared it with a single GP model and the stateoftheart results^{25,51,52}. We ran all models five times and reported the testing results after fiftytask learning (Fig. 5c). We randomly activated 3% sparse and overlapping connections with the GP learning for each task and used the LP learning to learn other connections. The metaparameters of local learning were trained using the first 35 tasks and fixed in the last 15 tasks for evaluation. A more detailed setting can be found in methods. Figure 5c indicates that during the 50task learning, the HP model consistently achieves the best results compared with other works. The proposed model obviously outperforms the sparse GP model, which indicates the effectiveness of the proposed hybrid paradigm. In addition, because this paradigm allows the hybrid model to flexibly allocate different learning methods on different connections, the proposed model can leverage the manycore architecture to optimize the deployment of onchip resources. We demonstrate the flexibility in the following section.
Effectiveness analyses
We next analyze the effectiveness of the hybrid synergic learning model. Because the learning of such hybrid models is affected by both the external supervision error and internal synaptic behaviors, according to different learning circuits, we assume that the overall loss of the hybrid model can be decomposed into an explicit classification loss and an implicit loss driven by the network dynamics (see Methods). Then we remodel the local weight update from the perspective of the optimization, and analyze the effectiveness of the approximate regularization based on heteroassociative memory (HAM)^{27,53} and metric learning.
For the fault tolerance test, if we consider the local weight increment as a derivative of the implicit loss function, we can integrate the local weight increment and derive the implicit loss in a similar form as the energy function used in HAM^{27,53}. Similarly, Hebbianbased operations can help to encode the previous patterns triggering neuron concurrent firing behaviors into a local minimum in the energy landscape. On the one hand, GP learning ensures that the network can selectively activate neuron firing and realize the correct response to input patterns. This indicates that the neuron concurrent firing behaviors are more likely to represent an adequate response for the previous training patterns. On the other hand, LP learning can gradually decrease the energy surface at every update based on the concurrent firing behaviors (Eq. (15)). Through implicitly optimizing this surface, local modules place an approximate regularization on the network structures. This encourages the network to selectively strengthen the weights triggering these concurrent firing activities and thereby produce a stronger stimulus for repeated or similar patterns. By combining LP and GP learning, the HP model can optimize the energy regularization and gradually relax the hierarchical representation of networks to local minimum states, which may encode the previous associative patterns, thereby exploiting the correlation embedded in the appeared training examples (see Methods).
We deliberate from the perspective of metric learning to discuss the model performance on fewshot learning. By clamping label signals into the local module, a correlationbased local module of the highlevel input features and training labels can be established and a constraint can be placed with respect to the distribution of classes in the metric space. We prove that the local module can project an input pattern into a cosinebased embedding space and further produce a simple inductive bias by measuring the similarity between the query sample and the centers of each previously appeared sample (see Method). By doing so, the network is forced to learn from the embedding space representations to make the distance between samples within a class sufficiently small while the distance between samples from different classes is sufficiently large.
Through the above analysis, we demonstrate that the LP and GP learning complement each other to form the synergic learning model. An interesting finding is that we can adapt the hybrid model to different tasks only with minor modifications to the local modules. It provides another way for the design of loss functions. Considering that the brain prominently uses local learning to perform tasks, transferring a part of the design of loss functions to local modules is instructive and can bring benefits from at least two folds: (1) it can reduce the number of manually designed hyperparameters in the overall loss functions, such as converting the original regularized weighting coefficients to model the learning rates of the local module; (2) the LPbased operation endows a network with online and highparallelism properties, which especially facilitates the implementation on manycore neuromorphic hardware by the finegrained parallelism architecture (see Discussion).
Hybrid computation on the Tianjic by algorithmhardware codesign
By implementing our model on the Tianjic, next, we demonstrate that the proposed hybrid model is well suitable for implementing in the dedicated neuromorphic hardware. We first instantiated our model on the Tianjic neuromorphic chips (see Methods) and evaluated the efficiency of onchip inferences. As indicated in Fig. 5d, the proposed HP model can flexibly support the ratebased and temporalbased spiking coding schemes and meet different onchip inference requirements for accuracy and inference latency. As the network size increases, owing to the spikebased paradigm and localmemory structure, the energy consumption scales very slowly. More importantly, by virtue of the neuromorphic manycore architecture, the model can alleviate the overall power consumption and achieve speeds that are orders of magnitude faster than the generalpurpose computer (Supplementary Table 1).
Due to the different weight update manner, the LP and GP learning complement each other in the computational resources. This property can be leveraged by the massive parallelism of manycore hardware for onchip learning. Since by far there is no reported solution that can simultaneously support LP and GP approaches on manycore chips, we exploited the method of algorithmhardware codesigns and designed a hybrid onchip learning scheme through developing an online learning mapping scheme with a new cycleaccurate hardware simulator and a mapping scheme (see Methods), thereby evaluating the computational resources of onchip hybrid learning. We evaluated the hardware efficiency using continual multitask learning. During this process, we mainly adopted four steps for the entire evaluation, including mapping design scheme, software tool configuration, simulating onchip running process, and data arrangement (see Methods). Figure 5e, f exhibits the simulation results of the routes and throughput of implementing the LP, GP, and HP models on Tianjic chips. The proposed hybrid approach allows the flexible configuration of GP and LP learning on different connections. Since only a small number of weight connections are used to receive taskspecific supervision signals, the workloads of intercore communications on manycore architectures can be significantly alleviated (Fig. 5e), and the local learning can be further deployed in corein resources by utilizing the decentralized manycore architecture^{13,54}. With the highly parallel and nearmemory computing architecture of neuromorphic architecture, it can efficiently realize local learning and facilitate onchip learning with high throughput (Fig. 5f).
In addition, as indicated in Fig.5e, f, the forward and backward dataflow of the GP circuit cannot make full use of the pipelined processing mechanism of manycore chips. Thus, how to efficiently optimize the implementation of the GP circuit in the neuromorphic chips is an important direction to improve the overall efficiency further. Alternatively, some emerging neuromorphic hardware, such as the Loihi^{13}, have embedded part of X86 cores in one single chip, which may provide a potential candidate for implementing GP in chips in the future. Thus, the combination of algorithmhardware codesign is also a feasible direction to further develop a hybrid computing paradigm. For example, the main body of manycore structure can be used to perform local learning, while the embedded microprocessors are used to perform GP learning, which may improve hybrid onchip efficiency and promote applications of the hybrid model.
Discussion
In this work, we reported a spikebased hybrid model that endows SNNs with an efficient synergic learning capability for handling multiple learning scenarios. Guided by the hippocampal plasticity mechanism, we developed a braininspired metalearning paradigm to integrate these two types of learning and further explore multiple synergic learning strategies that can quickly adapt the synergic learning model for solving different learning scenarios. Our results indicate that with small modifications of local modules, the hybrid synergic learning model can achieve significantly higher performances than singlelearning models on sequential classification tasks, and three different learning scenarios. Finally, we implemented the model in the Tianjic neuromorphic platform by exploiting algorithmhardware codesigns and demonstrated the advantages of the proposed hybrid model on neuromorphic chips.
Understanding how the interconnected neurons in the brain combine topdown modulation information and bottomup local information to learn to solve tasks is an active research area in both neuroscience and machine learning. The related surrogate gradient methods^{8,42} use the continuous relaxation of the gradients and provide a differentiable spiking network to update weights in a fully local computation. The eprop^{7} algorithm combines the topdown supervision signals and local eligibility traces to approximate the backpropagation of signals through time. Unlike the previous studies, we use a metalearning method to design the synergic learning circuit. We model the hyperparameters of local learning as a special type of metaparameters that can be modified by topdown supervision signals and indirectly influence the behaviors of synapse plasticity. Furthermore, by deriving the implicit loss function from the LP, we prove that the local modules can act as a regularization over the network topology and temporal dynamics, indicating that the roles of LP and GP circuits are different from the perspective of optimization. We deliberate from the associative memory and metric learning as an illustration and demonstrate that by constructing different correlationbased modules, the hybrid synergic learning model can be linked with several existing powerful learning algorithms, thereby providing support for the effectiveness analysis.
Some metalearningbased globallocal learning methods have been developed for nonspiking models. In the early 1990s, ref. ^{29} has proposed a basic framework that can optimize the parameterized local learning rule by global supervision signals. Several recent work^{31,55} use trainable connection weights of local modules and established largescale nonSNNs. We adopt the basic idea of learnable LP as in refs. ^{29,31,55}. One main difference from the previous works is that we optimize underlying weights and LP by the bilevel optimization, and develop a configurable hybrid strategy that can exploit synergic learning in different ways with greater flexibilities in solving many other learning scenarios. More importantly, we incorporate various spiking dynamics for establishing hybrid neuromorphic models and exploit the algorithmandhardware codesigns to develop feasible hardware implementations for online hybrid learning. Through the combination with neuromorphic computing, many unreported advantages of globallocal learning are revealed, such as the faulttolerance on neuromorphic sensors and the complementary computational efficiency on neuromorphic chips.
A salient feature of this work is the formulation of spikebased neuromorphic models. In principle, SNN is a special type of neural network that memorizes historical temporal information via intrinsic neuronal dynamics and encodes information into spike trains, thereby enabling eventdriven energyefficiency computation. They are suitable for scenarios with rich spatiotemporal eventdriven information and sparse dataflow and have powered many applications in neuromorphic sensors and neuromorphic chips^{28,56}. By deriving from neuronal dynamics and incorporating various dynamic behaviors of spiking neurons, our model retains many prominent biological attributes and provides a general method to metalearn spikebased LP (Supplementary Note 4). By introducing complementary learning features, the proposed synergic learning show promises in improving information processing on NVSs. Besides that, since the neuromorphic chips can leverage the asynchronous spikebased communication and highparallel computation of local learning, the proposed hybrid model provides an opportunity to promote efficient hardware implementation and facilitate the exploration of the hybrid computing paradigm in the neuromorphic hardware platform. ref. ^{57} reports an online functionallevel simulation scheme of metalearning models on Loihi chips. However, because the functionallevel simulation simplifies the model performance environment and loses many underlying finegrained execution details related to the hardware environment, it is difficult to accurately evaluate the actual consumption and practical advantages of hybrid learning on chips. Conversely, we implement the hybrid model in the Tianjic by developing the new cycleaccurate hardware simulator and mapping scheme. It can prompt algorithmhardware codesigns and the exploration of hybrid computing paradigms on neuromorphic architecture.
In summary, the neuromorphic synergic learning model developed in this work exhibits a superior learning ability for solving multiple different learning tasks and the excellent energy efficiency of the hybrid computation paradigm on neuromorphic chips, which may open an avenue for the collaborative development of neuromorphic algorithms and neuromorphic computing chips.
Methods
Model establishment
The hybrid plastic approach is based on two sets of differential equations. The first set describes the membrane potential dynamics as follows:
where \({w}_{ij}\) denotes the weight of the synapse connecting preneuron j and postneuron i, \({u}_{i}\) denotes the membrane potential of neuron i, \({\tau }_{u}\) denotes the membrane time constant, \({s}_{j}({{{{{\rm{t}}}}}})\) denotes the afferent spike trains, \({t}_{j}^{f}\) denotes the firing time, and \({l}_{n}\) denotes the number of neurons in the \({l}_{th}\) layer.
The second set establishes on a type of diffusion dynamics of ion channels^{58}, modeled by
where \({\tau }_{w}\) denotes the synaptic constant. The first term \({w}_{ij}^{g}{w}_{ij}(t)\,\) in the right of Eq. (4) denotes the recovery of \({w}_{ij}(t)\) into a ground state \({w}_{ij}^{g}\), in which we set zero in the experiments. The second term \(P(\ast )\) represents general LP controlled by presynaptic spike activity, \(pr{e}_{j}(t)\;\triangleq\; \{{s}_{j}(t)\}\), postsynaptic spike activity, \(pos{t}_{i}(t)\;\triangleq\; \{{u}_{i}(t),{s}_{i}(t)\}\), and a group of layersharing controllable factors \({{{{{\boldsymbol{\theta }}}}}}\) that includes the local learning rate, sliding threshold, and other hyperparameters that are determined by the specific local learning rules (see below).
As \(P(\ast )\) is generic for modeling local learning rule, here we take a specific expression, a variant of Hebbian rule which is formulized by
where \({k}^{corr}\) is a weight hyperparameter, \(\rho (x)\) is a bounded nonlinear function, and \({\beta }_{i}\le 0\) is an optional sliding threshold to control weight change directions and prevent weight explosions. It, therefore, can update the weight according to concurrent presynaptic firing and postsynaptic membrane activity. Integrating Eq. (4), we get
where \({w}_{ij}({t}_{n0})\) denotes the instantaneous phasic state of synaptic weight at the phase time \({t}_{n0}\). Because the HP approach uses a potential trajectory rather than an equilibrium state for computation, the dependence on the initial parameter \({w}_{ij}({t}_{n0})\) is nontrivial. Based on it, we assume that the HP approach can perform supervised learning through modifying the phasic weight values \({w}_{ij}({t}_{n0})\) at a certain time \({t}_{n0}\), in a form of the instantaneous topdown modulated signal. Consequently, we substitute the synaptic Eq. (6) into the membrane potential Eq. (2) by
To enable the continuous dynamics compatible with backpropagation, we use a modified Euler method to get an explicit iterative version of Eq. (7)
where we use \({t}_{m}\) to refer to the simulation timestep. Sorting the formula and substituting the specific expression of \(P(x)\), we get a set of final signal propagation equations as follows
where \({k}_{u}\;\triangleq\; \frac{dt}{{\tau }_{u}}\), the upper index l denotes the \({l}_{th}\) layer, and \(H(x)\) is the firing function determined by the Heaviside function. Specifically, if \({u}_{i}^{l}({t}_{m})\) exceeds the threshold \({v}_{th}\), \(H(x)=1\); otherwise \(H(x)=0.\) Regarding the nondifferentiable points of spike firing function \(H(x)\), we use the surrogate gradient methods proposed by refs. ^{8,42} and adopt a suitable rectangle function^{59} for approximating the derivative of the spiking function. In addition, we multiply the gated signal \((1{s}_{i}^{l})\) in the membrane attenuation term \((1{k}_{u}){u}_{i}^{l}({t}_{m})\) to realize the firingandresetting behavior of spiking dynamics.
Finally, to make the expression clearly, we replace the summation of local activity into an iterative variable \({P}_{ij}^{l}({t}_{m})\), relax \({k}_{ij}^{l,corr}\,\)by two elastic regular factors, \({k}_{ij}^{l,corr}={\alpha }_{i}^{l}{\eta }_{j}^{l}\), and transform Eq. (8) into an iterative version,
where \(dt\) denotes the length of timestep,\(\,{\alpha }^{l}\,\)controls the impact of local modules and \({\eta }_{j}^{l}\) controls the local learning rate. It therefore formulizes the meta modules \({{{{{{\boldsymbol{\theta }}}}}}}^{l}\) as a group of layerwise parameters {\({\alpha }^{l},{\eta }^{l},\,{\beta }^{l}\)}.
For classification output, we take the onehot encoding and use \(N\) neurons of the output layer to represent classification results. Then we incorporate different spike coding schemes into a general framework and describe the classification loss function \(C\) by
where \({{{{{\bf{y}}}}}}\) is the ground truth, \({n}_{l}\,\) is the number of layers, \(\,T\) denotes the simulation windows, \(C\) is a classification loss, \(q(x)\) is a nondecreasing bounded function depending on specific coding schemes, \({z}_{{t}_{m}}\in {R}_{\ge 0}\) is the weight associated with timestep \({t}_{m}\). This formulization can adapt to the ratebased coding when \({z}_{{t}_{m}}=1/T\), \(q(x)=H(x{v}_{th})\), and adapt to rankbased coding when \({z}_{{t}_{m}}=1[{t}_{m}=T].\)
Given the signal propagation Eq. (10) and a specific decoding scheme, ideally, we can search the optimal value of network parameters, \({{{{{\boldsymbol{\theta }}}}}}\) and \({{{{{\bf{w}}}}}}\), using the BPTT algorithm. Because the parameter \({{{{{\boldsymbol{\theta }}}}}}\) is a higherlevel variable to modulate weights, we exclusively establish the optimization of \({{{{{\boldsymbol{\theta }}}}}}\) and formulize a general expression as follows:
The task \({\pi }_{i}\triangleq \{C({{{{{\bf{y}}}}}},{{{{{\bf{x}}}}}},{{{{{\bf{w}}}}}}{{{{{\boldsymbol{\theta }}}}}})\}\) samples from the task distribution set \(\varGamma ,\) and consists of a certain loss function C and a set of training data and validation data. Here we use \({C}_{{\pi }_{i}}^{val}\) and \({C}_{{\pi }_{i}}^{train}\) to distinguish the loss function of training data and validation data in fewshot learning and multitask learning. In Eq. (12), obtaining precision solutions of \({{{{{{\bf{w}}}}}}}^{\ast }\) is usually prohibitive and computationally expensive. To ensure convergence and obtain feasible solutions, we formulize the above problem as a bilevel optimization that is usually used for optimizing two associated upperlevel and lowerlevel variables^{41,47,60}. To this end, we follow the work^{47,60} to approximate the \({{{{{{\bf{w}}}}}}}^{\ast }\) by onestep gradient update in one training batch. After updating the weight \({{{{{\bf{w}}}}}}\), we alternatively update\(\,{{{{{\boldsymbol{\theta }}}}}}\) across a validation task batch using the updated weights to learn fast adaptation of LP. In this manner, the optimization can be divided into two parts by iteratively optimizing parameters \({{{{{\bf{w}}}}}}\) and \({{{{{\boldsymbol{\theta }}}}}}\). More concretely, in the k iteration, we sample a training task batch and updated the weights \({{{{{{\bf{w}}}}}}}_{k}\) by the BPTT. Then we sample a validation task batch using gradient updates \({{{{{{\bf{w}}}}}}}_{k}\,\) and optimize the \({{{{{{\boldsymbol{\theta }}}}}}}_{k}\) over task batch by the following Eq. (13) until the training converges
where \({\varGamma }_{s}\) is a task batch set sampled from the task distribution set \(\varGamma\), \(\xi \ge 0\) is the learning rate of approximation weight updates that can be used for accelerating convergence^{60}.
Support rank ordering coding
By deriving from a type of ionchannel dynamics model, our model maintains the synaptic decay dynamics, \(k(t)={e}^{\frac{{t}_{n}t}{{\tau }_{w}}}\), during information transmission. Because the presynaptic spike signals must be filtered by the temporal filtering \(k(t)\) to the postsynaptic neurons, it implies that the arriving time affects the information transition. On this basis, we find a potential relationship between the HP model and the classic rank order coding assumption (see Supplementary Note 3), and develop an evidenceaccumulation temporal decoding scheme. Formally, as long as the first spike is triggered by the winning neuron in the output layer, the HP SNNs will stop signal inference and produce results based on the index of the winning neuron. Then, the scaled membrane potential of the output neurons is used as an output representation to calculate the loss. In this manner, in addition to supporting a conventional ratebase decoding scheme, we can utilize the synaptic dynamics and threshold mechanism of spiking neurons to implement an eventdriven inference mode.
Implicit loss function
We consider the impact of local modules as a form of implicit loss and analyze the effectiveness from the perspective of optimization. Because the learning process of the HP model is affected by the supervision signals and internal dynamics, accordingly, its overall loss function \(E\) is more likely not only incorporating an explicit classification loss \(C\), but also building on an implicit loss function \({E}_{in}\) generated by the inherent dynamics of the network. According to the different learning circuits, we first make the decomposability assumption on the general overall loss as below:
where \({{{{{\bf{x}}}}}}\,\) and \({{{{{\bf{y}}}}}}\) are external input data, \({n}_{l}\) denotes the total number of layers, and \({{{{{{\rm{\lambda }}}}}}}_{1,2,3}\in {R}_{\ge 0}\) denote the influential factor of each part. We follow the notations of Eq. (1) and in a slight abuse of notation, we explicitly express the composition \({\lambda }_{1}{{{{{{\bf{w}}}}}}}_{GP,t},{\lambda }_{2}{{{{{{\bf{w}}}}}}}_{LP,t}\) on \(C\) to highlight the difference between the hybrid model and singlelearningbased model. The expression of \(E\) can be regarded as an extension of a singlelearningbased network. In the case of \({\lambda }_{2}={\lambda }_{3}=0\), \(E\) degenerates to the conventional classification loss for the GPbased network, and in the case of \({\lambda }_{1}=0\), the network reduces to the LPbased network. We use the implicit function to analyze the effectiveness of HP models on faulttolerance learning and fewshot learning.
Effectiveness analyses on faulttolerance learning
If we treat the local weight increment as an implicit derivative for a part of the overall loss function E, it inspires us to integrate its weight increment to obtain the implicit loss function \({E}_{in}\). For simplicity, we mainly focus on the impact of local weight rules on the optimization of the current layer weight w and illustrate the effectiveness of local modules for the specific task.
In the fault tolerance learning, we accordingly treat the local weight increment \(\varDelta {w}_{ij,t}^{l}\) as the implicit derivative of \({E}_{in}^{l}\) by
We use the Hebbian rule and set \(\rho ({u}_{i,t}^{l})=H({u}_{i,t}^{l}{v}_{th})\) in the derivation. Integrating the above equation, we can get a loss expression \({E}_{in}\) as follows:
Since the derivative of Heaviside function is zero for \(u\;\ne\; {v}_{th}\), the following equation holds
We note that a form of the loss function in the Eq. (17) is similar to the energy function used in HAM^{27,53,61} in which Hebbianbased operations help networks encode the previous associative patterns into a local minimum of the energy surface. It inspires us to explain the model effectiveness from the optimization of energy function. Specifically, in the HP models, the GPbased learning ensures that the network can selectively activate parts of neurons firing and realizes a correct response to input patterns. Thus, the associative patterns of neuron concurrent firing behaviors (i.e., \({{{{{{\bf{s}}}}}}}_{t}^{l1^{T}}{{{{{{\bf{s}}}}}}}_{t}^{l}\)) are more likely to represent an effective response for input patterns. On the other hand, as shown in Eq. (17), Hebbianbased operation can decrease the surface at every update. Therefore, by optimizing the energy surface, the local module places an approximate regularization on the network structures with the punishment of \({\sum }_{l=2}^{{n}_{l}}{{{{{{\bf{s}}}}}}}_{t}^{l1^{T}}{{{{{{\bf{w}}}}}}}_{t}^{l}{{{{{{\bf{s}}}}}}}_{t}^{l}\). It encourages the network to strengthen the weights triggering neuron concurrent firing behaviors, resulting in a stronger stimulus for similar or repeated patterns. Collectively, by combing the LP and GP methods, the HP model can relax the hierarchical representation of the networks to local minimum states that are more likely to encode an effective response to the previous associative patterns, thereby exploiting the correlation embedded in the appeared training patterns for the recognition of incomplete patterns.
Please note that unlike HAM models using one or more bidirectional iterations for pattern reconstruction^{27,53}, the HP model leverages the memory matrix for the classification of disturbance patterns. It can be further illustrated by analyzing the correlationbased augmented information of local modules. Given a general input and output dataset \(D={\{({{{{{{\bf{x}}}}}}}_{i},{{{{{{\bf{y}}}}}}}_{i})\}}_{i}^{N}\), where \({{{{{{\bf{y}}}}}}}_{i}\in {R}^{m\times 1}\) refers to the response of the current layer to the input \({{{{{{\bf{x}}}}}}}_{i}\), let us assume that a querying sample \(\tilde{{{{{{\bf{x}}}}}}}\) \(\in {R}^{n\times 1}\) is received and belongs to the \({D}_{k}\) category. Then the local module produces an augmented information \({{{{{{\bf{I}}}}}}}_{LP}\) by
When the network receives a disturbance sample \(\tilde{{{{{{\bf{x}}}}}}}\), \({w}_{LP}\) provides the augmented information of the previously stored pattern \({{{{{{\bf{y}}}}}}}_{i}\) in the form of a weighting coefficient \(({{{{{{\bf{x}}}}}}}_{k}^{T}\tilde{{{{{{\bf{x}}}}}}})\). Since the inner product from the same class can provide stronger augmented information, it also indicates that the local module can exploit the correlation between the input sample and the previous appeared associative patterns, thereby facilitating the classification of disturbance patterns.
Effectiveness analyses on fewshot learning
Next, we show that the correlationbased local module can place a constraint with respect to the distribution of classes in the cosinebased metric space to accelerate convergence. Assume that we have a set of training samples \(D\,=\,{\{({{{{{{\bf{x}}}}}}}_{k},{{{{{{\bf{y}}}}}}}_{k})\}}_{k=1}^{{N}_{D}}\) where \({{{{{{\bf{x}}}}}}}_{{{{{{\rm{k}}}}}}}\in {R}^{m\times 1}\) is an mdimensional feature vector, \({{{{{{\bf{y}}}}}}}_{k}\in {R}^{n\times 1}\) is the onehot label, \({N}_{D}\) denotes the sample number of dataset D. We refer \({{{{{{\bf{x}}}}}}}_{k}\) to the general features coming from raw data or the last \({n}_{l}1\) layer. By introducing the training labels to the output neurons, the local module constructs a Hebbianlike matrix as follows:
where K denotes the class number of \(D\) and \({D}_{k}\) denotes the subset of examples within the same class. Here we set the learning rate \({\eta }_{i}=\frac{1}{{N}_{{D}_{k}}\left\{{\bf{x}}}_{i}\right\_2} \) and refer \({{{{{{\bf{c}}}}}}}_{k}=\frac{1}{{N}_{{D}_{k}}}{\sum }_{i=1}^{{N}_{{D}_{k}}}\frac{{{{{{{\bf{x}}}}}}}_{i}}{\left\{{\bf{x}}}_{i}\right\_2}\) to the sample mean of the class \({D}_{k}\). To keep the clarity of proof, we also simplify the modeling of other metaparameters. Based on Eq. (18), when entering a query sample \(\tilde{{{{{{\bf{x}}}}}}}\), the local module produces an inductive bias \({{{{{{\bf{I}}}}}}}_{LP}({{{{{\boldsymbol{x}}}}}})\,\) with intensity \({\sum }_{k=1}^{K}{{{{{{\rm{y}}}}}}}_{k}({{{{{{\boldsymbol{c}}}}}}}_{k}^{T}\tilde{{{{{{\bf{x}}}}}}}).\,\) Then the membrane potential of output neuron is governed by
where \({y}_{k,i}\) is the \(i\) element of the onehot label \({{{{{{\bf{y}}}}}}}_{k}\), which satisfies that \({y}_{k,i}={1}_{\{k\}}(i)\). In the experiment, we initialize \({{{{{{\rm{\lambda }}}}}}}_{1}\,\)with a small positive value to strengthen the impact of \({{{{{{\bf{I}}}}}}}_{LP}\) in the early training phase. In this manner, we can give an intuitive interpretation for the inductive bias \({{{{{{\bf{I}}}}}}}_{LP}\) from the Euclidean distance \(L\) between the label \({{{{{\bf{y}}}}}}\) and \({{{{{{\bf{u}}}}}}}^{{l}_{n}}\). Assuming that the input pattern belongs to \(i\,\) class, \(L\) can be calculated by
where we set \(T\,=1\) and omit the index t for neatness. Note that the \({{{{{{\rm{\lambda }}}}}}}_{1}\) is a predefined small amount, thus, the \({L}_{\phi }\) has the two main components \({(1{c}_{i}^{T}\tilde{{{{{{\bf{x}}}}}}})}^{2}\)and \({(1{{{{{{\rm{\lambda }}}}}}}_{1})}^{2}{{{{{{\bf{c}}}}}}}_{q}^{T}\tilde{{{{{{\bf{x}}}}}}}\). To minimize the distance \(L\), the network is forced to learn the specific feature mapping that projects the distance between samples within a class sufficiently small (by the punishment of \({(1{{{{{{\bf{c}}}}}}}_{i}^{T}\tilde{{{{{{\bf{x}}}}}}})}^{2}\)) and the distance between samples from different classes sufficiently large (by the punishment of \({{{{{{\bf{c}}}}}}}_{q}^{T}\tilde{{{{{{\bf{x}}}}}}}\))). In this manner, the inductive bias \({{{{{{\bf{I}}}}}}}_{LP}\,\)enables the HP model to learn from the feature similarity between the query data and the feature centers of the previous training data in the metric space.
Details of baseline performance evaluation
In the MNIST and FashionMNIST experiments, we used Bernoulli sampling to encode the pixel into spike trains. In the sequential MNIST and CIFAR10 experiments, we took the first spiking layer as the encoding layer^{59} to convert the pixel information into spike signals. Here, the sequential MNIST is a variant of MNIST dataset, which inputs the original image in a rowbyrow pixel manner. In the CIFAR10DVS and DVSgesture experiments, we accumulated spike trains (8 and 10 ms, respectively) for acceleration and directly input them into SNNs. We applied the batch normalization (BN) technique to convolutional layers on the DVSGesture dataset by following the work^{62}. We optimized the HP model in all datasets using the mean square error (MSE) and the adaptive moment estimation optimizer (ADAM). A fourlayer MLP with [28256FC256FC10] was trained on sequential MNIST, a sixlayer CNN with [input128C3AP2256C3AP2256C3AP2512FC10] was trained on MNIST and FashionMNIST datasets, a ninelayer CNN with [input64C3S2BN128C3S1BN256C3S1BN256C3S1BN256C3S1BN256C3S1AP2800 FC512FC11FC] was trained on DVSGesture datasets, and a ninelayer CNN with the CIFARNet structure^{59} was trained on CIFAR10 and CIFAR10DVS. We took the local learning rules in Eq. (5) and metaparameters in Eq. (10) in all experiments unless otherwise stated. In comparison with finetuning models, we used the same learning rules and fixed these metaparameters after random initialization during training. To reduce computation, we equipped the local module in the hidden fully connected layers in all classification tasks. Other parameter configurations can be found in Table 3.
Details of faulttolerance learning
In the cropping experiment, we increased cropping area gradually on the center of each image or each NVS frame, denoted by \({(2ci)}^{2}\), where ci represents for cropping intensity with a range of \(0 \sim 14\). In the noise experiment, since saltandpepper noise can maintain the spike binary representation, we use it for evaluating the robustness to the noise in the NMNIST and MNIST experiments. We also increased the proportion of noise region on each image or each NVS frame gradually, denoted by the noiselevel (nl), where nl value refers to the \(nl\times 2{e}^{2}\) region with a range of \(0 \sim 14\). All models were pretrained on the standard training dataset and tested on the cropping (noise) data with the same parameter configuration, network structure [input512FC10FC], and the MSE loss. For the distance comparison, we calculated the membrane potential of the first hidden layer in the last timestep as representations to calculate the distance between the incomplete data and the original data. We randomly sampled 1000 testing data from MNIST and plotted its average distance on Fig. 4e, f.
Details of fewshot learning
The Omniglot is a standard fewshot learning dataset that contains 1623 categories and each category contains 20 samples. In one training episode, we first randomly selected N classes and sampled S sample pairs from each class (called Nway Sshot). Then, we fed the Slabeled samples (named by the presentation time) into the classifier. After that, we randomly sampled a new but unlabeled instance from the same N classes and queried the classifier for the labels. We used four convolutional layers with 3 × 3 kernel size and two strides, followed by a fully connected layer and an Nway classifier layer. We followed the work^{31} to configure the network parameters and divide the training sets and testing sets. During the training phase, we sampled three task episodes from the task distribution \(\varGamma\), and used the onestep updated weights \({{{{{\bf{w}}}}}}\) to approximate \({{{{{{\bf{w}}}}}}}^{{{{{{\boldsymbol{\ast }}}}}}}\) using training task samples, and alternatively iterated the metaparameters θ by resampling from the same tasks. To reduce the computation, we equipped the local module in the fully connected layers. The training label was fed into the last classifier layer by the onehot coding scheme to guide the correct classifications. We adopted the encoding strategy as used ref. ^{59} to produce spike trains. We trained the network of 3,000,000 episodes and reported the best results over the last 1000 episodes.
Details of continual learning
The shuffled MNIST experiments include multiple image classification tasks. All tasks are to classify handwritten digits from zero to nine. Each task is a variant of the MNIST dataset with a different permutation. For each new task, the image pixels were randomly permuted with the same randomization across all digits and different randomization are used in different tasks. We trained each task by ten epochs. We adopted a fourlayer spiking network with [7841024102410] structure and minimized the MSE loss by the ADAM. During the training process, we fixed the metaparameters of the local module and randomly generated a sparse and fixed connection matrix to receive supervision signals. We used the BPTT to update these sparse weights and local learning to update other connections. After each task is learned, we fixed the weights and updated the hyperparameters of the local modules for one epoch. Other comparison methods were adopted from the corresponding publications^{25,51,52} and applied to spiking models.
Details of hardware implementation
The Tianjic chip is a crossparadigm neuromorphic computing platform that supports a broad spectrum of neural coding schemes, computational models, and network structures. It is fully digital and fabricated using 28nm highperformance lowpower technology. Each Tianjic chip contains 156 functional cores (FCores), which are arranged in a 2Dmesh manner. Each FCore contains a group of neurons, a group of axons, and synaptic connections between them. Among the FCores, spikes can be transmitted to one or more cores in the mesh through the routing network in a form of routing packets. At the same time, through the interchip communication interface, multiple chips can expand the internal routing network connections into a larger computing platform.
We mapped the model onto multiple FCores. Different FCores are configured to perform different basic operations and transformations. Taking the MNIST dataset as an example, we deployed 70 FCores to implement a fully connected structure [7841024102410]. We tested the energy consumptions with different coding schemes. Because the rank coding shortens the average decision time, it can effectively reduce the onchip inference latency and the average compute ratio, thereby alleviating average dynamic power consumption. The average onchip inference latency required for rate coding and rank coding are 0.27 and 0.18 us, the compute ratios are 0.63, 0.45, and the dynamic power consumptions are 0.48 and 0.38 W, respectively. We reported the onchip performance on MNIST, FMNIST, and NMNIST datasets and compared it with GPUbased running results in Supplementary Table 1. With the massive parallelism and the nearmemory computing architecture, the execution time on the Tianjic can be much faster than that of the generalpurpose computer. The energy consumptions scale only slightly as the network size increases owing to the spikedriven paradigm and localmemory manycore architecture (Fig. 5d).
The cycleaccurate simulator can well capture the hardware chip properties at runtime and is commonly used for chip evaluation. We based on Tianjic’s hybrid structure to design an onchip hybrid learning scheme and a feasible cycleaccurate hardware simulation scheme to evaluate onchip computational resources. Here we took an extended version of Tianjic chip with reconfigurability and functionalities to support continuous execution of multiple operations (see Supplementary Note 1). On this basis, we developed a mapping scheme to disassemble the overall dataflow into performable finegrained basic operations and further transformed a mapping design into executable configuration (see Supplementary Note 2). We simulated three onchip learning modes (LP, GP, and HP) using the software toolchain. A detailed simulation scheme is provided in Supplementary Fig. 1. With this simulation scheme, we estimated the throughput and route cost of different learning modes using an MLP structure [78451210] and the time window \(T=3\). Regarding the route cost in Fig. 5e, we accumulated the amount of data volume whenever data transmission occurs. Regarding the throughput in Fig. 5f, we recorded the time spent in each phase when executing computational tasks on all allocated FCores. After that, we summed the time consumptions together to count the total clock consumption and thereby the throughput.
Data availability
All data used in this paper are publicly available and can be accessed at http://yann.lecun.com/exdb/mnist/ for the MNIST dataset, https://www.cs.toronto.edu/~kriz/cifar.html for the CIFAR dataset, https://www.garrickorchard.com/datasets/nmnist for the NMNIST dataset, and https://github.com/brendenlake/omniglot/ for the Omniglot dataset.
Code availability
Implementation of hybrid learning model is made public together with the publication of this paper https://github.com/yjwu17/Spikinghybridplasticitymodel. Codes related to the detailed hardware implementations are available from the authors upon reasonable request.
References
Mostafa, H. Supervised learning based on temporal coding in spiking neural networks. IEEE Trans. Neural Netw. Learn Syst. 29, 3227–3235 (2018).
Diehl, P. U., Neil, D., Binas, J., Cook, M. & Liu, S. C. In IEEE International Joint Conference on Neural Networks (IJCNN) (2015).
Zhang, W. & Li, P. In Advances in Neural Information Processing Systems. 7800–7811 (MIT Press, 2019).
Rathi, N. & Roy, K. Dietsnn: direct input encoding with leakage and threshold optimization in deep spiking neural networks. Preprint at arXiv:2008.03658 (2020).
Amir, A. et al. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7243–7252 (IEEE, 2017).
Zheng, H., Wu, Y., Deng, L., Hu, Y. & Li, G. Going deeper with directlytrained larger spiking. Neural Netw. 2011, 05280 (2020).
Bellec, G. et al. A solution to the learning dilemma for recurrent networks of spiking neurons. Nat. Commun. 11, 3625 (2020).
Zenke, F. & Ganguli, S. Superspike: supervised learning in multilayer spiking neural networks. Neural Comput. 30, 1514–1541 (2018).
Diehl, P. U. & Cook, M. Unsupervised learning of digit recognition using spiketimingdependent plasticity. Front. Comput. Neurosci. 9, 99 (2015).
Johansen, J. P. et al. In Proceedings of the National Academy of Sciences Vol. 111 E5584–E5592 (National Academy of Sciences in partnership with Oxford University Press, 2014).
Yger, P., Stimberg, M. & Brette, R. Fast learning with weak synaptic plasticity. J. Neurosci. 35, 13351–13362 (2015).
Magee, J. C. & Grienberger, C. Synaptic plasticity forms and functions. Annu. Rev. Neurosci. 43, 95–117 (2020).
Davies, M. et al. Loihi: a neuromorphic manycore processor with onchip learning. IEEE Micro 38, 82–99 (2018).
Merolla, P. A. et al. A million spikingneuron integrated circuit with a scalable communication network and interface. Science 345, 668–673 (2014).
Roy, K., Jaiswal, A. & Panda, P. Towards spikebased machine intelligence with neuromorphic computing. Nature 575, 607–617 (2019).
Imam, N. & Cleland, T. A. Rapid online learning and robust recall in a neuromorphic olfactory circuit. Nat. Mach. Intell. 2, 181–191 (2020).
Gerstner, W., Lehmann, M., Liakoni, V., Corneil, D. & Brea, J. Eligibility traces and plasticity on behavioral time scales: experimental support of neoHebbian threefactor learning rules. Front Neural Circuits 12, 53 (2018).
Bailey, C. H., Giustetto, M., Huang, Y.Y., Hawkins, R. D. & Kandel, E. R. Is heterosynaptic modulation essential for stabilizing Hebbian plasiticity and memory. Nat. Rev. Neurosci. 1, 11–20 (2000).
Kaiser, J., Mostafa, H. & Neftci, E. Synaptic plasticity dynamics for deep continuous local learning (DECOLLE). Front. Neurosci. 14, 424 (2020).
Legenstein, R., Pecevski, D. & Maass, W. A learning theory for rewardmodulated spiketimingdependent plasticity with application to biofeedback. PLoS Comput. Biol. 4, e1000180 (2008).
Brea, J., Senn, W. & Pfister, J.P. Matching recall and storage in sequence learning with spiking neural networks. J. Neurosci. 33, 9565–9575 (2013).
Jimenez Rezende, D. & Gerstner, W. Stochastic variational learning in recurrent spiking networks. Front. Comput. Neurosci. 8, 38 (2014).
Jang, H., Simeone, O., Gardner, B. & Gruning, A. An introduction to probabilistic spiking neural networks: probabilistic models, learning rules, and applications. IEEE Signal Process. Mag. 36, 64–77 (2019).
Lake, B. M., Salakhutdinov, R. & Tenenbaum, J. B. Humanlevel concept learning through probabilistic program induction. Science 350, 1332–1338 (2015).
Masse, N. Y., Grant, G. D. & Freedman, D. J. Alleviating catastrophic forgetting using contextdependent gating and synaptic stabilization. Proc. Natl Acad. Sci. USA 115, E10467–E10475 (2018).
Stewart, K., Orchard, G., Shrestha, S. B. & Neftci, E. In 2020 2nd IEEE International Conference on Artificial Intelligence Circuits and Systems (AICAS). 223–227 (IEEE).
AcevedoMosqueda, M. E., YanezMarquez, C. & AcevedoMosqueda, M. A. Bidirectional associative memories: different approaches. ACM Comput. Surv. (CSUR) 45, 1–30 (2013).
Bellec, G., Salaj, D., Subramoney, A., Legenstein, R. & Maass, W. Long shortterm memory and learningtolearn in networks of spiking neurons. Advances in Neural Information Processing Systems: NeurIPS 787–797 (2018).
Bengio, S., Bengio, Y., Cloutier, J. & Gecsei, J. On the optimization of a synaptic learning rule. In Preprints Conf. Optimality in Artificial and Biological Neural Networks. (University of Texas, Dallas, 1992)
Munkhdalai, T. & Trischler, A. Metalearning with Hebbian fast weights. Preprint at arXiv:1807.05076 (2018).
Miconi, T., Clune, J. & Stanley, K. O. In International Conference on Machine Learning (ML Research Press, 2018).
Metz, L., Maheswaranathan, N., Cheung, B. & SohlDickstein, J. Metalearning update rules for unsupervised representation learning. International Conference on Learning Representations (2018).
Gupta, A., Eysenbach, B., Finn, C. & Levine, S. Unsupervised metalearning for reinforcement learning. Preprint at arXiv:1806.04640 (2018).
Lisman, J., Grace, A. A. & Duzel, E. A neoHebbian framework for episodic memory; role of dopaminedependent late LTP. Trends Neurosci. 34, 536–547 (2011).
Doya, K. Metalearning and neuromodulation. Neural Netw. 15, 495–506 (2002).
Sebastiao, A. M. & Ribeiro, J. A. Neuromodulation and metamodulation by adenosine: impact and subtleties upon synaptic plasticity regulation. Brain Res. 1621, 102–113 (2015).
Dias, R. B., Rombo, D. M., Ribeiro, J. A. & Sebastião, A. M. Ischemiainduced synaptic plasticity drives sustained expression of calciumpermeable AMPA receptors in the hippocampus. Neuropharmacology 65, 114–122 (2013).
Wilmes, K. A. & Clopath, C. Inhibitory microcircuits for topdown plasticity of sensory representations. Nat. Commun. 10, 5055 (2019).
Buzsáki, G. & Mizuseki, K. The logdynamic brain: how skewed distributions affect network operations. Nat. Rev. Neurosci. 15, 264–278 (2014).
Anandalingam, G. & Friesz, T. L. Hierarchical optimization: an introduction. Ann. Oper. Res. 34, 1–11 (1992).
Bengio, Y. Gradientbased optimization of hyperparameters. Neural Comput 12, 1889–1900 (2000).
Neftci, E. O., Mostafa, H. & Zenke, F. Surrogate gradient learning in spiking neural networks: Bringing the power of gradientbased optimization to spiking neural networks. IEEE Signal Process. Mag. 36, 51–63 (2019).
Varghese, B., Wang, N., Barbhuiya, S., Kilpatrick, P. & Nikolopoulos, D. S. In 2016 IEEE International Conference on Smart Cloud (SmartCloud). 20–26 (IEEE).
Thorpe, S. & Gautrais, J. In Computational Neuroscience 113–118 (Springer, 1998).
Ba, J., Hinton, G. E., Mnih, V., Leibo, J. Z. & Ionescu, C. In Advances in Neural Information Processing Systems. 4331–4339. (MIT Press).
Gallego, G., Delbruck, T., Orchard, G. M., Bartolozzi, C. & Scaramuzza, D. Eventbased vision: a survey. IEEE Transactions on Pattern Analysis and Machine Intelligence 1–1. https://doi.org/10.1109/TPAMI.2020.3008413 (2020).
Finn, C., Abbeel, P. & Levine, S. Modelagnostic metalearning for fast adaptation of deep networks. Proc. 34th Int. Conf. Mach. Learn. 70, 1126–1135 (2017).
Koch, G., Zemel, R. & Salakhutdinov, R. Siamese neural networks for oneshot image recognition. In ICML Deep Learning Workshop, Vol. 2 (2015).
Hassabis, D., Kumaran, D., Summerfield, C. & Botvinick, M. Neuroscienceinspired artificial intelligence. Neuron 95, 245–258 (2017).
Yang, G., Pan, F. & Gan, W.B. Stably maintained dendritic spines are associated with lifelong memories. Nature 462, 920–924 (2009).
Zenke, F., Poole, B. & Ganguli, S. Continual learning through synaptic intelligence. In Proceedings of the 34th International Conference on Machine Learning. Vol. 70. 3987–3995 (JMLR. org., 2017).
Kirkpatrick, J. et al. Overcoming catastrophic forgetting in neural networks. Proc. Natl Acad. Sci. USA 114, 3521–3526 (2017).
Kosko, B. Bidirectional associative memories. IEEE Trans. Syst. Man. Cybern. 18, 49–60 (1988).
Pei, J. et al. Towards artificial general intelligence with hybrid Tianjic chip architecture. Nature 572, 106–111 (2019).
Miconi, T., Rawal, A., Clune, J. & Stanley, K. O. Backpropamine: training selfmodifying neural networks with differentiable neuromodulated plasticity. International Conference on Learning Representations. (2018).
Pfeiffer, M. & Pfeil, T. Deep learning with spiking neurons: opportunities and challenges. Front Neurosci. 12, 774 (2018).
Stewart, K., Orchard, G., Shrestha, S. B. & Neftci, E. Online fewshot gesture learning on a neuromorphic processor. IEEE J. Emerg. Sel. Top. Circuits Syst. 10, 512–521 (2020).
Gerstner, W., Kistler, W. M., Naud, R. & Paninski, L. Neuronal Dynamics: From Single Neurons to Networks and Models of Cognition. (Cambridge University Press, 2014).
Wu, Y. et al. In Proceedings of the AAAI Conference on Artificial Intelligence. 1311–1318. (MIT Press).
Liu, H., Simonyan, K. & Yang, Y. Darts: differentiable architecture search. International Conference on Learning Representations (2018).
Kosko, B. Bidirectional associative memories: unsupervised Hebbian learning to bidirectional backpropagation. IEEE Trans. Syst., Man, Cybern.: Syst. 51, 103–115 (2021).
Fang, W. et al. Incorporating learnable membrane time constant to enhance learning of spiking neural networks. Proceedings of the IEEE/CVF International Conference on Computer Vision. 2661–2671 (2021).
Sengupta, A., Ye, Y., Wang, R., Liu, C. & Roy, K. Going deeper in spiking neural networks: Vgg and residual architectures. Front Neurosci. 13, 95 (2019).
Scherr, F., Stöckl, C. & Maass, W. Oneshot learning with spiking neural networks. Preprint at bioRxiv (2020).
Acknowledgements
This work was partly supported by the National Nature Science Foundation of China (No. 62088102, No. 61836004), National Key R&D Program of China 2018YFE0200200, the Science and Technology Innovation 2030Key Project of “New Generation Artificial Intelligence” under Grant 2020AAA0109100, BrainScience Special Program of Beijing under Grant Z191100007519009, and CETC Haikang GroupBrain Inspired Computing Joint Research Center and Beijing Mlink Technology Inc. We would like to thank IDG/McGovern Institute for Brain Research at Tsinghua University for their technical support.
Author information
Authors and Affiliations
Contributions
Y.W., R.Z., J.Z., F.C., and L.S. conceived the work. Y.W. and M.X. carried out experiments. M.X., L.D., G.W., S.M., and Y.W. contribute to hardware implementation. Y.W., R.Z., M.X., and F.C. are contributed to the analysis and interpretation of results. All of the authors contributed to the discussion of model design and experiment, and R.Z led the discussion. Y.W., R.Z., M.X., and L.S. contribute to the writing of this paper. L.S. proposed the approach of hardwarealgorithm integration and supervised the whole project.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Peer review information
Nature Communications thanks the anonymous reviewer(s) for their contribution to the peer review of this work.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Wu, Y., Zhao, R., Zhu, J. et al. Braininspired globallocal learning incorporated with neuromorphic computing. Nat Commun 13, 65 (2022). https://doi.org/10.1038/s41467021276532
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41467021276532
This article is cited by

Lead federated neuromorphic learning for wireless edge artificial intelligence
Nature Communications (2022)
Comments
By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.