Brain-inspired global-local learning incorporated with neuromorphic computing

There are two principle approaches for learning in artificial intelligence: error-driven global learning and neuroscience-oriented local learning. Integrating them into one network may provide complementary learning capabilities for versatile learning scenarios. At the same time, neuromorphic computing holds great promise, but still needs plenty of useful algorithms and algorithm-hardware co-designs to fully exploit its advantages. Here, we present a neuromorphic global-local synergic learning model by introducing a brain-inspired meta-learning paradigm and a differentiable spiking model incorporating neuronal dynamics and synaptic plasticity. It can meta-learn local plasticity and receive top-down supervision information for multiscale learning. We demonstrate the advantages of this model in multiple different tasks, including few-shot learning, continual learning, and fault-tolerance learning in neuromorphic vision sensors. It achieves significantly higher performance than single-learning methods. We further implement the model in the Tianjic neuromorphic platform by exploiting algorithm-hardware co-designs and prove that the model can fully utilize neuromorphic many-core architecture to develop hybrid computation paradigm.


Supplementary
We designed a comprehensive on-chip hybrid learning scheme with a configurable software tool chain from algorithm to hardware to emulate the forward and backward paths in three learning modes. Specifically, we developed a dual spatial-temporal unfolding mapping scheme (DST-UM), a mapping compiler for network partition and resource placement, and a cycle-accurate simulator satisfying chip-specific constraints.
In this manner, we evaluated the model performance under different workloads, including single-paradigm and hybrid-paradigm modes. The mapping implementation scheme of hybrid on-chip learning is presented in Supplementary Figure 1.
As shown in Supplementary Figure 1, we disassembled the dataflow of the entire hybrid learning into basic operations which can be supported by the hardware platform.
Here, the calculation between input spikes and weight is split into multiple spatial basic operations due to resource limitation, including fan-in and fan-out, memory space in function cores (FCores). In addition, the proposed DST-UM scheme enables the execution data flow to be allocated and optimized in a single spatial dimension or Given the above mapping implementation, we used a software tool chain reported by ref. 1 , which includes a mapping compiler used for network partition and resource placement, and a C++-based cycle-accurate simulator used for hardware simulation that considers all chip-specific constraints. We used the compiler to perform placement and generate configured file automatically. Therefore, our simulator can utilize the configuration and simulate the running process to generate the corresponding analysis results and evaluation. We configured each FCore according to Supplementary Figure   1 and thereby used the software tool train to simulate the entire process of hybrid learning. Notably, we also simulated the event-driven attributes and the sparse processing of intermediate variables in three modes. We utilized the parallel processing mechanism of the many-core architecture to carry out all inference process. The LP circuit was performed in a pipelined manner and the backward process of GP and HP were performed in a serial manner. Finally, we based on this scheme and estimated the computational resources of online learning.

Supplementary Note 2: Details of hardware evaluation methods
We chose the three models (LP, GP, and HP) with the same MLP structure [784-512-10] as an example, and set the time window at = 3. We mainly took the following steps during implementation, which are described as follows: (1) Designing mapping scheme. We used the Tianjic mapping scheme to disassemble Please note that in the throughput cost evaluation, we used to refer to any of the three operator sets, and configured according to the task requirements and mapping schemes to realize the calculation of the three models as shown in Supplementary   Figure 1.
Regarding to the route cost evaluation. The route cost (RC) yields the following equation: Here the communication data volume is carried by the ℎ route packet and measured through the simualtion process. denotes the total number of route packet in the ℎ phase. denotes the allocated number of FCores performing one task in parallel in the ℎ phase. Phase ∩ ≠ ∅ means that we only record the phases involving with inter-core communication.
Regarding to the throughput cost evaluation. We recorded the time spent in each phase when executing computational tasks on all allocated FCores. The throughput cost (TC) yields the following equation: denotes the clock frequency of the simulator, denotes different operator types from { , , } , denotes clock consumption of the ℎ allocated core in the ℎ type operator of ℎ . We recorded the maximum number of the ℎ operator among cores in ℎ for computation.
We accumulated the data volume whenever data transmission occurs, and summed the time consumptions together to count the total clock cycles. By combining the recorded data and evaluation formulas, we computed the route cost and throughput cost, and obtained the results in Fig.5e and Fig.5f.
Regarding the energy consumption. We pre-trained the HP models with different network sizes and allocated the FCores for the corresponding network sizes. On this basis, we ran the inference process on the Tianjic using the mapping tools and measured the power consumption and running time. Finally, we multiplied them and obtained the energy consumption shown in Fig. 5d.