A framework for the general design and computation of hybrid neural networks

There is a growing trend to design hybrid neural networks (HNNs) by combining spiking neural networks and artificial neural networks to leverage the strengths of both. Here, we propose a framework for general design and computation of HNNs by introducing hybrid units (HUs) as a linkage interface. The framework not only integrates key features of these computing paradigms but also decouples them to improve flexibility and efficiency. HUs are designable and learnable to promote transmission and modulation of hybrid information flows in HNNs. Through three cases, we demonstrate that the framework can facilitate hybrid model design. The hybrid sensing network implements multi-pathway sensing, achieving high tracking accuracy and energy efficiency. The hybrid modulation network implements hierarchical information abstraction, enabling meta-continual learning of multiple tasks. The hybrid reasoning network performs multimodal reasoning in an interpretable, robust and parallel manner. This study advances cross-paradigm modeling for a broad range of intelligent tasks.

Here we would like to summarize three important observations as follows: 1. The proposed HMN's performance exceeds those of the SNN models in all controlled experiments with different settings.
2. When the task number is small, such as 32, the smaller the network capacity, the larger the performance gap.
3. When the task number is large, such as 128, the larger the network capacity, the larger the performance gap. Even with enough network capacity, the performance of the single SNN saturates and is lower than that of HNN by a considerable margin.
To investigate the effectiveness of HMN quantitatively, we analyzed the reuse ratio

Supplementary Note 3: the Universal Approximation of Hybrid Units
The approximation capability of ANNs has been studied based on the universal approximation theorem developed in the 1990s 9, 10, 11 . Hornik et al. 9 proved that there exists a three-layered feedforward network with bounded and non-constant semi-linear functions, e.g. sigmoidal function, which can approximate any continuous function arbitrarily well. Blum et al. 10 and Kolmogorov et al. 11 also provide a constructive proof on the approximation property. At the same time, it has frequently been conjectured that generic computations by neural circuits are not digital, and are not carried out on a static input but rather on functions of time 12 . Thus, if we accept the assumption that the firing times of neurons in a biological neural system encode relevant information, there are still many possible spiking coding schemes that encode information in a way different from that of ratedbased traditional networks, such as first-to-spike coding, temporal interval coding, etc. Along this line, previous work has studied the approximation capabilities of SNNs with some specific coding paradigm or some specific computational framework 13,14 . However, it still lacks a theoretical analysis of the universal approximation capability of such versatile spiking coding.
Furthermore, hybrid information conversion is a map from a set of spiking functions ( ) to a set of real numbers. In principle, SNNs can implicitly encode the information in the spike function ( ) in an arbitrary manner. However, to the best of our knowledge, it is unknown whether for any spiking coding, it has a general model to approximate any arbitrary filter with a given precision.
The problem can be divided into two cases: (1). If we know the specific coding method (i.e., filtering methods), we can adopt designable methods to construct HUs. (2). If the encoding method for the information I is unknown, can the HNN framework capture the effective information contained in it?
We would like to point out that it can use learnable HUs for general approximation. Next, we argue that some modified learnable HUs can approximate arbitrary given filters to any desired degree of precision. Our proof yields two following assumptions, (1) Positive minimum spiking interval. It exits a minimum interval between any two spikes in one spike train ({ }).
. That is to say, we mainly consider the case that spike firing time happened in a compact temporal domain.
The main goal is to illustrate that for any filter (i.e., spiking decoding function), some suitably modified HUs can approximate the filter with any desired degree of precision. To this end, we first formalize the universal approximation theory of HUs as below, Theory 1. Given an error gap ≥ 0, for any function : ( ) → , there exists a composite function :

where denotes a composable affine map consisting by a three-layer feedforward network with bounded and non-constant semi-linear functions used for domain transformation, H denotes a composable affine map used for the information conversion between time domain and amplitude domain, and * denotes componentwise composition, such that the approximation bound
holds for any arbitrarily small .
To accomplish this proof, we first point out that the information capacity of spike trains can be approximated by a finite set of spline basis functions and a finitedimensional representation matrix. Then we use the existing approximation theory of ANNs or SNNs to support the theoretical effectiveness of HUs.
At the first step, we prove that for any given filter (i.e., spiking decoding function) yielding a positive minimum spiking interval, there exists a piecewise linear function Lemma 1 (The universal approximation capability of HUs to any decoding filtering). Given an error gap 1 = 2 > 0, for any filtering function ( ( )) where spike trains ( )|: → defined in a compact temporal space, there exits some suitable modified function 1 : = * | →

Proof.
We begin by setting the spiking function as the unit step function (i.e., piecewise constants described by the Type A spiking neuron defined in Wolfgang et al. 14 ) where [ , ] (t) denotes the indicator function yielding In particular, when → 0, the A-type spiking functions can approach commonly-used delta functions ( ).
We denote the minimum firing interval by and set the first spike firing time as the reference time (i.e., On each of these pieces of [ , +1 ], we want to define the pairwise basis function Then for any spike train i = ∑ ( − ) with spiking firing function , it can be approximated by ̃i Analytically, the approximate error is caused by the difference between the arbitrary spike firing time and the most relevant basis function time . Note that when the spike meets the assumption of the minimum firing interval, each neuron fires at most 0 = ⌈ /∆ * ⌉ times in a given time domain. On this basis, it can be seen from the construction of basis function that as = ∈ 0 + 1 is large enough, the approximation error satisfies For , it can be approximated by a binary matrix ∈ × , ∀ ∈ {0,1} multiplied with a set of basic functions = [ 1 , 2 , 3 , … . , ] which yields The above results indicate that for given the spike trains with any given precision, it can be approximated by a finite Euclid space. It exits a finite set of basis function representation methods to approximate any given spiking representation by 1 with arbitrary precision. Such approximation for any spiking decoding function can be implemented and identified as H in the HUs. Then we can follow the Weierstrass approximation theorem to find an approximation 1 to approximate the decoding function .
Given the above results, for any spiking function, in principle, it can be represented by a finite real space to approximate the signal representation with a given precision. By using existing theories of universal approximation for NNs, it can be further proved that there exists a composable affine map F to project the information distribution from into under the approximation error /2 Finally, with the help of triangle inequalities, we can easily get The first term ‖ 1 − ( )‖ can be evaluated by above Lemma 1. We can use the existing conclusion of approximate capability in ANNs or SNNs 9-12 to illustrate that

Network Models
Bio-plausibility of the HSN. The P pathway and M pathway are two well-known pathways for visual information processing 15 , which have features similar to those of ANNs and SNNs, respectively. Specifically, the P-pathway has a slow conduction velocity and responds strongly to changes in colour, but only responds weakly to changes in contrast 16 . Considering the constraint in the visual system, the transmission of such high-precision information needed for object recognition suffers from low speed. This feature resembles the functional role of ANNs in the HSN. In contrast, the M-pathway has a fast conduction velocity 17 and responds to low-contrast stimuli, but is less sensitive to changes in colour 18 . In this case, the high-speed conduction necessary for sensitive detection sacrifices high precision in terms of colour. This feature matches the role of SNNs in the HSN. To summarize, because of the constraint of accessible computing resources, high precision is needed in detailed object recognition, but recognition may be slower, whereas high speed is essential to detect flying events, but spatial resolution and colour information must be sacrificed. Dividing raw visual information into two pathways with a distinctive response property and conquering them separately seems to be an efficient strategy to balance the intricate, and even contradictory, requirements of real-world scenarios.
Bio-plausibility of the HMN. The adaptivity of the brain function when it encounters diverse environmental scenarios fundamentally requires neuromodulation, graded and relatively slow changes to fast synaptic transmission and ion channel properties through diffusible signalling molecules 19,20,21 . Two parts of neuromodulation, that is slow diffusive modulation and fast synaptic transmission, are comparable to the ANNs and SNNs in the HMN respectively. We explain this point as follows: Consider neuropeptides as an example, a recently discovered fine-grained neuromodulator with higher diversity, selectivity, and specificity 22,23 . Secreted neuropeptides are thought to persist as long as minutes and diffuse to as broad as hundreds of micrometres 20 . With the recent advent of single-cell and neurotaxonomic methods, a specific prediction has been made that dense intra-cortical neuropeptide modulatory networks may play prominent roles in cortical homeostasis and plasticity 22,23 . Interestingly, the ANN part in the HMN emerges as a similar fine-grained modulatory dense network with longer persistent activities (operation is slower), broader influence (each ANN unit influences a group of spiking neurons), and higher neuron specificity (modulatory activity is task specific), whose larger spatiotemporal scale activity can adjust the fast transmission property of SNN parts.
Bio-plausibility of the HRN. The reasoning and deducting capability of neural networks implicitly requires symbol-like processing 24   Experimental data pipeline: Firstly, we synthesized four spike encoders with two different types to generate spike trains based on the CIFAR-10 dataset. The first type is a classical encoder with rate coding, temporal first-to-spike coding, and inter-spikeinterval (ISI) coding, respectively. The second type is a learnable encoder, which directly learns the coding scheme from the data. Then, we use learnable HUs and the above three designed HUs to decode the generated spike trains. Finally, the decoded results are fed into an ANN-based CNN for classification. Spike encoder: Suppose the output spiking is denoted by ( ). For the rate encoder, it can be formalized by ∫ ( ) 0 / = , where is the value of each pixel, and denotes the time window function. In our demonstration, all encoders encode information with = 100. For the temporal encoder, the timestep of the first spike is equal to the value of each pixel. For the ISI encoder, the interval of the first two spikes is equal to the value of each pixel. For the learnable encoder, it is a one-layer HUs that can be formalized by • • ( ), where is a rectangle window, is a trainable tensor with the same dimension of in the time domain, and the is a step activation function. Learnable and designable HUs: The learnable HUs can be formalized by • • ( ) . In our demonstration, denotes the encoding spike with 100 timesteps. Therefore, we choose HUs with the dimension of = 100, the dimension of = 10, and the is a ReLU function. The weights of and are trained independently by mutual information with an unsupervised training paradigm. For the image classification task of CIFAR-10, a set of encoding and decoding methods are chosen for each trial of the experiment to transmit the value of each pixel of the image separately. After transmission, a conventional training process of image classification is carried out on the CIFAR-10 dataset. The classification method we used here is an AlexNet-style network, with a 12-layer feature extractor and a 3-layer MLP classifier. During training, trials were repeated three times. The average results and standard deviations are reported in Supplementary Table 3.

Graphics Processing Units
In this section, we firstly analyze the computational cost of ANNs and SNNs with the same network structure; Then, ANNs, SNNs, and HNNs are deployed on the same hardware to evaluate the execution time. The following discussion focuses on the MLP structure with n-digit integer weights and activations for simplicity. The comparison can be easily extended to other structures, such as convolution or floating-point cases.
For the ℎ layer of ANNs, the dimension of weight is −1 • , where indicates the neuron number of ℎ layer. On this basis, we can obtain the computational cost (time cost) boundary ( −1 • ) of an MLP structure. The memory cost (space cost) boundary is ( −1 • + ). If the weights and activations are n-digit numbers, the computation cost of "addition" is ( ), "logic and" is (1), and "multiplication" is ( 2 ) 27 . Therefore, the computational cost and memory access cost can be described by Due to the binary spike activation, SNNs can bypass most of "multiplication" operations and only have "addition" and "logic and" operations. In addition, the binary spikes are transmitted through an event-driven approach. Benefitting from these features, SNNs can significantly alleviate the bandwidth requirement and reduce the computational cost. Supposing that the firing rate of the ℎ layer is ∈ [0,1] in SNNs, the computational cost of "addition" and "logic and" operations follows Here, we assume that both ANNs and SNNs work on the general computing core 28 with a peak computation capacity of (operations/s) and a memory bandwidth of (bit/s). The runtime computing capacity follows = ( , • ). multiply-add (one operation in GPUs) instead of calculating "multiplication" first and then "addition" (two different instructions in CPUs). Therefore, all curves above follow GPUs condition, which is slightly different from the general case.
For ANNs, according to (18), we calculate the time consumption and the number of weights diagram as plotted in Supplementary Figure.4a. The saturation region means that the consumption of the total computation time of ANNs is lower than the sensor, indicating that the hardware doesn't have enough computation resources to process the data as fast as the sensor sampling. The undersaturation region is the opposite case.
According to the Nyquist rate, twice the region under the curve is the aliasing region.
In tracking and detection tasks, it means even every single recognized position is right, For SNNs, owing to the event-driven information processing ability and local memory, SNNs can execute computations without the need to get the entire information ready. If SNNs input with an event or analogue sensor, the total time consumption yields where the indicates the refractory period of the input sensor or SNNs neurons (generally in DVS = 15 ). Unlike ANNs, the is related to the input, because different stimuli patterns will have a different output firing rate . Therefore, as shown in Supplementary Figure.  ].
If the structures of the ANN part and SNN part in HNNs are the same, in most cases, the will be ∑ ( + )

=1
. Additionally, equally dividing computing resources is the worst but most straightforward strategy, which is common in GPUs.
However, in FPGA or neuromorphic chips, we can design a dynamic resource allocation strategy to make ≈ . As shown in Supplementary Figure. 4c, the follows < < 2 .
In brief, HNNs inherit the event-driven and sparsity features of SNNs and can achieve a faster speed than ANNs and multiple time-step SNNs in the same task and hardware. Since the involvement of ANNs improves the single time-step representation capability, SNNs can work without a particular time window.