Memory-inspired spiking hyperdimensional network for robust online learning

Recently, brain-inspired computing models have shown great potential to outperform today’s deep learning solutions in terms of robustness and energy efficiency. Particularly, Spiking Neural Networks (SNNs) and HyperDimensional Computing (HDC) have shown promising results in enabling efficient and robust cognitive learning. Despite the success, these two brain-inspired models have different strengths. While SNN mimics the physical properties of the human brain, HDC models the brain on a more abstract and functional level. Their design philosophies demonstrate complementary patterns that motivate their combination. With the help of the classical psychological model on memory, we propose SpikeHD, the first framework that fundamentally combines Spiking neural network and hyperdimensional computing. SpikeHD generates a scalable and strong cognitive learning system that better mimics brain functionality. SpikeHD exploits spiking neural networks to extract low-level features by preserving the spatial and temporal correlation of raw event-based spike data. Then, it utilizes HDC to operate over SNN output by mapping the signal into high-dimensional space, learning the abstract information, and classifying the data. Our extensive evaluation on a set of benchmark classification problems shows that SpikeHD provides the following benefit compared to SNN architecture: (1) significantly enhance learning capability by exploiting two-stage information processing, (2) enables substantial robustness to noise and failure, and (3) reduces the network size and required parameters to learn complex information.

We can take advantage of this implicit mapping by replacing their decision function with a weighted sum of kernels: where (x i , y i ) is the training data sample, and the c i s are constant weights. The study in [95] showed that the inner product can efficiently approximate Radial Basis Function (RBF) kernel, such that: The Gaussian kernel function can now be approximated by the dot product of two vectors, z(x) and z(y). The proposed encoding method is inspired by the RBF kernel trick method. Figure S1b shows our encoding procedure. Assume an input vector in original space ⃗ F = {f 1 , f 2 , · · · , f n } and F ∈ R n . The encoding module maps this vector into high-dimensional vector, H = {h 1 , h 2 , · · · , h D } ∈ R D , where D ≫ n. The following equation shows an encoding method that maps input vector into high-dimensional space: where ⃗ B k s are randomly chosen hence orthogonal base hypervectors of dimension D ≃ 10k to retain the spatial or temporal location of features in an input and b i ∼ U(0, 2π). That is, ⃗ B kj ∼ N (0, 1) and δ( ⃗ B k 1 , ⃗ B k 2 ) ≃ 0, where δ denotes the cosine similarity. However, this activation is not a convex function,  Figure S1: (a) Overview of hyperdimensional classification during both training and testing phases, and (2) HDC non-linear encoder.
thus making it impossible to back-propagate from HDC encoder. Although HDC learning does not rely on back-propagation, in section 4.3, we will show that the integration of SNN and HDC requires backpropagation through HDC encoding module. To address this issue, we exploit hyperbolic tangent function (Tanh) as an activation function: After this step, each element h i of a hypervector H n has a non-binary value. In the HDC, binary (bipolar) hypervectors are often used for computation efficiency. We thus obtain the final encoded hypervector by binarizing it with a sign function (H = sign(H n )) where the sign function assigns all positive hypervector dimensions to '1' and zero/negative dimensions to '-1'. The encoded hypervector stores the information of each original data point with D bits. In our example, the feature vector, F , is highly sparse event-based data. However, after mapping data through the above encoder, the generated high-dimensional data will get a dense binary representation.

Hyperdimensional Model Training
We exploit hyperdimensional learning to directly operate over encoded data. Figure S1a shows an overview of HDC classification. HDC identifies common patterns during learning and eliminates the saturation of the class hypervectors during single-pass training. Instead of naively combining all encoded data, our approach adds each encoded data to class hypervectors depending on how much new information the pattern adds to class hypervectors. If a data point already exists in a class hypervector, HDC will add no or a tiny portion of data to the model to prevent hypervector saturation. If the prediction matches the expected output, no update will be made to avoid overfitting. This adaptive update provides a higher chance and weight to non-common patterns to represent the final model. This method can eliminate the necessity of using costly iterative training.
Let us assume ⃗ H as a new training data point. HDC computes the cosine similarity of ⃗ H with all class hypervectors. We compute similarity of a data point with ⃗ C i as: δ( ⃗ H, ⃗ C i ). Instead of naively adding a data point to the model, HDC updates the model based on the δ similarity. If an input data has label l and correctly matches with the class, the model updates as follows: where η is a learning rate. A large δ l indicates that the input is a common data point which is already exist in the model. Therefore, our update adds a very small portion of encoded query to model to eliminate model saturation (1 − δ l ≃ 0). If the input data get an incorrect label of l ′ , the model updates as: where δ l ′ − δ l determines the weight that the model needs to be updated. Small δ l ′ − δ l indicates that the query is marginally mismatched while larger mismatch is updated with a larger factor (δ l ′ − δ l ≫ 0).

Hyperdimensional Inference
In inference, HDC checks the similarity of each encoded test data with the class hypervector in two steps. The first step encodes the input (the same encoding used for training) to produce a query hypervector ⃗ H. Then we compute the similarity (δ) of ⃗ H and all class hypervectors. Query data gets the label of the class with the highest similarity.

Neuron and Synapse Model
In DECOLLE, the neuron and synapse models follow leaky, current-based I&F dynamics with a relative refractory mechanism. The dynamics of the membrane potential U i of a neuron i is governed by the following differential equations [97]: with S i (t) the binary value representing whether neuron i spiked at time t. To facilitate implementation, the membrane potential is separated into two variables U and V . A spike is emitted when the membrane potential reaches a threshold S i (t) = Θ (U i (t)), where Θ(x) = 0 if x < 0, otherwise 1 is the unit step function. The constant b i represents the intrinsic excitability of the neuron. The factors τ ref and τ mem are time constants of the membrane and refractory dynamics, respectively. I i denotes the total synaptic current of neuron i, expressed as [97]: where W ij is the synaptic weights between pre-synaptic neuron j and post-synaptic neuron i. Because V i and I i are linear with respect to the weights W ij , The dynamics of V i can be rewritten as [97]: The states P and Q describe the traces of the membrane and the current-based synapse, respectively. For each incoming spike, the trace Q undergoes a jump of height 1 and otherwise decays exponentially with a time constant τ syn . Weighting the trace Q ij with the synaptic weight W ij results in the Post-Synaptic Potentials of neuron i caused by input neuron j. In the equation above, since P ij and Q ij are linear and are only driven by S j , the index i can be dropped. This results in as many P and Q variables as there are presynaptic neurons, independently of the number of synapses. Rewritten in discrete time with time step ∆t and annotate the equation with the superscript l denoting the layer to which the neuron belongs, the dynamical equations above are: where the constants α = exp − ∆t τmem , γ = exp − ∆t τ ref , and β = exp − ∆t τsyn reflect the decay dynamics of the membrane potential U , the refractory state R and the synaptic state Q during a ∆t timestep.

DECOLLE Update Rule
As discussed above, in DECOLLE, we attach a random readout to each of the N layers of spiking neurons [97]: where G l ij are fixed, random matrices. The global loss function is then defined as the sum of the layerwise loss functions defined on the random readouts, i.e. L = N l=1 L l Y l . To enforce locality, DECOLLE sets to zero all non-local gradients, i.e., ∂L l ∂W m ij = 0 if m ̸ = l. With this assumption, the weight updates at each layer become: where η is the learning rate. Assuming the loss function depends only on variables in same time step, the first gradient term on the right hand side, ∂L l ∂S l i , can be trivially computed using the chain rule of derivatives. For the second gradient term, surrogate gradient-based learning gives [25]: where σ ′ U l i is the surrogate gradient of the non-differentiable step function Θ U l i . In particular, a piecewise linear surrogate activation function is used, such that its derivative becomes the boxcar function σ ′ (x) = 1 if x ∈ [−0.5, 0.5] and 0 otherwise. The rightmost term is computed as: Because regularization is used to favor low firing rates in DECOLLE, the terms involving R l i can be ignored. Combining the above equation give the DECOLLE rule governing the synaptic weight update: 3 SpikeHD Trainable parameters Figure S2 demonstrates the effect of trainable parameter ratio on the performance of the model. To stress-test our model, the shape of the SNN is limited to 3 LIF layers with size ratio 2:3:2, and the dimension of HDC is derived from however many trainable parameter is allowed in HDC Memory, up to residue. For small SpikeHD (blue line with 10k parameters), there is a fine spot where optimal accuracy is reached. This is due to the fact that the training data requires an SNN big enough to extract meaningfull features (which explain the left slope) and an HDC memory large enough to memorize these features (which explain the right slope). Due to the fact that HDC memory accuracy resembles a sigmoid function with respect to dimension in practice, the accuracy deteriorate rapidly upon lowering the dimension to a certain threshold. This is demonstrated in the rapid increase in error rate for SpikeHD with very high SNN proportion. In addition, due to the correlation of the dimension threshold to the output dimension of the SNN instead of the size of the SNN, large SpikeHD (black line with 40k parameters) requires smaller proportion of HDC memory to perform well. The demand for the size of HDC also comes the number of classes. For example, the Omniglot dataset, a few-shot, 1623-class classification task, demands high HDC memory.

Classification
Accuracy Classification Accuracy Step II Step III MNIST DVSGesture Figure S3: Training Performance of SpikeHD with convolutional SNN and distinct HDC nonlinear encoders on MNIST and DVSGesture. Each bar plot represents the accuracy of SNN along with SpikeHD using HDC encoders with binary, uniform, or Gaussian base hypervectors and with Tanh activation.

SpikeHD Enhancing Convolution-based SNN
We apply a convolutional architecture with two convolution layers of 16, 32 channels respectively and with kernel of size 7 and a fully connected layer of 64 neurons. Each convolutional LIF layers are followed by a max pool of kernel size 2. Figure S3 compares the test classification accuracy of SpikeHD with convolutionbased SNN. For SNN, we use the conventional DECOLLE architecture in our default configuration. Our evaluation shows that default SpikeHD outperforms both SNN and HDC in terms of quality of learning. SpikeHD achieves, on average, 1.7% and 1.8% higher classification accuracy compared to the SNN model after co-training on MNIST and DVS Gesture, respectively.