Sparse-firing regularization methods for spiking neural networks with time-to-first-spike coding

The training of multilayer spiking neural networks (SNNs) using the error backpropagation algorithm has made significant progress in recent years. Among the various training schemes, the error backpropagation method that directly uses the firing time of neurons has attracted considerable attention because it can realize ideal temporal coding. This method uses time-to-first-spike (TTFS) coding, in which each neuron fires at most once, and this restriction on the number of firings enables information to be processed at a very low firing frequency. This low firing frequency increases the energy efficiency of information processing in SNNs. However, only an upper limit has been provided for TTFS-coded SNNs, and the information-processing capability of SNNs at lower firing frequencies has not been fully investigated. In this paper, we propose two spike-timing-based sparse-firing (SSR) regularization methods to further reduce the firing frequency of TTFS-coded SNNs. Both methods are characterized by the fact that they only require information about the firing timing and associated weights. The effects of these regularization methods were investigated on the MNIST, Fashion-MNIST, and CIFAR-10 datasets using multilayer perceptron networks and convolutional neural network structures.


Introduction
Spiking neural networks (SNNs) can process information in the form of spikes in a manner similar to the way information is processed in the brain.SNNs are thereby expected to be able to achieve both high computational functionality and energy efficiency [1].The spikes are represented as all-or-none binary values, and how information is represented by spikes is closely related to the information-processing mechanism in SNNs.The spike-based information representation methods are divided into two major categories, rate coding and temporal coding [2,3].In rate coding, information is contained in the average number of spikes generated by a neuron.In this case, the firing frequency can take approximately continuous values as a function of the input intensities; therefore, the resulting SNNs can be treated as differentiable models similar to an artificial neural network (ANN).Using rate coding, ANNs can be converted to SNNs, and the high learning ability of ANNs has been successfully transferred to SNNs [4,5,6].However, when rate coding is used, information processing in the SNNs is just an approximation of that in ANNs.Furthermore, the precise approximation of an ANN requires many spikes, which reduces energy efficiency when implemented in neuromorphic hardware [7].It has been experimentally shown that physiologically, neurons in certain brain regions or specific neuron types exhibit extremely sparse firing characteristics [8], and it is thought that temporal coding using not only the firing frequency but also the firing time is realized in at least some brain regions [9,10,11,12].
Therefore, to achieve brain-like high-capacity, energy-efficient information processing capabilities in SNNs, it is important to use temporal coding that also considers spike timing information.
Because in temporal coding, the precise timing of spikes carries information, it is necessary to train the SNNs directly instead of using a converted ANN.In recent years, by incorporating deep learning techniques, it has become possible to directly train SNNs using the backpropagation algorithm [13,14,15,16].Among the various methods proposed, methods that focus on the displacement of the membrane potential and those that focus on the displacement of the spike time have attracted particular attention because of their high learning performance.In membrane potential displacement methods, the derivative of the output spike with respect to the membrane potential is almost always zero because the spike is a binary value.However, it is possible to approximate this derivative using a surrogate function [17].This method has been proposed in various forms by various groups [18,19,20,21].It has been proven to be very flexible, works with various surrogate functions [22], and can be used to efficiently train recurrent neural network structures [23,24].Recently, it has become possible to train relatively large models [25].However, with few exceptions [26,27], the neurons exhibit high firing frequencies, and it is debatable whether the timing information is efficiently utilized.
A timing-based learning method is a method that focuses on the displacement of the spike time [28].The coding most commonly used in this learning method is time-to-first-spike (TTFS) coding, which has the property that each neuron fires at most once [29,30].Because the information is contained in the timing of a single spike and the gradient is computed directly using the spike timing, this coding is expected to realize an ideal temporal coding.The high learning performance of this method has been shown in various neuron models [31,32,33,34,35,36,37].Hardware implementation efforts are also underway to achieve high power efficiency by taking advantage of its sparsity characteristics [37,38].However, the constraint of firing at most once per neuron in TTFS coding may not be sufficiently sparse in some situations.For example, in the brain, there are many neurons that hardly fire at all [8].Furthermore, in an extremely power-limited environment such as edge AI [39], a sparse firing pattern that goes beyond the constraint of one firing per neuron is desirable.
In this paper, we propose two methods to further improve the sparse firing property of TTFS-coded SNNs.One method is derived from the loss in the value of the membrane potential, and the other is derived from the firing conditions.Both methods are characterized by the fact that they only require information about the firing timing and the weights associated with it, as is the case in timing-based learning.In the following, we describe the two methods and show experimentally how they suppress firing effectively on the MNIST, Fashion-MNIST, and CIFAR-10 datasets.

Spike timing-based sparse-firing regularization methods
We first summarize the proposed spike timing-based sparse-firing regularization (SSR) methods.SSR methods are characterized by the fact that they only require information about the firing timing and the weights associated with it, as is the case in ordinary timing-based learning.In this study, we propose two SSR method variants: membrane-potential-aware SSR (M-SSR) and firing-condition-aware SSR (F-SSR).In both cases, we add a new regularization term to the cost function used in supervised learning to suppress the firing.Moreover, firing events can be suppressed by adding a new regularization term to the cost function used in supervised learning.M-SSR is based on the idea of reducing the value of the membrane potential, which is realized by adding the membrane potential loss V as a regularization term to the cost function.F-SSR is based on the idea of breaking the firing conditions, which is realized by adding the firing condition loss Q as a regularization term to the cost function.Figure 1 shows the outline of each method.For simplicity, this paper adopts commonly used leaky integrate-and-fire (LIF) neuron models.The LIF neuron model has as parameters the time constant of the membrane potential τ v and the time constant of the synaptic current τ I (see Method).Extensions to other neuron models are straightforward.
First, we explain M-SSR, which is based on the idea of reducing the membrane potential value.The i ], where t(l) i is the time at which the membrane potential equals v. M-SSR is obtained by setting v → V th .(b) F-SSR is derived from the firing condition.Assuming that no input spike is accepted after the neuron's firing time t The F-SSR is derived by formulating the loss such that this asymptotic membrane potential becomes small.membrane potential loss V is defined as where is the loss relating to the membrane potential trajectory of the lth-layer neuron i and ξ(> 0) is the hyperparameter for leveling the sparsity in each layer.In addition, V th is the firing threshold, and T is a parameter specifying the time interval during which firing is suppressed.Figure 1 (a) shows the loss associated with the membrane potential trajectory of a neuron.Note that v is sufficiently large and there is only one point t(l) i at which the membrane potential equals v.In this case, the loss is nonzero only during i ].To perform integration in Eq. ( 2), we need information about the value of the membrane potential at each time step.However, by setting v → V th , we can obtain t(l) i → t i .This avoids the integral calculation, and Eq. ( 2) can be solved analytically.When computing the gradient of this integral, it is important to fix the integration range [ t, t i ].If we do not treat it as a fixed value, the more rapidly the membrane potential rises, the smaller the membrane potential loss in Eq. ( 2) becomes, and thus the firing is not effectively suppressed.Finally, we obtain the following M-SSR: Note that in the above equations, the gradients are not calculated for the variables shown in blue (they are considered to be constants in the gradient calculations).This corresponds to fixing the integration range [ t, t i ].Constant terms not involved in the learning are excluded.Γ (l) i denotes the index set of spikes that have been input to the lth-layer neuron i up to firing time t i .In Eq. ( 3), the following variables are defined Appendix A provides a detailed derivation.In addition, Appendix B discusses the consistency of the M-SSR gradient (Eq.( 3)) with that of the integral-form loss (Eq.( 2)) when v → V th .
Next, we explain F-SSR, a method that suppresses firing based on the firing conditions.From the nonleaky integrate-and-fire neuron model (τ v = ∞, τ I = τ ), we obtain the following firing conditions: firing condition Because the firing will be suppressed if this firing condition is not satisfied, we define the F-SSR term Q as follows: We note that i = 0 if the neuron does not fire.

Numerical simulations
We trained several SNNs on the MNIST dataset [40], Fashion-MNIST dataset [41], and CIFAR-10 dataset [42] to investigate the effect of SSR on suppressing firing.In the experiment, in addition to the multilayer perceptron (MLP) structure, a convolutional neural network (CNN) structure was used.The image data in the dataset were converted to input spikes, where the intensity of each pixel is converted to the input time of each spike (see Method).We define sparsity as the average number of spikes per neuron per input data in a time window [0, t ref ], where t ref is the reference time of the output layer firing time (see Method).In addition, we set T = t ref in Eqs. ( 2) and ( 8), and set the firing threshold V th to 1.When the integration form Eq. ( 2) was used, the integration was approximated by dividing the integral by the time width ∆t.
Figure 2 shows the learning results of SNNs with one hidden layer (784-400-10) trained on the MNIST dataset with various M-SSR strengths γ 2 (see Method).Note that all output layer neurons were required to fire because the loss function was defined by the spike timing of the neurons in the output layer.Therefore, sparse firing regularization was applied only to the hidden layers.The upper figures show the raster plots of the firing distribution of each layer for a given input data, and the lower figures show the time evolution of the membrane potentials of each layer.As the strength of M-SSR was increased, the number of neurons that fired tended to decrease, and it can be seen that most of the hidden layer neurons stopped firing when γ 2 = 1.3 × 10 −5 .By contrast, the firing distribution of the output layer did not change significantly with respect to M-SSR strength.The neurons corresponding to the correct index fired the earliest (t ∼ 4), and the other neurons fired later (t ∼ 8).The membrane potentials of the hidden layer neurons were suppressed as the regularization strength increased, whereas the output layer solved the task using fewer spikes from the hidden layer.This indicates that M-SSR regularization can suppress the firing of the hidden layer without significantly compromising recognition task performance.
Figure 3 shows the sparsity-accuracy tradeoff results obtained when using the integral-form regularization (Eq.( 2)) and M-SSR (Eq.( 3)).We trained SNNs with a single hidden layer (784-400-10) using various regularization strengths.The standard deviations were obtained over 10 trials.The upper figures show the results for the MNIST dataset, and the lower figures show the results for the Fashion-MNIST dataset.Results are also shown for different neuron models (τ v , τ I ).For the MNIST dataset, the tradeoff curves show that a larger v led to a better tradeoff for all neuron models, and the best tradeoff was obtained by M-SSR The lower figures show the time evolution of the membrane potentials when given the same input data, with the results for the hidden layer (top) and the output layer (bottom).In the panels displaying the membrane potentials in the hidden layer, only 50 neurons are shown.In all cases, the following hyperparameters were used: , and τ soft = 0.9.

MNIST Fashion-MNIST
Figure 3: Sparsity-accuracy tradeoff for different regularization forms.We evaluated the integralform regularization (Eq.( 2)) and M-SSR (Eq.( 3)) in a 784-400-10 SNN based on various neuron models.The tradeoff in sparsity-accuracy is shown when the regularization strength γ 2 was varied from 0 to 10 −4 .The standard deviations were obtained over 10 trials.The hyperparameters were t ref = 8, γ 1 = 10 −4 , γ 3 = 0, η = 10 −4 , and τ soft = 0.9.For the integral-form regularization, we used various values of v, and we set ∆t = t ref /1000.(corresponding to v = 1).Similar results were obtained for Fashion-MNIST, although the advantage was not as pronounced as it was for MNIST.This result demonstrates that the integral-form regularization (Eq.( 2)) smoothly transitioned to the limit form (M-SSR, Eq. ( 3)).Furthermore, taking the limit of v → 1, the tradeoff between sparsity and accuracy was improved.In addition, good sparsity-accuracy tradeoff properties were obtained for various neuron models (τ v and τ I ), suggesting that M-SSR is preferable to the integral-form regularization for TTFS-coded SNNs. Figure 4 compares the results of the two proposed SSRs, M-SSR (Eq.( 3)) and F-SSR (Eq.( 8)).The results for an SNN with one hidden layer (784-400-10) are shown in the top figures.On the MNIST benchmark, F-SSR obtained a better sparsity-accuracy tradeoff than M-SSR, whereas on the Fashion-MNIST benchmark, both F-SSR and M-SSR yielded a similar sparsity-accuracy tradeoff.The results for an SNN with three hidden layers are shown in the lower figures.For each SSR method for each dataset, the value of ξ (Eq.( 1)) was set to obtain the best tradeoff averaged over the whole layer (except for the output layer).See Appendix C for the tradeoff properties for various values of ξ.For an SNN with three hidden layers, M-SSR and F-SSR showed similar sparsity-accuracy tradeoff characteristics, but the optimal value of ξ differed significantly in M-SSR and F-SSR.For the MNIST dataset, the optimal value was ξ = 6 for M-SSR and ξ = 1 for F-SSR, whereas for the Fashion-MNIST dataset, the optimal value was ξ = 6 for M-SSR and ξ = 4 for F-SSR.This difference may be due to the characteristics of the regularization function.In M-SSR (Eq.( 3)), the error that occurred in the l-th layer propagates back to the previous layer via spike timing t (l−1) j .In this case, if the weights of neurons from t (l−1) j to the lth layer were positive overall, t (l−1) j increased during training, and consequently the l − 1th layer also became more sparse.Similarly, the l − 2th layer was expected to become sparse.Therefore, to counteract this effect, a relatively large value of ξ was optimal.By contrast, in the case of F-SSR (Eq.( 8)), the losses that occurred in the lth layer do not propagate back to previous layers.Therefore, a relatively small value of ξ was optimal.
Next, we applied the SSR methods to spiking CNNs.Table 1 shows the network structure used for each dataset.Figure 5 shows the effect of SSR regularization on MNIST and Fashion-MNIST.The overall sparsity-accuracy tradeoff for the CNN structure was worse than that for the MLP structure.The SNNs with MLP structures reduced the average number of firings per neuron to about 0.1 − 0.2 with almost no loss in accuracy, whereas the SNNs with convolutional structures only reduced the number of firings to about 0.4.It can be seen that the first convolutional layers are not very sparse, with the exception of the results of F-SSR on the MNIST dataset.This is considered to be caused by the fact that it is difficult to force only a portion of the neurons to fire in a convolutional layer because of the weight-sharing property.
On the CIFAR-10 dataset, firing tended to be suppressed during training even when SSR was not applied, and we confirmed that neurons relating to some channels did not fire at all image locations and for all training data.To avoid this problem, we trained the SNNs to attempt to satisfy the firing conditions.This was achieved by making the regularization strength γ 3 of the F-SSR term negative (see Method).The results of learning by promoting firing are shown in Fig. 6.Because the variance over the 10 trials was large, we plotted the results for the trial with the highest test accuracy for better visualization.Appendix D presents the results for all trials.By sacrificing sparsity, we were able to observe a noticeable improvement in performance.We conducted an experiment incorporating M-SSR regularization, but the combination of promoting firing and sparse firing did not yield a favorable sparsity-accuracy tradeoff.The bottom panel presents the sparsity-accuracy tradeoffs for the sparsity averaged over the three hidden layers.The standard deviations were obtained over 10 trials.We used the following hyperparameters: t ref = 8, γ 1 = 10 −4 , η = 10 −4 , and τ soft = 0.9 for the 2-layer SNNs, t ref = 9, γ 1 = 10 −4 , η = 10 −4 , and τ soft = 0.9 for the 4-layer SNNs.For the MNIST dataset, we set ξ to 6 for M-SSR, and 1 for F-SSR.For the Fashion-MNIST dataset, we set ξ to 6 for M-SSR, and 4 for F-SSR.

(a) (b)
Figure 5: Effects of SSR with a CNN-architecture on the MNIST dataset (a) and the Fashion-MNIST dataset (b).From top to bottom, the panels represent the sparsity-accuracy tradeoffs for the first convolutional layer, second convolutional layer, first fully connected layer, second fully connected layer, and whole network (not including pooling layers).The standard deviations were obtained over 10 trials.We used the following hyperparameters: t ref = 16, γ 1 = 10 −4 , η = 10 −4 , and τ soft = 0.9.For the MNIST dataset, we set ξ to 6 for M-SSR, and 2 for F-SSR.For the Fashion-MNIST dataset, we set ξ to 6 for M-SSR, and 4 for F-SSR.From the top, the sparsity-accuracy tradeoffs for the first, second, and third convolutional layers, and then fully connected layer are represented, respectively.The sparsity-accuracy tradeoff for the whole network (not including pooling layers) is presented in the bottom panel.We inverted the effects of F-SSR, so the neurons tend to fire more frequently.

Discussion
SNNs with TTFS coding can realize ideal temporal coding by constraining each neuron to fire at most once.Due to this mechanism, the SNNs with TTFS coding have high firing sparsity, and this approach has been applied in energy-efficient hardware implementations [37,38].To further improve this sparse firing characteristic, we developed the SSR methods.The two SSR methods were derived from two different perspectives.The first one is M-SSR, which was derived by assigning a penalty each time the membrane potential exceeds a threshold v and taking the limit when the threshold equals the firing threshold.The other is F-SSR, which was obtained from the firing conditions of neurons.Both SSR methods are characterized by the fact that they do not require information about the membrane potential itself, only the firing time and associated weights.The sparsity-accuracy properties of these two methods were investigated using the MNIST, Fashion-MNIST, and CIFAR-10 datasets.Interestingly, although some differences were observed depending on the datasets and the network structure, both F-SSR and M-SSR showed equally good sparsityaccuracy properties, even though the regularization methods were derived from different perspectives.In particular, for the fully connected layer, it was found that the average number of firings for each neuron could be lowered to 0.1 to 0.2.From the experiments conducted in this study, it is difficult to determine which method is superior.We can at least conclude that F-SSR has the advantage of a somewhat smaller computational load than M-SSR due to its simpler formula.To understand the difference between F-SSR and M-SSR, in addition to the sparsity-accuracy property, a detailed analysis of the changes in the firing characteristics and in the information processing mechanism associated with the sparse firing mechanism will be required in future.
For the CNN structures, we found the SSR methods had more difficulty suppressing the firing of neurons in the convolutional layer than in the fully connected layer.In CIFAR-10 in particular, we observed that firing is suppressed too much and learning becomes difficult even without SSR.This may be because it is difficult to flexibly decide whether the outputs belonging to a certain kernel should fire depending on the position because of the weight-sharing property.To prevent this, we found that, in CIFAR-10, learning performance can be improved by promoting firing.Similar firing promotion terms have been introduced in previous studies [31,43].In timing-based learning of large-scale CNN structures, one way to obtain better sparsity-accuracy properties is to utilize models that allow multiple neuron firings [44] combined with the SSR methods.
Previous studies have developed methods that suppress the firing of SNNs in the framework of the surrogate gradient method [26,27].They applied direct regularization to the spike variable s(t) ∈ {0, 1} represented at each time step in the model to the time-discretized SNN.The gradient calculation is made possible using the surrogate function ds(t) dv(t) = σ(v(t)) [17].This method is closely related to the M-SSR proposed in this paper.In the surrogate gradient method, the spike variables above are treated as a function of the membrane potential s(t) = v(t) −∞ σ(v )dv .In this sense, the idea is similar to the loss in Eq. ( 2), which integrates the membrane potential.By contrast, M-SSR, unlike the previous method [26,27], can be transformed from the time-integration form to the timing form by setting v → V th .This may correspond to the fact that the learning method with the surrogate function can transition to a timing-like learning method by taking a limit [19].Interestingly, as shown in Fig. 3, the sparsity-accuracy property improves as v gets closer to V th .This suggests that timing-based sparse regularization is more effective in timingbased learning.We note that F-SSR is a regularization method using firing conditions, which is unique to timing-based learning.
SNNs can operate efficiently on neuromorphic hardware [26,45].Because the energy consumed by the spike transmission increases as the firing frequency increases, reducing the firing frequency is an important issue in real-world applications.SNNs with TTFS coding are expected to provide significant power advantages in hardware implementation due to their extremely sparse firing [34,35,46].Several research groups have reported hardware implementations of such SNNs [38,37,47].The SSR methods are expected to further improve the energy efficiency of SNNs.Moreover, unlike the methods in [26,27], the SSR methods can calculate the gradient without observing the membrane potential, which may simplify the learning system on hardware.Finally, in addition to the reduction in the firing rate, the combination of binarized weights [48] and pruned weights [49,27,50] is expected to make the SNN model more suitable for hardware implementation.

Method SNN models
In this study, we constructed a multilayer SNN using the following LIF neuron model: ), (10) where v (l) i is the membrane potential of neuron i in the lth layer, w (l) ij is the coupling strength from neuron j in the l − 1th layer to neuron i in the lth layer, and t (l−1) j is the firing time of neuron j in the l − 1th layer.Furthermore, τ v is the time constant of the membrane potential and τ I is the time constant of the synaptic current.N (l) is the number of neurons that make up the lth layer.Neurons fire when the membrane potential reaches the firing threshold V th and generate spikes.After firing, the membrane potential is fixed at 0 and never fires again.The membrane potential of the model described in Eq. ( 9) is analytically obtained as follows: The experiments in this study consider the three cases (τ v , τ I ) ∈ {(2τ, τ ), (∞, τ ), and (∞, ∞)}.Note that the learning characteristics of SNNs with TTFS coding were investigated for the cases of , and τ v = ∞, τ I = ∞ [33,37].The firing time in each case can be calculated from the condition v i ) = V th as follows: (14) where Γ (l) i denotes the index set of spikes input to the lth layer neuron i up to firing time t (l) i .We define the following variables: A detailed derivation of the firing time in the case of (τ v , τ I ) = (τ, 2τ ) is given in Appendix A.3.

Learning algorithms
Supervised learning of the SNN was performed using the following cost function where M represents the output layer and N (M ) .The value of the teacher label κ i is equal to one when the ith label is assigned and zero otherwise.Parameters γ 1 , γ 2 , and γ 3 are real numbers, and they respectively control the significance of the temporal penalty term T , the membrane potential loss V (Eq.( 1)), and the firing condition term Q (Eq.( 7)).Parameter τ soft is a positive real number, which adjusts the softmax scaling.Learning was performed by minimizing this cost function using the gradient method with the Adam optimizer [51] at a learning rate of η.On the CIFAR-10 task, the coefficient γ 3 was set to a negative number to promote firing.In this case, the firing condition term Q was modified as follows: We promoted the neurons to fire only when the corresponding neurons did not fire.

Dataset
The MNIST, Fashon-MNIST, and CIFAR-10 datasets include 2-dimensional image data.In the MNIST and Fashion-MNIST datasets, each image has one channel, whereas in the CIFAR-10 dataset, the images have three channels.To process such image data, we first normalized the pixel intensity to [0, 1].Then, we obtained an input spike as follows: where x ijk is the normalized pixel intensity, the first and second indices represent the coordinates of the pixel, and the third index represents the channel number.Here, τ in is a positive constant.We set τ in = 5 in all experiments.When spikes are input to a fully connected layer, the input tensors are reshaped into one-dimensional tensors.For the CIFAR-10 dataset, to avoid the problem of the first hidden layer firing too early and ignoring later inputs, the number of channels was doubled as follows: Furthermore, we used data augmentation (horizontal flipping, rotation, and cropping) as in the previous study [43].

Appendices
A Derivation of M-SSR (Eq.( 3)) In this section, we present a detailed derivation of M-SSR (Eq.( 3)).If v is sufficiently close to V th , t(l) i , the time at which the membrane potential is v can be assumed to be a single point (see Fig. 1 (a)).Moreover, we can assume that the number of input spikes can be assumed to be constant Γ i ].From the above, the integral (Eq.( 2)) can be transformed as follows Note that V (l) i = 0 if the neuron does not fire.Importantly, when computing the gradient of this integral, the integration range [ t, t i ] should be fixed.This is because if they are not treated as fixed values, the SNNs learn to make the membrane potential rise rapidly, thus the effect of firing suppression cannot be obtained.We can omit the constant term v(t , which is not involved in the learning process, and we only need to calculate the following: In the following, the limit of the above integral (v → V th ) is calculated for various neuron models ((τ v , τ I ) = (∞, ∞) , (∞, τ ) , and (2τ, τ )).
A.1 Neuron model with τ v = ∞ and τ I = ∞ When τ I = τ v = ∞, the membrane potential and firing time can be calculated as follows: Using these equations, the limit (v → V th ) can be calculated as follows: Eq.( 29) Note that the blue variables are related to the integral range, which is treated as a fixed value when calculating the gradient, as mentioned above.
A.2 Neuron model with τ v = ∞ and τ I = τ When τ v = ∞ and τ I = τ , the membrane potential and firing time are given by [31] v (l) We also obtain the following in v → V th : Using these, the limit (v → V th ) can be calculated as follows: A.3 Alpha-synaptic neuron model with τ v = 2τ I = 2τ When τ v = 2τ I = 2τ , the membrane potential is given by From the firing condition v (l) i t (l) i = V th , we obtain the following: where we defined the following variables: From the formula to solve a quadratic function, the firing time t i can be calculated from Eq. (51) as follows [37]: where the other solution of the quadratic function is ignored because it indicates the time at which the membrane potential decreases from a value greater than V th to a smaller value [37].We also obtain the following: We define the following variable: Using these, the limit (v → V th ) can be calculated as follows: B Numerical convergence analysis M-SSR was derived by taking the limit of v → V th for the integral-form regularization (Eq.2).In this section, we confirm that the gradient of the integral-form regularization (Eq.2) converges to that of M-SSR (Eq. 3) when v approaches V th through numerical simulation.The Iris dataset [52] was used as input data.The Iris dataset consists of 150 instances of three-class data with four features.To process the data with an SNN, each feature was normalized to [0, 1] and each element of the vector (x i ) was transformed to the time t (0) i = τ in x i (i = 0, 1, 2, 3) of the input layer spikes, where τ in = 5.We also introduced a bias spike t (0) 4 = 0.This input spike was input to an SNN with two hidden layers (5-10-10-3).The cost function (Eq.( 16)) was active only for V , with γ 2 = 1.The error for the gradient of the weights for each layer was defined as follows: The integral-form regularization (Eq.( 2)) was computed by setting T = t ref and dividing the integral by the time step width ∆t = T Nsteps .Figure 7 shows the results of the gradient error for various neuron models.The gradient error was calculated by averaging over the 150 data.For all neuron models and N steps , we can confirm that the errors decrease as v is brought closer to V th , and then the errors tend to increase.This increase in error is due to the fact that if v is too close to V th , the integration range becomes extremely narrow, making it difficult to evaluate the integral.As N steps is increased, it is found that the integrals can be evaluated more precisely up to the point where v is closer to V th .As N steps is increased, the minimum value of the gradient error decreases uniformly, and it is found to decrease to about 0.1% for N steps = 10 7 .

C Effects of ξ
In the proposed SSR methods (Eqs.( 1) and ( 7)), the regularization loss is larger in the latter layers as the coefficient ξ increases, resulting in sparse firing characteristics in the latter layers.Figure 8 shows the effect of the value of ξ on the sparsity-accuracy tradeoff with M-SSR (Eq.( 3)) and F-SSR (Eq.( 8)).We trained the SNNs with three hidden layers (784-400-400-400-10), and the results are shown for the MNIST and Fashion-MNIST datasets.For each dataset, the top three panels show the sparsity-accuracy tradeoffs for the first, second, and third hidden layers, and the bottom panel shows the average results for all hidden layers.In the case of M-SSR, when p layer = 1, the latter layer is not sparse, but when p layer = 6, both layers are sparse.By contrast, in the case of F-SSR, even when p layer = 1, the latter layers are relatively sparse.This difference can be attributed to the fact that in M-SSR, the regularization error propagates backward to the previous layer, whereas in F-SSR, the regularization error is local and does not propagate backward.For each dataset, the first three panels present the sparsity-accuracy tradeoff for the first, second, and third hidden layers, from the top.The bottom panel presents the sparsity-accuracy tradeoff for the sparsity averaged over the three hidden layers.We used the following hyperparameters: t ref = 9, γ 1 = 10 −4 , η = 10 −4 , and τ soft = 0.9.

Figure 1 :
Figure 1: Derivation of the two SSR methods.(a) M-SSR is derived from the membrane potential loss, where the loss occurs when the membrane potential is larger than v. Assuming that a neuron fires at t (l) i

Figure 2 :
Figure2: Typical results with M-SSR.We trained SNNs with one hidden layer (784-400-10) on the MNIST dataset with various values of M-SSR strengths γ 2 .The upper figures present raster plots for a given input data, showing the results of the input layer (top), hidden layer (middle), and output layer (bottom).The lower figures show the time evolution of the membrane potentials when given the same input data, with the results for the hidden layer (top) and the output layer (bottom).In the panels displaying the membrane potentials in the hidden layer, only 50 neurons are shown.In all cases, the following hyperparameters were used: τ v = τ I = ∞, t ref = 8, γ 1 = 10 −4 , γ 3 = 0, η = 10 −4 , and τ soft = 0.9.

Figure 6 :
Figure6: Sparsity-accuracy tradeoff on the CIFAR-10 task.The best case over 10 trials is plotted.From the top, the sparsity-accuracy tradeoffs for the first, second, and third convolutional layers, and then fully connected layer are represented, respectively.The sparsity-accuracy tradeoff for the whole network (not including pooling layers) is presented in the bottom panel.We inverted the effects of F-SSR, so the neurons tend to fire more frequently.

Figure 7 :
Figure 7: Numerical conversion analysis.The gradient errors in the SNNs (5-10-10-3) for various values of v and N steps ∈ {10 4 , 10 5 , 10 6 , 10 7 } are plotted.The results are shown for the following neuron models: (a) τ v = τ I = ∞, (b) τ v = ∞, τ I = 5, (c) τ v = 2τ I = 10, and (d) τ v = 2τ I = 20.In each subfigure, the upper panel shows the gradient error relating to the first hidden layer and the lower panel shows the gradient error relating to the second hidden layer.The gradient error was calculated by averaging over the 150 data.We set t ref = 10.

Figure 9
Figure 9 shows the sparsity-accuracy tradeoffs for SNNs trained on the CIFAR-10 dataset.The regularization strength of F-SSR γ 2 ranged from 0 to −10 −6 .The training was performed with 10 different initial weights for each value of γ 2 .All 190 data points obtained are shown in the figure.

Figure 8 :
Figure8: Effect of ξ on the sparsity-accuracy tradeoff.The sparsity-accuracy tradeoffs are plotted for SNNs (784-400-400-400-10) with M-SSR and F-SSR for various values of ξ.For each dataset, the first three panels present the sparsity-accuracy tradeoff for the first, second, and third hidden layers, from the top.The bottom panel presents the sparsity-accuracy tradeoff for the sparsity averaged over the three hidden layers.We used the following hyperparameters: t ref = 9, γ 1 = 10 −4 , η = 10 −4 , and τ soft = 0.9.

Table 1 :
Convolutional architectures used for each dataset."Conv(a,b)" represents a convolutional layer with a kernel size of a × a, number of output channels b, and stride 1. "Pool" represents a pooling layer with kernel size of 2 × 2, stride 2. The padding of the convolutional layer was set to 0 for MNIST and set to 1 for Fashion-MNIST and CIFAR-10.