Neural encoding with unsupervised spiking convolutional neural network

Accurately predicting the brain responses to various stimuli poses a significant challenge in neuroscience. Despite recent breakthroughs in neural encoding using convolutional neural networks (CNNs) in fMRI studies, there remain critical gaps between the computational rules of traditional artificial neurons and real biological neurons. To address this issue, a spiking CNN (SCNN)-based framework is presented in this study to achieve neural encoding in a more biologically plausible manner. The framework utilizes unsupervised SCNN to extract visual features of image stimuli and employs a receptive field-based regression algorithm to predict fMRI responses from the SCNN features. Experimental results on handwritten characters, handwritten digits and natural images demonstrate that the proposed approach can achieve remarkably good encoding performance and can be utilized for “brain reading” tasks such as image reconstruction and identification. This work suggests that SNN can serve as a promising tool for neural encoding.


Introduction
The goal of neural encoding is to predict how the brain responds to the outside inputs. It provides an effective approach to explore the brain mechanism for processing sensory information and can form the basis of brain-computer interface systems. Visual perception, as one of the most important ways for us to receive outside information, has been the major target of neural encoding. Over the past two decades, scientists have made remarkable progress in vision-based neural encoding [1][2][3][4] with the development of non-invasive brain imaging techniques, such as functional magnetic resonance imaging (fMRI), thereby making it a hot topic of neuroscience.
The common steps for vision-based encoding include feature extraction and response prediction 5 .
Feature extraction aims to produce visual features of the stimuli by simulating the visual cortex. A feature extractor that approximates the real visual mechanisms is essential for encoding. Response prediction aims to make voxel-wise predictions of the fMRI responses based on the extracted visual features. This step is usually accomplished by linear regression 6 because the relationship between the features and the responses should be as simple as possible. Previous studies demonstrated that the early visual cortex processes information in a way similar to Gabor wavelets [7][8][9] . Inspired by this nding, Gabor lter-based encoding models have been proposed and successfully applied in image identi cation and movie reconstruction tasks 1,3 . In recent years, convolutional neural networks (CNNs) have drawn extensive attention because of their remarkable achievements in the eld of computer vision. Representational similarity analysis 10 results show that human visual cortex share similar hierarchical representations to CNNs 11,12 . Therefore, CNN-based encoding models began to be widely used and have achieved excellent performance 2,4,13,14 . Despite the great success of CNNs in the application of encoding, the differences between CNNs and the brain in processing visual information cannot be ignored 15 .
In terms of computational mechanisms, the key distinction between the arti cial neurons in CNNs and the biological neurons is that the former propagates continuous digital values, whereas the latter propagates action potentials (spikes). The proposal of spiking neural network (SNNs), which was regarded as the third generation of neural networks 16 , narrowed this difference. Unlike traditional arti cial neural networks (ANNs), SNNs convey information by spike timing. In SNNs, each neuron integrates spikes from the previous layer and emits spikes to the next layer when its internal voltage reaches the threshold. The most commonly used learning algorithm of SNNs is the spike-timing-dependent plasticity (STDP) 17,18 , which is an unsupervised method for weight update and has been uncovered in mammalian visual cortex [19][20][21] . Recent studies have applied STDP-based SNNs on object recognition and achieved considerable performance [22][23][24] . The advantage of SNNs in biological plausibility leads to their potential in neural encoding.
In this paper, we proposed a spiking CNN (SCNN)-based encoding framework to ll in the gap between CNNs and the realistic visual system. The encoding procedure consisted of three steps. First, we trained a SCNN using STDP algorithm to extract the visual features of the images. Second, we annotated the coordinates of each voxel's receptive eld in the SNN feature maps based on the retinal topological properties of the visual cortex. That is, each voxel receives visual input from only one xed location of the feature map. Third, we built linear regression models for each voxel to predict their responses from corresponding SNN features. The framework was tested using two publicly available image-fMRI datasets, namely, handwritten character 25 and natural image 1 datasets. Moreover, we realized two downstream decoding tasks (image reconstruction and image identi cation) based on our encoding models. We compared the encoding and decoding performance of our method with those of previous methods.

Results
Encoding performance on handwritten character dataset. For this dataset, we rst built the SCNN using the images in the TICH dataset (images in the test set were excluded, and 14854 images for the 6 characters were retained), which aimed to maximize the representation ability of the SCNN. Then, we trained the voxel-wise linear regression models with the fMRI data in the train set for each participant.
The encoding performance was de ned as the PCC between the predicted and measured responses to the test set images. Moreover, we compared our model with the CNN-based encoding model, in which the network architecture of CNN was constrained to be consistent with that of the SCNN (Supplementary Table 1). The CNN was trained with Adam optimizer 26 with a learning rate of 0.0001 for 50 epochs using TICH dataset, and the classi cation accuracy achieved 99% on the test set images. The subsequent encoding procedures for CNN were the same as SCNN. For each subject, 500 voxels with highest in the 3-fold cross-validation of the train set data were selected for the comparison. Figure 2a shows the prediction accuracies for SCNN and CNN-based encoding models, we found that the accuracies of SCNN on all of the three subjects were signi cantly higher than those of CNN (p< , one-tailed two sample ttest). This result suggests that the SCNN has greater potential than CNN in terms of encoding tasks.

Page 4/19
The predictability of the activity of a voxel depends on its degree of involvement in the task. That is, if a voxel receives a large amount of stimulus information, then its fMRI activities will be easy to predict, and vice versa. To verify this, we visualized the distributions of the stimulus intensities and voxel receptive elds. As we annotated the receptive eld for each voxel through the 3-fold cross-validation on the train set data, the top 100 voxels with the highest of each participant were adopted for the analysis. Figure 2b shows the mean stimulus intensities of the train set and the receptive elds of the selected voxels. We found that their spatial distribution patterns, which approximately obeyed Gaussian distributions along the X-axis and uniform distributions along the Y-axis, were quite similar. In other words, the receptive elds of these informative voxels tended to be distributed in locations with higher stimulus intensity. This nding also proves the effectiveness of our receptive eld-based feature selection algorithm.
Encoding performance on natural image dataset. Compared with handwritten character images, natural images are more complex and closer to our daily vision. To validate whether our approach is feasible for encoding natural image stimuli, we trained and tested the proposed encoding model based on this dataset. Task-optimized CNN-based models are not applicable in this encoding task because the visual stimuli in this dataset have no category labels. We compared our approach with the Gabor wavelet pyramid (GWP) model proposed by Kay, et al. 1 and the brain-optimized CNN (GNet) 13,27 . Instead of classifying the input images, the CNN in GNet was trained for predicting the fMRI responses in end-to-end fashion. The architecture of GNet can be found in Supplementary Table 2, we trained the GNet independently for each visual area in each subject (a total of 6 models were trained). As shown in Table 1, the number of predictive voxels (p < 0.05, Bonferroni corrected) of the proposed model were larger than that of GWP, and next to the GNet. We noted that there were fewer predictable voxels in V3, which might be related to its lower signal-to-noise ratio 1 . Figure 3 shows the prediction accuracies for different models of the predictive voxels. For each visual areas, the accuracies of SCNN were signi cantly higher than those of GWP (p< , one-tailed two sample t-test) and have no signi cantly difference with GNet (p>0.37, two-tailed two sample t-test). These results indicate that the unsupervised SCNN-based encoding model is superior to GWP, and can even achieve comparable performance with the neural network that optimized with brain response as target.  Image reconstruction result. Based on the encoding model, we accomplished the image reconstruction task using the handwritten character dataset. We adopted the images of the six characters in the TICH dataset (excluding the test set images) as the prior image set and used the encoding model to produce the predicted fMRI responses for each image. Notably, only 500 voxels selected from the train set data were used for this task. For each image in the test set, we averaged the top 15 images of the prior image set whose predicted responses were most correlated to the observed responses as the reconstructed image. Figure 4a, b show some reconstruction examples. We obtained promising reconstruction results on every character, and the images that belong to the same character with different writing styles can be reconstructed. This result indicates that the reconstructions contain both category and structure information of the original stimuli. Moreover, we compared our reconstruction results with those of other methods, including CNN, SMLR 28 , DGMM 29 , Denoiser GAN 30 , and D-VAE/GAN 31 . The reconstruction performance was evaluated using PCC and Structural Similarity Index (SSIM) 32 . As shown in Table 2, our approach achieved competitive or superior performance to the state-of-the-art methods.  Image identi cation result. The image identi cation task was accomplished using the natural image dataset. The encoding model was used to produce predicted fMRI responses for all the images in the test set. The images perceived by the participants were identi ed by matching the measured responses to the predicted responses. According to a previous study 1 , 500 voxels with the highest predictive power were adopted for this task. As shown in Table 2, our approach achieved the identi cation accuracies of 96.67% (116/120) and 90.83% (109/120) for the two participants, which were higher than that of the GWP model (92% and 72%) and GNet (90% and 73.33%). The correlation maps between measured and predicted responses for the two participants were shown in Fig. 5a. For most of the rows in the correlation maps, the elements on the diagonal were signi cantly larger than the others, indicating that our approach had excellent identi cation ability. In addition, we investigated the identi cation accuracies with 100, 500, 1000, and 2000 voxels. As shown in Fig. 5b, our approach reached the highest accuracies when 500 voxels were adopted.

Discussion
In this work, we proposed a SCNN-based visual perception encoding model, which consisted of the SCNN feature extractor and voxel-wise response predictors. Different from the traditional Gabor and CNN-based approaches that used real-value computation, our model relied on the spike-driven SCNN that processed the visual information in a more brain-like manner. The proposed model achieved remarkable success in predicting the brain activity evoked by handwritten characters and natural images with a simple two-layer unsupervised SCNN using two publicly available datasets as the test bed. Moreover, we accomplished the image reconstruction and image identi cations tasks with promising results based on our encoding models, indicating that our model can be applied to solve practical brain-reading problems.
Neural encoding models are usually proposed on the basis of different physiological hypotheses.
Evaluating and comparing the encoding performance of the models can help us estimate the similarities between the hypotheses and the real physiological mechanism. The SCNN combines the computational rules of SNN and the network architecture of CNN, which have signi cant advantages on modeling the visual system. Motivated by this notion, we established the SCNN-based encoding model to predict the brain responses induced by different visual inputs. To extract meaningful visual features, we used a SCNN that consists of a DoG layer and a convolutional layer, which aimed to mimic information processing in the retina and visual cortex, respectively. In addition, we developed an e cient feature selection algorithm for encoding based on the neuronal population receptive eld properties 33,34 of the fMRI voxels. Compared with other benchmark methods (Gabor and CNN based encoding models), our model showed better encoding performance on the experimental data, which demonstrated the superiority of SCNN in visual perception encoding.
A debate about whether the brain is supervised or unsupervised has always been discussed. Instead of the supervised CNNs, we used the unsupervised SCNN trained by STDP in our model. Our result suggests that the visual cortex, at least the early visual areas, is more likely to learn visual representations in an unsupervised manner. Additionally, the STDP-based SCNN has the following advantages on neural encoding. First, it is biologically plausible because STDP is a bioinspired learning rule. Second, it can deal with both labeled and unlabeled data. Third, it is more suitable for small sample datasets, such as fMRI.
The realization of neural decoding tasks is the basis for many brain-reading applications, such as the brain-computer interfaces (BCI) 35 . There are two types of decoding models: one is derived from the encoding model, and the other is directly built in an end-to-end fashion. The advantage of the former is that it can provide the voxel-level functional description while completing the decoding tasks 5 . However, recent breakthroughs in decoding were mainly made using the latter models 31,36,37 . In this work, we accomplished the downstream decoding tasks including image reconstruction and image identi cation based on the encoding model. The results show that our approach outperformed state-of-the-art methods on both of the decoding tasks. This nding further validates the effectiveness of our encoding model and indicates that encoding-based approaches have promising application potential in solving decoding tasks.
In conclusion, this work provides a powerful tool for neural encoding. On the one hand, we combined the structure of CNNs and the calculation rules of SNNs to model the visual system and constructed voxelwise encoding models based on the receptive eld mechanism. On the other hand, we proved that our model can be used to implement practical decoding tasks, such as image reconstruction and image identi cation. We believe that the SCNN-based encoding models will provide crucial insights into the visual mechanism and contribute to solve BCI and computer vision tasks. Compared with deep-learning networks, the architectures of SNNs are usually shallower, which limits their abilities to extract complex and hierarchical visual features. Several recent studies have attempted to overcome this problem and made some progress 23,24,38 . The use of a deeper SCNN in our model will further improve the encoding performance and allow us to investigate the hierarchical structure of the visual cortex. Furthermore, we would like to extend the SNNs to the encoding tasks of other cognition functions (e.g., imagination and memory) in the future.

Methods
SCNN-based encoding model. We proposed a SCNN-based encoding model to predict fMRI activities evoked by the input visual stimuli. The encoding model consisted of a SCNN feature extractor and voxelwise regression models. In brief, for each input image, we rst used the unsupervised SCNN to extract the stimulus features and then built linear regression models to project the SCNN features into fMRI responses. The architecture of the encoding model is shown in Fig. 1a.
SCNN feature extractor. We adopted a simple two-layer SCNN to extract stimuli features. The rst layer refers to the Difference of Gaussians (DoG) layer, which is motivated to mimic the neural processing in the retinal ganglion cells 39,40 . In this layer, each input image is convolved by six DoG lters with zero padding. ON-and OFF-center DoG lters with size of , , and and standard deviations of , , and are used. The padding size is set to be 6. Subsequently, DoG features are converted into spike waves by intensity-to-latency encoding 41 with the length of 30.
Speci cally, we sorted DoG feature values greater than 50 in descending order and distributed them equally into 30 bins to generate the spike waves. Before passing to the next layer, the output spikes underwent max pooling operation with the window size of and a stride of 2.
The second layer corresponds to the convolutional layer, which aims to mimic the information integration mechanism of the visual cortex. In this layer, we used 64 convolutional kernels that were made up with Integrate-and-Fire (IF) neurons to process the input spikes. The window size of the convolutional kernels is , and the padding size is 2. Each IF neuron gathers input spikes from its receptive eld and emits a spike when its voltage reaches the threshold, which can be formulated as follows: where denotes the voltage of IF neuron at time step t, denotes the synaptic weight between neuron and input spikes within its receptive eld, and denotes the ring threshold and is set to be 10. For each image, neurons are allowed to re once at most. In addition, we adopted the inhibition mechanism in the convolutional layer. That is, one neuron, which is the neuron with the earliest spike time, is allowed at most to re at each position in the feature maps. The synaptic weights are updated through STDP, which can be described as follows: 3 where denotes the weight modi cation; and denote the learning rates and are set to be 0.004 and − 0.003, respectively; and and denote the spike time of neuron and input spikes, respectively. We calculated the learning convergence de ned in Kheradpisheh, et al. 23 using the following formula:

,
where N denotes the total number of synaptic weights. The training of the convolutional layer is stopped when is less than 0.01. The implementation of SCNN was based on the SpykeTorch platform 42 . After training the SCNN, we set the ring threshold to be in nite and measured the voltage value at the last time step in each neuron as the SCNN feature of the visual stimuli. Given that the voltages in the convolutional neurons accumulated over time and has never been reset when was in nite, the nal voltage values (SCNN feature) re ected the activations of SCNN to the visual stimuli.
Responses prediction algorithm. With the SCNN feature obtained as mentioned above, we built the linear regression model for each voxel to predict the fMRI response . To determine the optimal receptive eld location for each voxel, we went through all the locations on the SCNN feature maps to t Downstream decoding tasks. Based on the encoding models, we accomplished two downstream decoding tasks: image reconstruction and image identi cation. Given the observed fMRI response, the image reconstruction task aims to reconstruct the perceived image, and the image identi cation task aims to identify which image was seen. The detailed implementations of the two tasks are described as follows.
Image reconstruction. As shown in Fig. 1b, the implementation of the image reconstruction task relied on a large prior image set. First, we used the encoding model to produce the predicted fMRI responses for all the images in the prior image set. Second, we calculated the Pearson's correlation coe cients (PCCs) between the observed fMRI response and the predicted fMRI response of the prior images. Finally, we averaged the prior images whose predicted fMRI responses had the highest PCC with the observed one to obtain the reconstruction result.
Image identi cation. The implementation of the image identi cation task is shown in Fig. 1c. The images of the test data were inputted into the encoding model to produce the predicted fMRI responses. Then, the PCCs between the predicted fMRI responses and the observed fMRI response were calculated. The image whose predicted fMRI response was most correlated with the observed one was regarded as the image seen by the subject.
fMRI datasets. To validate the encoding model, we employed two publicly available datasets that have been widely used in previous studies 1,25,31,36,43 : handwritten character and natural image datasets. The brief descriptions of the datasets are listed in the following, and the details can be found in the original literatures.
The handwritten character dataset contains the fMRI data from three participants as they viewed handwritten character images. A total of 360 images for 6 characters (B, R, A, I, N, and S) with the size of were presented for each participant, which were taken from the TICH character dataset 44 . The xation point was added to each image as a white square. During the experiment, each image was presented for 1 s ( ashed at 2.5 Hz) followed by a 3 s black background, and 3 T fMRI data were acquired simultaneously (TR = 1.74 s, voxel size = ). The voxel-level fMRI responses of visual areas V1 and V2 for each visual stimulus were estimated using general linear models 45 . We adopted the same train/test set split as in the original work 25 , which contained 270 and 90 classbalanced examples, respectively.
The natural image dataset contains the fMRI data from two participants as they viewed natural images 1 .
The experiment was divided into train and test stages. In the train stage, 1750 images were presented to the participants, and each image was shown for 1 s ( ashed at 2 Hz) followed by a 3 s grey background.
In the test stage, participants were shown with 120 images that were different from the images used in the train stage. 3T fMRI data were acquired simultaneously in both experiment stages (TR = 1 s, voxel size = ). The voxel-level fMRI responses of visual areas V1, V2, and V3 for each visual stimulus were estimated. To reduce computational complexity, the natural images were down sampled from to pixels.

Code Available
The code that supports the ndings of this study is available from https://github.com/wang1239435478/Neural-encoding-with-unsupervised-spiking-convolutional-spikingneural-networks.  Figure 1 The ow chart of the encoding and decoding processes. a The illustration of the encoding model. The

Figures
proposed model uses a two-layer SCNN to extract visual features of the input images, and uses linear regression models to predict the fMRI responses for each voxel. b The diagram for the image reconstruction task, which aims to reconstruct the perceived images from the brain activity. c The diagram for the image identi cation task, which aims to identify which image is perceived based on the fMRI responses.

Figure 2
The encoding results on handwritten character dataset. a The encoding performance (mean ± SEM) of SCNN and CNN-based models on the selected voxels. The accuracies of SCNN are signi cantly higher