Characterization of nonlinear receptive fields of visual neurons by convolutional neural network

A comprehensive understanding of the stimulus-response properties of individual neurons is necessary to crack the neural code of sensory cortices. However, a barrier to achieving this goal is the difficulty of analyzing the nonlinearity of neuronal responses. In computer vision, artificial neural networks, especially convolutional neural networks (CNNs), have demonstrated state-of-the-art performance in image recognition by capturing the higher-order statistics of natural images. Here, we incorporated CNN for encoding models of neurons in the visual cortex to develop a new method of nonlinear response characterization, especially nonlinear estimation of receptive fields (RFs), without assumptions regarding the type of nonlinearity. Briefly, after training CNN to predict the visual responses of neurons to natural images, we synthesized the RF image such that the image would predictively evoke a maximum response (“maximization-of-activation” method). We first demonstrated the proof-of-principle using a dataset of simulated cells with various types of nonlinearity, revealing that CNN could be used to estimate the nonlinear RF of simulated cells. In particular, we could visualize various types of nonlinearity underlying the responses, such as shift-invariant RFs or rotation-invariant RFs. These results suggest that the method may be applicable to neurons with complex nonlinearities, such as rotation-invariant neurons in higher visual areas. Next, we applied the method to a dataset of neurons in the mouse primary visual cortex (V1) whose responses to natural images were recorded via two-photon Ca2+ imaging. We could visualize shift-invariant RFs with Gabor-like shapes for some V1 neurons. By quantifying the degree of shift-invariance, each V1 neuron was classified as either a shift-variant (simple) cell or shift-invariant (complex-like) cell, and these two types of neurons were not clustered in cortical space. These results suggest that the novel CNN encoding model is useful in nonlinear response analyses of visual neurons and potentially of any sensory neurons.


54
A goal of sensory neuroscience is to comprehensively understand the stimulus-response properties of 55 neuronal populations. In the visual cortex, such properties were first characterized by Hubel and Wiesel, 56 who discovered the orientation and direction selectivity of simple cells in the primary visual cortex (V1) 57 using simple bar stimuli [1]. Later studies revealed that the responses of many visual neurons, including 58 even simple cells [2][3][4][5], display nonlinearity, such as shift-invariance in V1 complex cells [6]; size, 59 position, and rotation-invariance in inferotemporal cortex [7][8][9]; and viewpoint-invariance in a face patch 60 [10]. Nevertheless, nonlinear response analyses of visual neurons have been limited thus far, and existing 61 analysis methods are often designed to address specific types of nonlinearity underlying the neuronal 62 responses. For example, the spike-triggered average [11] assumes linearity; moreover, the second-order 63 Wiener kernel [12] and spike-triggered covariance [13][14][15] address second-order nonlinearity at most. In 64 this study, we aim to analyze visual neuronal responses using an encoding model that does not assume the 65 type of nonlinearity.

4
An encoding model that is useful for nonlinear response analyses of visual neurons must 67 capture the nonlinear stimulus-response relationships of neurons. Thus, the model should be able to predict 68 neuronal responses to stimulus images with high accuracy [16] even if the responses are nonlinear. In 69 addition, the features that the encoding model represents should be visualized at least in part so that we can 70 understand the neural computations underlying the responses. Artificial neural networks are promising 71 candidates that may meet these criteria. Neural networks are mathematically universal approximators in 72 that even one-hidden-layer neural network with many hidden units can approximate any smooth function 73 [17]. In computer vision, neural networks trained with large-scale datasets have yielded state-of-the-art and 74 sometimes human-level performance in digit classification [18], image classification [19], and image 75 generation [20], demonstrating that neural networks, especially convolutional neural networks (CNNs) 76 [21,22], capture the higher-order statistics of natural images through hierarchical information processing.

77
In addition, recent studies in computer vision have provided techniques to extract and visualize the features 78 learned in neural networks [23][24][25][26].

79
Several previous studies have used artificial neural networks as encoding models of visual 80 neurons. These studies showed that artificial neural networks are highly capable of predicting neuronal 81 responses with respect to low-dimensional stimuli such as bars and textures [27,28] or to complex stimuli 82 such as natural stimuli [29][30][31][32][33][34][35]. Furthermore, receptive fields (RFs) were visualized by the principal 83 components of the network weights between the input and hidden layer [29], by linearization [31], and by 84 inversion of the network to evoke at most 80% of maximum responses [32]. However, these indirect RFs 85 are not guaranteed to evoke the highest response of the target neuron.

86
In this study, we first investigated whether nonlinear RFs could be directly estimated by CNN 87 encoding models (Fig 1) using a dataset of simulated cells with various types of nonlinearities. We 88 confirmed that CNN yielded the best accuracy among several encoding models in predicting visual 89 responses to natural images. Moreover, by synthesizing the image such that it would predictively evoke a 90 maximum response ("maximization-of-activation" method), nonlinear RFs could be accurately estimated.

91
Specifically, by repeatedly estimating RFs for each cell, we could visualize various types of nonlinearity 92 underlying the responses without any explicit assumptions, suggesting that this method may be applicable 93 5 to neurons with complex nonlinearities, such as rotation-invariant neurons in higher visual areas. Next, we 94 applied the same procedures to a dataset of mouse V1 neurons, showing that CNN again yielded the best 95 prediction accuracy among several encoding models and that shift-invariant RFs with Gabor-like shapes 96 could be estimated for some cells from the CNNs. Furthermore, by quantifying the degree of 97 shift-invariance of each cell using the estimated RFs, we classified V1 neurons as shift-variant (simple) 98 cells and shift-invariant (complex-like) cells. Finally, these cells were not spatially clustered in cortical 99 space. These results verify that nonlinear RFs of visual neurons can be characterized using CNN encoding 100 models.

Nonlinear RFs could be estimated by CNN encoding models for simulated cells with 104 various types of nonlinearities. 105
We generated a dataset comprising the stimulus natural images (2200 images) and the corresponding 106 responses of simulated cells. To investigate the ability of CNN to handle various types of nonlinearities, we 107 incorporated various basic nonlinearities for the data generation, including rectification, shift-invariance, 108 and in-plane rotation-invariance, which were found in V1 simple cells

112
The responses were generated using one Gabor-shaped filter for a simple cell, two phase-shifted 113 Gabor-shaped filters for a complex cell, and 36 rotated Gabor-shaped filters for a rotation-invariant cell.

114
We also added some noise sampled from a Gaussian distribution such that the trial-to-trial variability of 115 simulated data was similar to that of real data.

116
We first used a dataset of simulated simple cells and complex cells and trained the CNN for 117 each cell to predict responses with respect to the natural images (Fig 1). For comparison, we also 118 constructed the following types of encoding models: an L1-regularized linear regression model (Lasso), 6 L2-regularized linear regression model (Ridge), support vector regression model (SVR) with a radius basis 120 function kernel, and hierarchical structural model (HSM) [31]. The prediction accuracy, defined as the 121 Pearson correlation coefficient between the predicted responses and actual responses in a 5-fold 122 cross-validation manner, of CNN was high and better than that of other models for both simple cells and 123 complex cells (Fig 2C), ensuring that the stimulus-response relationships of these cells were successfully 124 captured by CNN.

125
Next, we visualized the RF of each cell using the maximization-of-activation approach (see 126 Materials and Methods) [23,24] where the RF was regarded as the image that evoked the highest activation 127 of the output layer of the trained CNN. We performed this RF estimation 100 times independently for each 128 cell, utilizing the empirical fact that an independent iteration of RF estimation processes creates different 129 RF images by finding different maxima [23]. Fig 2D and 2F show 20 out of the 100 RF images estimated 130 by the trained CNN (CNN RF images) for a representative simple cell and complex cell, respectively. The 131 predicted responses with respect to these RF images were all > 99% of the maximum response in the actual 132 data of each cell, ensuring that the activations of the CNN output layers were indeed maximized. All 133 visualized RF images had clearly segregated ON and OFF subregions, and the structure was close to the

135
Furthermore, when RF images were compared within a cell, RF images of cell #29 had ON and OFF 136 subregions in nearly identical positions, while some RF images of cell #31 were shifted in relation to one 137 another. These observations are consistent with the assumption that cell #29 is a simple cell and cell #31 is 138 a complex cell.

139
For complex cells, we expect that RF estimation using linear methods would fail to generate an 140 image with clearly segregated ON and OFF subregions, whereas nonlinear RF estimation would not [14].

141
Thus, the similarity between a linearly estimated RF image (linear RF) and a nonlinearly estimated RF 142 image is expected to be low for complex cells. We performed linear RF estimations following a previous 143 study [38]. Although the linear RF image and CNN RF image were similar for cell #29 (Fig 2E), the linear 144 RF image for cell #31 was ambiguous, lacked clear subregions, and was in sharp contrast to the CNN RF 145 image ( Fig 2G). These results are again consistent with the assumption that cell #29 is a simple cell and 7 cell #31 is a complex cell.

147
Next, we comprehensively analyzed the RFs of populations of simulated simple cells and 148 complex cells. Cells with a CNN prediction accuracy ≤ 0.3 were omitted from the analyses (Fig 2C). First, 149 the similarity between a linear RF image and CNN RF image, measured as the maximum normalized 150 pixelwise dot product between a linear RF image and 100 CNN RF images, was distinctly different 151 between simple cells and complex cells (Fig 2J), reflecting different degrees of nonlinearity. Second, the 152 accuracy of Gabor-kernel fitting of the CNN RF image, measured as the pixelwise Pearson correlation 153 coefficient between a CNN RF image and the fitted Gabor kernel, was high among all analyzed cells (Fig   154   2H), confirming that the estimated RFs had a shape similar to a Gabor kernel. Third, the maximum 155 similarity between each filter used in the response generation and 100 CNN RF images were high for both 156 simple cells and complex cells ( Fig 2I). Fourth, the orientations of the CNN RF images, estimated by 157 fitting them to Gabor kernels, were nearly identical to the orientations of the filters of the response 158 generators (circular correlation coefficient [39] = 0.92; Fig 2K). These results suggest that the RFs 159 estimated by the CNN encoding models had similar structure to the ground truth and that the 160 shift-invariant property of complex cells was successfully visualized from iterative RF estimations.

161
We also performed similar analyses using a dataset of simulated rotation-invariant cells. When 162 trained to predict the responses with respect to the natural images, CNNs again yielded high prediction 163 accuracy ( Fig 3B). Next, we estimated RFs using the maximization-of-activation approach independently 164 1000 times for each cell. The predicted responses with respect to these RF images were all > 99% of the 165 maximum response in the actual data of each cell, ensuring that the activations of CNN output layers were 166 indeed maximized. As shown in Fig 3C, the visualized RF images of cell #1 had Gabor shapes close to the 167 filters used in the response generation ( Fig 3A). In addition, some RF images were rotated in relation to 168 one another, consistent with the rotation-invariant response property of this cell. Finally, we quantitively 169 compared the RFs (1000 RF images for each cell) and the filters of the response generator (36 filters for 170 each cell). For each filter, the maximum similarity with 1000 CNN RF images was high (Fig 3D),

171
suggesting that the estimated RFs had various orientations and similar structure to the ground truth. Thus, 172 using the proposed RF estimation approach, RFs were successfully estimated by the CNN encoding 8 models, and various types of nonlinearity could be visualized from multiple RFs without assumptions, 174 although the hyperparameters and layer structures of CNNs were unchanged across cells.

175
176 CNN yielded the best accuracy for prediction of the visual response of V1 neurons. 177 Next, we used a dataset comprising the stimulus natural images (200−2200 images) and corresponding real 178 neuronal responses (N = 2465 neurons, 4 planes), which were recorded using two-photon Ca 2+ imaging 179 from mouse V1 neurons. To investigate whether CNN was able to capture the stimulus-response 180 relationships of V1 neurons, we trained the CNN for each neuron to predict the neuronal responses to the 181 natural images (Fig 1). The prediction accuracy was again measured by the Pearson correlation coefficient 182 between the predicted responses and actual responses of the held-out test data in a 5-fold cross-validation 183 manner (N = 2455 neurons that were not used for the hyperparameter optimizations; see Materials and 184 Methods). Comparison of the prediction accuracies among several types of encoding models revealed that 185 CNN outperformed other models (Fig 4A), and the prediction of the CNNs were accurate (Fig 4B and 4C).

186
These results show that the stimulus-response relationships of V1 neurons were successfully captured by 187 CNN, demonstrating the efficacy of using CNN for further RF analyses of V1 neurons.

Estimation of nonlinear RFs of V1 neurons from CNN encoding models. 190
Next, we visualized the RF of each neuron by the maximization-of-activation approach (see Materials and 191 Methods) [23,24]. Neurons with a CNN prediction accuracy ≤ 0.3 were omitted from this analysis ( Fig 4B).  Gabor-kernel fitting, measured as the pixelwise Pearson correlation coefficient between the RF image and 196 fitted Gabor kernel, was high among all analyzed neurons (median r = 0.77; Fig 5E), suggesting that the 197 RF images generated from the trained CNNs (CNN RF images) successfully captured the Gabor-like 9 structure of RFs observed in V1. We also performed linear RF estimations following a previous study [38].

199
Although the linear RF image and CNN RF image were similar for neuron #639, the linear RF image for 200 neuron #646 was ambiguous, lacked clear subregions, and was in sharp contrast to the CNN RF image ( Fig   201   5A and 5B), suggesting that neuron #639 would be linear and neuron #646 would be nonlinear. Supporting 202 this idea, further analysis (see below) revealed that neuron #639 was a shift-variant (simple) cell, and 203 neuron #646 was a shift-invariant (complex-like) cell. The similarity between a linear RF image and a 204 CNN RF image, measured as the normalized pixelwise dot product between these two images, varied 205 among all analyzed neurons (Fig 5D), reflecting the distributed nonlinearity of V1 neurons.

Estimated RFs of some V1 neurons were shift-invariant. 208
We then performed 100 independent CNN RF estimations for each V1 neuron to characterize the 209 nonlinearity of RFs. We especially focused on the shift-invariance, the most well-studied nonlinearity in

211
predicted responses with respect to these RF images were all > 99% of the maximum response in the actual 212 data of each neuron, ensuring that the activations of the CNN output layers were indeed maximized.

213
Importantly, RF images of neuron #639 had ON and OFF subregions in nearly identical positions ( Fig 6A).

214
In contrast, some RF images of neuron #646 were horizontally shifted in relation to one another (Fig 6B),

215
suggesting that neuron #646 is shift-invariant and could be a complex cell.

Characterization of shift invariance from iteratively estimated RF images. 218
To quantitatively understand the shift-invariance, we then developed predictive models of visual responses 219 for each simulated complex cell and V1 neuron, termed simple model and complex model, inspired by the 220 stimulus-response properties of simple and complex cells. In the simple model, the response to a stimulus 221 was predicted as the normalized dot product between the stimulus image and an RF image. The RF image 222 that yielded the best prediction accuracy was chosen and used for all stimulus images (Fig 7A). In contrast, 223 in the complex model, the response to each stimulus was predicted as the maximum of the normalized dot 224 products between the stimulus image and several RF images ( Fig 7B). Here, RF images used in these 225 models were selected from 100 RF images as ones that were shifted to one another. If there was no shifted 226 RF image, the complex model was identical to the simple model (see Materials and Methods).

250
We also compared complexness with other indices of linearity and nonlinearity using a dataset

Simple cells and complex-like cells were not spatially clustered in V1. 263
Finally, we tested whether simple cells and complex-like cells were spatially organized in the cortical 264 space. We first investigated the spatial structure of complexness by comparing the difference in 265 complexness with the cortical distance between all neuron pairs (N = 129451 neuron pairs). We found no 266 correlation between complexness and cortical distance (r = −0.01), suggesting no distinct spatial 267 organization of complexness ( Fig 9A left and B). We also calculated the cortical distances of all simple

Estimation of nonlinear RFs from CNN encoding models. 275
We first revealed that the accuracy of CNN in predicting responses to natural images was high for both 276 simulated cells and V1 neurons (Figs 2C, 3B, 4B). This finding is not surprising in light of the recent 277 successes of artificial neural networks, especially CNN, in computer vision [18][19][20]. Such successes could 278 be attributed to the ability of CNN to acquire sophisticated statistics of high-dimensional data [42].

279
Likewise, the high prediction accuracy of CNN shown in this study is possibly due to its ability to capture       gray-scaled, it was preprocessed such that its contrast was 99.9% and its mean intensity across pixels was 351 at an intensity level of approximately 50%, and then masked with a circle with a 60-degree diameter. The 352 stimulus presentation protocol consisted of 3−12 sessions. In one session, images were ordered 353 pseudo-randomly, and each image was flashed three times in a row. Each flash was presented for 200 ms 354 with 200-ms intervals between flashes in which a gray screen was presented.

Acquisition of simulated data 357
The following types of artificial cells were simulated in this study: simple, complex, and rotation-invariant 358 cells. A simple cell was modeled using a "linear-nonlinear" cascade formulated as shown below where the 359 response to a stimulus was defined as the dot product between the stimulus image s and a Gabor-shaped 360 filter f1, followed by a rectifying nonlinearity [2] and a Gaussian noise (Fig 2A).

361
= max( * 1 , 0) + (2) 362 A complex cell was modeled using an energy model with two subunits [36,37]. In this model, 363 each subunit calculated the dot product between the stimulus image s and a Gabor-shaped filter f1, f2. Then, 364 the outputs of these two subunits were squared, summed together, and the squared root was taken. Finally, 365 a Gaussian noise was added to define the response (Fig 2B). Here, the Gabor-shaped filters used in this 366 model had identical amplitude, position, size, spatial frequency, and orientation; the phase was shifted by 367 90 degrees. Note that this procedure, formulated as follows, can also be viewed as a 368 "linear-nonlinear-linear-nonlinear" cascade [30,56].

370
A rotation-invariant cell was modeled using 36 subunits. The i-th subunit (1 ≤ i ≤ 36) calculated 371 the dot product between the stimulus image s and a Gabor-shaped filter fi. After the maximum of the 372 outputs of the subunits was taken, a Gaussian noise was added to define the response (Fig 3A). Here, the 373 Gabor-shaped filters used in this model fi had identical amplitude, position, size, spatial frequency, and phase; the orientation of the i-th subunit was 5 (i -1) degree.

392
The noise was randomly sampled from a Gaussian distribution with a mean of zero and 393 standard deviation of one, which resulted in trial-to-trial variability similar to that of real data.

Data preprocessing 396
Data analyses were performed using Matlab (Mathworks, Natick, MA, USA) and Python (2.7.13, 3.5.2, 397 and 3.6.1). For real neural data, images were phase-corrected and aligned between frames [57]. To 398 determine regions of interest (ROIs) for individual cells, images were averaged across frames, and slow 399 spatial frequency components were removed from the frame-averaged image with a two-dimensional 400 Gaussian filter whose standard deviation was approximately five times the diameter of the soma. ROIs

434
(convolutional, max-pooling, or fully connected) were optimized with 5-fold cross-validation for the data 435 of 10 real V1 neurons. The optimized hyperparameters of CNN were as follows: the size of the mini batch 436 was 5 or 30 (depending on the size of the dataset), the dropout rate of fully connected layers was 0.5, the 437 optimizer was SGD, the learning rate decay coefficient was 5 × 10 −5 , and the hidden layer structure was 4 438 successive convolutional layers and one pooling layer, followed by one fully connected layer (Fig 1). Other 439 hyperparameters were fixed.

440
The training was formulated as follows:

442
where I is an image, t is the response, W is the parameters, and f is the model. E is the loss function defined where w is the parameter we want to update, m is the momentum coefficient (0.9), v is the momentum 454 variable, ε is the learning rate (initial learning rate was 0.1), and E(w) is the loss with respect to the batched 455 data. Adam was formulated as previously suggested [65]. The training iterations were stopped upon 456 saturation of the prediction accuracy for the validation set.

457
The response prediction accuracy of each encoding model was evaluated in a 5-fold where γ is the decay coefficient (0.95) and α is the learning rate (1.0). The generated image was finally 479 processed such that its mean was zero and standard deviation was one (RF image). To confirm that the 480 generated RF image maximally activates the output layer, the whole process was repeated independently 481 until we generated an image to which the predicted response was high (for most cells, > 95% of the 482 maximum response of the actual data of each cell). Note that for representative cells (Figs 2D,2E,3C,and 483 4B), the predicted responses to the generated RF images were > 99% of the maximum response of the 484 actual data.

485
To quantitatively assess the generated RF images, we fitted each RF image with a Gabor kernel 486 G(x, y) using sequential least-squares programming implemented in Scipy (0.19.0). A Gabor kernel, a 487 product of a two-dimensional Gaussian envelope and a sinusoidal wave, was formulated as follows: where A is the amplitude, σ1 and σ2 are the standard deviations of the envelopes, k0 is the frequency, τ is the 492 phase, (x0, y0) is the center coordinate, and θ is the orientation. The goal of fitting was to minimize the 493 pixelwise absolute error between the RF image and a Gabor kernel. This optimization was started with 494 seven different initial x0 and seven different initial y0 to ensure that the optimization fell in the global 495 minima. In addition, to create a reasonable Gabor kernel, we set bounds for some of the parameters: 0 ≤ x0 496 / Lx ≤ 1, 0 ≤ y0 / Ly ≤ 1, 0 ≤ σ1 / Lx ≤ 0.2, 0 ≤ σ2 / Ly ≤ 0.2, and π/3 ≤ k0 ≤ π, where Lx and Ly are the size of 497 the RF image in the x and y dimension, respectively. The accuracy of Gabor fitting was evaluated by the 498 pixelwise Pearson correlation coefficient between the original RF image and the fitted Gabor kernel.

499
Linear RF images were created by a regularized pseudoinverse method described previously 500 [38]. The regularization parameter was optimized for each cell by exhaustive grid search in a 10-fold 501 cross-validation manner. For each value in the grid, responses to the held-out test data were predicted using 502 the created RF image. Prediction accuracy was calculated as the Pearson correlation coefficient between 503 the predicted responses and actual responses. The linear RF image was created using the value with the 504 highest prediction accuracy as the regularization parameter.

Quantification of shift-invariance (complexness) 507
To distinguish between simple cells and complex-like cells, we then created a "shifted image set", which 508 contained CNN RF images that were shifted with respect to one another, selected from the 100 CNN RF 509 images. For this purpose, a zero-mean normalized cross correlation (ZNCC) was calculated for every pair where (u, v) is a pixel shift and 1 ̅ is the mean of I1. If the ZNCC was above 0.95 for a (u, v) pair ((u, v) 513 22 ≠ (0, 0)), these two RF images were defined as shifted to each other by (u, v) pixels. Then, for each pair 514 of shifted RF images, we calculated the shift distance as the maximum length of (u, v) vectors projected 515 orthogonally to the Gabor orientation. Finally, starting with the two RF images with the largest shift 516 distance, we iteratively collected RF images that were shifted from the already collected RF images to 517 create the "shifted image set". If none of the 100 RF images were shifted to another, the "shifted image set" 518 consisted of the RF image with the highest predicted response.

519
A simple model and complex model were created for each cell as follows (Fig 7). In the simple 520 model, the response to a stimulus image was predicted as the normalized dot product between the stimulus 521 image and one RF image selected from the "shifted image set". The RF image that yielded the best

529
where ACCsimple and ACCcomplex are the response prediction accuracy of the simple model and complex 530 model, respectively. Cells with the Gabor fitting accuracy ≤ 0.6, ACCsimple < 0, or ACCcomplex < 0 were 531 omitted from this analysis.

Spatial organizations of simple cells and complex-like cells 534
The spatial organizations of simple cells and complex-like cells were evaluated in two ways. First, for each 535 pair of neurons, we calculated the in-between cortical distance and the difference in complexness. A 536 relationship between the cortical distances and the complexness differences is indicative of a spatial