Evidence for the intrinsically nonlinear nature of receptive fields in vision

The responses of visual neurons, as well as visual perception phenomena in general, are highly nonlinear functions of the visual input, while most vision models are grounded on the notion of a linear receptive field (RF). The linear RF has a number of inherent problems: it changes with the input, it presupposes a set of basis functions for the visual system, and it conflicts with recent studies on dendritic computations. Here we propose to model the RF in a nonlinear manner, introducing the intrinsically nonlinear receptive field (INRF). Apart from being more physiologically plausible and embodying the efficient representation principle, the INRF has a key property of wide-ranging implications: for several vision science phenomena where a linear RF must vary with the input in order to predict responses, the INRF can remain constant under different stimuli. We also prove that Artificial Neural Networks with INRF modules instead of linear filters have a remarkably improved performance and better emulate basic human perception. Our results suggest a change of paradigm for vision science as well as for artificial intelligence.


From the INRF model to an INRF-module
Let us first define the INRF model for 2D input data u, 2D kernels m, w, g, a scalar parameter λ and a fixed nonlinear activation function σ for a specific location x: INRF(x; u, m, w, g, λ , σ ) = ∑ i∈I(x) m i u(y i ) − λ ∑ i∈I(x) where I(x), J(x) denote two sets of indexes corresponding to the neighboring pixels in two 2D windows centered on location x (with a slight abuse of notation, they also refer to the corresponding elements of the 2D kernels m, w and g). Setting g to be one when y i = x and zero everywhere else, and letting m and w be the same kernel, the equation above reduces to: which is the equation that we implement as the INRF-module.

Implementation details of an INRF-module
An INRF-module should calculate the expression defined in equation (2) for every possible location x. In order to do these operations efficiently we have designed an implementation based on convolutions that allows us to make use of the fast calculation of these operations in modern machine learning code libraries. Please notice that here we use the term convolution (with the notation '*') to describe the operation of point-wise multiplication followed by a sum of all the resulting products as is typical in the neural network literature. First, let us modify slightly the notation in equation (2); we write: Figure 1. Implementation details of an INRF-module for 2D input and output data.

2/7
because the neighbor elements u(y i , x) are relative to a particular location x. Moreover, we can simplify this notation by defining u i (x) = u(y i , x), hence writing: For the sake of simplicity we will continue this explanation for a 3 × 3 kernel w, but all these could be generalized for any other kernel size. We also present first the case of having 2D input and output data. We denote as w i , i = 1, . . . , 9 to the elements of w from top to bottom and left to right, i.e. w 1 is the top left element of the kernel and w 9 the bottom right one. Analogously, u 1 (x) = u(y 1 , x) is the top left element of the 3 × 3 window centered in u(x) and u 9 (x) = u(y 9 , x) is the bottom right element of this same window.
A summary of the implementation is depicted in figure 1. We will unroll the operations from top to bottom. We first define ) and then from (4) we have that which is step iii) in figure 1. Notice that the same value w i is multiplied by s i (x) for any position x, so we first calculate the product w i · s i and we then sum them pointwise to obtain the final result. Each s i is also a matrix with the same size as the input data u that in each location x depends only on u i (x) and u(x) and then it can be calculated for all the locations x simultaneously (step ii)).
Finally, in step i) we show how the matrices u i , which are defined as u i (x) = u(y i , x), are obtained. The input data u is convolved with kernels K 1 , . . . , K 9 to obtain the u 1 , . . . , u 9 matrices. Each kernel K i is a matrix that is everywhere '0' except in the i − th position that stores a '1'. In order to keep the same width and height after each convolution zero padding is used, as it is usual in convolutional neural networks.
This implementation using convolutions allows us to extend the INRF-module for any input data size and any output size by just adapting the size of the kernel w. Let us first write (4) for the case in which the input data u has NCh in channels: where the channel is indicated by the superscript c. Moreover, the output INRF can have NCh out channels, so the INRF at position x becomes a vector of NCh out components: and each of these components is obtained as: This generalization for an arbitrary number of input and output channels is represented in figure 2, again for a 3 × 3 kernel. In fact, the kernel is implemented as 3 · 3 = 9 kernels of size 1 × 1 × NCh in × NCh out , where NCh in , NCh out are respectively the number of input and output channels.

Implementing INRF inside a convolutional neural network 2.1 INRF layer
A convolutional layer is a linear + nonlinear operation composed by a cross-correlation (convolution but without flipping the kernel w) plus a bias term evaluated in a nonlinear function, as seen for instance in 1 : In equation 9 the operation w * x is the cross-correlation (hereafter convolution following literature terminology) between a tensor of weights w and an input tensor x, b is a bias term, and σ 2 a nonlinear function (called activation function).

3/7
In this work, we replace all the operations inside σ 2 in equation 9 by our INRF-module (equation 2), that is: This is what we call an INRF-layer, it has trainable weights w and it can fully replace a convolutional layer.

INRF neural networks
A convolutional neural network (CNN) is a stack of convolutional layers which usually ends in a multilayer perceptron (stack of fully connected layers), see Fig.3. In this setting, the stack of convolutional layer acts as a feature extractor that feeds a vector to the multilayer perceptron (classifier). In a INRF neural network (INRFnet), instead of convolutional layers there are INRF layers. In a CNN it is common to find other kinds of layers between convolutional layers, such as: pooling, batch normalization 2 , or dropout 3 . Our INRF layers can be used in combination with these other layers too.

Mapping between layers
An important feature of convolutional layers is their capacity to map between arbitrary dimensions. For instance, if an input x has dimensions m × n × NCh in a convolutional layer can map this to an output y with dimensions m × n × NCh out , where NCh in and NCh out are not necessarily equal. An INRF layer is capable to perform this mapping too. This happens in what is referred as step iii) in figure 2, see the implementation details in the previous section.

Classification architectures
In this section two INRFnet architectures used for classification are described.

INRFnet2 -A 2-layer net for MNIST
INRFnet2 has input dimensions 28 × 28 × 1, then two INRF layers with λ = 1.1 are used. The first INRF layer has 32 kernels of size 5 × 5 and the second one has 64 kernels of size 5 × 5. After each INRF layer there is a pooling layer with a 2 × 2 kernel. The result from the second pooling layer is passed to a fully connected layer with 500 nodes followed by a dropout layer (with 0.5 probability) that is also connected with a final fully connected layer of k nodes (one node per desired class). In both INRF layers, the nonlinearities σ 1 are rectified linear units (ReLU's) and σ 2 is just the identity.

INRFnet3 -A 3-layer net for CIFAR10, CIFAR100, SVHN
INRFnet3 has input dimensions 28 × 28 × 3, then three INRF layers with λ = 2 are used. All the INRF layers have 192 kernels of size 5 × 5. After all the INRF layers a batch normalization layer is used followed by a pooling layer with 2 × 2 kernel in the first two layers. Finally, an average pooling layer (with kernel size of 8 × 8) is stacked after the third batch normalization layer. At the end a fully connected layer with k nodes (one node per desired class) is stacked. In all the INRF layers the nonlinearities σ 2 are all ReLU's and σ 2 is a non-simetrical activation function defined in the equation: where p = 0.7 and q = 0.3 for INRFnet3.

Training considerations for INRFnets.
The training procedure of an INRFnet is currently constrained by (1) computation time, (2) improvement techniques designed for CNNs, and (3) harder optimization. (1) Although our implementation of INRF is optimized to run in a GPU through a highly efficient automatic differentiation library (Pytorch 4 ), it stays behind against the training time of a CNN of comparable size/parameters. This drawback is not a surprise since most of the optimization made in automatic differentiation libraries is achieved through convolution operations (which we already described as inefficient to describe INRF) and additional operations are required to implement INRF. (2) Techniques and empirical tricks such as batch normalization or preprocessing using whitening that have been developed through many years of research are designed for CNNs and not for networks with INRF modules. For instance, a preprocessing using whitening is known to improve generalization in CNNs, however, in INRFnet we found that this procedure leads to worse performance. We need then to test all these tricks and methods in order to find strong

Input x
Output y Convolutional layer Fully connected layer