Optimal speed estimation in natural image movies predicts human performance

Accurate perception of motion depends critically on accurate estimation of retinal motion speed. Here we first analyse natural image movies to determine the optimal space-time receptive fields (RFs) for encoding local motion speed in a particular direction, given the constraints of the early visual system. Next, from the RF responses to natural stimuli, we determine the neural computations that are optimal for combining and decoding the responses into estimates of speed. The computations show how selective, invariant speed-tuned units might be constructed by the nervous system. Then, in a psychophysical experiment using matched stimuli, we show that human performance is nearly optimal. Indeed, a single efficiency parameter accurately predicts the detailed shapes of a large set of human psychometric functions. We conclude that many properties of speed-selective neurons and human speed discrimination performance are predicted by the optimal computations, and that natural stimulus variation affects optimal and human observers almost identically.


Supplementary
. The influence of training on movies with rigid motion only on optimal receptive fields for speed estimation. A. Optimal space-time receptive fields for speed estimation with only rigid retinal image motion. Just as in the main text, a training set was created by texture-mapping randomly sampled patches from photographs of natural scenes onto surfaces. The surfaces were then drifted behind an aperture. This training set included movies only of frontoparallel surfaces (rigid-motion only), whereas the set in the main text did not contained movies of surfaces slanted to varying degrees (non-rigid motion). The similarity between these receptive fields and those in the main text provides evidence that the results in the main text are largely robust. However, note that there exist some differences between these receptive fields and those presented in the main text. For example, receptive fields 6-8 appear more like discrete cosine transfer components than the receptive fields in cortex. B. Quantifying the similarity between individual receptive fields. Correlation between space-time receptive fields in main text (rigid motion), and the movies in A (nonrigid motion). The optimal space-time receptive fields are largely but not completely robust to whether the image set contains non-rigid and rigid vs rigid motion only. Figure S3. Optimal space-time receptive fields and their pair-wise combinations. The original eight receptive fields are shown on the diagonal. The optimal computations could be implemented by appropriately weighting the squared responses of the receptive fields and their pair-wise combinations. This eclectic mix of receptive fields could be used in one of several possible implementations of the optimal computations (see Discussion, Supplement Note 3). Each receptive field response would get a different weight depending on the preferred speed of the likelihood neuron to which it contributes (equations S4,S5). Therefore, the variety of space-time receptive fields in cortex may play a functional role in speed estimation.

Contrast normalization
In the main text, we claim that the standard equation for contrast normalization in the literature provides a good approximation for the expected value of receptive field responses to encoded images that have been corrupted by noise. Here, we present Monte Carlo simulation results to support the claim. We examine the accuracy of the approximation for an individual image movie, and across the entire set of movies. Eq. (1) in the main text (repeated here) gives the response of a linear space-time receptive field to a noisy, contrast-normalized stimulus ( ) where is i.i.d Gaussian noise with standard deviation . That is, equation S1 gives the response of a linear receptive field (with normalization).
Most simple cells incorporate two more nonlinearities: half-wave rectification (simple cells cannot produce negative responses) and a squaring nonlinearity. Together, these three nonlinearities (i.e. contrast normalization, rectification, & squaring) account for the characteristic shape of simple-cell contrast response functions. Incorporating these features into Eq. S1, yields where the half-bracket represents half-wave rectification. Thus, equation 2 gives the expected simple cell response to a noisy input image.
The standard contrast normalization model for V1 simple cell responses 1-3 is given by where is the half-saturation constant, is the root-mean-squared contrast of a stimulus (i.e., 2 RMS c n = c ), and n is the number of pixels in the contrast patch . Thus, Eq. S3 does not explicitly include the effects of input noise.
We asked whether the expected value of the model simple cell responses to noisy input images (Eq. S2) is a reasonable approximation to standard model responses to noiseless input images (Eq. S3). We performed a Monte Carlo simulation to check whether (see Eq. S2) is approximately equal to r (see Eq. S3). Without loss of generality we can set max 1.0 r = . First, we examined the accuracy across the entire set of stimuli for the value of the standard deviation of the contrast noise used in our experiment ( ). The results show that the approximation is unbiased across the stimulus set ( Supplementary Fig. S1A). However, accuracy across the stimulus set does not guarantee that the approximation is accurate for individual stimuli. To examine whether the approximation holds for individual stimuli, we selected a stimulus from the training set that produced the largest response from space-time receptive field f 1 . Then we performed a series of Monte Carlo simulations (10000 samples each), for a range of values, as the contrast of the stimulus was manipulated. The solid curves in Supplementary Fig. S1C

Weights for likelihood neurons
The weights for constructing the speed-tuned likelihood neurons (see Fig. 4a) are given by simple functions of the covariance matrix. Each covariance matrix represents the response covariance of the receptive fields for all movies having a particular speed . We denote with for notational simplicity. The weights on the squared and sum-squared filters (see Figs. 3, S4 ) are given by where I is the identity matrix, 1 is the 'ones' vector, and diag() sets a matrix diagonal to a vector.
The response of the likelihood neuron with preferred speed is then given by

(S5)
The proportionality can be turned into an equality by adding a constant to the exponent having a value proportional to the log of the determinant of the covariance matrix .
A loose analogy can be made between the terms in equation S5 and the properties of neurons. The term in the brackets can be thought of as synaptic contributions to the polarization state of the likelihood neuron. The exponential function can be thought of as the non-linearity that converts voltage (which can be positive or negative) to spike rate (which is always positive).

Alternate implementations of ideal estimator
The implementation that is schematized in Fig. 4A  respectively-are then squared, combined in an appropriately weighted sum (as before), and passed through the same accelerating nonlinearity (as before) to obtain the L neuron responses. These two ways of implementing the ideal are compact and simple conceptually, but are not biologically plausible because linear neurons do not exist in cortex; neurons, for example, cannot respond with a negative spike rate.
A more biologically plausible implementation of the L neurons would be in the spirit of the classic model for obtaining complex cells; namely, by summing the responses of simple cells 4 .
Simple cells are typically modeled as a linear filtering stage followed by half-wave rectification and a squaring output nonlinearity. The squared output of each optimal space-time receptive field (and their pairwise sums) could be obtained from a pair of matched on and off units mimicking standard V1 simple cells. The simple cell responses would then be summed with appropriate weights and passed through an accelerating nonlinearity (as before) to obtain the L neuron responses. In this implementation, the L neurons would be a specific type of complex cell optimized for speed estimation. (A special case of this implementation are so-called "energy" units, which are obtained by summing the responses of four simple cells corresponding to a pair of receptive fields in quadrature phase 5 .) Complex cells for other tasks (e.g. disparity estimation) could be obtained analogously, but would require different receptive fields and weights 6 . All of the above ways of implementing the ideal estimator are mathematically equivalent. It remains uncertain how the brain might approximately implement such ideal calculations. However, the above arguments show that such calculations could be implemented with well-known neural operations.