Efficient probabilistic inference in generic neural networks trained with non-probabilistic feedback

Animals perform near-optimal probabilistic inference in a wide range of psychophysical tasks. Probabilistic inference requires trial-to-trial representation of the uncertainties associated with task variables and subsequent use of this representation. Previous work has implemented such computations using neural networks with hand-crafted and task-dependent operations. We show that generic neural networks trained with a simple error-based learning rule perform near-optimal probabilistic inference in nine common psychophysical tasks. In a probabilistic categorization task, error-based learning in a generic network simultaneously explains a monkey’s learning curve and the evolution of qualitative aspects of its choice behavior. In all tasks, the number of neurons required for a given level of performance grows sublinearly with the input population size, a substantial improvement on previous implementations of probabilistic inference. The trained networks develop a novel sparsity-based probabilistic population code. Our results suggest that probabilistic inference emerges naturally in generic neural networks trained with error-based learning rules.


Introduction
Humans and other animals have been shown to behave near optimally in dealing with uncertain sensory information in a wide variety of psychophysical tasks.At the behavioral level, this often implies non-trivial response strategies.For instance, when combining multiple sensory cues, accurate probabilistic inference requires the observer to determine the reliabilities of individual cues and adjust their weights accordingly on a trial-by-trial basis.How can such computations be implemented in neural circuits?A prominent framework addressing this question is the probabilistic population coding (PPC) framework, according to which the population activity on a single trial encodes a probability distribution rather than a single estimate and computations with probability distributions can be carried out by suitable operations on the corresponding neural responses [24,41].Ma et al. [24] showed that if neural variability belongs to a particular class of probability distributions, the posterior distribution in cue combination tasks can be computed with a linear combination of the input responses.Moreover, in this scheme, the form of neural variability is preserved between the input and the output, leading to an elegantly modular code.In more complex tasks, linear operations are insufficient and it has been argued that multiplication and division of neural responses are necessary for optimal inference [3,4,25,26,33].
Upon closer look, however, these previous implementations of PPC suffer from several shortcomings.First, the networks in these studies were either fully manually designed, or partially manually designed and partially trained with large amounts of probabilistic data to optimize explicitly probabilistic objectives, e.g.minimization of Kullback-Leibler (KL) divergence.Therefore, this literature does not address the important question of learning: how can probabilistic inference be learned from a realistic amount and type of data with minimal manual design of the networks?Second, although there are some commonalities between the neural operations required to implement probabilistic inference in different tasks, these operations generally differ from task to task.For instance, it has been argued that some form of divisive normalization of neural responses is necessary in tasks that involve marginalization [4].However, the specific form of divisive normalization that individual neurons have to perform differs substantially from task to task.Therefore, it is unclear if probabilistic inference can be implemented in generic neural networks, whose neurons all perform the same type of neurally plausible operation.Third, in these studies, the number of neurons required to implement a probabilistic task scales unfavorably with the number of input neurons (linearly in the case of cue combination and at least quadratically in all other tasks).Therefore, the question of whether these tasks can be implemented more efficiently remains open.
In this paper, we address these issues.Using a simple error-based learning rule, we trained generic feed-forward and recurrent neural networks on eight psychophysical tasks that span a range of ecologically important computations.The networks learned to perform near-optimal probabilistic inference from a relatively small number of training trials and were able to generalize beyond the particular stimulus conditions used in training.Importantly, neither the training objective nor the training examples were explicitly probabilistic.In all tasks, the number of hidden units required to achieve a given level of performance scaled sub-linearly with the number of input units.Thus, these generic networks perform probabilistic inference much more efficiently than the earlier PPC approaches.
The trained generic networks developed a novel sparsity-based probabilistic population code in their hidden layers, where the sparsity of the hidden layer activity was significantly correlated with sensory reliability.The strength of this relationship displayed heterogeneity from task to task in a way that can be explained with a simple meanfield type model.The new sparsity-based probabilistic population code makes specific predictions for physiological experiments.
We further show that in a probabilistic binary categorization task involving arbitrary categories where both human and monkey subjects have been shown to perform probabilistic inference [33], monkey performance (but not human performance) is consistent with an error-based learning rule in a generic neural network.These results suggest that near-optimal probabilistic inference in standard psychophysical tasks emerges naturally and robustly in generic neural networks trained with error-based learning rules and that these types of networks can be used as simple plausible neural models of probabilistic inference in non-human observers.

Results
Tasks.We trained generic feed-forward or recurrent neural networks on eight probabilistic psychophysical tasks that are commonly studied in the experimental and computational literature.The tasks were: (a) linear cue  For tasks with continuous output variables (a-d) linear output units were used, whereas for tasks with categorical output variables (e-h) sigmoidal output units were used.
Networks.The networks all received noisy sensory information about the stimulus or the stimuli in the form of a neural population with Poisson variability.The hidden units of the networks were modeled as rectified linear units (ReLUs).ReLUs are commonly used in neural networks due to their demonstrated advantage over alternative non-linearities in gradient-based learning schemes [11].Linear (sigmoidal) output units were used in tasks with continuous (categorical) outputs.Schematic diagrams of all the networks used for the tasks considered in this paper are shown in Figure 1.Networks were trained to minimize mean squared error or cross-entropy in tasks with continuous or categorical outputs, respectively.Importantly, the networks were provided only with the actual stimulus values or the correct class labels as feedback in each trial.Thus, they did not receive any explicitly probabilistic feedback, nor were they explicitly trained to perform probabilistic inference.
We manipulated sensory reliability via gain variables g multiplying the responses of the input populations.In each task, networks were tested with a wide range of gains or gain combinations (in tasks with more than a single stimulus).To test the generalization capacity of the networks, we trained them with a limited range of gains or gain combinations, as well as with the full range of test gains or gain combinations.The latter unrestricted training regime is called the "All g" condition, whereas the former limited training regime is called the "Restricted g" condition in what follows.The specific gain ranges and gain combinations used in each task are indicated below (Methods).
Generic neural networks trained with error-based learning rules implement probabilistic inference in standard psychophysical tasks.The performance of well-trained networks is shown in Figure 2b and d for both "All g" (black) and "Restricted g" (red) training conditions in all tasks (complete learning curves of the networks, i.e. performance as a function of the number of training trials, are shown in supplementary Figure S1).
In categorical tasks, the output nodes of the networks approximate the posterior probabilities of the classes given the input responses (Figure 2c).Theoretical guarantees ensure that this property holds under general conditions with a wide range of loss functions [16] (Discussion).In all continuous tasks as well as in two of the categorical tasks, we found strong trial-by-trial correlations between the input gain and the sparsity of hidden layer responses (Figure 5f), which in turn induce correlations between the posterior precision and the sparsity of hidden layer responses in the continuous tasks.This intriguing phenomenon will be explained below.
Random networks.We also considered an alternative architecture, in which the input-to-hidden layer weights and the biases of the hidden units were set randomly and left untrained (Methods).Such random networks can be plausible models of some neural systems [7,36].Although the performance of random networks improved with training, they required significantly larger numbers of hidden units than the corresponding fully trained networks to achieve a similar level of performance and, in some cases (e.g. in Kalman filtering), failed to achieve error rates as low as the fully trained networks even with very large numbers of hidden units (supplementary Figure S2).A well-known theoretical result can explain the inefficiency of random networks: the approximation error of neural networks with adjustable hidden units scales as O(1/n) with n denoting the number of hidden units, whereas for networks with fixed hidden units (as in our random networks), the scaling is much worse: O(1/n 2/d ) where d is the dimensionality of the problem [1].Because random networks were substantially more inefficient than fully trained networks, we exclusively consider the fully trained networks for the rest of this paper.
Alternative representations of stimulus reliability in the input populations.Thus far, we have assumed that sensory reliability has a purely multiplicative effect on the responses of input neurons.Although this assumption appears to hold for the effect of contrast on orientation selectivity in visual cortex [35], it is known to be violated for the effect of motion coherence on direction selectivity [10,29] and for the effect of contrast on speed selectivity [20], and is unlikely to hold in the general case.The importance of this observation is that the Poisson-like PPC approach proposed in [24] cannot easily handle cases where "nuisance variables" such as contrast or coherence do not have a purely multiplicative effect on neural responses.By contrast, our approach does not make any assumptions about the representation of stimulus reliability in the input populations.We demonstrated this in two cases: (i) cue combination with tuning functions of the form reported in [10,29] where stimulus coherence affects both the gain and the baseline of the responses (supplementary Figure S3) and (ii) contrast-invariant speed estimation with tuning functions of the form reported in [20] where both peak response and preferred speed depend on stimulus contrast (supplementary Figure S4).Thus, our networks can approximate optimal estimates regardless of the specific form in which stimulus reliability is encoded in the input populations, although the hidden layer representations that emerge from training can be different for different schemes (supplementary Figures S3-S4).
Modularity of the hidden layer representations.In the linear cue combination task, the representations learned by the hidden units were fully modular in the sense that the trained hidden units could be directly plugged into a network incorporating an additional input population without any re-training and with little information loss (Figure 1b and Figure 2).This suggests that although the network was trained to minimize the squared error, and hence was asymptotically guaranteed to reproduce the posterior mean only, information about the full posterior distribution was, to a large extent, preserved in the hidden layer.
Generalization to untrained stimulus conditions.As shown in Figure 2b and d (red bars), the networks were able to generalize well beyond the training conditions in all tasks.An example is shown in Figure 3a for the cue combination task.In the "restricted g" condition, we trained the networks with only two gain combinations, g ≡ (g 1 , g 2 ) = (5, 5) and g = (25,25) and tested them on all gain combinations of the form (g 1 , g 2 ) where g 1 , g 2 ∈ {5, 10, 15, 20, 25} with up to five-fold gain differences between the two input populations ("all g" condition).To demonstrate that the trained networks performed qualitatively correct probabilistic inference, we set up cue conflict conditions (similar to the cue conflict conditions in psychophysical studies [9]) where we presented slightly different

Optimal estimate
- 20 20 Network output  Scatter plots of the posterior probability of a given choice vs. the network outputs in "all g" conditions of the categorical tasks.Each dot represents a single trial.For the stimulus demixing task (SD), where the output was four-dimensional, only the marginal posterior along the first dimension is shown for clarity.(d) For categorical tasks, performance is measured in terms of fractional information loss which is defined as the KL-divergence between the actual posterior and the network's output normalized by the mutual information between the class labels and the neural responses [4].In (b) and (d), error bars represent standard errors over 10 independent runs of the simulations.Abbreviations: CC (cue combination); MCC (modular cue combination); CT (coordinate transformation); KF (Kalman filtering); BC (binary categorization); SD (stimulus demixing); CI (causal inference); VS (visual search).Categorical tasks are labeled in green, continuous tasks in orange.
Weight of cue 1 stimuli to the two input populations and manipulated the degree of conflict between the cues.The network achieved low generalization error (fractional RMSE: 10.9%) even after as few as 50 trials in this impoverished training condition and performed qualitatively correct probabilistic inference in the untrained conditions.In particular, the network correctly adjusted the weights assigned to the two cues as the ratio of their reliabilities varied over a 25-fold range.(Figure 3a).
The successful generalization performance of the neural networks is a result of two factors.First, the target function is approximately invariant to gain manipulations that differ between the training and test conditions.In cue combination, for instance, the target function (Equation 8) is approximately invariant to the scaling of the input populations by arbitrary gains g 1 and g 2 .The second factor is the network's inductive biases, i.e. how it tends to behave outside the training domain.These inductive biases depend on the details of the network architecture [31].The importance of the network architecture in the extrapolation behavior of neural networks can be illustrated with a simple example (Figure 3b).In this example, we trained two neural networks with different hidden unit activation functions on the same two-dimensional regression problem that involved learning the same output value for two clusters of widely separated training points.The training data are indicated by the red dots in Figure 3b.The network with rectified linear activation functions learned an approximately constant function, whereas the network with cosine activation functions learned a more complex function outside the training domain (Figure 3), even though both networks performed equally well on the training data.
Sparsity-based representation of sensory reliability.We now explain the observed correlations between the input gain and the sparsity of hidden layer responses in the trained networks.We first note that in the tasks with continuous output variables, the optimal solution is invariant to a multiplicative scaling g of the input responses (Methods, .We then investigated the conditions under which the neural network's outputs would be invariant to such scalings.To address this question, we first derived an approximate analytical expression for the mean response of a typical hidden unit μ, as a function of the input gain g, the mean input µ to the hidden unit for unit gain, and the mean µ b and the standard deviation σ b of the biases of the hidden units (Methods).To minimize the dependence of the mean hidden unit response on g, we introduced the following measure of the total sensitivity of μ to variations in g: where the prime represents the derivative with respect to g, and numerically minimized T var with respect to µ, µ b and σ b , subject to the constraint that the mean response across different gains be equal to a positive constant K. T var was minimized for a negative mean input µ, positive µ b and a large σ b value (green dot in Figure 4a).As an approximate rule, decreasing µ and increasing µ b or σ b lead to smaller T var values.Because the input responses are always non-negative, the only way µ can be negative in our networks is if the mean input-to-hidden layer weight is negative.The negativity of the mean input µ implies that as the gain g increases, the distribution of the total input h to the unit shifts to the left (Figure 4b, top), causing a larger proportion of the distribution to remain below the threshold (represented by the dashed line in Figure 4b), hence decreasing the probability that the neuron will have a non-zero response (Figure 4c, top).This mechanism underlies the sparsification of the hidden unit responses with increasing g.Because increasing g also increases the variance of h, the mean response for those inputs that do cross the threshold increases (Figure 4d, top).As a result, the mean response of the neuron, which is a product of these two terms, remains roughly constant (Figure 4e, top).
We demonstrate this sparsification mechanism for a network trained in the coordinate transformation task in Figure 4f-i.Because the coordinate transformation task is gain-invariant (Methods, Equation 9), the input-tohidden layer weight distribution in the trained network was slightly skewed toward negative values (Figure 4f), as predicted from our simplified statistical description.Consequently, we found a strong positive correlation between the sparsity of hidden layer responses and the mean input response (Figure 4i), but only a weak, slightly negative correlation between the mean hidden layer response and the mean input response (Figure 4h).Moreover, the mean bias of the hidden units, µ b , was positive and the distribution of biases was broad (Figure 4g), again consistent with the result from the simple mean-field model above.
The same type of analysis applies to the categorical tasks as well.However, the difference is that for some of our tasks with categorical outputs, in the optimal solution, the net input to the output unit had a strong dependence on g.For example, in causal inference, the input to the sigmoidal output unit scales approximately linearly with g (Equation 13).Similarly, in visual search, both global and local log-likelihood ratios have a strong dependence on g (through r i in Equations [19][20].In binary categorization, however, the net input to the sigmoidal output unit is approximately invariant to a scaling g of the input responses r (Equation 12).
In the bottom panel of Figure 4b-e, predictions from the mean-field model are shown for a parameter combination where all three parameters are close to 0 (represented by the magenta dot in Figure 4a).This parameter combination roughly characterizes the trained networks in the causal inference task (Figure 4j-m).In this case, because both µ and µ b are close to 0, the probability of non-zero responses as a function of g stays roughly constant (Figure 4c, bottom), causing the mean response to increase with g (Figure 4e, bottom).
Based on our simple mean-field model, we therefore predicted that for those tasks where the net input to the output unit is approximately g-invariant, there should be a positive correlation between the sparsity of hidden unit responses and the input gain and no (or only a weak) correlation between the mean hidden unit response and the input gain.On the other hand, in tasks such as causal inference, where the net input to the output unit has a strong g-dependence, we predicted a positive correlation between the mean hidden unit response and the input gain and no (or only a weak) correlation between the sparsity of hidden unit responses and the input gain.We tested these predictions on our trained networks and found that they were indeed borne out (Figure 5e-f).For causal inference and visual search tasks, the correlation between the mean input and the sparsity of hidden layer responses was weak (Figure 5f), whereas for the remaining tasks, it was strong and positive.The opposite pattern was seen for the correlation between the mean input and the mean hidden layer response (Figure 5e).The sparsity-based representation of the input gain in g-invariant tasks was again driven by the negativity of the mean input-to-hidden layer weights (Figure 5a).Finally, in most of our tasks, we also found a strong positive correlation between the biases of the hidden units and the magnitude of their output weights (Figure 5d) which can be explained through a simple signal-to-noise ratio argument (Methods).
The difference between these two types of tasks (g-invariant and g-dependent) was also reflected in the tuning functions that developed in the hidden layer of the networks.For g-invariant (or approximately g-invariant) tasks such as coordinate transformation, increasing the input gain g sharpens the tuning of the hidden units (Figure 6a), whereas for gain-dependent tasks such as causal inference, input gain acts more like a multiplicative factor scaling the tuning functions without changing their shape (Figure 6b).
Error-based learning in generic neural networks can account for the evolution of behavioral choice and accuracy in a binary categorization task.The dependence of the networks' performance on the number of training trials (Figure S1) suggests a possible explanation for deviations from optimal inference sometimes observed in experimental studies: i.e. insufficient training in the task.Testing this hypothesis rigorously is complicated In the optimal solution (top), the mean input µ is slightly negative and the mean bias µ b is positive.Increasing the gain g thus shifts the input distribution to the left: compare the black and red lines for input distributions with a small and a large gain, respectively.The means of the input distributions are indicated by small dots underneath.(c) This, in turn, decreases the probability of non-zero responses, but (d) increases the mean of the non-zero responses; hence (e) the mean response of the hidden units, being a product of the two, stays approximately constant as the gain is varied.In the bottom panel of (b-e), the results are also shown for a different parameter combination represented by the magenta dot in (a).This parameter combination roughly characterizes the trained networks in the causal inference task shown in (j-m).(f) For a network trained in the coordinate transformation task, distributions of the input-to-hidden layer weights, W ij ; (g) the biases of the hidden units, b i ; (h) scatter plot of the mean input r in vs. the mean hidden unit activity r hid ; (i) the scatter plot of the mean input vs. the kurtosis of hidden unit activity κ(r hid ).(j-m) Similar to (f-i), but for a network trained in the causal inference task.by possible prior exposure of the subjects to similar stimuli or tasks under natural conditions.Among the tasks considered in this paper, the binary categorization task minimizes such concerns, because it involves classifying stimuli into arbitrary categories.Moreover, in this task, the behavior of both human and monkey observers were best explained by heuristic models that were quantitatively suboptimal, but qualitatively consistent with the optimal inference model [33].Therefore, we sought to test the insufficient-training hypothesis for suboptimal inference in this task.The stimulus distributions for the two categories and the decision boundaries predicted by the optimal (OPT) and three suboptimal models (FIX, LIN, QUAD) are shown in Figure 7a-b.Different suboptimal models make different assumptions about the dependence of the decision boundary on the sensory noise, σ (Figure 7b).In particular, the FIX model assumes that the decision boundary is independent of σ, whereas LIN and QUAD models assume that the decision boundary is a linear or quadratic function of σ, respectively [33].The learning curve of a monkey subject who performed a large number of trials in this task (monkey L in [33]) is shown in Figure 7 together with the performance of a neural network that received the same sequence of trials as the subject.The input noise of the network was matched to the sensory noise estimated for the subject and the learning rate of the network was optimized to fit the learning curve of the subject.Besides providing a good fit to the learning curve of the subject (Figure 7c), the neural networks also correctly predicted the progression of the models that best fit the subject's data, i.e. early on in the training the QUAD model, then the LIN model (Figure 7d).When we performed the same type of analysis on human subjects' data, human observers consistently outperformed the networks and the networks failed to reproduce the learning curves of the subjects (Figure 7e).There might be several possible non-exclusive explanations for this finding.First, prior to the experiment, human observers were told about the task, including what examples from each category looked like.This type of knowledge would be difficult to capture with error-based learning alone and might have given human observers a "head-start" in the task.Second, human observers might have benefited from possible prior familiarity with similar tasks or stimuli.Third, human observers might be endowed with more powerful computational architectures than simple generic neural networks that allow them to learn faster and generalize better [13].
Low computational complexity of standard psychophysical tasks and the efficiency of generic networks.For each of our tasks, we empirically determined the minimum number of hidden units, n * , required to achieve a given level of performance (10% fractional RMSE or information loss) as a function of the total number of input units, d, in our generic networks.An example is shown in Figure 8a for the causal inference task with d = 10 and d = 108.The scaling of n * with d was better than O(d), i.e. sub-linear, in all our tasks (Figure 8b).Previous theoretical work suggests that this result can be explained by the smoothness properties of the target functions and the efficiency of the generic neural networks with adjustable hidden units.In particular, Barron [1] showed that the optimal number of hidden units in a generic neural network with a single layer of adjustable hidden units scales as C f (d)/ √ d with d, where C f is a measure of the smoothness of the target function, with more smooth functions having lower C f values.As an example, in [1], it was shown that for the d-dimensional standard Gaussian function, C f can be upper-bounded by √ d, leading to an estimate of O(1) hidden units in terms of d.For some of our tasks (e.g.Kalman filtering, binary categorization; Figure 8b), the scaling of n * with d appeared to be approximately constant, suggesting smoothness properties similar to a d-dimensional standard Gaussian.For the other tasks, the scaling was slightly worse, but still sub-linear in every case.We can gain some intuition about the relatively benign smoothness properties of our tasks by looking at the analytic expressions for the corresponding target functions (Equations 8-20): although the inputs are high dimensional, the solutions can usually be expressed as simple algebraic functions of a small number of one-dimensional projections of the inputs.
The efficiency of our generic networks contrasts sharply with the inefficiency of the manually crafted networks in earlier PPC studies [3,4,[24][25][26]33]: except for the linear cue combination task, these hand-crafted networks used a quadratic expansion which requires at least O(d 2 ) hidden units.Moreover, unlike generic neural networks, these networks with hand-crafted hidden units are not guaranteed to work well in the general case, if, for example, the target function is not expressible in terms of a quadratic expansion.Stacking the quadratic expansions hierarchically to make the networks more expressive would make the scaling of the number of hidden units with d even worse (e.g.[25]).The fundamental weakness of these hand-crafted networks is the same as that of the random networks reviewed above: they use a fixed basis set theoretically guaranteed to have much worse approximation properties than the adjustable basis of hidden units used in our generic networks [1].
s, x (deg.)OPT-P is equivalent to an OPT model where the prior probabilities of the categories are allowed to be different from 0.5 (in the experiment, both categories were equally likely).(e) Average performance of 6 human subjects in the main experiment of [33] (red), average performance of 30 neural networks whose input noise was set to the mean sensory noise estimated for the human subjects (blue).Error bars represent standard errors across subjects.

Number of hidden units (n)
10 0 10 1 10 2 Inf. loss (%) For each number of input units, d, the number of hidden units required to reach within 10% of optimal performance, n * , is estimated (shown by the arrows).An example is shown here for the causal inference task with d = 10 and d = 108 input units, respectively.(b) n * plotted as a function of the total number of input units, d, in different tasks.Solid lines show linear fits.In these simulations, the number of training trials for each task was set to the maximum number of training trials shown in supplementary Figure S1.For the visual search task, the number of hidden units in the second hidden layer is constrained to be half the number of hidden units in the first hidden layer, with n * referring to the total number of hidden units in both hidden layers.

Discussion
We showed that small generic neural networks trained with a standard error-based learning rule, but without any explicitly probabilistic feedback or training objective, implement probabilistic inference in simple probabilistic psychophysical tasks and generalize successfully well beyond the conditions they are trained in.
An important question to consider is why generic neural networks work as well as they do in probabilistic inference problems.For tasks with continuous outputs, our networks were trained to minimize the squared error loss function, which is minimized by the posterior mean estimate.Given the universal approximation guarantees for multilayer neural networks with rectified linear hidden units [22], it is not surprising that our networks can approximate the posterior mean, a sufficient condition for performing probabilistic inference, given enough hidden units and training data.However, the findings that near-optimal performance can be achieved even in small networks trained with a relatively small number of training examples and that the networks can generalize successfully beyond the training data they receive depend on the particular problems we studied, in particular, on their low-dimensional nature and their smoothness properties, hence are not predicted by the universal approximation results.The networks can, in principle, be trained to minimize loss functions other than squared error, in which case they would approximate alternative Bayes estimators.
For tasks with categorical outputs, even better theoretical results hold.In this case, under mild conditions, the output layer of a multilayer neural network is guaranteed to converge to the posterior probabilities of the classes with any "reasonable" loss function, where "reasonable" has a formal definition according to which all commonly used loss functions such as cross-entropy or mean squared error are reasonable [16].
Relationship to previous works.Our work is consistent with the probabilistic population coding framework according to which, by virtue of the variability of their responses, neural populations encode probability distributions rather than single estimates and computations with probability distributions can be carried out by suitable operations on the corresponding neural responses [24,41].However, our work disagrees with the existing literature on the implementation of such computations.We show that these computations do not require any special neural operations or network architectures than the very generic ones that researchers have been using for decades in the neural network community [34,38,42].
The starting point of the recent literature on probabilistic population coding has been the principle that Poisson-like neural variability be preserved between the input and output of a network, because this leads to a fully modular code that can be decoded with the same type of decoder throughout the network [24].To obtain actual networks, these studies then postulated a literal, one-to-one correspondence between the required neural computations that must be computed at the population level and the computations individual neurons perform.This literal interpretation has led to inefficient neural architectures containing intermediate neurons that are artificially restricted to summing or multiplying the activities of at most two input neurons and that perform substantially different operations in different tasks (e.g.linear summation, multiplication or different forms of divisive normalization) [3,4,[24][25][26]33].
Our generic networks are not necessarily inconsistent with the principle of the preservation of Poisson-like variability between the input and output of a network.Our categorical networks already satisfy this principle, and our continuous networks satisfy it if, instead of a purely linear decoder, we use a linear decoder that is normalized by the total activity in the hidden layer (supplementary Figure S6).However, our results show that it is unnecessary and inefficient to postulate a direct correspondence between population-level and individual-neuron computations: standard neural networks with rectified linear hidden units that perform the same type of operation independent of the task implement population-level computations required for optimal probabilistic inference far more efficiently.
Generic neural networks of the type we used in this paper have been used previously to explain neural responses in prefrontal cortex in a context-dependent decision making task [27], neural responses to naturalistic visual stimuli in higher visual areas [40] and responses of neurons in posterior parietal cortex performing a type of coordinate transformation [42].Our results suggest that similar networks can be plausible neural models in tasks that require probabilistic computations.
Experimental predictions.Our results lead to several experimentally testable predictions.First, for gaininvariant tasks, we predict a novel sparsity-based coding of sensory reliability in cortical areas close to the readout.Stimulus manipulations that increase sensory reliability such as an increase in contrast of the stimulus would be expected to increase the sparseness of the population activity in these areas.Another straightforward consequence of this relationship would be a positive correlation between the performance of the animal in the task and the population sparsity of neurons recorded from the same areas.Second, for tasks such as causal inference, we predict a different coding of sensory reliability based on the mean activity in areas close to the readout.Moreover, based on our simplified model of the mechanism underlying these two types of codes, we expect a trade-off between them: the stronger the correlation between sparsity and sensory reliability, the weaker the relationship between the mean activity and sensory reliability and vice versa.This can be tested with population recordings from multiple areas in multiple tasks.Third, at the level of single cells, we predict tuning curve sharpening with increased input gain in tasks where a sparsity-based coding of reliability is predicted (Figure 6a).Such tuning curve sharpening has indeed been observed in cortical areas MT [6], MST [17] and MSTd [29].On the other hand, we expect the input gain to act more like a multiplicative factor in tasks where a mean activity based coding of reliability is predicted (Figure 6b).
Sparse and reliable neural responses have been observed under natural stimulation conditions [8,15,37].Inhibitory currents have been shown to be crucial in generating such sparse and reliable responses [8,14], reminiscent of the importance of the mean input to a hidden unit being negative in our mean-field model of the sparsity-based coding of sensory reliability (Figure 4b).
Limitations.Our networks are highly idealized models of real neural circuits.First, except for the recurrent network that was used in the Kalman filtering task, they are all feed-forward networks, whereas cortical networks are usually highly recurrent.However, networks with feedback connections can sometimes behave effectively like a feedforward network [12,30].Feedforward networks also currently provide the best characterization of the neural responses in higher visual cortical areas [40], even though these areas are known to involve abundant feedback connections both within the same area and between different areas.Therefore, we expect that insights gained from understanding feedforward networks will not be irrelevant for understanding real cortical circuits.Second, our networks do not respect Dale's law, which states that the synaptic projections of a neuron can be either all inhibitory or all excitatory, but not mixed.However, there are relatively simple ways of turning a generic network that does not respect Dale's law into an equivalent one that does [32].Thus, we do not expect this deviation from biological reality to be consequential either.Third, our networks were trained with the backpropagation algorithm, which is usually considered to be biologically unrealistic due to its non-locality.Although the backpropagation algorithm in its standard form we have implemented is indeed biologically unrealistic, biologically plausible approximations to backpropagation have been put forward recently [5,23].Therefore, it is quite likely that one need not compromise the power of backpropagation in order to attain biologically plausibility.
The stimuli that we used were far from naturalistic; however, the computations required in our tasks capture the essential aspects of the computations that would be required in similar tasks with natural stimuli.Using simple stimuli allows for the parametric characterization of behavior and makes the derivation of the optimal solution more tractable.We have shown here that new insights can be obtained by combining analytically derived optimal solutions with neural networks.For example, understanding the novel sparsity-based representation of sensory reliability in the hidden layers of the networks in some tasks but not in others relied on the analysis of the optimal solutions in different tasks.
Finally, as exemplified by the inability of error-based learning to account for the performance of human observers in the binary categorization task, we do not expect error-based learning in generic neural networks to fully account for all aspects of the performance of human observers, and possibly non-human observers as well, even in simple tasks.
Relatively mundane manipulations such as changing the target or the distractors, or the number of distractors in a visual search task, changing the duration of a delay interval in a short-term memory task require wholesale re-training of generic neural networks, which seems to be inconsistent with the way human observers, and possibly non-human observers, can effortlessly generalize over such variables.More powerful architectures that combine a neural network controller with an external memory can both learn faster and generalize better [13], offering a promising direction for modeling the generalization patterns of observers in simple psychophysical tasks.

Methods
Neural networks: In all networks, the input units were independent Poisson neurons: r in ∼ Poisson(f (s, c)) where f is the vector of mean responses (tuning functions), s is the stimulus and c is a stimulus contrast or coherence variable that controls the quality of sensory information.For the main results presented in the paper, we assume that the effect of c can be described as a multiplicative gain scaling: f (s, c) = g(c)f (s) where the individual tuning functions comprising f (s) were either linear, Gaussian or von Mises in different tasks.
To demonstrate the complete generality of our approach, we also considered two alternative ways in which stimulus contrast or coherence can affect the responses of the input population.In particular, for the cue combination task, we considered tuning functions where c did not have a purely multiplicative effect, but affected the baseline responses as well [10]: f (s, c) = cf (s) + (1 − c)β with 0 ≤ c ≤ 1 where β was chosen such that the mean response of the input population was independent of c.The results for this case are shown in supplementary Figure S3.
Secondly, for a contrast-invariant stimulus estimation task, we considered tuning functions where stimulus contrast c affected both the peak response and the preferred stimuli of input neurons, as reported in [20] for speed tuning in + denotes elementwise rectification.For tasks with continuous output variables, the network output corresponds to a linear combination of the hidden unit responses: y = w r hid + b, and in tasks with categorical variables, the network output was given by a linear combination of the hidden unit responses passed through a sigmoid nonlinearity: y = σ(w r hid + b).For the stimulus demixing task where there were more than a single output variable, the outputs were softmax units instead.
In the main simulations, feed-forward networks with a single hidden layer had 200 hidden units, the two-hidden layer feed-forward network for the visual search task had 130 and 70 hidden units in its first and second hidden layers, and the recurrent network for the Kalman filtering task had 30 recurrent units.Partially random networks (supplementary Figure S2) had 10 times the number of hidden units in the corresponding fully trained network.
In cue combination, modular cue combination, coordinate transformation, Kalman filtering, and causal inference tasks, there were 50 input neurons per input population.To make our results comparable to earlier results, we used 20 input neurons per input population in the binary categorization and visual search tasks and 10 input neurons per input population in the stimulus demixing task.
Training procedure: The feed-forward networks were trained with the standard back-propagation algorithm [34].
The recurrent networks for the Kalman filtering task were trained with back-propagation through time [38].The algorithms were implemented with stochastic gradient descent with a momentum term.The batch size for the updates was 10.The learning rate was decreased over the course of training according to η 0 /(1 + γt) where t = 0, 1, . . ., N e − 1 refers to the epoch number.The number of training epochs, N e , was fixed at 100 and the learning rate parameters η 0 , γ, as well as the momentum coefficient were optimized with grid searches.The weights of the units in the network were initialized to small random values drawn from a zero-mean Gaussian distribution with a standard deviation of 0.01.In some tasks, we found it necessary to initialize the biases of the hidden units to large values; therefore, the biases of the hidden units were initialized to random values drawn from a zero-mean Gaussian distribution with standard deviation σ b , where σ b was optimized using a grid search.

Training conditions:
The "all g" conditions in different tasks were as follows.In cue combination and coordinate transformation tasks, all 25 pairs of the form (g 1 , g 2 ) with g 1 , g 2 ∈ {5, 10, 15, 20, 25} were presented an equal number of times.In Kalman filtering, g was uniformly drawn between 0.3 and 3 at each time step.In binary categorization, the six gain values, g ∈ {0.37, 0.9, 1.81, 2.82, 3.57, 4}, were presented an equal number of times.These gain values were calculated from the mean noise parameter values reported for the human subjects in [33].In causal inference, all 25 pairs of the form (g 1 , g 2 ) with g 1 , g 2 ∈ {1, 2, 3, 4, 5} were presented an equal number of times.In stimulus demixing, following [4], c was uniformly and independently drawn between 2 and 9 for each source.In visual search, g was randomly and independently set to either 0.5 or to 3 for each stimulus.
The "restricted g" conditions in different tasks were as follows.In cue combination and coordinate transformation tasks, the two pairs (g 1 , g 2 ) ∈ {(5, 5), (25,25)} were presented an equal number of times.In Kalman filtering, g was randomly and independently set to either 1 or to 2 at each time step.In binary categorization, g was always 4.2.This gain value corresponds to 100% contrast as calculated from the mean noise parameter values for the human subjects reported in [33].In causal inference, all three pairs of the form (g 1 , g 2 ) ∈ {(2, 4), (3,3), (4, 2)} were presented an equal number of times.In stimulus demixing, c was either set to 2 for all sources or else set to 9 for all sources.Similarly, in visual search, g was either set to 0.5 for all stimuli or else set to 3 for all stimuli.
Mean-field model of hidden unit responses: For a given input activity r, we consider the responses of the hidden units as realizations of a random variable r hid .The output weights are also assumed to be realizations of a random variable w.We further assume that w and r hid are independent.The network's output is then proportional to w r hid .We want to make this expression invariant to the input gain g.We first introduce a measure of the total sensitivity of this expression to variations in g.We will do this by computing the magnitude of the derivative of w r hid with respect to g and integrating over a range of g values, but we first note that the output weights are already gain invariant, hence we can just consider r hid .We now have to find an expression for r hid .The net input to a typical hidden unit is given by: where w in are the input weights to a typical hidden unit.Then: where Φ(•) and φ(•) are the cdf and the pdf of the standard Gaussian distribution.As mentioned above, we then introduce the following measure of the total sensitivity of μ to variations in g: where the prime represents the derivative with respect to g.Because g always appears as gµ or gσ in μ (Equation 2), the parametrization in terms of g, µ and σ is redundant.We therefore set σ = 1, and hence expressed everything in terms of the scale of σ.We then minimized T var numerically with respect to µ, µ b and σ b subject to the constraint that the mean response across different gains be equal to some positive constant K: This ensures that the degenerate solution where the hidden layer is completely silent is avoided.
Signal-to-noise ratio calculation: The variance of a typical hidden unit can be calculated analytically: The inverse of the signal-to-noise ratio, defined as the variance over mean of the hidden unit activity is given by: After taking the derivative with respect to b, a straightforward analysis shows that both terms in the last equation are decreasing functions of b.
Task details: In the linear cue combination task, the objective was to combine two cues, r 1 and r 2 , encoding information about the same variable, s, in a statistically optimal way.Assuming a squared error loss function, this can be achieved by computing the mean of the posterior p(s|r 1 , r 2 ).For a uniform prior distribution, the posterior mean is given by an expression of the form [24]: where φ is the vector of preferred stimuli of input neurons and 1 is a vector of ones.During training, the two cues received by the input populations were always non-conflicting: s 1 = s 2 = s and the gains of the input populations varied from trial to trial.The network was trained to minimize the mean squared error between its output and the common s indicated by the two cues.
In the coordinate transformation task, the eye-centered location of an object in 1-d, s 1 , was encoded in a population of Poisson neurons with responses r 1 and the current eye position, s 2 , was similarly encoded in a population of Poisson neurons with responses r 2 .The goal was to compute the head-centered location of the object, which is given by s = s 1 + s 2 .Assuming uniform priors, the optimal estimate of s can be expressed as [4]: for suitable matrices B and A (see [4] for a full derivation).
In the Kalman filtering task, we considered a one-dimensional time-varying signal evolving according to: s t = (1 − γ)s t−1 + η t , where η t ∼ N (0, σ 2 η ) with γ = 0.1 and σ 2 η = 1.At each time t, the stimulus was represented by the noisy responses, r in,t , of a population of input neurons with Poisson variability.The input population projected to a recurrent pool of neurons that have to integrate the momentary sensory information coming from the input population with an estimate of the signal at the previous time step (as well as the uncertainty associated with that estimate) to perform optimal estimation of the signal at the current time step.We decoded the estimate of the signal at each time step by a linear read-out of the recurrent pool: ŝt = w r rec,t + b.The network was trained with sequences of length 25 using a squared error loss function.The posterior p(s t |r in,1:t ) is Gaussian with natural parameters given recursively by [4]: where µ in,t and σ 2 in,t are the mean and variance of p(s t |r in,t ) which represents the momentary sensory evidence encoded in the input population.These are, in turn, given by µ in,t = φ r in,t /1 r in,t and σ 2 in,t = σ 2 f /1 r in,t .In the binary categorization task, the goal was to classify a noisy orientation measurement into one of two overlapping classes that have the same mean but different variances.Given a noisy activity pattern r over the input population representing the observed orientation, the posterior probabilities of the two classes can be calculated analytically.The log-likelihood ratio of the two categories is given by [33]: where e = φ/σ 2 f and a = 1/σ 2 f .The posterior probability of the first class is then given by a sigmoidal function of d: p(C = 1|r) = 1/(1 + exp(−d)).
In the causal inference task, the goal was to infer whether two sensory measurements are caused by a common source or by two separate sources.The log-likelihood ratio of these two hypotheses is given by [26]: where J s is the precision of the Gaussian stimulus distribution and: σ 2 f z 12 = 1 r 1 (15) where σ 2 f is the common variance of the Gaussian tuning functions of the individual input neurons.φ 1 and φ 2 are the preferred stimuli of the neurons in the first and second populations respectively.For convenience, we assumed φ 1 = φ 2 .The optimal probability of reporting "same cause" is then simply given as: p(C = 1|r 1 , r 2 ) = 1/(1 + exp(−d)).
In the stimulus demixing task, the goal was to infer the presence or absence of different signal sources in a mixture of signals with unknown concentrations.As a concrete example, the signals can be thought of as different odors, and the task would then be to infer the presence or absence of different odors in an odor mixture with unknown concentrations [4].Following [4], we assumed a linear mixing model: where s k denotes the presence or absence of the k-th odor source, c k denotes its concentration, o i is the concentration of the i-th odorant and w ik is the weight of the k-th odor source in the i-th odorant.The task can then be formalized as the computation of the posterior probability of the presence or absence of each odor source, given noisy responses r = {r i } no i=1 of populations of Poisson neurons encoding the odorants: i.e. p(s k = 1|r).The input populations were assumed to have linear tuning for the odorants: r i ∼ Poisson(o i f i + b i ), where f i and b i were random vectors with positive entries [4].As in [4], we assumed four sources and four odorants.The networks were trained to minimize the cross-entropy between the network's outputs and the correct source present/absent labels, s k .
In the visual search task, the goal was to infer the presence or absence of a target stimulus s T among a set of heterogeneous distractors.The log-likelihood ratio of the target presence is given by [25]: where N is the number of stimuli on the display (we assumed N = 4) and the local target presence log-likelihoods d i are given by: For independent Poisson neurons, the stimulus kernel h(s) is given by h(s) = log f (s), where we assumed von Mises tuning functions for individual input neurons.The integral in the second term on the right hand side was calculated numerically.We trained feed-forward neural networks with two hidden layers on this task (Figure 1h), as networks with a single hidden layer performed poorly.s (deg/s)  Corr( r in , κ(r hid )) Figure S6: Results for the networks with a normalized linear decoder in the continuous tasks: (a) cue combination, (b) coordinate transformation, (c) Kalman filtering.In these networks, the output of the network was given by ŝ = w r hid /1 r hid .In (a-c), the first column shows the distribution of input-to-hidden layer weights in the trained networks (the distributions are roughly symmetric around zero), the second column is a scatter plot of the mean input vs. the mean hidden layer activity and the third column is a scatter plot of the mean input vs. the kurtosis of hidden layer activity.(d) Correlation between the mean input and the mean hidden unit activity (left) and the correlation between the mean input and the kurtosis of hidden layer activity in the three tasks.Error bars represent standard errors over 10 independent runs of the simulations.

Figure 1 :
Figure1: Network architectures used for the tasks considered in this paper.Different colors represent different types of units.For tasks with continuous output variables (a-d) linear output units were used, whereas for tasks with categorical output variables (e-h) sigmoidal output units were used.

Figure 3 :
Figure 3: Generalization capacity of neural networks.(a) The weights assigned to cue 1 in the cue conflict conditions as a function of the ratio of the input gains, g 1 /g 2 .In cue conflict trials, s 1 was first drawn randomly between −4 and 4, then s 2 was generated as s 1 + ∆ where ∆ was one of −16, −12, −8, −4, 4, 8, 12, 16.The cue weight assigned by the network was calculated through the equation: ŝ = wŝ 1,MLE + (1 − w)ŝ 2,MLE , where ŝ is the network output, ŝ1,MLE and ŝ2,MLE are the maximum likelihood estimates of s 1 and s 2 .The network was trained with only 50 trials in the restricted g condition with non-conflicting cues, hence it did not see any of the conditions shown here during training.Error bars represent standard deviations over 1000 simulated trials.(b) Demonstration of the importance of network architecture, in particular the choice of hidden unit activation function, in the extrapolation behavior of neural networks.Training data are indicated by the red dots.

Figure 4 :
Figure 4: The mechanism underlying the sparsity-based representation of sensory reliability.(a) The variability index T var (plotted in log units) as a function of the parameters µ, µ b and σ b for two different constraint surfaces with K = 2 and K = 3 from the front to the back respectively.The optimal parameter values are represented by the green dot for K = 2. (b)In the optimal solution (top), the mean input µ is slightly negative and the mean bias µ b is positive.Increasing the gain g thus shifts the input distribution to the left: compare the black and red lines for input distributions with a small and a large gain, respectively.The means of the input distributions are indicated by small dots underneath.(c) This, in turn, decreases the probability of non-zero responses, but (d) increases the mean of the non-zero responses; hence (e) the mean response of the hidden units, being a product of the two, stays approximately constant as the gain is varied.In the bottom panel of (b-e), the results are also shown for a different parameter combination represented by the magenta dot in (a).This parameter combination roughly characterizes the trained networks in the causal inference task shown in (j-m).(f) For a network trained in the coordinate transformation task, distributions of the input-to-hidden layer weights, W ij ; (g) the biases of the hidden units, b i ; (h) scatter plot of the mean input r in vs. the mean hidden unit activity r hid ; (i) the scatter plot of the mean input vs. the kurtosis of hidden unit activity κ(r hid ).(j-m) Similar to (f-i), but for a network trained in the causal inference task.

Figure 5 :Figure 6 :
Figure 5: Parameter statistics for the trained networks.(a) Mean input-to-hidden unit weight; (b) mean bias of the hidden units; (c) standard deviation of hidden unit biases; (d) correlation between the hidden unit biases and the magnitude of the output weights; (e) trial-by-trial correlation between mean hidden unit response and mean input response; (f) trial-by-trial correlation between the sparsity (kurtosis) of hidden layer activity and mean input response.CC: cue combination; CT: coordinate transformation; KF: Kalman filtering; BC: binary categorization; SD: stimulus demixing; CI: causal inference; VS 1: visual search (first hidden layer); VS 2: visual search (second hidden layer).Error bars represent standard errors across 10 independent runs of the simulations.

Figure 7 :
Figure7: Error-based learning in generic neural networks can explain a monkey subject's performance, but not human subjects' performance in a probabilistic binary categorization task involving arbitrary categories.(a) Structure of the two categories.Vertical lines indicate the optimal decision boundaries at low and high noise, σ.(b) Dependence of the decision boundary on sensory noise, σ, for the optimal model (OPT) and three suboptimal heuristic models.(c) Cumulative accuracy of monkey L (red) compared with the cumulative accuracy of a neural network trained with the same set of stimuli (blue).The network was trained fully on-line.The input noise in the network was matched to the sensory noise estimated for the monkey and the learning rate was optimized to match the monkey's learning curve (Methods).Dotted lines indicate the 95% binomial confidence intervals.(d)The overbar shows the winning models for the monkey's behavioral data throughout the course of the experiment according to the AIC metric.The QUAD model initially provides the best account of the data.The LIN model becomes the best model after a certain point during training.The area plot below shows the fractions of winning models, as measured by their AIC scores, over 50 neural network subjects trained with the same set of input noise and learning rate parameters as the one shown in (c).Similar to the behavioral data, early on in the training, QUAD is the most likely model to win; LIN becomes the most likely model as training progresses.OPT-P is equivalent to an OPT model where the prior probabilities of the categories are allowed to be different from 0.5 (in the experiment, both categories were equally likely).(e) Average performance of 6 human subjects in the main experiment of[33] (red), average performance of 30 neural networks whose input noise was set to the mean sensory noise estimated for the human subjects (blue).Error bars represent standard errors across subjects.

Figure 8 :
Figure 8: Low computational complexity of standard psychophysical tasks and the efficiency of generic networks.(a)For each number of input units, d, the number of hidden units required to reach within 10% of optimal performance, n * , is estimated (shown by the arrows).An example is shown here for the causal inference task with d = 10 and d = 108 input units, respectively.(b) n * plotted as a function of the total number of input units, d, in different tasks.Solid lines show linear fits.In these simulations, the number of training trials for each task was set to the maximum number of training trials shown in supplementary FigureS1.For the visual search task, the number of hidden units in the second hidden layer is constrained to be half the number of hidden units in the first hidden layer, with n * referring to the total number of hidden units in both hidden layers.

Figure S3 :
Figure S3: Results for a model with contrast dependent baseline responses as in [10].The restricted c training condition corresponds to an equal number of trials of the form c = (0.2, 0.2) and c = (0.8, 0.8), whereas the all c training condition corresponds to all possible combinations of c 1 and c 2 with c 1 , c 2 ∈ {0.1, 0.3, 0.5, 0.7, 0.9} (in equal numbers).

Figure S4 :
FigureS4: Results for a model where the stimulus contrast c affects both the peak response and the preferred stimuli of input neurons, as reported for speed tuning in area MT[20].Tuning functions were log-normal (see Methods).The restricted c training condition corresponds to an equal number of trials with contrast c = 0.05 and c = 1, whereas the all c training condition corresponds to trials with contrast c = 0.05, 0.1, 0.2, 0.4, 1 (in equal numbers).