Abstract
Sensory data about most natural taskrelevant variables are entangled with taskirrelevant nuisance variables. The neurons that encode these relevant signals typically constitute a nonlinear population code. Here we present a theoretical framework for quantifying how the brain uses or decodes its nonlinear information. Our theory obeys fundamental mathematical limitations on information content inherited from the sensory periphery, describing redundant codes when there are many more cortical neurons than primary sensory neurons. The theory predicts that if the brain uses its nonlinear population codes optimally, then more informative patterns should be more correlated with choices. More specifically, the theory predicts a simple, easily computed quantitative relationship between fluctuating neural activity and behavioral choices that reveals the decoding efficiency. This relationship holds for optimal feedforward networks of modest complexity, when experiments are performed under natural nuisance variation. We analyze recordings from primary visual cortex of monkeys discriminating the distribution from which oriented stimuli were drawn, and find these data are consistent with the hypothesis of nearoptimal nonlinear decoding.
Introduction
How does an animal use, or ‘decode’, the information represented in its brain? When the average responses of some neurons are welltuned to a stimulus of interest, this can be straightforward. In binary discrimination tasks, for example, a choice can be reached simply by a linear weighted sum of these tuned neural responses. Yet real neurons are rarely tuned to precisely one variable: variation in multiple stimulus dimensions influence their responses in complex ways. As we show below, when these nuisance variations have nonlinear effects on responses, they can dilute or even abolish the mean tuning to the relevant stimulus. Then the brain cannot simply use linear computation, nor can we understand neural processing using linear models.
A quantitative account of nonlinear neural decoding of sensory stimuli must first express how populations of neurons encode or represent information. Past theories of nonlinear population codes made unsupported assumptions about the covariability of this population responses^{1,2}, leading to substantially underestimated redundancy of large cortical populations. Here we correct this problem by generalizing informationlimiting correlations^{3} to nonlinear population codes, providing a more realistic theory of how much sensory information is encoded in the brain.
Just because a neural population encodes information, it does not mean that the brain decodes it all. Here, encoding specifies how the neural responses relate to the stimulus input, whereas decoding specifies how the neural responses relate to the behavioral output. To understand the brain’s computational strategy we must understand how encoding and decoding are related, i.e. how the brain uses the information it has. These are distinct processes, so the brain could encode a stimulus well while decoding it poorly, or viceversa.
This paper makes four main contributions. First, it weaves together important concepts about tuning curves and nuisance variables, nonlinear computation, and redundant population codes, forming a general, unified description of feedforward encoding and decoding processes in the brain. This description is supported by intuitive explanations and concrete examples to illustrate how these concepts relate to each other and enrich familiar views of neural computation. Second, this paper provides a simple way of testing the hypothesis that the brain’s decoding strategy is efficient, using a simple statistic to assess whether neural response patterns that are informative about the taskrelevant sensory input are also informative about the animal’s behavior in the task. Third, it establishes the technical details needed to apply this test in practical neuroscience experiments. Fourth, we apply this test to analyze V1 data from macaque monkeys, finding direct experimental evidence for optimal nonlinear decoding.
The “Results” section describes the main concepts, their formal connections, and applications. More specifically, the first sections introduce a framework for understanding nonlinear computation, including basic notation, internal and external (nuisance) noise and their effects on information content and formatting, and how this information can be isolated by nonlinear computation of the right statistics. Subsequent sections introduce a formalism for decoding, including notions of linear and nonlinear choice correlations, fine and coarse estimation tasks, and predictions about those correlations under optimal decoding. This section continues by describing how redundancy in the population responses appears as special highorder response statistics, and how they affect the predictions. The last sections present an experimental application of these ideas. A sketch of the details of our general predictions are presented in the “Methods” section, and are derived in full in the Supplement along with details of their application to specific models and our experimental data.
Results
A simple example of a nonlinear code
Imagine a simplified model of a visual neuron that includes an oriented edgedetecting linear filter followed by additive noise, with a Gabor receptive field like simple cells in primary visual cortex (Fig. 1a). If an edge is presented to this model neuron, different rotation angles will change the overlap, producing a different mean. This neuron is then tuned to orientation.
However, when the edge has the opposite polarity, with black and white reversed, then the linear response is reversed also. If the two polarities occur with equal frequency, then the positive and negative responses cancel on average. The mean response of this linear neuron to any given orientation is therefore precisely constant, so the model neuron is untuned.
Notice that stimuli aligned with the neuron’s preferred orientation will generally elicit the highest or lowest response magnitude, depending on polarity. Edges evoking the largest response to one polarity will also evoke the smallest response to its inverse. Thus, even though the mean response of this linear neuron is zero, independent of orientation, the variance is tuned.
To estimate the variance, and thereby the orientation itself, the brain can compute the square of the linear responses. This would allow the brain to estimate the orientation independently from polarity. This is consistent with the wellknown energy model of complex cells in the primary visual cortex, which use squaring nonlinearities to achieve invariance to the polarity of an edge^{4}.
Generalizing from this example, we identify edge polarity as a ‘nuisance variable’—a property in the world that alters how taskrelevant stimuli appear but is, itself, irrelevant for the current task (here, perceiving orientation). Other examples of nuisance variables include the illuminant for guessing a surface color, position for object recognition, the expression for face identification, or pitch for speech recognition. Generically, nuisance variables make it hard to extract the taskrelevant variables from sense data, which is the central task of perception^{5,6,7,8,9,10}. For example, cells in the early visual cortex are not tuned to object identity, since the object could appear at any location and V1 has not yet extracted the complex combinations of features that reveal object type independent of the nuisance variable of position. The brain learns from its history of sensory inputs which statistics of its many sensedata are tuned to the taskrelevant variable. Good nonlinear computations then compute those statistics. In the orientation estimation task above, the relevant statistic was not the mean, but the variance.
Task, stimuli, neural responses, actions
Our mathematical framework describes a perceptual task, a stimulus with both relevant and irrelevant variables, neural responses, and behavioral choices.
In our task, an agent observes a multidimensional stimulus (s, ν) and must act upon one particular relevant aspect of that stimulus, s, while ignoring the rest, ν. The irrelevant stimulus aspects serve as nuisance variables for the task (ν is the Greek letter ‘nu’ and here stands for nuisance). Together, these stimulus properties determine a complete sensory input that drives some responses r in a population of N neurons according to the distribution p(r∣s, ν).
We consider a feedforward processing chain for the brain, in which the neural responses r are nonlinearly transformed downstream into other neural responses R(r), which in turn are used to create a perceptual estimate of the relevant stimulus \(\hat{s}\):
We model the brain’s estimate as a linear function of the downstream responses R. Ultimately these estimates are used to generate an action that the experimenter can observe. We assume that we have recorded activity only from some of the upstream neurons, so we do not have direct access to R, only a subset of r. Nonetheless, we would like to learn something about the downstream computations used in decoding. In this paper, we show how to use the statistics of fluctuations in r, s, and \(\hat{s}\) to estimate the quality of nonlinear decoding.
We first develop the theory for local or finescale estimation tasks: the subject must directly report its estimate \(\hat{s}\) for the relevant stimuli near a reference s_{0}, and we measure performance by the variance of this estimate, \({\sigma }_{\hat{s}}^{2}\). In later sections, we then generalize the problem to allow for binary discrimination as well as coarse tasks, which are more complicated mathematically but not conceptually different.
Signal and noise
The population response, which we take here to be the spike counts of each neuron in a specified time window, reflects both signal and noise, where the signal is the repeatable stimulusdependent aspects of the response, and noise reflects trialtotrial variation. Conventionally in neuroscience, the signal is often thought to be the stimulus dependence of the average response, i.e. the tuning curve \({{{{{{{\bf{f}}}}}}}}(s)={\sum }_{{{{{{{{\bf{r}}}}}}}}}{{{{{{{\bf{r}}}}}}}}\ p({{{{{{{\bf{r}}}}}}}} s)=\left\langle {{{{{{{\bf{r}}}}}}}} s\right\rangle\). (Angle brackets denote an average overall responses given the condition after the vertical bar.) Below we will broaden this conventional definition to allow the signal to include any stimulusdependent statistical property of the population response.
Noise is the nonrepeatable part of the response, characterized by the variation of responses to a fixed stimulus. It is convenient to distinguish internal noise from external noise. Internal noise is internal to the animal and is described by response distribution p(r∣s, ν) when everything about the stimulus is fixed. This could also include uncontrolled variation in internal states^{11,12,13,14}, like attention, motivation, or wandering thoughts. External ‘noise’ is variability generated by the external world—nuisance variables—leading to a neural response distribution p(r∣s) where only the relevant variables are held fixed. Both types of noise can lead to uncertainty about the true stimulus.
Trialtotrial variability can of course be correlated across neurons. Neuroscientists often measure two types of secondorder correlations: signal correlations and noise correlations^{2,15,16,17,18,19,20,21,22}. Signal correlations measure shared variation in mean responses f(s) averaged over the set of stimuli s: ρ_{signal} = Corr(f(s)) where again all averages are taken over all variables not fixed by a condition to the right of the vertical bar. (Internal) noise correlations measure shared variation that persists even when the stimulus is completely identical, nuisance variables and all: ρ_{noise}(s, ν) = Corr(r∣s, ν).
For multidimensional stimuli, however, these correlations are only two extremes on a spectrum, depending on how many stimulus aspects are fixed across the trials to be averaged. We propose an intermediate type of correlation: nuisance correlations. Here we fix the taskrelevant stimulus variable(s) s, and average over the nuisance variables ν: ρ_{nuisance}(s) = Corr(f(s, ν)∣s). Including both internal and external (nuisance) noise correlations gives Corr(r∣s).
Critically, but confusingly, some socalled ‘noise’ correlations and nuisance correlations actually serve as signals. This happens whenever the statistical pattern of trialbytrial fluctuations depends on the stimulus, and thus contain information. For example, a stimulusdependent noise covariance functions as a signal. There would still be true noise, i.e. irrelevant trialtotrial variability that makes the signal uncertain, but it would be relegated to higherorder fluctuations^{23} such as the variance of the response covariance (Fig. 2d, Table 1). Whether from internal or external noise, stimulusdependent correlations lead naturally to nonlinear population codes, as we explain below.
Nonlinear encoding by neural populations
Most accounts of neural population codes actually address linear codes, in which the mean response is tuned to the variable of interest and completely captures all signals about it^{3,24,25,26,27}. We call these codes linear because the neural response property needed to best estimate the stimulus near a reference (or even infer the entire likelihood of the stimulus, Supplement S.1.2.2) is a linear function of the response. Linear codes for different variables may arise early in sensory processing, like orientation in V1, or after many stages of computation^{5,9}, like for objects in the inferotemporal cortex.
If any of the relevant signals can only be extracted from nonlinear statistics of the neural responses^{1,2}, then we say that the population code is nonlinear (Table 1). One straightforward example is a stimulusdependent covariance \(Q(s)=\langle {{{{{{{\bf{r}}}}}}}}{{{{{{{{\bf{r}}}}}}}}}^{\top } s\rangle\); its information can be decoded by quadratic operations R = rr^{⊤} ^{28,29,30}.
A simple example of a nonlinear code is the exclusiveor (XOR) problem. Given the responses of two binary neurons, r_{1} and r_{2}, we would like to decode the value of a taskrelevant signal s = XOR(r_{1}, r_{2}) (Fig. 2a). We do not care about the specific value of r_{1} by itself, and in fact, r_{1} alone tells us nothing about s. The same is true for r_{2}. The usual view on nonlinear computation is that the desired signal can be extracted by applying an XOR or product nonlinearity. However, there is an underlying statistical reason this works: the signal is actually reflected in the trialbytrial correlation between r_{1} and r_{2}: when they are the same then s = −1, and when they are opposite then s = +1. The correlation, and thus the relevant variable s, can be estimated nonlinearly from r_{1} and r_{2} as \(\hat{s}={r}_{1}{r}_{2}\).
Some experiments have reported stimulusdependent internal noise correlations that depend on the signal, even for a completely fixed stimulus without any nuisance variation^{31,32,33,34,35}. Other experiments have turned up evidence for nonlinear population codes by characterizing the nonlinear selectivity directly^{36,37,38}.
More typically, however, stimulusdependent correlations arise from external noise, leading to what we call nuisance correlations. In the introduction (Fig. 1) we showed a simple orientation estimation example in which fluctuations of an unknown polarity eliminate the orientation tuning of mean responses, relegating the tuning to variances. Figure 2b–e shows a slightly more sophisticated version of this example, where instead of two image polarities, we introduce spatial phase as a continuous nuisance variable. This again eliminates mean tuning but introduces nuisance covariances that are orientation tuned.
One might object that although the nuisance covariance is tuned to orientation, a subject cannot compute the covariance (or any other statistic of the encoding model) on a single trial because it does not experience all possible nuisance variables to average over. However, in linear codes, the subject does not have access to the tuned mean response \(\left\langle {{{{{{{\bf{r}}}}}}}} s\right\rangle\) either, just a noisy singletrial version of the mean, namely r. Analogously, the subject does not need access to the tuned covariance, just a noisy singletrial version of the second moments, rr^{⊤} (Table 1). In this simple example, the nuisance variable of the spatial phase ensures that quadratic statistics contain relevant information about the orientation, just like complex cells in V1^{4}.
Choice correlations predicted for optimal linear decoding
To study how neural information is used or decoded, past studies have examined whether neurons that are sensitive to sensory inputs also reflect an animal’s behavioral outputs or choices^{39,40,41,42,43,44,45,46,47}. This choicerelated activity is hard to interpret, because it may reflect decoding of the recorded neurons, or merely correlations between them and other neurons that are decoded instead^{48}. However, testable predictions about the choicerelated activity can reveal the brain’s decoding efficiency for linear codes^{27}. Next, we discuss these predictions, and then generalize them to nonlinear codes.
We define ‘choice correlation’ \({C}_{{r}_{k}}\) as the correlation coefficient between the response r_{k} of neuron k and the stimulus estimate (which we view as a continuous ‘choice’) \(\hat{s}\), given a fixed stimulus s:
This choice correlation is a conceptually simpler and more convenient measure than the more conventional statistic, ‘choice probability’^{49}, but it has almost identical properties (see the “Methods” subsection “Nonlinear choice correlations”)^{27,48}.
Intuitively, if an animal is decoding its neural information efficiently, then those neurons encoding more information should be more correlated with the choice. Mathematically, one can show that choice correlations indeed have this property when decoding is optimal. There are several closely connected versions of this relationship that quantify information by estimator variance or Fisher information for continuous estimates, or by threshold or discriminability for binary estimates^{27}—but the simplest is based on discriminability:
where \(d^{\prime}\) and \({d}_{{r}_{k}}^{\prime}\) are, respectively, the stimulus discriminability^{50} based on the behavior or on neuron k’s response r_{k} (see the “Methods” subsection “Nonlinear choice correlations”). This relationship holds for binary classification derived from a locally optimal linear estimator
for any stimulusindependent noise correlations, regardless of their structure.
Another way to test for optimal linear decoding would be to measure whether the animal’s behavioral discriminability matches the discriminability for an ideal observer of the neural population response. Yet this approach is not feasible, as it requires one to measure simultaneous responses of many, or even all, relevant neurons, with enough trials to reliably estimate their joint information content. In contrast, the optimality test (Eq. (3)) requires measuring only nonsimultaneous single neuron responses, which is vastly easier. Neural recordings in the vestibular system are consistent with nearoptimal decoding according to this prediction^{27}.
Nonlinear choice correlations for optimal decoding
When nuisance variables wash out the mean tuning of neuronal responses, we may well find that a single neuron has both zero choice correlation and zero information about the stimulus. The optimality test would thus be inconclusive.
This situation is exactly the same one that gives rise to nonlinear codes. A natural generalization of Eq. (3) can reveal the quality of neural computation on nonlinear codes. We simply define a ‘nonlinear choice correlation’ between the stimulus estimate \(\hat{s}\) and nonlinear functions of neural activity R(r):
(see the “Methods” subsection “Nonlinear choice correlations”), where R_{k}(r) is a nonlinear function of the neural responses. If the brain optimally decodes the information encoded in the nonlinear statistics of neural activity, according to the simple nonlinear extension to Eq. (4),
then the nonlinear choice correlation satisfies the equation
where \({d}_{{R}_{k}({{{{{{{\bf{r}}}}}}}})}^{\prime}\) is the stimulus discriminability provided by R_{k}(r) (see the “Methods” subsection “Optimality test”). This simple Equation (7) is the most important in this paper, and it is the basis of most predictions and intuitions we present in subsequent sections.
Equation (7) predicts that choice correlations of individual statistics will be stronger for more informative statistics, those with higher discriminability \({d}_{{R}_{k}}^{\prime}\). This reflects either stronger stimulus tuning of the statistic and/or lower variability of that statistic. This effect does not depend on whether the variability is shared. The variability that is not decoded dilutes the choice correlation; variability that is decoded increases it. Finally, choice correlations for optimal decoding can never be negative: if a statistic is tuned to increase with the stimulus, its fluctuations should correlate with choices that increase as well.
As an example of this relationship, we return to the orientation task. Here the response covariance Σ(s) = Cov(r∣s) depends on the stimulus, but the mean \({{{{{{{\bf{f}}}}}}}}=\left\langle {{{{{{{\bf{r}}}}}}}} s\right\rangle =\left\langle {{{{{{{\bf{r}}}}}}}}\right\rangle\) does not. In this model, optimally decoded neurons would have no linear correlation with behavioral choice. Instead, the choice should be driven by the product of the neural responses, R(r) = vec(rr^{⊤}), where vec( ⋅ ) is a vectorization that flattens an array into a onedimensional list of numbers. Such quadratic computation is what the energy model for complex cells is thought to accomplish for phaseinvariant orientation coding^{4}. Figure 3 shows linear and nonlinear choice correlations for pairs of neurons, defined as \({C}_{{r}_{i}{r}_{j}}={{{{{{{\rm{Corr}}}}}}}}({r}_{i}{r}_{j},\hat{s} s)\). When decoding is linear (a suboptimal strategy for this example), linear choice correlations are strong while nonlinear choice correlations are near zero (Fig. 3a, b). When the decoding is quadratic, here mediated by an intermediate layer that multiplies pairs of neural activity, the nonlinear choice correlations are strong while the linear ones are insignificant (Fig. 3c, d).
Redundant codes
It might seem unlikely that the brain uses optimal, or even nearoptimal, nonlinear decoding. Even if it does, there are an enormous number of highorder statistics for neural responses, so the information content in any one statistic could be tiny compared to the total information in all of them. For example, with N neurons there are on the order of N^{2} quadratic statistics, N^{3} cubic statistics, and so on. With so many statistics contributing information, the choice correlation for any single one would then be tiny according to the ratio in Eq. (7), and would be indistinguishable from zero with reasonable amounts of data. Past theoretical studies have described nonlinear (specifically, quadratic) codes with extensive information that grows proportionally with the number of neurons^{2,28}. This would indeed imply immeasurably small choice correlations for large, optimally decoded populations.
A resolution to these concerns is informationlimiting correlations^{3}. The past studies that derive extensive nonlinear information treat large cortical populations in isolation from the smaller sensory population that would naturally provide its input^{2,28}. Yet when a network inherits information from a much smaller input population, the expanded neural code becomes highly redundant: the brain cannot have more information than it receives^{51}. Noise in the input is processed by the same pathway as the signal, and this generates noise correlations that can never be averaged away^{3}.
The previous work^{3} characterized linear informationlimiting correlations for fine discrimination tasks by decomposing the noise covariance into \({{\Sigma }}={{{\Sigma }}}_{0}+\epsilon {{{{{{{\bf{f}}}}}}}}^{\prime} {{{{{{{{\bf{f}}}}}}}}^{\prime} }^{\top }\), where ϵ is the variance of the informationlimiting component and Σ_{0} is noise that can be averaged away with many neurons.
For nonlinear population codes, it is not just the mean responses that encode the signal, \({{{{{{{\bf{f}}}}}}}}(s)=\left\langle {{{{{{{\bf{r}}}}}}}} s\right\rangle\), but rather the nonlinear statistics \({{{{{{{\bf{F}}}}}}}}(s)=\left\langle {{{{{{{\bf{R}}}}}}}}({{{{{{{\bf{r}}}}}}}}) s\right\rangle\). Likewise, the noise does not comprise only secondorder covariance of r, Cov(r∣s), but rather the secondorder covariance of the relevant nonlinear statistics, Γ = Cov(R(r)∣s) (see the “Results” subsection “Signal and noise”). Analogous to the linear case, these correlations can be locally decomposed as
where ϵ is again the variance of the informationlimiting component, and Γ_{0} is any other covariance that can be averaged away in large populations, including internal noise and external nuisance variation. The informationlimiting noise bounds the estimator variance \({\sigma }_{\hat{s}}^{2}\) to no smaller than ϵ even with optimal decoding. Likewise, the Fisher information cannot exceed the value of 1/ϵ, and the discriminability \(d^{\prime}\) cannot exceed \(ds/\sqrt{\epsilon }\) for a stimulus change of ds^{3}. Neither additional cortical neurons nor additional decoded statistics can improve performance beyond this bound.
Consequences of optimal decoding on choice correlations
The simple formula of Eq. (7) provides useful insights into the relationship between neural activity and choice when that activity is decoded optimally in a natural task. First, the choice correlations do not depend on the shape or magnitude of the internal noise or nuisance correlations, because these dimensions are deliberately avoided by optimal decoding, whose weights cancel those correlations (Eq. (18)). The only aspect of shared variability that matters for choice is the informationlimiting component, i.e. that which is indistinguishable from a change in the taskrelevant stimulus. This informationlimiting variability increases choice correlations mostly by decreasing the overall behavioral discriminability \(d^{\prime}\); the shared variability is large only at the population level and typically only has a small contribution to any one statistic^{3,27} and thus to its discriminability \({d}_{k}^{\prime}\). With a smaller denominator and roughly unaffected numerator, the ratio in Eq. (7) rises with informationlimiting correlations.
Likewise, the total number of the decoded statistics only affects the prediction of Eq. (7) insofar as it affects the available information. This is especially important when considering nonlinear statistics, as there are potentially so many of them. When optimally decoded, greater numbers of independently informative statistics will increase the total discriminability exhibited by the behavioral choices, while the discriminability for each statistic remains unchanged. According to Eq. (7), choice correlations for optimal decoding are the ratio of these two, so as the behavioral behavioral \(d^{\prime}\) increases and the individual terms remain fixed, the choice correlations shrink. On the other hand, greater numbers of redundant statistics change neither the total information content nor the behavioral choice. These added statistics can have similar tuning and similar fluctuations as each other (which is what makes them redundant). Any single redundant statistic might or might not be decoded, but it is correlated with others that are. According to Eq. (7), the individual \({d}_{{R}_{k}}^{\prime}\) are unchanged when adding more redundant statistics; the total information \(d^{\prime}\) is unchanged; and thus their ratio is fixed, consistent with the unchanged choice correlations.
When there are many fewer sensory inputs than cortical neurons, as seen in the brain, many distinct statistics R_{k}(r) will carry redundant information. Under these conditions, many choice correlations \({C}_{{R}_{k}}\) can be quite large even for optimal nonlinear decoding: the discriminabilities \({d}_{{R}_{k}}^{\prime}\) of redundant statistics can be comparable to the discriminability \(d^{\prime}\) of the whole population, producing ratios \({d}_{{R}_{k}}^{\prime}/d^{\prime}\) that are a significant fraction of 1 (Fig. 4). Supplementary Information S.7.1 illustrates this effect for a redundant nonlinear population code: the brain need not decode all functions of all neurons to extract essentially all of the information (Fig. S6A), and neuroscientists need not compute choice correlations for all possible statistics to establish decoding efficiency (Fig. S6B).
Which nonlinear statistics?
If the brain’s decoder optimally uses all available information, choice correlations will obey the prediction of Eq. (7) even if the specific nonlinear statistics extracted by the brain’s decoder differ from those selected for evaluating choice correlations (see the “Methods” subsection “Nonlinear choice correlation to analyze an unknown nonlinearity”). The prediction is valid as long as the brain’s nonlinearity can be expressed as a linear combination of the tested nonlinearities (see the “Methods” subsection “Nonlinear choice correlation to analyze an unknown nonlinearity”). Since the brain needs complicated nonlinearities for complicated tasks, it may be difficult to find a suitable basis set for truly natural conditions; feature spaces from deep networks trained on comparable tasks might provide a useful basis^{38,52}. For the modest, controlledcomplexity tasks used in most neuroscience experiments^{53,54,55,56}, polynomials or other simple bases may be sufficient, even when individual neurons do not use polynomial operations.
Indeed, Fig. 5 shows a situation where information is encoded by linear, quadratic and cubic sufficient statistics of neural responses, but a simulated brain decodes them nearoptimally using a generic neural network rather than a set of nonlinearities matched to those sufficient statistics. Despite this mismatch we can successfully identify that the brain is nearoptimal by applying Eq. (7), even without knowing details of the simulated brain’s true nonlinear transformations.
Decoding efficiency revealed by choice correlations
Even if decoding is not strictly optimal, Eq. (7) can be approximately satisfied due to informationlimiting correlations. Decoders that seem substantially suboptimal because they fail to avoid the largest noise components in Γ_{0} can be nonetheless dominated by the bound from informationlimiting correlations. This will occur whenever the variability from suboptimally decoding the noise Γ_{0} is smaller than the informationlimiting variance ϵ. Just as we can decompose the nonlinear noise correlations into informationlimiting and other parts, we can decompose nonlinear choice correlations into corresponding parts as well, with the result that
where χ_{R} depends on the particular type of suboptimal decoding (Supporting Information S.3.2). The slope α between choice correlations and those predicted from optimality is given by the fraction of estimator variance explained by informationlimiting noise, \(\alpha =\epsilon /{\sigma }_{\hat{s}}^{2}\). This slope α therefore provides an estimate of the efficiency of the brain’s decoding.
Figure 4 shows an example of one decoder that would be suboptimal without redundancy, but is nonetheless close to optimal when information limits are imposed. This rescue of optimality does not happen for all decoders, however. The figure also shows another decoder that is so suboptimal that it throws away most of the available information even when there is substantial redundancy. The patterns of choice correlations reflect this.
In realistically redundant models with more cortical neurons than sensory inputs, many decoders could be nearoptimal, as we recently discovered in experimental data for a linear population code^{27}. However, even in redundant codes there may be substantial inefficiencies and information loss^{57}, especially for unnatural tasks^{58}, so it is scientifically interesting to discover nearoptimal nonlinear computation even in a redundant code.
Coarse versus fine, estimation versus classification
Our conceptual framework and predictions are most simply expressed for fine estimation tasks, where here we define ‘fine’ as a stimulus range over which the noise statistics do not vary with the stimulus. Some minor details of our predictions change when moving to binary classification instead of continuous estimation: this introduces a correction factor that depends on the response distribution (Section S.6.4).
More details change when moving to ‘coarse’ tasks, which we define as when noise statistics do change significantly with the stimulus. As for fine discrimination, we again find that when decoding is optimal, random fluctuations in choices are correlated with neural responses to the same degree that those responses can discriminate between stimuli. However, this relationship is slightly more complicated for coarse discrimination. For this reason we introduce a slightly more complicated measure of choice correlation that we call Normalized Average Conditional Choice Correlation (NACCC, Eq. (17)), which removes the stimulusinduced covariation between neuron and choice, and isolates only the remaining shared fluctuations that reflect the brain’s processing. However, the end result is the same: choice correlations for optimal decoding are equal to the ratio of discriminabilities (Eq. (7); Supplemental Information S.5). As for fine estimation, there is a correction factor of order 1 for binary choices instead of continuous estimation (see the “Methods” subsection “Optimality test”, Supplemental Information S.6.4, Eq. (185)).
Evidence for optimal nonlinear computation in macaque brains
We applied our optimality test to data recorded with Utah arrays from primate visual cortex (V1) during a nonlinear decoding task. Monkeys performed a TwoAlternative Forced Choice task (2AFC) in which they categorized an oriented drifting grating based on whether it came from a wide or narrow distribution of orientations^{59} (Fig. 6a, b). The categorical target variable s is therefore the variance of the orientation distribution. This coarse binary discrimination task is a simplified version of a task that might arise in nature when identifying a surface texture or material^{60}; the orientation of the material would be a nuisance variable independent of the material type. Here the observable variable, the orientation, is the product of the target variable and a nuisance variable ν. An additional nuisance variable was the stimulus contrast varying independently of the stimulus variance, although here we only analyze the highest contrast.
Below we analyze whether the trialbytrial nonlinear statistics of V1 multiunit neural responses to these stimuli provide information about the taskrelevant category and the behavioral choice, and whether these two informations are correlated as predicted by our optimal decoding theory. Since the marginal response statistics depend substantially on the stimulus category, at least in part because the corresponding nuisance distributions differ, this is a coarse binary discrimination task. For this reason we test suitable optimal decoding predictions about nonlinear choice correlations in coarse tasks, using the NACCC measure that we outlined in the section “Coarse versus fine, estimation versus classification” and derived in the “Methods” subsection “Application to neural data” and Supplemental Information S.6.
V1 responses contain information about orientation^{61}. Here we found that V1 responses also contain some linear information about the orientation variance (Fig. 6c, blue; \(d^{\prime}\) calculated by Eq. (20)). This implies that within their receptive field they have already performed some nonlinear transformations of the input that are useful for estimating the orientation variance. However, we expect that nonlinear computations downstream can extract still more information. Note that for a fixed contrast, an optimal computation based on the stimulus is simply to threshold the squared deviation from the mean orientation. Because neural responses in this brain area can be linearly decoded to compute orientation, a good downstream decoder for the orientation variance would naturally be quadratic in those responses.
Indeed, we found information in the quadratic statistics of neural responses, \(\delta {r}_{i}^{2}\) and δr_{i}δr_{j} (Fig. 6c, red and green), verifying that downstream nonlinear computations could extract additional information from the neural responses. To isolate the nonlinear information we eliminated the linear stimulus dependence of the response, computing neural nonlinear statistics according to \(\delta {r}_{i}={r}_{i}\langle {r}_{i} {\hat{s}}_{1}\rangle\), where \({\hat{s}}_{1}={{{{{{{{\bf{w}}}}}}}}}_{{{{{{{{\rm{opt}}}}}}}}}\cdot {{{{{{{\bf{r}}}}}}}}+c\) is the optimal estimate decoded only from a linear combination of available neural responses.
These quadratic statistics also contained substantial nonlinear information about the behavioral choice (Fig. 6d). In general, there is no guarantee that the particular nonlinear statistics that are informative about the stimulus are also informative about the choice. However, our theory of optimal decoding predicts specifically that these quantities should be directly proportional to each other. Indeed, in two monkeys, we found that nonlinear choice correlations were highly correlated with nonlinear stimulus information (Fig. 6e). Remarkably, when we compare the measured nonlinear choice correlations to the ratio of discriminabilities after adjusting for the binary data (see the “Methods” subsection “Application to neural data”), the slopes of this relationship for the two animals were near the value of 1 that Eq. (7) predicts for optimal decoding (Fig. 6e).
Monkey 2 performed slightly worse than an ideal observer, with a probability correct of 0.76, compared to the ideal of 0.82 (see the “Methods” subsection “Application to neural data”)—even while its decoding was nearoptimal, with an efficiency according to Eq. (9) of 0.96 ± 0.04 (mean ± 95% confidence intervals). Even at the level of individual sessions, this is consistent with optimal decoding, with efficiencies not significantly different from 1 (p = 0.26, onetailed ttest). This suggests that information is lost in the encoding stage somewhere between the stimulus and the recorded neurons, and not downstream of those neurons. Monkey 1 had similar overall performance (probability correct of 0.74) but worse decoding efficiency (0.75 ± 0.08). Across sessions with reliable slopes (positive coefficients of determination, 77/119 sessions), the efficiencies were significantly different from 1 (p < 10^{−6}, onetailed ttest). This suggests the second monkey’s task performance has limitations arising downstream of the recorded neurons.
Controls to find the origins of choice correlations
To evaluate whether internal noise correlations contribute nonlinear information or choice correlations, we created a shuffled data set that removed internal noise correlations while preserving external nuisance correlations. That is, for each neuron we independently selected responses to highcontrast trials with matched target stimulus (variance), nuisance (orientations within ±1.5), and choice, and repeated our analysis on these shuffled data (Fig. 6f). The observed relationship between predicted and observed choice correlations was the same as in the original test, indicating that nuisance variations were sufficient to drive the nonlinear information and decoding.
We then shuffled the external nuisance correlations by randomly selecting responses to trials with matched target stimulus and choice, but now using unmatched nuisance variables, and again repeated the analysis (Fig. 6g). In other words, we picked responses from different trials that came from the same signal category (wide or narrow) and elicited the same choice but had different orientations, and we picked these trials (and thus their stimulus orientations) independently for neurons i and j. The strong statistical relationship observed between predicted and measured nonlinear choice correlations vanished with this shuffling, indicating that the nuisance variation was necessary for the nonlinear information and nonlinear decoding.
These shuffle controls removed noise correlations and nuisance correlations, respectively. Combining the conclusions from these controls, we find no evidence that the brain optimally decodes any stimulusdependent internal noise correlations in this task.
We looked directly for stimulusdependent internal noise correlations by conditioning on both the signal and the nuisance variable (which here is simply the single number, orientation) and measuring orientationdependent response covariances. The resultant nonlinear tuning was quite weak compared with the trialtotrial variability in those nonlinear statistics, and available nonlinear information arose largely in changing variances rather than covariances; likely arising from Poisson statistics and tuning of the mean firing (Supplementary Fig. S5A, B). Internal noise fluctuations in those directions were not significantly correlated with choice (Supplementary Fig. S5C, p = 0.088, 0.830, 0.969 for linear, square, and cross terms for monkey 1; p = 0.073, 0.094, 0.573 for linear, square, and cross terms for monkey 2 using a twosample Kolmogorov–Smirnov test).
Recent analyses of these same data found that internal noise did in fact influence the monkeys’ behavioral choices^{62}, but this effect was subtle and only apparent when examining the entire neural population simultaneously with a complex trained nonlinearity. In our analysis this effect is buried in the noise, so our method is not sensitive enough to tell if these largescale patterns induced by internal noise are used optimally or suboptimally. Additionally, in this work we analyzed a subset of trials with the highest contrasts and it is possible that at lower contrasts internal noise has a greater influence. However, we can detect that the brain contains information that is encoded nonlinearly due to external nuisance variation, and that this information is indeed decoded nearoptimally by the brain.
Discussion
This study introduced a theory of nonlinear population codes, grounded in the natural computational task of separating relevant and irrelevant variables. The theory considers both encoding and decoding—how stimuli drive neurons, and how neurons drive behavioral choices. It shows how correlated fluctuations between neural activity and behavioral choices could reveal the efficiency of the brain’s decoding. Unlike previous theories of nonlinear population codes^{2,28}, ours remains consistent with biological constraints due to the large cortical expansion of sensory representations by incorporating redundancy through a nonlinear generalization of informationlimiting correlations^{3}. Also unlike past work which largely concentrates on encoding efficiency, we provide mathematical methods to quantify the brain’s nonlinear decoding efficiency. When we applied this method to the neural responses of monkeys performing a discrimination task in which neural statistics were dominated by nuisance variation, we found quantitative results consistent with efficient nonlinear decoding of V1 activity.
The best condition to apply our optimality test is in a task of modest complexity but still possessing fundamentally nonlinear structure. Some interesting examples where our test could have practical relevance include motion detection using photoreceptors^{63}, visual search with distractors (XORtype tasks)^{30,64}, sound localization in early auditory processing before the inferior colliculus^{65}, or context switching in higherlevel cortex^{55}.
Optimal nonlinearities extract the sufficient statistics about the relevant stimulus. These statistics depend not only on the task but also on the nuisance variables. In complex tasks, like recognizing objects from images, nuisance variables push most of the relevant information into higherorder statistics which require more complex nonlinearities to extract. In such highdimensional cases, our proposed test is unlikely to be useful. This is because our method expresses stimulus estimates as sums of nonlinear functions, and while that is universal in principle^{66}, that is not a compact way to express the complex nonlinearities of deep networks. Relatedly, it may be difficult to see statistically significant information or choice correlations for nonlinear statistics that provide many important but small contributions to the behavioral output. Since many stimulusdependent response correlations are induced by external nuisance variation, not internal noise, we might not find informative stimulusdependent noise correlations upon repeated presentations of a fixed stimulus. Indeed, our analysis found no evidence of internal noise generating nonlinear choice correlations (Fig. 6). Those correlations may only be informative about a stimulus in the presence of natural nuisance variation. For example, if a picture of a face is shown repeatedly without changing its pose, then small expression changes can readily be identified by linear operations; if the pose varies then the stimulus is only reflected in higherorder correlations^{9}.
In contrast, we should see some nonlinear choice correlations even when nuisance variables are fixed. This is because neural circuitry must combine responses nonlinearly to eliminate natural nuisance variation, and any internal noise passing through those same channels will thereby influence the choice. Although they may be smaller and more difficult to detect than the fluctuations caused by the nuisance variation, this influence will manifest as nonlinear choice correlations. In other words, nonlinear noise correlations need not predict a fixed stimulus, but they may predict the choice (Supplementary Information S.4).
Our approach is currently limited to spatial feedforward processing, which unquestionably oversimplifies cortical processing. The approach can be generalized to recurrent networks by considering spatiotemporal statistics^{67}. Feedback could also cause suboptimal networks to exhibit choice correlations that seem to resemble the optimal prediction. If the feedback is noisy and projects into the same direction that encodes the stimulus, such as from a dynamic bias^{68,69,70}, then this could appear as informationlimiting correlations, enhancing the match with Eq. (7). This situation could be disambiguated by measuring the internal noise source providing the feedback, though of course this would require more simultaneous measurements.
In principle our method can also be applied to temporal neural response properties like spike timing or time series. For optimal processing, spike timing that is tuned to taskrelevant stimuli should also be correlated with resultant choices, even if the timing is converted to rate codes by downstream processing. On the other hand, it is more difficult to track behavioral consequences of spatiotemporal correlations that evolve through a recurrent network^{67} with dynamic outputs, as in motor control applications. It should be fruitful to develop this theory further for more complex tasks involving time sequences of actions.
Our method to understand nonlinear neural decoding requires neural recordings in a behaving animal. The task must be hard enough that it makes some errors, so that there are behavioral fluctuations to explain. Finally, there should be a modest number of nonlinearly entangled nuisance variables. Unfortunately, many neuroscience experiments are designed without explicit use of nuisance variables. Although this simplifies the analysis, this simplification comes at a great cost, which is that the neural circuits are being engaged far from their natural operating point, and far from their purpose: there is little hope of understanding neural computation without challenging the neural systems with nonlinear tasks for which they are required. In this context, it is especially noteworthy that a mismatch between choice correlations and the optimal pattern might not indicate that the brain is suboptimal, but instead that the nuisance variation in the experimental task may not match the natural tasks the brain has learned. For this reason it is important for neuroscience to use natural tasks, or at least naturalistic ones, when aiming to understand computational function^{71,72,73}.
Methods
Orientation estimation with varying spatial phase
Figure 1 illustrates how nuisance variation can eliminate a neuron’s mean tuning to relevant stimulus variables, relegating the neural tuning to higherorder statistics like covariances. In this example, the subject estimates the orientation of a Gabor image, G(x∣s, ν), where x is spatial position in the image, and s and ν are the orientation and spatial phase (nuisance) of the image, respectively (Supplementary Material S.1.1). The model visual neurons are linear Gabor filters like idealized simple cells in primary visual cortex, corrupted by additive white Gaussian noise. Their responses are thus distributed as r ~ P(r∣s, ν) = N(r∣f(s, ν), ϵI), where ϵ is the noise variance and the mean \({{{{{{{\bf{f}}}}}}}}(s,\nu )=\left\langle {{{{{{{\bf{r}}}}}}}} s,\nu \right\rangle ={\sum }_{{{{{{{{\bf{r}}}}}}}}}{{{{{{{\bf{r}}}}}}}}\ p({{{{{{{\bf{r}}}}}}}} s,\nu )\) is determined by the overlap between the image and the receptive field.
When the spatial phase ν is known, the mean neural response contains all the information about orientation s. The brain can decode responses linearly to estimate orientation near a reference s_{0}.
When the spatial phase varies, however, each mean response to a fixed orientation will be combined across different phases: \({{{{{{{\bf{f}}}}}}}}(s)=\left\langle {{{{{{{\bf{r}}}}}}}} s\right\rangle ={\sum }_{{{{{{{{\bf{r}}}}}}}}}{{{{{{{\bf{r}}}}}}}}\ p({{{{{{{\bf{r}}}}}}}} s)=\int d\nu \ {\sum }_{{{{{{{{\bf{r}}}}}}}}}{{{{{{{\bf{r}}}}}}}}\ p({{{{{{{\bf{r}}}}}}}} s,\nu )p(\nu )\). Since each spatial phase can be paired with another phase π radians away that inverts the linear response, the phaseaveraged mean is f(s) = 0. Thus the brain cannot estimate orientation by decoding these neurons linearly; nonlinear computation is necessary.
The covariance provides one such tuned statistic. We define Cov_{ij}(r∣s, ν) as the neural covariance for a fixed input image (noise correlations), and Cov_{ij}(r∣s) as the neural covariance when the nuisance varies (nuisance correlations). According to the law of total covariance,
where \(\delta {f}_{i}(s,\nu )={f}_{i}(s,\nu ){\left\langle {f}_{i}(s,\nu )\right\rangle }_{\nu }\). Supplementary Information S.1.1 shows in detail how Cov_{ij}(r∣s) is tuned to s.
Exponential family distribution and sufficient statistics
It is illuminating to assume the response distribution conditioned on the relevant stimulus (but not on nuisance variables) is approximately a member of the exponential family with nonlinear sufficient statistics,
where R(r) is a vector of sufficient statistics for the natural parameter H(s), b(r) is the base measure, and A(s) is the logpartition function. In this case, a finite number of sufficient statistics contains all of the information about the stimulus in the population response, and all other tuned statistics may be derived from them.
Estimation and inference are closely connected in the exponential family. In Supplementary Material S.1.2.2, we show that the optimal local estimation can be achieved by linearly decoding the nonlinear sufficient statistics, \(\hat{s}={{{{{{{{\bf{w}}}}}}}}}^{\top }{{{{{{{\bf{R}}}}}}}}({{{{{{{\bf{r}}}}}}}})+c\). The decoding weights minimize the variance of an unbiased decoder,
where \({{{{{{{{\bf{F}}}}}}}}}^{\prime}=\partial \langle {{{{{{{\bf{R}}}}}}}}({{{{{{{\bf{r}}}}}}}}) s\rangle /\partial s\) is the sensitivity of the statistics to changing inputs, and Γ = Cov(R∣s) is the stimulusconditioned response covariance—which generally includes nuisance correlations (see the section “Signal and noise”).
Quadratic encoding
In a quadratic coding model, the distribution of neural responses is described by the exponential family with up to quadratic sufficient statistics, R(r) = {r_{i}, r_{i}r_{j}} for i, j ∈ {1, …, N}. A familiar example is the Gaussian distribution with stimulusdependent covariance Σ(s). In order to demonstrate the coding properties of a purely nonlinear neural code, here we assume that the mean tuning curve f(s) is constant, while the stimulusconditional covariances Σ_{ij}(s) depend smoothly on the stimulus. We can quantify the information content of the neural population using Eq. (61).
Cubic encoding
In our cubic coding model, the distribution of neural responses is described by the exponential family with up to cubic sufficient statistics, R(r) = {r_{i}, r_{i}r_{j}, r_{i}r_{j}r_{k}} for i, j, k ∈ {1, …, N}.
We approximate a threeneuron cubic code first using purely cubic components, and we then apply a stimulusdependent affine transformation to include linear and quadratic statistics. The pure cubic code is used for a vector z with sufficient statistics z_{i}z_{j}z_{k} (and a base measure \({e}^{\parallel {{{{{{{\bf{z}}}}}}}}{\parallel }^{4}}\) to ensure the distribution is bounded and normalizable).
We approximate this distribution by a mixture of four Gaussians. The mixture is chosen to reproduce the tetrahedral symmetry of the cubic distribution (Supplementary Fig. S1), which allows the cubic statistics of responses to be stimulus dependent, leaving stimulusindependent quadratic and linear statistics.
To generate larger multivariate cubic codes for Supplementary Fig. S1, for simplicity we assume the pure cubic terms only couple disjoint triplets of variables, and sample independently from an approximately cubic distribution for each triplet. To convert this purely cubic distribution to a distribution with linear and quadratic information, we shift and scale these cubic samples z in a manner dependent on s:
where f(s) and Σ(s) describes the desired signaldependent mean and covariance (see Supplementary Material S.1.4).
Nonlinear choice correlations
For fine discrimination tasks, the nonlinear choice correlation between the stimulus estimate \(\hat{s}={{{{{{{{\bf{w}}}}}}}}}^{\top }{{{{{{{\bf{R}}}}}}}}+c\) and one nonlinear function R_{k} (the kth element of the vector R) of recorded neural activity r is
where \({{{{{{{{\bf{w}}}}}}}}}^{\top }{{\Gamma }}{{{{{{{\bf{w}}}}}}}}={\sigma }_{\hat{s}}^{2}\) is the estimator variance.
When the relevant response statistics change appreciably over the stimulus range used in the task, such as for the coarse variance discrimination task in the section “Evidence for optimal nonlinear computation in macaque brains”), the relevant quantities change slightly. The optimal linear decoder of nonlinear statistics, \(\hat{s}={{{{{{{\bf{w}}}}}}}}\cdot {{{{{{{\bf{R}}}}}}}}+c\), has weights obtained through linear regression:
where \(\bar{{{\Gamma }}}={\left\langle {{{{{{{\rm{Cov}}}}}}}}({{{{{{{\bf{R}}}}}}}} s)\right\rangle }_{s}\) is the average conditional covariance between R given the stimulus s. The differences from Eq. (12) are \({{\Gamma }}\to \bar{{{\Gamma }}}\) and \({{{{{{{\bf{F}}}}}}}}^{\prime} =d{{{{{{{\bf{F}}}}}}}}/ds\to {{\Delta }}{{{{{{{\bf{F}}}}}}}}/{{\Delta }}s\).
These differences are reflected in a slightly modified measure of correlation that we call normalized average conditional choice correlations (NACCC),
\({B}_{{R}_{k}}\) is actually a correlation coefficient based on the average conditional covariance \(\bar{{{\Gamma }}}\), and is bounded in absolute value by 1. As the stimulus range in a coarse task decreases, and the noise distribution p(R∣s) becomes independent of the stimulus, then Eq. (17) converges toward Eq (15).
The choice correlation for binary choices differs slightly from that for continuous estimation, for both fine and coarse discrimination tasks, by a factor ζ that is typically of order 1 (Supplementary Materials S.6.1).
Optimality test
Substituting the optimal weights (Eq. (12)) into Eq. (15), the optimal nonlinear choice correlation becomes
where \({d}_{{R}_{k}({{{{{{{\bf{r}}}}}}}})}^{\prime}={F}_{k}^{\prime}{{\Delta }}s/\sqrt{{{{\Gamma }}}_{kk}}\) is the fine discriminability provided by R_{k}(r) for a stimulus difference of Δs. The same argument holds for coarse discrimination, where \(\bar{{{\Gamma }}}\) in Eq. (17) is canceled by \({\bar{{{\Gamma }}}}^{1}\) in the optimal weights (Eq. (16)), yielding \({B}_{{R}_{k}({{{{{{{\bf{r}}}}}}}})}^{{{{{{{{\rm{opt}}}}}}}}}={d}_{{R}_{k}}^{\prime}/d^{\prime}\).
For finescale discrimination, optimal choice correlations can be written in many equivalent ways that reflect the simple relationships between four quantities often used to represent information: discriminability dprime is proportional to the square root of the Fisher information \(d^{\prime} ={{\Delta }}s\sqrt{J}\) ^{74}; estimator variance is bounded by the inverse of the Fisher information, \({\sigma }_{\hat{s}}^{2}\ge 1/J\); discrimination threshold is proportional to the estimator standard deviation, \(\theta =\sqrt{{\sigma }_{\hat{s}}^{2}}\) with proportionality given by the threshold condition.
In different experiments (binary discrimination, continuous estimation), it can be most natural to express this optimal decoding prediction as ratios of different measured quantities:
These quantities reflect information between the stimulus and the neural or behavioral responses. Supplemental material S.5 shows how this can be computed easily for general binary discrimination using the total correlation between the responses and the stimuli, \({D}_{{R}_{k}}={{{{{{{\rm{Corr}}}}}}}}({R}_{k},s)\), or a continuously varying behavioral choice \(\hat{s}\) and the stimuli, \({D}_{\hat{s}}={{{{{{{\rm{Corr}}}}}}}}(\hat{s},s)\):
and likewise for \({d}_{{R}_{k}}^{\prime}\). When the behavioral choice is binary rather than continuous, the correlations are modified by a factor δ near 1 (Supplemental Information S.6.3, Eq. (182)). For our experimental conditions, δ ≈ 1.2 ± 0.2.
Nonlinear choice correlation to analyze an unknown nonlinearity
In Fig. 5, we generated neural responses given sufficient statistics that are polynomials up to third order, R(r) = {r_{i}, r_{i}r_{j}, r_{i}r_{j}r_{k}} (see the “Methods” subsection “Cubic encoding”). Our model brain decodes the stimulus using a cascade of linear–nonlinear transformations, with Rectified Linear Units (\({{{{{{{\rm{ReLU}}}}}}}}(x)=\max (0,x)\)) for the nonlinear activation functions. We used a fully connected ReLU network with two hidden layers and 30 units per hidden layer. We trained the network weights and biases with backpropagation to estimate stimuli near a reference s_{0} based on 20,000 training pairs (r, s) generated by the cubic encoding model. This trained neural network extracted 91% of the information available to an optimal decoder.
Informationlimiting correlations
Only specific correlated fluctuations limit the information content of large neural populations^{3}. These fluctuations can ultimately be referred back to the stimulus as r ~ p(r∣s + ds), where ds is zero mean noise, whose variance 1/J_{∞} determines the asymptotic variance of any stimulus estimator. These informationlimiting correlations for nonlinear computation can be characterized by the covariance of the sufficient statistics, Γ = Cov(R∣s) conditioned on s; the informationlimiting component arises specifically from the signal covariance Cov(F(s)∣s). Since the signal for local estimation of stimuli near a reference s_{0} is \({{{{{{{\bf{F}}}}}}}}^{\prime} (s)=\frac{d}{ds}\left\langle {{{{{{{\bf{R}}}}}}}}({{{{{{{\bf{r}}}}}}}}) s\right\rangle\), the informationlimiting component of the covariance is proportional to \({{{{{{{\bf{F}}}}}}}}^{\prime} {{{{{{{{\bf{F}}}}}}}}^{\prime} }^{\top }\):
Here Γ_{0} is any covariance of R that does not limit information in large populations. Substituting this expression into (Eq. (61)) for the nonlinear Fisher Information, we obtain
where \({J}_{0}={{{{{{{{\bf{F}}}}}}}}}^{\prime}{{{\Gamma }}}_{0}^{1}{{{{{{{{\bf{F}}}}}}}}}^{\prime}\) is the nonlinear Fisher Information allowed by Γ_{0}. When the population size grows, the extensive information term J_{0} grows proportionally, so the output information will asymptote to J_{∞}.
Application to neural data
All behavioral and electrophysiological data were obtained from two healthy, male rhesus macaque (Macaca mulatta) monkeys (L and T) aged 10 and 7 years and weighting 9.5 and 15.1 kg, respectively. All experimental procedures complied with guidelines of the NIH and were approved by the Baylor College of Medicine Institutional Animal Care and Use Committee (permit number: AN4367). Animals were housed individually in a room located adjacent to the training facility on a 12 h light/dark cycle, along with around 10 other monkeys permitting rich visual, olfactory, and auditory social interactions. Regular veterinary care and monitoring, balanced nutrition and environmental enrichment were provided by the Center for Comparative Medicine of Baylor College of Medicine. Surgical procedures on monkeys were conducted under general anesthesia following standard aseptic techniques.
Monkeys faced a TwoAlternative Forced Choice (2AFC) to guess whether an oriented drifting grating stimulus came from a narrow or wide distribution of orientations, centered on zero with standard deviations σ_{+} = 15^{∘} and σ_{−} = 3^{∘}. Visual contrast was set to 64%. Each trial was initiated by a beeping sound and the appearance of a fixation target (0.15^{∘} visual angle) in the center of the screen. The monkey fixated on a fixation target for 300 ms within 0.5^{∘}–1^{∘} visual angle. The stimulus appeared at the center of the screen. After 500 ms, colored targets appeared randomly on the left and right, and the monkey then saccades to one of these targets to indicate its choice (red and green targets correspond to narrow and wide distributions).
After the monkey was fully trained, we implanted a 96electrode microelectrode array (Utah array, Blackrock Microsystems, Salt Lake City, UT, USA) with a shaft length of 1 mm over parafoveal area V1 on the right hemisphere. The neural signals were preamplified at the head stage by unity gain preamplifiers (HS27, Neuralynx, Bozeman MT, USA). These signals were then digitized by 24bit analog data acquisition cards with 30 dB onboard gain (PXI4498, National Instruments, Austin, TX) and sampled at 32 kHz. The spike detection was performed offline according to a previously described method^{12,75}. For each behavioral session and in both monkeys, 95 multiunit neural responses r_{k} were measured by spike counts in the 500 ms preceding the saccade target onset.
The animals did not perform well on all days, so for further analysis we selected sessions where the performance exceeded 0.7 for monkey 1 (85% of all sessions) and 0.75 for monkey 2 (68% of all sessions).
The neural data from the two monkeys is of comparable quality, although the monkey with higher task accuracy (monkey 2) performs more trials and has more significantly tuned neurons. When we resampled from both datasets to control for the number of trials and tuned neurons, and then used comparable datasets to do nonlinear choice correlation analysis, we found similar decoding efficiencies for two monkeys as was reported in the section “Evidence for optimal nonlinear computation in macaque brains” (data not shown).
The taskrelevant stimulus s is the large or small variance \({s}_{\pm }={\sigma }_{\pm }^{2}\) of the distribution over orientations. The orientation ϕ is a variable jointly determined by the taskrelevant stimulus and a multiplicative nuisance variable ν through \(\phi =\sqrt{s}\nu\), with \(\nu \sim {{{{{{{\mathcal{N}}}}}}}}(0,1)\). If the orientation itself can be estimated locally from linear functions of the neural responses, then the stimulus can be decoded quadratically from those neural responses according to \(\hat{s}={\hat{\phi }}^{2}\). A binary classification of the variance is given by \({\hat{s}}_{\pm }={{{{{{\mathrm{sgn}}}}}}}\,\ ({\hat{\phi }}^{2}{\theta }^{2})\) where θ is the animal’s orientation threshold. This threshold is optimal where the two stimuli are equally probable: p(ϕ∣s_{+}) = p(ϕ∣s_{−}), implying that \({\theta }_{{{{{{{{\rm{opt}}}}}}}}}^{2}=\left({{{{{{\mathrm{log}}}}}}}\,{s}_{+}{{{{{{\mathrm{log}}}}}}}\,{s}_{}\right)/\left({s}_{}^{1}{s}_{+}^{1}\right)\). The probability of correctly guessing the orientation variance is \(\frac{1}{2}\left(p({\hat{s}}_{\pm }=+ {s}_{+})+p({\hat{s}}_{\pm }= {s}_{})\right)\), where these probabilities can be computed from the cumulative normal distribution on the correct side of the optimal orientation threshold, \(p({\hat{s}}_{\pm }=+ {s}_{+})=2\int\nolimits_{{\theta }_{{{{{{{{\rm{opt}}}}}}}}}}^{\infty }d\phi \ p(\phi  {s}_{+})={{{{{{\mathrm{erfc}}}}}}}\,\left({\theta }_{{{{{{{{\rm{opt}}}}}}}}}/\sqrt{2{s}_{+}}\right)\); similarly, \(p({\hat{s}}_{\pm }= {s}_{})=1{{{{{{\mathrm{erfc}}}}}}}\,({\theta }_{{{{{{{{\rm{opt}}}}}}}}}/\sqrt{2{s}_{}})\). Using values of s_{±} for our task, this gives an optimal fraction correct of 0.82.
We computed choice correlations using NACCC (Eq. (17)), and discriminability based on total correlations between stimulus and response (Eq. (20)). We adjusted the optimal prediction by constant factors ζ and δ to account for binary choices using the equations in Supplement S.6.4, with thresholds estimated by logistic regression between choice and the absolute value of the stimulus orientation. We estimated the slopes of the relationship between measured and predicted choice correlation using the angle of the principal component of the bivariate data. We computed standard deviations for these quantities by bootstrapping 100 times.
For our two shuffle controls testing whether correlations between neurons were informative about the stimulus or choice, we selected responses independently from \({r}_{i} \sim p({r}_{i} s,\phi ,\hat{s})\) (Fig. 6f) or \({r}_{i} \sim p({r}_{i} s,\hat{s})\) (Fig. 6g). We evaluate statistical significance of the measured and predicted optimal choice correlations using pvalues for null distributions based on 100 shuffled choices and 100 shuffled stimuli, while preserving correlations between neural responses. Both null distributions are approximately Gaussian with zero means, so we compute the pvalue of the choice correlations with respect to the corresponding Gaussian, \(p=1\frac{1}{2}{{{{{{\mathrm{erfc}}}}}}}\,( x /\sqrt{2{\sigma }_{x}})\) where x is the quantity of interest and σ_{x} is its standard deviation (Fig. 6c, d).
Reporting summary
Further information on research design is available in the Nature Research Reporting Summary linked to this article.
Data availability
The data that support the findings of this study are available from the corresponding author upon reasonable request.
Code availability
All custom code used for electrophysiology data collection and data processing are made publicly available at github.com/atlab. Experimental data for Fig. 6 and code used for analysis and figure generation are available for download from github.com/xaqlab/nonlinear_choice_correlation.
References
Shamir, M. & Sompolinsky, H. Nonlinear population codes. Neural Comput. 16, 1105–1136 (2004).
Ecker, A. S., Berens, P., Tolias, A. S. & Bethge, M. The effect of noise correlations in populations of diversely tuned neurons. J. Neurosci. 31, 14272–14283 (2011).
MorenoBote, R. et al. Informationlimiting correlations. Neuroscience 17, 1410–1417 (2014).
Adelson, E. H. & Bergen, J. R. Spatiotemporal energy models for the perception of motion. Josa a 2, 284–299 (1985).
DiCarlo, J. J. & Cox, D. D. Untangling invariant object recognition. Trends Cogn. Sci. 11, 333–341 (2007).
Pinto, N., Cox, D. D. & DiCarlo, J. J. Why is realworld visual object recognition hard? PLoS Comput. Biol. 4, e27 (2008).
Rust, N. C. & DiCarlo, J. J. Selectivity and tolerance (“invariance”) both increase as visual information propagates from cortical area v4 to it. J. Neurosci. 30, 12978–12995 (2010).
Pagan, M., Urban, L. S., Wohl, M. P. & Rust, N. C. Signals in inferotemporal and perirhinal cortex suggest an untangling of visual target information. Nat. Neurosci. 16, 1132 (2013).
Meyers, E. M., Borzello, M., Freiwald, W. A. & Tsao, D. Intelligent information loss: the coding of facial identity, head pose, and nonface information in the macaque face patch system. J. Neurosci. 35, 7069–7081 (2015).
Anselmi, F., Patel, A. & Rosasco, L. Neurally plausible mechanisms for learning selective and invariant representations. J. Math. Neurosci. 10, 1–15 (2020).
Ecker, A. S. et al. Decorrelated neuronal firing in cortical microcircuits. Science 327, 584–587 (2010).
Ecker, A. S. et al. State dependence of noise correlations in macaque primary visual cortex. Neuron 82, 235–248 (2014).
Denfield, G. H., Ecker, A. S., Shinn, T. J., Bethge, M., & Tolias, A. S. Attentional fluctuations induce shared variability in macaque primary visual cortex. Nat. commun. 9, 1–14 (2018).
Ecker, A. S., Denfield, G. H., Bethge, M. & Tolias, A. S. On the structure of neuronal population activity under fluctuations in attentional state. J. Neurosci. 36, 1775–1789 (2016).
Bondy, A. G., Haefner, R. M. & Cumming, B. G. Feedback determines the structure of correlated variability in primary visual cortex. Nat. Neurosci. 21, 598–606 https://doi.org/10.1038/s4159301800891 (2018).
Averbeck, B. B., Latham, P. E. & Pouget, A. Neural correlations, population coding and computation. Nat. Rev. Neurosci. 7, 358 (2006).
Cohen, M. R. & Kohn, A. Measuring and interpreting neuronal correlations. Nat. Neurosci. 14, 811 (2011).
Kohn, A., CoenCagli, R., Kanitscheider, I. & Pouget, A. Correlations and neuronal population information. Annu. Rev. Neurosci. 39, 237–256 (2016).
Abbott, L. F. & Dayan, P. The effect of correlated variability on the accuracy of a population code. Neural Comput. 11, 91–101 (1999).
Cohen, M. R. & Maunsell, J. H. Attention improves performance primarily by reducing interneuronal correlations. Nat. Neurosci. 12, 1594 (2009).
Cohen, M. R. & Newsome, W. T. Estimates of the contribution of single neurons to perception depend on timescale and noise correlation. J. Neurosci. 29, 6635–6648 (2009).
Gawne, T. J. & Richmond, B. J. How independent are the messages carried by adjacent inferior temporal cortical neurons? J. Neurosci. 13, 2758–2771 (1993).
Beck, J., Bejjanki, V. R. & Pouget, A. Insights from a simple expression for linear fisher information in a recurrently connected population of spiking neurons. Neural Comput. 23, 1484–1502 (2011).
Paradiso, M. A theory for the use of visual orientation information which exploits the columnar structure of striate cortex. Biol. Cybern. 58, 35–49 (1988).
Zohary, E., Shadlen, M. N. & Newsome, W. T. Correlated neuronal discharge rate and its implications for psychophysical performance. Nature 370, 140–143 (1994).
Sompolinsky, H., Yoon, H., Kang, K. & Shamir, M. Population coding in neuronal systems with correlated noise. Phys. Rev. E 64, 051904 (2001).
Pitkow, X., Liu, S., Angelaki, D. E., DeAngelis, G. C. & Pouget, A. How can single sensory neurons predict behavior? Neuron 87, 411–423 (2015).
Shamir, M. & Sompolinsky, H. Implications of neuronal diversity on population coding. Neural Comput. 18, 1951–1986 (2006).
Burge, J. & Jaini, P. Accuracy maximization analysis for sensoryperceptual tasks: Computational improvements, filter robustness, and coding advantages for scaled additive noise. PLoS Comput. Biol. 13, e1005281 (2017).
Pagan, M., Simoncelli, E. P. & Rust, N. C. Neural quadratic discriminant analysis: nonlinear decoding with v1like computation. Neural Comput. 28, 2291–2319 (2016).
Gutnisky, D. A. & Dragoi, V. Adaptive coding of visual information in neural populations. Nature 452, 220 (2008).
Kohn, A. & Smith, M. A. Stimulus dependence of neuronal correlation in primary visual cortex of the macaque. J. Neurosci. 25, 3661–3673 (2005).
Averbeck, B. B. & Lee, D. Effects of noise correlations on information encoding and decoding. J. Neurophysiol. 95, 3633–3644 (2006).
Ohiorhenuan, I. E. et al. Sparse coding and highorder correlations in finescale cortical networks. Nature 466, 617 (2010).
PonceAlvarez, A., Thiele, A., Albright, T. D., Stoner, G. R. & Deco, G. Stimulusdependent variability and noise correlations in cortical mt neurons. Proc. Natl Acad. Sci. USA 110, 13162–13167 (2013).
Rigotti, M. et al. The importance of mixed selectivity in complex cognitive tasks. Nature 497, 585–590 (2013).
Pagan, M. & Rust, N. C. Dynamic target match signals in perirhinal cortex can be explained by instantaneous computations that act on dynamic input from inferotemporal cortex. J. Neurosci. 34, 11067–11084 (2014).
Yamins, D. L. et al. Performanceoptimized hierarchical models predict neural responses in higher visual cortex. Proc. Natl Acad. Sci. USA 111, 8619–8624 (2014).
Britten, K. H., Newsome, W. T., Shadlen, M. N., Celebrini, S. & Movshon, J. A. A relationship between behavioral choice and the visual responses of neurons in macaque mt. Vis. Neurosci. 13, 87–100 (1996).
Shadlen, M. N., Britten, K. H., Newsome, W. T. & Movshon, J. A. A computational analysis of the relationship between neuronal and behavioral responses to visual motion. J. Neurosci. 16, 1486–1510 (1996).
Dodd, J. V., Krug, K., Cumming, B. G. & Parker, A. J. Perceptually bistable threedimensional figures evoke high choice probabilities in cortical area mt. J. Neurosci. 21, 4809–4821 (2001).
Krajbich, I., Armel, C. & Rangel, A. Visual fixations and the computation and comparison of value in simple choice. Nat. Neurosci. 13, 1292 (2010).
de Lafuente, V. & Romo, R. Neuronal correlates of subjective sensory experience. Nat. Neurosci. 8, 1698 (2005).
Treue, S. & Trujillo, J. C. M. Featurebased attention influences motion processing gain in macaque visual cortex. Nature 399, 575 (1999).
Roitman, J. D. & Shadlen, M. N. Response of neurons in the lateral intraparietal area during a combined visual discrimination reaction time task. J. Neurosci. 22, 9475–9489 (2002).
Gu, Y., Angelaki, D. E. & DeAngelis, G. C. Neural correlates of multisensory cue integration in macaque mstd. Nat. Neurosci. 11, 1201 (2008).
Purushothaman, G. & Bradley, D. C. Neural population code for fine perceptual decisions in area mt. Nat. Neurosci. 8, 99 (2005).
Haefner, R. M., Gerwinn, S., Macke, J. H. & Bethge, M. Inferring decoding strategies from choice probabilities in the presence of correlated variability. Nat. Neurosci. 16, 235–242 (2013).
Britten, K. H., Newsome, W. T., Shadlen, M. N., Celebrini, S. & Movshon, J. A. A relationship between behavioral choice and the visual responses of neurons in macaque mt. Vis. Neurosci. 13, 87–100 (1996).
Green, D. M. & Swets, J. A. Signal Detection Theory and Psychophysics (John Wiley, 1966).
Kanitscheider, I., CoenCagli, R. & Pouget, A. Origin of informationlimiting noise correlations. Proc. Natl Acad. Sci. USA 112, E6973–E6982 (2015).
KhalighRazavi, S. M. & Kriegeskorte, N. Deep supervised, but not unsupervised, models may explain it cortical representation. PLoS Comput. Biol. 10, e1003915 (2014).
Haag, J., Denk, W. & Borst, A. Fly motion vision is based on reichardt detectors regardless of the signaltonoise ratio. Proc. Natl Acad. Sci. USA 101, 16333–16338 (2004).
Mante, V., Sussillo, D., Shenoy, K. V. & Newsome, W. T. Contextdependent computation by recurrent dynamics in prefrontal cortex. Nature 503, 78–84 (2013).
Saez, A., Rigotti, M., Ostojic, S., Fusi, S. & Salzman, C. Abstract context representations in primate amygdala and prefrontal cortex. Neuron 87, 869–881 (2015).
Walker, E. Y., Cotton, R. J., Ma, W. J. & Tolias, A. S. A neural basis of probabilistic computation in visual cortex. Nat. Neurosci. 23, 122–129 (2020).
Beck, J. M., Ma, W. J., Pitkow, X., Latham, P. E. & Pouget, A. Not noisy, just wrong: the role of suboptimal inference in behavioral variability. Neuron 74, 30–9 (2012).
Nienborg, H. & Cumming, B. G. Psychophysically measured task strategy for disparity discrimination is reflected in v2 neurons. Nat. Neurosci. 10, 1608 (2007).
Qamar, A. T. et al. Trialtotrial, uncertaintybased adjustment of decision boundaries in visual categorization. Proc. Natl Acad. Sci. USA 110, 20332–20337 (2013).
Karklin, Y. & Lewicki, M. S. Emergence of complex cell properties by learning to generalize in natural scenes. Nature 457, 83–86 (2009).
Hubel, D. H. & Wiesel, T. N. Receptive fields, binocular interaction and functional architecture in the cat’s visual cortex. J. Physiol. 160, 106–154 (1962).
Walker, E. Y., Cotton, R. J., Ma, W. J., & Tolias, A. S. A neural basis of probabilistic computation in visual cortex. Nat. Neurosci. 23, 122–129 (2020).
Poggio, T. & Koch, C. Synapses that compute motion. Sci. Am. 256, 46–53 (1987).
Ma, W. J., Navalpakkam, V., Beck, J. M., Van Den Berg, R. & Pouget, A. Behavior and neural basis of nearoptimal visual search. Nat. Neurosci. 14, 783 (2011).
Davis, K. A., Ramachandran, R. & May, B. J. Auditory processing of spectral cues for sound localization in the inferior colliculus. J. Assoc. Res. Otolaryngol. 4, 148–163 (2003).
Hornik, K. Approximation capabilities of multilayer feedforward networks. Neural Netw. 4, 251–257 (1991).
Lakshminarasimhan, K., Pouget, A., DeAngelis, G., Angelaki, D. & Pitkow, X. Inferring decoding strategies for multiple correlated neural populations. PLoS Comput. Biol. 14, e1006371 (2018).
Haefner, R. M., Berkes, P. & Fiser, J. Perceptual decisionmaking as probabilistic inference by neural sampling. Neuron 90, 649–660 (2016).
Graf, A. B., Kohn, A., Jazayeri, M. & Movshon, J. A. Decoding the activity of neuronal populations in macaque primary visual cortex. Nat. Neurosci. 14, 239 (2011).
Maynard, E. et al. Neuronal interactions improve cortical population coding of movement direction. J. Neurosci. 19, 8083–8093 (1999).
Pitkow, X. & Angelaki, D. E. Inference in the brain: statistics flowing in redundant population codes. Neuron 94, 943–953 (2017).
Krakauer, J. W., Ghazanfar, A. A., GomezMarin, A., MacIver, M. A. & Poeppel, D. Neuroscience needs behavior: correcting a reductionist bias. Neuron 93, 480–490 (2017).
Niv, Y. The primacy of behavioral research for understanding the brain. Behav Neurosci. 135, 601–609 https://doi.org/10.1037/bne0000471 (2021).
Berens, P., Ecker, A. S., Gerwinn, S., Tolias, A. S. & Bethge, M. Reassessing optimal neural population codes with neurometric functions. Proc. Natl Acad. Sci. USA 108, 4423–4428 (2011).
Tolias, A. S. et al. Recording chronically from the same neurons in awake, behaving primates. J. Neurophysiol. 98, 3780–3790 (2007).
Acknowledgements
The authors thank Jeff Beck, Valentin Dragoi, Arun Parajuli, Alex Pouget, Nicole Rust, and Haim Sompolinsky for helpful conversations. This work was supported by NSF CAREER grant 1552868 to X.P., by NeuroNex grant 1707400 to X.P. and A.T., and by NSF Grant No. PHY1748958, NIH Grant No. R25GM067110, the Gordon and Betty Moore Foundation Grant No. 2919.01. Q.Y. was supported in part by National Natural Science Foundation of China grant No. 32100832 and by the Natural Science Foundation of the Higher Education Institutions of Jiangsu Province China Grant No. 19KJD520001. X.P. and A.T. were supported in part by the Intelligence Advanced Research Projects Activity (IARPA) via Department of Interior/Interior Business Center (DoI/IBC) contract number D16PC00003. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright annotation thereon. Disclaimer: the views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of IARPA, DoI/IBC, or the U.S. Government.
Author information
Authors and Affiliations
Contributions
X.P. and Q.Y. conceived the theoretical framework. X.P. and Q.Y. designed and performed the mathematical analyses. Q.Y. performed the simulations. E.W., R.J.C., and A.S.T. designed the experiments for a study with Wei Ji Ma; E.W., R.J.C., and A.S.T. performed the experiments; E.W. preprocessed the neural data; Q.Y. and X.P. analyzed the neural data. Q.Y. and X.P. wrote the manuscript; all authors discussed the results and commented on the manuscript.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Peer review information Nature Communications thanks the anonymous reviewer(s) for their contribution to the peer review of this work. Peer reviewer reports are available.
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Yang, Q., Walker, E., Cotton, R.J. et al. Revealing nonlinear neural decoding by analyzing choices. Nat Commun 12, 6557 (2021). https://doi.org/10.1038/s41467021267939
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41467021267939
This article is cited by

A distributed and efficient population code of mixed selectivity neurons for flexible navigation decisions
Nature Communications (2023)
Comments
By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.