Optimal dynamic coding by mixed-dimensionality neurons in the head-direction system of bats

Ethologically relevant stimuli are often multidimensional. In many brain systems, neurons with “pure” tuning to one stimulus dimension are found along with “conjunctive” neurons that encode several dimensions, forming an apparently redundant representation. Here we show using theoretical analysis that a mixed-dimensionality code can efficiently represent a stimulus in different behavioral regimes: encoding by conjunctive cells is more robust when the stimulus changes quickly, whereas on long timescales pure cells represent the stimulus more efficiently with fewer neurons. We tested our predictions experimentally in the bat head-direction system and found that many head-direction cells switched their tuning dynamically from pure to conjunctive representation as a function of angular velocity—confirming our theoretical prediction. More broadly, our results suggest that optimal dimensionality depends on population size and on the time available for decoding—which might explain why mixed-dimensionality representations are common in sensory, motor, and higher cognitive systems across species.

We plot the error ratio of pure and conjunctive cells decoding a hypothetical 5D stimulus. Similar to the 2D case ( Fig.   3), we identified here three regimes where the error ratio is equal to, smaller than and larger than the ratio predicted by the FI for large N and T -which in this case equals pure / conj = √ 5. (a) For wide tuning (90 • ), the range of N and T values we used is sufficient to find these regimes and the error ratio in N − T space looks very similar to that in 2D (Fig. 3). (b) For narrow tuning (45 • ) conjunctive cells suffer from very poor coverage of this high-dimensional stimulus space. Thus, for the range of N and T values we used, we can only see regime #1 where pure / conj ≈ √ 5 (blue line corresponding to N = 10 5 , right panel) and regime #2 where pure / conj < √ 5 (left panel). The panel on the right shows the ratio of the decoding error by pure versus conjunctive cells, as a function of T , for 4 different population sizes (N ). Note that for a 5D stimulus, pure cells can outperform conjunctive cells even at large N , and the conjunctive cells will not be accurate unless the population size is very large (above 5,000 neurons) or the decoding time is very short (the green line in the inset [N = 4, 000] is mostly below 1) . The color bar indicates the error-ratio between pure and conjunctive cells for a 5D stimulus. Regime #2 ϵ pure / ϵ conj < 1 The performance of pure and conjunctive neurons encoding a 2D stimulus is shown as a function of decoding time (T ) and the number of cells available for decoding (N ), for a different kind of scaling of peak firing-rates of conjunctive cells than the scaling used in Fig. 3. In 2D, we normalized the tuning curves such that the expected Fisher information at the limit of large N and T is equal for the two populations, but the mean-firing rate of pure and conjunctive cells is no longer the same. In this scaling the peak firing rate of conjunctive cells is chosen so that conjunctive cells fire on average fewer spikes than pure cells (see Methods), because as we discussed in the main text, when both populations have the same mean firing rate, the Fisher information of conjunctive cells is higher than that of the pure cells. Importantly, one can still observe the absolute advantage of either pure or conjunctive cells when N and or T are relatively small, respectively -even though their performance becomes exactly the same at large N and T , as expected from this scaling of their firing rates. (a-d) We adopt here the noise correlation structure found in the head-direction system of rodents by Peyrache et al. 25 , and show the relative performance of pure and conjunctive neurons encoding a 2D stimulus, as a function of decoding time (T ) and number of cells available for decoding (N ), in the presence of these noise correlations.
(a,b) The noise correlation between every pair of neurons depends on their pairwise tuning curve distance d (see Methods), as illustrated in a for pure cells (left) and conjunctive cells (right). The distance is then transformed into a correlation value using the function illustrated in b. Cells with highly overlapping tuning curves are positively correlated, whereas cells with a small overlap are slightly negatively correlated. Given the correlation, spike counts are generated using a procedure detailed in the Methods. (c,d) Plot of the decoding error ratio pure / conj of pure and conjunctive cells. Decoding was done using two different methods: a naive decoder that does not take into account the noise correlations (top) and a decoder that uses both the neurons' spike counts and the noise correlation structure to infer the stimulus (bottom). The dependence of the error ratio on N and T is rather similar to that found in the absence of noise correlations (compare to Fig. 4c). (c) The error ratio at short decoding times is larger than that at long times, indicating the relative advantage of conjunctive cells relative to pure cells at short times. As T increases the ratio saturates to a value that depends on N . (d) At long decoding times (T = 10s), decreasing N leads to smaller error ratio and eventually to an absolute advantage of pure cells, suggesting that at small N the conjunctive population suffers from a loss of coverage. (e,f) We plot the error ratio as a function of N and T in 2D using an alternative method of introducing dependencies between cells. Here there is an overall additive (panel e) or multiplicative (panel f) random factor that modulates the tuning curves of all neurons in a sub-population (pure azimuth, pure pitch, conjunctive). This additive or multiplicative modulation was done prior to drawing the Poisson-distributed spike counts. The effect on the tuning curve is shown at the top panels. The bottom panels show that the error ratio exhibits the same qualitative behavior as in the case with independent spike counts. Because of this additional source of variability, the theoretical error ratio here is not √ 2. In the absence of an analytical calculation of the error ratio for large N and T for these noise models, we used the error ratio between pure and conjunctive cells found for the largest N , T values used in the simulation as the 'boundary' between the regimes.   which decreases as the tuning becomes broader (inset). Fisher Information increases as tuning becomes narrower (inset to Supplementary Fig. 6a), and therefore when the tuning of conjunctive cells is fixed, their advantage over a population of pure cells at large N diminishes as pure cells became more narrowly tuned (Supplementary Fig. 6b).
Correspondingly, N cr increased as pure cells became more narrowly tuned (inset to Supplementary Fig. 6b). Thus, both the difference in stimulus space coverage by pure and conjunctive cells, and the dependence of FI on the tuning width, contribute to the relative advantage of pure cells for small N and narrow tuning. (a) We computed the probabilities that one of the pure subpopulations will fire a very small number of spikes n min (dashed lines); and that the conjunctive population will fire 2n min spikes (solid lines). These probabilities are shown as a function of the expected total number of spikes emitted by the entire population, which is the same for pure and conjunctive cells. When the expected number of spikes is 10-15, there is a large chance that the total spikes will be unevenly divided such that one of the pure subpopulations fires three spikes or less, leading to a very poor estimate of the stimulus along the direction corresponding to that subpopulation. (b) We computed the pure and conjunctive conditional errors -the decoding errors averaged after removing trials with few spikes as described in (a).
The average number of spikes emitted by the pure and conjunctive populations remains equal after these trials are removed (see also Methods). The error ratio pure / conj is shown after successively removing all trials where either of the pure subpopulations fired n min = 0 (taking into account all trials) and n min = 1, 2, 3 spikes. The black line, similar therefore in the model we used the average peak firing rate of all pure cells ("pure combined") as a representative peak firing rate of the pure population. The conjunctive cells had significantly elevated peak-firing rate compared to pure cells. Note that this effect was significant both when the peak-firing rate of conjunctive cells was computed separately for azimuth and pitch dimensions (on the 1D marginal tuning curve for the respective dimension, "conj. azim", and "conj. pitch"), and when defined as the maximal peak-firing rate between the azimuth and pitch 1D tuning curves ("conj. max"). Error bars, mean ± s.e.m.; * P < 0.05, * * P < 0.01, * * * P < 0.001, using Student's t-test.
(b) Tuning widths were very similar between recorded pure-azimuth and pure-pitch cells, and additionally there was no significant difference between the tuning width of pure and conjunctive cells -both when the tuning width was computed separately for the 1D tuning-curves, or taken as the average value across both dimensions ("combined").
Therefore we modeled the tuning curves with the same tuning width parameter for pure and conjunctive cells, in both azimuth and pitch dimensions.

Supplementary Note 1
We study the scaling behavior of the Maximum Likelihood decoding error as a function of the population size N and the decoding time T . As N, T → ∞, the average squared error saturates to the inverse Fisher Information, what is known as the Cramér-Rao bound.
To understand the scaling behavior of the error for finite N, T (i.e., away from the CR bound) we quantify fluctuations of the likelihood function used for decoding. If there were no fluctuations, the maximum of the likelihood function would be at the stimulus presented to the neuronal population. The random fluctuations "shift" the global maximum of the likelihood function away from the stimulus, leading to decoding errors.
The derivation is carried out in three steps: First we average over the noise sources, namely the Poisson spike variability and the random preferred orientation of each neuron, to obtain expressions for the first and second order statistics of the likelihood function that is maximized for the decoded stimulus. Knowing these statistics allows us to write an effective simplified likelihood function that can be analyzed.
Second, for this effective likelihood function, we estimate the decoding error assuming a general algebraic form for the tuning curve.
Finally, using an effective form for any tuning curve shape (including the von Mises function we use throughout) we obtain an implicit equation for the ML decoding error, which is solved numerically. We conclude by comparing the result of our derivation to numerical simulations.
We emphasize that the derivation below is qualitative. As will be seen, it provides an explanation for two important observations we describe in the main text: • In one dimension, for large N , the population size and decoding time control the error only through their product N × T (see Supplementary Fig. 1b).
• In two (or more) dimensions, for short decoding times, the error ratio favors conjunctive cells more than is expected from computing the FI (see blue region in Figs. 3a,b and Fig. 4c).
This derivation does not provide a quantitative account for the decoding error away from the CR bound. We think that the theory of extreme value statistics can be applied to obtain more refined estimates of interesting quantities.
Doing so is beyond the scope of this paper and will be left for future work. We refer the reader to the book by Falk et al. (2004) that provides a thorough introduction to the theoretical tools one might use to study this problem. When relevant we point to key mathematical results we think may be useful.

First and second order statistics of the likelihood function
We start with a 1D stimulus s, and corresponding spike counts n i (s) drawn from a Poisson distribution with rate given by the von Mises tuning curve. The ML estimate of s is θ, and it is found by maximizing the likelihood function Here, κ is a parameter that controls the tuning width, θ i is the preferred stimulus of neuron i, T is the decoding time, R is the peak firing rate and N is the size of the population.
The preferred stimuli are random and uniformly distributed between 0 and 2π. Throughout the derivation we assume that N is large enough such that for any function g(θ), We emphasize that the focus is on scenarios where the error deviates from the CR bound due to short decoding times T , not due to small N . Define, Below we compute the averages L 1 and L 2 1 . The assumption that sums can be replaced by integrals (Eq. 2) implies that L 2 is in fact a constant (it does not depend on the stimulus coordinate θ), so we drop it from our derivation and set L = L 1 . Note that ignoring the term L 2 is equivalent to using the population vector, rather than the maximum likelihood decoder. Averages are taken over multiple stimulus presentations and choices of the neurons' preferred stimuli.
The quantity L 1 . This is simply the population vector. The average spike count is n i (s) = RT exp [κ (cos(s − θ i ) − 1)].
Defining δθ = θ − s we write where we have used standard identities for the modified Bessel function of the first kind I ν (z), and where J is the Fisher information of a single population (see Eq. (7) in Methods).
The quantity L 2 . For completeness, we computed L 2 which measures how evenly the neurons' tuning curves are spread in the stimulus space, and is thus independent of the stimulus coordinate θ and the stimulus itself s. A similar calculation gives, The quantity L 2 1 . Using the fact that for Poisson statistics the spike counts satisfy n 2 i = n i 2 + n i , the second moment of L 1 is L 2 1 (δθ) = L 1 (δθ) 2 + N RT κ 2 e −κ 1 2 (I 0 (κ) + I 2 (κ)) cos(2δθ) + I 0 (κ) sin 2 (δθ) = L 1 (δθ) 2 + κ 1 2 (I 0 (κ) + I 2 (κ)) cos(2δθ) + I 0 (κ) sin 2 (δθ) In the simulations, the decoder's performance is assessed by computing the estimated stimulus from instances of L in which the variability is generated by the Poisson statistics and random preferred stimuli are drawn from a uniform distribution. Possible interaction of these two noise sources and their non-Gaussian statistics makes analyzing how that variability affects the decoder's performance a difficult problem.
Instead, we study the performance of a hypothetical decoder which maximizes an effective likelihood functionL that has Gaussian statistics with the same first and second moments as L that we computed above, i.e., Here ξ(δθ) is a random Gaussian process with mean 0 and unit variance.
In addition to the variance shown in Eq. (6), the function L 1 (δθ) has non-zero auto-correlation: L 1 (δθ)L 1 (δθ + ∆) ∼ I 1 (κ cos(∆/2)) = 0. This means that if the value of the likelihood function is large for some deviation δθ of the inferred stimulus from the actual one, it is likely large also for a "nearby" value of the deviation.
In the present work we aim to only obtain a qualitative understanding of the behavior of the error stemming from the fluctuations of the likelihood function. To do so we neglect the correlations of its fluctuations. It is in fact possible to include them by considering a correlated Gaussian process inL, with correlations matched to those of the actual likelihood function L. This however leads to expressions from which it is difficult to extract the error behavior. Specifically, doing so leads to a factor C 0 (see Eq. 13 below) which depends on . Assuming the autocorrelation is zero for ∆ = 0 leads to a constant C 0 which can be computed.
Substituting Eqs. (4,6) into Eq. (7) and dividing by J, L(δθ) = cos(δθ) + κ 1 2 (I 0 (κ) + I 2 (κ)) cos(2δθ) + I 0 (κ) sin 2 (δθ) where we defined S(κ, δθ) = κ 1 2 (I 0 (κ) + I 2 (κ)) cos(2δθ) + I 0 (κ) sin 2 (δθ) On average, the effective likelihood functionL has a maximum at δθ = 0, so it correctly represents the fact that the ML decoder is unbiased. Instances of the random process ξ can shift the maximum away from 0 which we interpret as the errors that this effective decoder makes. In the next section we analyze the magnitude of these errors. In Supplementary Fig. 11 we show a comparison of numerically computing the likelihood function L 1 and the analytical estimate we use in the simulations (top row, Eq. 4; bottom row, Eq. 6). The formula we obtained breaks down when the entire population emits on average less than one spike (see Supplementary Fig. 11, left column, for which N = 100 and T = 0.01s). The approximation improves as the number of spikes grows (either as N is increased, or as T is increased), as can be seen in the middle and right columns of Supplementary Fig. 11.

The errors made by the effective Gaussian decoder with algebraic tuning
In the previous section we showed that the likelihood function maximized by the decoder can be written as a sum of a deterministic part and a stochastic part. The deterministic part is maximal for δθ = 0 (i.e., the error in the absence of noise is 0). The stochastic part shifts the maximum away from δθ = 0, meaning that in the presence of noise the error is non-zero. Now we want to use our knowledge of the mean and variance of this function to estimate the magnitude of errors by obtaining the characteristic distance of the peak of the likelihood function from δθ = 0.
Analyzing the characteristic value of the error |δθ| at which the effective likelihood function (Eq. 8) has a maximum is difficult, so again we simplify. Consider the random process z(t), starting at t = 0 and going only forward in time, such that where ξ t is a Gaussian random process with mean 0 and unit variance, and m > 0. Examples of this process are shown in Supplementary Fig. 12a,b.
This process is a simplified case of the Gaussian approximation we made to the decoding problem. In other words, z(t), the position of the random process as a function of time, is analogous toL(δθ), the value of the effective likelihood as a function of the decoding error. We refer to the decoding error coordinate δθ as the "time" t to emphasize the connection between the problem we are studying here and that of extremes of non-stationary random processes. In fact, in the Gaussian approximation, the likelihood function is a nonstationary Gaussian process. The mathematical properties of extreme values of this type of stochastic processes are discussed in chapter 10.3 of Falk et al. (2004).
In the case of the Gaussian approximation of the likelihood function (Eq. 8) the decoding error is equal to δθ for which the functionL(δθ) is maximal:L for all −π < δθ ≤ π.
Note that we considered only a one sided process: t ≥ 0. This choice ensures that the base of the deterministic term in Eq. (10) is always ≥ 1. This in turn guarantees that the deterministic contribution to z(t) has monotonic dependence on m: if m 1 > m 2 then for all t > 0 we have that −|1 + t| m1 < −|1 + t| m2 meaning that the decay of the deterministic part of z(t) is faster for larger m. Had we considered also negative times, the m dependence of the deterministic term would change at t = 0.
Thus, in the simplified process (Eq. 10), we are interested in the time t = such that z( ) > z(t) for all t ≥ 0.
We denote this time by , because the characteristic time by which the maximum of z(t) is shifted from t = 0 will be interpreted as the decoding error, as in Eq. (11). Supplementary Fig. 12 illustrates why depends on the exponent m: the deterministic part of the process z(t) decreases faster for larger m, so the noise term is unable to "overcome" this decline such that the process reaches its maximum at later times, compared to the process z(t) with a smaller value of m. The distribution of errors, i.e., times at which the process z(t) reaches its maximum are shown in Supplementary Fig. 12c, along with the averages for three values of m.
The average of z(t) (i.e., the deterministic part −|1+t| m ) decreases as time increases. The maximum of the stochastic part as a function of time, taken over realizations of the Gaussian noise, increases with time. This is true because the longer the process evolves, the more likely it is to have exceeded a given positive value in its history. The distribution of extremes of stationary Gaussian processes (i.e., the case where σ in Eq. (10) is constant) has been computed analytically [see theorem 10.2.1 in Falk et al. (2004)].
Quantitatively computing the characteristic time at which z(t) is maximal can thus be done by adding the deterministic part to the average over the distribution of extremes [Eq. (10.2) in Falk et al. (2004)]. It is also worth noting that this distribution might be useful in deriving an approximation of the distribution of errors (rather than the characteristic value) which is needed to obtain an expression for the periodic Cramér Rao bound [see Routtenberg & Tabrikian 58 ].
Since we are only interested in the scaling of the characteristic time with the standard deviation σ and the exponent m, it is sufficient to inspect the probability that z( ) > z(t = 0) = −1, i.e., the probability that the process exceeds its initial value. For large enough t this probability is negligible meaning that there is essentially no chance that the decoder will make errors of the corresponding magnitude. This probability reads The integrand is a product of two terms: the first decays exponentially with ξ( ); and the second increases monotonically but is bounded by 2 as the argument in the error function increases. Note that the Gaussian process ξ( ) appears in the argument of the error function, and that increasing the error leads to decreasing the second term in the product.
Thus, as grows, the overlap of the region where both the exp −ξ 2 ( )/2 and 1 + erf(·) terms are order 1 decreases.
The maximal such that this overlap is non-negligible must satisfy with C 0 being a positive constant of order one. This is the maximal error ("time" in the language of the process z) for which there is a non-negligible probability of observing a global maximum in z.
For small we substitute (1 + ) m ≈ 1 + m into Eq. (13) giving, This scaling is in excellent agreement with numerical simulations (not shown). We will determine C 0 in the next section where we return to analyzing the approximate likelihood function we derived (Eq. 8), by requiring that the CR bound is saturated for small .

Estimating the errors of a maximum likelihood decoder
In the previous section we presented a simple scaling analysis of a random process with fixed algebraic drift and fixed variance (i.e., m and σ in Eq. 10 do not depend on t). The characteristic time t = at which the process z(t) reaches its maximum is now interpreted as the decoding error, so = |δθ| in Eq. (8). However, the quantities in the effective likelihood function we derived (Eq. 8), do depend on the stimulus coordinate (i.e.,L is in fact a nonstationary process): We assume that the dependence of σ does not change the scaling relationship, and we introduce dependence to the scaling of the drift, i.e., m = m( ), and To determine an appropriate form of m( ) and the value of C 0 consider the following two limits.
The limit 1: here we know from the Taylor expansion of cosine that m( 1) = 2; and that a very small error implies saturation to the CR bound, so = 1/ √ J. Substituting these into Eq. (16) fixes C 0 = 2/S(κ, 0).
The limit of large errors: since we are studying the decoding of a one dimensional angular variable, the error is bounded from above by π/2 regardless of how small J is. For J = 1 we assume the error is maximal, and neglect the dependence on κ. Under these assumptions Eq. (16) is satisfied if m(π/2) = 2. Including the κ dependence or choosing a different value of the FI (of order one) for which the error is maximal does not change the results qualitatively.
Between 0 and π/2, m( ) < 2 because the quartic term in the Taylor expansion of cosine is positive, so the effective drift is slower than − 2 . The simplest form of m( ) that satisfies these requirements is Substituting Eq. (17) into Eq. (16) gives an implicit equation in for the scaling of the decoding error away from the CR bound: We solved Eq. (18) numerically, choosing a range of values for J. To allow comparison to numerical simulation of the ML decoder, we eliminate factors of order 1 by plotting the relative error / CR (where CR = 1/ √ J) which is expected to saturate to 1 from above for large J.
Plotted as a function of T , both the theoretical prediction and the numerical simulations show departure from the CR bound (dashed line at / CR = 1) that depends on the value of N , and both reach a saturation beyond which the error does not grow more with respect to the CR bound as time is decreased (Supplementary Fig. 1). There is no quantitative agreement in terms of the value of T at which the departure is observed, and no quantitative agreement for the maximal ratio. These differences could be expected since in our derivation we neglected factors of order 1 a number of times.
In the numerical solution of Eq. (18) and in the simulations of the ML decoder we used κ = 9.1 (corresponding to 45 • tuning width), R = 1Hz and N between 100 and 10000. Importantly, when the relative error obtained from simulations is plotted as a function of 1/ CR instead of as a function of T , curves corresponding to different values of N collapse onto one another. This means that the error depends on N and T only through their product (which is proportional to J).

Decoding error of conjunctive cells
To obtain an estimate of the decoding error of conjunctive cells, we assume that the peak firing rate of conjunctive cells is normalized such that the average population firing rate of pure and conjunctive cells is equal (the normalization used throughout the paper). Under this assumption, in two dimensions, the Fisher information of the pure cell population is equal to half that of the conjunctive cells.
Therefore, the estimate of the error of a ML decoder that uses conjunctive cell responses is obtained by solving Eq.
(18) with 2J instead of J. One can expect then that the error ratio will exceed √ 2 for some values of J, because before convergence to the CR bound, multiplying the FI by 2 results in the error decreasing by a factor greater than √ 2.
Indeed this was the case when the ratio (J) (2J) was plotted (inset to Fig. 4c) for N = 1000 and N = 2000. Similarly to numerical simulations, the error ratio exceeds √ 2 for short decoding times, and converges to √ 2 for long decoding times, and the time at which the maximum ratio is attained depends on N . The theoretical estimate however fails to predict the fact that the maximal error ratio depends on N . This is accounted for by an analysis that uses a different approach, presented in the Methods section, and in Supplementary Fig. 7.

Supplementary References
Falk M, Hüsler J, Reiss R. 2004 Laws of small numbers: Extremes and rare events. Birkhauser Verlag. Basel, Switzerland.