Scaling of sensory information in large neural populations shows signatures of information-limiting correlations

How is information distributed across large neuronal populations within a given brain area? Information may be distributed roughly evenly across neuronal populations, so that total information scales linearly with the number of recorded neurons. Alternatively, the neural code might be highly redundant, meaning that total information saturates. Here we investigate how sensory information about the direction of a moving visual stimulus is distributed across hundreds of simultaneously recorded neurons in mouse primary visual cortex. We show that information scales sublinearly due to correlated noise in these populations. We compartmentalized noise correlations into information-limiting and nonlimiting components, then extrapolate to predict how information grows with even larger neural populations. We predict that tens of thousands of neurons encode 95% of the information about visual stimulus direction, much less than the number of neurons in primary visual cortex. These findings suggest that the brain uses a widely distributed, but nonetheless redundant code that supports recovering most sensory information from smaller subpopulations.


Supplementary Note 1. Generalized Fisher information
Fisher information quantifies how much neural activity r tells us about a stimulus θ around a particular reference θ 0 . As such, it is a measure of fine discrimination performance. Here, we show how linear Fisher information relates to Fisher information in general, show how it can be generalized beyond fine discrimination, and describe some properties of this generalization.

Definition and properties of linear Fisher information
We can derive linear Fisher information in two ways [1,2]. The first is to assume that p(r|θ) is a member of the exponential family with linear sufficient statistics. The second is to show that it is the Fisher information that can be extracted with a minimum-variance unbiased linear decoder. We will provide both derivations in turn.
Let us first assume that neural activity r in response to a stimulus θ follows an exponential family distribution with linear sufficient statistics, p(r|θ) = g(θ)Φ(r) exp(h(θ) T r), where in which g(θ), Φ(r), and h(θ) are known functions. The partial derivative with respect to θ of the log-likelihood function, ∂ ∂θ log p(r|θ), is called the "score" which is given by where f (θ) = E (r(θ)) is the population activity vector. Note that the first moment of the score function is zero. The Fisher information can be derived using the variance of the score function [3] which can be written as follows: where Σ(θ) = E (r(θ) − f (θ)) (r(θ) − f (θ)) T is the noise covariance matrix. To express the Fisher information in terms of f (θ), we note that Thus, we have h (θ) = Σ −1 (θ)f (θ) [4]. Taking this expression to substitute both instances of h (θ) in the Fisher information results in To show that linear Fisher information is the information extractable by a minimum-variance unbiased linear decoder, assume that the decoder linearly combines neural activity of neurons with a projection vector w. For fine discrimination task with two close-by stimuli θ 1 = θ 0 − δθ and θ 2 = θ 0 + δθ with small δθ, the unbiased locally linear estimator forθ is given byθ − θ 0 = w T (r − f (θ 0 )).
The expectation of the right-hand side around θ 0 is w T ( r − f (θ 0 )) = 0, demonstrating that the estimator is unbiased. Our aim is to find a w that yields a locally unbiased estimate, that is imposing the constraint w T f (θ) = 1.
To find the minimum variance estimator satisfying this constraint, note that its variance is given by var θ = w T Σw, where Σ is the noise covariance matrix around θ 0 . Therefore, we aim to find min w w T Σw, s.t. w T f (θ) = 1.
Using a Lagrange multiplier to solve the constraint optimization for w results in with associated estimator variance var θ = 1 By the Cramér-Rao bound [3], the Fisher information is the inverse of this variance, resulting in which matches the previously derived expression for the linear Fisher information. This demonstrates that linear Fisher information can be interpreted in multiple ways: it is either the Fisher information when restricting the distribution of neural activity to the exponential family with linear sufficient statistics (which contains independent-Poisson populations with dense tuning curves, as well as other distributions [4], or the Fisher information that can be extracted with a linear decoder.

Generalizing Fisher information beyond fine discrimination
Let us generalize the above to coarse discrimination. To do so, assume two classes, C 1 and C 2 , which represent a pair of stimulus orientations at θ 1 and θ 2 in the experiment. As before, we will derive generalized Fisher information in two ways. First, we will derive it by making particular distributional assumptions on p(r|θ 1 ) and p(r|θ 2 ). Then, we will derive it from the perspective of optimal linear discrimination. For the first approach, assume that p(r|θ j ) for both j ∈ {1, 2} follows a Gaussian distribution, C 1 : p(r|θ 1 ) = N (r|f 1 , Σ) which have different means, but the same covariance matrix. Under the assumption that θ is a random variable (which takes two values, θ ∈ {θ 1 , θ 2 }, in coarse discrimination tasks), it is easy to find a decision rule that minimize the expected Bayes risk [5]. We will denote L ij as the loss of choosing C j when C i is correct. Furthermore, we assume a symmetric decision problem with symmetric loss, that is L 12 = L 21 and L 11 = L 22 , a uniform prior p(C 1 ) = p(C 2 ) = 1/2, and a preference for making correct choices, that is L 11 < L 12 . In this case, the expected Bayesian risk, i∈{1,2} L iD(r) p(C i |r), associated with decision rule D(r) ∈ {1, 2} is minimized by where Λ(r) is the log-likelihood ratio. For the assumed Gaussian likelihoods, this log-likelihood ratio is given by where f 0 = 1 2 (f 1 + f 2 ), and f j = E r|θj (r) for j ∈ {1, 2}. Letting w = Σ −1 δf with δf = f 2 − f 1 , we can rewrite Λ(r) as Λ(r) = w T (r − f 0 ). (17) In order to identify how likely this decision rule makes the correct choice, observe that Λ(r) follows the following distributions under C 1 and C 2 , Therefore, we can find the probability of making a correct choice under D(r) by where Φ (·) is the cumulative function of the standard normal distribution. After replacing both instances of w by its definition, w = Σ −1 δf , p(correct) becomes Comparing this expression to Eq. (6) reveals a close similarity which we can utilize to define the generalized linear Fisher information for coarse discrimination tasks by where δθ = θ 2 − θ 1 is the stimulus difference. It is easy to see that, for small δθ, generalized linear Fisher information converges to linear Fisher information, As the sensitivity index d [6] in our case is given by d = √ δf T Σ −1 δf [7,8,9], the generalized linear Fisher information can be re-expressed in terms of d by This relationship furthermore results in illustrating the close relationship between p(correct), d , and I g (θ). An alternative derivation for generalized linear Fisher information is through an optimal linear discriminator with less stringent assumptions on the class-conditional distribution. In this second approach, we assume a linear decoder projecting the neural activity to a one-dimensional readout usinĝ To assign an observed neural activity to a class, we just need to place a threshold on the readoutθ. To do so, we optimize w to maximize the class separation following Fisher's linear discriminant analysis [10], which minimizes the within-class variance while maximizing the between-class variance of r. As before, let f j and Σ j be mean and noise covariance of neural activity in class C j , but without making any further assumptions about the class-conditional densities p(r|C j ). We aim to find the w that maximizes the ratio of the between-class variance to the within-class variance, which is formulated as where δf δf T is the between-class covariance matrix and Σ is the average within-class covariance matrix given by Here, we fix w 2 = 1, as we are interested in the direction of w but not its length. Using a Lagrange multiplier to solve the constraint optimization for w results in This yields the direction, w, to best project the neural activity into one dimension. To find the associated p(correct), note thatθ is the sum of a (potentially large) set of random variables. These random variables are correlated, such that the central limit theorem does not directly apply. Nonetheless, we assume this sum to be approximately Gaussian for bothθ|C 1 andθ|C 2 , and given bŷ This results in the sensitivity index, d , to be given by yielding the same expression as before. This makes it straightforward to derive the generalized Fisher information as before.

Bias-corrected generalized Fisher information
Evaluating the generalized Fisher information, Eq. (21), by replacing δf and Σ by its empirical moments estimated from neural data with a limited number of trials leads to biased estimates [11]. In [11], they provide a bias correction for standard Fisher information, but it is unclear if this bias correction also applies to our generalization of Fisher information. In this section, we will derive such a bias correction for our generalization. This correction turns out to be the same as that provided by [11]. This is unsurprising in hindsight, as [11] do not restrict the size of δθ in their derivation, such that it applies to both fine and coarse discrimination. We assume neural activity r t j , j = 1, 2 in response to stimulus θ j in trials t = 1, ..., T to follow a multivariate Gaussian distribution given by where we assume the same covariance matrix for neural activity in response to θ 1 and θ 2 . This is not a restriction, as our above derivation from the perspective of a linear discriminator has shown that, if these covariances differ, we can replace them by their average (which is what we do in practice, see below). Under this assumption, the empirical mean and covariance over T trials for each stimulus is distributed as [12] where W(V p×p , n) is the p-dimensional Wishart distribution with n degrees of freedom. The naïve estimation of generalized Fisher information, Eq. (21), is obtained by replacing δf and Σ with their unbiased estimates, δµ and S, given by where E(δµ) = δf and E(S) = Σ. Furthermore, the inverse of sample covariance, S −1 , follows an inverse Wishart distribution [12] given by which has mean Replacing δf and Σ with δµ and S in Eq. (21) results in the following naive estimator of the generalized Fisher information to be given byÎ To evaluate the bias of I g,nv , we utilize the fact that the sample mean and sample covariance of Gaussian distributions are independent [12], such that we can express the first moment of I g,nv by where Having the first moment ofÎ g,nv , we can obtain the expression for the bias-corrected generalized Fisher information,Î g,bc , given byÎ This estimate is the same as provided by [11], and will, in expectation, equal the true Fisher information, that is, E Î g,bc = I g .

Variance of bias-corrected generalized Fisher information
Let us now consider the variance of the bias-corrected generalized Fisher information across different draws of T trial/samples from the same neural population. This variance has already been computed by [11], but only as a function of the true information, I g , which is an unknown quantity. Here, we re-derive this expression for completeness, and additionally derive an unbiased estimated thereof as a function ofÎ g,bc , which can be computed from experimental data. The variance ofÎ g,bc is given by where var δµ T S −1 δµ can be decomposed into The first term in Eq. (41) can be expressed as The second term in Eq. (41) can be expressed as Together, this results in Eq. (41) to be given by Therefore, var(I g,bc ) can be simplified to To simplify this expression, note that if ∼ N (µ, Σ), then, for a constant matrix Λ, we have Additionally, for a symmetric matrix Λ, the variance of the quadratic form is expressed as var( T Λ ) = 2 Tr (ΛΣΛΣ) + 4µ T ΛΣΛµ.
Applying Eqs. (46) and (47) yields Using these expressions results in the final variance This is the expression provided by [11]. Unfortunately, it is a function of the true information I g , which is unknown, such that the variance cannot be evaluated from data. To find an unbiased estimate of this variance, note that the true information, I g , shows up as I g and I 2 g . We already have an unbiased estimate of I g , and will now derive such an unbiased estimate for I 2 g . Let us denote this estimate byˆ I 2 g bc (in contrast to the squaredÎ g,bc , which isÎ 2 g,bc ). We find it by Solving for I 2 g and substituting I g by its bias-corrected estimateÎ g,bc reveals the bias-corrected estimatê which satisfies E ˆ I 2 g bc = I 2 g . Substituting the bias corrected estimates of I g and I 2 g into Eq. (49) results after some algebra in the unbiased variance estimate which can be computed from data.

Covariance of bias-corrected generalized Fisher information
As we are interested in how information scales with population size, we also need to know how information estimates for different subpopulations relate to each other. Knowing this relationship is essential to our model fits, as fitting the information scaling models to information estimates that are correlated across different population sizes could results in significant mis-estimates if these correlations are ignored. In fact, we will use the results from this section to show in Sec. 3.5 that the increase in information due to adding one more neuron to a population is uncorrelated across different subpopulations. Based on this insight, we thus fitted these information increases rather than absolute informations, as illustrated in Supplementary Figure 4 in the main text.
To identify the relation between the information estimates for different subpopulations, we will focus on two subpopulations with N x and N y neurons (N y ≤ N x ) where the latter consists of a subset of neurons of the former. That is, the subpopulation with N x neurons contains all of the N y neurons in the (possibly) smaller subpopulation. We are interested in how their information estimates co-vary if we estimate both information measures from the same set of T trials.
To find this covariance, let us decompose the true (i.e., non-empirical) moments of the larger subpopulation into Here, δf x and δf y are the population tuning differences of the larger and smaller subpopulation, respectively, and we have ordered the neurons in the larger subpopulation such that it contains all shared neurons first, followed by all non-shared neurons. This re-ordering is possible, as the information estimates are independent or how neurons are ordered within a population. Furthermore, Σ x and Σ y are the noise covariance matrices of the larger and smaller subpopulation, and Σ u is the the covariance of shared with non-shared neurons. Experimentally, we cannot directly observe these moments, but instead estimate them through the empirical moments, Using the same properties as in the previous section, these empirical moments relate to the true moments by The empirical covariances additionally have the properties [13] ( where MN is the matrix-normal distribution. From Eq. (39), the bias-corrected generalized information for two subpopulations denoted as I x g,bc and I y g,bc can be written asÎ We can decomposeÎ x g,bc into two terms. The first term is the shared information which is common between subpopulations x and y as both of them contains all of neurons in subpopulation y. The second term is the information gain that is gained by adding the non-shared neurons. This decomposition can be expressed aŝ where δÎ x−y g,bc is the information gain due to the non-shared components between subpopulations x and y. The covariance ofÎ x g,bc andÎ y g,bc is given by cov Î x g,bc ,Î y g,bc = var Î y g,bc + cov Î y g,bc , δÎ x−y g,bc , where we already have expression for the variance on the right-hand side (i.e., Eq. (52)), and only need to find an expression for the covariance. To calculate cov Î y g,bc , δÎ x−y g,bc , let us first find an expression for δÎ x−y g,bc . To find this expression, note that, by the decomposition of δµ x and S x , and using block matrix inversion, Substituting this relationship into Eqs. (59) and (60) results in the bias-corrected information gain where "const" captures all non-stochastic terms that do not contribute to the covariance. Overall, this results in cov Î y g,bc , δÎ x−y g,bc where we have substituted Eq. (60) for I y g,bc to find the second term on the right-hand side. The first term of Eq. (65) is known from Eq. (52). The covariance expression in the second term can be expressed as First we evaluate the last expectation in Eq. (66) which is . Thus we can first take the expectation of (S z − S u S −1 Next, we observe that S u S −1 y |S −1 y is matrix normal, which has a simple expression for the expectation of its quadratic form. Using this expression yields We do not need to complete the expectation over the remaining random variables because most involved terms cancel out each other later on. Utilizing the same strategy we evaluate the expectation of the first term in Eq. (66) which is given by The expectation with respect to S u S −1 y |S −1 y is given by (72) Utilizing the fact that δµ y and δµ z − Σ u Σ −1 y δµ y are jointly Gaussian and uncorrelated, which means they are independent, we can combine Eqs. (72) and (69) to simplify the expression in Eq. (66) to (73) which means that information in the smaller population is uncorrelated to the information gain obtained from non-shared neurons. As a consequence, cov Î x g,bc ,Î y g,bc = var Î y g,bc .
Note the this only holds for the bias-corrected information estimates. For the naïve estimates, a similar derivation shows thatÎ y g,nv and δÎ x−y g,nv are correlated.

Supplementary Note 2. Information scaling models
We assume that information in the recorded population is limited by the presence of information-limiting correlations [1]. In this case, the noise covariance matrix Σ N for a population of N neurons decomposes into where Σ 0,N is the non-limiting covariance component, I ∞ is the asymptotic information, and f N is the derivative of the mean population activity. All of these quantities depend on the stimulus, θ, but we will keep this dependency implicit for notational convenience. In the N → ∞ limit, only the second component limits information, while the information associated with Σ 0,N grows without bounds. To see how information grows in the presence of information-limiting correlations, note that the Sherman-Morrison formula allows us to express Σ −1 N by Let us denote the linear Fisher information associated with the non-limiting component by Then, after some algebra, the total Fisher information is given by or, equally, This result forms the core of our information scaling models. For the remainder of this section we will discuss how we would expect information I 0,N in the non-limiting component to scale, and the impact of measurement noise on overall information scaling.

Linear non-limiting information scaling for large N
To characterize the scaling of I 0,N with N , let us use the spectral decomposition with variances σ 2 N,1 , . . . , σ 2 N,N and principal directions z N,1 , . . . , z N,N . Then, I 0,N is given by where α N,n is the angle between f N and z n . To see how I 0,N scales with N , let us assume that the α N,n 's are independent of the σ 2 N,n 's. Furthermore, In addition, geometry requires that . Together, this yields Therefore, under these assumptions, the scaling of I 0,N only depends on the scaling of the eigenvalue spectrum {σ 2 N,1 , . . . , σ 2 N,N } of Σ 0,N . For the following, we will assume that each neuron in the population features some small amount of "private" noise that is not correlated with the variability of other neurons. This private noise introduces a lower bound, σ 2 0 , on the variances, that is σ 2 N,n ≥ σ 2 0 for all n. Together with the above expression, this allows us to derive a lower bound on the scaling of non-limiting information. In particular, by Jensen's inequality The second-to-last expression contains the average variance, which is lower-bounded by σ 2 0 and of order one. Therefore, the scaling of I 0,N is at least O(N ).
To gain further insight into the scaling of I 0,N , assume a sequence of non-limiting covariance matrices Σ 0,M , Σ 0,M −1 , Σ 0,M −2 , . . . , starting with some large population with M neurons. Each consecutive matrix Σ 0,N −1 is constructed from the next-larger matrix Σ 0,N by removing a single neuron, such that they share all entries except for one row and column associated with that neuron. If we order their eigenvalues according to it is known that these eigenvalues obey the interleaved ordering N,n , the information increase when moving from N − 1 to N neurons becomes N,N is always the smallest eigenvalue of the covariance matrix, this implies that O(1) can be guaranteed as long as σ 2 N,N remains positive with increasing N , which is satisfied by our previous assumption that each neuron has some private noise. If it instead would go to zero, we would have lim N →M σ −2 N,N = 0, violating the requirement.
For the first term we observe that the hierarchical eigenvalue relationship of nested matrices implies that σ −2 N,n ≤ σ −2 N −1,n for all n = 1, . . . , N − 1. This implies that every element in the sum is negative. However, the information increase I 0,N − I 0,N −1 cannot be negative. Therefore, the second term on the left-hand side has to be at least as large as the negative first term (i.e., the sum), that is (85) , the sum cannot be larger than O(1). Overall, as long as none of the variances become zero with increasing N , the increase in I 0,N will be O(1), which implies that I 0,N scales with O(N ).

Models for I 0,N
The above argument shows that, under rather general conditions, I 0,N can be expected to scale with O(N ). However, it does not tell us about how I 0,N behaves for small N , which depends on the details of the structure of Σ N,0 .
To describe the details of this structure, we compared two models for I 0,N . The first, called the lim model, directly follows the scaling results and assumes that I 0,N = cN with some parameter c that is independent of N . The second model, called the lim-exp model, allows the non-limiting information to initially grow supralinearly before converging to a linear growth. We derived this model by integrating c 1 − e −N/τ from zero to N , resulting in with the additional parameter τ that controls the extent of the initial supralinearity (in units of N ). We have chosen this particular model, as it turns out easier to fit than alternative models (such as, for example, integrating a re-scaled logistic sigmoid over the positive half-line) that provide qualitatively similar qualitative I 0,N scaling. This model approaches I 0,N = cN in the τ → 0 limit. Model comparison revealed the lim model to significantly outperform the lim-exp model (Supplementary Figure 6), such that we focused on the lim model in the main text.

Impact of measurement noise
Our recordings of neural activity might be noisy, introducing additional variability into our estimates of Σ N and f N . To estimate the effect of such measurement noise, we assume it to be of equal magnitude and independent across neurons, such that it adds an additional diagonal term to the covariance decomposition, where σ 2 rec denotes the variance of the measurement noise. We don't assume it to impact differential correlations, as those limit information in the brain, rather than our measurement thereof.
Following the same derivation as in the beginning of this section, the information in a population of N neurons becomes where is the non-limiting information, including measurement noise. We can, as before, use the spectral decomposition This shows that measurement noise increases all eigenvalues of Σ 0,N by the same magnitude. This has several consequences. First, the added variance baseline results in I 0,rec,N to grow more slowly with N than I 0,N . Second, this baseline causes in the eigenvalues of Σ 0,N + σ 2 rec I to be more similar to each other than those of Σ 0,N alone. As a consequence, the growth of I 0,rec,N ∝ N n=1 σ 2 N,n + σ 2 rec −1 with N is more linear than that of I 0,N ∝ N n=1 σ −2 N,n . This might make I 0,rec,N = c rec N a good model of non-limiting information growth, even if I 0,N = cN is not. Third, as the measurement noise impacts only I 0,rec,N but not I ∞ , measurement noise only impacts our estimates of c but not of I ∞ . Fourth, measurement noise will lower our estimates of c, and therefore increase our estimates of N a = a/(1 − a)I ∞ /c, which is the population size at which a fraction a of the asymptotic information I ∞ is reached.

Impact of eye movements
To estimate the impact of eye movements that might occur between trials, we assume that the only impact that they might have had was to have the stimulus appear outside some neurons' receptive field in some trials. We assumed that, under those circumstances, the neuron's activity would be set to zero. For tractability, our model does not consider details of the spatial structure of receptive fields. This leads to a simple model in which we assume one indicator variable z t per trial t that is z t = 0 if the stimulus appears in all measured neurons' receptive fields in that trial, and z t = 1 if it might fall outside of some receptive fields. We furthermore introduce the indicator variable z nt for neuron n in trial t that is z nt = 0 if the stimulus falls into neuron n's receptive field in trial t, and z nt = 1 if it might fall outside of it. We assume that p(z t = 1) = p t and p(z nt = 1) = p n , independent across trials and neurons. Furthermore, p n is the same across all neurons, and thus a scalar. Neuron n's activity in trial t is set to zero only if z t = 1 and z nt = 1. Formally, if r nt,ori is the neuron's unperturbed (original) spike count in that trial, its perturbed one becomes For simplicity we here assumed z t and z nt to be independent, and are zeroing neurons only if z t z nt = 1, jointly. For z t = 0, this occurs with probability p(z t z nt = 1|z t = 0) = 0, that is, never. For z t = 1, it occurs with probability p(z t z nt = 1|z t = 1) = p n . We could have equivalently fixed z nt = 0 for all trials in which z t = 0, defined p(z nt = 1|z t = 1) = p n , and set r nt = (1 − z nt )r nt,ori . This choice leads to p(z nt = 1|z t = 0) = 0 and p(z nt = 1|z t = 1) = p n , illustrating that it yields the same result as our original assumptions. In either formulation, z t mimics global fluctuations, whereas z nt mimics fluctuations that are private to individual neurons.

Impact on population activity moments
As linear Fisher information only depends on f and Σ we here assess the impact on occasionally zeroing out neurons on these moments. To do so, we denote f ori , f ori , and Σ ori as the unperturbed moments, and f , f , and Σ as the same moments after perturbation. To find the perturbed mean, observe that The perturbed covariance follows a similar derivation, for n = m, where we have defined A similar derivation for n = m results in Σ nn = (a 2 + b 2 )Σ nn,ori + b 2 f 2 n,ori + p t p n (1 − p n ) Σ nn,ori + f 2 n,ori .
Together, this yields where δ nm = 1 if n = m, and δ nm = 0 otherwise. Overall, this results in where S is a diagonal matrix with Σ nn,ori + f 2 n,ori as the nth element of its diagonal. Therefore, the perturbed covariance Σ is a scaled-down version (as a 2 + b 2 ≤ 1) of the unperturbed covariance Σ ori , with the scaling factor a 2 + b 2 decreasing monotonically in both p n and p t . Zeroing out neurons results in two additional perturbations. First, it causes the addition of the rank-one matrix f ori f T ori whose magnitude increases with p n and the variance of z t , var (z t ) = p t (1 − p t ), which is largest for p t = 1/2. Second, it adds a diagonal component whose magnitude increases with p t and the variance of z nt , var (z nt ) = p n (1 − p n ).
Both f and Σ are computed conditional on a specific stimulus. The Σ we used to compute linear Fisher information is the average across the two considered stimuli. Then, the perturbed Σ becomes a linear combination of the average Σ ori , f ori f T ori , and S, for both stimuli.

Impact on information scaling
Let us consider the impact of the above perturbations on how information scales with population size, and its consequences for the estimated scaling parameters c and I ∞ . If we ignore the rank-one and diagonal additions to the covariance matrix, the overall Fisher information for each N is re-scaled by with a scaling factor that is close-to one for small p n and p t , and shrinks monotonically in p n . Furthermore, it is smallest for interim values of p t , but becomes one for p t = 0 and p t = 1. Based on the expression of our information scaling model, Eq. (78), as I −1 N,ori = c −1 ori N −1 + I −1 ∞,ori , estimates of both c and I ∞ will be down-scaled by the same factor.
Adding a diagonal term to Σ has an analogous effect as measurement noise, as discussed in the previous section: it causes information to grow more slowly with N , thus lowering c further, but leaves I ∞ unperturbed.
The impact of the rank-one component f ori f T ori depends on the alignment between f ori and f ori . If they are perfectly aligned, that is, if f ori ∝ f ori , this component introduces differential correlations, thus lowering asymptotic information I ∞ . If alignment is only partial, then this component does not limit information [1], but might still lower c. Non-alignment is likely if a change in the stimulus results in a simple shift in the population activity pattern, as is the case in our experiments. Perfect alignment could, for example, occur, if the stimulus modulates the gain of population activity, making the change in population activity due to a change in stimulus well-aligned with the average population activity. This might, for example, occur in contrast discrimination tasks, if contrast modulates the population activity gain.
In summary, zeroing out some neurons in a fraction of trials results in lowering our estimate of c, depending on a complex interplay of p t and p n . If f ori is not perfectly aligned to f ori , then I ∞ is down-weighted by (1 − p n p t ) 2 /(a 2 + b 2 ). In case of perfect alignment, the asymptotic information estimate is suppressed further. We confirmed these effects in simulations, as shown in Supplementary Figure 14.

Supplementary Note 3. Estimating the information scaling moments from neural data
Here, we fix the discrimination (i.e., the pair of drift directions, θ 1 and θ 2 ) and discuss how we estimate the moments of Fisher information for different population sizes. To do so, we assume a large population with M neurons of which we subsample N neurons, and where N M . Rather than focusing on the moments of the Fisher information I n for population size n ≤ N , we will instead focus on the moments of the Fisher information increase, ∆I n = I n − I n−1 (with I 0 = 0), when increasing the population size from n − 1 to n neurons, for reasons that become apparent later. Our aim is to estimate the mean, E (∆I n ), the variance, var (∆I n ), and the covariance, cov (∆I n , ∆I m ), for different population sizes n and m.

Generative model and desired moments
To describe the stochasticity of ∆I n , we assume the following generative process. Assume that neurons in the large population have indices 1 to M , and that we uniformly draw a subset of N different neurons with indices i 1 , i 2 , . . . , i N , denoted i 1:N . This subpopulation has moments f i 1:N and Σ i 1:N , that in turn can be used to compute its associated Fisher information. However, we do not directly observe these moments, but instead record the population activity across T trials for each stimulus, θ 1 and θ 2 , from which we compute the empirical moments γ i 1:N and Ω i 1:N . These empirical moments are in turn used to compute the Fisher information increases ∆Î 1:N , using the bias-corrected estimates discussed further above. In summary, the generative process follows the Markov chain In this Markov chain, the first and last transition are deterministic, and the center transition is stochastic. Therefore, we can write the generative model as where the Fisher information increases are a deterministic function of the empirical moments, and the sum is over different subpopulations drawn from the larger population. We assume these draws to be uniform, that is p (i 1:N ) ∝ 1.
To find the moments of ∆Î n , we use iterated expectation, variance, and covariance, which, for a Markov chain Z → X 1 , X 2 is given by Applied to our generative model, that yields the decompositions var ∆În ∆Î n = E i 1:N var ∆În|i 1:N ∆Î n + var i 1:N E ∆În|i 1:N ∆Î n , cov ∆În,∆Îm ∆Î n , ∆Î m = E i 1:N cov ∆În,∆Îm|i 1:N ∆Î n , ∆Î m + cov i1:n E ∆În|i 1:N ∆Î n , E ∆Îm|i 1:N ∆Î m , where both variance and covariance are decomposed into (i) the (co)variance of the information increase for a fixed subpopulation i 1:N , averaged across different subpopulations, and (ii) how the average information increase for a fixed subpopulation (co)varies across different subpopulations. Our data does not allow us to directly estimate these moments for two reasons. First, we don't observe the larger population, and so can't use it to draw different subpopulations from this larger population. We will address how we handle this limitation in the next subsection. Second, we only observe a single set of empirical moments, µ and S, for the subpopulation that we record from. We will address how we handle this limitation in the remaining subsections.

Simulating samples from a large, unobserved population
Our generative model assumes that we are subsampling N neurons from a large neural populations of M neurons. Our data, in contrast, are population recordings from a single neural population with N neurons. To use these recordings to simulate sampling from various subpopulations of the larger population, we assume these subpopulations to be statistically similar to the recorded population. That is, the different sampled subpopulations will contain neurons with similar activity statistics as the recorded population. Thus, each sampled subpopulation will contain all neurons from the recorded population, but in a different order for each sampled subpopulation. We will simulate this by introducing a new index set j 1:N = j 1 , j 2 , . . . , j N that, for each sampled subpopulation i 1:N , contains a random order of the indices 1, . . . , N of neurons in the recorded population. With this, all of the above moments across i 1:N will become moments across j 1:N , while taking into account that the recorded subpopulation is used as a proxy for sampling different subpopulations from a larger populations. We will describe the consequences of this for each of the moments separately.

Estimating the mean
The desired mean of the information increase ∆Î n is, by Eq. (104) the average information increase for a particular set of empirical moments, γ i 1:N and Ω i 1:N , for a particular subpopulation i 1:N , averaged across different subpopulations. We deal with not being able to sample different subpopulations by replacing i 1:N by a randomly ordered recorded population j 1:N . Furthermore, we cannot draw different empirical moments, γ i 1:N and Ω i 1:N for a given subpopulation, as would be required to compute E ∆Î|i 1:N ∆Î n . We will replace this expectation with our best estimate thereof, which is the Fisher information increase estimate based on the bias-correctet Fisher information, estimated from the empirical moments of the recorded population, µ and S. Overall, this leads to the approximate estimate, where µ j 1:N and S j 1:N denote the empirical moments with neurons ordered according to j 1:N . As our Fisher information estimate is unbiased, the above estimate will be unbiased as well. In practice, we approximate the expectation over j 1:N by 10000 random ordering.

Estimating the variance
The variance, Eq. (105), is decomposed into two terms. The first, E i 1:N var ∆În|i 1:N ∆Î n , is the variance of the Fisher information increase for a fixed subpopulation, averaged across many subpopulations. This term captures the uncertainty in ∆Î n due to using the empirical moments to estimate it. The second term, given var i 1:N E ∆În|i 1:N Î n , captures how the average Fisher information increase for a given subpopulation varies across different subpopulations. Our data doesn't allow us to compute either of these terms directly. However, it turns out that they are both well-approximated by how the Fisher information increase estimated from the empirical moments, µ and S, varies across different population orders, j 1:N , that is E i 1:N var ∆În|i 1:N ∆Î n + var i 1:N E ∆În|i 1:N ∆Î n ≈ var j 1:N ∆Î n (µ j 1:N , S j 1:N ) .
To understand why this approximation works, we need to consider two components that contribute to the empirical moments of the recorded neurons. The first is that, for each neuron and each neuron pair, these empirical moments are noisy, as they are estimated from a limited number of trials. Thus, we can approximate the effect of using empirical rather than true moments, as captured by the first term in Eq. (105), by computing the variance across different neurons in the population, as achieved by the variance across different orderings, j 1:N . The second factor is that different neurons contribute different amounts of information to the population. This comes into play in the second term in Eq. (105), and is again well-approximated by the variance across different orderings, j 1:N . As it seems paradoxical that the same variance can capture both kinds of effects at the same time, we have demonstrated it in simulations of neural populations, shown in Supplementary Figure 16a.

Estimating the covariance
As the variance, the covariance, Eq. (106) can be decomposed into two terms that capture different sources of uncertainty. The first term, E i 1:N cov ∆În,∆Îm|i 1:N ∆Î n , ∆Î m , captures the uncertainty associated with estimating empirical moments from a limited number of trials. To find this covariance, assume n = m and note that, where all covariances are conditional on i 1:N . Without loss of generality we can assume that n > m, and use Eq. (75) from Sec. 1.4 to find This shows, that, conditional on i 1:N , the information increase estimates are uncorrelated. The second term, cov i 1:N E ∆În|i 1:N ∆Î n , E ∆Îm|i 1:N ∆Î m , captures how the average Fisher information increase associated with adding the nth neuron correlates with that when adding the mth neuron across different subpopulation samples. On average, these increases will be negatively correlated, for the following reason. The variance of the information estimateÎ n = n k=1 ∆Î k can be decomposed into var Î n = n k=1 var ∆Î k + 2 which shows the impact of the individual variances, as well as the covariance between estimates associated with different population sizes. For a population of M neurons, the estimate of total information,Î M , will be the same, irrespective of how the neurons are ordered within that subset. Therefore, var Î M = 0. However, as, by definition, var ∆Î n ≥ 0, the above decomposition implies that the covariances need to be on average negative, to ensure that the sum of variances and covariances becomes zero. The same principle applies if we estimate the variance of ∆Î n by shuffling the order, j 1:N , of neurons in a smaller, recorded population. If this population has N neurons, then var EÎ N |j 1:N Î N = 0, irrespective of j 1:N , such that the information increase estimates will be negatively correlated.
Recall that we use population order shuffling as a proxy for repeatedly subsampling N neurons from a larger population of M neurons. The shuffling-induced negative correlations arise from using the same N recorded neurons across all estimates. If we instead subsample a larger population, the different sampled subpopulations are bound to share a smaller number of neurons. For two subpopulations that share no neurons, these estimates would be completely uncorrelated. However, even for N M , two random subpopulations of size N are likely to share neurons of the larger population. Indeed, the same intuition underlying the birthday paradox [14] tells us that we are almost guaranteed to find such shared neurons. However, the correlations don't only depend on the presence of shared neurons, but also on how many of them are shared, and the latter will decrease significantly for larger M . To show that this significantly lowers the impact of negative correlations on the total variance, we compare this variance computed with and without accounting for these correlations for different M 's. As Supplementary Figure 16b shows, their impact drops significantly with growing M . Therefore, we will approximate them to be zero, that is cov ∆În,∆Îm ∆Î n , ∆Î m ≈ 0.
This results in an overestimate of the variance of the Fisher information estimate, and make our fits less certain, and, as a consequence, more conservative.

Supplementary Note 4. Population activity models
We used two different models to simulate population activity, as described below.

A Gaussian population activity model with limited information
We used a simple Gaussian activity model to satisfy the assumptions of Gaussianity underlying the generalized linear Fisher information, and to test some of the properties of our estimates. This model violates some properties of neural activity, like non-negativity, but is convenient for our purposes, as it supports fine control over the eigenvalues of Σ 0 , the alignment of f to Σ 0 , and the asymptotic information, I ∞ . For a population size of N neurons, we generated Σ 0 by drawing a random orthonormal matrix Z 0 of size N × N that forms the eigenvectors of Σ 0 . We parameterized the eigenvalues by σ 2

A visual hierarchy population activity model
We relied on [15] for a more realistic model of V1 population activity that is driven by pixel-level inputs. Details of this model can be found in [15]. Briefly, a population of N neurons responded to a P × P pixelated images J of an oriented Gabor. The nth neuron's linear filter F n was for each (x, y) pixel determined by where c is the Michelson contrast, θ n determines the neuron's tuning, σ 2 determines the size of the exponential envelope, and λ and φ are the Gabor's frequency and phase, respectively. The filter was computed by the above function for each (x, y) and then standardized to have mean zero and unit variance across all (x, y). Image templates, J(θ), in response to stimulus θ were generated equally, with θ n replaced by the template's orientation, θ. Each neuron's gain, a n , was drawn from a log-normal distribution with unit mean and variance σ 2 a , and then multiplied by the overall gain, g.
Neural population activity is assumed to arise from the image template with Gaussian pixel noise (zero mean, variance σ 2 0 ), followed by application of the per-neuron linear filters, F n , multiplied by their gain a n , and a Poisson step. For Supplementary Figure 7, we estimated information from a set of trials, in each of which neural activity was generated from a different pixel noise instantiation. For Supplementary Figure 10, we skipped the Poisson step, as it introduced additional noise and was not required for the point we were trying to make. Instead, we estimated Fisher information from approximations to the neural mean responses and their covariance matrix, following [15]. We computed the mean response of neuron n to image J by f n (θ) = a n xy F n,xy J xy (θ) + , where [·] + is the threshold-linear function that sets negative values to zero. The population noise covariance was computed by where ⊗ denotes the (element-wise) Hadamard product, a = (a 1 , . . . , a N ) T is the column vector of per-neuron gains, F is the P 2 × N filter matrix with per-neuron filters unrolled as vectors along its columns, and f (θ) is the mean population activity in response to stimulus θ. The information was computed from Σ(θ) and f (θ).

Supplementary Tables
where xi is the stimulus' drift direction in trial i, θj is the j's of the eight drift direction used in our experiments, and 1a is the indicator function that results in 1a = 1 if a is true, and 1a = 0 otherwise. In this model, βj will be the neuron's average activity in response to stimulus θj. The adaptive model fits neural responses across trials by the following linear model, In this model, the neuron's mean activity in response to a stimulus with drift direction xi = θj preceded by a stimulus with drift direction xi−1 = θ1 is βj. The β jk 's then model how this activity changes relative to βj if the current trial's stimulus is instead preceded by a stimulus with drift direction xi−1 = θ k . For sessions with stimuli of multiple contrasts, the contrast in the table refers to that of the preceding stimulus (i.e., that in trial i − 1 of the adapting stimulus). As these models are nested (i.e., setting β jk = 0 for all j and k turns the adaptive model into a non-adaptive one), we used an F-test to test the null hypothesis that the non-adaptive model fits the data better than the adaptive model. The above table shows the percentage of neurons for which this null hypothesis could be rejected at the p = 0.05 significance level. We performed a one-sided binomial test (H0: fraction = 5%, testing for fraction > 5%) to determine if any of the observed fractions are unlike to have arisen by chance, and found that none are significantly above 5% (one-sided p > 0.124 for all sessions/mice) [16]. This made us conclude that none of our datasets featured significant adaptation effects.
0.12 ± 0.02 0.14 0. illustrates that most variability of the raw traces does not impact information, as it is orthogonal to the signal direction. The optimal decoder was for each stimulus drift direction computed (from temporally deconvolved traces) as the best discriminator between this and the next-closest drift direction.

Supplementary Figure 4: Impact of running speed on information.
For each session, we grouped trials into low (blue) and high (red) running speed by (i) subsampling trials to ensure a similar running speed distribution across all drift directions, and (ii) performing a median split by running speed for each drift direction. To maximize the number of analyzed trials, we did not aim to achieve the same average running speeds within each speed group across sessions -these average running speeds within each group might thus differ across sessions. Furthermore, mice 3 and 4 were overall running less than mice 1 and 2. (a) Information increases more rapidly with population size for higher running speeds (same mouse/session as in Supplementary Figure 3c; mean ± 1SD across random orderings of neurons within the population). The black line shows the mean information growth across all trials (as in Supplementary Figure 3c). The dashed lines show trial-shuffled data that removes pairwise noise correlations, and illustrate that, in all cases, these correlations lower information across all population sizes. (b) The drift direction discrimination threshold (80% correct) inferred from the information estimated in the recorded population is consistent across different drift direction pairs with δθ = 45 • , and is lower for high running speeds. The black dots show the inferred thresholds across all trials (as in Supplementary Figure 3f). (c) Higher running speed increases information in the recorded population. Each dot (mean ± 1SD of information estimate; filled = significant increase, bootstrap, p<0.05) shows the information estimated for one discrimination with δθ = 45 • . Across all sessions, higher running speed significantly increased information (t63 = 6.69, two-sided p ≈ 7 × 10 −9 , across non-overlapping δθ = 45 • discriminations). (d) Both estimated asymptotic information and N95 appear impacted by running speed (lines connect median estimates of individual sessions; horizontally shifted to ease comparison; posterior densities as is Supplementary Figure 4c). Across sessions, we found a significant increase in asymptotic information (signed-rank on median estimates; it is to discriminate two stimuli from the population responses they evoke. This discriminabiliy is measured by the performance of a linear discriminator, normalized by the stimulus difference (δs or δθ). For population responses (dots = mean population activity for one stimulus, shaded areas = 1SD of the noise covariance; δfi = difference in mean population activity for different δsi / δθi) whose mean response changes linearly with the stimulus s, this information remains unchanged when δs changes (top; δs1 vs. δs2). Population activity that encodes a circular stimulus θ is bound to violate this linearity, and its associated linearly decodable information drops with an increase in δθ (bottom; δθ1 vs. δθ2). This occurs also if a non-linear decoder that accounts for the circularity of θ would recover the same information, irrespective of δθ, and is not a bug of the linear decoder, which nonetheless correctly identifies all linearly decodable information (that drops with δθ). (b) We demonstrate this effect by simulating V1 population in response to oriented Gabor pattern, and estimate the information encoded about their orientation. We show how information grows with population sizes for stimulus pairs with different δθ (colors; mean ± 1SD across different orders with which neurons are added to the population). (c) The information at N = 1000, which we use as a proxy for I∞, drops with δθ, for the reason illustrated for the rotational code in (a). Details of the simulations to generate (b) and (c) are described in Sec. 4.2. The simulations quantify information about oriented Gabor pattern rather than the drift direction of drifting gratings, and so should only be qualitatively compared to the data in the main text. Figure 11: Same as Supplementary Figure 5, but for trial-shuffled data. As in Supplementary Figure 5, we asked if a subpopulation appears to encode a disproportionate amount of information across all stimulus drift directions. In contrast to Supplementary Figure 5, we here removed the impact of noise correlations by, for each neuron and drift direction, randomly shuffling the trial identity. (a) Both panels show that information increase in the recorded population depends on the order with which neurons are added to the population (colors). The panels differ in the considered drift direction discrimination (left: 0 • vs. 45 • ; right: 45 • vs 90 • ). The neuron order was optimized by incrementally adding the neuron that resulted in the largest information increase for a 0 • vs. 45 • (blue) or 45 • vs 90 • (orange) drift direction discrimination, or largest average increase across all discriminations with δθ = 45 • (green). The optimal ordering for the 0 • vs. 45 • was also applied to the 45 • vs 90 • discrimination (blue line in right panel) and vice versa (orange line in left panel). The average information increase across random orders (black) is shown as baseline reference. Shaded error regions illustrate the uncertainty (mean ± 1SD) due to limited numbers of trials (all curves), and variability across random orderings (black only). The black and green open circle (bootstrapped median ± 95% CI) show the population sizes required to capture 90% of the information in the recorded population for the associated orderings. (b) Plotting population sizes required to capture 90% of the information in the recorded population (bootstrapped median ± 95% CI) for random ordering vs. orderings optimized to maximize average information across all discriminations revealed a significant difference between the two orderings for some datasets (filled dots). Each dot reflects one discrimination for one session. The difference in population sizes was also significant across all datasets (t-test, t63 = 9.541, two-sided p = 6.1×10 −14 , across non-overlapping δθ = 45 • discriminations). Figure 12: Example tuning curves, tuning curve R 2 's, and pair-wise correlations, for the 10% and 25% contrast trials of mice 5 and 6. We determined the tuning type (untuned, orientation-tuned, or direction-tuned) of each neuron by fitting the 25% contrast trials only, and chose the best-fit tuning curve function f25(θ) (functional form depends on tuning type) as described in Methods. We then jointly fit the 10% and 25% contrast data by using the previously determined f25(θ) function to fit the 25% contrast trials, and f10(θ) = a + bf25(θ) to fit the 10% contrast trials. To do so, we jointly adjusted the parameters of the f25 function, as well as a and b. Except for untuned neurons, this resulted in tuning curve fits with fewer parameters than if we would have fitted the tuning curves for each contrast level separately. Bayesian model comparison that accounted for the different numbers of parameters revealed that, for most neurons, this joint fit across both contrast levels explained the data better than separate fits for each contrast level (BICjoint < BICseparate; mouse 5: direction-tuned 76.54% (310 of 405 neurons), orientation-tuned 94.21% (683 of 725 neurons); mouse 6: direction-tuned 70.35% (140 of 199 neurons), orientation-tuned 92.41% (597 of 646 neurons)). (a) Eight examples of untuned, orientation-tuned, and direction-tuned neurons. The pale, small dots show responses in individual trials. The large dots show mean responses for each drift direction, and the solid, vertical lines connect the 25th and 75th percentile. The pale lines shows the fitted tuning curves. Each panel shows data for both 10% and 25% contrast trials. Data, raw, and fitted tuning curves are darker for 25% contrast trials and slightly shifted to the right. Plots are truncated at ∆F/F = 0.2. (b) Cumulative distributions of coefficients of variation R 2 for different mice (rows) and sessions (line) for orientation-tuned (purple) and direction-tuned (orange) neurons for 25% (dark) and 10% (bright) contrast trials. The R 2 values are computed separately for each contrast level, even though the tuning curves were fit jointly across the two contrast levels. (c) The cumulative distribution of pairwise noise correlations for pairs of neurons for 25% contrast (dark) and 10% contrast (bright trials), shown separately for each session. (a) used data from one session of mouse 5 whose corresponding R 2 values are shown in bold in (b) and (c). All data at 0 • is replicated at 360 • to show the fitted tuning curve across all possible drift directions. The pale vertical lines in (b) and (c) show the average R 2 and correlation coefficients for each sessions, tuning type, and contrast level. The observed mean correlations were comparable to those found in previous studies [17]. Note that fitted tuning curves and average pair-wise correlations were not used to estimate information, and are provided for reference only. Figure 13: The scaling of the estimated population size with the fraction of asymptotic information. Let Na denote the population size required to encode a% of the total asymptotic information, I∞. Changing a results in a simple re-scaling of Na. This figures illustrates this re-scaling for different a, using N95 as a base measure. For example, if we would be interested in N90 instead of N95, we would read off the scaling factor for 90%, and would re-scale the reported N95 to get estimates for N90. Figure 14: Effects of modeled eye movement on information scaling. We assessed the effects of eye movements on information scaling using a simple eye movement model (see Sec. 2.4 for details). In this model, in a fraction of trials pt the activity of a fraction of neurons pn was set to zero. (a) We simulated population activity by adding a Poisson step to the model described in Sec. 4.2 to simulate spike counts in a 500ms window in a population of N = 1000 neurons in response to oriented Gabor pattern, and estimated bias-corrected linear Fisher information for different population sizes, different pt's (panels/colors), and different pn's (color shading). As can be seen, information grows more slowly with number of neurons N with increasing pt and increasing pn. (b) We fitted our information scaling model to the information scaling curves in (a) to estimate non-asymptotic scaling c (left) and asymptotic information I∞ (center). We used these estimates to, in turn, estimate the number of neurons N95 required to encode 95% of asymptotic information (right). The posterior estimates are shown as in Supplementary Figure 4c in the main text. As predicted by our theory (see Sec. 2.4), asymptotic information scales with pt(1 − pt): it hardly drops if pt = 1, and drops most strongly for pt = 0.6. This also confirms that fori is most likely not perfectly aligned to f ori in our simulations. The estimated c, in contrast, drops monotonically with both an increase in pt and in pn. In combination, this yields an overestimation of N95 that grows with both pt and pn. Figure 15: A non-linear mapping between spike counts and ∆F/F signal does not qualitatively impact our results. We added a Poisson step to the model described in Sec. 4.2 to simulate spike counts in a 500ms window in a population of N = 300 neurons in response to oriented Gabor pattern. We then used linear and non-linear functions that map these spike counts to ∆F/F signals that were in turn used to estimate how information scales with population size. (a) The different utilized functions for mapping per-neuron spike counts to ∆F/F signals. The black line connects the percentiles (from the 1st to the 99th percentile) of the distribution of simulated spike counts to those of the distribution of ∆F/F responses of the data of session 1 of mouse 1. The near-linear relationship indicates that a linear remapping of these spike counts will well-replicate the observed ∆F/F distribution. We fitted this relationship with both a linear (green) and a quadratic function (blue). We furthermore tested various saturating functions (different shades of red, weak: 0.5 tanh (0.5r) 2 1/2 , medium: 0.24 tanh (0.1r) 2 1/2 , strong: 0.05 tanh (0.45) 2 1/2 , where r is the spike count) to simulate the scenario in which the ∆F/F response saturates for higher spike counts. (b) The cumulative distribution function of the ∆F/F distributions of the data of session 1 of mouse 1 (black), the linear (green) and quadratic (blue) model, and the different saturating models (shades of red). In particular the strongly saturating model results in significant deviations from the data. (c) Information scaling computed from the ∆F/F signal (except for the black & grey lines, which are based on spike counts) for the different mappings between spike counts and ∆F/F . The dashed lines show results when trial-shuffling the spike count data before mapping it to ∆F/F signals to destroys the spike count noise correlations (see Supplementary Figure 3). Any eventual difference between Fisher information computed from spike counts (black) and the linear model (green) are due to numerical precision, as invertible linear transformations, as used here, do not change the Fisher information. Most importantly, trial-shuffled spike count data yields linear information scaling even after non-linear mappings, as these mappings do not introduce new noise correlations. Similarly, the information scaling of non-shuffled data saturates even after perturbing the spike counts with a non-linear mapping, as this mapping does not remove the noise correlations. While strong non-linear mappings might lower our information estimates, they do not impact our finding that information saturates. Figure 16: The variance and covariance of Fisher information scaling. We simulated virtual populations of different sizes M as described in Sec. 4.1, yielding f and Σ for each M . (a) To demonstrate that the variance in the Fisher information increase estimate due to shuffling well-approximates the combined variance due to population subsampling and due to estimating the moments from a finite number of trials, we generated one population with M = 10, 000 neurons. We in turn drew 100 empirical moments, γ ∼ N f , 2Σ/(T δθ) 2 and Ω ∼ W (Σ/(2T − 2), 2T − 1), corresponding to estimating these moments from T = 1, 000 trials each for two drift directions separated by δθ = 45 • . We additionally subsampled N = 300 neurons of the full population ten times, resulting in ten i1:N neuron indices, and, for each i1:N , containing a fixed set of neurons, shuffled their order 100 times, resulting in 100 j1:N per i1:N . For the empirical moments, we computed the Fisher information increase for each subsampled, shuffled population j1:N , resulting in 10 6 estimates for each population size n ∈ 1, . . . N . The figure shows the variance due to shuffling only (blue, averaged over different subsamples and empirical moments), and due to empirical moments only (red, averaged over different subsamples and shuffles). As comparison, we computed the total variance of the same estimate across 100 subsampled populations with N = 300 neurons for each set of empirical moment (black; variance across 10 4 estimates), which is the variance we aim to estimate. As the plot shows, the variance due to shuffling well-approximates this total variance. A naïve sum of the variance due to empirical moments and shuffling (grey dashed) would over-estimate the total variance. (b) To estimate the degree by which the variance of the Fisher information increase, var ∆În , is overestimated when ignoring the negative correlations across different ∆În's, we generate populations of different sizes, M , and their associated moments. For each population, we then estimated the covariance cov ∆În, ∆Îm across 1,000 different subsamples i1:N of populations of N = 300 neurons. In turn, we estimated the Fisher information variance once when taking into account this covariance, var Î n = n j=1 var ∆Îj + 2 j−1 k=1 cov ∆Î k , ∆Îj , and once when not doing so,ṽ ar Î n = n j=1 var ∆Îj . The plot shows the resulting fraction ṽ ar ∆În − var ∆În /var ∆În for different n and M as an average across ten different generated populations, and shows that the variance overestimate becomes smaller for larger populations.