Inferring collective dynamical states from widely unobserved systems

When assessing spatially extended complex systems, one can rarely sample the states of all components. We show that this spatial subsampling typically leads to severe underestimation of the risk of instability in systems with propagating events. We derive a subsampling-invariant estimator, and demonstrate that it correctly infers the infectiousness of various diseases under subsampling, making it particularly useful in countries with unreliable case reports. In neuroscience, recordings are strongly limited by subsampling. Here, the subsampling-invariant estimator allows to revisit two prominent hypotheses about the brain’s collective spiking dynamics: asynchronous-irregular or critical. We identify consistently for rat, cat, and monkey a state that combines features of both and allows input to reverberate in the network for hundreds of milliseconds. Overall, owing to its ready applicability, the novel estimator paves the way to novel insight for the study of spatially extended dynamical systems.

Although derived for branching processes (BPs), we conjectured that MR estimation is applicable to any process with a first order autoregressive representation (PAR). We here show exemplary results for three different classes of PARs: In AR(1) processes, additive noise ℎ is drawn independently at each time step. Here, we considered a uniform distribution ℎ ∼ (0, 2ℎ). In a Kesten process, additive and multiplicative noise is drawn at each time step, both and ℎ being i.i.d. for all . Here, ∼ ( , 2 ) with = /10 and ℎ ∼ (ℎ, 2 ) with = ℎ/10 are normally distributed. In a BP, each unit at time generates , offspring, which are i.i.d. for all and . In addition, a random number ℎ of units are introduced at each time step. Here, , ∼ Poi( ) and ℎ ∼ Poi(ℎ) are Poisson distributed, 2 and 2 denote the variances of , and ℎ respectively. All three processes satisfy the first-order statistical recursion relation ⟨ +1 | ⟩ = ( ) + ℎ (Eq. (5)). Parameters are chosen such that for all simulations the average activity is identical, ⟨ ⟩ = 100. a. Fully sampled and subsampled (binomial subsampling ∼ Bin( , ) with = 1/10) time series are shown for = 0.9 and ℎ = 10. b. The three classes show the same first-order statistics according to Eq. (5). However, their second order statistics Var[ +1 | ] differ as indicated. c. Conventional linear regression underestimateŝ for all three processes under subsampling. d. MR estimation is applicable to all three processes under full sampling and subsampling, i.e.
∝ holds. e. While MR estimation returns consistent estimates of even under subsampling, the conventional estimator underestimateŝ for all three processes.  Each left panels shows the time series of the activity from one single trial (light blue) and averaged activity from 100 trials (dark blue), recorded from = 50 out of = 10 4 neurons. Each right panels shows the corresponding MR estimation from one single trial. We investigated the following, generic cases for the temporal evolution of the drive rate ⟨ℎ ⟩: a, b. The drive is stationary (⟨ℎ ⟩ identical for all , red), so are the mean rates ⟨ ⟩. c, d. The drive exhibits a transient increase centered around half of the simulation time. The mean rate ⟨ ⟩ is therefore also time-dependent and follows the temporal evolution of ⟨ℎ ⟩. e. The drive shows a linear increase over the simulation. f. The drive exhibits a step function after half the simulation. Nonstationarities (c -f) typically lead to an overestimation of̂, which is particularly severe if the underlying dynamics is Poissonian ( = 0 Mean activity A(t) : Distribution of the estimatê, estimated from 5000 independent copies of a branching process (BP) with = 0.99, ⟨ ⟩ = 100 and length = 10 5 : normalized histograms of the probability of estimatinĝfor full sampling (blue) and binomial subsampling with = 0.001 (red), together with normal distributions ( ,̂2̂). Inset: --plot for the quantiles of ( ,̂2̂) and the quantiles of the estimated̂under both samplings. The estimated̂are found to be distributed normally in both cases (fully sampled: 2 = 0.9995, subsampled: 2 = 0.998). b. The variance 2̂o f the estimatêis estimated from 100 independent copies of a BP. Results for different , mean activities ⟨ ⟩ and time series lengths are plotted as a function of the effective time series length = |{ | > 0}|, the number of nonzero entries. For any given , the variance of̂shows algebraic scaling 2̂∝ . The exponent of this scaling depends on , with higher the closer is to unity. Hence, the benefit from longer time series is larger the closer a system is to criticality. Importantly, the variance does not directly depend on the mean activity ⟨ ⟩, this number only influences the accuracy of MR estimation via the potential change in . c. The variance of the estimatêis estimated from 100 independent copies of a BP with = 0.99, ⟨ ⟩ = 100, and = 10 5 and plotted as a function of the sampling probability under binomial subsampling. While the variance appears to increase dramatically under stronger subsampling, this increase can be attributed to the according decrease of the effective time series length . After rescaling by ( / ) 3/2 (cf. panel b), the rescaled variance remains within one order of magnitude over four orders of magnitude in . Hence, the accuracy of the estimator is not directly influenced by the degree of subsampling. d. The variance 2̂i s estimated from 100 independent copies of a BP with = 0.99, ℎ = 1, and = 10 5 and plotted as function of the distance to criticality = 1 − . The variance is found numerically to scale as 2̂∝ , hence the standard deviation scales aŝ∝ √ . Similar scaling results were found for other linear (like the interquartile range) and quadratic (like the mean squared error) measures of variation.  Supplementary Table 1). a. Data from monkey prefrontal cortex during an working memory task. The third panel shows a oscillation of with a frequency of 50 Hz, corresponding to measurement corruption due to power supply frequency. b. Data from anesthetized cat primary visual cortex. c. Data from rat hippocampus during a foreaging task. In addition to a slow exponential decay, the slopes show the -oscillations of 6 -10 Hz present in hippocampus. Dashed lines indicate results for an exponential model with offset, solid lines results for the model without offset (compare Supplementary Note 5).  Supplementary Table 1). b. Histogram of the single neuron branching ratioŝ, inferred with the conventional estimator and using MR estimation. The difference between these estimates demonstrates the subsampling bias of the conventional estimator, and how it is overcome by MR estimation. Under 0.1% subsampling,̂inferred by the EM algorithm reaches a steady state after 10 h, but is severely biased. The slow rise of̂might lead to a convergance to the proper after several weeks of projected runtime (ignoring common termination criteria). c. Under 0.01% subsampling, inferred by the EM algorithm converge to a biased value. In contrast, MR estimation returns a correct̂in all three cases, and outperforms the EM algorithm by a factor of 10 5 to 10 6 in terms of the runtime. . We discriminate the following cases in this order: A BP with =̂is only considered to explain the data, if the four tests offset , , lin , and≤ 0 are negative (×). If any of offset , , or lin is positive (✓), the data cannot be explained by a BP with any , regardless of the other tests (-), and MR estimation is invalid. If≤ 0 is positive, the additional test 1 =0 becomes relevant: if it is negative, the data cannot be explained by a BP with any . If it is also positive, the data are consistent with Poisson activity (BP with = 0).

Supplementary Note 1 Applicability of MR estimation
We here analytically derive the novel MR estimator for branching processes (BP) [1][2][3] . We expect that analogous derivations apply to any process with a first order autoregressive representation (PAR) 4 , because these processes fulfill Eq. (5). Beside BPs, PARs include autoregressive AR(1) processes, integer-valued autoregressive INAR(1) processes 5 rounded integer-valued autoregressive RINAR(1) processes 6 , and Kesten processes 7 . We emphasize that the MR estimator only requires the subsampled recording of a system with full activity conforming with the definition below. It is not necessary to know either the full system size, the number of subsampled units, nor any of the moments of the full process .

Supplementary Note 2 Branching processes
In a branching process (BP) with immigration 1-3 each unit produces a random number , of units in the subsequent time step. Additionally, in each time step a random number ℎ of units immigrates into the system (drive). Mathematically, BPs are defined as follows 1,2 : Let , be independently and identically distributed non-negative integer-valued random variables following a law with mean = ⟨ ⟩ and variance 2 = Var[ ]. Further, shall be non-trivial, meaning it satisfies P[ = 0] > 0 and P[ = 0] + P[ = 1] < 1. Likewise, let ℎ be independently and identically distributed non-negative integer-valued random variables following a law ℋ with mean rate ℎ = ⟨ℋ ⟩ and variance 2 = Var[ℋ ]. Then the evolution of the BP is given recursively by i.e. the number of units in the next generation is given by the offspring of all present units and those that were introduced to the system from outside.
The stability of BPs is solely governed by the mean offspring . In the subcritical state, < 1, the population converges to a stationary distribution ∞ with mean ⟨ ∞ ⟩ = ℎ/(1 − ). At criticality ( = 1), asymptotically exhibits linear growth, while in the supercritical state ( > 1) it grows exponentially. We will first show results that further specify the mean and variance of subcritical branching processes.

Theorem 1. The stationary distribution of a subcritical BP satisfies
where , 2 , ℎ, and 2 are defined as above.
Proof. The first result was stated before 2 Again, in the stationary distribution Var and hence the stated result follows.

Supplementary Note 3 Subsampling
To derive the MR estimator for subsampled data, subsampling is implemented in a parsimonious way, according to the following definition: (ii) Let ∈ ℕ. Conditioning on does not add further information to the process: The two random variables ( +1 | = , = ) and ( +1 | = ) are identically distributed for any , ∈ ℕ.
Thus the subsample is constructed from the full process based on the three assumptions: (i) The sampling process does not interfere with itself, and does not change over time. Hence the realization of a subsample at one time does not influence the realization of a subsample at another time, and the conditional distribution of ( | ) is the same as ( ′ | ′ ) if = ′ . However, even if = ′ , the subsampled and ′ do not necessarily take the same value. (ii) The subsampling does not interfere with the evolution of , i.e. the process evolves independent of the sampling. (iii) On average is proportional to up to a constant term.
It will be shown later, that the novel estimator is applicable to any time series that was acquired from a BP conforming with this definition of subsampling. We will demonstrate possible applications at the hand of two examples: 1. Diagnosing infections with probability . For example, when a BP represents the spread of infections within a population, each infection may be diagnosed with probability ≤ 1, depending on the sensitivity of the test and the likelihood that an infected person consults a doctor. If each of the infections is diagnosed independently of the others, then the number of diagnosed cases follows a binomial distribution ∼ Bin( , ). Then ⟨ | = ⟩ = is given by the expected value of the binomial distribution. This implementation of subsampling conforms with the definition above, with the sampling probability and the constant in (iii) being identical here.

Sampling a subset of system components.
In a different application, assume a high-dimensional system of interacting units that forms the substrate on which activation propagates. Often, the states of a subset of units are observed continuously, for example by placing electrodes that record the activity of the same set of neurons over the entire recording (Fig. 1b). This implementation of subsampling in finite size systems is mathematically approximated as follows: If out of all model units are sampled, the probability to sample active units out of the actual active units follows a hypergeometric distribution, ∼ Hyp( , , ). As ⟨ | = ⟩ = / , this representation satisfies Def. 1 with = / . Choosing this special implementation of subsampling allows to evaluate Var[ ] further in terms of : The term = ( , , ⟨ ⟩, Var[ ]) is constant when subsampling a given (stationary) system, and quantifies the factor by whicĥ C is biased when using the conventional estimate for . It depends on , and the first two moments of and is thus known for a BP. This relation was used for Fig. 1c.

Supplementary Note 4 MR estimation
We here derive an estimator for the mean offspring based on the autoregressive representation of the BP, This novel estimator is based on multistep regressions 9 (MR estimator), which generalize (5) to arbitrary time steps . From iteration of Eq. (5), it is easy to see that Definition 2 (Multistep regression estimator). Consider a subsampled BP { } of length . Let max ∈ ℕ, max ≥ 2. Then multistep regression (of max -th order) estimates in the following way: † Throughout this manuscript, the conditional random variable ( | = ) is to be read as " given the realization = of the random variable ".
2. Based on the relation 9 = ⋅ , estimatêand̂by minimizing the sum of residuals with the collection of slopes {̂} max =1 obtained from step 1 (Fig. 1f). Then̂is the multistep regression (MR) estimate of the mean offspring . For the application to experimental data, we further applied tests to identify nonstationarities (Supplementary Note 5).
We first prove that the MR estimator is consistent in the fully sampled case, and will then show the consistency under subsampling. First, we need the following result about the individual linear regression slopeŝunder full sampling: Theorem 2. The slopê, obtained from under full sampling, is a consistent estimator for . If the process is subcritical, then the offset̂is also a consistent estimator for ℎ 1− 1− . Remark. For = 1, these results were already obtained by [8,10,11], and details can be found in these sources. Based on their proofs, we here show the generalization to timesteps.
Proof. Let ∈ ℕ, ∈ {0, … , − 1}. Construct a new random process by starting at time and taking every -th time step of the original process . This new process is given by . We will derive that ( , ) is a consistent estimator for . According to [11], it is sufficient to show that the evolution of We now show that ( ( , ) ′ ) ′∈ℕ is a martingale difference sequence for all . From iteration of Eq. (6), it is easy to see that In the critical and supercritical cases, onlŷ( , ) p − → holds following [11]. Hence, we obtain̂p − → for all and̂p − → ℎ(1 − )/(1 − ) if < 1.

Corollary 3. As least square estimation of̂and̂from minimizing the residual (8) is consistent, multistep regression is a consistent estimator for under full sampling,̂p − → .
These results were obtained for BPs. However, the derivation is here only based on the autoregressive representation (5), motivation the following proposition: Numerical results for AR(1) and Kesten processes support this conjecture 9 (Supplementary Fig. 1). Next, we show that MR estimation is consistent in the subcritical case even if only the subsampled is known:

Theorem 5. Let be a PAR with < 1 and a stationary limiting distribution ∞ and let the PAR be started in the stationary distribution, i.e. 0 ∼ ∞ . Let be a subsampling of . Multistep regression (MR) on the subsampled is a consistent estimator of the mean offspring .
Proof. The existence of a stationary distribution ∞ was shown by [2]. The least square estimator for the slope of linear regression is also given by 12̂=̂+̂+ (12) with the the estimated standard deviationŝand̂+ of and + respectively. In the subcritical state, = + because of stationarity. Thus estimating the linear regression slope is equivalent to estimating the Pearson correlation coefficient̂+ = ( ) (which is identical to the autocorrelation function of ). In the following, we calculate the Pearson correlation coefficient for the subsampled time series by evaluating ⟨ + ⟩. We use the law of total expectation in order to express ⟨ + ⟩ not in dependence of , but in terms of : where the inner expectation value is taken with respect to the joint distribution of + and , and the outer with respect to the joint distribution of + and . Through conditioning on both and + , ( | ) and ( + | + ) become independent due to Def. 1. Hence, the joint distribution of ( , + | , + ) factorizes, and the expectation value factorizes as well. By definition, Without loss of generality, we here show the proof for = 0 which is easily extended to the general case. We express ⟨ + ⟩ in terms of Eq. (6) using the law of total expectation again: where the first expectation was taken with respect to the joint distribution of and + . We then used that ⟨ 2 ⟩ and ⟨ ⟩ = ℎ/(1 − ) exist, which follows from stationarity of the process. By a similar argument, and combining these results the covariance is Therefore, we find that the estimator̂converges in probability: Hence, the bias of of the conventional estimator̂C =̂1 is precisely given by the factor = 2 Var[ ] / Var[ ]. However, importantly the relation̂=̂̂still holds for the subsampled . Given a collection of multiple linear regressions1, … ,̂m ax , the least square estimation of̂and̂from minimizing the residual (8) yields a consistent estimator̂for the mean offspring even under subsampling and only requires the knowledge of .
This proof also showed that the conventional estimator 8  Nonstationarity, criticality and supercriticality. The consistency of the estimator in the fully sampled case is included in our proof of Lemma 2 and follows from the results by [8,11]. Our proof for the subsampled case (Theorem 5), in contrast, strictly requires stationarity ( ∼ ∞ for any ) and the existence of the first two moments of . We expect that the MR estimator is also consistent if the subcritical process is not started in the stationary distribution, 0 ≁ ∞ , because the results by [2] show that it will converge to this stationary distribution as → ∞ (Supplementary Fig. 2). Furthermore, numerical results suggest that the MR estimator is also consistent for critical and supercritical cases, where no stationary distribution exists (Fig. 3d).

Supplementary Note 5 Identifying common non-stationarities and Poisson activity.
In many types of analyses, non-stationarities in the time series can lead to wrong results, typically an overestimation of̂. We developed tests to exclude data sets with signatures of common non-stationarities. The different non-stationarities, their impact on the and the rules for rejection of time series are outlined below. First, transient increases of the drive ℎ , e.g. in response to a stimulus, lead to a transient increase in ⟨ ⟩. These transients induce correlations or anti-correlations, which prevail on long time scales (Supplementary Fig. 3c,d). The autocorrelation function is therefore better captured by an exponential with offset, = offset ⋅ offset + offset . If the residual of this exponential with offset 2 offset was smaller than the residual of the MR model 2 exp by a factor of two, offset = (2 ⋅ 2 offset < 2 exp ), then the data set was rejected. The factor two punishes for the differences in degree of freedom: The residuals of a model with two free parameters (exponential with offset) instead of one (exponential only) can only be smaller.
Second, ramping of the drive can lead to overestimation of ( Supplementary Fig. 3e). The comparison of the two models with and without offset introduced above serves as a consistency check able to identify ramping: if the data are captured by a BP, both models should infer identical̂. Thus, a difference between̂e xp and̂o ffset hints at the invalidity of MR estimation. Instead of , we compared the autocorrelation timesô ffset = − / loĝo ffset andê xp obtained from both models, as the logarithmic scaling increases the sensitivity. If their relative difference was too large, then the data are inconsistent with a BP and MR estimation is invalid: = (| exp − offset | / min{ exp , offset } > 2). Third, when a system changes between different states of activity, e.g. up and down states, the drive rate ⟨ℎ ⟩ may experience sudden jumps. These can lead to spurious autocorrelation ( Supplementary Fig. 3f). To identify these trends resulting from non-stationary input ℎ or from choosing too short data sets, we tested whether the sequence of was fit better by a linear regression = 1 + 2 on the pairs ( , ), than by the exponential relation (8). If the residuals 2 lin were smaller than 2 exp : lin = ( 2 lin < 2 exp ), data were rejected. Apart from non-stationarities, even Poisson activity ( = 0, = ℎ ) with stationary rate may lead to a spurious overestimation of̂as well: for subsampled branching processes of finite duration, the Poisson case and processes close to criticality ( = 1) can show very similar autocorrelation results, because the sequence of is expected to be absolutely or almost flat, respectively. Moreover, for = 0 any solution on the manifold with = 0 minimizes the residuals in Eq. (8). Hence, the estimator for̂may yield any value depending on the initial conditions of the minimization scheme. To distinguish between = 0 and > 0, we used the fact that for = 0, all slopes are expected to be distributed around zero, ⟨ ⟩ = 0. In contrast, for processes with > 0, all slopes are expected to be larger than zero ⟨ ⟩ = ⋅ > 0. Thus to identify stationary Poisson activity, we tested (using a one-sided t-test) if the slopes obtained from the data were significantly larger than zero, yielding the -value≤ 0 and the following test ( Supplementary Fig. 3b):≤ 0 = (̄≤ 0 ≥ 0.1). The choice of the significance level should be guided by the severity of type I or II errors here: if it is set too liberal, Poisson activity may be mistaken for correlated activity, potentially even close-to-critical. On the other hand, if the significance level is too conservative, activity with long autocorrelation times may be spuriously considered Poissonian under strong subsampling (when is small and all slopes only slightly differ from zero). For this study, we chose a significance level of≤ 0 < 0.1 in order to not underestimate the risk of large activity cascades. To confirm candidates for Poisson activity identified through positive≤ 0 , we assured that the did not show a systematic trend, i.e. that linear regression of as a function of (see lin above) yielded slope zero: 1 =0 = ( 1 =0 ≥ 0.05). The according significance level for this two sided test is then given by 1 ≠0 < 0.05.
We discriminate the following cases in the order indicated in Supplementary Table 1:̂obtained from MR estimation is only valid if none of the tests (except 1 =0 , which is ignored here) is positive. A positive result for any of offset , , or lin indicates non-stationarities, the data are not explained by a stationary BP, and MR estimation is invalid. If≤ 0 is positive, the data are potentially consistent with Poisson activity ( = 0). This is only the case if 1 =0 is also positive. If otherwise 1 =0 is negative, the Poisson hypothesis is also rejected and MR estimation invalid. This strategy correctly identified the validity of MR estimation for all investigated cases: stationary BPs with = 0.98 and = 0.0 were accepted, while nonstationary BPs with transient changes, ramping, or sudden jumps of the drive were excluded (Supplementary Fig. 3).

Supplementary Note 6 Variance of the estimates.
The distribution of̂is consistent with a normal distribution ( , 2̂) centered around the true mean offspring (Supplementary Fig. 4a; numerical results). The variance 2̂d epends on the branching ratio , the mean activity ⟨ ⟩, the length of the time series, and the sampling fraction . Each of these factors affects 2̂m ainly by changing the effective length of the time series, i.e. the number of non-zero entries = |{ | > 0}|. Thus, regardless of the actual time series length or the mean activity ⟨ ⟩, the variance scales as a power-law in , Var[̂] ∝ − (Supplementary Fig. 4b). The exponent of this power-law depends on .
The closer to criticality the process is, the larger the exponent , i.e. the larger the benefit from longer time series length . For = 0.99, we found ≈ 3/2. The performance of the estimator is in principle independent of the mean activity: Small ⟨ ⟩ only affect the variance of the MR estimator through a potential decrease of .
Similarly, the degree of subsampling only affects the variance of the estimator through a decrease of the effective length of . While there may be a significant rise in 2̂w hen reducing the sampling fraction , this increase can be explained by the coincidental decrease in , as the rescaled variance 2̂⋅ remains within one order of magnitude over four decades of the sampling fraction ( Supplementary Fig. 4c).
How does the variance change close to the critical transition? We found that the answer to this question highly depends on the specific choice of the parameters: if is varied, one can either keep ⟨ ⟩ or ℎ constant, not both at the same time. If the mean activity ⟨ ⟩ is fixed by choosing ℎ = ⟨ ⟩ (1 − ), then the variance of the process scales as Var[ ] ∝ 1/(1 − ) (Theorem 1). As → 1, the activity will inevitably get into a regime, where bursts of activity ( > 0) are disrupted by intermittent quiescent periods ( ), thereby reducing . In turn, the variance of the estimator increases as detailed before.
If however, the drive ℎ is kept constant, we found that the variance scales linearly in the distance to criticality = 1− over at least 5 orders of magnitude of : 2̂∝ (Supplementary Fig. 4d). Thus, the variance decreases when approaching criticality, while the relative variance 2̂/ is constant. Note, however, that even though the standard deviation also decreases when approaching criticality (̂∝ √ ), the relative standard deviation increases (̂/ ∝ 1/ √ ).
For other measures of variation (e.g. quadratic (like the mean squared error MSE) and linear (like the inter-quartile range IQR)), we obtained scaling laws with the same exponents.

Confidence interval estimation.
We used a model based approach to estimate confidence intervals for both simulation and experimental data (for Figs. 1c,d, 2c,d, and 3d), because classical bootstrapping methods underestimate the estimator variance by treating all slopes independently, while they are in fact dependent. We found that our model based approach constructs more conservative and representative confidence intervals.
For simulations, we simulated ∈ ℕ independent copies of the investigated model and applied MR estimation to each copy, yielding a collection of independent estimates {̂( ) } =1 .
For experimental time series with length , mean activity ⟨ ⟩, and number of sampled units , MR estimation yields an estimatê. We then simulated copies of branching networks { ( ) } =1 (for simulation details see Supplementary Note 8) with = 10, 000 units, =̂as inferred by MR estimation, and length and rate ⟨ ⟩ to match the data. The rate was matched by setting the drive to ℎ = ⟨ ⟩ (1 −̂) / . Thereby, after subsampling units, the mean activity of each resulting time series ( ) matched that of the original time series , ⟨ ( ) ⟩ = ⟨ ⟩. This procedure gives copies of a BN that all match in terms of the mean activity, the branching ratio, time series length, and number of sampled units. Applying MR estimation to these BNs yields a collection of independent estimates {̂( ) } =1 . For both simulation and experimental data, the distribution of̂and confidence intervals can be constructed from this collection.

Supplementary Note 7 Expectation maximization based on Kalman filtering
Kalman filtering is a method to predict the original time series given a measurement , defined for AR(1) processes and affine measurement transformation where ℎ and are independent Gaussian random variables ℎ ∼ (ℎ, 2 ) and ∼ ( , 2 ) and and constant real numbers. Assuming that 0 ∼ ( , ), Kalman filtering infers the original time series | , ℳ given a measured time series and the known model ℳ = ( , ℎ, 2 , , , 2 , , ). Based on an iterative expectation maximization algorithm which incorporates Kalman filtering [13][14][15] , the model parameters ℳ can be estimated from a time series . We used this algorithm to infer . Because of the mutual dependence of the model parameters, we also needed to infer ℎ, 2 , , , and 2 . In order to reduce the dimensionality of the maximization step, we disregarded and , as the influence of the initial value decreases if the time series gets long. For initial values, we chose = 0.5 in the center of the range of interest for , ℎ = ⟨ ⟩ ⋅(1− ) (see Supplementary Note 2), = 0.1⋅ℎ , = 1, = 0, and = 0.1. We further chose = ⟨ ⟩ and 2 = Var[ ] for the two model parameters that were not optimized. We considered two termination criteria for the EM algorithm: First, it is recommended to restrict the EM algorithm to 10 -20 cycles in order to avoid overfitting, a common problem with likelihood-based fitting methods for multidimensional model parameters. Therefor we considered̂inferred after 20 EM cycles. Second, we considered̂after the results of two subsequent EM cycles did not differ by more than 0.01%.
We used the publicly available Python implementation of the Kalman EM algorithm, pykalman. All parameters were chosen as detailed above. The analysis was performed on a computer cluster, and reached runtimes of several days up to projected runtimes of weeks. In fact, this computational demand was a limiting factor in terms of widespread application. In contrast, MR estimation terminated within half a second on the same CPUs.

Supplementary Note 8 Simulations
Branching process. We simulated BPs according to Eq. (1) in the following way: Realizations of the random numbers , and ℎ describing the number of offsprings, and the drive, were each drawn from a Poisson distribution: , ∼ Poi( ) with mean , and ℎ ∼ Poi(ℎ) with mean ℎ, respectively. Here, we used Poisson distributions as they allow for non-trivial offspring distributions with easy control of the branching ratio by only one parameter. For the brain, one might assume that each neuron is connected to postsynaptic neurons, each of which is excited with probability , motivating a binomial offspring distribution with mean = . As in cortex is typically large and is typically small, the Poisson limit is a reasonable approximation. For the performance of the MR estimator and the limit behavior of the BP, the particular form of the law is not important such that the special choice we made here does not restrict the generality of our results.
The mean rate ⟨ ⟩ depends on and ℎ (Lemma 1). In the simulation we varied and fixed ⟨ ⟩ = 100 by adjusting ℎ accordingly if not stated otherwise. For subsampling the BP, each unit is observed independently with probability ≤ 1 . Then is distributed following a binomial distribution Bin( , ), and subsampling is implemented by drawing from at each time step. As ⟨ ⟩ = , this implementation of subsampling satisfies the definition of stochastic subsampling with = , = 0.
Branching network. In addition to the classical branching process, we also simulated a branching network model (BN) by mapping a branching process 1,16 onto a fully connected network of = 10, 000 neurons. An active neuron activated each of its postsynaptic neurons with probability = / . Here, the activated postsynaptic neurons were drawn randomly without replacement at each step, thereby avoiding that two different active neurons would both activate the same target neuron. Similar to the BP, the BN is critical for = 1 in the infinite size limit, and subcritical (supercritical) for < 1 ( > 1). As detailed for the BP, ℎ was adjusted to the choice of to achieve ⟨ ⟩ = 100, which corresponds to a rate of 0.01 spikes per neuron and time step. Subsampling 17 was applied to the model by sampling the activity of neurons only, which were selected randomly before the simulation, and neglecting the activity of all other neurons.
Self-organized critical model. The SOC neural network model we used here is the Bak-Tang-Wiesenfeld (BTW) model 18 . Translated to a neuroscience context, the model consisted of = 10, 000 (100 × 100) non-leaky integrate and fire neurons. A neuron spiked if its membrane voltage ( ) reached a threshold : Note that the choice of does not change the activity of the model at all, so we set = 0 for convenience. The model neurons were arranged on a 2D lattice, and each neuron was connected locally to its four nearest neighbors with coupling strength = : where denotes the spike times of neuron , and ℎ ( ) is the Poisson drive to neuron with mean rate ℎ as defined for the BP above. Note that the neurons at the edges and corners of the grid had only 3 and 2 neighbors, respectively. This model is equivalent to the well-known Bak-Tang-Wiesenfeld model 18 if ℎ → 0 and = 1. Subsampling 17 was implemented in the same manner as for the BN.

Parameter choices.
If not stated otherwise, simulations were run for = 10 7 time steps or until exceeded 10 9 , i.e. approximately half of the 32 bit integer range. If not stated otherwise, confidence intervals (Supplementary Note 6) were estimated from = 100 samples, both for simulation and experiments. In Figs. 1c,d, BNs and the BTW model were simulated with = 10 4 units and ⟨ ⟩ = 100. In Fig. 1e, BPs were simulated with = 0.9 and ⟨ ⟩ = 100.
In Fig. 3c, subcritical and critical BNs with = 10 4 and ⟨ ⟩ = 100 were simulated, and = 100 units sampled. Because of the non-stationary, exponential growth in the supercritical case, here BPs were simulated with ℎ = 0.1 and units observed with probability = 0.01.

Supplementary Note 9 Epidemiological recordings
WHO data on measles worldwide. Time series with yearly case reports for measles in 194 different countries are available online from the World Health Organization (WHO) for the years between 1980 and 2014. MR estimation was applied to these time series. Because they contain very few data points and potential long-term drifts, we applied the consistency checks detailed above for every country (Supplementary Table 1). After these checks, 124 out of the 194 surveyed countries were accepted for MR analysis and included in our analysis. Yearly information on approximate vaccination percentages (measles containing vaccine dose 1, MCV1) for the same countries and time span are also available online from the WHO.

RKI data on norovirus, measles and MRSA in Germany.
For Germany, the Robert-Koch-Institute (RKI) surveys a range of infectious diseases on a weekly basis, including measles, norovirus, and invasive meticillin-resistant Staphylococcus aureus (MRSA). Case reports are available through their SURVSTAT@RKI server 19 . Because of possible changes in report policies in the beginning of surveillance, we omitted the data from the first 6 months of each recording. Moreover, we omitted the incomplete week on the turn of the year, thus evaluating 52 full weeks in each year.
The MRSA recording showed a slow, small variation in the case reports that can be attributed to slow changes in the drive rates. To compensate for these slow drifts, we corrected the time series by subtracting a moving average over 3 years (156 weeks). We then applied MR estimation to the obtained time series. The recordings for measles and norovirus showed strong seasonal fluctuations of the case reports, resulting in a baseline oscillation of the autocorrelation function. We therefore used a modified model = ⋅ with a fixed period of = 52 weeks, and estimated̂,, and̂from minimizing the residual of this modified equation.
In order to obtain the naive estimates using the conventional linear regression estimator̂C =̂1, we used the following correction for seasonal fluctuations. Each incidence count was normalized by the incidence counts from the same week, averaged over all years of recording (̄= ⟨ +52⋅ ⟩ with the average taken over the years for any week = 1, … , 52), yielding the deseasonalized time series ′ = /̄m od 52 . Linear regression was performed on this time series ′ .
For Fig. 2d, subsampling was applied to the original time series assuming that every infection is diagnosed and reported with a probability , yielding the binomial subsampling described in Supplementary Note 3. MR estimates were obtained from this subsampled time series according to Eq. (21), for the conventional estimator the subsampled time series was processed as described above.

Supplementary Note 10 Animal experiments
We evaluated spike population dynamics from recordings in rats, cats and monkeys. The rat experimental protocols were approved by the Institutional Animal Care and Use Committee of Rutgers University 20,21 . The cat experiments were performed in accordance with guidelines established by the Canadian Council for Animal Care 22 . The monkey experiments were performed according to the German Law for the Protection of Experimental Animals, and were approved by the Regierungspräsidium Darmstadt. The procedures also conformed to the regulations issued by the NIH and the Society for Neuroscience. The spike recordings from the rats and the cats were obtained from the NSF-founded CRCNS data sharing website [20][21][22][23] .
In rats the spikes were recorded in CA1 of the right dorsal hippocampus during an open field task. We used the first two data sets of each recording group (ec013.527, ec013.528, ec014.277, ec014.333, ec015.041, ec015.047, ec016.397, ec016.430). The data-sets provided sorted spikes from 4 shanks (ec013) or 8 shanks (ec014, ec015, ec016), with 31 (ec013), 64 (ec014, ec015) or 55 (ec016) channels. We used both, spikes of single and multi units, because knowledge about the identity and the precise number of neurons is not required for the MR estimator. More details on the experimental procedure and the data-sets proper can be found in [20,21].
For the spikes from the cat, neural data were recorded by Tim Blanche in the laboratory of Nicholas Swindale, University of British Columbia 22 . We used the data set pvc3, i.e. recordings in area 18 which contain 50 sorted single units 23 . We used that part of the experiment in which no stimuli were presented, i.e., the spikes reflected spontaneous activity in the visual cortex of the anesthetized cat. Because of potential non-stationarities at the beginning and end of the recording, we omitted data before 25 s and after 320 s of recording. Details on the experimental procedures and the data proper can be found in [22,23].
The monkey data are the same as in [24,25]. In these experiments, spikes were recorded simultaneously from up to 16 singleended micro-electrodes (⌀ = 80 m) or tetrodes (⌀ = 96 m) in lateral prefrontal cortex of three trained macaque monkeys (M1: 6 kg ♀; M2: 12 kg ♂; M3: 8 kg ♀). The electrodes had impedances between 0.2 and 1.2 M at 1 kHz, and were arranged in a square grid with inter electrode distances of either 0.5 or 1.0 mm. The monkeys performed a visual short term memory task. The task and the experimental procedure is detailed in [24]. We analyzed spike data from 12 experimental sessions comprising almost 12.000 trials (M1: 4 sessions; M2: 5 sessions; M3: 3 sessions). 6 out of 12 sessions were recorded with tetrodes. Spike sorting on the tetrode data was performed using a Bayesian optimal template matching approach as described in [26] using the "Spyke Viewer" software 27 . On the single electrode data, spikes were sorted with a multi-dimensional PCA method (Smart Spike Sorter by Nan-Hui Chen).
Analysis. For each recording, we collapsed the spike times of all recorded neurons into one single train of population spike counts , where denotes how many neurons spiked in the ℎ time bin . We used = 4 ms, reflecting the propagation time of spikes from one neuron to the next. Note that scales with the bin size (bs) as (bs = ) = (bs = ) , while the corresponding autocorrelation times are invariant under bin size changes. For Fig. 3b and Supplementary Fig. 6, we investigated single neuron activity by applying similar binning to the spike times of each neuron individually.
From these time series, we estimated̂using the MR estimator with max = 2500 (corresponding to 10 s) for the rat recordings, max = 150 (600 ms) for the cat recording, and max = 500 (2000 ms) for the monkey recordings, assuring that max was always in the order of multiple autocorrelation times. Experiments were excluded if the tests according to Supplementary Note 5 detected potential nonstationarities.