Main

Throughout the recent history of neuroscience, computational models have been developed and used ubiquitously to decompose the complex neural mechanisms underlying cognition and behaviour1,2,3,4,5. A dilemma that is inherent to computational modelling but particularly challenging in computational neuroscience is the trade-off between (cross-validated) accuracy and simplicity. Both finely detailed models6 and broadly simplified ones7,8 have their respective proponents. One of the many facets of this trade-off pertains to the use of linear vs nonlinear models. Nonlinearity of dynamics is inevitable at the microscale of individual neurons9 and their components10, and has been demonstrated, although less comprehensively, at the mesoscale of neuronal populations11. Further supported by theoretical derivations12 and motivated by the much larger repertoire of behaviours of nonlinear systems (including chaos, multistability and metastability), an assumption has thus formed13,14,15,16 that accurate models of neurodynamics at the macroscale (i.e., of brain regions) must inevitably be nonlinear.

This assumption begs the question of whether nonlinear models will in fact perform better than linear ones in accounting for the dynamics of neuroimaging and neurophysiological data. Specifically, can nonlinear models explain neuroimaging or neurophysiological data more accurately than linear ones? This pragmatic modelling question, importantly, is different from the general question of whether any signs of ‘nonlinearity’ can be found in neuroimaging time series17,18,19 (see Discussion).

Few investigations20,21,22 have indeed sought to answer the former question directly by comparing the ‘fit’ of linear and nonlinear models to neurophysiological (electroencephalography, EEG, and intracranial electroencephalography, iEEG) time series. However, even these few works are limited in that each provides a single comparison between a linear and a nonlinear family of models (linear autoregressive (AR) vs nonlinear manifold-based model in ref. 20, linear finite-impulse response vs nonlinear Volterra series model in ref. 21, and linear state space vs nonlinear AR with radial basis function nonlinearities in ref. 22), which need not be the best representatives of linear and nonlinear models in general. While the compared linear and nonlinear models were found to be as predictive of EEG data in ref. 20 and iEEG data in ref. 22, the results were mixed in ref. 21. Using scalp EEG data from patients with epilepsy, this paper finds mostly linear dynamics well within and outside of seizures, and mostly nonlinear dynamics (although varying across patients) in the periods around seizure onsets and offsets. While limited in their scope, these works beg for a deep and rigorous data-driven investigation into the nonlinearity of macroscopic brain dynamics, as pursued herein.

In the second part of the paper, we seek to answer the question of why nonlinear models do not provide more accurate predictions than linear ones even though neurodynamics are inevitably nonlinear at the microscale. Specifically, we numerically demonstrate, using a simple sigmoidal nonlinearity, that four properties of macroscopic brain dynamics can fundamentally ‘counteract’ or seemingly ‘mask’ nonlinear dynamics present at the microscale: averaging over the activity of large populations of neurons to obtain a single macroscopic time series (averaging over space), natural low-pass-filtering properties of brain processes (averaging over time), observation noise and limited data samples. While the effects of observation noise and limited data samples are technology dependent but otherwise independent from the form of nonlinearity, the effects of spatiotemporal averaging are fundamental to macroscopic neural dynamics and may depend on the functional form of the microscale nonlinearity. We thus also verify and demonstrate the effects of spatiotemporal averaging using a data-driven and biophysically grounded spiking neuron model23. Together, our results provide (1) important evidence that linear models can be as descriptive as nonlinear ones at the macroscale and (2) a methodology based on system identification theory to quantitatively define a ‘best’ model of whole-brain dynamics given a set of specified costs.

Results

System identification and data-driven computational modelling

Among the diverse categories of computational models used in neuroscience, we focus on ordinary differential equation (ODE) models of the general form

$$\dot{{{{\bf{x}}}}}(t)=f({{{\bf{x}}}}(t))+{{{{\bf{e}}}}}_{1}(t),\qquad {{{\bf{x}}}}(0)={{{{\bf{x}}}}}_{0}$$
(1a)
$${{{\bf{y}}}}(t)=h({{{\bf{x}}}}(t))+{{{{\bf{e}}}}}_{2}(t)$$
(1b)

where y(t) is an n-dimensional time series of recorded brain activity, in this case via resting-state functional magnetic resonance imaging (fMRI) (rsfMRI) or iEEG (rsiEEG), x(t) is an m-dimensional time series of ‘internal’ brain states, f and h are generally nonlinear vector fields, and e1(t) and e2(t) are time series of process and measurement noise with arbitrary statistics (Fig. 1a,b). It is possible, although not necessary, that m = n. As with any differential equation, the description would not be complete without the initial condition x(0) = x0, determining the state of the brain at the initial recording time t = 0. Note that no external input u(t) is considered due to the resting-state condition of the experiments. Also, given that we can only sample y(t) in discrete time, we implement equation (1) by approximating the derivative \(\dot{{{{\bf{x}}}}}(t)\) as a first difference x(t) − x(t − 1) (see Methods).

Fig. 1: Prediction error method for system identification.
figure 1

a,b, The general category of computational models \({{{\mathcal{M}}}}\) studied in this work, represented by an ODE describing the resting-state evolution of internal states x(t) (a) and an output model that maps internal states to fMRI/iEEG time series y(t), as shown for fMRI (b). A total of nfMRI = 116 regions were used throughout (see Methods) for fMRI, while 13 ≤ niEEG ≤ 175 channels were used for each iEEG patient. c, A schematic representation of the prediction error system identification framework used in this work. At each time t, all of {y(0),…,y(t − 1)} was used to predict y(t) simultaneously across all channels, denoted for simplicity by \(\hat{{{{\bf{y}}}}}(t| t-1)\). \({\bar{y}}_{i}\) denotes the temporal average of yi(t) for each channel i. This estimate should not be confused with estimates of FC. d, FC measures the covariation between pairs of channels or, equivalently, how well each yi(t) (y1(t) in the figure) can be predicted from each other yj(t) (y2(t) in the figure), ‘at the same time’ t.

The model in equation (1) is ‘linear’ if the functions f and h are linear functions, that is, matrix operations of the form f(x) = Ax and h(x) = Cx where Am×m and Cn×m are constant (or even time-varying, but state-independent) matrices. Throughout the field of computational neuroscience, numerous models of the form in equation (1) or its discretization (see Methods) are constructed and used, each with different functional forms and noise statistics6,7,9,12,15,22,24. The critical but fairly overlooked quest of system identification25 is then to find the ‘best’ model, among all the available options, against experimental data. This comparison indeed depends on one’s measure of a model’s goodness of fit.

A natural choice, referred to as the prediction error (PE) approach, is based on how well a given model can predict the ‘future’ of the measured time series from its ‘current and past’ values (Fig. 1c). Note that this prediction is precisely what an ODE such as equation (1a) defines: it models the ‘change’ \(\dot{{{{\bf{x}}}}}(t)\), and thus the immediate future, of the system’s state (and therefore output) from its current state x(t). Since the state x(t) is not directly available, it should in turn be estimated from the current and past measurements of the output y(t). Therefore, the PE approach in its simplest form seeks to minimize, within any given parametric or non-parametric family of models, the magnitude of the one-step-ahead PE

$${{{\boldsymbol{\varepsilon }}}}(t)={{{\bf{y}}}}(t)-\hat{{{{\bf{y}}}}}(t| t-1)$$
(2)

where \(\hat{{{{\bf{y}}}}}(t| t-1)\) is the Bayes-optimal, minimum variance estimate of y(t) given all of the history {y(0),…,y(t − 1)} of y up to time t − 1 (ref. 25) (Fig. 1c). Notably, the PE approach focuses on the prediction accuracy of the time series itself, rather than the prediction accuracy of functional connectivity (Fig. 1d) or other statistics of the time series (cf. Discussion and Supplementary Fig. 1). This approach can also be easily extended to multistep-ahead predictions (cf. ref. 25 and Supplementary Figs. 20 and 21).

The task of system identification does not end once the parameters of a given family of models are fit to the (training) data. The critical next step is to assess the quality of the fit, particularly to data withheld during the training (cross-validation). In the PE approach, the two most widely used measures are the variance and the whiteness of the PE (ref. 25), where the former is often measured by

$${R}_{i}^{2}=1-\frac{{\sum }_{t}{\varepsilon }_{i}{(t)}^{2}}{{\sum }_{t}{({y}_{i}(t)-{\bar{y}}_{i})}^{2}}$$
(3)

and the latter is often assessed via a χ2 test of whiteness (Methods), for each channel i = 1,…,n. In equation (3), ε(t) is the same one-step-ahead prediction error in equation (2) and \({\bar{y}}_{i}\) is the temporal average of yi(t) (often equal to zero due to mean centreing) and corresponds to a constant predictor which always predicts y(t) equal to its average \(\bar{{{{\bf{y}}}}}\). Therefore, it is clear that \({R}_{i}^{2}\) is always ≤1 but ‘can be negative’. A value of \({R}_{i}^{2}=1\) indicates a perfect model (for channel/region i), \({R}_{i}^{2}=0\) indicates a model as good as the constant predictor, and \({R}_{i}^{2} < 0\) indicates a model worse than the constant predictor.

Linear models provide maximum prediction accuracy with minimum computational complexity

In this work, we fit and compare several families of linear and nonlinear models, as described below (see Methods for details). We fit each family of models to the data for each participant, thereby finding the optimal model at the global or local level (if the corresponding optimization algorithm is convex or non-convex, respectively). We then compare the resulting best models in each family in terms of their cross-validated fit to held-out data of the same participant (see Methods). The most important ground for comparison is the accuracy of their fit, measured by \({R}_{i}^{2}\) according to equation (3).

First, consider the results for the fMRI data (Fig. 2a,b). While we describe the results obtained using a relatively coarse parcellation here, similar results also hold for finely parcellated and unparcellated data (cf. Supplementary Figs. 1214). Overall, linear models that directly fit the BOLD time series without (de)convolving with a haemodynamic response function (HRF), either with dense or sparse effective connectivity, and with or without higher-order AR lags, achieve the highest R2. Among nonlinear models, the manifold-based locally linear model achieves a comparable R2. Yet, upon closer inspection of this model, we observe that its window size (which is chosen optimally, see Methods and Supplementary Fig. 7) is very large, effectively making it a globally linear model. The lack of nonlinearity becomes even clearer when examining the pairwise models. Here we see that a simple linear model performs as well as the minimum mean squared error (MMSE) model, or even slightly better (Fig. 2b right panel) due to the numerical errors of distribution estimation. We thus infer that the former achieves the highest prediction accuracy achievable by ‘any’ generally nonlinear model, albeit for pairwise prediction.

Fig. 2: Linear vs nonlinear models of rsfMRI activity.
figure 2

a, The distribution of cross-validated regional \({R}_{i}^{2}\), combined across all 116 regions and 700 participants, for linear (green) and nonlinear (yellow) models. The grey box corresponds to the zero model used as a baseline (see Methods for an explanation of each model). Negative values of R2 indicate that models have a worse prediction performance than a constant predictor that always predicts the next value of a signal to be equal to the signal’s mean. Note the higher accuracy of the best linear model (‘VAR-3 (sparse)’) compared with all nonlinear ones. b, The P value of one-sided Wilcoxon signed-rank test performed between all pairs of brain-wise distributions (left) and all pairs of pairwise distributions (right) of R2 in a. Warm (cold) colours indicate that the distribution labelled on the row has significantly larger (smaller) samples than the distribution labelled on the column. Grey hatches indicate non-significant differences at an α = 0.05 with Benjamini-Hochberg False Discovery Rate (BH-FDR) correction for multiple comparisons. c,d, Similar to a and b but for the statistic Q of the multivariate test of whiteness relative to its rejection threshold Qthr (cf. Methods). Smaller Q/Qthr indicates whiter (better) residuals, with Q/Qthr ≤ 1 required for the null hypothesis of whiteness not to be rejected. e,f, Similar to a and b but for the time that it took for the learning and out-of-sample prediction of each model to run, per participant per cross-validation (see Methods). In all boxplots, the centre line, box limits and whiskers represent the median, upper and lower quartiles, and the smallest and largest samples, respectively. VAR, vector autoregressive; HRF, haemodynamic response function; NMM, neural mass model; DNN, deep neural network; MLP, multilayer perceptron; CNN, convolutional neural network; LSTM, long short-term memory; IIR, infinite impulse response; FIR, finite-impulse response; MMSE, minimum mean squared error.

The second ground for comparison is the whiteness of model residuals, also in held-out data, which indicates that all the dynamics in the data are captured by the model and have not leaked into the residuals (Fig. 2c,d). Here, linear models also score higher than nonlinear models, with autoregressive (AR) models clearly outperforming others. However, it is noteworthy that the null hypothesis of whiteness is rejected for the residuals of all methods (Q/Qthr > 1), suggesting the presence of unexplained variance left by all models. Generally, the number of lags and sparsity patterns have little effect on the prediction accuracy of linear AR models for rsfMRI data, a positive but weak effect on the whiteness of the residual and a negative effect on the computational complexity (Supplementary Fig. 2). Similar to the comparison of R2 values, the only nonlinear model whose whiteness of residuals is comparable to the linear ones is the manifold-based locally linear model which, as explained above, is effectively linear at the global scale. Also as before, the pairwise linear models achieve a degree of whiteness that is almost identical to the pairwise MMSE estimator, ensuring their optimality among all linear and nonlinear pairwise predictors.

Third and finally, we can compare the models by considering the total time that it takes for their learning and prediction (Fig. 2e). When comparing the most efficient linear and nonlinear models, we find that linear models take at least one order of magnitude less time to fit than nonlinear models, as expected. However, linear methods can also be extremely complex to learn; linear models with states at the neural level (‘Linear with HRF’) require the most time to learn due to their high flexibility. Notably, this additional complexity of the ‘Linear with HRF’ or nonlinear methods is not counterbalanced by any benefits in their accuracy or whiteness of residuals, making the simplest linear models the preferred choice across all measures.

Next, we perform the same comparisons between linear and nonlinear models, but now on the basis of their fit to resting-state iEEG field potential dynamics (Fig. 3). Similar to rsfMRI data, linear AR models provide the best fit to the data in terms of both the magnitude and whiteness of their cross-validated prediction error. These models also have lower computational complexity than nonlinear ones, with about an order of magnitude (or higher) advantage in computation time.

Fig. 3: Linear vs nonlinear models of rsiEEG activity.
figure 3

Panels and acronyms parallel those in Fig. 2. a, The distribution of cross-validated regional \({R}_{i}^{2}\), combined across all electrodes (the number of which varies among participants) and all the recording sessions of the 122 participants (sample size = 776,484). Linear and nonlinear methods are depicted by green and yellow boxes, respectively (see Methods for an explanation of each model). Unlike data presented in Fig. 2, pairwise linear or pairwise MMSE models are not included due to the observation that between-electrode connections decrease the cross-validated accuracy of the top model (cf. the 4th and 5th boxplots). In contrast, including scalar autoregressive lags is highly beneficial in iEEG, whereas it is not so in rsfMRI. Therefore, the MMSE model here is scalar, conditioning on the past lags of each region itself. The lower whisker of the boxplots are trimmed to allow for better illustration of the interquartile ranges. b, The P value of the one-sided Wilcoxon signed-rank test performed between all pairs of distributions of R2 in a. Warm (cold) colours indicate that the distribution labelled on the row is significantly larger (smaller) than the distribution labelled on the column. Grey hatches indicate non-significant differences evaluated at α = 0.05 with BH-FDR correction for multiple comparisons. c,d, Similar to a and b but for the statistic Q of the multivariate test of whiteness relative to its rejection threshold Qthr (cf. Methods). Smaller Q/Qthr indicates whiter (better) residuals, with Q/Qthr ≤ 1 required for the null hypothesis of whiteness not to be rejected. e,f, Similar to a and b but for the time that it took for the learning and out-of-sample prediction of each model. In all boxplots, the centre line, box limits and whiskers represent the median, upper and lower quartiles, and the smallest and largest samples, respectively.

Alongside these similarities between the rsfMRI and rsiEEG data, two major distinctions are notable. First, the R2 values are generally much higher for iEEG, as evidenced by the R2 distributions of the zero model between the two cases. This difference is due to the fact that the iEEG time series has a much higher sampling rate and is therefore smoother. As a result, even predicting each sample equal to its previous sample (that is, the zero model) has a median accuracy of more than 97% (see Supplementary Figs. 1719 for a more detailed assessment of the effects of sampling rate on models’ R2). This fact only highlights the importance of the zero model; without it, the R2 of all models might have seemed satisfactorily high. In comparison to the zero model, however, it becomes clear that a simple 1-lag linear model, for example, has in fact a very low predictive power.

The second major distinction between the two modalities is the amount of history and temporal dependency within them. fMRI data are almost Markovian, so that y(t − 1) contains almost all the information available for the prediction of y(t). Little information is also contained in y(t − 2), but almost no information is contained in timepoints further in the past (Fig. 2a). When considering iEEG data, in contrast, increasing the number of autoregressive lags up to 100 still improves the R2, although the exact optimal number of lags varies between data segments. In this comparison, it is also important to take into account the vast difference in the sampling frequencies between the modalities, where 2 lags in the fMRI dataset amount to 1.44 s, while 100 lags of the iEEG data sum to only 0.2 s. This greater ‘richness’ of iEEG dynamics from a modelling perspective is also responsible, at least in part, for the markedly lower whiteness of residuals of all model families with respect to fMRI (see Fig. 3c vs Fig. 2c). This greater richness of iEEG is also consistent with, although not necessarily a direct consequence of, the fact that iEEG data reflect neural signals more directly than fMRI.

The linearizing effects of macroscopic neurodynamics and neuroimaging explain the observed linearity

The above results pose the natural question of why nonlinear models were not able to capture the dynamics in rsfMRI/rsiEEG data beyond linear ones, even though microscopic neuronal dynamics are fundamentally nonlinear. Here we focus on four properties of macroscopic neurodynamics and neuroimaging, and show that, in principle, they either fundamentally counteract or apparently mask nonlinearities. Due to its unique position in neural modelling11,12, we will use the sigmoidal nonlinearity to illustrate these effects; we note, however, that the effects are otherwise applicable to other forms of nonlinearity.

The first property that can fundamentally ‘counteract’ microscopic nonlinearities is spatial averaging. Imaging tools that are capable of measuring macroscopic brain dynamics detect a signal that reflects an average over the activity of hundreds, thousands or even millions of neurons. This spatial averaging can weaken, rather quickly, the nonlinear relationships in the dynamics of individual units (neurons or small-scale neuronal populations) as long as the units are not perfectly correlated, and can completely nullify nonlinearities when correlations decay with distance (Fig. 4a–c). Note that this distance can be the physical distance between the units, as assumed here, or in any relevant space such as that of neural codes and stimulus preference. The key factor in the linearizing effect of spatial averaging is the decay of pairwise correlations between neurons so that not all pairs of neurons in a region are highly correlated (a state of blanket global synchrony).

Fig. 4: The linearizing properties of macroscopic brain dynamics and of neuroimaging measurements.
figure 4

a, The effect of spatial averaging. For each panel, Nave pairs of signals xi(t), t = 1,…,2,000 were randomly and independently generated, \({y}_{i}(t)=\tanh ({x}_{i}(t))\) was calculated, and their averages 〈xi〉 and 〈yi〉 were computed. The quantities 〈xi〉 and 〈yi〉 possess a linear relationship as Nave ≈ 5 or higher. b, The cross-validated R2 of the optimal nonlinear (MMSE) and linear predictors for the 〈xi〉–〈yi〉 relationships in a. c, The effect of spatial correlation on spatial averaging. Here we assign (xi(t),yi(t)) pairs to spatial locations in a unit sphere (left) and make each xi(t) and xj(t) correlated in a manner that depends on their spatial distance (middle). The difference between nonlinear and linear R2 always decays with Nave and vanishes if the correlation decays, even slowly, with distance (right). d, The effect of temporal averaging. One pair of \(x(t),y(t)=\tanh (x(t))\) was generated, independently over time and passed through a Gaussian low-pass filter (LPF) with a cut-off frequency fcut-off that is normalized to the Nyquist frequency; thus, fcut-off = 1 means no LPF. e, Same as b but for the LPF{x}–LPF{y} relationships in d. f, Similar to c but for temporal averaging. We varied the PSD decay rate of x(t) (left) and then low-pass filtered x(t) and \(y(t)=\tanh (x(t))\) as in d. The difference between the optimal linear and nonlinear R2 eventually vanishes as fcut-off decreases, but it happens at smaller fcut-off for larger decay rates p (right). g, The effect of observation SNR. The quantities x(t) and and y(t) = tanh(x(t)) are as in d and their additive noises were generated independently. h, Same as e but for the (x + noise) − (y + noise) relationships shown in g. i, The effect of dimensionality. The values x1(t),…,xn(t) were generated as in a but here y(t) = tanh(x1(t)…+xn(t)) generates a one-dimensional nonlinearity in n + 1 dimensions. No noise is included; no spatial or temporal averaging is applied. j, Right: similar to b, e and h except that a manifold-based (locally linear) nonlinear predictor was used since the conditional density estimation required for MMSE loses accuracy in high dimensions with a fixed number of data points (see Methods). Left: the optimal window size of the manifold-based predictor as a function of dimension n. As n increases, the locally linear predictor automatically chooses larger windows to be able to make reliable predictions, thereby effectively degrading to a globally linear predictor (see also Supplementary Fig. 1). In all boxplots, the centre point, box limits and whiskers represent the median, upper and lower quartiles, and the smallest and largest samples, respectively. Error bars in c, f and j represent 1 s.e.m.

This linearizing effect of spatial averaging is similar to, but different from, stochastic linearization (also known as quasi-linearization)26. While the latter ‘approximates’ the relationship y = σ(x) using its expected slope E[∂y/∂x], spatial averaging as discussed here can result in a relationship that is truly linear. Also, the same effect can be observed when averaging other forms of nonlinearity than the sigmoid. Extended Data Fig. 2 shows the effect of spatial averaging on spiking neurons evolving according to the Izhikevic model23. This model has completely different nonlinearities than the sigmoid (polynomial and discontinuous) and shows a robust nonlinear phenomenon (limit cycle). Although more than a few (but still no more than 100–104) neurons are required, spatial averaging still dissolves the nonlinear aspects of the dynamics while mostly sparing the linear ones.

The second property capable of completely counteracting microscale nonlinearities is temporal averaging. Macroscopic neural dynamics are often observed, or even defined, through signals that are low-pass-filtered versions of micro- and mesoscale variables. The most notable of these is perhaps the BOLD signal captured by fMRI, which can be seen as an observation of neural activity passed through the low-pass filter of the HRF. Similarly, although to a lesser extent, the local field potentials captured by iEEG most strongly reflect the aggregate pyramidal post-synaptic currents27, which are themselves low-pass-filtered observations of spiking activity through synaptic transmission and neuronal membranes’ resistive-capacitive circuit28. The effect of low-pass filtering, in essence, is temporal averaging, which impacts nonlinearities in a manner that is similar to that of spatial averaging (Fig. 4e–f). The parallel of spatial correlations here is the autocorrelation function or its frequency-domain representation, the power spectral density (PSD). Autocorrelation represents how the correlation between adjacent samples of a signal decay with the temporal distance between those samples. As expected, the smaller the bandwidth of the signal (that is, the faster their PSD decays with frequency before low-pass filtering), the weaker the linearizing effect of low-pass filtering. As a result, stronger low-pass filtering would also be required to completely nullify nonlinear relationships in signals with narrower bandwidth (Fig. 4d). The linearizing effect of temporal averaging also holds for deterministic dynamics, albeit with the resulting linear dynamics (post averaging) also being deterministic (Extended Data Fig. 3).

A third property that can counteract or mask nonlinearities is noise. Although both process noise and observation (scanner or electrode) noise may have linearizing effects, here we focus only on the latter. As with any neuroimaging time series, various sources of observation noise can affect the fMRI/iEEG time series29,30 and, in turn, ‘blur’ nonlinear relationships, even if they exist between the underlying noise-free BOLD/LFP time series (Fig. 4g,h). In fact, when the power of noise reaches the power of the signal (signal to noise ratio (SNR) ≈ 1), it can completely mask a nonlinear relationship in the absence of any spatial or temporal averaging. In reality, however, the linearizing effect of observation noise can combine with spatiotemporal averaging, making the 2  SNR  14 that we have in rsfMRI data (Supplementary Fig. 4) potentially more than enough to mask any remaining nonlinearities post-spatiotemporal averaging. Ironically, the use of linear filtering to ‘clean the data’ is more likely to further linearize the dynamics of the time series due to temporal averaging effects discussed above, instead of recovering nonlinearities lost due to noise (Supplementary Note 1). Nonlinear post-processing steps, on the other hand, may leave their own potentially nonlinear signatures in the data, but such signatures should not be confused with true nonlinear relationships in the original BOLD/LFP signal. Further, although we let the noise in Fig. 4g,h be independent of the signal, as is typically the case for measurement noise, this linearizing effect would still hold if the noise is linearly dependent on the signal.

The fourth and final property that we discuss is the number of samples required for detecting nonlinear relationships in large dimensions. Let us assume, despite our discussion so far, that a perfect noise-free nonlinear relationship exists between n-dimensional fMRI or iEEG time series and a noise-free sensor can capture it perfectly. When only N ≈ 1,000 data points are available, we find that the manifold-based predictor, which was our most predictive nonlinear method both for fMRI and iEEG, is still unable to predict the nonlinear relationship better than a linear model in n ≈ 40 dimensions or higher (Fig. 4i,j). This loss in the predictive power of this nonlinear predictor with increasing dimensionality can be easily seen from the fact that the smallest mesh, having two points per dimension, requires an exponentially large N = 2n data points. Indeed, incorporating structural bias into the learning algorithm can arbitrarily reduce this sample complexity ‘if’ the incorporated bias is consistent with the underlying data31 (for example, if one looks for relationships of the form y = σ(x1 +  + xn) in Fig. 4i,j). However, using predictors with structural bias can also be arbitrarily misleading if their form of nonlinearity is not consistent with the given data32, which is one potential reason for the lower performance of most nonlinear methods in Figs. 2 and 3. This discussion also makes it clear that the inability of our nonlinear system identification methods to outperform linear ones in Figs. 2 and 3 over the entire brain is not a proof that no nonlinear method can possibly do so. We can, nevertheless, be certain about this for pairwise or scalar AR models (for fMRI and iEEG, respectively) where the optimal MMSE predictor was computable and performed as well as a linear one.

In conclusion, the process of averaging over space, the process of averaging over time, the existence of observation noise and the acquisition of limited data are each characteristic of macroscale brain dynamics or neuroimaging measurements and can transform microscopically nonlinear dynamics into macroscopically linear ones. In reality, their effects are probably all combined, rendering the optimality of linear models in our comparisons not as unexpected as it might originally seem. This linearity has major implications for computational neuroscience, as we discuss next.

Discussion

Summary

In this work, we set out to test the hypothesis that macroscopic neural dynamics are nonlinear, and using linear models for them results in an inevitable loss of accuracy in exchange for simplicity. We thus compared linear and nonlinear models in terms of how well they can predict rsfMRI and rsiEEG data in a cross-validated PE system identification framework, where the quality of each model’s fit was assessed by the variance and whiteness of its PE (residual). We found that linear models, and AR models in particular, achieve the lowest PE variance and highest PE whiteness, outperforming neural mass models (NMMs), deep neural networks (DNNs), manifold-based models and the optimal MMSE predictors. In the case of fMRI data, we further verified that the higher predictive power of AR models holds not only in aggregate but also in a strong majority of individual regions across the brain (Extended Data Fig. 4a). Interestingly, the spatial (regional) distribution of the R2 of the best model also shows significant differences across established cortical functional networks, a remarkably lower predictability of subcortical regions relative to cortical ones and a close alignment between most methods (all except for ‘Linear with HRF’) (Extended Data Fig. 4b–d). This distinction in predictability highlights significant differences in how spatio-temporally correlated the fMRI time series of different regions are, while the mechanistic physiological and technological reasons behind this distinction remain a warranted avenue for future research.

To further understand the possible causes of the optimality of linear models, we analysed the effects of common elements of macroscopic neural dynamics: averaging over space and time, observation noise and limited data samples. We showed that they can each counteract or mask the nonlinearities present at smaller scales. These linearizing effects add up when combined, suggesting that linear models provide a useful choice for macroscopic neural dynamics at rest; of course, in certain experimental conditions, rigorous system identification methods might still uncover nonlinear dynamics in future studies.

The observed optimality of linear models for the resting state is accompanied by both challenges and opportunities. Having a linear model for neuroscience investigations is computationally ideal, given the extent to which the behaviour of linear systems and their response to stimuli are mapped out. Nevertheless, to what extent these linearly interacting macroscopic signals are informative of and have a causal influence on the underlying microscopic activity remains unclear and represents an invaluable area for future investigation. Our observations also warrant the exploration and development of both linear and nonlinear models of macroscopic neural dynamics beyond those tested here and available in the literature.

Connections to previous literature

It is important to distinguish the pragmatic modelling question that drove our analysis from the rather philosophical question of whether any signs of ‘nonlinearity’ can be found in neuroimaging time series. The latter question has been extensively investigated17,18,19,33 and often uses determinism or chaos as a proxy for nonlinearity. To answer our distinct modelling question, we used a system identification approach that allows for a direct, side-by-side comparison of linear and nonlinear models. In contrast, the aforementioned studies often resorted to indirect, surrogate-based comparisons that rely on strong (and debated) assumptions about the constructed surrogates34. Also related to, but different from, our work are studies that seek to determine whether the end-to-end input–output mapping between stimuli and neuroimaging signals (EEG or BOLD) (refs. 35,36) or between functional connectivity and individual phenotypes37,38 is (highly) nonlinear. Our focus here, however, is on the nonlinearity of the internal network dynamics of the brain. Finally, past studies have also examined the performance of linear models per se in fitting neuroimaging time series (for example, ref. 39) but without a comparison to nonlinear models.

Our analysis of the linearizing effect of spatial correlation is also related to the large body of work investigating the effect of spatial correlations on the information content and decoding accuracy of neural population codes (see for example, ref. 40). As expected, the stronger the correlation between neurons, the weaker the linearizing effects of spatial averaging. However, nonlinearities can in principle have two opposing effects on the neural code. On the one hand, nonlinearities can substantially increase the computational complexity and expressivity of a neural network, making correlations beneficial for the neuronal encoding. On the other hand, if the expressivity is too high, the decodability of one neural population by another may decrease, potentially making the linearizing effects of low correlations favourable. Determining which effect dominates and whether an optimal point exists at the levels of neural correlation observed remain areas of future research in vivo.

Results and implications

The implications of the linearity of brain dynamics are far-reaching. Linear systems fundamentally have a more limited repertoire of dynamic behaviours than nonlinear ones, excluding the possibility of multistability, chaos, limit cycles or cross-frequency coupling, to name a few41. When driven by noise, linear systems act as linear filters that shape the power spectrum of their output (here, fMRI or iEEG time series) through their frequency response, essentially amplifying the frequency content near their resonance frequencies and dampening it elsewhere. Importantly, this effect of shaping the power spectrum of linear systems acts independently over different frequencies; in contrast, nonlinear systems can drive arbitrarily complex cross-frequency interactions42.

The linearity of brain dynamics has even greater implications for network control43,44. The design and analysis of optimal, robust, adaptive and many other forms of control are much better understood in the context of linear systems than nonlinear ones. This contrast in tractability only grows for large-scale systems such as the brain, thus motivating the recent surge of interest and advancements in using linear control theory in neuroscience45,46,47. Nonlinear models also present additional challenges beyond network control, including analytical and mechanistic understanding of their functionality, obtaining provable guarantees on their performance and even hardware requirements for their use in chronic implantable devices. In this context, the present work shows that the favourable tractability and simplicity of linear models do not necessarily come at the often-presumed cost of model inaccuracy and also provide the necessary tools for identifying the most accurate models for any datasets of interest.

In the analysis of fMRI data, we found that incorporating an HRF component in the model, instead of modelling the dynamics directly at the BOLD level, results in a loss of accuracy in linear models (see ‘Linear (sparse)’ vs ‘Linear with HRF’), and is almost ineffective in nonlinear models (see ‘NMM’ vs ‘NMM with HRF’). It was also in light of this observation that we did not include an HRF component in the majority of our models, such as the DNN or the manifold-based models. This lack of advantage of an explicit HRF component (within the specific context of modelling resting-state fMRI dynamics using ODEs) is understandable on a number of grounds. First, to include an HRF component in the model, one should either learn the HRF from the data, such as in our ‘linear with HRF’ model, which creates marked model flexibility and therefore increases the likelihood of overfitting, or use a typical HRF, such as in our ‘NMM with HRF’ model, which is a source of additional error. Second, by including the HRF in the model, we ultimately seek to recover neural information that is lost through the HRF. This task is difficult, if not impossible, without a high signal-to-noise ratio as well as more accurate HRF models than those currently available. Finally, a linear autoregressive model can automatically capture a linear approximation of the HRF dynamics48, precisely as present in the observed time series. Ultimately, our results encourage a side-by-side comparison of models with and without the inclusion of an HRF component to assess the costs and benefits of such inclusion for any datasets of interest.

A very similar argument also applies to including coloured noise in the model, both for fMRI and iEEG. Even though the so-called ‘noise floor’ in neural signals over which neural oscillations are detected49 has a clear power-law (1/f) PSD, its decay can be well modelled by white noise passing through a linear filter. This property of the 1/f noise is in fact why the AR models, which assume a white noise signal e(t), have prediction errors that are maximally white. This latter fact can be directly seen from Figs. 2c and 3c, when noting that any model’s prediction errors are, by construction, the model’s estimate of the noise e(t) (ref. 25).

In addition to considering linear and nonlinear models, we underscore the importance of the zero model. It is not uncommon in the modelling literature to assess the quality of a fitted model per se, without any grounds for comparison. For instance, our DNN multilayer perceptron ‘DNN (MLP)’ model for fMRI had a median R2 of 14% and for some participants it had a median R2 (among all regions) of over 50%. Even more notably, the ‘DNN (MLP)’ model for iEEG had a median R2 of over 97%. Without any comparisons, these numbers may suggest that the ‘DNN (MLP)’ models are quite accurate; yet, as seen in Figs. 2b and 3b, the predictive accuracy of these models is in fact lower than that of the zero model in fMRI and indistinguishable from that of the zero model in iEEG. The act of comparing to a baseline model is therefore an essential step in the assessment of any model’s goodness of fit.

We restricted our analyses here in the main text to certain spatiotemporal resolutions for both fMRI (a coarse parcellation) and iEEG (a high sampling rate), naturally raising the question of how robust our findings are to our choices of resolutions. As shown in Supplementary Figs. 1219, our main finding (higher predictive power of linear autoregressive models over all other model families) holds across all resolutions tested. We do, however, observe certain differences between resolutions. In iEEG data, we observe that using lower sampling rates (and therefore longer time intervals) increases the benefit of modelling network interactions, even while lowering the R2 values across all models. In fMRI data, we interestingly see that as we move towards more fine-grained parcellations and ultimately unparcellated data, (1) the simpler ‘Linear (sparse)’ model with less parameters gains advantage over the more populated ‘VAR-3 (sparse)’ model and (2) the overall R2 values of all models are reduced, potentially due to the improvement in signal-to-noise ratio resulting from averaging in coarser parcellations.

Methodological considerations

Despite the solid theoretical foundations of the PE method for system identification, our results may still beg a practical question: would the same system identification and side-by-side comparison procedure be able to identify nonlinear dynamics, should they actually exist in the time series? A direct answer to this question can be given, for example, by applying the same procedure to simulated time series generated from a nonlinear model whose ground-truth functional form we know. The result of such an analysis is provided directly in Extended Data Figs. 2 and 3 but also indirectly in Fig. 4. Note that in the latter, we compared the cross-validated predictive power of linear and nonlinear (MMSE or manifold-based) models in identifying the sigmoidal relationship y = σ(x) and its variants after averaging or noise addition. This relationship can be equally viewed as a nonlinear dynamical system \(\dot{x}=\sigma (x)\), the nonlinearity of which was only identified until counteracted or masked by the four macroscopic effects we discussed therein.

In Supplementary Fig. 9, we take this validation one step further and compare the ‘DNN (MLP)’ model with linear autoregressive and subspace models while also tuning their respective hyperparameters using simulated data from the Izhikevic model and the same stochastic gradient descent (SGD) algorithm used for hyperparameter tuning in the main results. The resulting higher predictive performance of the ‘DNN (MLP)’ model is another testament to the validity of the end-to-end process we used for model training and comparison. Note also that the amount of improvement that nonlinear models can offer over linear ones can vary widely on the basis of the true underlying dynamics and the ability of either model to capture them. In Supplementary Fig. 10, for example, we show a similar side-by-side comparison between the ‘DNN (MLP)’ and ‘Linear (dense)’ models on one of the simplest nonlinear systems exhibiting a nonlinear behaviour (chaos in this case). Despite the extreme simplicity of the true underlying model, the neural network can achieve near-perfect accuracy with only one hidden layer and 10 hidden units, while the linear model can achieve 50% R2. Finally, we also trained both the parameters and hyperparameters of the ‘DNN (MLP)’ model on fMRI data while replacing its rectified linear unit (ReLU) activation functions with linear ones using the same codes and routines used for the actual ‘DNN (MLP)’ model. As seen from Supplementary Fig. 11, the resulting model achieves indistinguishable R2 distributions from the (theoretically equivalent) ‘Linear (dense)’ model, except for implementation differences between our hard-coded linear regression solver and that provided by MATLAB’s ‘trainNetwork’ function.

As for the comparison of linear and nonlinear models, one might expect nonlinear models to perform at least as well as linear ones, but not worse, given that the space of nonlinear models includes all linear models as a special case. In our comparison, however, we saw that most of our nonlinear methods actually have a worse prediction performance than linear ones. This behaviour can be understood in light of at least two facts. First, many nonlinear models, such as NMMs, do not include linear models as a special case and have structural biases that can be a source of error if not consistent with the data32. Second, even nonlinear models that do not have structural biases and/or contain linear models as a special case, such as DNN, MMSE or manifold-based models, still have a marked flexibility relative to a linear model. This immense flexibility makes the training of these models and finding their global optimum challenging. As a result, training algorithms are quite likely to return suboptimal models which, in this case, show worse accuracy and generalization than their linear special cases.

A further noteworthy aspect of our study specifically, and the prediction error framework more generally, is the focus on fitting the time series rather than its derivative statistics, such as the functional connectivity (FC) (refs. 50,51,52) or power spectral density53. While the choice of one approach over the other ultimately depends upon the anticipated use of the learned model, it is important to note that the mapping from dynamical systems to FC (or any other such statistic) is not a one-to-one mapping54. In fact, linear systems of the form y(t) − y(t − 1) = Wy(t − 1) + e(t) with completely different W matrices can give rise to almost identical FC matrices (see Extended Data Fig. 1). Therefore, when considering the accuracy of a general-purpose model of the brain, the time series contains the maximum amount of information and thus provides the best target for model fitting.

One modelling approach that we did not employ in this study is dynamic causal modelling (DCM, ref. 55). The reason is that none of the current variants of DCM are feasible due to their computational complexity at the scale of our analysis: whole-brain fMRI with n = 116 parcellations or large-scale iEEG with up to 175 and a median of 98 electrodes. The most efficient variant, spectral DCM for fMRI, for instance, is applicable to 30–40 nodes, whereas stochastic DCM (the most relevant to our study) is only applicable to much smaller systems. However, in light of our results thus far, the great computational complexity of the DCM approach and thus its potential for overfitting, we would not expect its cross-validated R2 to reach that of a linear model, although this comparison remains unknown at present.

In this work, we demonstrated four properties of macroscopic neurodynamics that can counteract or mask microscopic nonlinearity. In doing so, we purposefully kept the discussion at a conceptual level and generally abstained from tying it to specific micro- or mesoscopic neural models, as doing so would require building on assumptions that our study explicitly seeks to avoid. For instance, it is currently unclear whether and to what extent the dynamics of the ‘mesoscopic’ local field potentials or population firing rates that seem to be the main neural drivers of fMRI or iEEG are nonlinear and, if so, what the precise form of their nonlinearity is at each brain region. A warranted avenue for future research would be the re-analysis of the effects of spatial and temporal averaging, observation noise and limited data samples on precise, data-driven models of mesoscopic brain dynamics, should they possess nonlinear interactions.

Limitations

Finally, we highlight some of the limitations of the present study. First, it is important to note that the space of all nonlinear models in tens to hundreds of dimensions is intractably large, and the fact that our tested nonlinear models did not outperform linear ones is not a proof that no nonlinear model may ever do so.

In particular, some nonlinear models (for example, ‘DNN (MLP)’) have the capacity to implement the ‘Zero’ or even linear AR models as a special case (for example, by setting the output weights of the ‘DNN (MLP)’ model to zero to replicate the ‘Zero’ model or to adjust its weights and biases so that all ReLU activation functions operate in their linear range to replicate the ‘Linear (dense)’ model). Therefore, despite our best efforts in training these models and validating our training algorithms in alternative simulated settings (cf. Supplementary Figs. 79), the fact that their predictive power does not reach the performance of the zero/linear models in some datasets suggests suboptimal parameter and/or hyperparameter fitting due, for example, to local minima or a lack of enough training data. Therefore, we should highlight that our achieved predictive power for those nonlinear methods does not necessarily reflect their best performance and it is possible that future work with more training data and/or alternative optimization approaches may get better results from these nonlinear models that match or outperform linear models in iEEG/fMRI.

Our work thus seeks to provide rigorous evidence and methodology towards resolving the linear vs nonlinear modelling dilemma in computational neuroscience, rather than a final resolution thereof. Hybrid models consisting of various combinations of different nonlinear structures (LSTMs and CNNs, for example) may have substantial potential for obtaining more accurate nonlinear models and warrant future research. We can be confident, nevertheless, about the optimality of linear models at the pairwise level for fMRI or scalar AR level for iEEG given the equal or higher prediction power of linear regression relative to the optimal MMSE predictor. Moreover, our modelling framework is currently only applicable to resting-state dynamics with no inputs and has been tested on the two modalities of fMRI and iEEG. Inclusion of input signals for system identification of task fMRI/iEEG data requires accurate data-driven ‘input models’ of how experimental stimuli, as well as participants’ voluntary responses, influence the BOLD or LFP signals in each brain region and is a highly warranted avenue for future research56,57. Under intensive task conditions, moreover, it is more likely, or perhaps certain, to observe nonlinearities at least in the form of saturation effects in the BOLD/LFP signal. However, the precise form and extent of this nonlinearity need to be determined using rigorous system identification routines.

In conclusion, our work sought to ask the often-unasked question of whether the brain is macroscopically linear. Our findings show that simple linear models explain the rsfMRI and rsiEEG data as well as, or even better than, an array of nonlinear ones, thus challenging the commonly held yet untested assumption of higher accuracy of nonlinear models. However, the costs and benefits of nonlinear models are ultimately case-specific. Therefore, instead of offering a universal recommendation on the preferable choices for the modelling of neural dynamics, we rather provide the groundwork for rigorous investigation and informed decision-making in the context of rsfMRI/rsiEEG. When feasible, following a similar system identification routine is always recommended for computational modelling of any datasets of interest to ensure the optimal fit of the models used for subsequent analysis or design.

Methods

Data and preprocessing

For the fMRI analysis, we used ICA-FIX resting-state data from the S1200 Human Connectome Project (HCP) release58,59. The HCP experiments were carried out by the WU-Minn consortium and its adherence to ethical standards was approved by the Internal Review Board of the respective institutions. Explicit informed consent was acquired from all participants involved in the study. rsfMRI images were collected with the following parameters: TR = 20 ms, TE = 33.1 ms, flip angle = 52°, FOV = 208 × 108 mm, matrix = 104 × 90, slice thickness = 2.0 mm, number of slices = 72 (2.0 mm isotropic), multifactor band = 8 and echo spacing = 0.58 ms. Brains were normalized to fslr32k via the MSM-AII registration and global signal was removed. No bandpass filtering was performed (see Supplementary Note 1). Finally, we removed participants from further analysis if any of their four resting-state scans had excessively large head motion, defined by having frames with >0.2 mm frame-wise displacement or a derivative root mean square above 75. Also, participants listed in ref. 60 under ‘3T Functional Preprocessing Error of all 3T RL fMRI runs in 25 Subjects’ or ‘Subjects without Field Maps for Structural scans’ were removed, leaving a total of 700 participants that were used for all the analyses. We parcellated the brain into 100 cortical regions (Schaefer 100 × 7 atlas61) and 16 subcortical ones (Melbourne Scale I atlas62).

For iEEG preprocessing, raw data from the Restoring Active Memory (RAM) dataset we have previously published63,64,65 were segmented into task-free epochs from either before or after task completion that were at least 5 min in length. This process resulted in a total of 283 recordings from 122 participants. Data were then downsampled to the lowest sampling rate used across recording sites (500 Hz). Electric line noise and its harmonics at 60, 120 and 180 Hz were filtered out using a zero-phase distortion 4th order stop-band Butterworth filter with a 1 Hz width. This procedure was implemented using the ‘butter()’ and ‘filtfilt()’ functions in MATLAB. We then rejected noisy channels that were either (1) marked as noisy in the RAM dataset notes, (2) had a line length greater than three times the mean, (3) had z-scored kurtosis >1.5 or (4) had a z-scored power spectral density dissimilarity measure >1.5. The dissimilarity measure used was the average of one minus the Spearman’s rank correlation with all channels. Data were then demeaned and detrended. Channels were grouped according to whether they were grid or depth electrodes and then common average referenced within each group. Following the common average referencing step, plots of raw data and power spectral densities were visually inspected by an expert researcher with 6 yr of experience working with electrocorticography data to ensure that data were relatively clean.

Finally, while the aforementioned channel removal criteria are consistent with the standard practice in iEEG preprocessing (where we remove channels with exceptional line length, kurtosis and power spectral densities to target high-frequency noise and ictal activity66,67, electrode drift and ictal spikes68,69,70, or line noise and flat power spectral densities65,71, respectively) and they are essential from a data quality perspective, see also Supplementary Fig. 22. Here we reproduced our main finding of the higher predictive power of linear models on data without channel removal as a validation that channel removal had not confounded our finding by potentially providing an edge for the linear models.

Computing and run-time calculations

All the computations whose run times were measured and reported in Figs. 2e and 3e were performed on the CUBIC cluster at the University of Pennsylvania, using a 1 CPU core and 16 or 64 GB of memory per fMRI or iEEG computing jobs, respectively. For each method, fMRI participant/iEEG segment and cross-validation fold, one training time, one test time and one total time were computed, where the latter is simply the sum of the former two. Note that for the ‘Zero’, ‘Manifold’ and ‘MMSE’ models, no training time could be defined. This should be clear for the ‘Zero’ model, but it was also the case for the ‘Manifold’ and the ‘MMSE’ models due to their ‘model on demand’ nature, that is, that all training data were directly used in computing the prediction of each test point, instead of the usual process of learning one model from the training data and then using that model for test predictions. Therefore, the training time was set to zero for these models. For the remaining models, the training time equalled the CPU time taken for all the computations, from the point when data had been broken into training and test sets until the point when the model was learned. The test time, well-defined and non-zero for all the methods, started immediately after the end of the training period and ran until the point when the output time series (one-step-ahead predictions) were computed. Note that computations of the R2 and whiteness statistics were optional post-hoc analyses and were done in the same way for all methods, hence were not included in any of these run-time calculations. A breakdown of the run time into separate training and test times is also shown in Supplementary Figs. 5 and 6.

Linear and nonlinear families of models

A complete list of the models used in this study are provided in Supplementary Table 1. For all models, the continuous-time dynamics in equation (1) was first discretized. With a slight abuse of notation, we also represent the discretized dynamics as

$${{{\bf{x}}}}(t)-{{{\bf{x}}}}(t-1)=f({{{\bf{x}}}}(t-1))+{{{{\bf{e}}}}}_{1}(t),$$
(4a)
$$\begin{array}{rcl}{{{\bf{x}}}}(0)&=&{{{{\bf{x}}}}}_{0},\qquad t=1,\ldots ,N\\ {{{\bf{y}}}}(t)&=&h({{{\bf{x}}}}(t))+{{{{\bf{e}}}}}_{2}(t),\qquad t=0,\ldots ,N\end{array}$$
(4b)

where the time index t is now an integer, for simplicity of notation, but the discretization step size is always equal to 720 ms for the HCP data (equivalent to 1 TR) and 2 ms for the RAM data. This choice means that, for example, the map f in equation (4a) equals 1 time step multiplied by the map f in equation (1a), and T = 864 s in equation (1a) corresponds to N = 1,200 in equation (4a) for the fMRI data. Recall that in this general form, the noise signals e1(t) and e2(t) can have arbitrary statistics, including white or coloured PSD. We then learned the dynamics in equation (4) using the following families of models. The hyperparameters used for each model are listed in Supplementary Table 1 (see ‘Hyperparameter selection’ below and Supplementary Figs. 7 and 8 for details).

Linear models with states at the BOLD/LFP level (‘Linear (dense)’, ‘Linear (sparse)’)

This model is our simplest. In it, we let y(t) = x(t), modelling the dynamics directly at the BOLD/LFP level. This also allows for combining the noise signals e1(t) and e2(t) into a single noise signal e(t), which was then taken to be white. These simplify equation (4) to

$${{{\bf{y}}}}(t)-{{{\bf{y}}}}(t-1)=f({{{\bf{y}}}}(t-1))+{{{\bf{e}}}}(t).$$
(5)

If we further let f(y(t)) = Wy(t) be linear, then we get

$${{{\bf{y}}}}(t)-{{{\bf{y}}}}(t-1)={{{{W}}}}{{{\bf{y}}}}(t-1)+{{{\bf{e}}}}(t)$$
(6)

where W is an n-by-n matrix of effective connectivity between brain regions. We fit and compared this model both when W is dense and when it is sparse. The latter was motivated by the facts that (1) from a mechanistic perspective, an important property of brain networks and other large-scale complex networks is their sparsity; and (2) from a machine learning perspective, regularization and reducing the number of free parameters in a model can prevent overfitting and improve generalization. To promote sparsity, we used standard 1-norm (LASSO) regularization with a λ hyperparameter that is tuned separately for fMRI and iEEG.

Linear AR models (‘AR-2 (sparse)’, ‘VAR-2 (sparse)’, ‘AR-3 (sparse)’, ‘VAR-3 (sparse)’, ‘AR-100 (sparse)’, ‘AR-100 (scalar)’)

Motivated by the long history of AR models in neuroscience24,72,73, here we extend equation (6) to

$$\begin{array}{rcl}{{{\bf{y}}}}(t)-{{{\bf{y}}}}(t-1)&=&{{{{W}}}}{{{\bf{y}}}}(t-1)+{{{{{D}}}}}_{2}{{{\bf{y}}}}(t-2)\\ &&+{{{{{D}}}}}_{3}{{{\bf{y}}}}(t-3)+\cdots \\ &&+{{{{{D}}}}}_{d}{{{\bf{y}}}}(t-d)+{{{\bf{e}}}}(t)\end{array}$$
(7)

for an ‘AR-d’ model. The number of lags d was tuned separately for fMRI and iEEG, and the matrix W is either made sparse using LASSO or enforced to be diagonal. Note that the latter results in n scalar AR models at each node, which are completely decoupled from each other. We restricted the matrices D2, D3,… to be diagonal in ‘AR’ models but not so in full vector autoregressive (‘VAR’) models. In both cases, we used LASSO regularization to promote sparsity in the regressors, signified by the ‘(sparse)’ suffix in method identifiers, with the regularization hyperparameter λ chosen optimally and separately for each model (cf. ‘Hyperparameter selection’ below). In general, we found that λ is a moderately sensitive parameter, more so for the whiteness of residuals than for R2 (see Supplementary Fig. 3 for an example).

Linear models with states at the neural level (‘Linear with HRF’, only applicable to fMRI data)

A standard step in the computational modelling of fMRI dynamics is to incorporate a model of the HRF and to separate the underlying neuronal variables from the observed BOLD signals. In this family of models, we thus separated the states x from the outputs y while keeping a one-to-one relationship between the two (m = n). We then let the latter be a filtered version of the former through the HRF. For generality and given the natural and important variability of HRF across the brain74,75, we allowed the HRF to vary regionally and learned it from the data for all regions in addition to the effective connectivity matrix W. Furthermore, for the sake of generality, we allowed both e1(t) and e2(t) to be coloured, with power spectral densities that could also be different between regions and were learned from data. Note that this choice included, as a special case, white e1(t) and e2(t). The result was a highly flexible linear model given by

$${{{\bf{x}}}}(t)-{{{\bf{x}}}}(t-1)={{{{W}}}}{{{\bf{x}}}}(t-1)+{{{{\mathcal{G}}}}}_{1}(q){\hat{{{{\bf{e}}}}}}_{1}(t)$$
(8a)
$$\begin{array}{rcl}{{{\bf{y}}}}(t)&=&{{{\mathcal{H}}}}(q){{{\bf{x}}}}(t)+{{{{\mathcal{G}}}}}_{2}(q){\hat{{{{\bf{e}}}}}}_{2}(t)\\ {{{\mathcal{H}}}}(q)&=&\mathop{\sum }\limits_{p=1}^{{n}_{h}}{{{\rm{diag}}}}({{{{{H}}}}}_{:,p}){q}^{-p}\end{array}$$
(8b)
$${{{{\mathcal{F}}}}}_{1}(q)={{{{I}}}}-{{{{\mathcal{G}}}}}_{1}^{-1}(q)=\mathop{\sum }\limits_{p=1}^{{n}_{\phi }}{{{\rm{diag}}}}({{{{{\Phi }}}}}_{:,p}){q}^{-p}$$
(8c)
$${{{{\mathcal{F}}}}}_{2}(q)={{{{I}}}}-{{{{\mathcal{G}}}}}_{2}^{-1}(q)=\mathop{\sum }\limits_{p=1}^{{n}_{\psi }}{{{\rm{diag}}}}({{{{{\Psi }}}}}_{:,p}){q}^{-p}$$
(8d)

Since LASSO regression produced the best results in our BOLD-level linear models, we used LASSO (with the regularization weight λ) to promote sparsity in here. \({{{\mathcal{H}}}}(q)\) is a diagonal matrix whose (i, i) entry is a linear finite-impulse response (FIR) approximation of the HRF in region i, parameterized as in equation (8b) (q−1 is the standard delay operator, such that q−1x(t) = x(t − 1), see ref. 25). Similarly, \({{{{\mathcal{G}}}}}_{1}\) and \({{{{\mathcal{G}}}}}_{2}\) are diagonal filters, parameterized by the inverse FIR forms in equations (8c) and (8d). The matrices \({{{{{H}}}}}_{n\times {n}_{h}}\), \({{{{{\Phi }}}}}_{n\times {n}_{\phi }}\) and \({{{{{\Psi }}}}}_{n\times {n}_{\psi }}\) included learnable FIR parameters of \({{{\mathcal{H}}}}(q)\), \({{{{\mathcal{G}}}}}_{1}^{-1}(q)\) and \({{{{\mathcal{G}}}}}_{2}^{-1}(q)\), respectively. Since the state vector x(t) was not measured, we learned this model by iterating between state estimation and parameter estimation in an expectation-maximization (EM)-like manner. Note that the presence of filters increases the effective state dimension of the system to \(n \max \{{n}_{\phi }+1,{n}_{h}+{n}_{\psi }\}\), considerably increasing the computational complexity of the state estimation step. The final model was taken from the EM iteration with the highest (training) R2.

Linear models with abstract data-driven states (‘Subspace’)

The previous model, despite and because of its extreme generality and flexibility, has a very large state dimension and is extremely difficult to fit. If we forgo the physiological interpretability of the states, then simpler and lower-dimensional models of the form

$$\begin{array}{rcl}{{{\bf{x}}}}(t)-{{{\bf{x}}}}(t-1)&=&{{{{W}}}}{{{\bf{x}}}}(t-1)+{{{{\bf{e}}}}}_{1}(t)\\ {{{\bf{y}}}}(t)&=&{{{{C}}}}{{{\bf{x}}}}(t)+{{{{\bf{e}}}}}_{2}(t)\\ {{{\rm{Cov}}}}\left(\left[\begin{array}{c}{{{{\bf{e}}}}}_{1}(t)\\ {{{{\bf{e}}}}}_{2}(t)\end{array}\right]\right)&=&\left[\begin{array}{cc}{{{{Q}}}}&{{{{M}}}}\\ {{{{{M}}}}}^{T}&{{{{R}}}}\end{array}\right]\end{array}$$

can be learned via subspace identification method25. Unlike the model above, states represent abstract low-dimensional regularities within the data, with a dimension nx that was chosen optimally for each data type. The hyperparameters r and s represent the amount of output and input–output history used, respectively, in the output Hankel matrix construction and subspace projection steps of the algorithm, respectively (see ref. 25 Ch. 10 for details). The noise sequences e1(t) and e2(t) were assumed to be white but could be correlated, and the covariance matrices Q, M and R were also learned from data. Note that this whiteness assumption on noise sequences is without loss of generality in this case due to the subspace method’s ability to learn any non-white dynamics (that is, colour) of noise as part of the abstract state dynamics.

Nonlinear NMMs (‘NMM’, ‘NMM with HRF’)

Learning of the models above, except for the ‘Linear with HRF’, involves a convex optimization that can be efficiently solved to find its unique global optimum. In contrast, the learning of nonlinear models is less straightforward. Recently, ref. 76 developed an algorithm called MINDy that uses state-of-the-art optimization techniques for learning an NMM of the form

$$\begin{array}{rcl}{{{\bf{x}}}}(t)-{{{\bf{x}}}}(t-1)&=&({{{{W}}}}{\psi }_{{{{\boldsymbol{\alpha }}}}}({{{\bf{x}}}}(t-1))-{{{{D}}}}{{{\bf{x}}}}(t-1)){\Delta }_{T}\\ &&+{{{{\bf{e}}}}}_{1}(t)\\ {{{\bf{y}}}}(t)&=&{{{\mathcal{H}}}}(q){{{\bf{x}}}}(t)+{{{{\bf{e}}}}}_{2}(t)\end{array}$$

using rsfMRI data. In this model, x(t) has the same dimension as y(t) (one neural mass per brain region), ΔT is the sampling time, W is a sparse connectivity matrix, D is a diagonal self-decay matrix, ψα() is an element-wise sigmoidal nonlinearity whose steepness is determined by each element of the vector α (which is also the same size as x), and \({{{\mathcal{H}}}}(q)\) is a scalar linear HRF that is the same and fixed a priori for all regions. The associated toolbox that we used allows the user to either deconvolve y(t) using a canonical HRF to obtain the state x(t) (‘NMM with HRF’), or set \({{{\mathcal{H}}}}(q)=1\) and directly fit the model to y(t) (‘NMM’). We used both methods for fMRI data but only the latter for iEEG. Since the MINDy algorithm was originally tuned for fMRI, we re-tuned its regularization hyperparameters λ1,…,λ4 for use with iEEG data (see ‘Hyperparameter selection’ below).

Nonlinear models via MLP DNNs (‘DNN (MLP)’)

Here we used a model of the form in equation (5) for fMRI and trained a ReLU MLP DNN to approximate the function f(  ). The structure of the DNN consists of an input layer, D ReLU layers, each preceded by fully connected and batch normalization layers and succeeded by a 50% dropout layer, a final fully connected layer and the output layer. Given the importance of AR lags in the modelling of iEEG, for this modality we generalized equation (5) as

$${{{\bf{y}}}}(t)-{{{\bf{y}}}}(t-1)=f({{{\bf{y}}}}(t-1),\ldots ,{{{\bf{y}}}}(t-d))+{{{\bf{e}}}}(t)$$
(9)

and similarly approximate f() using an MLP DNN. We used MATLAB’s Deep Learning Toolbox for the training and evaluation of the DNN and tuned the depth D and width W of the DNN separately for fMRI and iEEG (see ‘Hyperparameter selection’ below).

Nonlinear models via convolutional DNNs (‘DNN (CNN)’)

Given the recent success of CNNs in complex learning problems, we also included a model similar to ‘DNN (MLP)’ but with a CNN to approximate the function f(). The network consists of an input layer, D one-dimensional convolutional layers (convolving over time using nfilt filters of size lfilt) each succeeded by a batch normalization layer, a ReLU layer and an average pooling layer with a pool size of npool, a final dropout layer with probability pdrop, a fully connected layer and the output layer. Spatial convolution was not included in the model, as is the standard in modelling dynamical systems with CNNs, due to the arbitrary nature of channel numbering. Temporal convolution is nevertheless the basis of this model and we thus considered d > 1 autoregressive lags for both fMRI and iEEG.

Nonlinear models via long short-term memory neural networks (‘LSTM (IIR)’, ‘LSTM (FIR)’)

The above DNN models are inherently static (that is, feedforward), whereas various recurrent neural network architectures have also been proposed for directly modelling dynamical systems. One of the most successful of such architectures are LSTMs which we implemented here in two forms: infinite impulse response (IIR) and finite-impulse response (FIR). These two forms respectively correspond to the two common sequence-to-sequence and sequence-to-one forms of modelling time series using LSTMs. In both cases, the network consists of an input layer, a layer of W LSTM units, a fully connected layer and an output layer. The difference is that in the IIR model, the network is initialized once at time 0 and run forward, continuously receiving y(t − 1) as input and generating y(t) − y(t − 1) as output. Each output, therefore, depends on the entire history of the inputs. In the FIR model, on the other hand, the model is initialized and run forward once for each timepoint t, receiving only y(t − d),…,y(t − 1) as input and predicting y(t) − y(t − 1) as output.

Nonlinear manifold-based models (‘Manifold’)

Consider equations (5) or (9) and assume, for simplicity, that f is differentiable. Each of these systems of equations consists of n scalar equations, each of which defines a manifold (surface) in n + 1 (for equation (5)) or nd + 1 (for equation (9)) dimensional space. Various methods have been developed in the machine learning and system identification literature77,78 and used in computational neuroscience20,79 to capitalize on the fact that in the small vicinity of a point, the manifold can be approximated by a linear hyperplane tangential to it at that point. Here we used the simple method of local polynomial modelling of order 1 (ref. 77). To explain this method, first consider the simpler model in equation (5). For each test time t, we approximate the function f() as a linear function in the vicinity of y(t − 1), that is,

$$f({{{\bf{z}}}})\simeq {{{{\bf{c}}}}}^{\ell }+{{{{{W}}}}}^{\ell }\left[{{{\bf{z}}}}-{{{\bf{y}}}}({t}^{\,\ell }-1)\right]$$
(10)

The constant vector c and matrix W were learned from training data (separately for each test point), as follows. Each training point y(tm − 1) was weighted according to its distance to y(t − 1), that is,

$${k}^{\ell ,m}=\exp \left(-\frac{\parallel {{{\bf{y}}}}({t}^{m}-1)-{{{\bf{y}}}}({t}^{\,\ell }-1){\parallel }^{2}}{2{h}^{2}}\right)$$
(11)

where the hyperparameter h controls how local or global the model is. These weights were then used in a weighted least squares estimation,

$$\begin{array}{rcl}\left[\begin{array}{cc}{{{{\bf{c}}}}}^{\ell }&{{{{{W}}}}}^{\ell }\end{array}\right]&\simeq &\left(\mathop{\sum}\limits_{{t}^{m}}[{{{\bf{y}}}}({t}^{m})-{{{\bf{y}}}}({t}^{m}-1)]{k}^{\,\ell ,m}{{{{{\boldsymbol{\varphi }}}}}^{\ell ,m}}^{T}\right)\\ &&\cdot {\left(\mathop{\sum}\limits_{{t}^{m}}{{{{\boldsymbol{\varphi }}}}}^{\ell ,m}{k}^{\,\ell ,m}{{{{{\boldsymbol{\varphi }}}}}^{\ell ,m}}^{T}\right)}^{{\dagger} }\end{array}$$

where denotes pseudo-inverse and

$${{{{\boldsymbol{\varphi }}}}}^{\ell ,m}=\left[\begin{array}{c}1\\ {{{\bf{y}}}}({t}^{m}-1)-{{{\bf{y}}}}({t}^{\ell }-1)\end{array}\right].$$
(12)

Note that only the computed c was ultimately used, since substituting z = y(t − 1) in equation (10) gives

$$f({{{\bf{y}}}}({t}^{\,\ell }-1))\simeq {{{{\bf{c}}}}}^{\ell }$$

which was used for computing the one-step-ahead prediction at time t. All the details remain the same when applying this method to equation (9) for iEEG data, except that y(t − 1) and y(tm − 1) in equations (10), (11) and (12) are replaced with \({[{{{\bf{y}}}}({t}^{\,\ell }-1)^{T}\cdots {{{\bf{y}}}}({t}^{\,\ell }-d)^{T}]{\,}^{T}}\) and \({[{{{\bf{y}}}}{({t}^{m}-1)}^{T}\cdots {{{\bf{y}}}}({t}^{m}-d)^{T}]}{\,}^{T}\), respectively. We tuned h separately for fMRI and iEEG (see ‘Hyperparameter selection’ below), giving rise to values that are so large that they essentially result in a globally linear model (Supplementary Fig. 1). The value of h was independently optimized for the computations reported in Fig. 4i,j, as described below.

Nonlinear MMSE models (optimal) (‘MMSE (pairwise)’, ‘MMSE (scalar)’)

The models in equation (5) (for fMRI) or equation (9) (for iEEG) ultimately define a stochastic mapping from y(t − 1) or (y(t − 1),…,y(t − d)) to y(t) − y(t − 1) such that observing the values of the former provides information to predict the latter. It is not hard to show that for two random variables U and V, the optimal (that is, minimum variance) prediction of U given V = v is given by its conditional expectation \(\hat{u}={{{\rm{E}}}}[U| V=v]\) known as the MMSE prediction80. Therefore, the optimal prediction of y(t) − y(t − 1) given y(t − 1) or (y(t − 1),…,y(t − d)) is given by E[y(t) − y(t − 1)y(t − 1)] and E[y(t) − y(t − 1)y(t − 1),…,y(t − d)], respectively. Due to its optimality, it provides a theoretical upper bound on the achievable accuracy of ‘any’ nonlinear model. The difficulty in calculating this estimate, however, is the estimation of the conditional distribution of y(t) − y(t − 1) given an observation of y(t − 1) or (y(t − 1),…,y(t − d)). Without imposing additional assumptions (for example, linearity or Gaussianity), this task is not feasible in n ≈ 100 dimensions with our limited data points per recording segment. However, this distribution is indeed feasible (1) on a pairwise basis, giving us the optimal predictions E[yi(t) − yi(t − 1)yj(t − 1)] for all pairs i, j = 1,…,n, or (2) on a scalar AR basis, yielding the optimal predictions E[yi(t) − yi(t − 1)yi(t − 1),…,yi(t − d)] separately for each i = 1,…,n. We used the former for fMRI and the latter for iEEG. To estimate this conditional distribution for fMRI, we used a Gaussian window with a standard deviation equal to β times the range of yj(t) in the training data to detect the training points close to each test yj(t − 1) and then used an N-point weighted histogram to estimate the (conditional) distribution. More precisely, for any pair of i and j and any given test time t, we first computed the Gaussian weight

$$\begin{array}{rcl}{w}_{j}^{\ell ,m}&=&\exp \left(-\frac{{[{\,y}_{j}({t}^{m}-1)-{y}_{j}({t}^{\ell }-1)]}^{2}}{2{\sigma }^{2}}\right)\\ \sigma &=&\beta \left(\mathop{\max }\limits_{j,{t}^{m}}{y}_{j}({t}^{m})-\mathop{\min }\limits_{j,{t}^{m}}{y}_{j}({t}^{m})\right)\end{array}$$

for all ‘training’ times tm. Note that σ was obtained using only training data. Then, we divided the interval

$$\left[\mathop{\min }\limits_{i,{t}^{m}}{y}_{i}({t}^{m})-{y}_{i}({t}^{m}-1),\,\mathop{\max }\limits_{i,{t}^{m}}{y}_{i}({t}^{m})-{y}_{i}({t}^{m}-1)\right]$$

into N equal bins and constructed a weighted histogram of all the training yi(tm) − yi(tm − 1) where we counted each yi(tm) − yi(tm − 1) as much as \({w}_{j}^{\ell ,m}\). We then normalized this histogram by dividing all bin values by their sum so that we obtained a well-defined (conditional) probability distribution \({p}_{i,\,j}^{\ell }(\Delta {y}_{i}^{k})\) and estimated the expected value of this probability distribution as

$${{{\rm{E}}}}[{\,y}_{i}(t)-{y}_{i}(t-1)| {\,y}_{j}(t-1)]\simeq \mathop{\sum }\limits_{k=1}^{N}\Delta {y}_{i}^{k} {p}_{i,\,j}^{\ell }(\Delta {y}_{i}^{k})$$

where \(\Delta {y}_{i}^{k}\) denotes the centre of the kth histogram bin. The case for iEEG is similar, except that the Gaussian weights were computed on the basis of d-dimensional Euclidean distances given that we conditioned on d-dimensional vectors. In both cases, β and N are hyperparameters that were tuned separately for fMRI and iEEG (see ‘Hyperparameter selection’ below).

Zero model (‘Zero’)

So far, we have discussed several families of models. Comparisons among them will provide a clear picture of which family provides the best fit to the data ‘relative to’ the others. Note however that this process does not necessarily imply that the best model is good in any absolute sense. In other words, all models may be estimating \(\hat{{{{\bf{y}}}}}(t| t-1)\) at chance level or lower. Therefore, we also considered the zero model (also known as the zero-order hold, naive model or random walk)

$${{{\bf{y}}}}(t)-{{{\bf{y}}}}(t-1)={{{\bf{e}}}}(t)$$

with the trivial estimate \(\hat{{{{\bf{y}}}}}(t| t-1)={{{\bf{y}}}}(t-1)\). Note that this expression corresponds to equation (5) with f(y(t − 1)) = 0 and is only meant to provide a baseline for comparison, not to act as a formal model itself. Also note that this model is different from and often performs better than the constant predictor \(\hat{{{{\bf{y}}}}}(t| t-1)=\bar{{{{\bf{y}}}}}\) which constitutes the denominator of R2.

Hyperparameter selection

For all models that involve the choice of a design hyperparameter, we simultaneously optimized over all the hyperparameters using SGD with minibatch, separately for fMRI and iEEG. Let Nparam denote the number of hyperparameters in any of the models. Starting from an initial estimate of the hyperparameter vector, in each iteration, \({3}^{{N}_{{{{\rm{param}}}}}}\) hyperparameter vectors were generated, constituting a hypercubic mesh around the current hyperparameter estimate. For integer-valued hyperparameters, we moved 1 point in each direction, while for real-valued hyperparameters, we moved 10−6 units. Using a minibatch of randomly selected data segments, the mean-over-minibatch of the median-over-regions of the model R2 was computed and maximized over the mesh. The random minibatch selection was independent between mesh points and between iterations. For integer-valued hyperparameters, their value was updated to that of the maximizing mesh point. For real-value hyperparameters, a gradient-ascent step was taken in the direction of the largest R2. The process was repeated until the hyperparameters stopped having a consistent decrease/increase and hovered around a steady-state value (which always happens due to the stochastic nature of SGD) and/or the R2 stopped having a consistent increase.

For the ‘DNN (CNN)’ and ‘Subspace’ methods (the latter only in iEEG data), the aforementioned procedure was infeasibly slow. This was the case due to a high number of hyperparameters for the CNN model and due to cluster jobs becoming frequently hung and needing to be killed and restarted for the subspace method. As such, we slightly modified the above procedure as follows. In each iteration of the SGD, instead of generating \({3}^{{N}_{{{{\rm{param}}}}}}\) search directions at all the points of a hypercubic mesh, we generated 2Nparam + 1 search directions, one at the current optimum estimate and 2 at the current estimate ±1ek (for integer-valued) or ±10−6ek (for real-valued) for each kth hyperparameter, where ek is the kth canonical unit vector (all zeros except one 1 at the kth location). The remaining details were similar to the general case above. Finally, in all cases, we used 100 participants for fMRI (out of the total of 700) and 1,500 segments for iEEG (out of the total of 8,490) for hyperparameter tuning and then removed them from the subsequent model fitting and validation experiments to ensure a lack of overfitting to hyperparameters.

The hyperparameter and R2 values throughout the process are shown in Supplementary Figs. 7 and 8 for fMRI and iEEG data, respectively, and the final values of the hyperparameters selected for each model are reported in Supplementary Table 1. Note that the initial hyperparameter estimates were chosen on the basis of previous experience, not randomly, which is why they are often very close to or the same as the final values.

Cross-validation

For the comparisons of HCP data in Fig. 2, we performed the cross-validation as follows, with slightly different procedures for brain-wide and regional methods. For the brain-wide methods, for each of the 700 participants, we split each of the 4 resting scans of that participant into 2 halves, giving a total of 8 segments, each of length 600 samples. All of our methods were then applied using an 8-fold cross-validation where each time, 1 of the 8 segments was used for testing and the remaining 7 were used for training. For pairwise methods, we were forced to lower the sample size due to the extremely high computational complexity of the MMSE predictor. Therefore, instead of each of the above 8 segments (per participant), we used the second quarter of that segment, giving us still an 8-fold cross-validation but on segments of length 150 samples each.

For the comparisons of RAM data in Fig. 3, we first split each of the 283, 5 min recordings into 8,490, 10 s segments. Even though having longer segments would in principle benefit model fitting, 10 s segments ensured that all of our methods could run using the 64 GB available memory per node on the CUBIC cluster. From the 8,490 segments, those that contained any not-a-number (NaN) entries (316 segments) or for which the subspace method produced NaN predictions (30 segments, due to the bad conditioning of the Φ matrix therein) were removed from further analysis. Since each recording was already split into 30 segments, and due to the large number of segments, we performed only a single-fold cross-validation on each segment, with the first 8 s used for training and the final 2 s used for cross-validation.

Multivariate test of whiteness

A standard measure of the goodness of fit in the prediction error method is the whiteness of residuals, measuring the extent to which all temporal structure (that is, dynamics) in the data has been captured by the model. Note that a multivariate time series e(t) is ‘white’ if it has no statistical dependence across time (that is, e(s) and e(t) are independent if s ≠ t) even though it can have arbitrary statistical dependence across channels (that is, ei(t) and ej(t) can be dependent at the same time t). Parametric (χ2) statistical tests have been devised for multivariate whiteness, such as the classical Box–Pierce portmanteau test81 and its modifications82,83. Under strong assumptions, all of these tests have a statistic Q (defined slightly differently between them) that is asymptotically (at infinite samples) χ2 distributed. In our datasets, however, we found that Q was not χ2 distributed, and therefore we used randomization to generate the true null distribution of Q by shuffling the time indices of e(t) 100 times, computing Q for each of them and computing the 95th percentile of the randomized Q values as the threshold Qthr for significance. We used the original definition of Q (ref. 81),

$$Q=(N-M\,)\mathop{\sum }\limits_{i=1}^{M}{{{\rm{tr}}}}\left({\hat{{{{{R}}}}}}_{{{{\bf{e}}}}}{(i)}^{T}{\hat{{{{{R}}}}}}_{{{{\bf{e}}}}}{(0)}^{-1}{\hat{{{{{R}}}}}}_{{{{\bf{e}}}}}(i){\hat{{{{{R}}}}}}_{{{{\bf{e}}}}}{(0)}^{-1}\right)$$

where N is the number of (test) samples, M is the number of cross-correlation lags, and

$${\hat{{{{{R}}}}}}_{e}(i)=\frac{1}{N-M}\mathop{\sum }\limits_{t=0}^{N-M-1}{{{\bf{e}}}}(t+i){{{\bf{e}}}}{(t)}^{T},i=0,1,\ldots ,M$$

is a finite-sample estimate of the cross-correlation matrix between channels of e(t) at lag i. Since in practice \({\hat{{{{{R}}}}}}_{{{{\bf{e}}}}}(0)\) may be singular or near-singular, we used the pseudo-inverse of \({\hat{{{{{R}}}}}}_{{{{\bf{e}}}}}(0)\) instead of its inverse in computing Q. Finally, only in the case of ‘pairwise’ fMRI models where the residuals are inherently univariate did we use the simpler χ2 test of whiteness for univariate time series (ref. 25 sec. 16.6).

Nonlinear predictors used for the analysis of the linearizing effects of macroscopic dynamics

Our discussions of linear and nonlinear models and their hyperparameters so far applies to the comparisons shown in Figs. 2 and 3 on neuroimaging time series. In our numerical analysis of the linearizing effects of macroscopic brain dynamics in Fig. 4, we also constructed linear and nonlinear predictors and computed their R2. The linear predictor was always a simple linear regression model, while the nonlinear predictor was the MMSE predictor for two-dimensional predictions (Fig. 4a–h) and the manifold-based predictor for higher-dimensional predictions (Fig. 4i,j). The MMSE predictor was as described above, except that β was adjusted as 0.02 + 0.02/SNR for Fig. 4g,h. For the manifold-based predictor, we used a Gaussian window and swept logarithmically over its hyperparameter h from 0.1 to 10 in every iteration and chose the value of h that gave the largest R2. Fig. 4j (left panel) shows the average of the resulting optimal h for 100 iterations.

Simulations involving the Izhikevic model

To show the linearizing effects spatiotemporal averaging on a model whose ground truth is known to be nonlinear, we generated simulated time series from the Izhikevic model23

$$\begin{array}{rcl}\dot{v}(t)&=&0.04v{(t)}^{2}+5v(t)+140-u(t)+I\\ \dot{u}(t)&=&a(bv(t)-u(t))\\ (v(t),u(t))&\leftarrow &(c,u(t)+d)\quad \,{{\mbox{if}}}\,\quad v(t)\ge 30\end{array}$$

with a = 0.02, b = 0.2, c = −65 and d = 2, I = 7. We discretized the model using Euler discretization with 0.1 ms sampling.

Estimation of rsfMRI SNR

Here we describe our method for the estimation of rsfMRI time series scanner noise and the resulting SNR reported in Supplementary Fig. 4. From the 700 participants used for the study, 50 were selected uniformly at random, and for each selected participant, 1 of their 4 rest scans was selected also uniformly at random. The following was then performed for each of the 50 participant-scans. The rest scan was motion corrected using intramodal linear registration with 6 degrees of freedom (in general, we kept the amount of preprocessing as minimal as possible throughout the SNR estimation algorithm since each preprocessing step often involved averaging and/or interpolation steps that could bias SNR estimates). The first volume of the motion-corrected rest scan was visually inspected and 10 voxels outside of the head were selected. Due to the unavailability of phantom scans, we used these voxels to estimate the scanner noise, while the two (phantom scans and outside voxels) have been shown to yield consistent noise estimates84. For each of the 10 voxels, we calculated the temporal variance of the corresponding time series and averaged the results, providing an estimate of scanner noise variance \({\sigma }_{N}^{2}\). To estimate the signal power, a grey matter mask was extracted using each participant’s T1 scan and linearly registered back to the participant’s motion-corrected rest scan. We then computed the temporal variance of each grey matter voxel and averaged the results, yielding an estimate of the combined signal and noise variance. Assuming statistical independence between scanner noise and the participants’ BOLD activity, this combined variance is precisely the sum \({\sigma }_{S}^{2}+{\sigma }_{N}^{2}\) of signal variance and noise variance. The SNR was then calculated as σS/σN. Note that this process is inherently conservative and provides an upper bound on the SNR, as it, for instance, does not include any physiological signals into ‘noise’. Therefore, the ratio between the power of signals of neural origin over all other signals contributing to rsfMRI time series may be much lower than 6.5. An SNR of 6.5, however, is still low enough to yield a notable linearizing effect, highlighting the importance of measurement noise in downstream computational modelling.

Citation diversity statement

Recent work in several fields of science has identified a bias in citation practices such that papers from women and other minority scholars are undercited relative to the number of such papers in the field85,86,87,88,89. Here we sought to proactively consider choosing references that reflect the diversity of the field in thought, form of contribution, gender, race, ethnicity and other factors. First, we obtained the predicted gender of the first and last author of each reference by using databases that store the probability of a first name being carried by a woman89,90. By this measure (and excluding self-citations to the first and last authors of our current paper), our references contained 8.73% woman(first)/woman(last), 18.87% man/woman, 18.34% woman/man and 54.06% man/man. This method is limited in that (1) names, pronouns and social media profiles used to construct the databases may not, in every case, be indicative of gender identity and (2) it cannot account for intersex, non-binary or transgender people. Second, we obtained predicted racial/ethnic category of the first and last author of each reference using databases that store the probability of a first and last name being carried by an author of colour91,92. By this measure (and excluding self-citations), our references contained 14.39% author of colour (first)/author of colour(last), 15.47% white author/author of colour, 23.76% author of colour/white author and 46.37% white author/white author. This method is limited in that (1) names and Florida Voter Data used to make the predictions may not be indicative of racial/ethnic identity and (2) it cannot account for Indigenous and mixed-race authors, or those who may face differential biases due to the ambiguous racialization or ethnicization of their names. We look forward to future work that could help us to better understand how to support equitable practices in science.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.