Sparse representations of high dimensional neural data

Conventional Vector Autoregressive (VAR) modelling methods applied to high dimensional neural time series data result in noisy solutions that are dense or have a large number of spurious coefficients. This reduces the speed and accuracy of auxiliary computations downstream and inflates the time required to compute functional connectivity networks by a factor that is at least inversely proportional to the true network density. As these noisy solutions have distorted coefficients, thresholding them as per some criterion, statistical or otherwise, does not alleviate the problem. Thus obtaining a sparse representation of such data is important since it provides an efficient representation of the data and facilitates its further analysis. We propose a fast Sparse Vector Autoregressive Greedy Search (SVARGS) method that works well for high dimensional data, even when the number of time points is relatively low, by incorporating only statistically significant coefficients. In numerical experiments, our methods show high accuracy in recovering the true sparse model. The relative absence of spurious coefficients permits accurate, stable and fast evaluation of derived quantities such as power spectrum, coherence and Granger causality. Consequently, sparse functional connectivity networks can be computed, in a reasonable time, from data comprising tens of thousands of channels/voxels. This enables a much higher resolution analysis of functional connectivity patterns and community structures in such large networks than is possible using existing time series methods. We apply our method to EEG data where computed network measures and community structures are used to distinguish emotional states as well as to ADHD fMRI data where it is used to distinguish children with ADHD from typically developing children.


Neuronal Networks
CoCoMac is a coordinate-independent collection of annotations compiled from "over 2,500 anatomical tracer injections" and from "over 400 published experimental studies" that captures 10,681 anatomical connectivity relations and 16,712 overlap (or mapping) relations between pairs of annotated brain regions.Based on these relations, Modha et al construct a network hierarchy consisting of 383 brain regions and 6, 602 directed edges spanning the entire Macaque brain.The undirected version of the network consists of 5, 208 edges which gives an average (undirected) degree of about 27.

VAR modeling 2.1 Standard method using Yule-Walker equations
We briefly summarize the standard method of estimating the VAR coefficients.The maximum likelihood method is used to derive estimates for the coefficients A i .Suppose that the time series data Ŷ , composed of k variables, and N observations at regular time intervals, can be modeled by Main [2].Denoting the observed values of t by e t , the Log-Likelihood, l , of the observed noise (see [1, pg 346, section 5.4.1]), is given by: l (e p+1 , . . ., e N , Σ ) = (N − p) Maximizing (2.1.1)is clearly equivalent to minimizing the sum of squares, S : The invertible linear relationship between the variables { p+1 , . . ., N } T and {Y p+1 , . . ., Y N } T implies that there is a one-to-one correspondence between the observed values Ŷ of Y and the observed values e of , and also that the Jacobian of the transformation from e to Ŷ is constant (equal to 1 in this case).Hence choosing coefficients A i that maximize the log likelihood of e is equivalent to maximizing the log-likelihood function of Ŷ .Therefore to minimize (2.1.2) with respect to the A i , substitute from (2.1.1)to get: Minimizing (2.1.3)for A i leads to the system of equations: (2. 1.4) where: The k × k matrices R(i) are the usual estimates for the auto-covariance function of the process Y t .The equations (2.1.4)are the Yule-Walker equations using estimated autocovariances.
The solution obtained by (2.1.4)above is often not satisfactory, since for large systems with only moderate amounts of data, the amount of noise in the coefficients makes identification of the actual non-zero coefficients difficult.Moreover, variations on ths method based on identification of the best model order based on information criteria such as Akaike Information Criterion (AIC) or Schwartz Bayes Criterion (SBC) all lead to the same issue of noise in the coefficients.
For the procedures to be described, we need to use regression with the lagged values of Ŷ as predictors.
Below, we show this kind of regression is valid and is a generalization of the Yule-Walker equations.

Coefficients with constraints via regression
The coefficients A i obtained from (2.1.4)are also exactly the coefficients that are obtained if we used linear Least-Squares regression on the data with the lagged values of Ŷ as predictors.More precisely, if y (r) , r = 1, 2, . . ., k, are the individual components of Y , set: 2 , . . ., x (r) p } (2.2.2) If for each r = 1, 2, . . ., , k, we perform a linear Least-Squares regression of the observed values ŷ(r) , of y (r) , on the observed values X of X, the coefficients obtained are identical to those from (2.1.4).This fact is easy to see from the standard formula for the linear regression coefficients: X Xβ r = X y (r)   where β r = [a r 1 , a r 2 , . . ., a r p ] T is the column vector of coefficients obtained by concatenating the r th rows of the matrix coefficients A 1 , A 2 , . . ., A p .Putting the k equations above together, we get: which on inspection will be seen to be identical to (2.1.4).Now suppose that, there are linear constraints on the coefficients {A q }, of the form: Here a qrs is the element from the r th row and s th column of A q , C is a matrix with lk 2 p columns, and c is a column vector of length l < k 2 p where l is the number of constraints.The column vector [a qrs ], of length k 2 p, is formed by running over all q, r, s.
This represents a general linear relation between the set of all coefficients.For such constrained coefficients, we can use the Lagrange Multiplier Method to minimize (2.1.3)taken along with (2.2.4).
The question arises whether we can continue to estimate the coefficients A i as before, using regression on the equations for the individual components y (r) .The answer is no in general but yes if the constraints are of a special form.In general coefficient estimates obtained by using the lagrange multiplier method on equations (2.1.3)and (2.2.4) are not the same as those obtained by performing linear Least-Squares regression on the individual equations.This is because Least-Squares regression on individual components will fail to account for inter equation constraints.However in the special case when (2.2.4) consists only of intra-equation constraints of the form: where each C r is a matrix with kp columns and lr < kp rows, a r q is the r th row of the matrix A q and each c s is a vector of length ls , the lagrange multiplier equations simplify to give the individual regression equations as before.In particular, subset constraints of the form: for particular sets of indices {q, r, s}, are trivially of the form (2.2.5).Solving for the coefficients A i is equivalent to performing a separate linear regression on each of the individual equations.For the r th equation, only those regressors in the set X (see (2.2.1) that have non-zero coefficients need to be considered.

Maximum Likelihood for VAR(p) model
We denote the Maximum Likelihood Estimate of A i by Âi .The value of the maximum likelihood of the observed data can now be found by substituting these estimates into equation (2.1.1).The value of the log likelihood evaluated at Âi is: where e t = y t − p i=1 A i y t−i Now in practice, we do not know the value of the theoretical error covariance Σ in advance.So we have to treat it as a parameter to be estimated.We calculate its maximum likelihood estimate from the expression for the log-likelihood.Equations (2.1.3)already maximize the likelihood with respect to {A i } and moreover the estimate for {A i } is independent of Σ .We therefore only need maximize (2.3.1) with respect to Σ .We do this by differentiating (2.3.1) with respect to Σ −1 and equating to 0. To do this we calculate the derivative with respect to the ij th component σ ij of Σ −1 and use the symmetry of Σ .We use the fact that for any matrix U , the differential of the determinant, |U | with respect to its ij th entry u ij is (−1) i+j M ij dl ij where M ij is the minor of u ij .Also, the differential of e T t U e t with respect to u ij is just (e T t e t ) ij du ij .We then get: e t e T t = 0 and hence the estimate for Σ which maximizes the likelihood is: Substituting this in (2.3.1), and after a few simplifications, we get for the maximum likelihood:

Time Series with multiple trials
To fit multitrial data with r trials, interleave the trials by placing, for every channel i and time point j, all the trial values at (i, j) adjacent to each other in the same row.and use regressors that are lagged by multiples of r only.For each equation, this procedure can easily seen to be theoretically equivalent to fitting the coefficients by inverting the average, over all realizations, of the regressor covariance covariance matrix.

Likelihood Ratio Test and Wilk's Theorem
Suppose we have two models M 0 and M 1 for a data set given by {y 1 , y 2 , . . ., y N }.Let the models correspond to parameter sets A (0) and A (1) respectively.The Likelihood Ratio: is a measure of the superiority of the model M 1 over the model M 0 .If Λ(A (1) : A (0) ) > 1, we can consider model M 1 to be better than model M 0 .Hence we need to determine if the statistic Λ(A (1) : A (0) ) is significantly larger than unity.For this we need its probability distribution which, in general, can be difficult or complicated to find.However, in the case when Λ(A (1) : ) is a ratio of maximum likelihoods, the following result known as Wilk's theorem ( [2]), enables us to approximate its distribution.

Determine significance of estimated GC value
The concept of Granger Causality can be extended to use it in a lagwise manner instead of just a variablewise manner.Given the time series Y as above, suppose we have some existing model M 0 for it.Given the existing coefficients of the model, consider the Likelihood ratio restricted to a block of components Y to = {Y (j1) , Y (j2) , . . ., Y (js) } after entry of another disjoint block of components Y f rom = {Y (i1) , Y (i2) , . . ., Y (ir) } at some lag q.From (2.3.3), this is given by: where |Σ to | is the error covariance of the to block in the original model M 0 , and | Σto | is the error covariance of the to block after adding the coefficient locations of the f rom block at lag q and refitting the model.p 0 and p are the orders of the model (restricted to the to block) before and after adding the new coefficients.The third term in equation (2.3.6) is a penalty term somewhat analogous to the penalty term found in information criteria such as AIC.In our case, we fix a maximum possible lag, p max beforehand and use only data starting from p max + 1.This also has the advantage that genuine initial data is available from the data upto lag p.
(2.3.6) then reduces to: Because the coefficients are found using linear Least-Squares regression, (2.3.7) is a ratio of maximum likelihoods, and hence it follows from Wilk's theorem above that (N − p max )gc f rom→to has asymptotically a χ 2 (λ) distibution, where λ is number of components in the f rom block.We can therefore test the GC statistic in exactly the same way as the Likelihood ratio.If the blocks are atomic (that is have only one variable), then this Granger Causality statistic is closely related to the F -statistic in linear regression.The F -statistic has an F (1, n − λ 0 − 1) distribution, where λ 0 is the number of coefficients in the model M 0 .For large n, this distribution approximates the χ 2 (1) distribution.The results in section Main [3] show that GC statistic above is an effective way to determine all the significant coefficients at different lags.

Confidence intervals for coefficients
The finite number of time point implies some uncertainity in the size of the coefficients obtained.In addition, this uncertainity, places a threshold of the size of the smallest coefficients that can be detected by the algorithm.In order to find the uncertainity in the computed coefficients, first note that for a random variable Z that is the sum of squares of identically normally distributed random variables , where s χ is the common standard deviation of the variables χ i .Moreover Z is asymptotically normally distributed as N → ∞.In our case, where N is at least 100 or more, we can take Z to be approximately normal.
We show in section Supp [6], equation Supp[(??)] that when a new coefficient c is added at some stage of the fit, the following equation holds: Here x is the new regressor associated with the location of the coeffient c, ∆σ is the change in the sum of squares of residuals due to the addition of the new coefficient and H 0 is given by: where Q 0 is the orthonormal matrix in the QR factorization of the the set of regressors X 0 before the new regressor was added.Now we have: where e 0i and e i are the components at time i of the residuals ē and ē0 , respectively, before and after the entry of c.So we can write: Now note that the response variable ȳ and the lagged regressors X are jointly normally distributed.Hence the residuals ē in the first term, which are a linear combination of the response variable ȳ, the regressors in X and the newly added regressor x are jointly normally distributed.In addition the residual vector ē is orthogonal to x which implies that x and and ē are independent random variables.Now it can be shown (see for example [3, Ch. 9, section 4]), that if M is an idempotent matrix, then xT M x has a χ 2 (N − r) distribution where r is the rank (= tr(M )) of M .Since I − H 0 is idempotent, it follows that the denominator xT (I − H 0 )x of (2.3.10) has a χ 2 (N − r) distribution.(Note incidentally that, by our assumption of sparsity, r is small compared to N ).In addition the numerator of the first term has a χ 2 (N ) distribution.In view of the independence of ē and x, the numerator and denominator are also independent (H 0 is also formed from lagged regressors) and hence the first term in (2.3.10) has an F(N, N − r) distribution with variance O( 1 N 2 ).For the second term of (2.3.10),we don't have independence of the numerator and denominator and hence we simply treat them as a ratio of two χ 2 distributions.The variance of the numerator and denominator are 2σe Taking the square root of (2.3.10) and making the obvious approximations, it follows that the uncertainity in the estimation of c is given by: where s e and s x are the common standard deviations, respectively of the e i and x i .

Entry p-Value for SVARGS
We provide a rationale for our choice of p-values for coefficient entry in the SVARGS algorithm.The argument is heuristic and we relate it to the concept of the False Discovery Ratio (F DR).First note that for each coefficient entry, we test k 2 q coefficients (over all equations).For any particular coefficient whose true value is 0, the probability of a false entry at that step is p entry .Now, if a coefficient was falsely entered, but the true value of its likelihood statistic is unity, then the probability that it will remain at the removal step is p exit .For any such coefficient (whose true value is 0), let pn be the probability that it is in the entered state at the n th step.Then we have the following recursion: This can be rewritten as: where a = p exit − p entry and b = p entry Solving the recursive equation upto the n th step gives: Since a is small, this becomes: which is what one might naively assume without going through the above calculation.In other words, our naive intuition that p entry is the probability of a zero coefficient being falsely detected is justified.The expected number of false entries E ñ is therefore: We would like E ñ to be small compared to the final number of coefficients.How small is decided by the choice of the False Discovery Rate, F DR.In terms F DR, the number of false detections is F DR × M .Equating this to E ñ above, we get: and hence: We may choose F DR based on the expected number of non-zero coefficients in the final model (For example if we know the approximate sparsity beforehand, one might want to take F DR = O(1)/M ).
The approach we take, after choosing the FDR, is to think of the corresponding value in 2.3.15 as just indicative of the best p entry value.In our algorithm, we choose a range of values around this value.At the start of the algorithm we use the lowest value in the range and then progressively increase it (as coefficient addition gets exhausted) upto the maximum value in the range (we can also think of this as testing a range of FDR values instead of a range of p entry values).The best value is then chosen using the extended SBC information criterion.The additional computational cost of this approach is very moderate because we are not refitting the model for every value of p entry .
A somewhat more time-intensive alternative is to compute the model for a suitable range of values of p entry and then use some form of cross-validation to determine the best value.

Extended SBC criterion
In our experiments with different information criteria for model selection for the SVARGS algorithm, we found that both the AIC as well as the SBC criteria were too liberal in their inclusion of coefficients.The huge number of spurious coefficients also resulted in an unacceptable slowing down of the algorithm.The Extended SBC criterion ( [4]) is based on somewhat different assumptions about the model.The usual SBC criterion is derived on the basis of the prior probability of each coefficient subset being equal.For large models, this implies that larger models are assigned a higher prior probability (since the number of different subsets for a given number of coefficients increases exponentially with the number of coefficients).In [4], the extended SBC criterion is derived based on assigning equal prior probabilities to different model sizes rather than different coefficient subsets.The resulting criterion is: Here SBC is the usual term for Schwartz-Bayes information criterion (= −2(LogLikelihood) + mlog(N )).
The second term is the adjustment for the different assumed prior.m is the complexity of the model and the quantity τ (M, m) is the number of different subsets of coefficients of size m that can be constructed from a pool of coefficients of size M .
γ is a number between 0 and 1 and must be chosen appropriately depending on whether κ = log(M )/log(N ) is approximately 1 or more (γ = 1), or κ is somewhat less than 1 (γ = 0.5) or κ is well below 1 (γ = 0).We used a smooth approximation to the step function passing through these points.
The criterion is is shown to be consistent and satisfying the condition that the probability of selection of a model other than the true model approches zero as the number of data points approaches ∞.

SVARGS coefficient confidence and size threshold
The finite number of time points implies some uncertainity in the size of the coefficients obtained.In addition, this uncertainity places a threshold of the size of the smallest coefficients that can be detected by the algorithm.In 2.3.5 we determine the uncertainity, σ c , in the estimated size of the coefficients.Coefficients not much larger than σ c have a fair chance of not being detected or being eliminated due to the natural variance of the finite sample of size N .Coefficients whose size is more than twice σ c are almost certain to be detected.Based on σ c , a rule of thumb for the threshold on the size of detectable coefficients can be taken to be: where s e is the standard deviation of the residual variance of the equation containing the coefficient c, and s x is the standard deviation of the variance of the regressor corresponding to c.In general, the greater the explanatory power of the regressors, the smaller the variance of the residual sum of squares and the better is the detectability of the coefficients.

General comments on SVARGS
Note that the SVARGS fit need not necessarily give the best subset of predictors in any well-defined sense.This is a well-known drawback of stepwise regression in general.There may be approximate redundancies among the lagged predictors, so that a best subset of predictors may not even exist in a strictly mathematical sense.At best, one can say that it gives coefficient locations at which the (penalized) likelihood has a local maximum.Also note that the p-values obtained as the fit progresses are only used for model comparison in order to determine the next coefficient location to add or remove.These p-values are entirely dependent on the path taken by the fitting process and only the relative size of consecutive p-values are relevant for the progress of the fit.Thus these p-values are not used in any way to rank the final set of model coefficients.
For the intended applications of this method, this is perfectly alright since we are mainly interested in the question: "What is the (relatively) small set of significant interactions, delayed or otherwise, between variables".A rough idea of the importance of the interactions can be gleaned from the (conditional) Granger Causality strengths.Moreover if, for some reason, it is necessary to rank the coefficients, one can compute p-values from the confidence intervals (computed in section 2.3.5) for the final set of coefficients .
In view of the above, we also observe that the final model obtained was fairly robust with respect to a range of FDR values.A higher FDR led to more spurious coefficients but the prediction rate did not vary much as long as the FDR was not too large.This was found to be true, both, for data generated from sparse models as well as for real data.In our opinion, this indicates that in our applications (and we suspect often in many other real world applications), there is a wide gulf between those interations that are significant and those that are not, with very little in the grey area.Note however that the FDR increases the computation time by increasing the coefficient density as well as the reversal-ratio -the ratio of the number of backward steps to the total number of non-zero coefficients (see supporting information of [5] under the section time complexity).In particular, having a somewhat low FDR can speed up the fit considerably.
In addition, the method works best, both in terms of quality of fit and time complexity, when the final model is known to be sparse with the number of non-zero coefficients being of order O(mk) with m < < N and m < < kq (where k is the number of variables in the model and q is the number of lags and N is the number of time points).It is not suitable for dense or even semi-sparse models.As the final coefficient count increases, the time complexity of the method degrades more than proportionally with respect to the final number of coefficients.See the supporting information in [5] under the section on time complexity for a discussion of the time complexity of the algorithm.

Quantities derived from VAR model
In order to find the Granger causalities (in both the time and frequency domains) from the fitted model, we first need to compute the following quantities using the fitted model: 1. ACVF: Matrix of variance and cross-covariance at any given lag.
2. Sub-Model Coefficients: Given a subset of variables and a subset of lags, compute a VAR model as if all the "data" is to be explained by only the specified subset of the original set variables and the specified subset of the original set of lags.
3. Transfer Function: Transfer function of the operator associated with the VAR model.
4. Spectrum: Matrix of power and cross spectra at any given frequency).
To compute the Granger causality strengths, we ony require (1) and (2).For the GC spectrum we require (2) as well as ( 3) and ( 4).Under a certain technical condition that we shall state at the end of this section (section 3.5), the integral over all frequencies of the GC spectrum is identical to the GC strengths.In the following sections, we describe how to calculate these quantities and subsequently the granger causalities.

ACVF, Transfer Function and Spectrum
Given the VAR model Main [2] of order p in k variables, one can construct an equivalent VAR model of order 1 in kp variables using the transformations: This yields the extended model: where the extended vector X is the column vector of length kp and the extended matrix A is a (kp × kp) coefficient matrix: As such, the form 3.1.2is not efficient for direct numerical computation.However it can be used to conveniently derive more efficient formulas to compute the above mentioned quantities.We do this below.

Autocovariance Function (ACVF)
From the extended order 1 model above, the equations for the autocovariance G can be found by multiplying both sides of the equation first by X t and then by X t−1 and taking expectations on both sides.The resulting two equations are: which can also be written as a single equation: The matrix Σ is zero except for the top-left (k × k) block which is the same as the noise covariance of the original model in Y : and G and G are block-toeplitz matrices representing the extended autocovariances at lag 0 and lag 1 respectively: Here G j represents the autocovariance at lag j of the original variable Y t .Note that G −m = G T m .Substituting these into the equations for the extended autocovariance and performing the block multiplications, we get, after some matrix algebra, the following p equations: For a description of the methods used to solve (3.1.3)or (3.1.4)see section 5.
Note: In the remainder of this section, we set α(ω) = e 2πiω .

Transfer Function
The Transfer function, H p (ω) is calculated in a straightforward way from the formula: where p is the order of the model.

Spectrum
Taking the Fourier transform of both sides of 3.1.2and averaging, we obtain the Spectrum of Y as the top-left (k × k) sub-block of the matrix function: where A is the extended coefficient matrix of 3.1.2,G is the covariance matrix (autocovariance at lag 0) of the extended vector X and H(ω) is the corresponding Transfer function of the extended model 3.1.2.To find the top-left (k × k) sub-block in 3.1.8,we solve the resulting block-matrix equations, which after some algebraic simplification, gives: where G j is the autocovariance of Y at lag j, and L j (ω) is the expression 3.1.6whose inverse is the Transfer Function H j (ω) for an order j model.

Sub-models
Given the VAR model for the full set of variables and some given set of lags L = {1, 2, . . ., p}, where p is the order of the model, we now describe how to compute the VAR submodel for a subset of the variables and a subset of the lags.More precisely, we would like to compute the VAR model as if all the "data" is to be explained by only the specified subset of the original set variables and the specified subset of the original set of lags.Denote the original set of variables by: W = {w (1) , w (2) , . . ., w (k) } T We write the full VAR model, with assumptions as in Main [2]: Let X denote the subset of variables for which we require the sub-model.Let Z denote the complement of X in W . Without loss of generality we may assume that for some s < k, X = {w (1) , w (2) , . . ., w (s) } (3.2.2) (Otherwise permute the variables appropriately).
Partition the noise terms t and each coefficient matrix A i of the original model as: where B xx i is a (s × s) coefficient matrix at lag i pertaining only to X, and µ x t is the noise pertaining only to to X. Also denote by J the subset of the original lags (L) over which the fit is required, and let M denote the complement of J in L. We can then write the equations for X as: and for the purpose of fitting the submodel, we write: The right hand side of (3.2.8) is the 'noise' which is, by hypothesis, uncorrelated with X(t − i) for i ≥ 1.We therefore have to "fit" the coefficients C i to the model represented by (3.2.8).The autocovariance function (which can be thought of as the autocovariance of the "data") can be calculated directly from the coefficients of the original model.We can therefore use the Yule-Walker equations for the original model (restricted to the components in X) to fit coefficients C i in (3.2.8) and then compute the new noise term ν t .From equations (3.2.7) and (3.2.8), this noise term is: where , and 1 J is the indicator function of J on L.
As per the original model hypothesis, the W t−i , for i ≥ 1, are not correlated with µ x t .Using this fact, and after some algebraic simplification, the covariance matrix of ν t is seen to be: where: where G h is the autocovariance of the original full set of variables W at lag h and Σ µ x is the covariance matrix of µ (x) .
This allows us to calculate the coefficients C i and the new VAR model.Note that it involves no further calculation other than solving the Yule-Walker equations restricted to the subset X, and computing the new noise term ν t .For the left hand side matrix of the Yule-Walker equations, a one time computation of the autocovariances of the full model as per 3.1.4suffices for all submodels.We discuss an efficient implementation of the subfit computations above in section 5.
Using the procedures outlined in sections 3.1 and 3.2, we are now ready to calculate the Granger Causality strengths and spectra as below.For more details on how to obtain the Granger Causality formulas below see [6, pg. 451-474].For insight into the reasoning and rationale behind the definitions below, see [7].

Granger Causalities
Define the Lag operator L by L(X t ) = X t−1 .In the remainder of this section and the next we will find it convenient to write VAR equations Main [2] in operator form: Here A(L) = A 1 .L 1 + A 2 .L 2 + . . .+ A p .L p .

Time Domain (GC strengths)
Let W be stochastic process, that can be represented under the same general conditions as in equation Main [2].Suppose that W is composed of two sub-processes X and Y with W = [X, Y ].We can then represent W as: We write these in operator form as in 3.3.1: with the error covariance matrix for given by: In addition, assume that the two sub-processes, X and Y , individually satisfy the same general assumptions as in Main [2] so that we can write them as: with the error covariance matrix for δ given by: In the time domain, the Granger Causality F Y →X from Y to X is defined by: where |.| denotes the determinant.The Granger Causality from X to Y is defined analogously.Having fitted a VAR model to the full set of data, the GC strengths can be computed using the above formulas and the formulas in section 3.2 for the coefficients and noise covariance of the sub-fitted models.

Frequency Domain (GC spectrum)
First left multiply both sides of equation (3.3.2) by the normalizing matrix: The resulting equations: include a projection of Y onto not only the past, but also the present of X, in such a way that the cross covariance between the X and Y blocks of the new noise terms vanishes: Because of this independence, the spectrum S (xx) of X can be decomposed as: The Granger Causality, f Y →X (ω), from Y to X at the frequency ω is then defined by: As in the case of the GC strengths, once a VAR model is fit to the full set of data, the GC spectrum can be computed using the above formulas and the formulas in section 3.2 for the coefficients and noise covariance of the sub-fitted models.

Conditional Granger Causalities
For three or more time series, one can take pairs of time series components and compute the Granger Causality for each of these pairs.This is called a pairwise analysis, but it has inherent limitations.We do not know if Y is genuinely driven by X or only appears to be driven by X due to the presence of an intermediary Z which transmits the influence to Y .To determine whether a causal influence from variable X to Y is partially or wholly mediated by some other variable Z, we need the concept of conditional Granger Causality, both in the time and in the spectral domains.We briefly mention the relevant equations here.For more details on how to obtain these formulas, see [8] or [9, pg. 228-237] and the references therein.For insight into the reasoning and rationale behind the definitions below, see [10].

Time Domain (conditional GC strengths)
Consider three stochastic processes X t , Y t and Z t .Suppose that a pairwise analysis shows Granger Causality from Y to X.To determine whether this influence is partially or wholly due to the presence of Z, first write the VAR model equations for the processes X t and Z t taken together: Denote the covariance matrix of the noise terms by: Next write the equations for all three processes X, Y and Z taken together: and denote the matrix of error covariance terms by: where Υ yx = Υ xy , Υ zx = Υ xz and Υ zy = Υ yz .
Then the Granger Causality from Y to X, conditional on Z is defined to be: The Granger Causality from X to Y is defined analogously.As in the case of the pairwise GC strengths, once a VAR model is fit to the full set of data, the conditional GC strengths can be computed using the above formulas and the formulas in section 3.2 for the coefficients and noise covariance of the sub-fitted models.

Frequency Domain (conditional GC spectrum)
The computation of the Conditional Granger Causality in the frequency domain involves a series of steps.In what follows I x , I y and I z denote identity matrices of the appropriate block size.
As in the case of the pairwise GC spectrum, first multiply both sides of equation (3.4.1) by the normalizing matrix: This projects the variable Z onto the past as well as the present of X in a way that absorbs the cross covariance between the noise terms of X and Y .Therefore, the resulting equations have independent noise covariance: Similarly multiply both sides of equation (3.4.3) by P [ = P 2 .P 1 ] where Again, this projects the variable Y onto the past and present of X, and the variable Z onto the past and present of both X and Y , in such a way that the resulting equations have independent noise covariance: Now let Hb (ω) and Hc (ω), respectively, denote the transfer functions, as defined in section 3, of the operators on the left hand sides of the equations (3.4.6) and (3.4.8).Then define the matrix Q(ω) by: The Conditional Granger Causality Spectrum from Y to X conditional on Z is then defined as: For more details and insight into these formulas, see [10].The Conditional Granger Causality Spectrum from X to Y conditional on Z is defined analogously.As in the case of the pairwise GC spectrum, once a VAR model is fit to the full set of data, the conditional GC spectrum can be computed using the above formulas and the formulas in section 3.2 for the coefficients and noise covariance of the sub-fitted models.

Consistency between GC strengths and spectrum
For the pairwise granger causalities and spectrum , the consistency condition which ensures the equality of f Y →X and ω f Y →X (ω) can be stated as follows: The pairwise GC spectrum is consistent with the GC strengths only if the VAR operator Byy in equations For conditional Granger Causality, the corresponding condition is: Let L yz Q be the Y Z block of the VAR operator whose transfer function is the left hand side Q of equation (3.4.10).Then the conditional GC spectrum is consistent with the conditional GC strengths only if For the granger causalities in the opposite directions, reverse the roles of X and Y everywhere.For proofs of these see [7] and [10] respectively.
Note that to compute all pairs of granger causalities in a multivariable model, one would use the sub-fitting procedure described in section 3.2 on each pair of variables (or each pair of blocks as the case may be) and then use the formulas for the granger causalities above.For the conditional granger causalities, to get f Y →X|Z , subfits must be performed for the Y Z and XY Z blocks.Thus the GC consistency condition must be checked separately for each pair of interacting blocks.

GC consistency condition for random models
Our numerical tests (on pairwise GC) show that random dense models, are unlikely to satisfy the GC consistency condition for all interactions.On generating a large number of random dense models, we found that the probability that a random dense model satisfies the consistency condition, decreases as the model size gets larger and in general this probability is a very small but finite value.On the other hand, similar tests on sparse models show that random sparse models (of the sparsity levels that concern us here) satisfy the consistency condition with high probability.
The practical implication of this is that for sufficiently dense VAR models, there will be large discrepancies between the integrated GC spectrum and the GC strengths for those interactions that do not satisfy the GC consistency condition.The discrepancies become particularly egregious as the coefficient density becomes high.

Additional performance measures
We include a few more performance measures here in addition to the ones in section Main [3].
The reference models against which the tests were made were all checked for the pairwise consistency condition stated in section 3.5] and they were, as expected, all GC-consistent with respect to both Pairwise and Conditional Granger Causalities.In general sparse models appear to have a high probability of passing the GC consistency test, while dense models have a fairly high probabilty of failing it for at least some interactions.
The models obtained via SVARGS+CGC were all stable (or required a very small deflation (< 1%) of the coefficient size to make them stable) and all satisfied the GC consistency condition for Pairwise GC.We did not check the theoretical consistency condition for the conditional GC, because it would have involved approximating a theoretically infinite order VAR operator from its transfer function.However, the normalized difference between the Conditional GC strengths and sum of the Conditional GC spectrum was less than 0.05 in all cases which strongly suggests that they satisfy the Conditional GC consistency condition.

Performance with Added Noise
We also computed VAR models from simulated data to which simulated external (measurement) noise had been added.The standard deviations of the simulated noise used were 0, 0.2, 0.4 and 0.6 of the standard deviations of the intrinsic noise.For each noise level, the methods were evaluated on 10 randomly generated VAR models.Figure 6 shows the performance of SVARGS in the presence of added (measurement) noise.5 Autocovariance and Granger Causality computation

Autocovariance computation
There is no convenient closed form expression for the autocovariances G j in (3.1.4).For smaller systems (< 200 variables or so) we can efficiently compute the G j by setting up the coefficients in the matrix equations and determining, programmatically, the matrix corresponding to the linear equation to be solved.This is much faster than trying to directly use the equations for extended autocovariance.However for larger systems, this becomes inefficient.We then use an iterative solver to solve the extended autocovariance equations (3.1.3),which are in the form of Discrete Lyapunov Equations.The Matlab dlyap method can be used to solve these.
Another faster option is to use a power doubling method described in [11] modified by us to take into account the special structure of the matrix A. This is particularly suitable for sparse models and is much faster than dlyap.Both these methods were used successfully in our applications.
Another approach that works well for dense models is to diagonalize the block companion matrix A in the expression for the extended autocovariance in section 3.1.1,as A = PΛP −1 , where Λ is a diagonal matrix with terms l i on the diagonal and P is the matrix of eigenvectors of A.
The equation (3.1.3)for the extended autocovariance then becomes: where G = P −1 GP T −1 and Σ = P −1 ΣP T −1 .The matrix A must be non-defective upto numerical tolerance for this to work properly.
The equation (5.1.1)has the explicit solution: where This works better for dense rather than sparse matrices since a dense A is much more likely to be nondefective.For defective A, a similar explicit solution can be found in terms of the jordan canononical form, but this is impractical to compute for all but the smallest matrices.
Note that the iterative methods for computing the autocovariance require that the VAR model be stable.

Efficient subfits for conditional Granger Causality computation
With reference to section 3.2, when computing conditional granger causalities, a "leave one block out" subfit needs to be performed for each of the blocks where the conditional Granger Causality is required.We compute the conditional Granger Causality between two variables (or two, typically small, blocks of variables) conditioned on all the remaining variables.More generally, this is the situation where for each subfit, the subset of variables X to be subfit contains most of the variables in the system W (meaning, in each case, the subset of variables Z to be left out is small in relation to W).In such cases the subfit can be computed quite efficiently, without having to invert a large autocovariance matrix each time to find C and without having to perform the large matrix multiplication given by equation (3.2.12).This is done by computing the differentials in the noise and the coefficients rather than the actual quantities themselves.We state this result in a proposition (refer to section 3.2 for notation): Proposition 1 (Noise delta and Coefficient delta for VAR model subfit).
Let G be as in equation (3.1.3).Denote the inverse of G by Q.Let X (with k x variables) and Z (with k z variables) be a partition of the variables into two blocks as in equations (3.2.2).Let G x be the (k x × k x ) submatrix of G that has the lag-extended autocovariance terms of X.Let g be the (k z × k x ) submatrix of G that has the lag-extended cross-covariance between Z and X so that: , and µ x be the noise, in the equations for the X variables as in equation (3.2.5).Let C = vertcat(C i ) be the subfitted coefficients for X, and let ν be the sub-fitted noise as in equation (3.2.8).Then the coefficient delta and noise delta are given, respectively by: The proof of (5.2.2) involves writing the equations for the coefficients in block form and is straightforward but tedious, so we omit it here.

Fast computation of conditional GC due to known sparse locations
The sizes of the variable blocks between which the conditional GC is to be computed are usually small.So the number of rows in B zx and the number of columns in q are small (for atomic blocks, it is exactly the number of lags).Similarly q z and g z are just small square matrices.In addition, equation (5.2.2) also tells us that the conditional GC from block i to block j is only non-zero if, in the original equations, the coefficient block (j, i) is non-zero at some lag.This, further, greatly reduces the amount of work required.The only expensive computation is the one-time calculation of the inverse, Q, of G. Therefore, computing the coefficient and noise delta's using (5.2.2) is much faster than using the full equations in section 3.2.
If, on the other hand, the variable blocks are large, then the total number of GC values to be calculated is small (since the number of GC values is the square of the number of blocks), and so the of cost computing the new coefficients by inverting the covariance matrix corresponding to the complement of Z is not so relevant.
In our implementation, we use equations (5.2.2) if the size of the variable block is less than k/2.Otherwise we directly use the equations is section 3.2.Here k is the total number of variables.
Note that computing the subfits require the computation of the autocovariance.As we remarked in the previous section, if an iterative method is used to compute the autocovariance, then the model must be stable.

SVARGS computation
A fast implementation of the main computations involved in the SVARGS method is described in the supplementary information of [5], under the section efficient implementation of the algorithm.The same implementation can be easily extended to variable blocks as long as the blocks are all of the same size 7 Measure and node (channel) preselection procedure To reduce the number of measures/nodes, start with the full set of features F -this is the union of the features over all the feature blocks (Each feature block corresponds to one measure or one node).Set the initial loss L 0 = 1.We use the following iterative procedure briefly described below: 1. Randomly partition the dataset into an appropriate number K of folds.
2. Then, for each test fold, train a model using the features in F on the corresponding training fold using Adaboost-SAMME upto the required number of rounds.
3. Use the model obtained to classify the test fold and record the predictions.This step will involve calculating all the individual gains at every split on the test fold.
4. Since the individual gains are now calculated for each measure/node, average the gain of the corresponding feature block over all rounds.
5. Perform multiple runs of the adaboost-SAMME algorithm by repeating steps 1 to 4 for different K-fold partitions of the data and average the resulting gains and compute the aggregate loss L F from the predictions over all the runs.
7. Prune the feature blocks by clustering the set of gains into high gains and low gains.The heuristic used for doing this was as follows: (a) Sort the gain values in decreasing order resulting in a curve G.
(b) Put the high outliers in G in the high gain cluster, the low outliers and zeros in the low gain cluster and remove all these from G.
(c) Find the index x 0 where the slope of G is minimum.Impose a lower bound x 0 = max(M/2, x 0 ) on x 0 where M is the number of elements in G.
(d) Add G(x), x < x 0 to the high gain cluster and put the rest in the low gain cluster.
8. Remove the feature blocks that have low gain.If no features have been removed, STOP.
9. Let F be the union of the features over the high gain feature blocks, let L 0 = L F and return to step 1.
The features selected are then the ones remaining in F.
2. The recording was passed to the PREP pipeline along with the boundary event markers described above.The PREP pipeline was setup to perform the following actions: Remove line noise at 50 Hz in a filter-agnostic manner.Compute the common average reference.Detect and interpolate channels that were bad due to large deviations relative to the common average reference.
3. The cleaned channels from PREP were then epoched to remove inter-stimulus segments and fed to the amica program with the pca reduction option.The ICA sphereing and weight matrices returned from amica were used to compute the ICA components.
4. Next the ICA components were passed to the ADJUST toolbox.The ADJUST function identifies four types of artifacts in the independent components: Horizontal eye movements, vertical eye movements, eye blinks and "generic discontinuities".For each the subjects ADJUST identifed between 1 and 7 eye movement and blink components over the two batch recordings.If "generic discontinuities" were included as artifacts then ADJUST identified between 2 and 14 artifacted components for each subject.
5. ICA components identified as eye blinks, horizontal eye movements or vertical eye movements were removed.Generic discountinuities were ranked by importance and only as many removed as would cap the maximum number of number of channels removed to 10.The EEG channels were then reconstructed from the remaining non-artifacted ICA components using the inverse ICA matrices.
6.In view of the above, the maximum rank of the (channels × time) batch recordings was reduced from 32 to a number between 21 and 31.Since we needed a common set of channels for all the subjects, 21 common reconstructed EEG channels were chosen from the 10 − 20 system in a way that excluded interpolated channels as well as maintained an even distribution of channel locations over the scalp.The channels chosen were: AF3, F7, F3, FC5, T7, CP5, P7, P3, Pz, O1, O2, P4, P8, CP6, T8, FC6, F4, F8, AF4, Fz and Cz.
7. As a final step, data values that were extreme outliers in each channel of each recording were interpolated.The outliers were detected by computing a robust mean and robust standard deviation for each channel and using a cutoff of 2.5 standard deviations.

DEAP feature preselection
To reduce the number of features we used the procedure described in section 7 above.This procedure was performed separately for different sets of measures where each set of measures had the same block size.This was done to prevent unequal block sizes from skewing the results.The order in which the each blockwise elimination was performed is as below: 1. Blocks generated by each node-wise measure.2. Blocks generated by each channel, channel asymmetry and channel caudality.
For the time domain feature based classification, all the weighted 14 node-wise network measures were used while, interestingly, the binary network measures were eliminated (see table Main [1]).
For the CGC spectrum based classification based on band summed features, the measures eliminated depended on the particular emotion being classified.The measures that remained and that were used in the final classification for each emotion are listed below.(see table Main [1] for the full list of node-wise measures).

Figure 2 :
Figure 2: Coefficient Scatter: glmnet-lasso.The figures show a scatter plot of the actual coefficients vs. the coefficients predicted by glmnet-lasso.For each of the four model sizes, the plots show the combined scatter over 50 random VAR models and 7 different data lengths 150, 200, 250, 300, 350, 400, 450.
(a) Correlation with true coefficients.

Figure 3 :
Figure3: Additional coefficient performance measures.The figures show error measures for the coefficients computed by SVARGS as well as those computed by glmnet-lasso.For each variable length above, the results were averaged over 50 random VAR models and 7 different data lengths 150, 200, 250, 300, 350, 400, 450.
points Single Core Time (secs) (d) Single Core Seconds to compute CGC

Figure 4 :
Figure 4: Conditional GC performance.The figures show error measures for the Conditional GC computed by SVARGS+CGC.For each variable length above, the results were averaged over 50 random VAR models and 7 different data lengths 150, 200, 250, 300, 350, 400, 450.

Figure 6 :
Figure 6: Performance with added noise: Plotted with respect to the relative size of the std.deviation of the added noise.The data length used was 200 and the results averaged over 10 samples.

Table 4 :
Feature Counts for different combinations of feature groups, Spectral