Large-scale nonlinear Granger causality: A data-driven, multivariate approach to recovering directed networks from short time-series data

To gain insight into complex systems it is a key challenge to infer nonlinear causal directional relations from observational time-series data. Specifically, estimating causal relationships between interacting components in large systems with only short recordings over few temporal observations remains an important, yet unresolved problem. Here, we introduce a large-scale Nonlinear Granger Causality (lsNGC) approach for inferring directional, nonlinear, multivariate causal interactions between system components from short high-dimensional time-series recordings. By modeling interactions with nonlinear state-space transformations from limited observational data, lsNGC identifies casual relations with no explicit a priori assumptions on functional interdependence between component time-series in a computationally efficient manner. Additionally, our method provides a mathematical formulation revealing statistical significance of inferred causal relations. We extensively study the ability of lsNGC to recovering network structure from two-node to thirty-four node chaotic time-series systems. Our results suggest that lsNGC captures meaningful interactions from limited observational data, where it performs favorably when compared to traditionally used methods. Finally, we demonstrate the applicability of lsNGC to estimating causality in large, real-world systems by inferring directional nonlinear, multivariate causal relationships among a large number of relatively short time-series acquired from functional Magnetic Resonance Imaging (fMRI) data of the human brain.


Introduction
Detecting causal influences between components of a complex system, especially from simultaneously observed time-series, is an actively growing area of research [1,2,3,4]. The cause-effect relationships quantified by any approach can be represented as a network graph. Networks are ubiquitous in natural as well as man-made systems. Examples of networks are interactions between individual neurons, regions in the brain, protein interaction, genetic networks, etc. These network graphs can be analyzed to reveal important properties of the system being studied. For example, investigating such a graph constructed from brain activity of healthy subjects and patients with some form of neurodegeneration can reveal vital information useful for diagnosis and treatment.
One of the most widely used approaches for estimating causal relations from time-series data is Granger causality analysis [5]. It estimates causal influence from one time-series to another, if the prediction quality of the influenced time-series improves when the past of an influencer time-series is used, as compared to its prediction quality when the past of the influencer is not used. The definition of Granger causality (GC) establishes causal interactions by having a strict flow of time. Since it was originally formulated for linear models, its application to nonlinear formulations was treated with skepticism. However the problem of GC was extended to non-linear systems with promising results [6]. Ideally, a causality analysis method should 1) be multivariate, 2) be able to capture nonlinear dependencies, 3) work for systems with large number of variables, and 4) be data-driven. While GC is a multivariate analysis approach with both linear and nonlinear variants, it, very often, cannot be extended to large systems since the problem becomes an ill-posed, underdetermined one.
In systems containing more than two time-series, performing a bivariate analysis -i.e. considering only pairs of time-series at a time without taking into account the effects of confounding variables -results in spurious causalities as a consequence of indirect connections [7]. In [8], the authors have shown that a bivariate analysis could result in misleading information regarding propagation of influence, while multivariate analysis distinguishes direct from indirect influences. Although a multivariate analysis can produce better estimates [9] of casual relations, the complexity of the model increases with increasing number of time-series in the system, hence making it computationally infeasible. Additionally, redundant variables can lead to underestimation of causal influences [10]. In summary, an approach that estimates multivariate interactions, while reducing redundancy, and being computationally feasible would be desired.
Furthermore, most systems in nature show complex dynamics which cannot be represented well by linear approaches [11,12,13]. Nonlinear time-series analysis approaches have the advantage of capturing interaction patterns among components of such systems. Nonlinear models are more realistic, however, at the cost of increased computation time, and possible increase in number of parameters to be estimated. For linear stochastic systems, using nonlinear analysis approaches may not provide any benefit and could produce worse results. Nevertheless, ideally, an approach that can possibly capture multivariate, nonlinear interactions in large systems is desired.
Non-linear extensions of GC have been proposed in recent literature [6], including, kernel-based non-linear GC approaches [14,15,16] has gained most traction. However, nonlinear Granger causality methods, such as those cited above, have not been applied to large systems. Possible reasons for this restrictions are: Besides computational expense, the extendibility to multivariate analysis of high-dimensional dynamic systems based on a low number of temporal observations, involves specific challenges regarding parameter optimization of sophisticated nonlinear time-series models on limited data. In this work, we address this bottleneck and introduce a large-scale Non-Linear Granger Causality (lsNGC) approach for estimating network structure. By introducing a nonlinear dimension reduction step, lsNGC aims at directed, multivariate time-series causality analysis in large complex networks. We demonstrate the applicability of lsNGC on estimating connectivity from resting-state functional MRI. However, lsNGC may be useful for other domains as well, given that the data is represented as signals simultaneously acquired.
In the following sections we discuss lsNGC and the various networks we investigate. We compare our approach with two standard methods: 1) Kernel Granger causality [15], which is multivariate and nonlinear, and 2) an approach that is bivariate and nonlinear, using local models (LM) [11] to estimate causal influences. We test performance of simulated data with known ground truth of connections. Additionally, we demonstrate the application of the proposed lsNGC approach on real time-series data recorded using functional Magnetic Resonance Imaging (fMRI) from subjects presenting with symptoms of HIV associated neurocognitive disorder (HAND) and healthy controls. If lsNGC measures can characterize brain connectivity well, it should be useful in distinguishing the two subject groups.

Large-scale Nonlinear Granger causality
Large-scale nonlinear Granger causality adopts theoretical concepts from Granger causality analysis. Granger causality (GC) is based on the concept of time-series precedence and predictability; here, the improvement in the prediction quality of a time-series in the presence of the past of another time-series is quantified. This reveals if the predicted time-series was influenced by the time-series whose past was used in the prediction, uncovering the causal relationship between the two series [5] under investigation. The supplementary material (section 1) details the theoretical concepts involved in Granger causality analysis.
LsNGC estimates causal relationships by first creating a nonlinear transformation of the statespace representation of the time-series, whose influence on others is to be measured, and another representation of the rest of the time-series in the system. Consider a system with N time-series, each with T temporal samples. Let the time-series ensemble X ∈ R N ×T be X = (x 1 , x 2 , ..., x N ) T , where x n ∈ R T , n ∈ {1, 2, ..., N }, x n = (x n (1), x n (2), ..., x n (T )). The time-series ensemble X can also be represented as Let's say that we are interested to learn if x s influences x r . We first construct the phase space representation of x s with embedding dimension d, as W s . The state at time t is w s , and t ∈ d, ..., T − 1. Say we are interested in quantifying the influence of x s on x r in the presence of all confounding variables and also by modeling nonlinearities present in the data. Confounding variables can be accounted for by performing a multivariate analysis. Additionally, we account for nonlinear interactions among time-series by transforming the original space using a nonlinear transformation function.
To perform a multivariate analysis, it is desirable to have a phase space reconstruction, where prediction is performed using all the time-series, apart from x s whose influence is to be quantified. From the time-series ensemble X\x s we construct the phase space reconstruction Z s . The state of this multivariate system at a given time-point is It should be noted that Z s does not contain any terms from x s .
In brief, we have constructed two systems represented by phase states W s and Z s . W s represents the states of only the time-series whose influence we want to quantify, i.e x s , and Z s represents the multivariate phase space incorporating all time-series but x s .
Coming back to Granger causality (GC), GC works on the principle that if the prediction quality of a time-series x r improves in the presence of x s as compared to its prediction quality in the absence of x s , having considered the rest of the time-series in both models, then x s Granger causes x r . It quantifies boost in the prediction quality, by comparing two models, one that uses information from the states of x s and the other that does not. Let f and g represent two nonlinear functions. The two estimates of x r are given by:x r,s = a 11 f (Z s ) + a 12 g(W s ) (1) In the above equations, a and b are the weights or model parameters, obtained by minimizing the mean squared errors in the estimate of x r . The quantitiesx r,s andx r,s are the estimates of x r calculated by the two models. The subscript (r, s) denotes that these models were constructed to investigate the influence of x s on x r . In this study, we use the generalized radial basis function (GRBF) as nonlinear transformations f and g. Mathematical formulation of the GRBF is provided in the supplementary material (section 2.2). In brief, representative clusters of the state space Z s and W s are obtained using clustering methods, such as k-means clustering, where k can be seen as the number of hidden neurons in a GRBF neural network. Let c z and c w be the number of hidden layer neurons for Z s and W s , respectively.
The f -statistic can be obtained by recognizing that the two models, equations (1) and (2), can be characterized as the unrestricted model and the restricted model, respectively. Residual sum of squares (RSS) of the restricted model, RSS R , and residual sum of squares of the unrestricted model, RSS U , are obtained. Here, n is the number of time-delayed vectors, p U and p R are the number of parameters to be estimated for the unrestricted and restricted model, respectively. For equations (1) and (2), p U = c z + c w and p R = c z , respectively.
A measure of lsNGC can be obtained using the f -statistic, given by: F xs→xr quantifies the influence of x s on x r , by testing the equality of variances of errors in prediction of the x r by both the models i.e. equations (1) and (2). If the variance of the error in predicting x r is lower when x s is used, then x s is said to Granger cause x r . The measure F xs→xr is stored in the affinity matrix S at position (S) s,r , where S is an N × N matrix of lsNGC indices. Each lsNGC measure in the affinity matrix can be represented as a directed edge connecting the s th node to the r th node in a network graph. Implementation specifics that make lsNGC computationally efficient and various parameter information are provided in the supplementary material, section 2. We will also make publicly available the lsNGC MATLAB software.

Results
To evaluate the approach, several benchmark simulations are considered and performance is compared to two state of the art approaches, Kernel Granger Causality (KGC) [15] and mutual nonlinear cross-mapping methods [11] using local models (LM). These two approaches are discussed briefly in the supplementary material, section 5.

Simulated data network models
We begin by creating benchmark simulations. Fifty different sets of each type of simulation were created, this is useful for estimating the consistency of the method. All the simulations were generated to have 500 time-points. Two species logistic model: Before investigating empirical data or systems with a large number of time-series, it is imperative to test performance on a simple network structure with directed interaction. To this end, the two species logistic model which is one of the commonly studied [17] chaotic time-series systems is considered: where r 1 = 3.7, r 2 = 3.8, and γ 1,2 and γ 2,1 are the coupling constants. We adopt all values used from [11,17]. For the unidirectional case, the coupling constants take the values γ 2,1 = 0.32, and γ 1,2 = 0. Uniformly distributed random numbers between [0, 1] are used as initial conditions and the first 50 time points are discarded. In our results, we refer to this network as 2-logistic ( Figure 1). All the lsNGC scores are estimated using 3. Figure 2 is a histogram of the lsNGC scores (normalized between 0 and 1 using min-max normalization for display), assigned to x 1 → x 2 and x 2 → x 1 over the 50 different sets of the simulation. LsNGC is able to capture directed connections from x 1 → x 2 well which is evident from Figure 2 by the high scores assigned to x 1 → x 2 compared to x 2 → x 1 . Detailed comparative quantitative results for the performance of various algorithms on all simulated networks shown in Figure  Complex system with three nodes: Following this, we consider a slightly more complicated system with three nodes. The coupling parameters of this system can be set such that it can show fan-in and fan-out causality, as can be seen in Figure 1.
In the fan-out case, nodes x 2 and x 3 are both driven by a common source, node x 1 , hence the dynamics of the two driven nodes contain information from x 1 . Thus, although x 2 and x 3 do not causally influence each other, they may be correlated. Such motifs can be challenging and an approach that is able to characterize these connections well is desirable. LsNGC is able to capture the connections well and is able to recover the fan-out network structure ( Figure 3a). Figure 3a, clearly demonstrates that the scores assigned across all 50 simulations by lsNGC for x 1 → x 2 and x 1 → x 3 are much higher than any of the other connections. Additionally, no spurious connection is estimated between x 2 and x 3 .
The challenge faced when estimating connections with fan-in motifs is that since x 3 is influenced by x 1 and x 2 , detected relationships are generally weak, since the dynamics of x 3 is affected by two time-series. From Figure 3b, we observe that the highest strengths of connection across all 50 simulations is rightly assigned to x 1 → x 3 and x 2 → x 3 . However, we do observe lsNGC assigns relatively high strengths (∼0.5) to x 1 → x 2 and x 2 → x 1 for a few of the 50 simulations. It would be interesting to dive into presence of such connections in more detail. We suspect this happens as a consequence of a multivariate model, since the v-structure [18] Nevertheless, as can be seen from Figure 3, the scores of the true connections are still higher than those of such spurious connections with only minimal overlap of the histograms. Five node nonlinear network: We also generate time-series data similar to that described in [19] using the KGC toolbox (https://github.com/danielemarinazzo/KernelGrangerCausality). This toolbox contains both linear and non-linear implementations of interactions between time-series. Here, we present equations governing the non-linear 5-node network (5-nonlinear), while equations for the linear system (5-linear) are provided in the supplementary material.
Results estimating direction of connection are provided in the supplementary material (section 4). 34 node Zachary network: Systems in nature involve of a number of interacting factors. Hence, it is important to evaluate systems with a considerably large number of interacting time-series. To test lsNGC on networks with a large number of nodes, we consider the Zachary dataset [21] consisting of 34 nodes. The nodal interactions is as follows and adopted from [15]: Here, a = 1.8, s = 0.01, c = 0.05, where c i,j represents the influence j has on i, and τ is Gaussian noise with unit variance and zero mean. These quantities were adopted from [15], where, the authors construct directed networks by assigning an edge, with equal probability of being in either direction. Apart from the directed connections, we randomly select 5 edges to be bidirectional. We construct 50 such networks and obtain 50-sets of time-series data from the corresponding network (34-Zachary2). One of the 50 networks used is shown in Figure 1. From the generated time-series data we estimate the underlying network structure of the 50 different networks. We also construct another 50 sets of time-series using the original undirected Zachary network with c = 0.025 in equation (7) (34-Zachary1).

Evaluating network recovery of simulations
LsNGC derives measures of nonlinear connectivity scores represented as edges in a network. These are non-binary scores, from which we obtain a measure of the Area Under the receiver operating characteristic Curve (AUC). However, before deriving AUC measures, the connectivity matrix is log transformed to reduce the skew in the f -statistic measures. The Receiver Operating Characteristic (ROC) plots the true positive rate (TPR) versus the false positive rate (FPR), which shows the tradeoff between these quantities. Ideally, TPR should be 1 and FPR = 0 at any one threshold applied on the connectivity graphs, i.e. affinity matrix, for the AUC to equal 1. An AUC of 0.5 represents assignment of random connections, analogous to guessing the absence or presence of connections. Since the AUC quantifies both, the strength of connections and the direction of information flow, it is used to evaluate performance in estimating the true network structure. The AUC derives evaluation measures from the non-binarized connectivity matrix. However, it is also important to evaluate the statistical significance of connections i.e. edges in the network. The lsNGC measures of connectivity, expressed as f -statistic values, can be used to derive p-values for connections. Significant connections are obtained after multiple comparisons correction using False Discovery Rate (FDR) method at p < 0.05. From the thresholded affinity matrix, measures of sensitivity and specif icity are derived.
Here, we present quantitative results on the recovered network structure for the various simulations. For every network investigated in this study, 50 different sets of time-series were simulated. Results are summarized as boxplots (example: Figure 4, 5). The circle with a dot inside the box represents the distribution median. The box spans the first quartile to the third quartile which is its interquartile range (IQR). The vertical extensions from the box, whiskers, have a maximum length of 1.5 times the IQR. The median of the AUC, sensitivity and specificity are represented asÃU C method ,sens method andspec method , respectively, where method refers to the analysis method, i.e., lsNGC, LM or KGC.
Results in Figure 4 were obtained for all network structures in Figure 1 generated with 500 time-points. In this figure, red, blue and grey correspond to lsNGC, LM and KGC respectively. LsNGC, LM and KGC work very well for the smaller networks i.e. 2-logistic, 3-Fan out and 3-Fan-in networks. However, KGC's and LM's performance drops when using the linear system with 5 nodes, with aÃU C KGC = 0.82 andÃU C LM = 0.87, compared toÃU C lsN GC = 1. Additionally, LM performs poorly for the nonlinear 5 node networkÃU C LM = 0.62, while lsNGC and KGC perform comparably. For both, directed and undirected 34-node Zachary networks, performance of lsNGC drops compared to its performance for smaller networks. It is, however, comparable to bivariate LM. The multivariate KGC performs poorly, with most recovered networks being random at medians of 0.52 and 0.51 for the networks. KGC performs poorly since it is cannot capture right connections for a relatively large network with just 500 time-points. In the original paper [17], the authors tested the Zachary network, but with 10,000 time-points; which is an unrealistic scenario for most practical applications. The bottom end of the box represents the first quartile and the top of the box represents the third quartile. The circle with a dot in the box represents the distribution median. A general trend that is noticeable here is that the performance drops for all approaches as the number of nodes increases. Note that lsNGC performs competitively for all networks with LM slightly inferior for 5-linear and KGC significantly worse for all non-trivial networks.
Given, the F-statistic for the lsNGC measure, we obtain significant connections in the recovered network as described in section 2.1. Figure 5 plots the sensitivity, specificity and a measure combining them. Here, we observe that for small networks with 2-3 nodes, all approaches perform well with the exception of KGC for the fan-out network. For the 5-node networks, LM performs quite poorly, whereas overall, lsNGC does better than the two. Large networks, such as the Zachary network with 34 nodes, are generally difficult to recover since the total number of possible connections grows as a function of N (N − 1). Here, we observe that KGC is the poorest of the methods tested. LM and lsNGC are comparable. However, it should be stressed that the calculation of significant connections for LM is not as straightforward as compared to lsNGC. Since a statistical measure cannot be derived directly from the score assigned by LM, surrogate time-series need to be created, from which LM measures are obtained, creating a null distribution. Details of this can be found in the supplementary material section 6.
Due to various constraints, such as cost, sensor limitations, manpower, time, etc., it is not always feasible to collect a small number of observations (time-points) for the factors under investigation. Thus, it is also essential to test the network recover-ability of lsNGC for fewer number of observations. To this end, the time-series length is varied from 500 to 50 time-points. Figure 6 compares AUC results across methods for decreasing number of time-points. Clearly, of the three approaches, KGC performs the poorest. For small networks, having 2-3 nodes, all methods perform comparably; however, a sudden drop in performance of KGC occurs when the time-series length reduces to 50 time-points for the 3 node networks. It is interesting to note that the sudden drop in KGC's performance is observed much earlier, at 200 time-points, for the 5-node network, while for the . We also plot a measure combining the two measure to estimate overall performance. In general we observe that KGC performs most poorly.
34-node Zachary network, its performance oscillates across a median AUC of 0.5, indicating detection of random connections for 500 to 50 time-points. LsNGC and LM perform equally well for the networks with 2-3 nodes across different time-series lengths. For the 5-node network, results indicate that LM reaches a bottleneck in performance, and is not able to improve as much as lsNGC can with increased time-series length. LsNGC has an AUC = 1 for the 5-node linear network for T = 500, whereas, the best performance with LM is a median AUC = 0.87. Nevertheless, it is important to note that network structure recovered with LM does not degrade with decreasing time-series lengths. For the 5-node nonlinear network, we see that the best performance with LM isÃU C LM = 0.62, whereas the worst for lsNGC isÃU C lsN GC = 0.82. Additionally, we investigate both undirected (34-Zachary1) and directed networks (34-Zachary2). The drop in performance of lsNGC is a lot steeper than that of LM with decreasing time-series lengths. However, its performance is comparable to LM at T > 200.

Functional Magnetic Resonance Imaging data
LsNGC showed promising results on the simulations. Nevertheless, its performance on real data can give more insight into its usability for various applications involving network graph estimation from signals. In this work, we analyze its performance on functional Magnetic Resonance Imaging (fMRI) data. It has been demonstrated that individuals presenting with symptoms of HAND have quantifiable differences in connectivity [22,23] from controls. We hypothesize that if lsNGC can capture brain connectivity from fMRI data for the subjects well, differences in connectivity between subjects with HAND and controls should be observed. Hence, we tested how well a classifier was able to discriminate the two subject groups. The classifier is able to learn relevant differences from the two groups using connectivity derived with lsNGC (AUC =0.88 and accuracy = 0.77) suggesting that lsNGC was able to characterize the network structure well. More details on the data and analysis approach can be found in appendix C.

Discussion
In this paper we introduce a novel approach called large-scale nonlinear Granger causality (lsNGC). This approach is a nonlinear, multivariate variant to Granger causality that can estimate connections in systems with a large number of time-series. We demonstrated its applicability on various simulations and on real data. In this section, we discuss the performance of lsNGC in instances where it effectively recovered network structure, as well as in simulations where it did not. All investigated methods, i.e. lsNGC, LM andnKGC have in common that for a given number of observations (time-points) their performance drops with increasing number of nodes. However, performance may be improved by increasing the number of observations. We observe that lsNGC outperforms the two other nonlinear approaches in most cases. Further, KGC is a lot more susceptible to poor performance with increased number of nodes (time-series, Figure 6) compared to lsNGC.. When increasing the number of nodes in the network, the number of time-points has to be significantly increased to produce meaningful results with KGC. Although, lsNGC's performance gradually drops with decreasing number of time-points, for all practical lengths of time-series, it outperforms KGC, making lsNGC more reliable for larger systems with fewer time-points.
Comparing lsNGC with LM, it is seen that the network structure recovered with LM does not degrade as rapidly with decreasing time-series lengths. This can be attributed to the lesser complexity of LM, hence fewer parameters to be estimated compared to lsNGC. As such, given very short time-series LM may be able to outperform lsNGC. LM, having fewer parameters, is a less complex model but with high bias. Such high bias comes at a the price of significant performance drop of LM, which is clearly demonstrated in the 5-node nonlinear network as a good example. Here, we see that LM performs best withÃU C LM = 0.62, whereas the worst result for lsNGC is AU C lsN GC = 0.84. To put it simply, LM is a low variance, high bias model, whereas lsNGC is a higher variance, lower bias model. This becomes more evident when analyzing the network with 34 nodes. The drop in performance of lsNGC is a lot steeper than that of LM with decreasing time-series lengths. However, its performance is comparable to LM at T > 200. Additionally, the formulation of lsNGC can be directly used to estimate significant connections, unlike LM, where a null distribution needs to be created from a set of surrogate time-series (Supplementary material section 5). This increases the computational cost to a large extent. The flexibility of estimating significant connections with lsNGC is a significant advantage over traditional approaches for detecting causality.
The success of lsNGC on simulated data motivated us to test its performance on real data. To this end, we evaluated the connectivity matrices derived using lsNGC on real fMRI data from healthy controls and subjects with HIV associated neurocognitive disorder (HAND). The connectivity measures used as features in a classifier were highly discriminative. This demonstrates that lsNGC is able to capture relevant information regarding interaction between different regions in the human brain.
In summary, lsNGC is quite robust in recovering network structure across different network architectures. Similar to all investigated methods, network size does affect its performance; however, it significantly outperforms conventionally used methods such as KGC for practical time-series lengths ( Figure 6). In comparison, as LM is a bivariate method, it does not consider confounding time-series in its models, and falls short at capturing indirect connections. LsNGC has the benefit of being a nonlinear, multivariate method whose formulation provides control over the number of parameters to be estimated, as such, lsNGC can effectively estimate network structure even in high-dimensional systems with short time-series.

Conclusion
In this work, we introduce an approach for estimating underlying directed network structure from time-series data. We propose a multivariate, nonlinear method called large-scale nonlinear Granger causality (lsNGC) for detecting causal interactions between time-series. Most approaches proposed in literature for performing multivariate nonlinear causality analysis are limited by the number of samples of the time-series. Additionally, few multivariate approaches can handle large number of nodes in the network due to computational limitations, making them impractical for large networks. The practicality and limitations of methods can be investigated through experimentation and analysis with different types of networks. In this study, we demonstrated the advantage of lsNGC over traditional multivariate and bivariate approaches to recovering directed network graphs. The high AUC, good sensitivity and specificity results for realistic lengths of time-series data demonstrates its potential to be applied to real world data. Additionally, although the bivariate LM performs comparable to lsNGC in some instances, obtaining the true, unweighted network structure using LM is difficult since it requires the calculation of surrogate time-series, which is computationally very expensive, especially for large networks. Finally, we have demonstrated the applicability of lsNGC in inferring brain network graphs from brain activity data obtained using functional magnetic resonance imaging (fMRI). Besides clinical applications for diagnosis of neurological disorders, such an approach may be able to reveal useful insights about directed interactions in the brain.

A. Granger causality analysis
The principle of Granger causality (GC) is based on the concept of precedence and predictability, where the improvement in prediction quality of a time-series in the presence of the past of another time-series is evaluated and quantified, revealing the directed influence between the two series [5] investigated. As lsNGC is an extension of traditional multivariate GC (mvGC), the basic concepts of mvGC are briefly described here.
Consider a system with N time-series, each with T temporal samples. Let the time-series ensemble X ∈ R N ×T be X = (x 1 , x 2 , ..., x N ) T , where x n ∈ R T , n ∈ {1, 2, ..., N }, x n = (x n (1), x n (2), ..., x n (T )). The time-series ensemble X can also be represented as We use multivariate vector auto-regression which is the most common prediction scheme used in GC analysis [24,25].
Here, the matrices A j are the model parameters obtained by minimizing the mean squared errors in the estimate of X, where A j is an N × N matrix with j ∈ 1, 2, ..., d, and d is the lag order. We defineX as the predicted system (1) without the error term, and E as the set of residuals defining the difference between the actual and predicted values of X.
Granger causality (GC) analysis establishes a causal influence score from time-series x s to x r on the premise that, if the predictability of time-series x r improves in the presence of the past of time-series x s , then x s Granger causes x r . GC analysis estimates causal relationships in a multivariate sense by considering the full ensemble of time-series including confounding time-series that neither represent x s nor x r . We obtain the influence of x s on x r by quantifying the reduction in prediction quality of x r in the absence of time-series x s . Equations (2) and (3) obtain the prediction of all series when the full ensemble of time-series is used. Next, we obtain the prediction error E X\xs , when series x s is removed from the ensemble X.
If the prediction quality of x r is higher when x s is used (2) rather that when it is not used for predicting time-series x r , then it is said that x s Granger causes x r . The f -statistic can be obtained by recognizing that the two models can be characterized as the unrestricted model and the restricted model, where the unrestricted model is eq. (9), and the restricted model is constructed in the absence of time-series x s . Residual sum of squares (RSS) of the restricted model, RSS R , and residual sum of squares of the unrestricted model RSS U are obtained. n is the number of observations in the regression, q U and q R are the number of parameters to be estimated for the unrestricted and restricted model, respectively. For traditional multivariate GC, q R = d × (N − 1) and q U = d × N , respectively.
A measure of GC can be obtained using the f -statistic, given by: F GC xs→xr quantifies the influence of x s on x r , by testing the equality of variances of errors in prediction of the x r by both the models. If the variance of the error in predicting x r is lower when x s is used, then x s Granger causes x r . Significant interactions can be obtained using the f -statistic.

B.Implementation specifics for large-scale Nonlinear Granger causality (lsNGC) Implementation specifications
Nonlinear Granger causality analysis has been investigated and theoretical work laying the foundation of mathematical formulation to perform such an analysis has been studied [14,16,26]. However, while sound theoretical concepts are prerequisite, practical implementation, especially for large systems having many nodes (time-series) is not always straightforward. In this section, we briefly describe implementation specifications that increase scalability and reduce computation time significantly.
Let us say that we are interested to learn if x s influences x r . We first construct the phase space representation of x s with embedding dimension d, as W s . The state at time t is w s (t) = [x s (t − (d − 1)), ..., x s (t − 1), x s (t)], and t ∈ d, ..., T − 1.
To perform a multivariate analysis, a phase space reconstruction is constructed, where prediction is performed using all the time-series apart from x s whose influence is to be quantified. From the time-series ensemble X\x s we construct the phase space reconstruction Z s . The state of this multivariate system at a given time-point is It should be noted that Z s does not contain any terms from x s . Let f and g represent two nonlinear functions. The two estimates of X are given by: In the above equations, A 1 , A 2 and B 1 are the weights or model parameters, obtained by minimizing the mean squared errors in the estimate of all time-series in X. The time-seriesx r,s fromX andx r,s fromX\x s are the estimates obtained for x r . The subscript (r, s) denotes that these are estimates of x r obtained to investigate the influence of x s on x r . The two constructed models, given in eq. (12) and (13), are the unrestricted and restricted model, respectively. The RSS of the two models gives us an f -statistic measure. In this study, we use the generalized radial basis function (GRBF) neural network as nonlinear transformations f and g.
where, i ∈ {1, 2 ... c g } Analogously, cluster centers U T s ∈ R c f ×(N −1)d are calculated for the state space Z s , where c f is the number of clusters obtained with k-means clustering. Activation function f in (5) and (6) is calculated as follows: where, i ∈ {1, 2 .
.. c f } The embedding dimension d is chosen using Cao's method described in [27]. In this study, c f = 25 and c g = 5 is chosen empirically from preliminary analysis.

5-node linear network
The linear implementation of interactions between the 5-node network time-series (whose network structure is the same as in example 3 of reference [19]) is provided here:

5-node time-series results
The true connections of the 5 node network can be summarized as follows, x 1 → x 2 , x 1 → x 3 , x 1 → x 4 , x 4 → x 5 and x 5 → x 4 , for the linear and non-linear case. LsNGC in the linear case clearly assigns high scores to the right connections, Figure 7. LsNGC correctly assigns a high score to Resting-state fMRI data: Functional MRI scans were obtained from human subjects at the Rochester Center for Brain Imaging (Rochester, NY, USA) using a 3T, Siemens Magneton TrioTim scanner. The study protocol included: (i) High-resolution structural imaging using T1-weighted magnetization-prepared rapid gradient echo sequence (MPRAGE, TE = 3.44 ms,TR= 2530 ms, isotropic voxel size 1 mm, flip angle = 7 • ).(ii) Resting-state fMRI scans using a gradient spin echo sequence (TE = 23 milliseconds, TR = 1650 milliseconds, 96 × 96 acquisition matrix, flip angle of 84 • ). The acquisition lasted 6 minutes and 54 seconds, and 250 temporal scan volumes were obtained. A total of 25 slices, each 5 mm thick, were acquired for each volume. During acquisition, the subject was asked to lie down still with eyes closed. The data were acquired as part of a NIH sponsored study (R01-DA-034977). Prior to computation of connectivity measures, the fMRI data used in this study was preprocessed using standard methodology. For each dataset, the first ten (of 250) volumes of functional magnetic resonance images were eliminated to analyze only those that reached steady-state imaging. Next, motion correction, brain extraction and correction for slice timing acquisition were performed. Additional nuisance regression was carried out to remove variations due to head motion and physiological processes. Each dataset was finally registered to the 2 mm MNI [28] standard space using a 12-parameter affine transformation [29]. All preprocessing steps were carried out using the C-PAC software [30] and its corresponding dependencies in FMRIB Software Library (FSL) [31]. Finally, the time-series were normalized to zero mean and unit standard deviation to focus on signal dynamics rather than amplitude [32]. Based on the commonly used Automated Anatomic Labeling (AAL) template [33], the registered MRI volumes were divided into 90 regions, excluding the brain stem and cerebellar regions, 45 in each hemisphere. A representative time-series for each region was computed by averaging the time-series of all voxels within it. Subjects in this study were recruited as part of a NIH funded study (R01-DA-034977) at the University of Rochester Medical Center. In total, 15 healthy controls (mean age 42 ± 10 years) and 14 HIV positive subjects with symptoms of HIV associated neurocognitive disorder (HAND, mean age 45 ± 16 years) were recruited as part of this study. A standard battery of neuropsychological (NP) tests was used to assess cognitive abilities of subjects, covering six cognitive domains: executive function, speed of information processing, attention, memory, learning, and motor function. These scores were converted to age and education adjusted z-scores. An overall z-score combining the scores from individual domains was generated and used to assess HAND [34]. All participants provided written informed consent prior to participation as per protocol approved by the institutional IRB.
Application to fMRI data: It has been demonstrated that individuals presenting with symptoms of HAND have quantifiable differences in connectivity [22,23] from controls. We hypothesize that if lsNGC can capture brain connectivity from fMRI data for the subjects well, differences in connectivity between subjects with HAND and controls should be observed. Hence, we tested how well a classifier was able to discriminate the two subject groups. First, matrices of interaction using lsNGC, connectivity matrices, were obtained from the fMRI time-series of every subject in this study. LsNGC produced connectivity profiles conveying different information about interactions amongst regional time-series for every subject. These connectivity matrices were vectorized such that data from each subject was represented as an F dimensional vector of interactions, which can be viewed as features to train a classifier. Since, the number of features were large (∼ 8000 for N = 90, where N is the number of regions defined by the AAL template) compared to the number of subjects in our study, before vectorizing the connectivity matrices, we symmetrized them such that F = N (N − 1)/2, reducing the number of features to 4005 for each subject.
To further reduce the number of redundant and/or noisy features, we performed feature selection, using Kendall's τ correlation coefficient [35] . Feature selection aims at estimating interactions that best discriminate subjects presenting with HAND symptoms and controls and reduces the computational complexity of a classifier. Feature selection was performed independently for every set of training/test set split. Feature ranking estimated by Kendall's coefficient feature selection approach was used to select s features where s ∈ {2, 3,5,8,10,15,20,25, 50, 80}.
The ranked features were classified with the AdaBoost [36] classifier. It is an ensemble classifier that uses an ensemble of weak classifiers, such as decision stump classifier, to produce a strong classifier. The WEKA [37] implementation of AdaBoost was imported into MATLAB 2016 and used in our analysis.
We ensured a strict split between training and testing data, with 90%/10% train/test separation within an iterative cross-validation scheme with 100 different data splits. The training data was used for both feature selection and training the classifier. Classification accuracy and the AUC were adopted to evaluate performance. An AUC of 1 indicates perfect classification while an AUC of 0.5 indicates random classification. The classifier for discriminating subjects with HAND and healthy controls achieved best mean AUC = 0.88 and accuracy = 0.77 with 15 features (Figure 9) which suggests that lsNGC was able to characterize the network structure well, hence the classifier was able to learn discriminate characteristics well.

D. Comparative methods for evaluating lsNGC
In this section we briefly summarize the two comparison methods used to evaluate the performance of lsNGC.

Causality analysis with local models
An example for the use of local models (LM) to estimate casual relationships between time-series in a system is described in [11]. First, the attractor manifold of a time-series is constructed from time-delayed versions of itself and embedded into a state-space reconstruction (SSR). Such a space of embedded points transforms the observed time-series into a manifold [38,39] representing the evolution of states, i.e., its dynamics. The basic idea here is to build local models (using nearest neighbors) of the state space dynamics for every time-series, which cross-predicts the trajectory of an influencing time-series. Where, cross-prediction implies predicting say x 1 from the states of x 2 . If x 2 cross-predicts x 1 well, then x 1 influences x 2 .

Kernel Granger causality (KGC)
Kernel Granger causality (KGC) was proposed in 2008 [15]. It calculates Granger causality in the feature space of kernel functions. In this work, we use radial basis function kernels for KGC. [19] We used the publicly available KGC toolbox (https://github.com/danielemarinazzo/KernelGrangerCausality) to perform this analysis.

E. Estimating significance of connections from networks obtained
Measures of significance using lsNGC can be obtained from the f -statistic (equation (3)) and details regarding obtaining significance of connections with KGC can be found in [15]. Unlike lsNGC and KGC, LM measures cannot be used directly to glean significance measures. Significance is calculated by establishing a null distribution, which was obtained by estimating LM measures between non-interacting pairs of surrogate time-series. The surrogates were generated by using the Iterative Amplitude Adjusted Fourier Transform [40] algorithm generated with the Chaotic Systems Toolbox [20]. Sensitivity and specificity were calculated on the connectivity matrix thresholded at p ¡ 0.05, followed by multiple comparisons correction using False Discovery Rate, where the null hypothesis is generated by estimating connections between non-interacting surrogate time-series.  Figure 6: The effect of time-series length on performance of lsNGC, LM and KGC is studied here. We observe that for large networks, KGC requires a lot samples than LM or lsNGC to effectively recover the true network structure. LsNGC, on the other hand, performs well, with its performance reducing with fewer samples. LM is comparable to lsNGC; however, its does not perform too well for the 5-node network. The results are visualized as boxplots. The bottom end of the box represents the first quartile and the top of the box represents the third quartile. The circle with a dot in the box represents the distribution median.  : Plot of AUC and accuracy results for different number of retained features. Shaded regions above and below each solid line, corresponding to the mean AUC/accuracy, represent the 95% confidence interval of the AUC/accuracy value. The general trend observed here is that lsNGC is able to discriminate between the two subject groups well.