Fast and effective pseudo transfer entropy for bivariate data-driven causal inference

Identifying, from time series analysis, reliable indicators of causal relationships is essential for many disciplines. Main challenges are distinguishing correlation from causality and discriminating between direct and indirect interactions. Over the years many methods for data-driven causal inference have been proposed; however, their success largely depends on the characteristics of the system under investigation. Often, their data requirements, computational cost or number of parameters limit their applicability. Here we propose a computationally efficient measure for causality testing, which we refer to as pseudo transfer entropy (pTE), that we derive from the standard definition of transfer entropy (TE) by using a Gaussian approximation. We demonstrate the power of the pTE measure on simulated and on real-world data. In all cases we find that pTE returns results that are very similar to those returned by Granger causality (GC). Importantly, for short time series, pTE combined with time-shifted (T-S) surrogates for significance testing strongly reduces the computational cost with respect to the widely used iterative amplitude adjusted Fourier transform (IAAFT) surrogate testing. For example, for time series of 100 data points, pTE and T-S reduce the computational time by \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$82\%$$\end{document}82% with respect to GC and IAAFT. We also show that pTE is robust against observational noise. Therefore, we argue that the causal inference approach proposed here will be extremely valuable when causality networks need to be inferred from the analysis of a large number of short time series.


Results
First, we use the three DGPs described in Models to compare the performance of pTE, GC and TE in terms of the power and size. If by construction there is no causality from X to Y, the percentage of times the causality is higher than the significance threshold returned by the surrogate analysis will be called "size" of the test, i.e., is the probability that a causality is detected when there is no causality by construction. On the other hand, if by construction X causes Y, the percentage of times the method finds causality from X to Y is called "power" of the test. With the surrogate analysis adopted, the causality between the original data will be compared to the maximum one found within 19 surrogates 51 , and the probability that the original data displays by chance the highest causality is 5%.
We analyze the power and size for the two possible causal directions ( X → Y and Y → X ), as a function of the coupling strength and of the length of the time series. Figure 1 displays the power and size of the three methods, pTE, GC and TE, for the linear model, when the coupling is such that there is causality from Y to X (the size is shown in the top row, and the power, in the bottom row). The similarity between pTE and GC in finding the true causality is evident. With a coupling strength C < 0.1 the three methods fail to detect causality, while for C > 0.4 , for both pTE and GC, the number of data points in the time series needed to find causality is quite small, in fact 100 data points are sufficient to achieve a power of 100. In Fig. 2 we show results when we move along an horizontal or a vertical line in Fig. 1: we plot the power/size vs. the time series length, keeping fixed the coupling strength (left panel, C = 0.5 ) and vs. the coupling strength, keeping fixed the time series length (right panel, N = 500 ). In the left panel we notice that for C = 0.5 , a minimum of 200 data points are needed to retrieve the correct causality for all three methods with a power above 95. In the right panel, we notice that with 500 data points, a minimum coupling strength of C ≈ 0.25 is necessary to find a power larger than 95 for all three methods. Figure 3 displays the results obtained for the nonlinear model, and we notice that they are very similar to the ones obtained with the linear model, probably due to the weak nonlinearity considered. We note that, in comparison with the linear model, in this model, with short time series the power and size returned by the three methods are more similar.
Regarding the two chaotic Lorenz oscillators, which are coupled in the first variable, the situation is very different, as shown in Fig. 4. When looking at the causality between the coupled variables, for both pTE and GC the causality is detected for a moderate coupling strength and a rather long time series. Causality X → Y is not detected for any (coupling strength, time series length), which is correct by construction. TE instead finds causality also for X → Y , which is wrong by construction. This observation for TE can be attributed to insufficient conditioning treated by Paluš 19,52 , in fact the directionality of the coupling cannot be inferred when the systems are fully synchronized.
Next, we compare the computational cost of using pTE, GC and TE. Figure 5 displays the time required to calculate X → Y and Y → X causalities, as a function of the length, N, of the time series. The figure shows the time required when the codes are run on Google colab CPUs ( Intel ® Xeon ® CPU @ 2.20GHz), and includes preprocessing the time-series (detrending and normalizing) and performing the statistical significance test. www.nature.com/scientificreports/ For short time series we see a large advantage of using pTE instead of GC. TE sits back as the slowest of the three methods. The reason is attributed to the scaling of parameter k in the k-nearest neighbors method used to compute TE, which scales as √ N. Table 1 displays the computational time required to calculate X → Y and Y → X causalities, and the corresponding power and size obtained using the linear model. While in Fig. 5 we showed the total computational time, in Table 1 we show only the time required for the calculation of the pTE, GC and TE values (without signal preprocessing and without performing statistical significance analysis). We see that, for time series of 25 data points, the time required for pTE calculation (averaged over 1000 runs) is 200% faster than GC; however, this percentage reduces to 12% for time series of 500 data points. From these results, we argue on the value of using pTE to analyze a large number of short time series, which is often the case when causality methods are used to build complex networks from observed data. We remark that all the codes used to generate the results shown in this article are publicly available at GitHub 50 .
The use of time-shifted (T-S) surrogates 51,53 results in a substantial reduction of the computational time, in comparison to the widely used IAAFT surrogates, as seen in Fig. 5 and Table 2. The computational cost is reduced by approximately 98% , albeit displaying very similar results in terms of power and size. Clearly, T-S surrogates give a major boost in causality testing. As an example, for time series of length N = 100 , using pTE with T-S Figure 1. Power and size [the percentage of times that causality is detected when there is causality (power) and when there is no causality (size)] obtained using pTE (first column), GC (second column) and TE (third column) on the linear model, as a function of the length of the time series and of the strength of the coupling, for the two possible causality directions (top row: X → Y , bottom row: Y → X ). By construction the model has causality from Y to X; therefore, the top row displays the size, and the bottom row, the power. The performance of pTE and GC is very similar, as both find the correct causality with moderate coupling strength even for short time series. TE finds the correct causality, but for stronger coupling.  Fig. 1). In the left panel, we fix the coupling strength to 0.5 and we plot the power/size of the linear model as a function of the time series length for pTE, GC and TE. In the right panel, we fix the number of data points to 500 and plot the power/size as a function of the coupling strength. To study the resilience to observational noise, we add, to the time series generated with the DGPs, X and Y, a Gaussian noise ξ 1,2 of zero mean and unit variance, tuning its contribution with a parameter D ∈ [0, 1] . In this way we generate and analyse the signals X ′ and Y ′ given by X Figure 6 shows that pTE and GC perform very similarly (they are almost indistinguishable) and are quite resilient to noise. For the linear DGP, up to 40% of noise contribution can be present without a significant effect on the results, while for the nonlinear DGP, the methods start failing for a lower noise level. For the chaotic DGP the three methods are very resilient to noise. As previously noticed in Fig. 4, TE detects causality in both directions.
Finally, moving beyond synthetic data, we apply the pTE measure to two well-known climatic indices, and compare the results with GC and TE. The time series analysed, the NINO3.4 index and All India Rainfall (AIR) index, shown in Fig. 7, represent the dynamics of two large-scale climatic phenomena, the El Niño-Southern Oscillation (ENSO) and the Indian Summer Monsoon (ISM), whose causal inter-relationship is represented by Figure 3. As Fig. 1, but using the nonlinear model. We again see that pTE and GC both find the correct causality, and their performance is very similar. TE finds the correct causality, but for stronger coupling.  Fig. 1, but using the chaotic model composed by two coupled Lorenz systems. The performance of pTE and GC is very similar, as both find the correct causality when the time series is long enough, and the coupling strength is moderate. TE finds Y → X causality, but it also finds X → Y causality, which is wrong by construction.  Table 3 displays the results of the analysis of monthly-sampled data, and of yearly-sampled data. In the latter case we used the average of December, January and February (DJF) values, where the ENSO phenomenon peaks, and the average of June, July and August (JJA), where the monsoon peaks. Therefore, the length of the yearly-sample time series is 152 data points because for the last year the last data point, DJF, is not available. We used, for the yearly-sampled data, an autoregressive integrated moving average (ARIMA) model of order 4 (consistent with 16 ) and, for the monthly-sampled data, of order 3. The order of the model was selected by using the Akaike information criterion (AIC).  www.nature.com/scientificreports/ In Table 3 we see that for the yearly-sampled data, pTE and GC only detect the dominant causality (ENSO→ AIR), while TE detects both (in good agreement with 16 ). We note similarities with the results presented in Fig. 4: while unidirectional causality is found with pTE and GC, TE causality is found in both directions. The computational times clearly show that pTE is faster than GC (and of course also faster than TE, which is the slowest method). In the monthly-sampled data we see an opposite direction of causality, a result that we interpret as due   www.nature.com/scientificreports/ to different time scales in the mutual influence between ENSO and ISM: while ENSO effects on the Indian monsoon precipitations are pronounced on an annual time scale, the influence of the Indian monsoon on ENSO acts on a shorter, monthly time scale. To exclude the fact that this change in directionality is an artifact due to the different time series lengths, we analyzed the monthly-sampled time series using segments of 152 consecutive data points (which is the length of the annually-sampled data). In this case we did not find any significant causality, which suggests that the change in directionality when considering annually-sampled or monthly-sampled data is not an artifact but has a physical origin, that we interpret as due to different time scales in the mutual interaction and that 152 data points are not sufficient to find causality (in any direction) in the monthly-sampled data. Finally, we note that the computational times shown in Table 3 are higher than those that can be estimated from Fig. 5. In Fig. 5 we see that, for 150 datapoints, the time required for the GC calculation with T-S surrogate analysis is about 0.11 s while in Table 3 we see that the time required for GC and T-S calculation (two directions) is 0.36 s. The difference is due to the fact that in Fig. 5 a model of order 1 was used, while in Table 3, for the yearly-sampled data, a model of order 4 is used. The computational time increases with the order of the model, especially for GC, because the algorithm used (statsmodels grangercausalitytest) computes causality for all model orders up to the chosen one. For the NINO3.4 and AIR indices we also analysed the effect of varying the order of the model (from 1 to 10) and found either the same significant causal directionality (with stronger or weaker values), or we did not find any significant causality.

Discussion
We have proposed a new measure, pseudo transfer entropy (pTE), to infer causality in systems composed by two interacting processes. Using synthetic time series generated with processes where the underlying causality is known, and also, a real-world example of two well-known climatic indices, we have found a remarkable similarity between the results of pTE and Granger causality (GC), in terms of the power and size, and the robustness to noise, but pTE can be significantly faster, particularly for short time series. For example, for time series of 100 datapoints, while giving extremely similar results, pTE with time-shifted (T-S) surrogate testing reduces the computational time by approximatelly 92% with respect to GC with IAAFT surrogate testing, and by 48% with respect to GC with T-S surrogate testing (on Google colab CPU, the total computational time for pTE and T-S is 2.5 ms, while for GC and IAAFT is 32.5 ms, and for GC and T-S, 4.7 ms).
Since the computational cost is of capital importance for the analysis of large datasets, the causality testing methodology proposed here will be extremely valuable for the analysis of short and noisy time series whose probability distributions are approximately Gaussian. We remark that many real-world signals follow distributions that are nearly normal. Although we do not claim that our method can be applied to any pair of signals, the information presented in the Additional information supports the method's generic applicability. The algorithms are freely downloadable from GitHub 50 .

Derivation of the pseudo Transfer Entropy (pTE).
Transfer entropy 20 is a well-known measure that quantifies the directionality of information transfer between two processes. In the case of information transfer from process Y to X, it is defined as where p(·, ·, ·) and p(·|·) are joint and conditional probability distributions that describe the processes, i n+1 represents the state of process X at time step n + 1 , i  (1) can be re-written as which, by using the definition of conditional probabilities and entropies, can be re-written as The computation of the TE with Eq. (1) is challenging because a good estimation of the probability distributions is often not available. Considering the processes X and Y to follow normal distributions i.e. X ∼ N(x | µ x , � x ) and Y ∼ N(y | µ y , � y ) , the computation simplifies substantially, using in fact that the entropy of a p-variate normal variable x, is given by By definition of the multivariate Gaussian, we can rewrite Eq. (4) as which, by the property of the logarithm of products becomes www.nature.com/scientificreports/ where | | is the determinant of the p × p positive definite covariance matrix. By substituting Eq. (7)  n are the matrices containing the previous k and l values of processes X and Y respectively. Whenever X and Y are not Gaussian processes, we call the quantity in Eq. (9) pseudo Transfer entropy (pTE). For Gaussian variables pTE coincides with the Transfer Entropy and is equivalent to Granger causality 35 . The Gaussian form for CMI/TE for causality inference was also previously used 56-59 . Statistical significance. We used surrogate data to test the significance of the pTE, TE and GC values. The number of surrogates needed depends on the characteristics of the data, the available computational resources and time limitations: given enough resources and time, one should use a large number of surrogates and select a confidence interval 19 ; however, with limited time or computational resources, when the spread of surrogates data is not too large one can use an alternative strategy: analyze a small number of surrogates and, in the case of a one sided test, select as significance threshold the maximum or minimum value obtained with the surrogates. In this case, M = K/α− 1 surrogates should be generated, where K is a positive integer number and α is the probability of false rejection 51 . Therefore, a minimum of 19 surrogates ( K = 1 ) are required for a significance level of 95%.
We used the algorithm developed by Schreiber and Schmitz 60,61 known as iterative amplitude adjusted Fourier transform (IAAFT), which preserves both, the amplitude distribution and the power spectrum (for details, see Lancaster et al. 51 and references therein). The python routine to compute the IAAFT surrogates is contained in the NoLiTSA package 62 . We also tested the time-shifted (T-S) surrogates 51,53 , which consist in randomly choosing a time shift independently for each surrogate and then shifting the signal in time, wrapping its end to the beginning. These surrogates are very fast to generate and they fully preserve all the properties of the original signal. Both surrogates test the null hypothesis of two processes with arbitrary linear or nonlinear structure but without linear or nonlinear inter-dependencies.

Implementation.
To calculate pTE we developed an algorithm in python (available on GitHub 50 ), while we used the statsmodels implementation of GC 63 and the pyunicorn implementation of TE 64 . The code has been thought to be as user friendly as possible to be used to build networks. It takes as arguments all the time series of the studied system, the embedding parameter and the statistical significance test that the user decides to apply. As result it returns the matrix of pTE values computed from the original data, and the matrix of the maximum values obtained from the surrogates (i.e., the statistically significant thresholds).
In the analysis of synthetic data generated with the DGPs the causality measures were run over 1000 realizations with different initial conditions and noise seeds. For each realization the first 100 data points were discarded. For the computation of GC and pTE we chose a lag equal to 1, which implies considering the models as auto-regressive processes of order 1, AR(1), since by the considered models construction, the dependent variable is influenced by the previous step of the independent one; for the computation of TE the k-nearest neighbors method is used, and we chose k = √ N , where N is the number of data points in the time series 65 . In the analysis of the empirical data, from the physics of the problem, the choice of the order of the AR model used to represent the data is not trivial. We used an autoregressive integrated moving average (ARIMA) and the Akaike information criterion (AIC) to select order 4 for yearly-sampled data and order 3, for the monthlysampled data. www.nature.com/scientificreports/ To calculate the causality between two time series, the time series were first linearly detrended and L2-normalized. The significance of the pTE, GC and TE values obtained were then tested against the values obtained from 19 couples of surrogates (as explained in the previous section, 19 surrogates is the minimum for achieving a significance level of 95% ). Unless otherwise specifically stated, the results presented in the text were obtained by using IAAFT surrogates.

Models
In the main text three data generating processes (DGPs) were analyzed. For these DGPs the null hypothesis of non-causality is not satisfied for process Y to process X. Results obtained with other DGPs are presented in the Additional information.
The first DGP is a linear model 66 given by: where ǫ 1t and ǫ 2t are white noises with zero mean and unit variance, and C is the coupling strength.
The second DGP is a nonlinear model 67 that reads: The third DGP consists of two Lorenz chaotic systems, coupled on the first variable: Examples of time series of these three DGPs, normalized to zero mean and unit variance, are displayed in Fig. 8.

Additional information
Comparison with literature. The linear DGP was used by Diks and DeGoede 66 to test nonlinear Granger causality. With a coupling strength of C = 0.5 and a time series length of 100 points with a lag of 1, they obtained a power of 95.6 and a size of 3.0. Using pTE under the same conditions, we obtain a power of 99.8 and a size of 3.9.
The nonlinear DGP was used by Taamouti et al. 67 to quantify linear and nonlinear Granger causalities. With a coupling strength of C = 0.5 , 200 data points, a pvalue of 5% and a resampling bandwidth k for the bootstrap as the integer part of 2 · 200 1/2 , they obtained a power of 100 and a size of 4.4. Using pTE we obtained a power of 100 and a size of 3.3.
The coupled Lorenz systems studied by Krakovská et al. 68 , are very similar to those studied here. By using three state-space based methods, including cross-mapping, they noticed that the highest directionality in the causality is for a coupling C ≈ 4 . From C > 4 synchronization is obtained, finding causality in both directions, using time series of 50000 data points. This observation is very similar to our results with TE, while for pTE and GC, once synchronization has been achieved, no causality is found. This supports their conclusion, warning the reader that the blind application of causality test can easily lead to incorrect conclusions. While GC and pTE can successfully be used to analyze AR processes and weakly nonlinear Gaussian-like processes, for more complex processes (high dimensional and/or highly nonlinear) advanced information-theoretic methods such as TE are needed.
Additional data generating processes analyzed. Figure 8. Examples of time series of the three data generating processes (DGPs) analyzed in the main text. In the three cases there is causality from Y to X; the coupling strength is (a), (b) C = 0.5 , (c) C = 8.   67 .

M12-M13
x t = x t−1 + ρ + K 2π sin(2π x t−1 ) + + βǫ 1t mod 1 + C 1 (x t−1 − y t−1 ) y t = y t−1 + ρ + K 2π sin(2π y t−1 )  Table 5. Power and size obtained with the DGPs listed in Table 4 using pTE, GC and TE. We can notice that there are no significant differences between pTE and GC. The results were obtained using time series of length 1000, where the first 100 are discarded and they are averaged over 1000 realizations. The last three columns correspond to the directionality index DI, eg. (pTE Y →X − pTE X→Y )/(pTE Y →X + pTE X→Y ) , which shows that pTE performs better in most of the models in assessing the directionality. The pTE has been calculated with an embedding parameter of 1 for all models except for M10, where an embedding parameter of 2 has been used to match the causality lag imposed by construction. Bold indicates best values obtained for each model; italic indicates the wrong values obtained for each model. www.nature.com/scientificreports/