Unravelling the community structure of the climate system by using lags and symbolic time-series analysis

Many natural systems can be represented by complex networks of dynamical units with modular structure in the form of communities of densely interconnected nodes. Unraveling this community structure from observed data requires the development of appropriate tools, particularly when the nodes are embedded in a regular space grid and the datasets are short and noisy. Here we propose two methods to identify communities, and validate them with the analysis of climate datasets recorded at a regular grid of geographical locations covering the Earth surface. By identifying mutual lags among time-series recorded at different grid points, and by applying symbolic time-series analysis, we are able to extract meaningful regional communities, which can be interpreted in terms of large-scale climate phenomena. The methods proposed here are valuable tools for the study of other systems represented by networks of dynamical units, allowing the identification of communities, through time-series analysis of the observed output signals.


Influence of the symbolic representation
The community identification algorithm uses a symbolic representation of time-series, known as ordinal analysis, by which SAT anomalies are represented by a set of ordinal patterns. In this way the information of the evolution of the SAT anomalies at the seasonal time-scale is encoded into transition probability matrices. This procedure is quite general, and can be applied to any symbolic representation of the time-series. To study the influence of the symbolic encoding method, here we use an alternative approach and digitalise the time-series with a finite set of values.
We performed this analysis using 10 equally spaced values, ranging from the maximum to the minimum value of each time-series, then we followed the same procedure as in the main text: we computed the transition matrices, and their distances, and by thresholding the distances we obtained a climate network; finally the Infomap algorithm was used to identify the communities. The results are shown in Fig. 2.
As it can be seen, the community structure obtained with this symbolic representation is much noisier than the one obtained through ordinal patterns. This is probably due to the fact that, with the ordinal patterns, each symbol encodes information about the seasonal evolution of SAT anomalies; in contrast, with the simpler symbolic encoding used here, the symbols only encode information about the coarse grained SAT anomaly values. We speculate that by using "blocks" composed by several coarse grained values 2 , a less noisy community structure could be revealed, and this will be an interesting issue to investigate in future work.

Influence of the threshold used to construct the network
In order to construct a climate network, the weights matrix has to be pruned by using an adequate threshold W . Decreasing the threshold leads to a more connected network, while increasing it results in a sparser one. The number of communities depends on the number of connections, which in turn depends on the threshold. In order to uncover a coherent, well-defined community structure, the threshold has to be carefully chosen.
We report in Fig. 3 the number of communities and the average degree as a function of the threshold. It can be seen that there is a negative correlation between the number of communities and the average degree. The fragmentation of the network into smaller communities (as community 7 in Fig. 1a or communities 8, 9 and 10 in Fig. 1b) could be due to the removal of relevant links that keep the bigger communities together. Thus, to obtain a meaningful community structure, we selected a threshold that provided the best compromise between the need to limit the small-communities-proliferation and the need to include in the network only the relevant links.
We stress that the same qualitative behaviour is found also when using other community detection algorithms.

Comparison with networks constructed by using the Pearson correlation coefficient
In this section we contrast the community structure obtained by using the proposed methodology (based on transition probabilities computed with symbolic analysis), with that obtained with the classical approach, which measures the dynamical similarity of two time series with the Pearson correlation coefficient, r i j . As in 3 we use a threshold W = 0.5 to prune the r i j matrix.
Applying the Infomap algorithm to the obtained network results in 8604 communities, but only 20 are composed by more than 2 nodes. Figure 4 shows the largest 16 communities.
As it can be seen, only communities 0 and 1 correspond to coherent structures, namely El-Niño basin, and the tropical oceans, while the others appear to be just noise.
We also report the effect of the threshold value over the community structure in this case. As it can be seen in Fig. 5 the number of communities increases as the threshold is increased, as it occurs when using ordinal analysis (Fig. 3); however, here the change is more abrupt (note the logarithmic vertical scale), and it occurs at low values of the threshold.
Moreover, when a low threshold is used in order to limit the number of communities, the community structure obtained is meaningless, as it can be seen in Fig. 6.