Multiplex visibility graphs to investigate recurrent neural network dynamics

A recurrent neural network (RNN) is a universal approximator of dynamical systems, whose performance often depends on sensitive hyperparameters. Tuning them properly may be difficult and, typically, based on a trial-and-error approach. In this work, we adopt a graph-based framework to interpret and characterize internal dynamics of a class of RNNs called echo state networks (ESNs). We design principled unsupervised methods to derive hyperparameters configurations yielding maximal ESN performance, expressed in terms of prediction error and memory capacity. In particular, we propose to model time series generated by each neuron activations with a horizontal visibility graph, whose topological properties have been shown to be related to the underlying system dynamics. Successively, horizontal visibility graphs associated with all neurons become layers of a larger structure called a multiplex. We show that topological properties of such a multiplex reflect important features of ESN dynamics that can be used to guide the tuning of its hyperparamers. Results obtained on several benchmarks and a real-world dataset of telephone call data records show the effectiveness of the proposed methods.

A current research trend aims at investigating complex time-variant systems through graph theory, by considering suitable features associated with vertices and edges 1 . Of particular interest are those systems that also perform a computation when driven by an external input signal. An example is that of artificial RNNs [2][3][4] , which are computational dynamical systems whose link with physics and neurosciences dates back to the '80 with some pioneering works from Jordan 5 and Amit et al. 6 . Nowadays, RNNs are gaining renewed interest in neuroscience due to their biological plausibility 7-10 and in computer science and engineering for their modeling ability 11,12 . RNNs are capable to generate complex dynamics and perform inference based on current inputs and internal state, the latter maintaining a vanishing memory of past inputs 13,14 .
Let us consider trajectories describing the evolution of a dynamical system in state space, e.g., the space containing all possible system states. As an example, in Fig. 1 we show the trajectories of a dynamical system operating in ordered (left) and chaotic (right) regimes, whose state is defined by the values of variables θ 1 , θ 2 , and θ 3 at time t. Discriminating between order and chaos is of fundamental importance for investigating the properties of a dynamical system. It emerges that, such properties are manifested in the memory of the dynamical system and in the divergence rate of its state trajectories 15 . Fading memory is a desirable property of a dynamical system, and is characterized by ordered (contractive) dynamics. This is also referred to as the echo state property in the reservoir computing community 16 and ensures that the current state/output of the system depends only on a finite number of past states/inputs 17 . At the same time, a high divergence rate between state trajectories (a property of chaotic dynamics) is also a desirable feature of RNNs. Results show that RNNs operating in a chaotic regime are able to produce meaningful patterns of activity 18 and a balance must be struck in order to meet both properties. As a consequence, a (computational) dynamical system has to operate on the transition between order and chaos, in a region of the controlling parameter space called "edge of criticality". On the edge of criticality, RNN internal dynamics becomes richer, meaning that neuron activations become heterogeneous 15 . Such a diversity both improves memory capacity of RNNs as well as their capability to reproduce complex target dynamics 19,20 . The notion of edge of criticality permeates several complex systems [21][22][23] , including random Boolean networks 24 Figure 1. Trajectories in the state space of a system described by the evolution of variables θ 1 , θ 2 , and θ 3 over time. The behavior of the system is controlled by a parameter (not specified), which we assume to be able to produce ordered (left plot) and chaotic (right plot) dynamics, depending on its value. In the ordered regime, the trajectories converge to the same fixed point when starting from 2 different initial conditions (red dots). In this configuration, the system is characterized by a fading memory of previous states. In the chaotic regime, on the other hand, trajectories remain separated if the system starts from different initial conditions. Scientific RepoRts | 7:44037 | DOI: 10.1038/srep44037 state of an ESN through a set of vertex properties of the multiplex (e.g., degree, clustering coefficient). Edges in HVGs might cover a relevant time interval (contained within the longest period of the signal), while they are still local in terms of topology. Accordingly, the ESN state can be characterized also by information that is non-local in time. To find hyperparameters yielding highest prediction accuracy, we search hyperparameter configurations producing neuron activations as diverse as possible. This occurs when neurons dynamics is maximally heterogeneous (critical dynamics), a characteristic that, as we will show in the paper, is well-captured by the average entropy of vertex properties of the multiplex. Successively, to quantify the amount of memory in an ESN, we check the existence of neuron activations that are "similar" to different delayed versions of the input. In fact, memory in ESNs depends on the ability to reproduce past input sequences from information kept within some neuron activations. We describe dynamics of delayed inputs and of neuron activations through unsupervised graph-based measures. Then, by evaluating the agreement between such measurements, we quantify the memory capacity. We provide experimental evidence that our methods achieve performance comparable with supervised techniques for identifying hyperparamer configurations with high prediction accuracy and large memory capacity.

Methods
Echo state networks. A standard ESN architecture consists of a large, recurrent layer of non-linear neurons sparsely interconnected by edges with randomly generated weights, called a reservoir, and a linear, feedforward readout layer that is usually trained with regularized least-squares optimization 27 . De facto, the recurrent layer acts as a non-linear kernel 48 that maps inputs to a high-dimensional space. The time-invariant difference equations describing the ESN state-update and output are, respectively, defined as The reservoir contains N r neurons whose transfer/activation function f(·) is usually a hyperbolic tangent and g(·) is usually the identity function. At time instant t, the ESN is driven by the input signal Large values for ω i tend to saturate the non-linear activation functions. The second hyperparameter is the spectral radius ρ (eigenvalue with largest absolute value) of W r r , which is related to the echo-state property, discussed earlier. For a detailed discussion on the relationships between stability, performance and ρ, we suggest to the interested reader ref. 30 and references therein. Here, it suffices to say that a widely adopted rule-of-thumb 49 suggests to set ρ to a value slightly smaller than 1 (e.g., ρ = 0.99). However, to reach higher performance in some practical tasks, it could be necessary to pick a small value for ρ or to breach the aforementioned "safety" bound and push its value beyond unity. Note that in this latter case asymptotic stability of the ESN might still hold, even if some assumptions are locally violated.
In this work, we focus only on the tuning of the hyperparamers ρ and ω i , without affecting generality of the proposed methods that can be adopted to tune also other hyperparamers. The optimal values of ρ and ω i for the task at hand are typically identified with a cross-validation procedure, with the potential associated shortcomings mentioned above. For that reason, different unsupervised approaches have been proposed for their tuning 30,31,33 . For example, the effect of ρ and ω i on the ESN computational capability can be investigated through the maximal local Lyapunov exponent (MLLE), which measures the divergence rate in state space of trajectories with similar initial conditions. In autonomous (not input-driven) systems, chaos occurs when the maximal Lyapuanov exponent becomes positive, while in input-driven systems, like ESN, one typically relies on local first-order approximations of this quantity (see ref. 30 for details). Accordingly, the onset of criticality in ESNs can be detected by checking when MLLE crosses 0. Another quantity, which was shown to be more accurate in detecting criticality in dynamic systems and well-correlated with ESN performance, is the minimal singular value (in average, over time) of the reservoir Jacobian, denoted as λ 33 . λ is unimodal and in correspondence of its maximum the dynamical system is far from singularity, has many degrees of freedom, has a good excitability, and it separates well the input signals in state space 33 . By assuming a null input in Eq. (3), the Jacobian matrix of the reservoir at time t is given by where the diag(·) operator returns a diagonal matrix. In this paper, λ will serve as a baseline for comparison to our proposed graph-based unsupervised methods for improved ESN hyperparameter tuning, as discussed in the next section. Horizontal visibility graph and multiplex network. The HVG 41 associated with a finite univariate time The adjacency matrix A characterizes the graph: two vertices v i and v j , i ≠ j are connected by an edge (A[i, j] = 1) iff the corresponding data fulfill the criterion In a multivariate scenario, the data stream is composed of N r different time series r , of equal length t max . A multivariate time series can be mapped into a multiplex visibility graph  with N r layers 40 . Specifically, the lth multiplex layer is defined by the HVG G l constructed from x l . In the multiplex, a vertex is replicated on all layers and such replicas are linked by inter-layer connections, while intra-layer connections might change in each layer. From now on, we denote v l [t] to be the vertex of G l in layer l associated with time interval t.
In this paper we introduce a weighted HVG (wHVG), with edge values defined as max . Since self-loops are forbidden in HVGs, edge weights are always well-defined (i.e., finite). The use of weights permits to capture additional information, as it accounts for distance in time (j − i) and amplitude differences (x[i] − x[j]) of two data points connected by the visibility rule. This weighting scheme is motivated by our need to characterize and exploit the instantaneous state by means of a suitable measure of heterogeneity (discussed in the next section). To distinguish the original adjacency matrix from the one of the wHVG, we refer the former as binary adjacency matrix and the latter as weighted adjacency matrix. Algorithm 2 delivers the pseudo-code for constructing a HVG (and a wHVG) from time series x. The worst case complexity of this algorithm is  t ( ) max 2 , which occurs when values in x are monotonically decreasing. Instead, the best case complexity is  t ( ) max and arises in correspondence of monotonically increasing values in x.

Input: Time series
Output: Adjacency matrix A of (weighted) HVG 1: Indexes of the vertices on each layer l of  are associated one-to-one with the time index of the original time series. Hence, the ESN state Table 1 we introduce four indexes: vertex degree, clustering coefficient 50 , betweenness and closeness centrality 51 .
Heterogeneity of neurons dynamics. The capability of an ESN to reproduce the dynamics of a target system, hence to predict its trajectory in state space, is maximized on the edge of criticality, where the internal dynamical patterns of an ESN become sufficiently rich. In the literature, such a "richness" is usually expressed in terms of diversity of connection weights 52 , entropy or rank of the matrix of neuron activations 1,5,32 .
In the same spirit, in order to find hyperparameters yielding maximal prediction accuracy, here we look for those hyperparameter configurations giving rise to neuron activations that are as heterogeneous as possible. Figure 2 provides a visual example of the concept herein discussed. As an illustration, we consider an ESN driven by a sinusoid. We depict the neuron activations and the corresponding HVGs. We select a time step (marked by Number of edges incident to vertex v.

Clustering coefficient
Clustering coefficient of vertex v, applicable to both weighted and binary graphs. In ref. 68 to include also v (and its edges).

Betweenness Centrality
Measures the centrality of vertex v. σ ij is the total number of shortest paths from i to j and σ ij (v) denotes the number of shortest paths passing through v.

Closeness centrality
Total distance of a vertex v from all other vertices in the graph. |σ(i, v)| denotes the length of the shortest path between i and v a black square in the picture) and we show the correspondence between the element in each time series of the neuron activations and the vertices in the associated HVGs. When the ESN operates with a contractive dynamic (2a), neuron activations weakly depend on previous ESN states and they are all very similar to the input signal. This results in a lack of diversity among activations. Accordingly, the corresponding HVGs in the multiplex contain vertices with similar properties, e.g., similar degree and clustering coefficient. When the ESN approaches the edge of criticality, neuron activities highly depend on previous internal states, which encode information of past inputs and internal structure of neuron connections. As we can see from (2b), on the edge of criticality, the activations have different frequencies and their phases are shifted. Such a heterogeneity in the ESN instantaneous state is captured by vertex properties of the multiplex. Finally, if pushed beyond the edge of criticality, the ESN transits into a chaotic regime as shown in (2c). In this case, neuron activations become noise-like and disordered oscillations generate HVGs with vertex properties very different from previous configurations. However, the diversity of the patterns in different time series disappears, hence heterogeneity is again lost. This lack of variety is also highlighted by recurrence of similar motifs in the corresponding HVGs.
To determine the heterogeneity of neuron activations, we consider the entropy of the related vertex property distribution. In particular, heterogeneity is computed as follows: The procedure described above, is visually represented in Fig. 3. By referring to the figure, we observe that, at given time t, the ESN state is represented by those vertices in the multiplex tagged with the same label across the layers. For example, the ones in red describe the instantaneous state of the ESN at time step t 4 and a vector of vertex properties Φ * [t 4 ] is computed for this set of vertices. Then, we estimate the distribution p(Φ * [t 4 ]) and compute the entropy 4 4 . In order to compute ⁎ H , the procedure is repeated for each time step. Finally, ⁎ H is used to characterize the current hyperparameter configuration. The average entropy depends on the specific vertex property chosen for the analysis (we taken into account four properties, e.g., vertex degree, clustering coefficient, betweenness and closeness centrality -see Table 1). These vertex properties lead to four different entropy values H DG , H CL , H BC , and H CC .
To summarize, given an input signal, we select hyperparameter configurations that maximize such entropy values. This criterion is inspired by the aforementioned observation linking performance of a computational dynamical system (i.e., prediction accuracy and memory) with heterogeneity of its dynamics (critical dynamics). Accordingly, in this case this information is exploited in order to derive, in an unsupervised way, the configuration yielding highest prediction accuracy.
Other multiplex complexity measures. Recently 40 , two measures have been proposed in order to characterize the dynamics of a system observed through a multivariate time series and represented as a multiplex composed of HVGs. Here, we consider these measures in order to evaluate whether they are useful for identifying the hyperparameter configurations yielding maximum accuracy performances or not.
The Average Edge Overlap (AEO) computes the expected number of layers of the multiplex on which an edge is present. For binary HVGs, it is defined as    is defined as a stationary uncorrelated noisy signal.
In the following, we propose an unsupervised graph-based method to identify hyperparameter configurations for which an ESN achieves large memory capacity. Given the input time series max , we determine if there exists a subset of neuron activations, which is correlated with a past input sequence Being G l the HVG representing the lth layer of the multiplex, max is the sequence of its vertex degrees ordered according to the time index (not to be con- r , the degrees of vertices relative to time t across the different layers). With G x , instead, we refer to the HVG constructed over the input x[t], while Φ x DG is the vector of its vertex degrees. First, we define a measure of maximum agreement between Φ x DG and each sequence Φ l DG as DG . κ * (·,·) is a similarity measure between sequences: in this paper we consider the Pearson correlation κ PC (·,·), the Spearman correlation κ SC (·,·), and the mutual information κ MI (·,·).
A second measure of agreement is defined on the adjacency matrix . In this case, κ * (·,·) is directly evaluated on the time series rather than on the sequences of vertex degrees. This last measure of agreement is taken into account in order to quantitatively show (in the experiments) the benefits of using HVGs for representing neuron activations.
By referring to the illustrative example in Fig. 4, we generate the HVGs G x , … G G , , N 1 r , relative to delayed input x 15,10 and activations of the N r reservoir neurons, respectively. Successively, we evaluate their similarities by means of an agreement measure (δ DG in this example). The similarity measure κ * is chosen among the three previously proposed measures: κ PC , κ SC , or κ MI .

Results
In the following, we perform two experiments in order to evaluate the two proposed unsupervised methods for, respectively, finding hyperparameter configurations giving rise to ESNs with high prediction accuracy and large memory capacity. In the first experiment, we show that, on different real and synthetic tasks, (supervised) prediction accuracy is maximized for the same hyperparameter configurations that yields the largest heterogeneity for the vertex properties of the multiplex. In the second experiment, we show the reliability of the graph-based memory measures in identifying hyperparameters where the (supervised) MC is maximized.
Test for prediction accuracy. In this experiment, we consider several prediction tasks and, for each of them, we set the forecast step τ f > 0 to be the smallest time-lag that guarantees the measurements in a time window of size τ f to be uncorrelated (e.g., the first zero in the autocorrelation function of the input signal). Prediction error is evaluated by Normalized Root Mean Squared Error, 2 2 where ˆt y[ ] is the prediction provided by the ESN and y[t] is the desired/teacher output. The Prediction accuracy is defined as γ = max{1 − NRMSE, 0}.
In the following, we describe the datasets used in this experimental campaign.
Sinusoidal input. We feed an ESN with a sinusoid y(t) = sin(ψt) and we predict future input values with a forecast step τ f = 2π/ψ.

Mackey-Glass time series.
The Mackey-Glass (MG) system is commonly used as a benchmark in chaotic time series prediction. The input signal is generated from the MG time-delay differential equation MG MG 10 We adopt the standard parameters τ MG = 17, α = 0.2, β = 0.1, initial condition x(0) = 1.2, and integration step equal to 0.1. The forecast step here is τ f = 6.
Multiple superimposed oscillator. Prediction of superimposed sinusoidal waves with incommensurable frequencies is a hard forecasting exercise, due to the extension of the wavelength 54 . The ESN is fed with the multiple superimposed oscillator (MSO) For this task, the ESN is trained to predict future input values, with forecast horizon τ f = 16.
NARMA. The chosen Non-Linear Auto-Regressive Moving Average (NARMA) tas 55 consists in modeling the output of the r-order system: is a uniform random noise signal in [0, 1] and is the input of the ESN, which is trained to reproduce y[t + 1]. The NARMA task is known to require a memory of at least r past time-steps, since the output is determined by inputs and outputs from the last r time-steps. For this task we set r = 20 and τ f = 15.
Polynomial task. The ESN is fed with uniform noise in [− 1, 1] and is trained to reproduce the following output Telephone call load time series. As a last test, we consider a real-world dataset relative to the load of phone calls registered over a mobile network. The data comes from the Orange telephone dataset, published in the Data for Development (D4D) challenge 57 . D4D is a collection of call data records, containing anonymized events of Orange's mobile phone users in the Ivory Coast, in a period spanning from December 1, 2011 to April 28, 2012. The dataset consists of 6 time series consisting of: number and volume of incoming calls, number and volume of outgoing calls, day and time (1 hour resolution) when the telephone activity was registered. More detailed information is available at the related website 58 . All 6 time series are fed into the ESN as inputs; the goal is to predict 6 hours ahead the volume of incoming calls -the profile of this latter time series is depicted in Fig. 5.
In each test, we evaluate the correlation between the average entropy of vertex properties in the multiplex and the prediction accuracy γ, as we vary the hyperparameters ρ and ω i . Multiplexes are generated using both the binary and weighted version of the HVG adjacency matrix. To appreciate the effectiveness of our methodology, we also estimate the correlations of γ with λ, the minimum singular value of the Jacobian of the reservoir (see Methods). Additionally, we consider the correlation of γ with the two layer-based measures IMI and AEO (see Methods). Correlations are evaluated as follows. For each configuration (ρ = k, ω i = j), we have the prediction accuracy γ k,j , the entropy H k j , DG , and so on. The values assumed by these quantities by varying ρ and ω i generate a two-dimensional manifold. The Due to the stochastic nature of the ESN initialization, for each configuration (ρ = k, ω i = j) we compute γ k,j and all the other measures 15 different times. Successively, we compute the correlations among their average values. We use a reservoir with N r = 100 neurons and sparsity of the internal connectivity equal to 25%. The readout is trained by standard ridge regression with regularization parameter set to 0.05. The distributions of the vertex properties are estimated by histograms with b = 50 bins. We comment that the reservoir size N r can be increased/decreased without affecting the applicability of the proposed methodology (only the number of bins used to estimate the vertex properties distribution might be modified -see Methods). However, the number of neurons has an impact on the overall performance of the network and on the time and space complexities of the proposed method. Figure 6 depicts the values assumed by γ and four graph-based measures in the case of the MSO prediction task. As can be seen, high correlation emerges between the (average) entropy of the vertex properties and the prediction accuracy. Since our approach is fully unsupervised, the proposed graph-based measures can approximate well the accuracy γ, regardless of the task learned by the readout (prediction, function approximation, reconstruction of past inputs, etc). In Table 2, we report the average correlation values and their statistical significance (expressed by p-values) on all tasks. As we can see, the highest (and statistically significant) correlation is achieved by using one of the four average entropy measures of vertex properties. In particular, the measure based on vertex cluster coefficient distribution, H CL , achieves the best results in 5 of the 6 tasks. For what concerns the D4D time series, we observe that λ achieves high correlation with γ, but still lower than the one achieved by H CL . This demonstrates the effectiveness of the proposed methodology, also in the case of a real-world application. In SIN and MSO tasks, the graph-based quantifiers estimated on the weighted HVG achieve a higher degree of accuracy, with respect to the binary counterpart. In these cases, additional qualitative information relative to temporal and amplitude differences in the connected data allows to better represent the dynamics of the system. Finally, it is worth noting that IMI takes high, yet negative correlation values on both MG and POLY tasks. In such cases, results are close to the ones achieved with our approach.
Test for memory capacity. The performed experiment consists in generating 100 different random reservoirs, each one characterized by an increasing value of spectral radius ρ in the [0.1, 2] interval. As ρ varies, we evaluate the MC by training four readouts in order to reproduce different time-lagged versions of input signal x 10,5 , … , x 25,20 . Then, on the output of each reservoir, we evaluate the similarities δ TS , δ DG , and δ AND , which are high if there exists at least one series of activations that is similar to the considered past input sequence. This is evaluated in such as way that the measure κ * taken into account. Even if some neurons retain dynamics of previous input sequences, the reservoir introduces shifts in the phase and the amplitude of the input signal. To filter out these differences, in this test we consider only HVGs defined by binary adjacency matrices, which do not account for differences in the amplitude of the connected values. To evaluate the effectiveness of the proposed unsupervised memory measures, we compute the correlation between the supervised MC and δ TS , δ DG , δ AND , as ρ varies within the chosen interval. Note that we only monitor the effect of ρ on the dynamics, since it is the hyperparameter that mostly affects the memory capacity 32 . We kept the input scaling fixed, ω i = 0.7, while the remaining hyperparamers are configured as in the previous experiment. As before, we repeated each experiment 15 times with different and independent random initializations. In Table 3, we show the mean correlation values, along with the standard deviations, between the MC and the proposed unsupervised measures of memory capacity. From the table, we observe that the best agreement with the MC is achieved by the measures derived from the HVGs. In particular, δ DG configured with the Spearman rank κ SC is always highly correlated with MC and, in three of the four delayed input sequences taken into account, is the best performing one. In each setup, δ DG works better if configured with κ SC rather than κ PC . Instead, results obtained with κ MI are significantly worse in all cases. δ AND achieves the best results only for the first time lag taken into account, while the agreement with MC is lower in the remaining cases. Interestingly, several measures show a high degree of correlation with the MC as the size of the delay increases. δ TS , the unsupervised measure computed directly on the input time series and neuron activations, shows positive correlations with the MC, but the agreement is always lower with respect to the graph-based measures. For δ TS , the setting with κ PC works better than κ SC . Finally, also in this case by using κ MI we obtain the worst performance. In Fig. 7, we show an example of the values of MC, δ TS (configured with κ PC ), δ DG (configured with κ SC ), and δ AND , as ρ is varied within the [0.2, 2] interval.

Discussion
Experimental results show satisfactory correlations for average entropy of vertex properties with respect to prediction accuracy. Moreover, the two unsupervised graph-based memory measures that we proposed (δ DG and δ AND ) correlate well with the supervised measure of memory capacity. Each measure is computed on both the binary (b) and weighted (w) versions of the HVG adjacency matrix (adj). We also report the correlations of γ with the manifolds relative to the minimum singular value of the reservoir Jacobian over time (λ) and the two multiplex-based measures AEO and IMI, presented in ref. 40. In each task, the highest correlations with γ are highlighted in bold.  We first discuss the results of the prediction accuracy test, where we analyzed topological properties of vertices in the multiplex, representing the ESN instantaneous state. On all tests taken into account, we observed a remarkable correlation between γ and the average entropy of the clustering coefficient distribution H CL , hence suggesting that the clustering coefficient is able to describe well the heterogeneity of the activations. To explain this result, it is necessary to elaborate on the properties of the clustering coefficient CL(·). In the HVG literature, CL(·) its behavior has been analyzed for time series characterized by different Hurst exponent 59 . Additionally, an upper bound (CL(v) ∈ [0, 2/DG(v)]) is provided for HVGs derived from random time series 41 . In the following, we present an in-depth interpretation of the results by accounting for geometrical properties of the clustering coefficient. In a HVG, CL(v) measures the inter-visibility among neighbors of v. For convex functions, it is possible to connect any two points with a straight line. This feature is also (partially) captured by the HVG. If v is contained in a convex part of the related time series, there is a high degree of intervisibility among the neighbor vertices to which v is connected, hence CL(v) is high. Additionally, moving along the same convex part of the time series results only in minor changes of the clustering coefficient in the associated HVG vertices. Instead, if v is a local maximum of a concave part, then it is connected to points belonging to two different basins, which do not have reciprocal visibility. In this case, CL(v) is low and its value rapidly changes as one moves away from the maximum. This results in great losses of visual information. Therefore, large values of CL(·) indicate the presence of dominating convexities, while low values characterize concavities 60 . Accordingly, CL(·) can be used to measure the length of a convex (concave) part of the time series and how fast the convexity is changing, which is a measure of the fluctuations in the time series 61 . In a regime characterized by contractive dynamics, convexity changes at the same (slow) rate in different neuron activations and this results in a low entropy value of the clustering coefficient distribution among vertices in different layers. On the edge of criticality, instead, convex and concave parts in the time series of activations are characterized by heterogeneous lengths and they change at different rates. This corresponds to a high degree of clustering coefficient diversity of the same HVG vertex, replicated at different layers in the multiplex. Finally, in the chaotic regime, all time series fluctuate very rapidly and their convexity changes every few time steps. In this case, in each time series of activations the lengths of convex and concave parts are always very short and hence the desired heterogeneity is again lost.
For what concerns experiments on memory, the best overall results in terms of agreement with the supervised MC are achieved by the graph-based measure δ DG . As previously discussed, such a measure evaluates the maximum similarity between the sequence of vertex degrees on the input HVG G x and the HVG G l of neuron activations. This measure is closely related with the degree distribution P(k), whose importance is known in the HVG literature 41 . For example, it has been shown that for time series generated from an i.i.d. process, P(k) follows P(k) = (1/3)(2/3) k−2 and the mean degree is 〈 k〉 = 4. As the correlations in the time series increase, the i.i.d. assumption is lost and P(k) decays faster. Furthermore, vertex degrees are key parameters to describe dynamic processes on the graph, such as synchronization of coupled oscillators, percolation, epidemic spreading, and linear stability of equilibrium in networked coupled systems 62 . Their role has been studied also in the HVG framework 63 . HVGs have been studied in the context of time series related to processes with power-law correlations 59 . In our case, the time series of neuron activations have short-term correlations and increments in the correlation coefficients can have opposite signs at consecutive time lags. For these cases, we are not aware of any previous study in terms of HVGs.
In networks which are inherently degree disassortative, the range of degree values increases with network size, with a consequent decrease of the assortativity value 64 . In such networks, the Spearman rank correlation provides a more suitable choice with respect to calculating degree-degree Pearson correlations. It is important to notice that the rank is computed through a non-linear rescaling, which is data dependent. The information on the actual values of the data is discarded as only its inherent ordering (rank) is preserved. We argue that HVGs convey the same type of information captured by the Spearman correlation. Hence, the latter should be preferred to Pearson correlation to characterize the characteristics of the vertices and related topological properties in HVGs. This fact justifies the higher agreement with memory capacity achieved by means of δ DG when configured with κ SC , which accounts for Spearman correlations between sequences of vertex degrees in the HVGs related to the input signal and the neuron activations. Modeling ESN dynamics through a multiplex network allowed us to connect two seemingly different research fields, thus fostering multidisciplinary research in the context of recurrent neural networks. By converting a temporal problem into a topological one, we handled temporal dependencies introduced by ESNs (as well as by other types of RNNs), hence overcoming technical limitations of statistical approaches that require independence of samples. We performed and discussed several experiments that provided empirical evidence that our methodology achieves performance higher than other unsupervised methods and comparable to cross-validation techniques. These results suggest to allocate efforts to further improve the effectiveness of unsupervised learning methods in the context of ESNs and RNNs. Finally, we would like to stress that, while this paper is primarily focused on network structures in machine learning, our results might suggest new ideas for theoretical understanding of recurrent structures in biological models of neuronal networks 7,65,66 . In particular, we believe that it is possible to identify emergent structural patterns in the developed graph-based representations of network dynamics. This would allow to further explore and analyze the route to chaos in input-driven neural models by exploiting the language of graph theory, which is an already established framework within the neuroscience field 67 .