A sampling-guided unsupervised learning method to capture percolation in complex networks

The use of machine learning methods in classical and quantum systems has led to novel techniques to classify ordered and disordered phases, as well as uncover transition points in critical phenomena. Efforts to extend these methods to dynamical processes in complex networks is a field of active research. Network-percolation, a measure of resilience and robustness to structural failures, as well as a proxy for spreading processes, has numerous applications in social, technological, and infrastructural systems. A particular challenge is to identify the existence of a percolation cluster in a network in the face of noisy data. Here, we consider bond-percolation, and introduce a sampling approach that leverages the core-periphery structure of such networks at a microscopic scale, using onion decomposition, a refined version of the k-core. By selecting subsets of nodes in a particular layer of the onion spectrum that follow similar trajectories in the percolation process, percolating phases can be distinguished from non-percolating ones through an unsupervised clustering method. Accuracy in the initial step is essential for extracting samples with information-rich content, that are subsequently used to predict the critical transition point through the confusion scheme, a recently introduced learning method. The method circumvents the difficulty of missing data or noisy measurements, as it allows for sampling nodes from both the core and periphery, as well as intermediate layers. We validate the effectiveness of our sampling strategy on a spectrum of synthetic network topologies, as well as on two real-word case studies: the integration time of the US domestic airport network, and the identification of the epidemic cluster of COVID-19 outbreaks in three major US states. The method proposed here allows for identifying phase transitions in empirical time-varying networks.


I. INTRODUCTION
with the former having a homogeneous population of nodes across layers, while the latter containing dense layers interspersed by sparse regions. We use a hybrid unsupervised learning method, combining t −SNE and k−means clustering, and train it on subsets of nodes sampled from both the sparse and dense layers to distinguish dynamical states above and below φ c . We show that sampling from the dense layers with nodes containing similar dynamical information in the percolation process, provides significantly higher accuracy than compared to sampling from the sparse layers or sampling nodes randomly, independent of whether nodes lie in the core or the periphery.
Having determined the optimal sampling strategy, we next use the confusion scheme to identify the critical occupation probability φ c , and once again demonstrate high accuracy as compared to the ground-truth estimates of the threshold values. Perhaps, most surprisingly, we show that optimal samples are not limited to the core of the underlying network, but there exists multiple subset of nodes in the entire range of layers in the onion spectrum. This finding bears particular significance for using such methods in empirical networks, the majority of which have heavy-tailed distributions of links, and whose measurements are noisy. Finally, we apply our formalism to two examples of real-world examples: We determine the exact critical integration time of the US air transportation network , as well as identify the epidemic cluster for COVID-19 in three major states in the US. We end with a discussion of the implications of our findings.

II. PERCOLATION ON DIFFERENT NETWORK TOPOLOGIES
Consider a network G where V = {v 1 , · · · , v N } is the node set that undergoes bond-percolation, and let time t denote the dynamical state of a certain configuration of occupied bonds. The data X generated during the process is contained in a M × N matrix, where N is the number of nodes and M is the total number of dynamical states at different values of the occupation probability φ. The entries of X are binary; Nodes are ordered from the inner-to the outer-most layers which are labeled in descending order. The onion spectrum is shown as inset, indicating that the square-lattice consists of a single shell, whereas the power-law network contains three distinct shells. Layers are populated uniformly in the square-lattice, and in a punctuated fashion in the power-law network. In (c) and (d), the percolation process in the range 0 ≤ φ ≤ 1 for the square-lattice and power-law network. The critical bond-occupation probabilities φ SL c = 0.524 and φ P L c = 0.248 are marked as the red dashed line. Nodes part of the GCC are colored black, those outside are colored white. Nodes are ordered the same as in the upper-panel.
Next, we use the onion decomposition method to uncover the core-periphery structure of the two different types of networks. In addition to the coreness metric produced by k-core decomposition that identifies nested maximal subnetworks with nodes having at least k connections, the onion decomposition improves the coreness information by assigning a layer to each node, to further indicate its position within the core and make its internal organization apparent [43].
In Fig. 1, panels a and b we show the onion spectra of the square lattice and the power-law network. Nodes are sorted with respect to their layer value in descending order, from the innerto the outer-most layer. The square lattice has a uniform spectrum where nodes populate each layer equally in the network, however the power-law network shows a spectrum with sparse- The figure indicates that in the case of the square lattice, groups of nodes across layers show common dynamics for a wide-range of φ. Nodes attach and detach from the GCC (as φ changes) in a similar fashion independent of what layer they belong to. In contrast, in the power-law network there is wide variation across layers; for a given value of φ large swathes of nodes in the inner-and outer-most layers are not part of the GCC. Further, whether a node is part of the GCC as φ is increased, varies from layer-to-layer. Finally, we note the presence of high-fidelity samples in the core-, intermediate-and peripheral-layers. For any classification algorithm, such class-imbalanced datasets pose a challenge as learning methods fail to capture the distributive characteristics of the data and produces unsatisfactory accuracies [46]. The performance of such algorithms is poor on subsets with under and over-represented classes as it tends to partition phases into relatively uniform sizes. The pre-processing of the data using the onion decomposition method, instead allows for the identification of of node subsets with similar dynamical information in the percolation process. That is, layers where nodes disconnect from the GCC at comparable values of the control parameter, hence yielding a balanced training data in the subsequent learning phase.

III. EFFECT OF SAMPLING ON CLUSTERING
Next, we focus on the classification scheme for clustering nodes as part of-or excluded fromthe GCC. We construct a hybrid unsupervised learning model with t −SNE [47] a non-linear dimensionality reduction technique, and k−means clustering [48] used for identifying predetermined number of clusters from an unlabeled dataset. We add Gaussian noise, N (0, 0.01), to the input data X (Eqn. 1) to help spread the data points and sample subsets of nodes in bins of size 20 (corresponding to ≈ 50 samples), ranging from the innermost layer to the peripheral layer. We project the subset of 20 nodes into a two-dimensional plane with t −SNE, and then use k−means with k = 2 to assign labels as in Eq. (1). To assess the performance of the algorithm we compare the labels assigned by the unsupervised learning methodŷ to the ground-truth label y and define the accuracy αŷ ,y as where the summand is the Kronecker delta-function.
In Fig. 2 we plot the results of our analysis for the square-lattice. Panel a shows αŷ ,y as a function of the sampled subset of nodes, ordered the same as in Fig The results indicate the importance of adopting a considered sampling strategy for training-sets in power-law topologies, given that unlike in networks with uniform topologies, random sampling is sub-optimal.
Indeed, very few real-world networks have uniform topologies, instead exhibiting heavy-tailed distributions, implying that for any realistic application, identifying high-quality samples a priori is of paramount importance. Given issues of data sparsity, it is of note, that such samples exist in multiple layers of the power-law network, including the core-, intermediate-and peripheral layers.

IV. IDENTIFYING CRITICAL TRANSITION POINTS
The procedure described thus far, while effective in identifying configurations below and above the percolation phase, in itself, cannot identify the critical bond occupation probability φ c . To do so, we make use of the confusion scheme, first introduced to study phase transi- 248. The random sample in the square lattice has a clear W shape, but the middle peak is noisy and is flat around φ SL c . For the power-law network the random sampling strategy fails to provide any information on the transition probability. In both cases, the sampling-guided strategy yields high accuracy on φ SL,P L c , with the middle peaks in the black curves occurring at 0.520 and 0.240. tions in Kitaev chains, the classical Ising model and in disordered quantum spin chains [14].
Recently it has been extended to uncover the critical transition probability in dynamical phase transitions in complex networks [42]. Next, we show that our sampling-guided strategy adopted to the confusion scheme is quite effective in terms of identifying the value of φ c .
The method does not take as input labels of the dynamical states, instead a synthetic label space y is associated with an input matrix I, with entires 0 and 1, corresponding to the before and after states in the percolation process. At φ = 1 the label vector y = 0 i.e. all configurations are in the before state, and for φ = 0, the vector y = 1, each snapshot is labeled as after state.
The boundary between 0 and 1 (corresponding to the critical threshold) in this artificial labelset is varied in the entire range of the control parameter φ ∈ [0, 1] with increments of ∆ = 0.005 (yielding 200 steps in total) and associated with a optimal subsample I ⊂ X selected using the method described in Sec. III. For both the square-lattice and the power-law network we select samples from the periphery that yield high accuracy.
A feed-forward neural network (FFNN) is trained with this data consisting of pairs {I : y} in the form of supervised learning problem using the PyTorch library [49]. The input layer contains neurons at the same number of the chosen subsample size, followed by a hidden layer of layer contains a single neuron with sigmoid activation function, that predicts the probability of a configuration belonging to one of the states. A binary cross-entropy loss-function is minimized in training, which is well suited for binary classification problems. For stochastic optimization we us the Adam method [50] with learning rate 10 −3 . To prevent over-fitting we use Dropout regularization with probability 10 −1 . In each step, the dataset is split into a training set that is fed into the FFNN and prediction accuracy is evaluated on the test set. Highest accuracies are achieved at endpoints of the threshold range due to the constant nature of the label space. As one spans the range between [0, 1], initially lower accuracy values are observed as some configurations are associated with incorrect labels. At the transition probability, the artificial label space matches the ground truth, leading to a high classification accuracy. In this process, the accuracy curve follows a W-shape as a function of φ, where the middle peak corresponds to the estimated transition probability [42] .
In Fig. 4, we show the output of the confusion scheme on the square lattice (a) and the power-law network (b). In both panels, the brown curve corresponds to a random sample of 20 nodes and the black curve to the high-accuracy subset from the peripheral layers, identified using the unsupervised learning method. The vertical dashed lines mark the value for the ground-truth value of φ c in each network. The random sampling strategy in the square lattice is reasonably effective, generating a W-shape, although the peak near φ c is not well-defined. In the power-law network the random sampling accuracy curve is noisy and flat yielding little-tono information on the transition probabilities. However, in both cases the black curve yields a clear W-shape and the middle-peaks line up well with the ground-truth values of φ c . Thus unlike existing methods, the sampling-guided scheme outlined here simultaneously identifies nodes in the GCC as well as provides accurate estimates for the bond-occupation probability.   [52] or percolation in ground-transportation networks to identify critical bottleneck roads in local flows [53,54]. Recently, the integration process of air traffic into a temporally connected network was modeled as as a time-varying percolation process [55].

V. TRANSITIONS IN TIME-VARYING REAL-WORLD NETWORKS
The critical integration time T c , at which the network forms a temporal spanning cluster, is proposed as a measure of global reliability of air-traffic.
We test our scheme on the air-transportation network, to identify both T c as well as label nodes that belong to the time-varying GCC. We construct the temporal air-transportation net-work [56] starting from t 0 = 7 : 30AM on September 5, 2019 and spanning a 18-hour period until the integration process is completed. The resulting network consists of 288 airports as nodes and 1903 edges, where a link corresponds to at least one flight between two airports. We generate states with a time window of [t 0 , t 0 + T ] where T is incremented in intervals of 1-minute generating ∼1100 instances as the training and test set.
The task of the unsupervised learning method is to discriminate between states before and after the integration process. Unlike in the synthetic networks studied thus far, the transition happens at an early stage (T c = 91 mins), and therefore the dataset is unbalanced. After running the onion-decomposition scheme to identify the layers, we pick two samples with 10 airports: Sample 1 from the core of the air transportation network (Atlanta, Austin, Nashville, Boston,

Charlotte, Denver, Detroit, Fort Lauderdale-Hollywood, Las Vegas, Los Angeles, Chicago O'Hare)
and Sample 2 from the periphery (Norfolk, Worcester, Southwest Oregon, Barkley, Palm Beach, Hilton Head, Punta Gorda, Pitt-Greenville, Newport News/Williamsburg, Ithaca Tompkins). We then use the t −SNE method to cluster the states. The results are shown in Fig. 5 panels a and b, indicating that the method performs well; the labels are assigned by k−means with an accuracy of αŷ ,y = 0.99 in both samples of nodes. We then use these two samples as a training set on the confusion scheme, and plot the resulting accuracy curve as a function of the time t in Fig. 5c.
The curve corresponding to the core-sample is shown as black circles, while the that for the peripheral sample is shown as red-squares. As a reference we show the case for a random sample of 10 airports (Reno-Tahoe, Albany, Rapid City Regional, Hector, Miami, Waterloo, Kansas City, Corpus Christi, Columbia Metropolitan, Baltimore/Washington) as green triangles, whereas T c is shown as the vertical dashed line. For both samples, the accuracy curve αŷ ,y shows a clear W-shape with peaks at T = 93 mins and T = 94 mins. The random sample yields a noisy curve with no clear peak (behavior near peak for all three curves shown as inset). The results illustrate the versatility of the sampling-scheme with near-perfect identification of airports in the temporally connected cluster and the ability to identify the critical integration time to within 3 − 4%, and the flexibility in sampling from the core or the periphery of the network.
Next, we consider a different dynamical process of particular relevance; the spread of COVID-19 in the United States [57]. We investigate the possibility of employing our clustering method to use as a diagnostic tool that signals at a relatively early stage, whether an epidemic outbreak is about to occur based on real-time data. We pick three major states; Texas, Georgia and New York with 3.0 , 1.1 and 1.9 (×10 7 ) inhabitants respectively. We consider a spatial resolution at the level of counties leading to 254 nodes for Texas, 159 for Georgia and 62 for New York. We construct mobility networks from the United States census bureau's LODES [58] commuting data, where the edges represent population-flows between counties corresponding to 20262 (Texas), 11042 (Georgia) and 1883 (New York) undirected links. We then follow the temporal evolution of the number of detected cases in each county from January 21 st to May 18 th 2020 [59]. Counties are labeled "infected" when the number of cases per-capita is above a threshold of 10 −4 . In Fig. 6a, we plot the empirical temporal evolution of the number of infected counties finding an emergence of an epidemic cluster around day 60 for all three states. In the top-row of Fig. 6, we train our unsupervised learning model with samples of size 10 selected from the high-fidelity layers of of the mobility networks in cumulative intervals of 20 days starting from day 0. In Fig. 6b we plot the accompanying accuracy curve αŷ ,y in function of time. Before day 60, there is no epidemic cluster, and therefor αŷ ,y = 0.5 equivalent to model-independent simple guessing of labels. After day 60, however, once an epidemic cluster emerges, the model is able to reliably split counties into infected and disease-free clusters and make more accurate predictions. The increase in the accuracy curve tracks the increase in the epidemic cluster and reaches perfect accuracy at around day 80. We note that from a point of real-time forecasting the model reaches accuracies of ≈ 70% when the size of the epidemic cluster is ≈ 0.4.

VI. DISCUSSION
Taken together, our work sheds light on the role of the micro-and mesoscopic structure of networks in machine learning the phases in bond percolation. Our sampling guided approach reveals the importance of choosing particular subset of nodes from layers of the onion spectrum of the network that enables unsupervised learning methods to distinguish between percolating and non-percolating states. Labels assigned by k−means matches the ground truth labels with near-perfect accuracy. We show that this is facilitated by sampling nodes chosen from both core and peripheral layers, identified using onion decomposition, that identifies subsets of nodes in a spectrum of layers, that follow similar paths in the percolation process, i.e. they detach and attach to the percolating cluster at comparable values of φ. The sampling-guided strategy carries over to other learning tasks, such as identifying the critical occupation probability φ c using the confusion scheme. This gain in performance is particularly pronounced for networks with heavy-tailed degree distributions, where the method significantly outperforms random sampling. Indeed, to the best of our knowledge, the framework presented here is the first to simultaneously enable the clustering of nodes into different dynamical states, as well as identify φ c in networks with heterogenous topologies. This bears significance, given that many empirical networks exhibit right-skewed distributions of links.
To validate our results, we demonstrate two possible applications of our findings on realworld time-varying networks that exhibit a percolation transition; the exact integration time of the US domestic air transportation network, as well as the emergence of the COVID-19 epidemic cluster in three large US states. In both cases the framework yields excellent perfor-mance. The application to pandemic settings is of particular interest, as a possible diagnostic tool to assess the current state of disease-spread with real-time data. The ability to accurately classify (with reasonable accuracy) regions into infected and disease-free states (close to when the epidemic cluster first emerges) could prove useful in terms of mitigation strategies. Indeed, techniques have been proposed to study immunization strategies in networks where only a small subset of nodes are observed at a time, to slow-down epidemic spread [60,61]. Given the limited knowledge of network structure, immunizing a small sample of nodes provides significant improvement in the global level immunization of the network [62]. Similar considerations apply in the diffusion of rumors or "fake news" in social media and online platforms [63]. The approach proposed here, can in principle be easily extended such types of dynamical processes on networks.