Metrics for network comparison using egonet feature distributions

Identifying networks with similar characteristics in a given ensemble, or detecting pattern discontinuities in a temporal sequence of networks, are two examples of tasks that require an effective metric capable of quantifying network (dis)similarity. Here we propose a method based on a global portrait of graph properties built by processing local nodes features. More precisely, a set of dissimilarity measures is defined by elaborating the distributions, over the network, of a few egonet features, namely the degree, the clustering coefficient, and the egonet persistence. The method, which does not require the alignment of the two networks being compared, exploits the statistics of the three features to define one- or multi-dimensional distribution functions, which are then compared to define a distance between the networks. The effectiveness of the method is evaluated using a standard classification test, i.e., recognizing the graphs originating from the same synthetic model. Overall, the proposed distances have performances comparable to the best state-of-the-art techniques (graphlet-based methods) with similar computational requirements. Given its simplicity and flexibility, the method is proposed as a viable approach for network comparison tasks.

In this paper, we propose an alignment-free method that is based on the distribution, in the network, of a few indicators that locally describe the neighborhood of the nodes.By neighborhood we mean the 1-step ego-network, or egonet (hence the name EgoDist, abbreviation for ego-distances, for the distances proposed here), while the indicators we consider are the (normalized) degree, the clustering coefficient, and the egonet persistence (see the next Section for definitions), which describe in increasing detail the connectivity patterns inside and outside the egonet.The statistics of the three indicators can be used, in different combinations, to define a distribution function in 1, 2, or 3 dimensions, which is taken as a synthesis of the network properties.The distance between two networks is then defined as the distance between the corresponding distributions.Thus, similarly to graphlet-based techniques, the local graph structure around each node is summarized by a low-dimensional vector of features, whose statistics are then used as a global network descriptor.However, the features we use are pretty basic and lend themselves to quick computation.
The effectiveness of the proposed ego-distances is evaluated by means of a standard clustering exercise 11,12 : a large number of test networks are synthetically generated with seven randomized models, parameterized with different sizes and densities.The proposed ego-distances are then used for classification, that is, to discriminate pairs of graphs originating from the same model from those originating from different models, with size and density acting as confounding factors.The results of the experiments show that the proposed ego-distances perform comparably to the best graphlet-based distance (Graphlet Correlation Distance GCD11) -with similar computational requirements -despite the challenging experimental environment created by the subtle topological differences between the network models in the pool.

Methods: Network distance
Let us consider a network of size N (number of nodes) undirected and unweighted, therefore completely described by the N × N adjacency matrix A, with A i j = 1 if i and j are connected by an edge, and A i j = 0 otherwise.If we denote by L = 1 2 ∑ N i, j=1 A i j the number of edges, then the density of the network is given by ρ = 2L N(N−1) .Given node i, its degree m i = ∑ N j=1 A i j is the number of neighbors, and its egonet is the induced graph identified by the node set E i = {i} ∪ { j|A i j = 1}, i.e., the union of node i and all its neighbors, for a total of |E i | = m i + 1 nodes.
Three scalar quantities, all taking values in [0, 1], can be used to characterize the egonet E i : the normalized degree, the clustering coefficient, and the egonet persistence.

Normalized degree
For a graph with node degrees ranging in m min ≤ m i ≤ m max , we define the normalized degree d i as where we have assumed m min ̸ = m max .We compute the normalized degree distribution by discretizing the interval 0 ≤ d i ≤ 1 with step ∆ (then r = 1/∆ is the number of intervals) and by directly calculating the discrete distribution function P d (h) (i.e., the normalized histogram) by counting the proportion of d i 's in each interval.Using the indicator function (1 S x = 1 if x ∈ S and zero otherwise), we can write: with values d i = 1 conventionally counted in the last interval h = r.Figure 1 (left panel) shows examples of P d distributions for networks generated by three different -and well known -models, with the same size N and density ρ.It is preferable to use the cumulative distribution function (cdf) Q d (h) = ∑ h k=1 P d (k), which is numerically more stable for small N. Given the two graphs G ′ and G ′′ , we can define the distance between them as the (Euclidean) distance between the two respective cdf's Q ′ d and Q ′′ d : . (3) Notice that the above quantity -as well as the analogous ones defined in the following -is well defined even when G ′ and G ′′ have different size, provided that the cdf's are calculated with the same step ∆.

Clustering coefficient
The (local) clustering coefficient is defined, for a node i with m i > 1 neighbors connected by e i edges, by  while we set c i = 0 if m i ≤ 1.The global (i.e., averaged over all nodes) clustering coefficient C = 1 N ∑ i c i has been used as a metric for comparing networks: the distance between two graphs with clustering coefficients C ′ and C ′′ is simply given by The simplicity of this approach makes it very attractive: its performance -e.g., in classification task -is fairly good and, in any case, it is the best if compared with the use of other "simple" network indicators 11,12 .
Here we want to show that the performance in using the clustering coefficient can be improved by exploiting the distribution of the c i 's over the entire network, which is obviously much more informative than the global value C alone.As we did above for d i , we discretize the interval 0 ≤ c i ≤ 1 and we compute the distribution function P c (h) (i.e., the normalized histogram) by counting the proportion of the c i 's in each interval.Figure 1 (middle panel) shows that P c (h) has a fair capability of differentiating the three networks analyzed: ER has practically all zero values for c i 's, SFBA has a small number of nonzero low-value, while the pdf of GEO has its mass concentrated around 0.5.Given two graphs G ′ and G ′′ , we can define their distance as: i.e., as the (Euclidean) distance between the corresponding cdf's Q c (h) = ∑ h k=1 P c (k).

Egonet persistence
The egonet persistence p i is defined as the persistence probability 17,18 of the egonet E i , that is, the probability that a random walker, which is in any of the nodes of E i at step t, remains in any of the nodes of E i at step t + 1.For an undirected, unweighted network, it can be proved 17 that this quantity is equal to where m int j denotes the internal degree of node j, i.e., the number of its neighbors internal to E i (we conventionally set p i = 0 when E i = {i}, i.e., i is an isolated node).While the clustering coefficient c i only captures the connectivity inside the egonet, the persistence p i measures the balance between internal and external connectivity, as exemplified in Fig. 2. Thus the two indicators carry, in general, independent information, as it emerges from Fig. 1, which shows three examples of the distribution P p (h) for networks generated by different models, and it is further demonstrated in Fig. 3, where 2-dimensional scatter plots are shown for the same three networks.
As above, we discretize the interval 0 ≤ p i ≤ 1 and we compute the distribution function P p (h) by counting the proportion of the p i 's in each interval.Given the two graphs G ′ and G ′′ , we define their distance as: i.e., as the (Euclidean) distance between the corresponding cdf's Q p (h) = ∑ h k=1 P p (k). while we set c i = 0 if m i ≤ 1.The global (i.e., averaged across all nodes) clustering coefficient C = 1 N ∑ N i=1 c i was used as a metric to compare networks: the distance between two graphs with clustering coefficients C ′ and C ′′ is simply given by

Multi-dimensional distributions and distances
The simplicity of this approach makes it very attractive: its performance -for example, in classification tasks -is quite good and, in any case, it is the best when compared with the use of other "simple" network indicators 11,12 .
Here we want to demonstrate that the performance in using the clustering coefficient can be improved by exploiting the distribution of c i 's over the entire network, which is obviously much more informative than just the global value C. As we did above for d i , we discretize the interval 0 ≤ c i ≤ 1 and compute the distribution function P c (h) (i.e., the normalized histogram) by counting the proportion of c i 's in each interval.Figure 1 (central panel) shows that P c (h) has a good ability to differentiate the three analyzed networks: ER has practically all zero values for c i 's, SFBA has a small number of nonzero (but low) values, while the pdf of GEO has its mass concentrated around 0.5.Given two graphs G ′ and G ′′ , we can define their distance as: that is, as the (Euclidean) distance between the corresponding cdf's Q c (h) = ∑ h k=1 P c (k).

Egonet persistence
The egonet persistence p i is defined as the persistence probability 17,18 of the egonet E i , i.e., the probability that a random walker, located in any of the nodes of E i at step t, remains in any node of E i at step t + 1.For an undirected and unweighted network, it can be proved 17 that this quantity is equal to where m int j (resp.m ext j ) denotes the internal (resp.external) degree of node j, i.e., the number of its neighbors internal (resp.external) to E i (we conventionally set p i = 0 when E i = {i}, i.e., i is an isolated node).As above, we discretize the interval 0 ≤ p i ≤ 1 and we compute the distribution function P p (h) by counting the proportion of p i 's in each interval.
While the clustering coefficient c i only captures the connectivity within the egonet, p i measures the balance between internal and external connectivity, as exemplified in Fig. 2, i.e., it quantifies the proportion of edges that the nodes of the egonet direct into the egonet itself, rather than to external nodes.Therefore the two indicators, in general, carry independent information, as emerges from Fig. 1, which shows three examples of the distribution P p (h) for networks generated by different models, and is further demonstrated in Fig. 3, where 2-dimensional scatter plots are shown for the same three networks.
Given the two graphs G ′ and G ′′ , we define their distance as: i.e., as the (Euclidean) distance between the corresponding cdf's Q p (h) = ∑ h k=1 P p (k).

Multi-dimensional distributions and distances
So far we introduced three graph-to-graph distances, respectively D d , D c , and D p , each based on a single indicator.Since normalized degree, clustering coefficient, and egonet persistence carry in general independent information, it is obviously possible to combine them to define more complex metrics.A simple solution would be to add them: or, more in general, to combine them linearly (this would require, however, properly adjusting the parameters of the linear combination).
A more general approach is inspired by the plots in Fig. 3, which clearly show that different network models give rise to different distributions of the two-dimensional variable (c i , p i ).Therefore, we define the two-dimensional discrete distribution function P c,p (h, k) as the normalized 2D histogram: with values c i = 1 (resp.p i = 1) conventionally counted in the last interval h = r (resp.k = r).Given the two graphs G ′ and G ′′ , we define their distance as that is, the Frobenius norm of the difference between their cdf's Q c,p (h, k) = ∑ h i=1 ∑ k j=1 P c,p (i, j).Of course, in the same way as we did in (10), (11), we can define two more 2D measures, namely D d,c (G ′ , G ′′ ), based on the two-dimensional variable (d i , c i ), and D d,p (G ′ , G ′′ ), based on the pair (d i , p i ).Finally, a distance measure that fully exploits all the available information is obtained by considering the multivariate distribution of the three-dimensional variable (d i , c i , p i ), which captures the different patterns originating from the different network models (Fig. 4).This requires partitioning the set [0, 1] 3 into r 3 discretization cubes, computing the three-dimensional discrete distribution function P d,c,p (h, k, n) as: and defining the network distance between G ′ and G ′′ as where

Classification of synthetic networks
To have a fair evaluation of the efficacy of the ego-distances proposed above, we reproduce the experimental setup described in Refs. 11,12 y generating synthetic networks from the same seven models used therein (see the section Materials for details and references): Erdős-Rényi model (ER); ER degree distribution preserving model (ERDD); Barabási-Albert scale-free preferential attachment model (SFBA); scale-free gene duplication and divergence model (SFGD); geometric random graph model (GEO); geometric model with gene duplication (GEOGD); stickiness-index based model (STICKY).For each model, we generate networks with size N = 1000, 2000, and 4000, and density ρ = 0.004, 0.01, and 0.02, for a total of 7 × 3 × 3 = 63 combinations model/size/density.For each combination, we randomly generate 10 network instances, so that the experimental setup includes 630 networks.
It should be noted that most of the above network models do not necessarily produce a connected network (SFBA and ERDD are the only exceptions -see the section Materials for details).For example, at ρ = 0.004 most of the networks in our sample are not connected.However, the proposed measures present no problems in managing non connected networks.In particular, the three proposed egonet features are well defined also for isolated nodes: they all assume zero value, thus shaping the distributions of the features in such a way as to suitably characterize the network.
For each pair of networks, we compute the ego-distances defined above, using discretization step ∆ = 0.01 (the results are largely insensitive to this parameter thanks to the use of cdf's).We also consider D C global (eq.( 5)), which is an easy-to-calculate distance based on a global network feature.Finally, to have a challenging comparison, we compute the Graphlet Correlation Distance GCD11 (D GCD11 ) 11,12 , which is considered one of the most effective distances 10 .
The goal of the classification task is to recognize when two networks come from the same model.For this purpose, the performance of each distance is evaluated in the usual Precision/Recall framework: two networks form an actual positive 5/14 Finally, a distance measure that fully exploits all the available information is obtained by considering the multivariate distribution of the three-dimensional variable (d i , c i , p i ), which captures the different patterns originating from the different network models (Fig. 4).This requires partitioning the set [0, 1] 3 into r 3 discretization cubes, computing the three-dimensional discrete distribution function P d,c,p (h, k, n) as: and defining the network distance between G ′ and G ′′ as where

Classification of synthetic networks
To have a fair evaluation of the efficacy of the ego-distances proposed above, we reproduce the experimental setup described in Refs. 11,12 y generating synthetic networks from the same seven models used therein (see the section Materials for details and references): Erdős-Rényi model (ER); ER degree distribution preserving model (ERDD); Barabási-Albert scale-free preferential attachment model (SFBA); scale-free gene duplication and divergence model (SFGD); geometric random graph model (GEO); geometric model with gene duplication (GEOGD); stickiness-index based model (STICKY).For each model, we generate networks with size N = 1000, 2000, and 4000, and density ρ = 0.004, 0.01, and 0.02, for a total of 7 × 3 × 3 = 63 combinations model/size/density.For each combination, we randomly generate 10 network instances, so that the experimental setup includes 630 networks.
It should be noted that most of the above network models do not necessarily produce a connected network (SFBA and ERDD are the only exceptions -see the section Materials for details).For example, at ρ = 0.004 most of the networks in our sample are not connected.However, the proposed measures present no problems in managing non connected networks.In particular, the three proposed egonet features are well defined also for isolated nodes: they all assume zero value, thus shaping the distributions of the features in such a way as to suitably characterize the network.
For each pair of networks, we compute the ego-distances defined above, using discretization step ∆ = 0.01 (the results are largely insensitive to this parameter thanks to the use of cdf's).We also consider D C global (eq.( 5)), which is an easy-to-calculate distance based on a global network feature.Finally, to have a challenging comparison, we compute the Graphlet Correlation Distance GCD11 (D GCD11 ) 11,12 , which is considered one of the most effective distances 10 .
The goal of the classification task is to recognize when two networks come from the same model.For this purpose, the performance of each distance is evaluated in the usual Precision/Recall framework: two networks form an actual positive Table 1.AUPR (Area Under the Precision/Recall curve) value for the classification of ER, GEO, and SFBA networks, for the ego-distances defined in the section Methods and for the Graphlet Correlation Distance GCD11 distance 11  , where tp, fp, and fn are, respectively, the number of true positive, false positive, and false negative network pairs.The Precision/Recall curve provides a graphical representation of the simultaneous evolution of P and R with ε, and the area under the curve (denoted by AUPR, i.e., Area Under the Precision/Recall curve, 0 ≤ AUPR ≤ 1) is a quantity that summarizes the performance of each distance, with the limit AUPR = 1 obtained in the ideal case 19,20 .
As clearly highlighted in Ref. 12 , an ideal distance should be able to recognize networks generated by the same model without being confused by possible differences in size and density, but only being influenced by structural differences in the topology -a very challenging task.On the other hand, a distance can be considered good, though not ideal, when it is able to perform the above task at least when the networks are of the same size and density.For this reason, we also systematically evaluate the performance of each distance on a subset of network pairs, namely those (possibly) generated by different models but having the same size and density.
We start by limiting the analysis to just three network models, namely ER, GEO, and SFBA.In the previous section, their characteristics were compared in Figs. 1, 3, and 4, showing that each model has some peculiarities -in terms of the adopted egonet features -which differentiate it from the other two.Thus we expect that the proposed ego-distances are able to correctly classify the network model: Table 1 shows that this is indeed the case for all ego-distances if size and density are the same, but also the same happens for most ego-distances when networks are of different size and/or density.In other words, for this simplified task the egonet features are sufficiently differentiated that most combinations of two or three cdf's will distinguish the three models perfectly.As an example, we report in Fig. 5 the distance matrix relating to D d,c,p , which highlights the clear separation between the different models.
Much more challenging is the task of recognizing the correct model when there are subtle topological differences, as in the case where all seven models are in the pool: the results are summarized in Table 2 (upper part).In terms of AUPR, all the proposed ego-distances outperform GDC11 when sizes/densities are mixed.On the other hand, GCD11 achieves much better results in the simpler task of coupling networks of the same size/density.
The performance of ego-distances further improves if we restrict the computation of the cdf's of d i , c i , p i to a sub-range of [0, 1].Indeed, if we analyze the quadratic error terms 2 which form the core of the distances (3), ( 6), ( 8) and draw their dependence on h = 1, 2, . . ., r covering the interval [0, 1], we see (Fig. 6) that the differences between networks vanish as the upper bound is approached -not surprising, since d i , c i , p i rarely take values close to 1. Therefore, by restricting the computation to the range where the above terms are significant, we increase their sensitivity and, as a by-product, reduce the computational effort.For this purpose, we define a cap value 0 < T ≤ 1 (i.e., a maximum value) as the upper limit for computing the cdf differences: in (3), ( 6), (8), all sums remain extended to h = 1, 2, . . ., r, but now r = T /∆ instead of r = 1/∆.
Table 2 (lower part) and Fig. 7 show the results obtained with cap value T = 0.5.Compared to the case T = 1, the AUPR values do not vary remarkably when all sizes/densities are mixed, while they increase significantly when comparing networks with the same size/density.The Precision/Recall curves of Fig. 7    mean of Precision and Recall).On the other hand, when the comparison is restricted to networks with the same size/density, Fig. 7 (right panel) confirms that the Graphlet Correlation Distance D GCD11 remains overall superior, although some ego-distances are able to achieve comparable F1 values.
We now summarize the above results, in order to provide guidance on which of the proposed ego-distances to adopt.We observe that our results depend on the experimental dataset, which is rich and diversified but, obviously, does not include all possible network structures: any conclusion must therefore be evaluated with caution.In general, the AUPR (Table 2) should be the primary selection criterion, as it represents the average precision across all possible recall values.Not surprisingly, AUPR highlights the superiority of 2D and 3D measures, which exploit more information than 1D measures.In most cases, D d,c,p performs best (or nearly so) and should therefore be considered the preferred metric -an expected result, since it uses the most information in the most structured form, that is, building 3D distributions.Furthermore, the Precision/Recall curves of Fig. 7 show that D d,c,p keeps large Precision for Recall→ 0 (left side of the curve); has a monotone and sufficiently regular behavior; and grants a rather large Precision even for Recall→ 1 (right side of the curve).As we will see in the next section (Computational requirements), however, D d,c,p is the most computationally expensive of the ego-distances.If its use is prohibitive in a specific application, D d,c , D d,p are valid alternatives, as shown in Table 2.

Computational requirements
The same pool of synthetic networks, previously used to evaluate the classification capabilities of the proposed ego-distances, was also exploited to empirically test their computational requirements.A theoretically-based prediction is quite difficult, given the mixed sequence of operations involved in each ego-distance, namely the computation of one or more egonet features, of the cdf's, and of their Euclidean distance.The first task (computing the egonet features), however, is definitely dominant for medium to large networks.In general terms, assuming that checking the connection of a node pair (i, j) requires a fixed time, then computing the degree of a node requires time O(N), which becomes O(N 2 ) for all N nodes.For the clustering coefficient (eq.( 4)), checking connections between the neighbors of a node of degree m i requires O(m 2 i ) operations, which is O(N 2 ) is the worst case and therefore O(N 3 ) for the whole network.Analogously for the egonet persistence (eq.( 7)): in the numerator we check the (m i + 1) 2 possible connections internal to E i , while in the denominator the (m i + 1)N possible connections of the nodes of E i with all nodes of the network: both terms are O(N 2 ) in the worst case, which leads to a complexity O(N 3 ) for all N nodes.On the other hand, the computational complexity of Graphlet Correlation Distance GCD11 is O(Nm 3 max ) 12,21 .However, typical networks are often far from the worst case and therefore the computational requirements are milder.
To get an empirical estimate of the computational requirements of the ego-distances, we run experiments for all nine combinations of size (N = 1000, 2000, and 4000) and density (ρ = 0.004, 0.01, and 0.02) used above: for each pair (N, ρ), we compute the 70 × 69/2 = 2415 distances between all possible pairs of the 70 networks having the prescribed size and density (recall that we have 7 network models and 10 replicas for each model), and we add up the time required for the computation.Finally, we have an aggregated time for each pair (N, ρ), which is obtained from models with mixed characteristics and thus is representative of the average computational requirements of the distance used.
Figure 8 shows the results of the above experiments.First of all, we must mention that the time needed to compute the three ego-distances D d , D c , and D p (omitted in the figure) increases from the first to the last, as one would expect, with the first almost negligible compared to the others.This is why it clearly emerges from the figure that, for fixed N and ρ, D d,c and D d,p (in that order) are the fastest to compute, followed by D c,p , which is based on the two most expensive features, and by D sum   and D d,c,p , which require all three features (the figure shows that the latter is the slowest of the ego-distances, due to the need to elaborate 3D distributions).All ego-distances are faster than GCD11 anyway (at least in our implementation, see section Materials below for details).The three panels in the first row of Fig. 8 show, for all ρ values, a computation time approximately increasing as t ∝ N α , with α between 1.89 and 2.93.GCD11 scales more favorably in this regard, with α ranging from 1.33 to 1.99.The plots showing the dependence on ρ (second row of the figure) confirm the ranking between the distances in terms of computation time.

Example of application: European air transportation network
We demonstrate the use of the ego-distances introduced above on data describing the European air transportation network.The dataset includes 37 networks, corresponding to airlines, each with 448 nodes (only the connected component of each network is considered, formed by airports connected by flights of the corresponding airline), representing European airports (the complete lists of airports and airlines can be found in Refs. 10,22 . To show how ego-distances can be used flexibly to spot specific network (dis)similarities, we compute and compare the results of the two most simplest distances introduced above, namely D c and D d .Figure 9 (top row) shows the dendrograms summarizing the results of the hierarchical cluster analysis based on these two distances.In the same figure, the bottom row shows the graphs of five of the networks analyzed, related to five distinct airlines.Despite their apparent dissimilarity, the two distances reveal that some of the networks have similar characteristics.
For example, networks #9 (KLM) and #31 (Czech Airlines) are at zero distance according to D c : they are both pure stars and, as such, all nodes have zero clustering coefficient.But they are also very similar according to D d , as can be read from the right dendrogram in Fig. 9. Indeed, the normalized degree distribution is in both cases concentrated on the two extremes 0 and 1, corresponding respectively to the minimum and maximum degree, with slightly different values due to the unequal number of nodes in the two cases.
Despite the different topology, network #4 (British Airways) also has D c = 0 from #9 and #31: although the former is not a pure star, it too does not contain triangles and therefore also in this case all nodes have zero clustering coefficient.But network #4 is quite close to #9 and #31 also in terms of D d because, once again, the degree is concentrated in the extreme values (the leaves and the two hubs), with very few exceptions of nodes with degree 2.
As a final example, consider networks #1 (Lufthansa) and #35 (Wideroe).Although they differ in size and topology, their common trait is the abundance of triangles resulting in similar distributions of the clustering coefficient.Indeed, the dendrogram based on D c places the two networks at a short distance -and at the maximum possible distance from the pure stars #4, #9, #31.and D d,c,p , which require all three features (the figure shows that the latter is the slowest of the ego-distances, due to the need to elaborate 3D distributions).All ego-distances are faster than GCD11 anyway (at least in our implementation, see section Materials below for details).The three panels in the first row of Fig. 8 show, for all ρ values, a computation time approximately increasing as t ∝ N α , with α between 1.89 and 2.93.GCD11 scales more favorably in this regard, with α ranging from 1.33 to 1.99.The plots showing the dependence on ρ (second row of the figure) confirm the ranking between the distances in terms of computation time.

Example of application: European air transportation network
We demonstrate the use of the ego-distances introduced above on data describing the European air transportation network.The dataset includes 37 networks, corresponding to airlines, each with 448 nodes (only the connected component of each network is considered, formed by airports connected by flights of the corresponding airline), representing European airports (the complete lists of airports and airlines can be found in Refs. 10,22 . To show how ego-distances can be used flexibly to spot specific network (dis)similarities, we compute and compare the results of the two most simplest distances introduced above, namely D c and D d .Figure 9 (top row) shows the dendrograms summarizing the results of the hierarchical cluster analysis based on these two distances.In the same figure, the bottom row shows the graphs of five of the networks analyzed, related to five distinct airlines.Despite the apparent dissimilarity of the graphs, the two distances reveal that some of the networks have similar characteristics.
For example, networks #9 (KLM) and #31 (Czech Airlines) are at zero distance according to D c : they are both pure stars and, as such, all nodes have zero clustering coefficient.But they are also very similar according to D d , as can be read from the right dendrogram in Fig. 9. Indeed, the normalized degree distribution is in both cases concentrated on the two extremes 0 and 1, corresponding respectively to the minimum and maximum degree, with slightly different values due to the unequal number of nodes in the two cases.
Despite the different topology, network #4 (British Airways) also has D c = 0 from #9 and #31: although the former is not a pure star, it too does not contain triangles and therefore also in this case all nodes have zero clustering coefficient.But network #4 is quite close to #9 and #31 also in terms of D d because, once again, the degree is concentrated in the extreme values (the leaves and the two hubs), with very few exceptions of nodes with degree 2.
As a final example, consider networks #1 (Lufthansa) and #35 (Wideroe).Although they differ in size and topology, their common trait is the abundance of triangles resulting in similar distributions of the clustering coefficient.Indeed, the dendrogram 10/15 based on D c places the two networks at a short distance -and at the maximum possible distance from the pure stars #4, #9, #31.On the other hand, even if similar in terms of D c , the dendrogram based on D d reveals that networks #1 and #35 are the furthest possible, in the analyzed dataset, in terms of degree distribution, a result that could be guessed, to some extent, by looking at the two graphs.

Example of application: Mobility networks during COVID-19 lockdown
When a sequence of time-stamped networks is available, a typical task is to quantify the (dis)similarity between the graphs to identify anomalous instants in their time evolution.Below we briefly show the results of the analysis of the sequence of mobility networks between Italian cities, estimated from the digital traces of over 4 million individuals 23 across the 2020 lockdown period caused by the COVID-19 pandemic.
The dataset includes 32 networks, each of which aggregates the mobility of individuals over a week.The first week begins on February 24, when the first infected individuals had just been detected in Northern Italy and only restrictions on local mobility around the affected cities were imposed.The full lockdown started on March 9 (https://en.wikipedia.org/wiki/COVID-19_lockdowns_in_Italy) and has been gradually lifted since May 4. The period covered by the dataset ends in early October.Each network describes the flow of individuals among approximately 3000 municipalities.We binarize the networks by neglecting weights, thus preserving only the structure, with the aim of analyzing the evolution of the mobility backbone over time.Recall that ego-distances are alignment-free measures, so the focus is on the evolution of the network structure, rather than on the variation of specific flows from city to city.
The results using D c,p are summarized in Fig. 10 (similar results are obtained with other ego-distances).Panels (a)-(b) clearly highlight the strong anomaly of the network structure during the lockdown period.Interestingly, although the nationwide measures (including the mobility block) had been in place since March 9, the plots show that it took a couple of weeks to reach the maximum deviation from the unperturbed situation (it was only on March 21 that it was decided to close all unnecessary businesses and industries), and that such a regime lasted for a rather short period.In fact, the return to normal regime begins a couple of weeks before May 4, when intra-regional mobility was allowed.After that date, the mobility network returns to be very similar to that of the pre-COVID period.Incidentally, panel (b) reports, during the summer period, small alterations in mobility in the days close to Republic Day (June 2) and during the central weeks of August, the traditional holiday weeks.Panels (c)-(d) show how the structural variation of the network is captured by the two ego-features included in D c,p .Due to the interruption of many mobility corridors, the median of the clustering coefficient drastically drops (remains at zero for many weeks), and the entire distribution is strongly compressed.Less evident is the impact on the egonet persistence, whose median does not vary much: the distribution, however, becomes wider during the weeks of lockdown, because a number of egonets become more cohesive due to the interruption of external connections (see Fig. 2).

Concluding remarks
This work introduced a family of alignment-free network-to-network dissimilarity measures (EgoDist), based on the comparison of the distributions, on the network, of a few features that locally characterize the egonets: the degree, the clustering coefficient, and the egonet persistence.The dissimilarity between two graphs is defined as the distance between the corresponding distributions (one-dimensional or multi-dimensional).The ability of the proposed measures to discriminate networks with subtly different characteristics was evaluated by means of a standard experimental setup.Overall, EgoDist measures perform comparably to graphlet-based measures, with similar computational requirements.
The method has several possible generalizations.On the one hand, the extension to weighted and/or directed networks is conceptually immediate, since all the egonet features adopted here have their relevant generalizations 17,24 .On the other hand, the set of node features can be increased by using, for example, the distribution of any centrality indicator of nodes [25][26][27] , instead of, or in addition to, the quantities used here (this however implies loosening the assumption of using only egonet features).Incidentally, this opens up the problem of finding the (minimum) set of indicators that achieves the "best" network classification.
It goes without saying that all of the above generalizations imply a significant increase in computational requirements.In this regard, to keep the method scalable to large-scale networks one could exploit parallelization techniques (the computation of egonet features can be completely parallelized) but also devise sampling techniques, i.e., compute the distributions of the egonet features on a sample of nodes rather than across the entire network.This obviously requires the use of graph sampling techniques 28 , whose effect on the performance of the proposed approach should be thoroughly evaluated.than on the entire network.This obviously requires a cautious exploitation of graph sampling techniques [22], whose effect on the performances of the proposed approach should be carefully assessed.
The Matlab code of the function implementing the computation of the ego-distances is available at [URL].

Materials
Network models Below we summarize the description of the algorithms for the generation of the seven synthetic network models used for the evaluation experiments: to have comparable results, they are the same used by 11,12 .In all cases, the network is defined by the size N (number of nodes) and the density ρ = 2L N(N−1) , where L is number of edges.Notice that the average degree m avg = 2L N can be expressed as m avg = ρ(N − 1).ER (Erdős-Rényi model) Each node pair (i, j) is connected with probability ρ 25,29 .SFBA (Barabási-Albert scale-free preferential attachment model) We define η = ρN 2 , which is m avg 2 for large N, and note that η assumes integer values for all the pairs (N, ρ) used in the article.We initialize the network with a clique (complete graph) of η + 1 nodes, then add one node at a time until we reach the prescribed size N.Each added node must connect its η edges to η target nodes, which are randomly selected with probability proportional to their degree in the current network (preferential attachment 30,31 ).
ERDD (ER degree distribution preserving model) An SFBA network is first created with the prescribed N, ρ (see above), then all edges are shuffled while preserving the individual degree of each node (degree-preserving randomization 25,32 ).
STICKY (stickiness-index based model) An SFBA network is first created with the prescribed N, ρ (see above), thus defining the degree sequence m 1 , m 2 , . . ., m N : then all edges are removed.Finally, each pair of nodes (i, j) is connected with probability m i m j ∑ h m h 33 .
SFGD (scale-free gene duplication and divergence model) We initialize the network with a seed of 2 connected nodes, then add one node at a time until we reach the prescribed size N.For each node i to be added, an existing node j is selected uniformly at random, and i is connected to all neighbors of j.Furthermore, the pair (i, j) is connected with probability 0.5.Then, we consider all nodes h that are common neighbors of i and j and, with probability q, remove either the edge (h, i) or (h, j) (with random selection).The value of q is iteratively adjusted to reach the prescribed density ρ (on average over the 10 network replications) 34 .
GEO (geometric random graph model) The N nodes are thought of as points in the unit cube, whose 3D coordinates are selected uniformly at random.Then the nodes (i, j) are connected if and only if their Euclidean distance is smaller than a given r, the value of which is iteratively adjusted to reach the prescribed density ρ (on average over the 10 network replications) 35 .
12/15 GEOGD (geometric model with gene duplication) The N nodes are thought of as points in the unit cube.Given a prescribed r > 0, we initialize the set of nodes with 2 nodes at a much shorter Euclidean distance than r, then we add one node at a time until we reach the prescribed size N.For each node i to be added, an existing node j is selected uniformly at random, and i is placed in the unit cube at a random position within distance 2r from j.After all N nodes have been placed, each pair (i, j) is connected by an edge if and only if (i, j) are at smaller Euclidean distance than r, the value of which is iteratively adjusted to reach the prescribed density ρ (on average over the 10 network replications) 36 .

Measuring computation time
The computation times shown in Fig. 8 were obtained on a desktop PC with Intel i7 CPU at 2.90GHz using Matlab R2021b.To limit possible confounding factors, the times reported refer only to the computation of the distances between the 2415 network pairs, as described in section Computational requirements, i.e., all data loading and organization are ignored.For the EgoDist measurement we used the code available at https://piccardi.faculty.polimi.it/highlights.html,which implements all the distances proposed in this paper.For Graphlet Correlation Distance GCD11 we used the MNA Matlab interface for ORCA 21 , available at https://github.com/muellsen/MNA/tree/master/GraphletComputation.

Figure 3 ./ 14 Figure 3 .
Figure 3. Scatter plots of clustering coefficient and egonet persistence for the three networks of Fig. 1.The x,y-coordinates of each dot are the features of a network node.

Figure 4 .
Figure 4. Scatter plots of normalized degree, clustering coefficient, and egonet persistence for the three networks of Fig. 1.The x,y,z-coordinates of each point (colored dots) are the egonet features of a network node.The gray dots are the projection onto the horizontal plane and are shown for readability only.
j, l) is the cdf.In the following, we will generically indicate with ego-distances the graph-to-graph measures introduced in this section and based on the statistics of the egonet indicators: D d , D c , D p , D sum (1D distances), D c,p , D d,c , D d,p (2D), and D d,c,p (3D).

Figure 4 .
Figure 4. Scatter plots of normalized degree, clustering coefficient, and egonet persistence for the three networks of Fig. 1.The x,y,z-coordinates of each point (colored dots) are the egonet features of a network node.The gray dots are the projection onto the horizontal plane and are shown for readability only.
j, l) is the cdf.In the following, we will generically indicate with ego-distances the graph-to-graph measures introduced in this section and based on the statistics of the egonet indicators: D d , D c , D p , D sum (1D distances), D c,p , D d,c , D d,p (2D), and D d,c,p (3D).

Figure 5 .
Figure 5. Distance matrix D d,c,p between the networks ER, GEO, and SFBA (3 models × 3 sizes × 3 densities × 10 replications = 270 networks): the different models are clearly separated, regardless of the different combinations of size/distances.

Figure 7 .Figure 8 ./ 13 Figure 7 .Figure 8 .
Figure 7. Precision/Recall curves obtained by clustering the complete set of networks (7 models × 3 sizes × 3 densities × 10 replications = 630 networks) with five ego-distances defined in section Methods (D sum , D c,p , D d,c , D d,p , D d,c,p ) with cap value T = 0.5, and with Graphlet Correlation Distance GCD11 distance (for readability, we omit the curves of D d , D c , D p ).For each Recall/Precision point, the F1 value is specified by the dotted contour line.Left: the performance is assessed by mixing sizes/densities; Right: only networks with same size/density are compared.

Figure 9 .
Figure 9. Top row: Dendrograms of the hierarchical cluster analysis of the 37 European airline networks, using distances D c (left) and D d (right) with average linkage.Bottom row: Five examples of networks from five distinct European airlines.

10 / 15 Figure 9 .
Figure 9. Top row: Dendrograms of the hierarchical cluster analysis of the 37 European airline networks, using distances D c (left) and D d (right) with average linkage.Bottom row: Five examples of networks from five distinct European airlines.

11 / 15 FIG. 8 .
FIG. 8. Computation time as a function of the network size N at fixed density ρ (first row), and as a function of ρ at fixed N (second row), for five ego-distances defined in Sec.II (D sum , D c,p , D d,c , D d,p , D d,c,p , with ∆ = 0.01, T = 1) and for the GCD11 distance.Each point is the aggregated time of computing the 70 × 69/2 = 2415 distances among all the possible pairs of the 70 networks having the prescribed size and density (7 network models × 10 replications).

FIG. 9 .
FIG.9.Above: The matrix of the distances D c,p among the 32 Italian weekly mobility networks across the period of COVID-19 lockdown in 2020: the date on the axes is the first day of the corresponding week.Below: the first row of the distance matrix D c,p , i.e., the distance of each mobility network from week 1.Boxplots of the distribution of clustering coefficient (above) and egonet persistence (below) for the 32 Italian weekly mobility networks across the period of COVID-19 lockdown in 2020.Each box ranges from the 25th to the 75th percentile, the dot is the median value.

Figure 10 .
Figure 10.(a) The D c,p distance matrix among the 32 Italian weekly mobility networks during the COVID-19 lockdown period in 2020: the date on the axes is the first day of the corresponding week.(b) The first row of the distance matrix D c,p , i.e., the distance of each mobility network from the week 1 network.(c)-(d) Boxplots of the distribution of the clustering coefficient (c) and egonet persistence (d).The boxes range from the 25th to the 75th percentile, the point being the median value.
In the egonet E i , the degree m i depends only on the connectivity of node i (left); the clustering coefficient c i describes the connectivity between the neighbors of i (center); the egonet persistence p i captures the balance between internal and external connectivity of E i (right), as it quantifies the proportion of edges that the nodes of the egonet direct into the egonet itself.In the egonet E i , the degree m i only depends on the connectivity of node i (left); the clustering coefficient c i describes the connectivity among i's neighbors (center); the egonet persistence p i captures the balance between internal and external connectivity of E i .(right).
. The best-ranked distances are highlighted in bold italic.
pair if they are generated by the same model, an actual negative pair otherwise.To be effective for correct classification, the distance between two networks generated by the same model should be much smaller than the distance between two networks originating from different models.Given a distance D and a threshold ε > 0, a network pair is a predicted positive sample if D < ε, a predicted negative sample otherwise.Then Precision and Recall are given, for each ε, by P ε = t p/(t p + f p) and R ε = t p/(t p + f n)

Table 2 .
11PR (Area Under the Precision/Recall curve) value for the classification of all network models, for the distances defined in the section Methods and for the Graphlet Correlation Distance GCD11 distance11.The best-ranked distances are highlighted in bold italic.