Comparison of large networks with sub-sampling strategies

Networks are routinely used to represent large data sets, making the comparison of networks a tantalizing research question in many areas. Techniques for such analysis vary from simply comparing network summary statistics to sophisticated but computationally expensive alignment-based approaches. Most existing methods either do not generalize well to different types of networks or do not provide a quantitative similarity score between networks. In contrast, alignment-free topology based network similarity scores empower us to analyse large sets of networks containing different types and sizes of data. Netdis is such a score that defines network similarity through the counts of small sub-graphs in the local neighbourhood of all nodes. Here, we introduce a sub-sampling procedure based on neighbourhoods which links naturally with the framework of network comparisons through local neighbourhood comparisons. Our theoretical arguments justify basing the Netdis statistic on a sample of similar-sized neighbourhoods. Our tests on empirical and synthetic datasets indicate that often only 10% of the neighbourhoods of a network suffice for optimal performance, leading to a drastic reduction in computational requirements. The sampling procedure is applicable even when only a small sample of the network is known, and thus provides a novel tool for network comparison of very large and potentially incomplete datasets.

: Netdis performance as measured by the k C − N N score under sub-sampling for 25 independent copies of synthetic networks data set 1 with 2160 nodes. The variations of the results between the data sets in Fig.1 and 2 come mostly from the mixing between the configuration model and the Chung-Lu models. This mixing is not surprising given that these two models converge for large networks. Another source of variation is the duplication divergence model which in certain realizations produces non typical networks. Again, this is not surprising since the duplication divergence model is known to be quite unstable [10]. For instance, the same parameter values can produce networks that differ significantly with respect to the number of edges. The results given in Figures 3,4 and 5 show that the qualitative behaviour of Netdis under subsampling is the same for all independent data sets we considered.
Results with Erdös-Rényi background Figures 6 and 7 show the results obtained using an Erdös-Rényi random graph with 5,000 nodes and 50,000 edges as the gold standard in Netdis.
Although we observe slight variations in the starting points of the plots when compared to the use of the DIP-core yeast interaction network as a gold standard, the results show that the behaviour of Netdis under sub-sampling is not significantly affected by the choice of gold standard network. The most notable of these variations occurs in the case of the protein interaction dataset where the initial N N score (i.e. the score without sampling) is 0.6 compared to 1.0 when the DIP-core network was used as a gold standard. Despite this variation we still observe that the performance of Netdis degrades significantly only below 10% sampling probability when compared to the performance without sub-sampling.
In the case of data sets containing large networks using an Erdös-Rényi random graph as a gold standard decreases the overall 1 − N N and k C − N N scores for small sample sizes compared to the DIP-core gold standard but one still gets a strong signal even when as few as 10 ego-networks are sampled.

An example where sub-sampling fails
Here we consider a data set that consists of 5 Erdös-Rényi random graphs on 10,000 nodes with 15,000 edges and 5 Erdös-Rényi random graphs on 10,000 nodes with 15,000 edges to which a disconnected complete graph of size 30 is added. For this data set we the nearest neighbour scores start to deteriorate much quicker when compared to other data sets and the signal is almost completely absent for sampling probabilities as high as 0.5. Plots for the 1 − N N and k C − N N scores are given in Figure 8.

2-step Ego-network Size Distributions
In order to check that the observed performance of Netdis under sub-sampling is not due to 2-step ego-networks typically covering most of the network we computed the size distributions of 2-step ego networks. Figure 9 shows the size distribution for the 50,000 node synthetic networks data set. We observe that even for networks with scale free degree distributions typical 2-step ego networks do not cover most of the network. Similar results hold for the protein interaction networks where typical two step ego networks span less than 5% of the network.   Figure 8: Netdis performance under sub-sampling measured by the nearest neighbour scores 1−N N and k C −N N for a data set consisting of 5 Erdös-Rényi random graphs on 10,000 nodes with 15,000 edges and 5 Erdös-Rényi random graphs on 10,000 nodes with 15,000 edges to which a disconnected complete graph of size 30 is added. The dashed red lines correspond to the average nearest neighbour scores over a sample of 50 random distance matrices. Here, the assumptions of Proposition 1 are strongly violated and the performance of Netids under sub-sampling is poor. This is because the two-step ego networks in the disconnected complete graph are of size 30 are all identical and interest completely, and are of a different size to those typically found in the Erdös-Rényi random graph. Here we shall provide a proof of Theorem 0.1. Recall the setting: The neighbourhood of dependence for the random variable X i is given by S i = {j : i ∼ j, j = i}, and we let γ i = |S i |, i = 1, . . . , N , denote their sizes. Our sampling procedure chooses indices k 1 , . . . , k n according to the multinomial distribution M κ; 1 n , . . . , 1 n . We throw k balls into n boxes independently, with each box being equally likely. If index i is chosen, then we sample the whole dependency neighbourhood of X i . We study the empirical measure where w j = n κ t∼j kt γt . We compare this measure to a centred Gaussian random measure G mult,dep,samp with the corresponding covariance matrix In Lemma we shall show that with the notation We quantify the distance between the random measure ξ n and the limiting Gaussian random measure in terms of cylinder-type functions F which take measures as input and are of the form For φ ∈ C ∞ b (IR), the set of infinitely often differentiable real-valued functions with bounded derivatives of all order, let φ = sup x |φ(x)| and ∆φ = x.y |φ(x) − φ(y)|. Define the sets and ||f (i,j,k) || ≤ 1, and φ i ∈ C, i = 1, . . . , m Here f (j) is the partial derivative of f in direction x j , and similarly f (i,j) , f (i,j,k) denote higher partial derivatives. It is shown in [11] that this class of functions is convergence-determining for vague convergence. Then we shall prove that Proposition 0.1 In the above bootstrap procedure, for all H ∈ F, Before proving Proposition 0.1 we derive the covariance structure for our bootstrap procedure.

Lemma 0.2
For our bootstrap procedure, Proof First note that Cov(w i − 1, w j − 1) = Cov(w i , w j ) by linearity. To calculate this covariance, we view the multinomial vector k = (k 1 , . . . , k n ) as resulting from κ independent ball tosses into n urns, where each urn has probability 1 n of being hit. Thus, writing T (i) = j if ball i lands in urn j, the collection ((T (i) = j) i=1,...,κ ) are independent Bernoulli 1 n -variables, and k j = κ i=1 1(T (i) = j). Here 1(·) is the indicator function which equals 1 if the event in its argument is true, and 0 otherwise. Hence Ek i = κ n , Var(k i ) = κ n 1 − 1 n and Cov(k i , k j ) = − κ n 2 , i = j. We use these results to calculate the covariance as follows. First note that For the covariance, we have Now, by a similar argument used to calculate Var(w i ), for any subset U ⊂ {i, . . . , n}, Moreover, for disjoint U and V , Hence we obtain that Combining the case that j = i and that j = i finishes the proof. Q.E.D.
The proof of Proposition 0.1 is based on Stein's method for empirical measures, see [11]. We equip the space M f (IR) of real-valued bounded Radon measures on IR with the topology of vague convergence, as follows. Let C c (IR) be the space of real-valued continuous functions on IR with support contained in a compact set. Let (ν n ) n be a family of measures in M f (IR), and let ν be a measure in M f (IR). We say that ν n converges vaguely to ν, in short, ν n v ⇒ ν, if and only if for all functions f ∈ C c (IR) we have ν n , f → ν, f as n → ∞. Here we use the notation To describe a Gaussian random measure we assume that b : IR is a quadratic form such that, for any m ∈ N and for all φ 1 , .
Let ζ be a random measure taking values in the space of finite signed measures M f (IR) almost surely such that, for all m ∈ N, and for all φ 1 , .
where MVN m (0, B) denotes the multivariate normal law with mean vector 0 and covariance matrix B. Then ζ is a Gaussian random measure. Moreover, for H ∈ F of the form the so-called Stein equation corresponding to the Gaussian random measure ζ is The equation can be solved using a semigroup technique as in [12] (see also [13]); for each H ∈ F has the form (5), there is a function F ∈ F, and there is a function To prove a Gaussian approximation, we may employ the following result. Let H be of the form (5) and let F be the solution of the Stein equation (6); write F in the form (1). Then as suggested in Proposition 0.3 we bound where the covariance operator is as in Lemma , We start with the term E m j=1 f (j) ( ξ n , φ ) ξ n , φ j ; writing out ξ n we obtain Recall that Ew j = A(S j ); we expand the right-hand side as Using the definition of F we can bound With T (i) = j if ball i lands in urn j in our sub-sampling procedure we expand This gives also that Now we shall exploit that 1(T (l) = t) and w j are typically only weakly dependent. To this purpose let w (l) j denote the weights re-calculated without ball l; then and similarly we define ξ (l) n so that Note that this difference depends only on the random quantity T (l) and not on any of the other balls. With Taylor expansion, where, for some 0 < θ < 1, We shall bound R 2 later. By independence we have that Next we note that Assembling the argument we have shown so far that For R 2 we hence obtain the bound For R 3 , using Taylor expansion and the bounds on φ w as well as on the third partial derivatives of f , This bound gives that A(S k ).
Collecting the bounds yields the assertion. Q.E.D.