A novel method for assessing and measuring homophily in networks through second-order statistics

We present a new method for assessing and measuring homophily in networks whose nodes have categorical attributes, namely when the nodes of networks come partitioned into classes (colors). We probe this method in two different classes of networks: (i) protein–protein interaction (PPI) networks, where nodes correspond to proteins, partitioned according to their functional role, and edges represent functional interactions between proteins (ii) Pokec on-line social network, where nodes correspond to users, partitioned according to their age, and edges respresent friendship between users.Similarly to other classical and well consolidated approaches, our method compares the relative edge density of the subgraphs induced by each class with the corresponding expected relative edge density under a null model. The novelty of our approach consists in prescribing an endogenous null model, namely, the sample space of the null model is built on the input network itself. This allows us to give exact explicit expression for the \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\mathbf {z}}$$\end{document}z-score of the relative edge density of each class as well as other related statistics. The \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\mathbf {z}}$$\end{document}z-scores directly quantify the statistical significance of the observed homophily via Čebyšëv inequality. The expression of each \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\mathbf {z}}$$\end{document}z-score is entered by the network structure through basic combinatorial invariant such as the number of subgraphs with two spanning edges. Each \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\mathbf {z}}$$\end{document}z-score is computed in \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$O(n+m)$$\end{document}O(n+m) time for a network with n nodes and m edges. This leads to an overall efficient computational method for assesing homophily. We complement the analysis of homophily/heterophily by considering \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\mathbf {z}}$$\end{document}z-scores of the number of isolated nodes in the subgraphs induced by each class, that are computed in O(nm) time. Theoretical results are then exploited to show that, as expected, both the analyzed network classes are significantly homophilic with respect to the considered node properties.


Introduction
The homophily principle states that "similarity breeds connections" [15].This principle-born in sociologyonce declined into Network Theory, reads as nodes in a network are more likely to be linked to nodes sharing similar attributes.The effectiveness of homophily in social networks has been extensively demonstrated across various instances [1,2,11,12,16,24]: social networks exhibit homophily with respect to attributes such as gender, age, ethnicity, occupation, social class and many others.This simply means that people preferentially interact with people sharing the same cultural and sociological attributes.Putting it succintely: "birds of a feather flock together" [15,5].In contrast, heterophilic networks are those networks whose nodes preferentially interact with nodes having different attributes values.Homophily can also be seen as the categorical counterpart of "assortative mixing"-the correlation of attributes across linkand, as such, at least beyond a certain amount of assortativity, it binds the structure of networks [17], and influences the curvature of the cumulative degree distribution under the preferential attachment evolutionary mechanism [10].In view of this discussion, homophily qualifies as a genuine network property, namely, a property that when possessed to some extent, impacts non trivially on the structure of the network.A quantitative understanding of homophily in networks is therefore useful both from a theoretical and a practical point of view.A step forward in this direction is taken once we realize that homophily in networks certainly fits in the frame of "community detection" [14,7]; observe that communities in complex networks identify high order homogeneous structures.Arguing as in [30], network community detection can be seen as a procedure consisting of two stages: one stage consists of extracting communities by relying on the geometric structure of the networks, while the second stage consists in "abstracting" communities, namely, in identifying the common features and attributes of community members.Such common features and attributes are usually referred to as functions or node characteristic [20].From this perspective, communities are first detected based on their geometry and then evaluated based on their functions.The other way round is also meaningful: given a functional description of a network, namely a partition of its nodes into sets of nodes with the same node characteristic, assessing whether or not the class of the partition have a certain amount of geometric structure, i.e. assessing whether or not such classes are communities, is tantamount to assessing whether or not the network is homophilic with respect to node characteristic.Innocent as it may seem, this observation already provides a way of quantifying homophily: if the functional description correlated with the geometry of the network or, equivalently, if the network were homophilic with respect to node characteristics, then the node-induced subgraphs of each class of the partition should be relatively denser than what we expect under some suitable null model-the relative edge density of a subgraph H of a given graph G is the density of H over the density of G. Newman's celebrated modularity index [18] formulates the null hypothesis as the relative expected density of a random graph with the same degree distribution as the input graph.Modularity is thus the relative edge density of monochromatic subgraphs minus the expected relative edge density of monochromatic subgraphs under the null hypothesis that edges are distributed at random among nodes.Since the index lies in interval [− 1  2 , 1], its value directly quantifies network homophily: the larger the index the more homophilic the network is.Modularity thus provides a scale for comparing homophily of different networks.Notice that Newman's index resorts to an exogenous model (the configuration model ) to test homophily.In this paper, by revising the idea in [20], we propose to measure network homophily by testing the observed structure (i.e. the relative edge density of functional classes) against the expected structure under an endogeneous random model: the input graph itself will be the sample space for the null hypothesis.With this aim in mind, in this paper we propose a new statistical model that builds on the approach in [20] (developed for networks with only two functional classes), extend it to an arbitrary number of classes, and strengthen it by exploiting second order statistics based on a uniformly random coloring of the input network with the same color distribution.This machinery yields an explicit exact formula for the z-score of a suitable defined homophily index as well as of the number of isolated nodes of each functional class.The statistical significance of the observed homophily is then obtained through Čebyšëv inequality.As one may expect, the structure of the network enters second order statistics through the number of its subgraphs with two spanning edges, namely, the number of its P 3 's (if the two edges are adjacent) and the number its 2K 2 's (if the two edges are not adjacent).This means that our analysis does not require exogenous models (random graphs, for instance) to make comparisons for assessing homophily.Throughout the rest of the paper P 3 is the graph on three nodes joined by two edges, namely the graph • • • , while 2K 2 is the graph on four nodes with two edges without common endpoints, namely, the graph • • • • .
In line with the work of [20], we probe our theoretical results on two different network classes: i) Protein-Protein Interaction (PPI) networks, where nodes correspond to proteins, partitioned according to their functional role, and edges represent functional interactions between proteins ii) on-line social network where nodes correspond to users, partitioned according to their age, and edges represent friendship between users.As expected, numerical results provide strong evidence of the homophilic nature of the considered networks with respect to the corresponding node properties: protein function for PPI and age class for social network.

Homophily in networks
As sketched in Section 1, we look at homophily as a network parameter (actually as an array of parameters, see Section 3) measuring to what extent the attributes (node characteristics, functions) of the nodes of the networks correlate across the edges.To give a precise meaning to such a correlation, we follow the approach in [20] which we now discuss in more details.Of course, nothing bad is happening if we think of node characteristics as node colors and, consistently, of the functional description as a partition of the node set into color classes so that (potential) communities are sets of nodes with the same color.Consequently, we deal with a simple undirected graph G with n nodes and m edges whose nodes are partitioned into a number s of color classes.The simple original model in [20] refers to the case of two colors (s = 2) denoted by 0 and 1. Edges of G are then classified as (0, 0)-edges, (0, 1)-edges and (1, 1)edges according to the color at their endpoints.Let c 0 (resp., c 1 ) be the number of nodes of G having color 0 (resp., 1), with c 0 + c 1 = n; furthermore, let m i,j be the number of (i, j)-edges, i, j ∈ {0, 1}, with m 0,0 + m 0,1 + m 1,1 = m: if the functional definition of the communities correlated with the structure of G, then we should expect a statistical significant deviation between m 0,0 , say, and what we would expect if characteristic 0 were randomly distributed among the nodes of the graph, namely, if any node had an equal chance of possessing it.In [20], it is proposed to measure this deviation by the three ratios: where, for i, j ∈ {0, 1} and i = j , are the expected number of (i, i)-edges and (i, j)-edges, respectively, under the hypothesis that properties 0 and 1 are randomly distributed over the node set of G (see Section 2.2 for proofs).Just by rewriting ω i and η i,j as and one sees that ω i is nothing but the normalized intracommunity density; analogously, η i,j is the normalized intercommunity density [30].In this perspective, homophily (and heterophily) provides a suggestive interpretation of basic structural graph properties (those that can be captured by first order moments of functions of random partitions into two classes with c 0 nodes labeled 0 and c 1 nodes labeled 1).In this simple model, graph G is declared i-homophilic (or homophilic with respect to property i), i ∈ {0, 1}, if mi,j > 1 (we ask the reader to bear the pedantic reference to the indices i, j in view of the generalization to more than 2 properties).Without any other clue about the likelihood or the variability of ω i and η i,j , it is clear that both the assertions have no statistical significance behind their descriptive power.Moreover, it follows from (1) that ω i lies in the interval [0, 1/ρ(G)], ρ(G) being the edge density of G and such an interval might be really wide for sparse graphs.To overcome this limitation, [20] developed a computational model (feasible only for the case of two colors) aimed at evaluating the likelihood of an observed instance (ω 0 , η 0,1 ) in the form of a phase diagram in the m 0,0 m 0,1 -plane.Each point of such a diagram is the frequency of all partitions of the node set G into two parts C 0 and C 1 with c 0 and c 1 nodes, respectively, such that the subgraph of G induced by C 0 has m 0,0 edges, while the subgraph induced by C 1 has m − (m 0,0 + m 0,1 ) edges, m being the size of G.The diagram is computed by exhaustive enumeration for small graphs, while for large graphs only the boundary of the diagram is heuristically computed.In either cases, the likelihood of the observed pair is determined by its position and its darkness (in a grayscale) in the phase diagram.Although this approach has been proven successfully for a wide range of real networks (with only two functional classes), including certain PPI networks [20], it still suffers of the following limitations: (a) it is computationally expensive.In fact, an exact evaluation of the phase diagram requires time exponential in the number of nodes in the network, and can be applied to large instances only by exploiting heuristic algorithms on a sample.Also, after sampling a subgraph with m nodes, the complexity is O(n 2 m); (b) it can be applied only to two functional classes; (c) it is rather qualitative.
To overcome these limitations, we propose to compute the z-score of ω i and η i,j under the null model described in the next section.Since, as we show, this can be done for any number s of colors in O(s(n+m)) time, and s is usually a small constant, we have that our algorithm is time optimal, hence (a) and (b) are settled.As for (c), if Z(ω i ), say, is the z-score of ω i , then by Čebyšëv inequality the probability of the event (Z(ω i ) > λ) is at most λ −2 under the null model.Hence Z(ω i ) −2 directly measures the statistical significance of ω i , at the same time making the method completely quantitative.Moreover, we propose to evaluate the z-score of the number of isolated nodes in the subgraph induced by each color, that are expected to be negative values in the case of homophilic network.This computation is computationally harder, requiring O(nm) time, but experimental results show to be quite fast on networks with order of 10 5 edges, and is still applicable to sparse networks with about 10 6 nodes.

Design of the new model
Throughout the rest of the paper, we think of a network as an undirected graph G with node-set V (G) and edge-set E(G).An s-coloring of G is a surjective map g : where [s] := {1, . . ., s} is the set of colors.As previously stipulated, we think of g as the functional description of the network, and of the set g −1 (i), consisting of the nodes of G having color i, as the functional classes of the description.These classes are our (potential) communities.Hence, in the pair (G, g), G encodes the geometrical description of the network and g encodes its functional description.For instance, Protein-Protein Interaction networks (PPI for shortness) are graphs whose nodes are proteins and whose edges model functional interactions between proteins.Since proteins are classified by the biological function they are responsible for, each protein is uniquely associated with one of the 19 functional classes listed in Table 2 and which we identify by their labels.Therefore, given a PPI network G, the correspondence protein →function defines a surjective map g from the set of nodes of G into a set of 19 labels and, after thinking of the labels as colors, such a correspondence will be our 19-coloring g.For the Pokec social network graph, we partitioned the node set into five age classes.Therefore a correspondence user →age defines a 5-coloring.
Notice that the classification of ages is not frequency based, so that node classes differ substantially in size.
Let c i , i ∈ [s], be the number of nodes of G of color i under g and call the integer vector c = (c 1 , • • • , c s ) the profile of g.Any other coloring f : V (G) → [s] with the same profile as g will be referred to as a c-coloring of V (G) (or simply c-coloring when V (G) is understood).Our next step is to introduce a probability space that allows us to formulate null hypotheses to test against alternative hypotheses about (G, g).To this end, let Φ(c) be the set of all c-colorings of V (G).Since the multinomial coefficient with parts c 1 , c 2 • • • c s , denoted by one of the two symbols below A random c-coloring is the random variable F with values in Φ(c) and with probability mass function given by , namely, all c-colorings are equally likely (see the Appendix for a more formal definition not needed here).
Having the probability space (Φ(c), P n,c ) we test functions of (G, g) versus the same functions under the null hypothesis (G, F ), where F is a random c-coloring of V (G).We therefore define several random variables as functions of the random variable F , and such variables enable us to give first and second order moments of those statistics crucial for our purposes.We close this section by describing the former ones, deferring the description of the latter ones to the next section.
For a node v ∈ V (G) and a color i ∈ [s], let X i v be the Bernoulli random variable that equals to 1 if and only if node v has color i under the random c-coloring F , i.e.X i v is the indicator of the event F (v) = i.Since X i v is a Bernoulli random variable, by (11) in the Appendix, one has Analogously, for the product of two such variables for u, v ∈ V , u = v, and i, j ∈ [s], after resorting to (11) and (12) in the Appendix, one has where, after adhering to the notation in [13], for a positive integer a and a nonnegative integer r, we have denoted by the symbol a r the falling r-th power of a (see also the Appendix for more details), namely a r = a(a − 1) • • • (a − r + 1), with a 0 = 1.Thus, the 2-nd falling power a 2 of a equals a(a − 1).The above formula immediately shows that the random variables X i v as v runs in V (G) and i runs in [s] are not independent (neither are X i u and X j v ).Without pretending to be rigorous, this is only due to the fact that a random c-coloring can be thought of as the outcome of experiments where one draws from a bin "without replacement".However, variables in } are exchangeable, in the sense that the joint distribution of any subset of them does not depend on the order of drawing (the distribution is symmetric with respect to permuting indices).Hence, as long as we consider statistics based only on linear combinations of X i v , there is no other dependency other than the one inherited by the sampling procedure.To let the graph come into the structure of the dependency among variables, we have to consider second order statistics.
Let us come to edges now and, for an edge uv ∈ E(G) and colors i, j ∈ [s], let Y i,j uv be the Bernoulli random variable which is equal to 1 if and only if one of the endpoints of uv has color i and the other one has color j.
One more random variable is needed to compute the first two moments of the statistics we are interested in.Let T be a nonempty subset of V (G) and let i ∈ [s] be a color; define D i T as the number of elements of T having color i; by definition, D i T has the following expression: Let A and B be disjoint subsets of V (G).To determine the distribution of D i T we are interested in the probability of the event that all the elements of A have color i while all those of B have not.Let Ω i (A, B) denote this event (for more on events of this type refer to the Appendix).Thus and since the events on the right hand side of the identity above are mutually incompatible, after equation (9) in the Appendix and after setting t = |T |, one has and the close resemblance with the binomial distribution with parameters t and ci n is clear: powers are replaced by falling powers.This is not an accident: D i T follows a hypergeometric distribution Hyp(n, c i , t) giving the probability of success by drawing without replacement t balls from an urn containing n balls, c i of which are successfull.By choosing T equal to the neighborhood of a node v ∈ V (G), one immediately gets the distribution of the random number of neighbors of node v with color i, i.e.

Homophily, heterophily and isolated nodes: first and second order moments
We are now in position to describe statistics capable of assessing whether PPI networks are homophilic.Let (G, g) be a pair consisting of a PPI network G with n nodes and m edges and a c-coloring g.We classify the m edges of G according to the colors of their endpoints.Consequently, we say that edge uv ∈ E(G) is a (i, j)-edge of (G, g) if {g(u), g(v)} = {i, j}, i, j ∈ [s]-with a little abuse of notation we also admit i = j.Notice that (i, i)-edges, the intra-community edges, are the edges of G induced by the nodes in color class i (those responsible for the homophily of (G, g)) and, for i = j, (i, j)-edges, the inter-community edges, are the edges with one endpoint in color class i and the other one in color class j (those responsible for the heterophily of (G, g)).Let m i,i and m i,j be the number of (i, i)-edges and (i, j)-edges of (G, g), respectively.Therefore, for any two (possibly equal) colors i, j ∈ [s], the random variable uv counts the number of (i, j)-edges of (G, F ) where F is a random c-coloring.Let m i,j be the expected value of M i,j : by ( 3) and the linearity of expectation it follows straightforwardly that which generalizes to an arbitrary number of colors the corresponding expressions given above for two colors.Analogously, we define the i-homophily of (G, g) and (i, j)-heterophily of (G, g), i = j, as the ratios namely, the relative intra-and inter-community density, respectively (recall the identities in ( 1)).If for all i, j ∈ [s] (possibly i = j) we knew the variance σ 2 i,j of M i,j , then we could compute the z-score of the observed ω i e η i,j as the ratios By Čebyšëv inequality, if we assume, for instance, the null hypothesis that the observed value ω i is a value assumed by the random variable M i,i mi,i in the probability space (Φ(c), P c,n )-which is tantamount to assume that (G, g) does not display i-homophily-then the confidence level for accepting the null hypothesis would be at most Z −2 (ω i ).Deferring for a while the computation of σ 2 i,j , let us examine another useful statistic for (G, g): the number l i of isolated nodes in the subgraph induced by color i, i.e. the number of nodes in color class i having no neighbors in color class i.Call any such node i-isolated and observe that by definition the number of i-isolated nodes is Let L i be the random variable defined as the number of i-isolated nodes of (G, F ), where F is a random c-coloring of V (G).Although the random variables L i 's and M i,i 's are clearly dependent (as confirmed by results plotted in Fig. 7) in the next section-at the extreme cases, for instance, the joint knowledge of corresponding statistics l i and ω i is still quite informative.Indeed, consider two graphs G and G on the same node set and let g be a c-coloring of V (G).The i-homophily of (G, g) and ( G, g) could be well the same, but the number of i-isolated nodes can be significantly different as in the following example.
Example 1 For a positive integer t denote by K t the complete graph on t nodes and by K t its complement, namely the graph with t nodes and no edges.Also denote by K 1,t the complete bipartite graph with one node in a color class and t nodes in the other class.Finally, for graphs G and H denote by G + H their disjoint union, namely the graph obtained by picking a copy of G a copy of H disjoint from G, and then forming the union of the two copies.Consider the subgraphs G i and Gi induced by color i in G and G, respectively.If, for some positive integer p, one has then G i and Gi have the same i-homophily but the number of i-isolated nodes in G i is twice the number of i-isolated nodes in Gi .
Therefore, if we knew that ω i ≤ ωi and l i ≥ li , then this fact would support the claim that ( G, g) is more i-homophilic than (G, g) because the relative density of property i is less concentrated in ( G, g) than in (G, g).In conclusion, to assess i-homophily of (G, g) the use of the statistics (Z(ω i ), Z(l i )), where Z(ω i ) and Z(l i ) are the z-scores of ω i and l i , respectively, could be useful.The next theorem, besides summarizing what we have said about the first order moments of the statistics considered so far, also gives the announced expression for σ 2 i,j and the expression for the variance of L i .We then exploit these results to compute z-scores as a tool for analyzing networks in the next section.
Theorem 1 Let G be a graph with n nodes and m edges and let (Φ(c), P n,c ) be the probability space of the random c-colorings, where c = (c 1 , . . ., c s ).Assume c i > 0, ∀i ∈ [s].Moreover, let π 3 (G) denote the number of (not necessarily induced) copies of P 3 in G.For i, j ∈ [s], consider the random variables M i,j and L i defined on (Φ(c), P n,c ).Then 1) for i ∈ [s] the expected value and the variance of random variable , namely the random number of (i, i)-edges of (G, F ) under a random coloring F , are respectively given by 2) for i, j ∈ [s], i = j, the expected value and the variance of random variable , namely the random number of (i, j)-edges of (G, F ) under a random coloring F , are respectively given by 3) for i ∈ [s] let L i be the random number of i-isolated nodes of (G, F ) under a random coloring F , namely the random variable ; then the expected value and the variance of L i are respectively given by is the random number of nodes of color i spanned by the (i, i)-edges.
A formal proof of Theorem 1 is given in the Appendix.
A couple of facts are notable before closing the section.
Statistics presented in points 1) and 2) in Theorem 1 can be easily computed in O(n+ m) time, where n is the number of nodes and m is the number of edges in the input graph, assuming we have a constant number of colors.Hence, computing the s 2 z-scores for the number of edges M i,i and M i,j is computationally efficient for any input instance.We observe that the method in [20] requires exponential time for an exact evaluation, or O(s 2 n 3 ) time, where s is the number of functional classes, if optimisation heuristics are exploited.Computing statistics for the number of isolated nodes L i presented in point 3) in Theorem 1 is more time consuming.As shown in the Appendix, it requires O(smn) time, that can be improved to O s • v∈V deg(v) 2 .This is still efficient for sparse large graphs, with up to millions of nodes and edges.
All of the second order statistics presented in the theorem have an expression that encodes part of the structure of the input graphs, e.g. its number of P 3 's, 2K 2 's as well as the cardinalities of the set of common neighbors of nonadjacent pair of nodes.This means that the coefficient of variation of ω i , defined as σ i,i /m i,i is completely determined by G and c i and that different c-colorings (inducing different functional description) have the same scale.In this respect the homophily of the pair (G, g) is an intrinsic measure of the same pair and the coefficient of variation of ω i is an invariant of the pair (G, c).We can thus answer the question "how homophilic the network is?" without resorting to comparisons with other networks.

Assessing and measuring homophily
In this section we reap the crops of the last theorem by devising a methodological recipe to assess and measure homophily in networks.The main tools in this respect are the z-scores computed in the previous section.Given a pair (G, g) consisting of a network and one of its functional description ga partition of the node-set of the network into s classes of nodes having the same characteristic, e.g.age, marital status, biological function, kind of phone subscription, geographical localization etc.-we can define the s × s random matrix D whose i, j-th entry is the standardized random variable (M i,j − m i,j )/σ i,j and, analogously, the s-dimensional random vector d 0 whose i-th entry is the random variable (L i − E L i )/ var(L i )-notice that D is symmetric because M i,j and M j,i are the same variable.From (G, g) we can compute the arrays Z and z 0 consisting, respectively, of the z-scores of intra-and intercommunity edges (with the former displayed on the main diagonal of the s × s matrix Z) and of the s z-scores of the i-isolated nodes (nodes of color i none of whose neighbors has color i), for i = 1, . . ., s.We refer to Z and z 0 as the z-score arrays of (G, g).Hence we may think of Z and z 0 as the observed values of D and d 0 , respectively-notice that Z is symmetric as well.For an array A (matrix or a vector) denote by 1/A 2 the array of the same dimensions as A whose generic entry b is a −2 , a being the corresponding entry of A. Call the arrays 1/Z 2 and 1/z 2 0 U -values arrays.By Čebyšëv inequality, the U -values arrays give (entry-wise) an upper bound of the probability of observing a value at least as extreme as the one observed for the corresponding random variable.Hence U -values are upper bounds of the corresponding p-values-so called in the Theory of statistical hypotheses.Although U -values arrays: • do not capture the statistical dependency structure of the corresponding random arrays-this subject deserves further research; • do not ensure a tight approximation of the corresponding p-values: though using only second order moments Čebyšëv bounds are undoubtedly the best possible bounds, such bounds can be actually rather loose yielding (possibly) too conservative methods (especially in conjunction with the pervious point), U -values arrays certainly exhibit the following merits: • robustness: U -values do not require distributional assumptions and therefore have an endogenous nature; • complexity: U -values can be efficiently computed (see Section 4.3); • rigour: U -values are computed exactly and do not require sampling or estimates and have precise quantitative meaning for homophily.
Notice that the U -values arrays (1/Z 2 , 1/z 2 0 ) and the z-scores arrays (Z, z 0 ) convey the same statistical information.Hence (Z, z 0 ) is already a direct measure of the homophily of G with respect to g.We spend the remainder of the section to substantiate this claim.
Descriptive power of z-score arrays and comparisons of networks The generic entry of Z = {z i,j } measures the distance from the expected value of the corresponding random variables on a scale whose unit is the mean square error.At the same time, such an entry bounds from above the likelihood of this distance through the U -values, namely, the map z i,j → z −2 i,j .Similar considerations hold for the array z 0 .It follows that z-score arrays can be conveniently described as heat-maps that provide a visual representation of homophily.These kind of diagrams can be particularly useful when comparing different networks that use the same set of colors because all the arrays involved have the same dimensions and thus the corresponding heat-maps are comparable.This can be done for PPI networks, for instance, because they have the same functional description (see Section 4.1 and Section 4.4).In this case one can also refine the analysis with the help of vector z 0 to provide a measure of the concentration of homophily in each color class (however we did not pursue this idea numerically).
Multiple Testing The natural extension of Park and Barabasi's method [20] is the following procedure, which we present first in a scalar form to clarify the need for the Bonferroni correction and then in a more algebraic form to confirm the descriptive power of matrix Z.Although in what follows, when dealing with hypothesis testing, it would be more appropriate to use one-sided Čebyšëv inequality (a.k.a.Cantelli's inequality)-this amounts to consider (1 + z 2 i,j ) −1 in place of z −2 i,j -for simplicity we stick to the two-sided Čebyšëv inequality.
Procedure.Given the pair (G, g) fix a significance level α.Compute the z-scores )). Array z 0 can be dealt with in the same way and can be used to refine the analysis.(6) While the procedure above correctly assesses homophily (heterophily) of the marginal entries of D, it is not true that the same significance level is valid for the joint distribution of D. For assessing joint homophily (heterophily) we have to look at Procedure (6) as a multiple testing procedure which therefore requires multiple testing corrections.One of such correction, the most conservative one, is Bonferroni's correction which, in its simplest form, scales level α-the level below which the null hypothesis is rejected-by the reciprocal of the number h of testing performed.For instance, suppose we want to assess whether a pair (G, g) is jointly homophillic at level α.Then we need to simultaneously test the s diagonal elements of D. In this case, Procedure (6) specializes by declaring that (G, g) is i-homophilic when z i,i > s √ α .Clearly, as the number of testing increases, the procedure becomes too conservative especially in conjunction with Čebyšëv bounds.This limitation is unavoidable without further information about the statistical dependence structure among the marginals of D. Nonetheless, by using a slightly refined form of Bonferroni correction, we can still devise a method to measure homophily in a given network and to compare homophily between different networks that use the same set of colors.For (i, j) ∈ [s]×[s], with i = j, consider the alternative hypothesis H 1 i,j : D i,j > 0 versus the null hypothesis H 0 i,j : D i,j ≤ 0 at the significance level α i,j .Pair (i, j) is said to positive at the significance level α i,j whenever Procedure ( 6) accepts } and set S is called positive at the joint significance level α whenever H 1 i,j is accepted by Procedure (6) for all (i, j) ∈ Q.The main observation is as follows.If we prescribe the individual significance level α i,j = z −2 i,j for (i, j) ∈ Q, then S will be positive at the joint significance level min{1, Q z −2 i,j }.In particular, if , then the set of diagonal positions of Z, namely the positions of the z-scores of the intra-community densities, is positive at joint confidence level given by the trace of the U -value array 1/Z 2 .This observation suggests that we can relate the number of positive elements in a set Q at a significance level α with the sum of entries of 1/Z 2 indexed by Q.Indeed, let Q(α) be the largest subset of Q such that (i,j)∈Q(α) z −2 i,j ≤ α and let q(α) be the cardinality of Q(α).Notice that q(α) can be 0. Hence Q contains exactly q(α) positive elements at the joint significance level α.Parameter, q(α) depends only on Q, Z and α and therefore can be used to compare different networks that use the same set of colors.On the other hand, by definition, q(α) is related to Z by the following fact: for a real number λ, let It is clear that for each α there exists a λ (not in general unique) such that Q(α) = J(λ) ∩ Q.Therefore, family {J(λ) | λ ∈ R} globally conveys the same information as family {q(α) | α ∈ [0, 1)} and we can get rid of the significance level α when comparing networks that use the same set of colors.Notice however that {J(λ) | λ ∈ R} conveys globally the same information as the heat-map of the z-score matrix Z with the temperature acting as an inverse transform of the significance level.
Synthetic measure via Multidimensional Čebyšëv-type inequalities Multidimensional Čebyšëv inequalities [6] provide a somewhat dual method to the multiple testing procedure above.Recall that if X is a d-dimensional real random vector whose marginals have zero mean and unitary variance, X is the Euclidean norm of X, and t is a positive real number, then the following multidimensional Čebyšëv-type inequality holds by a straightforward application of Markov inequality to the random variable X 2 .The same inequality holds for matrices but replacing the Euclidean norm by the Frobenius norm and adjusting for dimensions.More generally, it holds by vectorializing any subset of entries of a given matrix (after adjusting for dimensions).For instance, direct application of inequality above yields: with diag(A) denoting the vector formed by the diagonal entries of the square matrix A. Hence, the sum of the squares of the diagonal entries of Z gives a global synthetic measure of homophily: the higher such sum is the more globally homophillic the network is.Therefore, is a global index of homophily lying in [0, 1], like Newman's modularity index [18].

Numerical tests on real networks
We now probe our theoretical results on two different network classes: i) Protein-Protein Interaction (PPI) networks, where nodes correspond to proteins, partitioned according to their functional role, and edges represent functional interactions between proteins ii) on-line social networks, where nodes correspond to users, partitioned according to their age, and edges represent friendship between users.As shown in the previous section, the major character of our methodology is the z-score matrix Z.Let us discuss data and the running time of the method in some details before going to the numerical tests.

Protein-protein interaction networks
We consider ten PPI networks retrieved from STRING database (https://string-db.org/)[25,26], setting a high confidence score cut-off (0.70).The selected networks, listed in Table 1, are mainly related to Bacteria (8 out of 10, belonging to diffent Phyla or classes), we also included in the study Saccharomices cerevisiae (Fungi -Ascomycota) and Pyrococcus abyssi (Euryarcheota -Thermococci) for comparison.The 8 bacterial organisms were chosen as representatives of Bacteria Kingdom, including different Phyla (Alpha, Gamma, Epsilon proteobacteria, Actinobacteria, Firmicutes/Bacilli, Spirochaetes).Organisms were also chosen on the basis of their network sizes (number of nodes and edges), in order to build an etherogeneous dataset.Species, Kingdom, Phylum/Class as well as number of nodes, number of edges, and density of the relative network are reported in Tab. 1 for each organism.Each organism's network is an undirected graph, in which each node represents a protein associated to a color denoting one of the functional classes listed in Table 2, and each edge represents the interaction between two proteins, weighted according to the likelihood of the given interaction.A PPI graph is thus represented by two text files, the first lists node labels and the associated colors, the second lists edges as pairs of nodes and the associated weight in range [0, 999].Edges have been cut-off at a 700 minimum weight, usually considered as a high confidence threshold.Isolated nodes in the resulting graph have been deleted.Some networks present a very limited number of nodes (some units) labeled by similar values (e.g.jhp0681 1 and jhp0681 2 in the node file for organism Helicobacter pylori) representing different isoforms of the same protein, but these nodes were simply denoted by a unique label (e.g.jhp0681) in the edge listing file.We merged such nodes in a single node; in the few cases in which they were associated to different functional classes, we merged them associating the functional class X to that node.

Pokec social network
Pokec is the most popular Slovak on-line social network.Datasets, obtained during May 25-27 2012, are anonymized and contain relationships and user profile data of the whole network [27].Friendships in the Pokec network are originally oriented.We decided to consider only symmetric pairs, so that we derived an undirected graph where nodes x, y are adjacent if and only if both x is a friend of y and y is a friend of x, so that it can be assessed that the two considered members had an actual interaction; also in this case, isolated nodes have been discarded.The network obtained contains more than one million nodes and 8 millions edges.Nodes are partitioned in classes according to the age declared by members, where about 34% of them either did not declare age, or declared a patently untrue value-in some cases even less than 10 or over 100.So, we decided to put into a "fake" age class denoted by X all members whose age is not a numeric value in [12,60).The size of each subgraph induced by the 5 age classes, possibly containing isolated nodes, is shown in Table 3, together with the size of the entire network.4, excluding time elapsed in file I/O.As it clearly appears from the table, the ratio between the number of edges in the graph and the time needed to compute edge z-scores is close to be constant (varying from 340k to 460k edges per second), confirming the asymptotic complexity O(n + m)-assuming the number of colors is constant.

Class
An efficient computation of singleton z-scores requires some more care.Expression for var(L i ) in point 3)) in Theorem 1 requires O(n 3 ) time to be computed.Actually, it can be manipulated (details are discussed in the Appendix), so that the complexity of computing var(L i ) for each color i is lowered to O(nm).More precisely, its complexity is strictly related to the number of pairs of nodes at distance 2, which in turns is bounded by π 3 , i.e. the number of P 3 's in the graph.It is immediate to see that The sum of squared degrees for all experimented networks is reported in Table 4, where it is confirmed to be proportional to computing times for singleton z-scores (with a ratio varying from 28k to 61k P 3 's per second).For each network we report the number of nodes, the number of edges and the sum of squared degrees.
The complexity of singleton z-scores computation strongly depends on the sum of squared degrees.

Numerical results
In order to have a pictorial quantitative perception of homophily and heterophily in the considered networks, we present matrix Z of the z-scores of the intra-and inter-community edges (see Section 3) in the form of heat-maps.Color scale is logarithmic on z-scores, traslated in order to avoid negative values.Each entry of Z corresponds to a square in the diagram.Green squares corresponding to entry i, j represent positive z-scores, while pink squares represent negative z-scores.Results related to PPI networks are shown in Figure 1.Homophily of PPI's with respect to their functional description is clearly readable from all the heat-maps by the green squares in all diagonals-showing the relative intracommunity density-except for the poorly characterized X function class.A majority of off-diagonal z-scores are negative (more than 79.6%), while diagonal z-scores tend to show very high values.As a global result, neglecting all i, j pairs where either i = X or j = X, we recap that: • the average value of the diagonal entries Z is 36.26,with standard deviation 49,85, ranging from a -0.3183 minimum to a 326.6 maximum; • more than 91% of diagonal entries Z are greater than 5; • the average value of off-diagonal entries Z is -0.836, with standard deviation 3.707, ranging from a -5.983 minimum to a 55.67 maximum; • more than 65% of off-diagonal entries Z are less than -1.
Haemophilus influenzae Helicobacter pylory J99 Streptococcus pneumoniae TIGR4    Concerning the off-diagonal entries of Z (namely, those corresponding to inter-community edges) it is worth noting that some classes show significant values, highlighting a unexpected heterophily although in most cases the associated classes belong to close functional classes such as class J, K and L, that can be grouped in the higher category Information, storage and processing.

Saccharomyces cerevisiae
In particular significant heterophilic z-scores are reported, in most of the organism networks, for classes J-L and class J-U representing Translation, ribosomal structure and biogenesis (class J), Replication, recombination and repair (class L) and Intracellular trafficking, secretion, and vesicular transport (class U).These heterophilic relationships can be considered reasonable from a biological point of view, since nodes associated to protein synthesis in the ribosome (class J) are related to nodes involved in DNA replication (class L) and also to intra-cellular transport (class U) according to the mechanics of protein biosynthesis (when DNA is transcribed, the resulting RNA copy is transported to the ribosome and after translation the protein can be transported away from the ribosome and onto the relevant part of the cell).These results provide consistency to our work as a real-world validation of our method.
To have a global and comparative glimpse of the whole scenario concerning PPI, we isolated the diagonal entries of Z and plotted them in Figure 2 on a different scale.
A large majority of z-scores (diagonal) shows very high values corresponding to extremely significant deviation from expected ones.As expected, the exception regards last column related to X class (Function unknown or General function prediction only) showing z-score values typically negative including very small values (-14 for Saccharomyces cerevisiae, -7 for Pyrococcus abyssi and -6 for Escherichia coli) with only two organisms showing positive values (0.44 for Mycobacterium tubercolosis and 1.9 for Vibrio cholerae) (Fig. 2).This typical scenario is consistent with what we could expect from a biological point of view, since it is reasonable that proteins, envolved in a common task, could on average preferentially interact or be close to each other in the PPI.Proteins belonging to X class do not share a common task since in most of cases they are not associated to any given functional class, so it is reasonable that they are not likely to interact with each other.Some functional classes seem to show extremely high values, shared among almost all the organisms.It is evident for class J (Translation, ribosomal structure and biogenesis) showing the highest values, reaching huge z-scores (335 for Escherichia coli, 280 for Mycobacterium tubercolosis) always higher than 124.Also class N (Cell motility) shows extremely high z-score values reaching 229 for Brucella mellitensis and 206 for Escherichia coli, with the only exception of Mycobacterium tubercolosis -2.47 -that is anyway more than two standard deviations greater than the expected one.Genes coding for proteins in bacteria are known to typically occur phisically close on chromosome, according to the operon paradigm, and it was shown, consistently with our findings (see [23]), that especially genes coding for proteins envolved in translation and cell motility task are very close to each other, favoring their syncronous transcription and the interaction of their protein products.As for the Pokec social network, results are presented in a completely analogous manner: see Fig. 3 for the heat-maps, while in (4) we isolated the diagonal elements.As expected Pokec shows a significant homophilic beahavior with respect to the considered node attribute, age class, as reported in Tab. 3.
All diagonal z-scores, excepting class X (no age or non reliable value), reported in Fig. 3 and in Fig.To complement the analysis, we also computed vector z 0 .Recall that the i-th entry of such vector is the z-score of the number of isolated nodes in the subgraph induced by color i (functional class for the PPI and age class for Pokec).As explained in Section 2.2, although correlated with the intra-community densities (as confirmed for PPIs in Fig. 7: the higher the density, the lower the likelihood to find isolated nodes), the entries of z 0 provides a measure of the concentration of the intra-community edges within color classes and, as expected, they are typically negative, consistently with what they represent.A negative entry means that subgraph induced by the corresponding functional classes for the PPI and age class for Pokec contains less isolated nodes that expected.As can be observed in both Fig. 5 and Fig. 6, except for the X class which shows a z-score value close to zero for Pokec and few values close to zero and a vast majority of positive values for PPI, z-scores associated to all other classes assume very low (negative) values (around 75% of values are smaller than -5 in PPI -values around -140 for class C and D and around 80 and 60 for class E and F in Pokec), that can be considered extremely significant from a statistical point of view.
-15   Finally, as we said in Sections 2.2 and 3, the entries of the p-values arrays 1/Z 2 and 1/z 2 0 (obtained simply by squaring the reciprocal of the entries of the z-score arrays) can be rather loose estimates of the corresponding true quantiles.In this respect our method is rather conservative.Nonetheless, as shown in Fig. 8, a large majority of p-values entries are under the threshold of 0.05, which is usually considered as reliable (for individual testing) with the exceptions already discussed above.

Conclusions and discussion
In this paper we presented a new approach to assess and measure homophily in networks.The model, described in Section 3, relies on computing • the z-scores of m i,j , the number of edges with one endpoint in functional class i and the other endpoint in functional class j (with possibly i = j), • the z-scores of l i , the number of nodes in functional class i with no neighbours in class i, under the hypothesis that these numbers are samples from the corresponding random variables M i,j and L i under the random coloring model (Φ(c), P n,c ) (the null model).These z-scores are either directly interpreted as a refined measure of network homophily (through heat-maps) or serve as the basis either for more synthetic measure via multiple testing or via the significance level of the Euclidean distance between the observed intra-community densities and the expected ones under the random coloring model.The idea of random coloring is implicit in [20] from which we also borrowed terminology.As a result, we extended their model to an arbitrary number of colors and made it computationally efficient and also quantitative (via the z-score).The method is clearly applicable to any kind of network and to any of its functional description.Different networks with the same functional description can also be compared directly.Moreover, we noticed that the coefficients of variations of the M i,j 's and L i 's are invariant for the pair (G, c), where G is the network and c is the profile of the functional description g of G.
Obtained results provide evidence of the strong homophilic nature of PPIs, in terms of protein function, and of Pokec social network, in terms of age classes, making our method reliable and affordable since homophilic nature of PPIs and social networks is something expected and known to some extent.
Network homophily is directly linked to network communities and to the paradigm of Guilt By Association (GAS) [19].According to this paradigm, attribute of a given node can be inferred by analyzing the attributes of its neighbours [4,21].In this view assessing and measuring network homophily can be extremely significant for the applicability of the GAS paradigm, allowing to classify nodes according to neighbor attributes.The analysis of Z matrix in Pokec network can provide an example of how GAS paradigm can be concretely applied.Users belonging to X class (age not reported or non reliable) are significantly close (according to the values of entries of Z matrix) to classes C and D, showing an heterophilic behavior while they are not close to users of classes D and E. This leads to hypothesize that users of class X could have, even if they did not report it, an age associated to class C or D. It is worth noting anyway that in some networks, in particular in PPIs, node attributes can be already classified through GAS paradigm, leading to a bias or to a tautological analysis, generating a circular argument.
Concerning PPI networks, comparison of Z matrices shows that the homophilic behavior is not linked to evident stronger similarity among close related species (also Saccharomyces cerevisiae and Pyrococcus abyssi show similar homophilic/heterophilic z-scores), so that homophilic behavior can be considered as an intrinsic characteristic of PPIs.Interestingly, some functional classes are more associated than expected showing an heterophilic behavior, especially classes J, K and L, that can be grouped in the higher category "Information, storage and processing".Another significant z-score highlights heterophily in most of organism networks with respect to classes J and U representing "Translation, ribosomal structure and biogenesis" (class J) and "Intracellular trafficking, secretion, and vesicular transport" (class U ) respectively.
The model has been implemented in Python, and experimental results confirm that the computational complexity of the proposed model is optimal for edge density computation, requiring O(n + m) time to compute the Z matrix.Computing the z-score of the number of i-isolated nodes is more time consuming, requiring O(nm) time, but experiments show that it is still efficient in practice for sparse large networks.
In conclusion we are confident that this work can provide a significant contribution allowing to assess and measure, through a robust statistical method, homophily in networks.
Let J ⊆ [s].The contraction by J of vector c = (c 1 , . . ., c t ) is the vector c ′ obtained from c by suppressing the entries whose indices are in J.We make use of the following multinomial identity which follows straightforwardly by the definition of the multinomial coefficient: where n and c are as above and c ′ is the contraction of c by J.For instance, if J = {1}, then c ′ = (c 2 , . . ., c s ) and the expression above reads as We now define the notion of random c-colorings in some more depth.Let Φ(c; V ) be the set of all c-colorings of V (in our case V = V (G) for some graph G).When V is understood (as we have assumed throughout the paper) the notation is abridged into Φ(c).Thus We now equip Φ(c) with the uniform measure P n,c and define the random c-coloring of V , which we denote by F , as the dentity map on Φ(c), namely the random c-coloring of V is essentially the probability space (Φ(c), P n,c ) itself and it can be visualized as the random object F taking the value f ∈ Φ(c) with probability Pr {F = f } = P n,c (f ).A statistic based on the random c-coloring F of V is simply any measurable function on (Φ(c), P n,c ), for instance, the indicator X i v of the event (F (v) = i), for some i ∈ [s] and v ∈ V , is one of such.Notice that the inverse image of event (F (v) = i) is the set {f ∈ Φ(c) | f (v) = i}.This is the essence of our statistical model.
For our purposes, for some two disjoint subset A and B of V and some color i ∈ [s], we are interested in the probability of the event that all the elements of A have color i while all those of B have not.Let Ω i (A, B) denote this event.Hence We are also interested in computing the probability of the intersection of two such events for two distinct colors.We summarize these calculations in the next lemma and then we show how to use the lemma for computing the probability of certain simpler events.
Lemma 1 Let A, A ′ , B, B ′ be subsets of V and let a, a ′ , b, b ′ be their respective cardinalities.Suppose Then, for each two distinct colors i and j, one has Proof.Since the elements of A have to be mapped to i and those of B have not, the elements that have color i can be chosen in n−(a+b) ci−a ways.After this choice, we are left with n − c i elements that have to be assigned to [s] \ {i} in such a way that all the elements in A ′ must be mapped to j and those in B ′ cannot.Among the elements of B ′ the are possibly some that have been already assigned to i.

Therefore we can perform the choice in
ways.After this choice has been done, we are left with n − (c i + c j )) elements that have to be assigned to colors in [s] \ {i, j}, namely with the number of c ′ -colorings of a set of n − (c i + c j )) elements where c ′ is the contraction of c by {i, j}.If follows that , where we used Formula (7) at the denominator.One obtains Formula (8) after simplifying, expanding the binomial coefficients and resorting to the definition of falling factorial.
The way we use the lemma to compute the probability of certain basic events is to read the events as a special case of the event Ω i (A, B) ∧ Ω j (A ′ , B ′ ) and to plug in the formula the corresponding parameters a, b, . . .b ′′ .Note that, for any j ∈ [s], by choosing By, taking B = ∅-and hence b = 0-has the effect of suppressing the constraint (F (b) = i, ∀b ∈ B).Therefore, for instance, and, in particular, for any pair of elements u, v ∈ V any color i ∈ [s], Pr {F (u) = i} = c Analogously, since for any pair of elements u, v ∈ V and any two distinct colors i, j ∈ [s], it holds that it follows that two compute the probability of such an event one has to put a Proof of Theorem 1 Proof.The expected values m i,i and m i,j , i = j have already been computed.Let us prove the formula for the expected value of L i .By definition , namely the event that v has color i while all of its neighbors have not.Thus, after (4), Hence, by linearity of expectation Let us compute the variance of the random variables in 1), 2) and 3).Observe that all such variables are sums of Bernoulli random variables, namely they are of the form S = ν∈N B ν where N is a finite index set and B ν is a Bernoulli random variable for each index ν ∈ N .The variance of S is thus given by var where we used the fact that Let us first specialize the formula above to M i,i and M i,j .Notice that in both cases N = E(G) and that the summation set in the last equality of ( 13 Therefore, the variance of S assumes the following form We obtain expressions for the variance of M i,i and M i,j by plugging the expectation of the corresponding variable in the formula above and specializing a and b for B e = Y i,j e and B e = Y i,j e , with e = uv for some nodes u and v. Let us start with a, namely, the value of Pr {B e = 1 ∧ B e ′ = 1} when (e, e ′ ) ∈ Q. Hence e ∼ e ′ .After regarding edges as sets of two nodes, one has e ∼ e ′ if and only if |e ∪ e ′ | = 3 (recall that the graph is loopless and has no parallel edges).Let e ∪ e ′ = {u, v, w} where u is the unique node in e ∩ e ′ .Recall that for disjoint subsets A and B of V (G) we denote by Ω i (A, B) the event that all the nodes of A have color i while all those of B have not.Now, if S = M i,i , then B e = Y i,i e for all e ∈ E(G), and thus a = Pr {Ω i ({u, v, w}, ∅)}; else, if S = M i,j , then B e = Y i,j e for all e ∈ E(G); in this case observe a is the sum of the probability of two mutually exclusive events: the event that u has color i while the nodes in {v, w} have color j, namely the event Ω i ({u}, {v, w}) ∧ Ω j ({v, w}, {u}), and the the event that u has color j while the nodes in {v, w} have color i, namely the event Ω j ({u}, {v, w}) ∧ Ω i ({v, w}, {u}).Therefore, by Lemma  for all e ∈ E(G), and thus a is the probability of the event Ω i ({u, u ′ , v, v ′ }, ∅), namely the probability that all the four nodes have color i under F ; else, if S = M i,j , then B e = Y i,j e for all e ∈ E(G); observe that there are two bipartitions of {u, u ′ , v, v ′ } into sets A and B such that |A| = |B| = 2 and neither A nor B induces one of the edges e and e ′ .Hence b is two times the probability that all the nodes in A have one of the colors i or j and all the nodes in B have the other color.Hence b is four times the probability of the event that all nodes in A have color i and all nodes in B have color j, that is b = 4Pr {Ω i (A, B) ∧ Ω j (B, A)}.By plugging the values of a and b (as well as the corresponding expected values) in ( 14) one achieves the desidered expressions for σ 2 i,i and σ 2 i,j .It only remains to prove the formula for the variance of L i .By specializing (13) with and since Pr W i u = 1, W i v = 1 = 0 whenever u and v are adjacent nodes of G, it follows that and after plugging this expression in the latter sum we obtain the stated formula.The proof is thus completed.

Classes' size in organism's networks
For each organism's network, we report in Table 5 the number of nodes for each functional class, and the total number of nodes.Nodes in classes A, B, Y, and Z are included in the total size, but were not considered in the analisys.

Speeding-up L i computation
We show how to compute efficiently statistics in point 3) in Theorem 1, in particular the variance expression Trivially computing the summation in (15) requires O(n 3 ) time.We show now that the time complexity can be lowered to O u∈V (G) deg 3 G (u) , that becomes O u∈V (G) deg 2 G (u) expected time (with very high probability) if hash-tables are used to represent sets, and falling factorial values x y are approximated by applying Stirling formula.Since huge networks are usually very sparse, this represents a deep improvement with respect to the computation based on (15).
We first observe that The second and third summations in (16) contain respectively only O(m) and O(n) terms.The first summation in (16) contains O(n 2 ) terms, and for each pair u, v a different value of exponent b(u, v) could be needed.This actually only occurs for pairs u, v having some common neighbor, while if all pairs had distance larger than 2 a substantial speed-up could be possible.We actually compute the first summation in (16) as if all pairs u, v had no common neighbors, so that b(u, v) = deg G (u) + deg G (v), and then we fix the correct value for pairs u, v such that dist(u, v) = 2-adjacent pairs have already been taken into account in the second summation.
Let us denote deg G (u) + deg G (v) by b ′ (u, v): (u,v)∈V (G) The first summation in ( 17) is easily computed by means of the degree histogram of G, where deg −1 G (d) is the number of nodes having degree d in G: and can be computed in O(m) time, since at most 2 √ m distinct degree values may occur in a graph.The second summation in (17) can be computed by exploring the neighborhood of each node, since dist(u, v) = 2 if and only if u, v ∈ N G (z) for some node z and uv ∈ E(G); this can be done in O z∈V (G) deg 2 (z) , that is much smaller that n 2 for sparse graphs.It is immediate to see that O z∈V (G) deg 2 (z) is the dominating term in computing the value of (15).
Experiments have been performed for social networks with over 10 6 nodes and 8 • 10 6 edges, for which the number of pairs of nodes u, v at distance 2 was order of 10 8 .

Figure 1 :
Figure 1: Heat-maps corresponding to Z matrices of the ten organism PPI networks.Diagonal entries correspond to intra-community edges z-scores, while off-diagonal entries correspond to inter-community edges z-scores.Values in the color scale have been cut to interval [−10, 60].

Figure 2 :
Figure 2: z-score intra-community density values (diagonal entries of Z) of each functional class (x-axis) are reported in different colors (each color representing a different organism as indicated in the top right legend of the plot).

Figure 3 :
Figure 3: Heat-map corresponding to Z matrix of Pokec social network.Values in the color scale have been cut to interval [−100, 100].

Figure 4 :
Figure 4: Diagonal z-score values related to age class.

Figure 5 :
Figure 5: z 0 values (y-axis) of each functional class (x-axis) are reported in different colors (each color representing a different organism as indicated in the top right legend of the plot).

Figure 7 :
Figure 7: Correlation between diagonal Z values (x-axis) and z 0 values (y-axis).Each point represents a given functional class of a given organism.

Figure 8 :
Figure 8: Diagonal entries of U -value arrays.The x-axis is labelled by functional classes and each color represents a different organism as indicated in the top right legend of the plot.
) is E(G) × E(G) \ {(e, e) | e ∈ E(G)}.Denote the latter set by P .Since two edges e and e ′ of G can have at most one node in common, it follows that P = Q ∪ R where Q = {(e, e ′ ) ∈ P | e ∼ e ′ } and R = {(e, e ′ ) ∈ P | e ∼ e ′ } and where we have written e ∼ e ′ if e and e ′ share a node and e ∼ e ′ otherwise.Clearly Q It is clear that Pr {B e = 1 ∧ B e ′ = 1} assumes only two values over the set P : it assumes the value a on Q, and the value b on R.Moreover, since |P | = (m 2 − m) = 2 m 2 and since e ∼ e ′ if and only if e and e ′ spans a P 3 , it follows that

3 if 2 j +c 2 i cj n 3 if.
B e = Y i,j e cic B e = Y i,j e Let us compute b.In this case e ∩ e ′ = ∅.Let e = uv and e ′ = u ′ v ′ .If S = M i,i , then B e = Y i,i e

4 if
B e = Y i,j e .

Table 1 :
Complete list of considered organisms, together with their network size (nodes and edges).Density is expressed as the ratio between the actual number of edges and the number of edges in the complete graph with the same number of nodes.Functional classes of proteins of the considered ten organisms were obtained from NCBI database (ftp://ftp.ncbi.nih.gov/pub/COG/COG/).Proteins were partitioned into 25 different functional classes, but only 19 were taken into account in this work, since: • 5 classes (A -RNA processing and modification, B -Chromatin structure and dynamics, Y -Nuclear structure, Z -Cytoskeleton, W -Extracellular structures) had no representatives (or only a few) for most of bacterial organisms;• classes R -general function prediction, and S -Function unknown, were merged into the X class.The 19 considered classes are reported in Tab 2. The number of proteins for each functional class in each organism is reported in the Appendix.

Table 2 :
Protein functional classes, partitioned into higher categories.

Table 3 :
Classes of Pokec social network.For each class the number of nodes is reported, with the number of edges joining nodes in the same class.

Table 4 :
Computing times for edge z-scores and singleton z-scores, on organisms and Pokec networks.

Table 5 :
For each organisms, the total number of nodes (classes A, B, Y, Z included) and the number of nodes in each considered functional class.