Computing the statistical significance of optimized communities in networks

In scientific problems involving systems that can be modeled as a network (or “graph”), it is often of interest to find network communities - strongly connected node subsets - for unsupervised learning, feature discovery, anomaly detection, or scientific study. The vast majority of community detection methods proceed via optimization of a quality function, which is possible even on random networks without communities. Therefore there is usually not an easy way to tell if a community is “significant”, in this context meaning more internally connected than would be expected under a random graph model without communities. This paper generalizes existing null models and statistical tests for this purpose to bipartite graphs, and introduces a new significance scoring algorithm called Fast Optimized Community Significance (FOCS) that is highly scalable and agnostic to the type of graph. Compared with existing methods on unipartite graphs, FOCS is more numerically stable and better balances the trade-off between detection power and false positives. On a large-scale bipartite graph derived from the Internet Movie Database (IMDB), the significance scores provided by FOCS correlate strongly with meaningful actor/director collaborations on serial cinematic projects.


I. INTRODUCTION
Many natural systems can be modeled as a network, with network nodes representing entities and network edges representing links or relationships between those entities.As such, a wide variety of network models and graph algorithms have been developed, generalized, and improved over many decades, forming the field of network science and the study of complex networks [1].A sub-field of network science is focused on methodology for and applications of "community" detection.Defined loosely, a community is a subset of nodes in a network that are more connected to each other than they are to other nodes.There are many distinct, precise definitions of a community, with utilities that vary by application [2].In practice, the purpose of community detection is to discover dynamics or features of the networked system that were not known in advance.Community detection has been profitably applied to naturally arising networks in diverse fields like machine learning, social science, and computational biology [3].
Usually, the aim of community detection is to find communities in a network that are optimal with respect to some quality function or search procedure.Often, a partition of the network is the object being optimized with the quality function.Arguably the most commonly used and studied quality function for partition optimization is modularity, which is the sum of the first-order deviations of each community's internal edge count from a random graph null model [4].Other community detection methods aim to find a collection of communities, where the requirement that communities be disjoint and exhaustive is relaxed (partitions are also collections).In some approaches, collections of communities are found by optimizing communities one-by-one, according to a community-level quality function [5][6][7].
Hundreds of distinct community detection methods have been introduced in recent decades.Despite this, relatively few articles discuss issues of statistical significance related to community detection.In particular, there is often no immediate way to determine if the communities returned by a community detection algorithm are of higher "quality" than would be expected (on average) if the algorithm were run repeatedly on a random graph model without true communities.When significance is discussed or addressed, it is usually with reference to the overall partition, rather than individual communities [e.g.8,9].This paper introduces a method called Fast Optimized Community Significance (FOCS) for scoring the statistical significance of individual communities that, importantly, have been optimized by a separate method.As discussed in Section I A, there are several existing methods that score optimized communities.This paper makes two advancements in this area: 1. Null models for scoring optimized communities are made explicit and newly generalized to bipartite graphs.

A new method (FOCS
) is introduced which enjoys some benefits over existing methods: • A core algorithm that is transparent, easy to implement, and general enough to apply to either unipartite or bipartite graphs without modification.
• Higher numerical stability and 10-100x faster runtimes • Conservative scores on optimized communities from null networks, while maintaining comparable or dominant detection power on true communities.
In this paper, a network is denoted by G := (V, A), where V is a set of vertices and A is an adjacency matrix.Let n := |V |.For u, v ∈ V , the entry A(u, v) is equal to the number of edges between nodes u to v. Unless explicitly stated otherwise, all networks considered will be undirected, so that A(u, v) = A(v, u).In the following sections, denote

A. Existing work
Currently there are several methods for scoring optimized communities by statistical significance.In one recent publication, a simulation-based method called the QS-Test was proposed [10].The QS-Test generates 500 independent configuration-model networks, each with a degree distribution matching the observed network.(The repetition count 500 is the default value for the method, and can be changed.)On each network, a community detection algorithm is run, and a kernel density estimator is applied to the resulting sets of quality functions.This provides a null distribution against which to compare observed values of the quality function.
The QS-Test approach has many desirable features.First, it is general, in that it can be applied with any quality function and any community detection algorithm.Furthermore, it is (at least in principle) evaluating community significance against a direct estimate of its quality function's null distribution.However, the approach also has drawbacks.First, unless given an unlimited number of machines, it is not scalable, as it requires hundreds or thousands of simulated graphs with the same number of nodes and edges as the observed network.It also requires the community detection algorithm of choice to be run on each of those networks.Even if a separate machine was available for each simulation, the collection of the results and subsequent density estimation procedure would be cumbersome in an online data setting.
An older approach introduced in [11] uses an analytical approximation to compute the statistical significance of a community.This approach was referenced in [10] and included in that paper's simulation study.The authors of [11] begin with a conditional configuration model which fixes the number of internal edge counts of the community of interest.Under this model, the edge count d u (C) of any external node u / ∈ C follows a certain hypergeometric distribution.The authors reason that, if the community is a false positive, the in-degree of its worst node should be distributed as the maximum hypergeometric order statistic of the external nodes.They derive a basic score from this observation, and then propose a modified version of the score for an optimized community.The particulars of this method will be discussed further in Section II, as the FOCS approach has a similar foundation.
Building upon their score based on a community's worst node, the authors then propose to test nodes up to the k-th worst node in the community.They show through empirical studies that this "B-Score" (for "border" score) is more powerful while remaining conservative on false-positive communities.The strengths of the B-Score approach over the QS-Test is that it is analytical and thus faster to compute.A drawback of the approach is that it contains more approximations to the null distribution than the QS-Test, and does not have the notion of effect-size or quality score which is inherent to that method.

II. THE FOCS ALGORITHM
The methodology introduced in this paper is based on a conditional configuration model, similar to (but not identical to) that introduced in [11].The null model and new methodology used in this paper are described first for unipartite graphs; a novel generalization to bipartite graphs is presented in Section II A. Given a community of interest C, the focus of the core FOCS algorithm is on the edge distribution of external nodes u ∈ C , under a random graph null model.The null model breaks edges coming out of C, and all edges internal to C , and randomly re-assigns the edges of u without replacement.Under this model, the degree of u in C has a hypergeometric pmf: If C is optimized, the least-connected or "worst" in-community node w ∈ C should be at the maximum quantile of P , among external nodes.Explicitly, define g P (u, C) m the cumulative distribution function of the minimum of m uniform order statistics, the significance score on which FOCS is based is therefore defined where F m is CDF of the minimum order statistic from m unit line uniform random variables.The score f (C) has the standard interpretation given to traditional p-values -a low value of f (C) implies that the connectivity observed in C is unlikely to have arisen in a random (community-less) network.
The idea of using the worst node of a community to test optimized communities was introduced in [11].However, those authors proposed adjusted hypergeometric parameters that account for perfect community optimization, which is more fully described in their publication.The approach in the present paper uses the observation that in practice, communities are rarely perfectly optimized.In fact, exact modularity optimization is exponentially com-plex and computationally infeasible on networks with any more than a few hundred nodes [12].Furthermore, the modularity maximization surface is glassy, with many local optima extremely close to the true maximum [13].This suggests that for a locally optimized, yet truly false positive, community, the distribution of worst nodes can be adequately described by the simpler procedure outlined above.Note that the null model described is well-defined for node pairs with multiple edges.
There may be multiple nodes in an optimized community that are spurious, in the sense that moving them to another community would not significantly change the quality score of the overall partition [13].Therefore, instead of a single worst node, a "worst set" of nodes may be a more robust test subject for determining significance.To test a worst set of nodes, the FOCS method computes f (C), removes the worst node, re-computes f , and so-on until a given proportion p nodes are tested.The pseudocode for FOCS is given in Algorithm 1.In practice, to resolve computational issues arising from lack of continuity in f , the cumulative probabilities given by Equation 1 are sampled around their observed values, and the median FOCS score arising from these samples is used.Also, note that the test set proportion p is a free parameter in the method.Setting p < 0.5 is the safest, as testing the "best" or most interior nodes of an optimized community may lead to spuriously low values of f , even under the null, since the community has been optimized.In our simulations and real data applications, a globally-applied setting of p = 0.25 appears to perform well.

Algorithm 1 FOCS
The FOCS algorithm has multiple practical benefits.First, it is simple to implement and fast to compute.Second, testing multiple worst-nodes is beneficial when there are ground-truth communities in the network.As mentioned above, modularity optimization is necessarily local, and thus even real communities may be contaminated with noise nodes.
Using FOCS helps to bypass noise nodes in a real community, increasing detection power.

A. Extension to bipartite and directed networks
The unipartite null model and the FOCS algorithm can be naturally extended to bipartite networks.The node set of a bipartite network is divided into two sides U and V such that each u ∈ U can form edges only with nodes in V , and vice versa.Consider a candidate bipartite community C = (C U , C V ), and an exterior node u ∈ C U := U \ C U .In the bipartite null, analogously to the unipartite model, all outgoing edges from C and all edges between C U and C V := V \ C V are broken, and edge stubs coming from u are re-assigned without replacement.In this setting, the degree of u in C V has the hypergeometric pmf The edge-breaking bipartite null model which produces the above distribution is illustrated in Figure 1.Using (3) instead of (1), the rest of the FOCS approach follows unchanged, with order statistic quantiles computed with respect to the union C U ∪ C V .As for directed networks, the use of FOCS depends on the type of community optimization, that is, whether in-degree, out-degree, or joint in-out-degree communities are being optimized.Each of these cases result in similar pmfs to that for undirected unipartite and bipartite cases, and can be used straightforwardly within the general iterative algorithm given above.

III. SIMULATIONS
This section presents simulation results which compare the significance scores of FOCS and existing methods, on communities from both null networks and networks with communities.In all cases, the QS-Test and B-Score methods were run with default parameter settings (as presented in the associated papers and code manuals), and FOCS was run with p = 0.25.

A. Null Networks
The first simulation experiment involves networks distributed according to the configuration model.Each network had 100 nodes, and the degree distribution was generated by a power law with exponent −2 on the range [10,50].The total number of simulation repetitions was 1, 000.At each repetition, the Louvain algorithm for modularity maximization was run [14], and a community for scoring was chosen uniformly at random from the communities in the partition containing more than two nodes.anti-conservative on null networks.In other words, applying the QS-Test with a significance cut-off of α = 0.05 to a given community will yield a probability of false positive greater than α.An explanation for this behavior is not obvious, as the method is performing simulations directly from a null model.The error may be due to poor interaction of the quality function's kernel density estimator with the null model.In contrast, the FOCS and B-Score methods are conservative for α ≤ 0.05.

B. LFR Networks
The second simulation experiment involves community-laden networks generated by the LFR model [15], which will help assess the detection power of each method.The central parameter of this model is µ ∈ [0, 1], which controls the average proportion of out-edges of each community.If µ is 1, all edges from each node point outside the node's community, and if µ is 0, all communities are externally disconnected.Other parameters of the model control the distribution of community sizes and the degree distribution.In this experiment, four LFR network settings are tested: "small" networks with n = 1, 000 vs. "large" networks with n = 5, 000, and "small" communities with sizes in [10,50] vs. "large" communities with sizes in [20,100].Note that all these networks are tiny by today's industry standard, but that QS-Test and B-Score are prohibitively slow on networks beyond this order of magnitude.In each setting, five LFR networks were simulated at each µ on an even grid, and the average significance scores for each method were computed across the ground-truth communities from all five repetitions.These average curves, plotted on the − log 10 scale so that larger values imply greater significance, are displayed in Figure 3. Figure 3 shows that the detection power of the methods vary with both network size and community size.On small networks with small communities, FOCS is the dominant method.On small networks with large communities, FOCS is comparable to B-score, while QS-Test outperforms both these approaches.On large networks, FOCS is the dominant method, and surprisingly, QS-Test loses much of its detection power.

IV. PERFORMANCE ON STANDARD REAL-WORLD DATASETS
This section presents results from FOCS, B-Score, and QS-Test on real-world datasets commonly used in the networks literature.The datasets used were obtained from the open-access data repository KONECT [16] and through links provided at Dr. Mark Newman's website (http://www-personal.umich.edu/~mejn/netdata/), and were chosen so that these results could be compared to those from [10].The data sets are listed and briefly described in Table I, and Table II provides some of their quantitative characteristics.

Network name Description
zachary [17] social ties between karate club members dolphins [18] interaction ties between dolphins moreno lesmis [19] character co-appearance network from Les Miserables enron [20] email network from ENRON data netscience [21] collaboration ties between graph researchers polblogs [22] hyperlinks between internet political blogs.
airports [16] flight network between U.S. airports moreno propro [23] protein interaction network in yeast chess [16] chess player game network astro-ph [24] collaboration ties between physics researchers internet [16] autonomous systems connections network TABLE I: Description of real-world benchmark datasets.

A. Detection rates on real data
To compare the methods (FOCS, QS-Test, and B-Score) on a particular data set, first the Louvain algorithm was run on the network.Each method was run on each community in the resulting partition, and the proportion of communities with a significance score below 0.05 is shown in Table II.Two patterns from the simulation study are reflected in these results.FOCS detection rates are more correlated with those from B-Score than those from QS-Test.Second, QS-Test detection rates are much lower on large networks, with the exception of the internet data set, which may be due to the fact that that network had relatively larger communities.These observations suggest that on real data, the methods perform similarly to the simulation study.Note that in this experiment, higher detection rates does not necessarily suggest better performance.For instance, the FOCS method declared two communities significant on the political blogs data set, whereas B-Score declared four.However, the two communities FOCS found significant were the large communities corresponding to (respectively) liberal and conservative sentiments.Other smaller, less-focused communities were ignored, which is a reasonable result.

B. Stability and runtime
On some representative small-to-medium-sized real data sets, each method's significance score computation was repeated 30 times with different seeds, for the purposes of measuring (i) numerical stability and (ii) runtime.The larger data sets were not included in this study, as the runtimes for QS-Test and B-Score on these data sets were prohibitively slow.
Numerical stability was measured because each method (including FOCS, as described in Section II) has randomized steps in its algorithm.The metric used to measure stability on a fixed community and network is the coefficient of variation of the significance score, across multiple runs of the algorithm.A low CV score implies that the randomized parts of the method being tested did not drastically affect the significance scores, on the particular community.Figure 4 shows the distribution of CV scores (via boxplots) across communities, within each data set.The results show that each method had dominant numerical stability in some data set.However, interestingly, the FOCS CV metrics were by far the most consistent, which suggests that, in contrast to other methods, the expected numerical stability of FOCS scores does not depend on the particular community nor the particular data set, which is desirable.
The stability and runtime analyses were performed on 2.20 GHz Intel(R) Xeon(R) CPU E7-8890 processors, and the QS-Test computations were distributed across 24 processors, using parallelization options provided with the authors' package (see https://github.com/skojaku/qstest).Computations for B-Score and FOCS methods were not parallelized.
Table III gives the mean and standard deviation of runtimes of each method, over the computation repetitions.Note that each runtime (out of thirty runtimes) is the sum of the runtimes from each individual community.On all data sets, FOCS achieved the lowest average runtime compared with the other methods, often by two or three orders of magnitude.
FIG. 4: Boxplots of score coefficient of variations across communities, by method and dataset.

V. APPLICATION TO IMDB DATA
This section describes an application of FOCS to a regularly updated IMDB database (https://datasets.imdbws.com).To display FOCS's handling of diverse network types, a bipartite actor-movie network was constructed from the data.The existing community scoring methods discussed in this paper were not included in this application, because they do not handle bipartite graphs.The movie set was restricted to those released in the US with more than 100 ratings on IMDB, and the actor set was restricted to those with at least one movie from this set.Note that writers and directors were also included as "actors".The resulting network had 37, 611 movies, 151, 571 actors, and 362, 850 edges.
To find optimized communities in the network, we used a simplified, low-cost-exploration simulated annealing algorithm, in the style of the standard method first described in [25].
Given a community score function and an initial community, the algorithm computes, for each node in the network, the increase in the score possible by moving the node in or out of the community.Ignoring nodes with negative increase, it chooses a node to move with probability proportional to the increase.The algorithm terminates after no score increases are possible, or after a pre-specified number of iterations.The community score function we use is related to bipartite modularity [26]: where m := u∈Vactors d u = v∈V movies d v .The square-root scaling ensures that trivial increases in the score are ignored; further theoretical motivation for this type of scaling is discussed in [7].
The algorithm described above produced 29,223 communities in the filtered IMDB bipartite network.Each community was scored with FOCS and ranked by decreasing − log 10 FOCS score.The extracted bipartite communities with the highest log-scores were those with movie sets that had persistent involvement from all actors in the actor set.Therefore, many of the top-ranked communities featured well-known movie series or collections and their directors, lead writers, and lead actors.Since the null model used by FOCS involves global re-assignment of edge stubs, it makes sense that focused, persistent activity by groups of actors across related films would receive the highest significance scores.In other words, a movie series with a consistent cast should seem most anomalous with respect to the condi-tional configuration model.A sample of some of the top-ranked communities are shown in  To quantitatively assess the true intra-relatedness of each community, the jaccard similarity between the movie set and the union of the sets of movies that each actor is "known for", according to the IMDB metadata, was computed.This score correlated highly with FOCS scores.In particular, the median jaccard similarities were increasingly large as ranges of the FOCS scores decreased on a quasi-logarithmic scale (see Figure 5).This shows that, in this application, the community ranking and threshold provided by FOCS aligned with ground-truth signal.
Interestingly, the majority of communities produced by the node-swapping algorithm were not significant.Communities with large FOCS scores exhibited much less internal coherence -many communities contained mostly unrelated movies with few actor overlaps.
Simply put, these communities were poor local maxima of the algorithm.It is a particular utility of a method like FOCS to be able to distinguish between these communities and meaningful, strongly-connected communities.

VI. DISCUSSION
This paper introduces new models and tests for optimized communities in networks, and presents FOCS, a new algorithm for significance scoring that has performance benefits over of existing approaches.FOCS uses a core scoring approach that exploits the fact that communities are rarely optimized perfectly, and therefore weakly connected nodes in communities distribute edges approximately according to a random graph null model.Because of this, FOCS has a simplicity that previous methods lack, making it more scalable, more numerically stable, and more generalizable.Despite its simplicity and speed, FOCS performs ahead of or comparably to preceding methods in terms of reduced tendency for false positives, and reduced significance scores on true communities.On a large-scale bipartite movie-actor network derived from IMDB data, the highest FOCS-ranked communities produced by an extraction method were those with highly related movie sets sharing continuous involvement from a dedicated cast and crew.This suggests that FOCS can be useful in detecting communities exhibiting anomalous, persistent involvement from its constituents.
The FOCS method has some limitations.First, as with the existing methods, FOCS uses resampling methods in its computation.Additionally, FOCS is based on arguably plausible yet non-rigorous ideas about the distribution of nodes in optimized communities.
Therefore, FOCS is not an exact statistical test, and its results should be reported with these caveats.It should be noted that existing methods also rely on approximations, which is often necessary when dealing with the intractable distributions presented by graph models.
Finally, the simulations on null networks in Section III A showed that FOCS may be overlyconservative.This means that there may be headroom to improve FOCS by making it less conservative and more powerful, which is an area for future research.
Despite these limitations, the FOCS method appears to improve greatly on the existing options for scoring the significance of individual communities.Given its scalability and straightforward implementation, it can be readily used in real-time anomaly detection, machine learning pipelines, and scientific studies.The basic implementation of the FOCS method used in experiments discussed in this paper can be found at https://github.com/google/fast-optimized-community-significance, and the pipeline of experiments can be reproduced with code at https://github.com/jpalowitch/focs_experiments.
the degree of u ∈ V by d u := v∈V A(u, v).Let C ⊆ V denote any node subset.With a slight abuse of notation, let d C := u∈C d(u) be the total degree of a subset.Analogously, d u (C) := v∈C A(u, v), and d C (C ) := u∈C d u (C ), where C := V \ C. In general, the notation d a (B) can be read and understood as "the degree of a in set B".Note that for undirected networks, d C (C ) = d C (C) for any C ⊆ V .
where dw (C) is the random version of d w (C) with respect to P .Define the worst node by w := arg min u∈C g P (u, C).Then, among statistics {d v (C) : v ∈ {w} ∪ C }, the random variable d w (C) should be treated as the maximum order statistic.Thus it is possible to test the significance of C by comparing the g P (u, C) to the distribution of the minimum of |C | + 1 uniform random variables from the unit line.Writing as F

FIG. 1 :
FIG. 1: Illustration of bipartite null model for FOCS score.Circles and squares represent U nodes and V nodes, respectively.Blue nodes are the community to be scored.The red circle indicates an arbitrary node u which will have its edges re-assigned under the null.

Figure 2
Figure 2 shows the − log 10 -scale distribution of significance scores from the three methods, plotted against the grid of uniform quantiles that would be expected in a perfectly null distribution of scores.Purple dotted lines show the standard 0.05 significance cutoff on the log 10 scale.Therefore, the top-left quadrant formed by the purple dotted lines is the region in which observed scores would indicate significance but uniform-generated scores would not.The bottom-right quadrant is vice-versa.The figure suggests that the QS-Test is

FIG. 3 :
FIG. 3: Results from the three methods on the four tested LFR settings.Flat lines are across regions where raw scores went below machine precision.

FIG. 5 :
FIG. 5: Distribution of jaccard similarities between cluster movie sets and actor "known-for" movie sets, across clusters, within ranges of FOCS scores.x-axis labels display the upper endpoint of the range, which extends back to the previous (left) upper endpoint.The lowest range extends to zero.

TABLE II :
Summary numbers on the considered data sets: number of nodes, number of edges, and number of communities found by the Louvain algorithm, and proportion of communities found significant (score < 0.05) by each method.

TABLE III :
Average runtime in seconds of methods across 30 repetitions.QS-Test computations were distributed across 24 machines.

TABLE IV :
Some top-ranked IMDB bipartite communities with well-known titles, ordered by FOCS score.Omitted titles exhibited the same persistence of movie series theme and actor participation.