Completeness of Community Structure in Networks

By defining a new measure to community structure, exclusive modularity, and based on cavity method of statistical physics, we develop a mathematically principled method to determine the completeness of community structure, which represents whether a partition that is either annotated by experts or given by a community-detection algorithm, carries complete information about community structure in the network. Our results demonstrate that the expert partition is surprisingly incomplete in some networks such as the famous political blogs network, indicating that the relation between meta-data and community structure in real-world networks needs to be re-examined. As a byproduct we find that the exclusive modularity, which introduces a null model based on the degree-corrected stochastic block model, is of independent interest. We discuss its applications as principled ways of detecting hidden structures, finding hierarchical structures without removing edges, and obtaining low-dimensional embedding of networks.

as political blogs network, our algorithm reports that the expert partition is incomplete, showing that there are other information hidden by the expert partition.
Our method can be used to detect such hidden partitions and further the hierarchies which are common in real-world networks. A usually adopted top-bottom way of finding the hierarchies is to iteratively detect sub-communities in the subgraphs composed of nodes in the same communities that already found. This method needs to remove edges connecting different subgraphs, hence certainly drops some information about the structure. We show that by applying our method recursively in a combination-exclusion way, we can give a principled method of finding hierarchies without removing edges connecting different communities.

Results
Completeness of the planted partition in the stochastic block model. Our main assumption is that if the given partition {t} contains complete information of the community structure in the graph, there should be no other community structures that are detectable after {t} is excluded. The existence of such community structure is characterized by the existence of the retrieval state as we define in the Method section. A straightforward example to illustrate this is the stochastic block model (SBM), which is a popular ensemble of networks with community structure. In SBM there is a partition {t} with q groups planted as a ground-true community structure, and edges are generated independently according to a q × q matrix P. Usually we consider the case where P has two distinct entries, P in on the diagonal and P out on the off-diagonal. Recently, a detectability transition 14 has been discovered below which no algorithm can detect finite information of the planted partition. It has been shown in ref. 10 that belied propagation algorithm using the classic modularity as negative Hamiltonian works all the way down to the detectability transition of the SBM, as the retrieval phase representing the planted partition in the phase diagram always exists in the detectable phase. Thus as a sanity check to our assumption, excluding the planted partition should remove the retrieval phase completely from the phase diagram, reflecting that there is no other statistically significant structures in SBM except the excluded one.
In Fig. 1(a), on a network generated by the SBM, we plot the exclusive modularity (with the planted partition excluded) and convergence time of belief propagation (BP) (see Eqs (4) and (5)). We can see from the figure that there are just two phases, paramagnetic phase defined by 0 exclusive modularity and spin-glass phase defined by in-convergence of BP, separated by a transition at β = − +  ⁎ log q c [ /( 1 ) 1] (see in Method section and Supplementary Information for details on computing this transition), where  c is the average excess degree. Clearly there is no retrieval phase exists in the phase diagram, verifying our analysis that the planted partition is complete in this synthetic network.
To illustrate further the effects of partition-exclusion, we propose the biSBM, stochastic block model with two planted partitions (detailed in Supplementary Information), and investigate whether our method can detect the second partition with the first partition being excluded. Analogous to the standard SBM, the biSBM use subgroups excluded. In these figures P, R, and SG denote paramagnetic, retrieval and spin glass phase respectively. (c) The overlap (fraction of correctly reconstructed labels) of the two planted partitions given by BP are plotted. Here c = 20 and α = 0.7. We average over 100 instances for network generated by biSBM with size N = 1000. The detectability thresholds ε ⁎ fir and ε ⁎ sec are theoretical results.
parameter ε = p p / out in to adjust the relative strength of the community structure. In addition to ε, biSBM introduces another parameter α to control the ratio of edges that belong to two community structures. Obviously, single partition is incomplete in describing the whole community structure in the network, and excluding either partition would leave a retrieval phase representing the other community structure in the phase diagram.
On a network generated by the biSBM, the exclusive modularity (with the first partition excluded) and convergence time are plotted as a function of β in Fig. 1(b). We can see from the figure that in between the paramagnetic and spin-glass phases, there is a retrieval phase, where BP converges to a fixed-point with large exclusive modularity, which is correlated with second partition. Thanks to the locally structure of the graph, we claim that our algorithm for finding the second partition is optimal, both in accuracy and in ability of detecting the second partition with a success better than random guess all the way down to the detectability transition (see SI). In Fig. 1(c), the accuracy of detecting the second partition (in overlap, fraction of correctly reconstructed labels) is plotted against parameter ε, we can see that BP works all the way down to the detectability transition of both two partitions. In other words, our results indicate that our algorithm does not through away any information of the second partition when excluding the first one.
Completeness of expert partition in real-world networks. For many real-world networks, for example the famous karate club network 19 and political blogs network 20 , annotations (i.e. expert partitions) are believed to reflect the true community structure of the network, hence have been widely used in validating community detection algorithms. However few attention has been paid on the validity of the expert partitions. In this section we adopt our method to investigate the completeness of the expert partitions in representing community structure in real-world networks.
Our results are shown in Fig. 1(d)-(f). In the karate club network, we find that the retrieval phase is completely absent when the expert partition is excluded, which indicates that expert partition is complete. While in political blogs Fig. 1(e) we find that after excluding the expert partition, the retrieval phase still presents. It means that the expert partition of separating liberals and conservatives is incomplete in describing the underlying community structure. We note that there are already studies on this network 10,21 , reporting that in addition to the liberals-conservatives division, there are more groups forming a hierarchical structure with a 4-layer dendrogram. To check the completeness of these results, rather than using the expert partition, we set 6 groups in the top-2 levels of the dendrogram as found in ref. 10 as the excluded partition. Figure 1(f) shows that the retrieval phase disappears, hence there is not hidden community structure. It indicates that the partition with 6 groups contains complete information of the underlying community structure. It has been reported 22 that much more than 6 groups can be found in the political blogs network using other techniques. The reason that we found 6 groups are enough may be because our current approach only focuses on assortative structures using a positive modularity. A simple extension of our approach to study the dis-assortative structures is running BP with a negative β.
Another network we examine is a network of school students drawn from the US National Longitudinal Study of Adolescent to Adult Health, which consists of students of a high school (US grades 9 to 12) and its feeder middle school (grades 7 and 8). In the dataset, the network structure is accompanied by the annotations of gender, ethnicity and school grade. We find that the retrieval phase presents after the partition given by the annotations (which is combined by three properties) is excluded. It means that the information of annotations are not enough for characterizing community structure of the network.
Other applications of the exclusive modularity. It is straightforward to see our method not only determines the completeness of a partition {t}, but also gives a new partition {g} as a complement to {t}, which is the marginalized partition in the retrieval state: We call this process detecting hidden partitions, with "hidden" reflecting the fact that {g} is essentially covered by {t} and can be detected only when {t} is excluded. Actually we notice that any community detection method (explicitly or implicitly) excludes the partition which puts all nodes into one group, as this "ferromagnetic partition" is a strong yet valid solution to the community detection task, but is obviously unwanted.
This process of excluding can be naturally used as an embedding of a network into a low-dimensional space using marginals of BP. We again take the network of school students as an example. In the X-axis of Fig. 2 we plot the marginal probability of node i being in the the first group Ψ i 1 (we use q = 2 groups, so simply the probability of it being in another group is − Ψ 1 i 1 ) with no partition excluded (i.e. using classic modularity). In Y-axis we plot the marginal probability Ψ i 2 given by BP with the first detected partition being excluded. From the figure one can see that the first detected partition can recover the expert partition (the classification of high/middle school) very precisely: almost all students in middle (high) school are placed at the left (right) side of the vertical boundary (the vertical dash line at Ψ = 0.5). And the second partition, represented by Ψ 2 , can recover the ethnicity very well.
Low-dimensional embedding, using e.g. spectral methods, is very helpful in understanding and visualizing large networks 23 . However to our best knowledge our method is first one that uses marginals of a message-passing algorithm for embedding. Our method is superior to existing methods in the sense that BP marginals gives close-to-optimal results in detection accuracy in stochastic block model, while existing ones such as spectral methods usually can be seen as a linear approximation to the optimal message passing ones 13 , hence are sub-optimal.
The partition we want to exclude, {t}, is not necessarily assortative (as in the planted partition or expert division), but could be of any kind. Once we have found partition {g} by excluding {t}, we can combine {g} and {t} to a partition {h} = {g}⊗{t} 24 , then find a new partition by running BP algorithm excluding {h}. By doing this combining-excluding procedure iteratively we are able to find hierarchical structures layer by layer in the dendrogram. There are many algorithms exiting for finding hierarchical structure in a top-down way. The standard procedure is building subgraphs at each level of dendrogram, using nodes in the same communities of the upper level in the dendrogram by removing edges between subgraphs, then running community detection on the subgraphs. This standard process is slow as community algorithm has to be ran in each subgraph. Moreover removing edges certainly drops some information about the community structures. Our method is obviously more efficient than existing top-down methods in finding hierarchies, because each run of BP finds one layer in the dendrogram; and more accurate, because our method do not remove any edges.

Discussion
We have presented a method for validating completeness of a given partition in characterizing community structure in networks. We defined the exclusive modularity by excluding the given partition, and proposed an efficient belief propagation algorithm using the exclusive modularity as Hamiltonian to determine whether there is a retrieval phase in the system representing statistically significant community structure, that implies the incompleteness of the community structure.
We applied our method to validate expert partitions in real-world networks. Our results reveal that in some networks, such as the karate club network, the expert partition is complete in describing the community structure, while in some other networks, such as political blogs network, the expert division is incomplete, indicating that there are some hidden structures that are ignored by the expert division. We believe our method gives a principled way to examine the relation between meta-data of networks and large-scale structures in networks. In addition to the completeness validation, we also discussed applications of our method in detecting hidden structures, finding hierarchical structures without removing edges, and obtaining low-dimensional embedding of networks.
There have been many work discussing the statistically significance of community structures. These include examining the likelihood ration 25 , using the Bethe free energy 14 and adopting the minimum description length 21 . The difference between our method and the others is that it combines both the classic measure of community structure, the modularity, and Bayesian statistics. In essence, our algorithm looks for consensus of many good partitions, rather than a single best one.
A possible extension of our method to finding structures more than assortative ones, such as core-periphery structures, is to adopt the inference of the stochastic block model with a given partition excluded, i.e. treated as a null model. We will put this into future work.

Method
The core idea of this work is to study the statistical mechanics of community detection by excluding a partition, {t}. We do this by giving each partition {g} an exclusive modularity Q({g}|{t}), as a measure of community structure where p ij ({t}) is expected probability of node i connecting node j in a model described as follows: suppose i and j belong to group r and s of partition {t} with t i = r and t j = s respectively. Let p i sr be the probability that node i is the endpoint of one edge randomly chosen from the edges connecting groups r and s. Therefore one has  It is easy to see that the probability p ij ({t}) is proposed as if it is the planted partition of (a variant of) the degree-corrected stochastic block model (DCSBM) 15 . In this sense, the exclusive modularity uses a DCSBM (with a given partition being the planted partition) as a null model, as opposed to the classic modularity 7 which uses a configuration model as a null model. As a consequence, the classic modularity can be seen as a special case of exclusive modularity that excludes the all-one vector. It is straightforward to see that Q({t}|{t}) = 0, which gives a simple check that the partition {t} is indeed excluded from our consideration of community structure. For a partition {g}, a larger exclusive modularity Q({t}) reveals that {g} gives more internal edges than the the expected number of internal edges in random partitions (except the excluded one {t}), hence is more likely to represent the underlying community structure. As pointed out by ref. 10, directly maximizing the modularity is prone to overfitting, finding partitions with high modularity even in random networks. In this sense, simply maximizing an objective function can not answer the model selection problem on whether the hidden community structure exists. Thus, we need to add some notion of statistical significance. Here we generalize the method proposed in ref. 10 which uses belief propagation algorithm to find consensus of many high modularity partitions, to our case that uses exclusive modularity as a measure of community structure. In more detail, we tackle the problem of determining whether there exist statistically significant communities using ideas from spin glass theory of statistical physics. We assign to each partition a Hamiltonian equal to negative exclusive modularity, then give a Gibbs-Boltzmann distribution to each partition at a finite temperature where Z is the partition function and β denotes the inverse temperature. Then we scan whole range of β to look for a phase which indicates the existence of a significant community structure with partition {t} excluded.
Although above Boltzmann distribution is difficult to solve in general, on the sparse graphs we can efficiently solve the marginals approximately using BP, which computes marginals probability of node i being in group h as Here Z i→j is the normalization factor, ∂i\j is the set of neighbors of i except j. On a network, we iterate BP equations Eq. (5) till converge, or stop after a given maximum number of iterations is exceeded, then compute marginals using Eq. (4), marginalized partition t { } and its exclusive modularity. Using the convergence time and exclusive modularity of partition t { }, we can further separate system into different phases as studied in ref. 10. As we have emphasized, we are mostly interested in whether there exists a significant structure with a large exclusive modularity, at a certain temperature. This is amount to finding an extremal Gibbs state at a certain temperature where the spin-glass susceptibility does not diverge. On sparse networks the divergence of susceptibility is characterized by in-convergence of BP. So in practice we just need to look for a retrieval state where BP converges and BP messages non-trivial. A simple way to do this is by scanning whole range of β, or using a binary search. Actually as pointed out in ref. 10, one only needs to check the convergence property and retrieval modularity at a critical temperature β = − + 