A generalised significance test for individual communities in networks

Many empirical networks have community structure, in which nodes are densely interconnected within each community (i.e., a group of nodes) and sparsely across different communities. Like other local and meso-scale structure of networks, communities are generally heterogeneous in various aspects such as the size, density of edges, connectivity to other communities and significance. In the present study, we propose a method to statistically test the significance of individual communities in a given network. Compared to the previous methods, the present algorithm is unique in that it accepts different community-detection algorithms and the corresponding quality function for single communities. The present method requires that a quality of each community can be quantified and that community detection is performed as optimisation of such a quality function summed over the communities. Various community detection algorithms including modularity maximisation and graph partitioning meet this criterion. Our method estimates a distribution of the quality function for randomised networks to calculate a likelihood of each community in the given network. We illustrate our algorithm by synthetic and empirical networks.


Supplementary Information: A generalised significance test for individual communities in networks
Sadamori Kojaku and Naoki Masuda

I. p-VALUE UNDER THE NULL MODEL
In our statistical test, the p-value is given by where q c is the quality of a focal community c,q is the quality of a community detected in the randomised networks and s c is the size of the focal community. Function Fq(q c ) is the cumulative probability density ofq over [q c , ∞]. In general, any cumulative probability density F X (Y ) for continuous variables X and Y obeys a uniform distribution over [0,1] if Y obeys the same probability distribution as that of X [1]. Therefore, under the null model, where q c andq obey the same probability distribution, the p-value obeys the uniform distribution over [0, 1].

RANDOMISED NETWORKS
In this section, we examine the robustness of the statistical results with respect to the number of generated randomised networks. We use the 12 empirical networks used in the main text, which consist of different numbers of nodes and communities. For each community c, we compute the p-value using R randomised networks, denoted by p [R] c . In the main text, we set R = 500. Then, we generate another R randomised networks and compute the p- c , for each community c. We measure the quality and the size of a community using q mod c and vol c , respectively. We use the Louvain algorithm to detect communities in the randomised networks.
The p-value computed with 500 randomised networks is close to that computed with 1, 000 randomised networks for most communities (Fig. S1(a)). The Pearson correlation coefficient, denoted by r, between the p-value between 500 networks and that with 1, 000 1 networks is equal to r = 0.999. Additionally, the p-value with R = 1, 000 is smaller than that with R = 500 for most communities, which indicates that the present statistical test is conservative when R is small. Therefore, with R = 500 employed in the main text, which is relatively small, we are not overestimating the significance of the detected communities.
A large network and community may require many samples of randomised networks, R, for a reliable estimation of the p-value. To examine this possibility, we plot the variation in the p-value, defined by |p [R] c − p [2R] c |, for each community c in Fig. S1(c). The variation of p-value tends to be small for large communities although the correlation between the variation and vol c is small (r = −0.144). A negative (albeit weak) correlation that we have found implies that a larger community requires a smaller value of R, which encourages the application of our statistical test to networks larger than those examined in the present article. Finally, we examine the robustness of the p-value with respect to the number of nodes in the network. To this end, for each empirical network, we average the variation, c |, over all communities in the network. The averaged variation is not strongly correlated with the number of nodes in the networks (Fig. S1(e); r = 0.087).