On the statistical significance of communities from weighted graphs

Community detection is a fundamental procedure in the analysis of network data. Despite decades of research, there is still no consensus on the definition of a community. To analytically test the realness of a candidate community in weighted networks, we present a general formulation from a significance testing perspective. In this new formulation, the edge-weight is modeled as a censored observation due to the noisy characteristics of real networks. In particular, the edge-weights of missing links are incorporated as well, which are specified to be zeros based on the assumption that they are truncated or unobserved. Thereafter, the community significance assessment issue is formulated as a two-sample test problem on censored data. More precisely, the Logrank test is employed to conduct the significance testing on two sets of augmented edge-weights: internal weight set and external weight set. The presented approach is evaluated on both weighted networks and un-weighted networks. The experimental results show that our method can outperform prior widely used evaluation metrics on the task of individual community validation.

www.nature.com/scientificreports/ a new angle. In particular, for un-weighted networks, we can reveal some interesting connections between our formulation and the community evaluation metric in Ref. 17 . We evaluate our method on both weighted networks and un-weighted networks. Experiments demonstrate that our method is comparable with prior state-of-the-art metrics on individual community assessment. The rest of this article is arranged as follows. In the next section, our formulation is described in detail. Thereafter, the experimental results are presented and some discussions are given in the last two sections.

Method
Notations. Given a weighted network G = (V , E, W) , where V is the node set, E is the edge set, and W is the set of positive weights. For a given subset of nodes S ( S ⊆ V ), its induced subgraph G[S] can be regarded as a candidate community. All edges incident on the nodes in S could be divided into two groups: the set of internal edges within G[S] that are incident on two nodes from S and the set of external edges of G [S] which are incident on one node from S and another node from V \S.
Formulation. To test if a given candidate community in the weighted network is a real one, we present a formulation based on the non-parametric two-sample test for censored data. The main workflow of our method is shown in Fig. 1, which will be elaborated below.
In the first step of our method, we regard each edge-weight as a censored observation. As shown in Fig. 1 a, the set of edge-weights is augmented by including the weights of missing links. The edge-weights of these missing links are 0s since they are either truncated or unobserved. Then, we can construct two augmented sets of edgeweights: the set of internal edge-weights W in S = {w in 1 , w in 2 , . . . , w in |S| 2 } and the set of external edge-weights  www.nature.com/scientificreports/ Let F in (·) and F out (·) be complementary cumulative distribution functions for internal and external weights, respectively. Then, the community assessment issue can be modeled as a two-sample test problem, in which the hypotheses under consideration are: H 0 : F in (w) = F out (w); H 1 : F in (w) > F out (w) , for at least one w, where w ≥ 0 is the non-negative weight. Note that a larger edge-weight indicates a stronger connection between two corresponding nodes and both F in (·) and F out (·) are complementary cumulative distribution functions. If H 0 is violated and H 1 holds, then there will be more internal edges with positive edge-weights and the internal edges are associated with larger weights. Hence, the proposed two-sample test is capable of quantifying the realness of a target community in a statistically sound manner.
To solve the above two-sample test issue, many effective methods can be utilized. Here we adopt the Logrank test due to its popularity in censored data analysis. As shown in Fig. 1b, W in S and W out S are first merged to generate a new set W S and then all edge-weights in W S are sorted in a non-increasing order. For each distinct edge-weight w i in W S , we can construct a 2 × 2 table as shown in Fig. 1c. In Fig. 1c, d in i , d out i and d i denote the number of internal edges, the number of external edges and the number of edges in W S whose edge-weights are w i . Meanwhile, n in i , n out i and n i denote the number of internal edges, the number of external edges and the number of edges in W S whose edge-weights are not bigger than w i .
Based on above notations, the test statistic Z of Logrank test is: can be regarded as the "observed minus expected" difference with respect to the number of internal edges of a specific weight. Thus, a real community should be associated with a large Z statistic. Based on this approximation, the corresponding p-value can be obtained to assess the statistical significance of G [S].
In regard to the combinatorial nature of community detection, the community significance assessment issue is actually a multiple hypothesis testing problem. Hence, we need to conduct a multiple testing correction by calculating an adjusted p-value for each candidate community. The most popular method for multiple testing correction is probably the Bonferroni correction approach, in which the original p-value is multiplied by the number of tested hypotheses to obtain an adjusted p-value. In our context, the number of tested hypotheses can be calculated as the number of possible communities of the same size. More precisely, for a given community G[S] of size |S|, its adjusted p-value is calculated and |V| is the number of nodes of the graph. In the experiment, the adjusted p-value is used instead of the original p-value in our method.
Un-weighted special case. For un-weighted networks, the edge-weight is either 1 or 0, and thus q = 2 in Eq. (1). For the weight w 1 = 1 , we have: n in 1 = |S| 2 , n out 1 = |S|(|V | − |S|) , and thus (1) can be rewritten as: www.nature.com/scientificreports/ In Ref. 17 , a new community evaluation metric has been presented, which can expressed as follows based on our notations: Thus, we can get the mathematical relation between Z (our test statistic for un-weighted networks) and Z ′ (the community evaluation metric in Ref. 17 ): As shown in Eq. (4), we can quantitatively establish the connection between the special case of our method for un-weighted networks and one popular metric in the literature.

Results
To test whether the presented p-value is effective on community evaluation, we conduct a series of experiments according to the pipeline shown in Fig. 2. Firstly, existing community detection algorithms are employed to produce a set of identified communities on networks with ground-truth communities. Then, we use both internal validation metrics (e.g. our method, modularity) and external validation metrics (e.g. precision, recall) to quantitatively validate each identified community. Since external validation metrics are calculated based on the ground-truth information, which can be used as the "gold standard". In other words, one internal validation metric is a good community validation index if it is highly correlated with each external validation metric on the assessment of identified communities. Based on this assumption, we calculate the Pearson's correlation coefficient between two vectors (one is generated from an internal validation metric and another one is produced by an external validation metric), where each vector is composed of the validation index values on a set of identified communities. Finally, the Friedman test 18 and three post-hoc tests: the Nemenyi test 19 , the Bonferroni-Dunn test 20 and the Holm's step-down test 21 are employed to check if our method is significantly better than other popular internal validation metrics. Data sets. In our experiment, we use two groups of data sets. One group is composed of four weighted PPI (Protein-Protein Interaction) networks: Collins2007 22 , Gavin2006 23 , Krogan2006_core 24 , Krogan2006_ extended 24 . There are three sets of ground-truth communities for these weighted PPI networks, where each set is collected from one of the following databases of protein complexes: CYC2008 25 , MIPS 26 and SGD 27 . There are 408, 203 and 323 ground-truth communities in these three sets, respectively. Another group is composed of six real un-weighted networks: Karate 28 , Football 29 , Personal Facebook (Personal) 9 , Political blogs (PolBlogs) 30 , Books about US politics (PolBooks) 31 , and Railways 32 . The topological characteristics of six real un-weighted networks are provided in Supplementary Table S1.
Parameter setting. We choose three classical community detection methods: SLPAw 33 , Infomap 34 and Louvain 35 to detect communities. In our experiment, we run these three methods with their default parameter settings for weighted graphs.

Experiment.
We compare our method with four internal metrics: conductance 36 , modularity 37 , p-value in OSLOM 7 and p-value in CCME 11 . The p-value of OSLOM is obtained with its default setting and the p-value of a community in CCME is the maximal p-value of all nodes within the community. Since Conductance, the p-values in our method, OSLOM and CCME are negatively correlated with Jaccard coefficient, Precision and Recall, we use the negative Pearson's correlation coefficient for these three metrics in the performance comparison. Then, for the set of reported communities on each data set from each community detection algorithm, we can use the Pearson's correlation coefficient with respect to each external validation metric to check which internal validation metric is better. More precisely, a better internal validation metric should have a larger correlation coefficient. The detailed results are recorded in the Supplementary Tables S2-S5. From these tables, we can obtain the rank distribution for each internal validation metric, where a larger correlation coefficient will be assigned to a smaller rank. The rank distributions on weighted PPI networks and un-weighted networks in terms of box plots are provided in Fig. 3a,b, respectively. From Fig. 3, it can be observed that our method can achieve the smallest average rank among five internal validation metrics. To check if our method is really better than the other four internal validation metrics, we first apply the Friedman test to assess the null hypothesis that all methods have the same rank. The χ 2 F value in the Friedman test on weighted PPI networks and un-weighted networks is 200.5704 and 45.6889, respectively. This means that the performance gaps among different internal validation metrics are statistically significant when the significance level is specified to be 0.05. Then, we further employ the Nemenyi test, the Bonferroni-Dunn test and the Holm's step-down test to compare our method with each competing internal validation metric in a pair-wise manner.
In Table 1, we record the rank difference between our method and each competing method on both weighted PPI networks and un-weighted networks. As shown in Table 1, the rank difference values between our method www.nature.com/scientificreports/ and Modularity, Conductance and CCME on weighted PPI networks are larger than the critical difference (CD) thresholds for both the Nemenyi test and the Bonferroni-Dunn test when the significance level is 0.05. This indicates that our method is significantly better than Conductance, Modularity and CCME under these two tests. On the un-weighted networks, our method is significantly better than OSLOM and CCME according to the Bonferroni-Dunn test and the Nemenyi test.
In Table 2, we list the p-values for the average rank difference between our method and each competing method based on the Holm's step-down test. Besides, the adjusted significance level for each position after sorting the p-values in a non-decreasing order is provided as well. As shown in Table 2, all the p-values on PPI networks are smaller than the corresponding adjusted significance levels. This indicates the superiority of our method over other four metrics on weighted networks is also confirmed by the Holm's step-down test. Similar to results in Table 1, we can claim that our method is significantly better than OSLOM and CCME on un-weighted networks based on the hypothesis testing results in Table 2.

Discussion
We have presented a general approach for assessing the statistical significance of a community from weighted networks. In this new formulation, the weights of missing links are set to be zeros and all edge-weights are treated as truncated observations. Based on this assumption, the community validation issue is modeled as a two-sample test problem on censored data. The presented formulation provides a general framework for community validation from a significance testing perspective. Based on this framework, we can either reveal the rationale underlying some existing community validation metrics or develop new community evaluation measures. . Each identified community can be assessed using both the internal validation metric and the external validation metric. For example, the negative log value of our method and the precision for the identified community A is 9.39 and 1, respectively. Then, a vector for each metric on the set of identified communities is readily available. For instance, the validation index vector for our method and the Jaccard coefficient is (9.39, 6.31, 7.56) and (1, 0.8, 0.75), respectively. Consequently, the Pearson's correlation coefficient is calculated between each pair of vectors: one from an internal validation metric and another one from an external validation metric. Based on the correlation coefficients, four statistical tests are further applied to check if our method is really better than other internal validation metrics on the task of individual community assessment.   RD(a, b) between two methods, where a denotes our method and b represents a competing internal validation metric. In last two rows, the critical difference (CD) thresholds at the significance level α = 0.05 for the Nemenyi Table 2. The p-value P(a, b) for the rank difference between two methods under the Holm's step-down test on weighted PPI networks (the left column) and un-weighted networks (the right column), where a denotes our method and b represents a competing internal validation metric. In the middle column, the adjusted significance level at α = 0.05 for each position after sorting the p-values in a non-decreasing order is listed.