A Regularized Stochastic Block Model for the robust community detection in complex networks

The stochastic block model is able to generate random graphs with different types of network partitions, ranging from the traditional assortative structures to the disassortative structures. Since the stochastic block model does not specify which mixing pattern is desired, the inference algorithms discover the locally most likely nodes’ partition, regardless of its type. Here we introduce a new model constraining nodes’ internal degree ratios in the objective function to guide the inference algorithms to converge to the desired type of structure in the observed network data. We show experimentally that given the regularized model, the inference algorithms, such as Markov chain Monte Carlo, reliably and quickly find the assortative or disassortative structure as directed by the value of a single parameter. In contrast, when the sought-after assortative community structure is not strong in the observed network, the traditional inference algorithms using the degree-corrected stochastic block model tend to converge to undesired disassortative partitions.

Since the standard stochastic block model makes nodes in the same block statistically indistinguishable, the most likely block assignment often places the nodes of similar degrees in a block, resulting in blocks of nodes with homogeneous degrees, too limited to represent the traditional community structures.
The degree-corrected stochastic block model [1] clusters together the nodes with heterogeneous degrees and defines the expected number of edges between nodes i and j as λ ij = ω gi,gj β i β j where for an arbitrary node l, its block assignment is denoted g l and its model parameter is denoted β l . This simple yet effective extension improves the performance of the models for statistical inference of a block structure in the real-world networks because nodes in a community tend to have broad degree distributions. To simplify technical derivations, authors in [1] approximate the Bernoulli distribution with the Poisson. This approximation is tight when the product of the number of nodes n and edge probability p is small. Given a partition of the network, i.e., its block assignments {g i }, the authors of [1] obtain the maximum likelihood estimates of the model parameters equal toβ i = ki κg i andω rs = m rs , where k i is the degree of node i, κ r is the sum of the degrees of all nodes in a block r, and m rs is the total number of edges between blocks r and s, so if r = s, it is twice the number of edges in r. Note that the self-loop edges and multi-edges are allowed in this model. Ignoring the constant terms which are irrelevant to the statistical inference of g, the log-likelihood of a graph G given the partition g can be expressed as This criterion has an information-theoretic interpretation when it is presented in the alternative form which shows that this log-likelihood represents the KullbackLeibler divergence between the probability of observing an edge (i, j) between block r and s in real network, i.e., P [i ∈ r, j ∈ s] = m rs /2m, and the probability of observing such edge in a random graph with the same degree sequence, i.e., P [i ∈ r, j ∈ s] = (κ r /2m)(κ s /2m). Seen this way, the statistical inference maximizes the information gain when replacing the generative model of a random graph model with the degree-corrected stochastic block model. In other words, the mostly likely partition is the one that requires the most information to describe the observed graph starting from a random graph that does not have block structure at all. It is worth mentioning that, in the standard stochastic block model where the number of edges is drawn from the Bernoulli distribution, it is rare that ω rs is close to 1. This is because the numbers of the edges between blocks are usually very small when the network is sparse. A Bernoulli random variable with a small mean is well approximated by a Poisson random variable having the same mean [2]. For this reason, [3] found no significant difference in the solutions produced by the model described here that uses the Poisson distribution instead of the Bernoulli distribution used in the original model.

Markov chain Monte Carlo
We use the Markov chain Monte Carlo (MCMC) algorithm [4] to infer the block assignment using the stochastic block model and its extensions. The MCMC algorithm samples the distribution of all possible block assignments in the network partitions space. Suppose there are K blocks and N nodes in a network. The naive Markov chain Monte Carlo approach is not practical because the size of the partition space O(N K ) which grows exponentially in the number of blocks K. Therefore, Peixoto [5] proposes the optimized MCMC algorithm with the greedy heuristic to infer the block assignment. Initially, every node in the network is assigned to one random block independently. Then, one attempts to move a node from block r to s with a probability conditioned on its neighbor's block assignment t. The proposal probability function is defined as where > 0 is a free parameter to fulfill the ergodicity condition which requires that any block assignment can be reached from the current block assignment in the finite and aperiodic expected number of steps. When → ∞, the proposal function reduces to the naive scheme which assigns a random block to the current node. However, in such case, the possibility of current node being assigned to the correct block assignment is very low, thus, majority of the moves do not increase the log-likelihood. Consequently, these moves are rejected very frequently, wasting the computational resources. Applying a relatively small makes the moves more likely to get accepted. This is because if there are a lot of edges between blocks s and t, a node with many neighbors in block t is likely to reside in block s. Consequently, moving node from block s to t is more likely to be accepted, saving the computational cost of many rejected proposals.
To ensure the detailed balance, each proposed assignment is accepted with a probability a in the Metropolis-Hastings fashion [6] given by where the node of the proposed assignment has n t neighbors in block t and ∆L is the change of log-likelihood after each move. A list of edges which are adjacent to each block are used to compute Eq. 4, which can be done in O(k i ) time [5] where k i is the degree of node i. Thus, processing moves for all nodes in the network requires O(E) operations where E is the number of edges in the networks.

Supplementary Proofs of Theorems
Proof of Theorem 1 Proof.
Also, it is obvious that the sum of cross entropy on the RHS of Eq. 9 is a constant term given the block assignment g. The log-likelihood of our model defined by Eq. 9 can be rewritten in the form where all the constant terms are grouped into the last term on the RHS. Setting the derivative of the RHS over θ i to zero, we obtain k i /θ i = s m gi,s /Θ gi which means that for every node i the MLE estimatorθ i = Θ gi k i /κ gi . Pluggingθ i into the log-likelihood above, we obtain the log-likelihood expression in Eq. 1 after dropping all the constant terms.

Proof of Theorem 2
Proof. For given {f i }, the derivative of the log-likelihood defined in Eq. 8 over θ i is zero when the maximum likelihood estimatorθ i satisfies the condition Even without the closed-form expression ofθ i , Eq. 5 ensures that the expected degree of node i is Since θ i =θ i , we have I i = f iθi and O i = (1 − f i )θ i , so the RHS of Eq. 7 is actually the RHS of Eq. 5 multiplied byθ i , hence it is equal to observed node degree k i in the network. Therefore, j λ ij = k i .

Proof of Theorem 3
Proof. We start with the equivalent definition of the generalized model in Eq. 8. The first-order derivatives of the log-likelihood over I i and O i are Given the block assignment g, when log-likelihood achieves its maximum, the stable points of I i and O i satisfy ∂L ∂Ii = ∂L ∂Oi = 0, which leads to the equations Suppose there are R nodes in block r, so we have R independent equations shown above and the other R equations are I i + O i = θ i = k i which represent the assumed constraints. These 2R independent equations produce at most one unique set of solutions for the 2R variables. It is obvious that i satisfy all these 2R equations. Therefore, Eq. 9 can be rewritten as where Λ rs = κ + r κ + r if r = s; Λ rs = κ − r κ − s otherwise. Here κ +/− r = i∈r k +/− i are the sum of internal and external degrees of nodes respectively. The first sum on the RHS of above equation represents the KullbackLeibler divergence between the distribution of edges residing across blocks r and s, i.e., p degree (g i = r, g j = s) = mrs 2m and the corresponding null distribution is defined as Note that the null model here does not require that all the edges across the entire network are randomly distributed. But it does require that, the edges inside each block and across different blocks are so distributed.
Since i k i = 2m, the term ki 2m can be treated as the probability of selecting an edge of node i. Thus, the second sum on the RHS can be treated as the expectation of binary entropy function H

Real networks for experiments
Supplementary Table S1 shows the number of nodes and edges of the real networks used for experiments. The number of blocks is set to their values generally accepted in the previous publications. As shown in Supplementary Table S1, the Karate club network, the Dolphin social network and the Les Miserables characters network are relatively sparse because their average degrees range from 4 to 6. To investigate the performance of the regularized model, we add "noises" to these graph by randomly removing their edges, further increasing the sparsity of these networks. For each network, we iteratively remove random edges from the original set of edges and repeat this process twice (each time starting from the same original network). If the removal of an edge results in the isolated nodes in the network, we remove them. Dimensionality reduction of the partition space

Supplementary
In Figures 1 and 2, we project the multidimensional space of partitions into the 2D x-y plane. Specifically, for each MCMC trial, we record all the sampled partitions of the network and their corresponding log-likelihoods. Given two sampled partitions {g i } and {b i }, a new partition {l i } is randomly generated such that l i = g i with probability 0.5 and l i = b i otherwise. Then, we scatter all the sampled partitions along with the randomly generated partitions on the x-y plane of Figure 1. In this figure, the multi-dimensional partition space is projected to a two-dimensional x-y plane by the multidimensional scaling (MDS) algorithm [10]. The z-axis indicates the log-likelihood of the partitions.

Illustrative toy networks
We use illustrative toy network to demonstrate the effect of regularization introduced here on the stochastic block model. Consider the "twin stars" network as shown in Supplementary Figure S1.
Supplementary  There are three possible partitions of this twin stars network. The first option is to cluster the two central hubs into one block and the remaining nodes into another block. This structure corresponds to a core-periphery structure as shown in Supplementary Figure S1(a) where the color of a node represents its block assignment. If we exchange the block assignments of the nodes connected to the two hubs, we obtain a twisted core-periphery structure as shown in Supplementary Figure S1(b). Both of these two partitions are disassortative. For community detection problems, however, the most suitable partition should divide the network into two identical stars, which corresponds to the partition shown in Supplementary Figure S1(c).
Supplementary Table S2 lists the log-likelihoods of these three partitions under the standard stochastic block model (SSBM), the degree-corrected stochastic block model (DCSBM) and our block model with regularization terms (RSBM). Under the degree-corrected stochastic block model, the log-likelihood of the assortative community partition is smaller than the log-likelihood of the inappropriate twisted core-periphery partition, whereas our new model manages to identify the assortative community partition as the most appropriate one. We tested the log-likelihood of the regularized stochastic block model with the f (k) function as defined by Eq. 10 using the model parameter α = 0.3, 0.6 and 0.9, respectively. Note that, in Eq. 9, it is possible that Λ rs = 0 if a block r includes only degree-one nodes as f (1) = 1, causing a log-likelihood of negative infinity according to the model definition. In this case, we set the term m rs log m rs /Λ rs = −10, 000 in Eq. 9 to ensure the log-likelihood is small enough. In general, larger α values would result in a larger gap in the log-likelihoods of the assortative community partition and the other two partitions. However, all the possible α values 0.3, 0.6 and 0.9 ensure that the assortative community partition fits the regularized model introduced here much better than the other candidates. The reason is that, no matter what value of α gets selected here, we have f (1) = 1. To reduce the cross entropy terms in Eq. 14, the inference algorithm is likely to assign a degree-one node to the block in which its sole neighbor resides. Therefore, the inference algorithm is likely to return the assortative communities as the most likely partition of the network.