Cross-validation estimate of the number of clusters in a network

Network science investigates methodologies that summarise relational data to obtain better interpretability. Identifying modular structures is a fundamental task, and assessment of the coarse-grain level is its crucial step. Here, we propose principled, scalable, and widely applicable assessment criteria to determine the number of clusters in modular networks based on the leave-one-out cross-validation estimate of the edge prediction error.

Introduction -Mathematical tools for graph or network analysis have wide applicability in various disciplines of science.In fact, many datasets, e.g., biological, information, and social data, that represent interactions or relations among elements have been successfully studied as networks in machine learning, computer science, and statistical physics.In a broad sense, a major goal is to identify macroscopic structures, including temporal structures, hidden in the data.To this end, for example, degree sequences, community and core-periphery structures, and various centralities have been extensively studied.In this Letter, we focus on identifying modular structures including community structures, bipartite structure, and a type of core-periphery structure through graph clustering.Bayesian approaches using the stochastic block model [1] are a powerful tool for this task.Graph clustering consists of two steps: selecting the number of clusters and determining the cluster assignment of each vertex.These steps may be performed repeatedly.Some methods require the number of clusters as an input, whereas others determine it automatically.In a Bayesian framework, the former step is called model selection, and this is our major focus.Model selection and its assessment for modular networks have been discussed in several ways.A classical prescription is to optimize an objective function, e.g., modularity [2,3] and the map equation [4], or to utilize the spectral method and count the number of eigenvalues outside of the spectral band [5,6].In the Bayesian framework, one principle is to select a model that maximizes the model's posterior probability [7][8][9][10] or the one with the minimum description length [11].Finally, minimization of the prediction error is a well-accepted principle, and cross-validation estimates it adequately [12,13].Unfortunately, a straightforward implementation of cross-validation is computationally expensive when the number of samples for validation is large; for instance, for leave-one-out cross-validation (LOOCV), to assess q ∈ {2, . . ., q max }, one has to run the learning algorithm a total of N (q max − 1) times for very similar training sets.Nevertheless, we show that the LOOCV is an exception and can be applied without the need to perform learning N times by exploiting the fact that the cavity biases in belief propagation (BP) are exactly the ingredients of the LOOCV.Throughout this Letter, we consider undirected sparse networks, and we ignore multi-edges and self-loops for simplicity.We denote by E the set of edges in the network.We denote by N and L the total numbers of vertices and edges, respectively.All the detailed derivations of the results can be found in the Supplemental Material.
Stochastic block model -The hyperparameters that specify the standard stochastic block model are the number of clusters q, the fraction of the cluster size γ σ , and the so-called affinity matrix ω σσ ′ , which indicates the connection probabilities within and between clusters, where σ is the cluster label.Because the networks we consider are sparse, ω = O(N −1 ).Assuming that the edges are generated independently and randomly on the basis of the affinity matrix, the probability that an adjacency matrix A is generated, i.e., the likelihood of the model, is expressed as The cluster assignment of the vertices σ is the hidden variable, and one usually conducts hyperparameter learning with respect to (γ, ω) and cluster inference using the expectation-maximization (EM) algorithm so that the marginal log-likelihood, or the negative free energy, log σ p(A, σ|γ, ω, q), is maximized.BP and the Bethe free energy -The EM algorithm for a block model inference requires computation of the marginal probability of cluster assignment ψ i σ for each vertex.Under the tree approximation, BP offers its estimate by calculating the cavity bias ψ i→j σ , which is the marginal probability of vertex i without the marginalization of vertex j. (See Ref. [9] or the Supplemental Material for details.)Using the cavity biases, the negative marginal log-likelihood per vertex is estimated as where c is the average degree.Each term in the summations is where ∂i indicates the set of neighboring vertices of i, and Note that undirected networks have the symmetry ω σσ ′ = ω σ ′ σ .Although the result of BP is generally an approximation, it is exact when the network is a tree and is quite accurate when the network is sparse.The function f Bethe is called the Bethe free energy, and the parsimonious model that minimizes it is expected to give the correct number of clusters q of the generative model, which corresponds to the maximum likelihood estimation of the hyperparameters.In [9], −c log N/2 is added to (2) to make the Bethe free energy extensive [14]; in numerical experiments, we follow their convention.
Bayes prediction error -The need to evaluate the predictability of the learned model is another principle of model assessment.In the context of a network, we quantify how well our model predicts the existence or nonexistence of edges for "new data."In reality, however, we do not have the "new data" in a given dataset.Therefore, we consider the cross-validation estimate as a proxy of the prediction error; we select a model with a small value of the error.We denote by A \(i,j) the adjacency matrix of a network in which A ij is missing, i.e., in which it is unknown whether A ij = 0 or A ij = 1.The posterior predictive distribution p(A ij = 1|A \(i,j) ) of the model in which vertices i and j are connected given dataset A \(i,j) , or the marginal likelihood of the learned model for the vertex pair, is The error should be small when the prediction of edge existence for every vertex pair is accurate.Thus, the probability distribution ( 5) is close, in the sense of the Kullback-Leibler (KL) divergence, to the actual distribution, which is given as the empirical distribution.Therefore, it is natural to employ, as a measure of the prediction error, the cross-entropy error function [15] where X ≡ i<j Aij ={0,1} P (A ij )X(A ij )/L.Note that we chose the normalization so that E Bayes is typically O(1) in sparse networks.We refer to (6) as the Bayes prediction error, which corresponds to the LOOCV estimate of the stochastic complexity [16].As long as the model we use is consistent with the data, the posterior predictive distribution is optimal as an element of the prediction error, because the intermediate dependence (σ i , σ j ) is fully marginalized and gives the smallest error.In (5), p(A ij = 1|σ i , σ j ) = ω σiσj by model definition.An important observation is that, because the cavity bias ψ i→j σi represents the marginal probability without the information for vertex j, this is exactly what we need for prediction in the LOOCV, that is, σi ω σiσj ψ j→i σj /Z ij for the conditional probability, where Z ij is defined in (4), and p(A ij = 1|A \(i,j) ) = Z ij .Note that utilizing BP for the LOOCV itself is not totally new; this idea has been addressed in a different context in the literature, e.g., Ref. [17].By using the fact that L = O(N ) and p(A ij = 1|A \(i,j) ) = O(N −1 ), E Bayes (q) can be written as ). ( 7) Equation ( 7) indicates that the prediction with respect to the non-edges contributes only as a constant, so E Bayes (q) essentially measures whether the existence of edges is correctly predicted in a sparse network.Using the relation , where Z i→j is the normalization factor of ψ i→j σ , we can also write (7) as where we ignored the constant and O(N −1 ) factors.
Gibbs prediction error -Although the prediction error using the posterior predictive distribution is the best choice when the model we use is consistent with the data, this assumption is often invalid in practice.In that case, the Bayes prediction error may no longer be optimal for prediction.In (6), we employed − log p(A ij |A \(i,j) ) as the error of a vertex pair.Instead, we can consider the log-likelihood of cluster assignment E(A ij |σ i , σ j ) = − log p(A ij |σ i , σ j ) as a fundamental element and measure E as the prediction error of a vertex pair, where • • • is the average over p(σ i , σ j |A \(i,j) ); that is, the cluster assignment (σ i , σ j ) is drawn from the posterior distribution, and the error of the vertex pair is measured with respect to this fixed assignment.Then, the corresponding cross-entropy error function is Again, we omitted the constant and O(N −1 ) factors.We refer to (9) as the Gibbs prediction error.If the probability distribution of cluster assignment is highly peaked, E Gibbs will be close to E Bayes , and E Gibbs and E Bayes will be very small if those assignments predict the actual network well.Alternatively, the maximum a posteriori (MAP) estimate of ( 9) is often measured for the Gibbs prediction error; hence, we refer to E MAP (q) as the prediction error with ψ i→j σi replaced by δ σi,argmax{ψ i→j σ } in (9).
Gibbs training error -Finally, we define the BP estimate of the training error.That is, not only do we use A \(i,j) , but we also use A ij for cluster inference.This can be done by taking the average over p(σ i , σ j |A) instead in (9).Thus, omitting the constant and O(N −1 ) factors, we have which measures the goodness of fit of the assumed model to the given data.We refer to E training as the Gibbs training error.Interestingly, as shown in the Supplemental Material, this quantity appears in the free energy (not the Bethe free energy) as a part of the internal energy.
Relations among the errors -By exploiting Bayes' rule, we have Note here that the left-hand side does not depend on σ i and σ j .If we take the average with respect to p(σ i , σ j |A \(i,j) ) on both sides, where D KL (p||q) is the KL divergence.Taking the sample average of the edges, we obtain If we take the average over p(σ i , σ j |A) in (12) instead, Because the KL divergence is non-negative, E training < E Bayes < E Gibbs .Essentially the same relations as ( 13) and ( 14) were derived in the context of neural networks in a slightly different manner [16].We can go even further.
If the cluster assignment distributions with small q can be regarded as the coarse graining of that with a larger q, the information monotonicity [18,19] of the KL divergence ensures that E Gibbs always estimates a smaller number of clusters q than E Bayes and E training .(See the Supplemental Material for the detailed argument.)When the inference of BP correctly predicts the edges, E MAP is biased so that the error becomes small.Therefore, E MAP tends to be smaller than E Gibbs .As we will observe later, E Gibbs typically performs better than E Bayes in practice.Equation (13) implies that detailed information about the difference in the cluster assignment distribution is often not relevant and simply causes overfitting.As shown in Fig. 1, when the network is truly generated by the stochastic block model, the Bethe free energy and all the prediction errors saturate at the planted value of q, as they should.
Degree-corrected stochastic block model -In practice, it is observed that the standard stochastic block model is often not flexible enough to fit real-world data with heterogeneous degree distributions.In such a case, the assessment of the number of clusters q may not make any sense.Therefore, in addition, we conduct the analysis for the degree-corrected stochastic block model [20] in parallel.In the degree-corrected stochastic block model of a simple graph, the probability p(A ij = 1|σ i , σ j ) that a pair of vertices is connected given their cluster assignments σ i and σ j is θ i ω σi,σj θ j instead of ω σi,σj .With this replacement, we can obtain the corresponding Bethe free energy and error functions analogously.Figure 1 shows an equal-size stochastic block model with a power-law degree distribution, which we generated as an instance of the LFR network [21].The mixing parameter µ is set to 0.1.The planted number of clusters is correctly estimated using both the Bethe free energy and the prediction and training errors in this case.
Real-world networks -Finally, and most importantly, model assessment using various error functions is applied to real-world networks.Unlike the case for synthetic networks, the selection of q is not very obvious for many networks because the error functions do not saturate clearly as q increases.This makes sense, because real-world networks may not have a clear simple modular structure; thus, they may not be perfectly fitted by either the standard or degree-corrected stochastic block models.In other words, by plotting the error functions, we can see how confident we can be about our model selection.Recall that each cross-validation estimate is given as an average error per edge, so we can also measure its standard errors.To select the parsimonious model, the "one-standard error" rule [22] is often used, in which the most parsimonious model whose error is no more than one standard error above the error of the best model is selected.To apply this empirical rule, we plotted the standard errors of the cross-validation estimates as shadows.For example, although the best model of the network of books about US politics (which we refer to as political books) is q = 7 for the standard stochastic block model, we choose q = 5 as the most parsimonious model.Although the estimated q from the Bethe free energy and prediction errors coincide in some cases, as far as we examined, the Bethe free energy does not saturate at a reasonable value of q, as already pointed out in [23].The Bayes prediction error E Bayes and Gibbs training error E training perform similarly.In contrast, the Gibbs prediction error E Gibbs shows good performance in the sense that its suggestion often coincides with the number of "ground-truth" communities of well-studied networks.Note that, even when we do not measure the Bethe free energy for model assessment, we still minimize the Bethe free energy in the cluster inference step.
Summary and Discussion -We derived crossvalidation estimates for various types of errors in terms of the distribution obtained by BP.This approach is incomparably more efficient than a straightforward appli-FIG.2. (Color online) Bethe free energies (left) and prediction errors (right) of the network of books about US politics [24] as functions of the number of clusters q.They are plotted in the same manner as in Fig. 1.The standard errors are shown as shadows.
cation of LOOCV and offers a reasonable model assessment.Moreover, we also showed the relations between the objectives for model assessment.This is quite important, because we can determine the exact cause of overfitting.The codes we used can be found at [25].Although the generation of edges is highly correlated, the validity of the cross-validation is justified because we fit the data based on a stochastic block model, which assumes that every edge is generated independently and randomly.In addition, although one may expect that the LOOCV estimates the conditional prediction error because it uses very similar training sets, it reportedly estimates rather the expected prediction error [22].Fitting with the stochastic block models is flexible, so the algorithm can infer not only the assortative structure, but also more complex structures.However, this is not always an advantage in practice.The flexibility also means that slightly different models may fit the data as well as the best model.Therefore, as a trade-off, model selection becomes more difficult.By restricting the structure that we can detect, it is possible to find a good balance of the cluster inference and model selection performance.We will address this problem in a future publication.Supplemental Material: Cross-validation model assessment for modular networks

BP INFERENCE OF CLUSTER ASSIGNMENT AND HYPERPARAMETER LEARNING FOR THE STOCHASTIC BLOCK MODEL
For self-containedness, we briefly summarize the EM algorithm with BP of a stochastic block model introduced in [9,23].The goal of cluster assignment inference is to evaluate the marginal probability of the cluster assignment ψ i σ for each vertex, provided that the hyperparameter set, γ and ω, is fixed at the estimated values.To this end, we iteratively compute the BP equation based on the likelihood p(A, σ|γ, ω, q); that is, for an edge (i, j) ∈ E, where ∂i\j indicates the set of neighboring vertices of i in the network except for j, and ψ i→j σi and Z i→j are the marginal probability of vertex i without the marginalization from vertex j and its normalization factor as defined in the main text.The external field, , is due to the effect of non-edges (i, k) / ∈ E, where the full marginal ψ i σi is As defined in the main text, Z i is the normalization factor of the full marginal.With these marginals in hand, we can update the estimate of the hyperparameters to γ and ω as

BP INFERENCE OF CLUSTER ASSIGNMENT AND HYPERPARAMETER LEARNING FOR THE DEGREE-CORRECTED STOCHASTIC BLOCK MODEL
Because the degree distribution of the standard stochastic block model is always the Poisson distribution, it is sometimes not flexible enough to fit the data.To overcome this issue, the degree-corrected stochastic block model was proposed [20].Although the EM algorithm with BP updates was discussed in [26], we write it in a form similar to those for the standard stochastic block model in the previous section.The likelihood of the degree-corrected stochastic block model is where θ i is an arbitrary hyperparameter for degree correction.By grouping vertices into clusters, the likelihood can be rewritten as where m σσ ′ = ij A ij δ σσi δ σ ′ σj is the number of edges between clusters σ and σ ′ if σ = σ ′ , and it is doubly counted if σ = σ ′ .By assuming that A ij is either zero or one for any vertex pair, we neglect the first product.As a normalization constraint of θ, i δ σσi θ i = n σ is usually imposed, where n σ is the number of vertices within cluster σ.Then, the log-likelihood reads The BP equation corresponding to (S5) is The expansion that ignores the O(N −1 ) factors yields, analogously to (S2), and the BP equation (S8) for (i, j) ∈ E is approximated as By setting θ i = 1 for every vertex, (S1) is recovered.According to the saddle-point conditions of (S7), the hyperparameters γ σ and ω σσ ′ should be updated as (S3) and (S4).For θ i of vertex i with degree The cluster assignment σ i is determined as that with the maximum value in ψ i σ .The Bethe free energy can also be written analogously to that in the standard stochastic block model.
The first and second terms are Note again that we have the symmetry ω σσ ′ = ω σ ′ σ .The non-edge part of (S13) is

DERIVATION OF THE ERROR FUNCTIONS
In this section, we explain the derivation of the Bayes prediction error E Bayes (q), Gibbs prediction error E Gibbs (q), and Gibbs training error E training (q).Because L = O(N ) and p(A ij = 1|A \(i,j) ) = O(N −1 ), the Bayes prediction where we used Precisely speaking, the quantity above is a function of the hyperparameters, because i<j Z ij is the expectation of the total number of edges.However, we can assume that its dependence is negligible because the total number of edges is a macroscopic quantity.Analogously to E Bayes (q), the Gibbs prediction error E Gibbs (q) is p(σ i , σ j |A \(i,j) ) log p(A ij = 1|σ i , σ j ) More examples of real-world networks: adjacency network of common adjectives and nouns in a novel (word adjacencies) [31], metabolic network of C. elegans (c-elegans) [32], and network of hyperlinks between blogs on US politics (political blog) [33].The panels are placed in the same order as in Fig. S4.The word adjacencies network is known to have a bipartite structure.Again, the networks are converted to undirected, unweighted, simple graphs.The c-elegans network and political blog network have hub structures, which are inconsistent with the standard stochastic block model; this inconsistency can be observed clearly for the political blog network.Although the Gibbs prediction error E Gibbs often saturates at a value slightly larger than the number of ground-truth communities, its performance is much better than that of the Bethe free energy and other error functions, as is the case for other real-world networks.

FIG. 1 .
FIG.1.(Color online) Bethe free energies (left) and prediction errors (right) of the standard stochastic block model (top) and the stochastic block model with a power-law degree distribution (bottom).Both models consist of four equal-size clusters, and N = 10, 000 in total.For the standard stochastic block model, we set the average degree c to 6 and ǫ to 0.1, where ǫ = ωout/ωin, and ω σσ ′ = ωin for σ = σ ′ and ωout otherwise.For the other model, which we generated as the LFR network, we set the average degree c to 9.58, mixing parameter µ to 0.01, and exponent of the degree distribution τ to −2, with a maximum degree dmax of 100.The four data in the right panel are the Bayes prediction errors EBayes (red circles), Gibbs prediction errors E Gibbs (green triangles), Gibbs training errors Etraining (blue diamonds), and MAP estimates EMAP of E Gibbs (yellow squares).

1 L 1 L 1 − 1 L
FIG. S4.More examples of real-world networks: adjacency network of common adjectives and nouns in a novel (word adjacencies)[31], metabolic network of C. elegans (c-elegans)[32], and network of hyperlinks between blogs on US politics (political blog)[33].The panels are placed in the same order as in Fig.S4.The word adjacencies network is known to have a bipartite structure.Again, the networks are converted to undirected, unweighted, simple graphs.The c-elegans network and political blog network have hub structures, which are inconsistent with the standard stochastic block model; this inconsistency can be observed clearly for the political blog network.Although the Gibbs prediction error E Gibbs often saturates at a value slightly larger than the number of ground-truth communities, its performance is much better than that of the Bethe free energy and other error functions, as is the case for other real-world networks.