Detecting mesoscale structures by surprise

Marchese, Emiliano; Caldarelli, Guido; Squartini, Tiziano

doi:10.1038/s42005-022-00890-7

Download PDF

Article
Open access
Published: 30 May 2022

Detecting mesoscale structures by surprise

Communications Physics volume 5, Article number: 132 (2022) Cite this article

1034 Accesses
4 Citations
3 Altmetric
Metrics details

Subjects

Abstract

The importance of identifying mesoscale structures in complex networks can be hardly overestimated. So far, much attention has been devoted to detect modular and bimodular structures on binary networks. This effort has led to the definition of a framework based upon the score function called ‘surprise’, i.e. a p-value that can be assigned to any given partition of nodes. Hereby, we make a step further and extend the entire framework to the weighted case: six variants of surprise, induced by just as many variants of the hypergeometric distribution, are, thus, considered. As a result, a general, statistically grounded approach for detecting mesoscale network structures via a unified, suprise-based framework is presented. To illustrate its performances, both synthetic benchmarks and real-world configurations are considered. Moreover, we attach to the paper a Python code implementing all variants of surprise discussed in the present manuscript.

Degree difference: a simple measure to characterize structural heterogeneity in complex networks

Article Open access 07 December 2020

On neighbourhood degree sequences of complex networks

Article Open access 06 June 2019

Learning low-rank latent mesoscale structures in networks

Article Open access 03 January 2024

Introduction

The importance of identifying the signature of some kind of mesoscopic organization in complex networks (be it due to the presence of communities or bipartite, core-periphery, bow-tie structures) can be hardly overestimated^1,2, the best example of complex systems whose behavior is deeply affected by their mesoscopic structural organization (e.g. resilience to the propagation of shocks, to the failure of nodes, etc.) being provided by financial networks^3,4,5,6.

So far, much attention has been devoted to the detection of binary mesoscale structures, i.e. communities and, to a far less extent, core-periphery structures: the efforts to solve these problems have led to a number of approaches that are briefly sketched below (for a detailed review of them, see refs. ^7,8).

Community detection has been initially approached by attempting a definition of communities based on the concepts of clustering, cliques, k-core; core-periphery structures have, instead, been defined in a purely top-down fashion, by imagining a fully connected subgraph (i.e. the core) surrounded by (peripherical) vertices exclusively linked to the first ones³. As stressed in⁵, the deterministic character of these definitions makes their tout-court application to real-world systems extremely difficult. This is the reason why the intuitive requirements that ‘the number of internal edges is larger than the number of external edges’ and that ‘the core portion of a network is densely connected, while its periphery is loosely connected’⁸ are, now, interpreted in a purely probabilistic way: a community has, thus, become a subgraph whose vertices have a larger probability to be inter-connected than to be connected to any other vertex in the graph—and analogously for the core-periphery structure. In other words, the top-down approach defining a golden standard and looking for (deviations from) it has left the place to a bottom-up approach where structures are supposed to emerge as the result of non-trivial (i.e. non-casual) interactions between the nodes.

This change of perspective leads to a number of problems. The first one concerns the definition of models stating how edges are formed and has been solved by adopting the rich formalism defining the Exponential Random Graphs framework⁹; the second, and most important, one concerns the definition of a (statistically sound) procedure for selecting the best model among the ones providing competing descriptions of the data. This has led to the identification of three broad classes of algorithms—for an alternative classification see ref. ¹⁰: according to our intuition, all of them implement some kind of statistical inference, the major difference lying in the way the corresponding test of hypothesis is implemented; from a practical point of view, instead, all these methods are designed for optimization, the functional form of the specific score function determining the class to which a given algorithm belongs.

The most representative algorithms among those belonging to the first class are the ones based on modularity. Although these methods compare the empirical network with a benchmark, they do not provide any indication of the statistical significance of the recovered partition, the reason being that none of them is designed as a proper statistical test. For instance, let us consider the definition of modularity, whose generic addendum is proportional to the term (a_ij − p_ij): while it embodies a comparison between the (empirical) adjacency matrix A and the matrix of probability coefficients P defining the benchmark, it does not implement any proper test of hypothesis. The algorithms prescribing to maximize a plain likelihood function belong to this group as well. Although popular, this way of proceeding is known to be affected by overfitting issues: an example is provided by the straight maximization of the likelihood defining the Stochastic Block Model (SBM), or its degree-corrected version, over the whole set of possible partitions, that outputs the trivial one where each vertex is a cluster on its own¹¹.

The aforementioned, major limitation is overcome by the algorithms belonging to the second class. They implement tests of hypothesis either à la Fisher or à la Neyman-Pearson, i.e. either defining a single benchmark (the null hypothesis of the first scenario) or two, alternative ones (the null and the alternative hypothesis of the second scenario): from a practical point of view, such a result is achieved by identifying the aforementioned benchmarks with proper probability distributions and the best partition of nodes with the one minimizing the corresponding p-value. Surprise-based algorithms belong to this second group: as it has been shown in ref. ¹²—for a particular case; such a result will be generalized in what follows—optimizing (asymptotic) surprise amounts at carrying out a (sort of) Likelihood Ratio Test aimed at choosing between two alternative models.

Hypothesis testing can be further refined by allowing for more than two hypotheses to be tested at a time: results of the kind are particularly useful for model selection and, in fact, have produced a plethora of criteria (e.g. the Akaike Information Criterion, the Bayesian Information Criterion and the Minimum Description Length) for singling out the best statistical model out of a basket of competing ones. Generally speaking, optimization, here, seeks for the maximum of a corrected likelihood function embodying the trade-off between accuracy and parsimony of a description. An example of the algorithms belonging to this third class is represented by Infomap; other examples are represented by the recipes employing the SBM within a Bayesian framework (see ref. ¹³ and the references therein).

With this paper, we pose ourselves within the second research line and adopt a bottom-up approach that prescribes to compare any empirical network structure with the outcome of a properly-defined benchmark model. To this aim, we devise a unified framework for mesoscale structures detection based upon the score function called ‘surprise’, i.e. a p value that can be assigned to any given partition of nodes, on both undirected and directed networks (our function is named ‘surprise’ since it generalizes the function proposed in ref. ¹⁴ for community detection; in our case, however, it indicates a proper probability and not its logarithm, as in ref. ¹⁴): while for binary community detection this is achieved by employing the binomial hypergeometric distribution^14,15, with bimodular structures like the bipartite and the core-periphery ones, one needs to consider its multinomial variant¹². Here, we aim at making a step further, by extending the entire framework to the weighted case: as a result, we present a general, statistically grounded approach to the problem of detecting mesoscale structures on networks via a unified, surprise-based framework.

Other examples of the use of the hypergeometric distribution to carry out tests of hypothesis on networks can be found in the papers¹⁶ (where the authors introduce a method to provide a statistically validated, monopartite projection of a bipartite network—the considered null hypothesis encoding the heterogeneity of the system),¹⁷ (where the authors employ the same validation procedure to detect cores of communities within each set of nodes of a bipartite system) and¹⁸ (where the authors extend the framework proposed in the aforementioned references to carry out a statistical validation of motifs observed in hypergraphs). For a review on the use of the hypergeometric distribution for network analyses see ref. ¹⁹ and the references therein.

Results

Surprise has recently received a lot of attention: the advantages of employing such a score function have been extensively discussed in^{12,14,15,20,21,22,23} where researchers have tested and compared its performance from a purely numerical perspective. A characterization of the statistical properties of surprise is, however, still missing: for this reason, we will, first, make an effort to translate the problem of detecting a given mesoscale network structure into a proper, exact significance test (to the best of our knowledge, the only other attempt of the kind—specifically, to detect communities—is the one in ref. ²⁴) and, then, show how the rich—yet, still underexplored—surprise-based formalism can properly answer such a question.

The basic equation underlying exact tests reads

$$\Pr (x\ge {x}^{* })=\mathop{\sum}\limits_{x\ge {x}^{* }}f(x)$$

(1)

and returns the probability of observing an outcome of the random variable X that is more extreme than the realized one, i.e. x*. In the setting above, f represents the distribution encoding the null hypothesis and Pr(x ≥ x*)—commonly known with the name of p value—answers the question ‘is the realized value X = x* compatible with the (null) hypothesis that X is distributed according to f?’.

The p-value constitutes the basic quantity for carrying out any significance test: hence, in what follows we will tackle the problem of detecting the signature of a (statistically significant) mesoscale network structure by individuating a specific test (or, equivalently, a suitable functional form for f).

Community detection in binary networks

The topic of nodes partitioning into densely-connected groups has traditionally received a lot of attention, with applications ranging from the analysis of social networks to the definition of recommendation systems^1,2. Within the surprise-based framework, binary community detection is carried out via the identification

$$f({l}_{\bullet }) \equiv {{{{{{{\rm{H}}}}}}}}({l}_{\bullet }| V,{V}_{\bullet },L)\\ =\frac{{\prod }_{i = \bullet ,\circ }\big({{{V}_{i}}\atop {{l}_{i}}}\big)}{\big({{V}\atop {L}}\big)}=\frac{\big({{{V}_{\bullet }}\atop {{l}_{\bullet }}}\big)\big({{{V}_{\circ }}\atop {{l}_{\circ }}}\big)}{\big({{V}\atop {L}}\big)}=\frac{\big({{{V}_{\bullet }}\atop {{l}_{\bullet }}}\big)\big({{V-{V}_{\bullet }}\atop {L-{l}_{\bullet }}}\big)}{\big({{V}\atop {L}}\big)}$$

(2)

i.e. by calculating the p value

$${{{{{{{\mathscr{S}}}}}}}}\equiv \mathop{\sum}\limits_{{l}_{\bullet }\ge {l}_{\bullet }^{* }}f({l}_{\bullet })$$

(3)

of a binomial hypergeometric distribution whose parameters read as above. In this formalism, the • subscript will be meant to indicate quantities that are internal to communities, while the ∘ subscript will be meant to indicate quantities that are external to communities. More precisely, the binomial coefficient $\big({{{V}_{\bullet }}\atop {{l}_{\bullet }}}\big)$enumerates the number of ways l_• links can be redistributed within communities, i.e. over the available V_• node pairs, while the binomial coefficient $\big({{{V}_{\circ }}\atop {{l}_{\circ }}}\big)$ enumerates the number of ways the remaining l_∘ = L − l_• links can be redistributed between communities, i.e. over the remaining V_∘ = V − V_• node pairs. Notice that, although l_• is ‘naturally’ bounded by the value V_•, it cannot exceed L - whence the usual requirement ${l}_{\bullet }\in [{l}_{\bullet }^{* },\min \{L,{V}_{\bullet }\}]$.

From a merely statistical point of view, surprise interprets a network as a population of V node pairs, L of which have been drawn; out of the L extracted ones, l_• node pairs have the desired feature of being internal to communities. Hence, for a given partition of nodes, ${{{{{{{\mathscr{S}}}}}}}}$ quantifies the probability of observing at least ${l}_{\bullet }^{* }$ successes (i.e. intra-cluster edges) out of L draws: the lower this probability, the ‘more surprising’ the observation of the corresponding partition, hence the better the partition itself.

Bimodular structures detection in binary networks

The surprise-based framework can be easily extended to detect what can be called bimodular structures, a term that will be used to compactly indicate core-periphery^25,26,27 and bipartite structures^28,29. The reason for adopting such a terminology lies in the evidence that both kinds of structures are defined by bimodular partitions, i.e. partitions of nodes into two different groups.

As shown elsewhere¹², the issue of detecting binary bimodular structures can be addressed by considering a multivariate (or multinomial) hypergeometric distribution, i.e. by identifying

$$f({l}_{\bullet },{l}_{\circ }) \equiv {{{{{{{\rm{MH}}}}}}}}({l}_{\bullet },{l}_{\circ }| V,{V}_{\bullet },{V}_{\circ },L)\\ =\frac{{\prod }_{i = \bullet ,\circ ,\top }\big({{{V}_{i}}\atop {{l}_{i}}}\big)}{\big({{V}\atop {L}}\big)}=\frac{\big({{{V}_{\bullet }}\atop {{l}_{\bullet }}}\big)\big({{{V}_{\circ }}\atop {{l}_{\circ }}}\big)\big({{{V}_{\top }}\atop {{l}_{\top }}}\big)}{\big({{V}\atop {L}}\big)}\\ =\frac{\big({{{V}_{\bullet }}\atop {{l}_{\bullet }}}\big)\big({{{V}_{\circ }}\atop {{l}_{\circ }}}\big)\big({{V-({V}_{\bullet }+{V}_{\circ })}\atop {L-({l}_{\bullet }+{l}_{\circ })}}\big)}{\big({{V}\atop {L}}\big)}$$

(4)

where V_⊤ ≡ V − (V_• + V_∘) indicates the number of node pairs between the modules • and ∘ and l_⊤ ≡ L − (l_• + l_∘) indicates the number of links that must be assigned therein. While the binomial coefficient $\big({{{V}_{\bullet }}\atop {{l}_{\bullet }}}\big)$enumerates the number of ways l_• links can redistributed within the first module (e.g. the core portion) and the binomial coefficient $\big({{{V}_{\circ }}\atop {{l}_{\circ }}}\big)$enumerates the number of ways l_∘ links can redistributed within the second module (e.g. the periphery portion), the third binomial coefficient $\big({{V-({V}_{\bullet }+{V}_{\circ })}\atop {L-({l}_{\bullet }+{l}_{\circ })}}\big)$enumerates the number of ways the remaining L − (l_• + l_∘) links can be redistributed between the first and the second module, i.e. over the remaining V − (V_• + V_∘) node pairs.

This choice induces the definition of the binary bimodular surprise

$${{{{{{{{\mathscr{S}}}}}}}}}_{\!/\!/}\equiv \mathop{\sum}\limits_{{l}_{\bullet }\ge {l}_{\bullet }^{* }}\mathop{\sum}\limits_{{l}_{\circ }\ge {l}_{\circ }^{* }}f({l}_{\bullet },{l}_{\circ });$$

(5)

analogously to the univariate case, l_• and l_∘ are naturally bounded by the values V_• and V_∘ — notice, however, that the sum l_• + l_∘ cannot exceed L (although it may not reach such a value, e.g. in case V_• + V_∘ < L).

Community detection in weighted networks

Within the surprise-based framework, the problem of detecting binary communities has been rephrased as an aleatory experiment whose random variable is the number of links within communities. Interestingly enough, such an experiment can be easily mapped into a counting problem, allowing us to interpret ${{{{{{{\mathscr{S}}}}}}}}$ as indicating the number of configurations whose number of internal links (i.e. within communities) is larger than the observed one.

When dealing with weighted networks, we would like to proceed along similar guidelines and consider the total, internal weight as our new random variable, to be redistributed across the available node pairs. Adopting this approach has three major consequences: (1) weights must be considered as composed by an integer number of binary links; (2) each node pair must be allowed to be occupied by more than one link; (3) the total weight must be allowed to vary even beyond the network size (when handling real-world networks, the case W ≫ V is often encountered).

The proper setting to define an aleatory experiment satisfying the requests above is provided by the so-called stars and bars model, a combinatorial technique that has been introduced to handle the counting of configurations with multiple occupancies. Basically, the problem of counting in how many ways w_• particles (our links) can be redistributed among V_• boxes (our node pairs), while allowing more than one particle to occupy each box, can be tackled by allowing both the particles and the bars delimiting the boxes to be permuted³⁰. Since V_• boxes are delimited by V_• − 1 bars, a term like $\big({{{V}_{\bullet }+{w}_{\bullet }-1}\atop {{w}_{\bullet }}}\big)$ is needed.

In order to better grasp the meaning of such a term, let us make a simple example. Let us imagine to observe a network with three nodes and two links, carrying a weight of 1 and 2, respectively. Now, were we interested in a purely binary analysis, we may ask ourselves in how many ways we could place the two links among the $\frac{N(N-1)}{2}=\frac{3(3-1)}{2}=3$ available pairs: the answer is provided by the binary binomial coefficient $\big({{{V}_{\bullet }}\atop {{l}_{\bullet }}}\big)=\big({{3}\atop {2}}\big)=3$. The implicit assumption we make is that the three links must not occupy the same node pairs - otherwise the total number of connections wouldn’t be preserved.

This perspective changes from the purely weighted point of view. Since we are now interested in preserving just the total weight of our network, irrespectively of the number of connections it is placed upon, the number of admissible configurations becomes precisely $\big({{{V}_{\bullet }+{w}_{\bullet }-1}\atop {{w}_{\bullet }}}\big)=\big({{2+3}\atop {3}}\big)=10$. Such a number is larger than before since, now, weights are disaggregated into binary links and multiple occupations of the latter ones are allowed.

The considerations above lead us to generalize the community detection problem to the weighted case by identifying

$$f({w}_{\bullet }) \equiv {{{{{{{\rm{NH}}}}}}}}({w}_{\bullet }| V+W,W,{V}_{\bullet })\\ =\frac{{\prod }_{i = \bullet ,\circ }\big({{{V}_{i}+{w}_{i}-1}\atop {{w}_{i}}}\big)}{\big({{V+W-1}\atop {W}}\big)}=\frac{\big({{{V}_{\bullet }+{w}_{\bullet }-1}\atop {{w}_{\bullet }}}\big)\big({{{V}_{\circ }+{w}_{\circ }-1}\atop {{w}_{\circ }}}\big)}{\big({{V+W-1}\atop {W}}\big)}\\ =\frac{\big({{{V}_{\bullet }+{w}_{\bullet }-1}\atop {{w}_{\bullet }}}\big)\big({{(V-{V}_{\bullet })+(W-{w}_{\bullet })-1}\atop {W-{w}_{\bullet }}}\big)}{\big({{V+W-1}\atop {W}}\big)}$$

(6)

i.e. by replacing the binomial hypergeometric distribution considered in the purely binary case with a negative hypergeometric distribution, a choice inducing the definition of the weighted surprise

$${{{{{{{\mathscr{W}}}}}}}}\equiv \mathop{\sum}\limits_{{w}_{\bullet }\ge {w}_{\bullet }^{* }}f({w}_{\bullet })$$

(7)

where the binomial coefficient $\big({{{V}_{\bullet }+{w}_{\bullet }-1}\atop {{w}_{\bullet }}}\big)=\big({{{V}_{\bullet }+{w}_{\bullet }-1}\atop {{V}_{\bullet }-1}}\big)$ enumerates the number of ways w_• links can be redistributed within communities, i.e. over the available V_• node pairs, and the binomial coefficient $\big({{{V}_{\circ }+{w}_{\circ }-1}\atop {{w}_{\circ }}}\big)=\big({{{V}_{\circ }+{w}_{\circ }-1}\atop {{V}_{\circ }-1}}\big)$ enumerates the number of ways the remaining w_∘ = W − w_• links can be redistributed between communities, i.e. over the remaining V_∘ = V − V_• node pairs. Differently from the binary case, the sum ranges up to the empirical weight of the network, i.e. ${w}_{\bullet }\in [{w}_{\bullet }^{* },W]$.

Bimodular structures detection in weighted networks

Let us now introduce the third generalization of the surprise-based formalism: following the same line of reasoning that led us to approach the detection of binary bimodular structures by considering the multinomial analogue of the distribution introduced for binary community detection, we are now led to focus on the multinomial (or multivariate) negative hypergeometric distribution, i.e.

$$f({w}_{\bullet },{w}_{\circ }) \equiv {{{{{{{\rm{MNH}}}}}}}}({w}_{\bullet },{w}_{\circ }| V+W,W,{V}_{\bullet },{V}_{\circ })\\ =\frac{{\prod }_{i = \bullet ,\circ ,\top }\big({{{V}_{i}+{w}_{i}-1}\atop {{w}_{i}}}\big)}{\big({{V+W-1}\atop {W}}\big)}\\ =\frac{\big({{{V}_{\bullet }+{w}_{\bullet }-1}\atop {{w}_{\bullet }}}\big)\big({{{V}_{\circ }+{w}_{\circ }-1}\atop {{w}_{\circ }}}\big)\big({{{V}_{\top }+{w}_{\top }-1}\atop {{w}_{\top }}}\big)}{\big({{V+W-1}\atop {W}}\big)}\\ =\frac{\big({{{V}_{\bullet }+{w}_{\bullet }-1}\atop {{w}_{\bullet }}}\big)\big({{{V}_{\circ }+{w}_{\circ }-1}\atop {{w}_{\circ }}}\big)\big({{V-({V}_{\bullet }+{V}_{\circ })+W-({w}_{\bullet }+{w}_{\circ })-1}\atop {W-({w}_{\bullet }+{w}_{\circ })}}\big)}{\big({{V+W-1}\atop {W}}\big)}$$

(8)

while the binomial coefficient $\big({{{V}_{\bullet }+{w}_{\bullet }-1}\atop {{w}_{\bullet }}}\big)$ enumerates the number of ways w_• links can redistributed within the first module (e.g. the core portion), the binomial coefficient $\big({{{V}_{\circ }+{w}_{\circ }-1}\atop {{w}_{\circ }}}\big)$enumerates the number of ways w_∘ links can be redistributed within the second module (e.g. the periphery portion) and the binomial coefficient $\big({{V-({V}_{\bullet }+{V}_{\circ })+W-({w}_{\bullet }+{w}_{\circ })-1}\atop {W-({w}_{\bullet }+{w}_{\circ })}}\big)$ enumerates the number of ways the remaining w_⊤ ≡ W − (w_• + w_∘) links can be redistributed between the first and the second module, i.e. over the remaining V_⊤ ≡ V − (V_• + V_∘) node pairs. Such a position induces the definition of the weighted bimodular surprise

$${{{{{{{{\mathscr{W}}}}}}}}}_{\!/\!/}\equiv \mathop{\sum}\limits_{{w}_{\bullet }\ge {w}_{\bullet }^{* }}\mathop{\sum}\limits_{{w}_{\circ }\ge {w}_{\circ }^{* }}f({w}_{\bullet },{w}_{\circ });$$

(9)

as for the (weighted) community detection, weights are understood as integer numbers—equivalently, as composed by an integer number of binary links. For what concerns the limits of the summations, w_• and w_∘ are naturally bounded by W; notice, however, that the sum w_• + w_∘ itself cannot exceed such a value.

Enhanced community detection

The recipe to detect communities on weighted networks can be further refined to account for the information encoded into the total number of links, beside the one provided by the total weight, by combining two of the distributions introduced above. To this aim, let us proceed in a two-step fashion: first, let us recall that the number of ways L links can be placed among V node pairs, in such a way that l_• connections are internal to the clusters - while the remaining L − l_• ones are, instead, external - is precisely

$${{{{{{{\rm{H}}}}}}}}({l}_{\bullet }| V,{V}_{\bullet },L)=\frac{\big({{{V}_{\bullet }}\atop {{l}_{\bullet }}}\big)\big({{V-{V}_{\bullet }}\atop {L-{l}_{\bullet }}}\big)}{\big({{V}\atop {L}}\big)};$$

(10)

now, for each of the binary configurations listed above, W − L links remain to be assigned: while w_• − l_• of them must be placed within the clusters, on top of the l_• available internal links, the remaining (W − L) − (w_• − l_•) ones must be placed between the clusters, on top of the L − l_• available inter-cluster connections. Hence, the conditional negative hypergeometric distribution reading

$${{{{{{{\rm{NH}}}}}}}}({w}_{\bullet }| W,W-L,{l}_{\bullet }) = \frac{\big({{{l}_{\bullet }+({w}_{\bullet }-{l}_{\bullet })-1}\atop {{w}_{\bullet }-{l}_{\bullet }}}\big)}{\big({{L+(W-L)-1}\atop {W-L}}\big)}\cdot \\ \cdot \frac{\big({{(L-{l}_{\bullet })+(W-L)-({w}_{\bullet }-{l}_{\bullet })-1}\atop {(W-L)-({w}_{\bullet }-{l}_{\bullet })}}\big)}{\big({{L+(W-L)-1}\atop {W-L}}\big)};$$

(11)

remains naturally defined; now, combining the two distributions above, simplifying and re-arranging, the generic term of the enhanced hypergeometric distribution can be rewritten as

$${{{{{{{\rm{EH}}}}}}}}({l}_{\bullet },{w}_{\bullet }| V,{V}_{\bullet },L,W) = {{{{{{{\rm{H}}}}}}}}({l}_{\bullet }| V,{V}_{\bullet },L)\cdot \\ \cdot {{{{{{{\rm{NH}}}}}}}}({w}_{\bullet }| W,W-L,{l}_{\bullet })\\ = \frac{\big({{{V}_{\bullet }}\atop {{l}_{\bullet }}}\big)\big({{{V}_{\circ }}\atop {{l}_{\circ }}}\big)}{\big({{V}\atop {L}}\big)}\cdot \frac{\big({{{w}_{\bullet }-1}\atop {{w}_{\bullet }-{l}_{\bullet }}}\big)\big({{{w}_{\circ }-1}\atop {{w}_{\circ }-{l}_{\circ }}}\big)}{\big({{W-1}\atop {W-L}}\big)};$$

(12)

with a clear meaning of the symbols. An analytical characterization of it is provided into the Supplementary Note 3: for the moment, let us simply notice that the definition provided above works for the values 0 < l_• < L.

By posing f(l_•, w_•) ≡ EH(l_•, w_•∣V, V_•, L, W), our distribution induces the definition of the enhanced surprise, i.e.

$${{{{{{{\mathscr{E}}}}}}}}\equiv \mathop{\sum}\limits_{{l}_{\bullet }\ge {l}_{\bullet }^{* }}\mathop{\sum}\limits_{{w}_{\bullet }\ge {w}_{\bullet }^{* }}f({l}_{\bullet },{w}_{\bullet });$$

(13)

although l_• and w_• − l_• are naturally bounded by V_• and W − L, respectively, the former one cannot exceed L. In order to better understand how the enhanced surprise works, let us consider again the aforementioned example: given a network with three nodes and two links, carrying a weight of 1 and 2, respectively, we observe $\big({{{V}_{\bullet }}\atop {{l}_{\bullet }}}\big)=\big({{3}\atop {2}}\big)=3$ (purely binary) configurations with exactly the same number of links and $\big({{{V}_{\bullet }+{w}_{\bullet }-1}\atop {{w}_{\bullet }}}\big)=\big({{2+3}\atop {3}}\big)=10$ (purely weighted) configurations with exactly the same total weight. If we, now, constrain both the total number of links and the total weight of the network, the number of admissible configurations becomes $\big({{{V}_{\bullet }}\atop {{l}_{\bullet }}}\big)\big({{{w}_{\bullet }-1}\atop {{w}_{\bullet }-{l}_{\bullet }}}\big)=\big({{3}\atop {2}}\big)\big({{3-1}\atop {3-2}}\big)=3\cdot 2=6$, as it can be easily verified upon explicitly listing them. Naturally, the configurations admissible by the enhanced surprise are a subset of the configurations admissible by the weighted surprise, i.e. precisely the ones with the desired number of links (see also Fig. 1).

**Fig. 1: Graphical comparison of different ways of counting admissible configurations when dealing with surprise.**

Enhanced bimodular structures detection

The last generalization of surprise concerns its use for the detection of bimodular structures within the enhanced framework. This amounts at considering the following multinomial variant of the enhanced hypergeometric distribution

$$ {{{{{{{\rm{MEH}}}}}}}}({l}_{\bullet },{l}_{\circ },{w}_{\bullet },{w}_{\circ }| V,{V}_{\bullet },{V}_{\circ },L,W)\\ \quad=\frac{\big({{{V}_{\bullet }}\atop {{l}_{\bullet }}}\big)\big({{{V}_{\circ }}\atop {{l}_{\circ }}}\big)\big({{{V}_{\top }}\atop {{l}_{\top }}}\big)}{\big({{V}\atop {L}}\big)}\cdot \frac{\big({{{w}_{\bullet }-1}\atop {{w}_{\bullet }-{l}_{\bullet }}}\big)\big({{{w}_{\circ }-1}\atop {{w}_{\circ }-{l}_{\circ }}}\big)\big({{{w}_{\top }-1}\atop {{w}_{\top }-{l}_{\top }}}\big)}{\big({{W-1}\atop {W-L}}\big)}\\ \quad=\frac{\big({{{V}_{\bullet }}\atop {{l}_{\bullet }}}\big)\big({{{V}_{\circ }}\atop {{l}_{\circ }}}\big)\big({{V-({V}_{\bullet }+{V}_{\circ })}\atop {L-({l}_{\bullet }+{l}_{\circ })}}\big)}{\big({{V}\atop {L}}\big)}\cdot \frac{\big({{{w}_{\bullet }-1}\atop {{w}_{\bullet }-{l}_{\bullet }}}\big)\big({{{w}_{\circ }-1}\atop {{w}_{\circ }-{l}_{\circ }}}\big)\big({{W-({w}_{\bullet }+{w}_{\circ })-1}\atop {(W-L)-(({w}_{\bullet }+{w}_{\circ })-({l}_{\bullet }+{l}_{\circ }))}}\big)}{\big({{W-1}\atop {L-1}}\big)}$$

(14)

where V_⊤ ≡ V − (V_• + V_∘) indicates the number of node pairs between the modules • and ∘ and l_⊤ ≡ L − (l_• + l_∘) indicates the number of links that must be assigned therein. An analytical characterization of it is provided into the Supplementary Note 3: for the moment, let us simply notice that the definition provided above works for the values 0 < l_•, l_∘ < L. The position f(l_•, l_∘, w_•, w_∘) ≡ MEH(l_•, l_∘, w_•, w_∘∣V, V_•, V_∘, L, W) induces the definition of the enhanced bimodular surprise

$${{{{{{{{\mathscr{E}}}}}}}}}_{\!/\!/}\mathop{\sum}\limits_{{l}_{\bullet }\ge {l}_{\bullet }^{* }}\mathop{\sum}\limits_{{l}_{\circ }\ge {l}_{\circ }^{* }}\mathop{\sum}\limits_{{w}_{\bullet }\ge {w}_{\bullet }^{* }}\mathop{\sum}\limits_{{w}_{\circ }\ge {w}_{\circ }^{* }}f({l}_{\bullet },{l}_{\circ },{w}_{\bullet },{w}_{\circ }).$$

(15)

Notice that l_• and l_∘ are naturally bounded by V_• and V_∘: still, their sum cannot exceed L; analogously, w_• − l_• and w_∘ − l_∘ are naturally bounded by W − L: still, their sum cannot exceed W − L.

As for its binomial counterpart, the expression of the MEH can be rearranged in a term-by-term fashion, in such a way that the module-specific binomial coefficients can be grouped together. Upon doing so, it becomes clearer that the MEH counts the number of ways w_• − l_• links can be placed on top of the l_• binary links characterizing the connectance of the • module, times the number of ways w_∘ − l_∘ links can be placed on top of the l_∘ binary links characterizing the connectance of the ∘ module, times the number of ways the remaining W − (w_• + w_∘) − (L − (l_• + l_∘)) links can be placed on top of the L − (l_• + l_∘) binary links characterizing the connectance of the third module.

Table 1 in the Methods section gathers all the variants of the surprise-based formalism, illustrating both the full and the asymptotic expression for each of them. To sum up, detecting a weighted mesoscale structure implies considering the negative version of the probability mass function working in the corresponding binary case (e.g. moving from the hypergeometric to the negative hypergeometric one); detecting a bimodular structure, instead, implies considering the multinomial version of the probability mass function working in the corresponding binary case (e.g. moving from the binomial hypergeometric to the multinomial one).

Table 1 Table illustrating all the generalizations of the surprise-based formalism proposed in the present paper.

Full size table

Comparing methods for the detection of mesoscale structures

Let us now carry out a comparison among some of the methods designed to detect mesoscale structures (for a consistency check of our surprise-based formalism and a theoretical comparison between modularity and surprise, we redirect the interested reader to the Supplementary Notes 4 and 5). To this aim, let us consider two popular algorithms for mesoscale structures detection, i.e. modularity maximization (Q has been considered in its full definition, e.g. $\langle {a}_{ij}\rangle ={p}_{ij}=\frac{{k}_{i}{k}_{j}}{2L}$, ∀ i < j for binary undirected configurations) and Infomap³¹, Upon doing so, we are able to compare one algorithm per class, i.e. modularity for the first class, surprise for the second class and Infomap for the third class.

To this aim, we have focused on different kinds of benchmarks, i.e. classes of synthetic networks with well-defined planted partitions, the aim being that of inspecting the goodness of a given algorithm in recovering the imposed partition. As an indicator of the goodness of the partition retrieved by each algorithm, we have followed¹⁷ and employed three different indices (see the Methods section for their definition): the normalized mutual information (NMI), the adjusted Rand index (ARI) and the adjusted Wallace index (AWI).

First, let us inspect the performance of modularity, surprise and Infomap to detect cliques arranged in a ring. Specifically, we have considered seven different ring-like configurations, each one linking twenty binary cliques (i.e. K₃, K₄, K₅, K₈, K₁₀, K₁₅, K₂₀). As Fig. 2 reveals, surprise always recovers the planted partition; on the other hand, modularity maximization leads to miss the partitions with K₃ and K₄ and Infomap misses the partition with K₃, a result that may be a consequence of the resolution limit, affecting both the aforementioned algorithms.

**Fig. 2: Performance of community detection algorithms on ‘homogeneous’ rings of cliques.**

Let us now ask us if the presence of weights affects the detection of mesoscale structures. Generally speaking, the answer is yes, as the comparison between the ring of binary cliques and the ring of weighted cliques cases shows. In particular, the result according to which surprise minimization is able to discriminate the inter-linked cliques changes once the weight of the links connecting any two cliques is risen: in fact, this leads the algorithm to reveal as communities two tightly-connected pairs of cliques, now. We explicitly notice that the results shown in Fig. 3 also depend on the relative magnitude of the weights within and between cliques: however, as long as the inter-cliques weight is up to two orders of magnitude larger than the intra-cliques one, it holds true.

**Fig. 3: Comparison between the ‘ring of binary cliques’ and the ‘ring of weighted cliques’ cases.**

In order to expand the set of comparisons, we have focused on two different kinds of well-established benchmarks, i.e. the Lancichinetti–Fortunato–Radicchi (LFR) one and the Aldecoa’s relaxed-caveman (RC) one³² (see the Supplementary Note 6 for their definition and a pictorial example of them).

Results on specific implementations of the LFR benchmark are shown in Fig. 4. Infomap is, generally speaking, a strong performer; as evident upon looking at the first row, however, its performance decreases abruptly as the mixing parameter exceeds a threshold value that depends on the particular setting of the LFR benchmark. Modularity, instead, seems to be more robust (i.e. its performance degrades less rapidly as the parameters μ_t increases) although the resolution limit manifests itself when configurations with small communities are considered. Overall, the performance of surprise seems to constitute a good compromise between the robustness of modularity and the steadily high accuracy of Infomap. We would also like to stress that surprise competes with modularity although it employs much less information than the latter: in fact, while the benchmark employed by modularity coincides with the (sparse version of the) Configuration Model—hence, encodes the information on the entire degree sequence—surprise compares the RGM with the SBM, hence employing the information on the link density, both in a global and in a block-wise fashion. Surprise becomes the best performer when binary, directed configurations are considered—see the second row of Fig. 4: while the performance of modularity starts decreasing as soon as the value of μ_t is risen and NMI_Infomap ≃ 0 when μ_t crosses the value of 0.6, the performance of surprise degrades much more slowly - in fact, for some instances of the LFR benchmark, it achieves a large value of NMI even for values μ_t ≥ 0.8.

**Fig. 4: Comparison of different community detection algorithms on the Lancichinetti–Fortunato–Radicchi (LFR) benchmark.**

Let us now comment on the performance of our algorithms when weighted configurations are considered. The results are, again, shown in Fig. 4 (to be noticed that we have kept one of the two parameters fixed and studied the dependence of the NMI on the other: specifically, we have frozen the topological mixing parameter and studied the dependence of the results on μ_w, thus inspecting the performance of our algorithms as the weights are redistributed on a fixed topology). Infomap is, again, a strong performer although its performance keeps decreasing abruptly as μ_w exceeds a threshold value depending on the particular setting of the LFR benchmark; modularity, instead, performs worse than in the binary case although it is still more robust than Infomap. Although degrading less sharply than Infomap, the performance of the purely weighted surprise seems to be the worst, here; on the other hand, the enhanced surprise outperforms the competing algorithms for intermediate values of the topological mixing parameter, irrespectively from the size of the communities: in fact, NMI_surprise = 1 even for μ_w = 0.8. Similar considerations hold true when weighted, directed configurations are considered (with the only difference that, now, modularity steadily performs worse than the other algorithms, except for the largest values of the mixing parameter). As for the binary cases, surprise competes with modularity although it employs much less information than the latter: in fact, while the benchmark employed by modularity now coincides with the (sparse version of the) Weighted Configuration Model—hence, encodes the information on the entire strength sequence— surprise compares the WRGM with the WSBM, hence employing the information on the magnitude of the total weight, both in a global and in a block-wise fashion.

Let us now consider the RC benchmark. Results on specific implementations of this are shown in Fig. 5: surprise outperforms both competing algorithms across the entire domain of the degradation parameter p. More specifically, while modularity degrades slowly as the value of p is risen, Infomap degrades abruptly as p ≥ 0.4. Hence, for small values of such a parameter, Infomap outperforms modularity; on the other hand, for large values of p, modularity outperforms Infomap (although both NMI_modularity and ARI_modularity achieve a value which is around 0.6, i.e. already far from the maximum). Interestingly, for small values of p, the performance of Infomap and that of surprise overlap, both achieving NMI and ARI values which are very close to 1: as p crosses the value of 0.4, however, the two trends become increasingly different with Infomap being outperformed by modularity which, in turn, is outperformed by surprise. From a more general perspective, these results confirm what has been already observed elsewhere¹⁴, i.e. that the best-performing algorithms on the LFR benchmarks often perform poorly on the RC benchmarks and vice versa.

**Fig. 5: Comparison of different community detection algorithms on the Relaxed-Caveman (RC) benchmark.**

Let us now inspect the performance of surprise in recovering binary bimodular structures. To this aim, we have defined a benchmark mimicking the philosophy of the RC one, i.e. progressively degrading an initial, well-defined configuration:

let us consider N_c core nodes and N_p periphery nodes. The core is completely connected (i.e. the link density of the N_c × N_c block is 1) and the periphery is empty (i.e. the link density of the N_p × N_p block is 0). So far, our benchmark is reminiscent of a core-periphery structure à la Borgatti–Everett;
let us now focus on the topology of the N_c × N_p bipartite network embodying the connections between the core and the periphery: in particular, let us consider each entry of such an adjacency matrix and pose a_cp = 1 with probability p_cp. Upon doing so, such a subgraph will have a link density amounting precisely at p_cp;
let us now ‘degrade’ such an initial configuration, by progressively filling the periphery and emptying the core. This can be achieved by (1) considering all peripherical node pairs and link them with probability q_p; (2) considering all core node pairs and keep them linked with probability 1 − q_c (or, equivalently, disconnect them with probability q_c). Upon doing so, we end up with a core whose link density is precisely 1 − q_c and with a periphery whose link density is precisely q_p. Now, varying q_p in the interval [0, p_cp] and q_c in the interval [0, 1 − p_cp] allows us to span a range of configurations starting with the Borgatti–Everett one and ending with an Erdös-Rényi one.

Specifically, here we have considered N_c = 100, N_p = 300 and p_cp = 0.5. The result of our exercise is shown in Fig. 6: as expected, the performance of the surprise worsens as the degradation parameter becomes closer to p_cp = 0.5; however, both the NMI and the ARI indices steadily remain very close to 1—a result meaning that surprise optimization is not only able to correctly classify true positives (i.e. to keep the nodes originally in the same communities together) but also the other, possible kinds of node pairs.

**Fig. 6: Benchmark for testing surprise on recovering core-periphery structures.**

As with the exercise on community detection, let us now ask us how the presence of weights impacts on the detection of bimodular structures. Let us consider a toy core-periphery network: rising the weight of any two links connecting the core with the periphery allows the two nodes originally part of the periphery to be detected as belonging to the core (see Fig. 7c, d). Analogously, if a bipartite topology is modified by adding weights between some of the nodes belonging to the same layer, ${{{{{{{{\mathscr{W}}}}}}}}}_{\!/\!/}$ will detect a core-periphery structure as significant, the core nodes being the ones linked by the heaviest connections (see Fig. 7a, b).

**Fig. 7: Impact of weights on the detection of bimodular structures.**

Testing surprise on real-world networks

When coming to study real systems, particularly insightful examples are provided by social networks. To this aim, let us consider the one induced by the co-occurrences of characters within the ‘Star Wars’ saga (i.e. the three trilogies)³³. As shown in Fig. 8 we have both considered the binary and the weighted version of it. In both cases, two major clusters are visible. For what concerns the binary version of such a network, the optimization of ${{{{{{{\mathscr{S}}}}}}}}$ reveals the presence of two major clusters: those clusters are induced by the characters of Episodes I–III (e.g. Yoda, Qui-Gon, Obi-Wan, Anakin, Padme, the Emperor, Count Dooku, etc.) and by the characters of Episodes IV-IX (e.g. C-3PO, Leia, Han, Lando, Poe, Finn); a third cluster, instead, concerns the villains of Episodes VII-IX (i.e. Snoke, Kylo Ren, Phasma, Hux). Interestingly, Rey, BB-8, Maz-Kanata and other characters living on Jakku are clustered together; moreover, the interactions between the characters of Episodes IV-VI and those of Episodes VII-IX causes the former ones and the latter ones to be recovered within the same cluster. This picture is further refined once weights are taken into account: in fact, two of the aforementioned clusters are now merged, giving origin to the cluster of heroes of Episodes IV-IX (e.g. C-3PO, Leia, Han, Lando, Poe, Finn, Rey, Maz-Kanata).

**Fig. 8: Application of the surprise-based formalism for the detection of communities on real-world networks.**

Let us now inspect the effectiveness of our framework in revealing weighted communities by considering the friendship network among the terrorists involved in the train bombing of Madrid in 2004³⁴ and the one among the residents living in an Australian University Campus³⁴ (see Fig. 8). As the optimization of ${{{{{{{\mathscr{W}}}}}}}}$ reveals, while fully connected subsets of nodes are considered as communities in case links have unitary weights, sparser subgraphs can be considered as communities as well whenever their inner connections are heavy enough. On the other hand, both bottom panels of Fig. 8 seem to confirm that one of the main limitations of surprise-like functionals is that of recovering a large number of small cluster of nodes.

Let us now compare the performance of ${{{{{{{{\mathscr{S}}}}}}}}}_{\!/\!/}$ and ${{{{{{{{\mathscr{W}}}}}}}}}_{\!/\!/}$ in order to see if, and how, the presence of weights affects the bimodular mesoscale organization of networks. To this aim, let us focus on the network of co-occurrences of the characters of the novel ‘Les Miserables’³⁴. As shown in Fig. 9, link weights indeed modify the picture provided by just considering the simple presence of links (see also ref. ¹²): the core of the weighted network is, in fact, constituted by the nodes connected by the heavier links, irrespectively from the link density of the former one.

**Fig. 9: Bimodular structures detection on real-world social and financial networks.**

After having applied our framework to the analysis of social networks, let us move to consider financial networks. One of the most popular examples of the kind is provided by the electronic Italian Interbank Money Market (e-MID)³⁵, depicted in Fig. 9. Notice that, for such a network, the vast majority of core links are also the heavier ones, an evidence confirming a tendency that is ubiquitous in financial and economic systems, i.e. binary and weighted quantities—even at the mesoscale—are closely related.

The surprise-based formalism presented in this paper can be also employed in a hierarchical fashion to highlight either nested communities or nested bimodular structures. To clarify this point, let us consider the World Trade Web (WTW) in the year 2000 as a case study³⁶. First, let us run ${{{{{{{{\mathscr{W}}}}}}}}}_{\!/\!/}$ to highlight the core portion of the weighted version of such a network; as Fig. 10 shows, the bipartition distinguishes countries with a large strength from those whose trade volume is low (basically, a bunch of African, Asian and South-American countries). Repeating our analysis within the core portion of the network allows us to discover the presence of a (statistically significant) nested core: in fact, the secound-round optimization reveals that the core-inside-the-core is composed by countries such as Canada, USA, the richest European countries, China and Russia.

**Fig. 10: Detection of nested bimodular structures.**

Let us now compare it with ${{{{{{{{\mathscr{E}}}}}}}}}_{\!/\!/}$, run in a hierarchical fashion as well. The results of our exercise are shown in Fig. 10. As evident from looking at it, the enhanced surprise is more restrictive than the purely weighted one, as a consequence of constraining the degrees beside the strengths. Hence, while the first run excludes the countries with both a small degree and a small strength, the second run excludes the Russia, a result seemingly indicating that while its strength is large enough to allow it to be a member of the core, its degree is not. In a sense, the optimization of ${{{{{{{{\mathscr{E}}}}}}}}}_{\!/\!/}$ corrects the picture provided by the optimization of ${{{{{{{{\mathscr{W}}}}}}}}}_{\!/\!/}$ as the core becomes less populated by low degree nodes—an effect which is likely to become more evident on systems that are neither financial nor economic in nature.

Let us now run, and compare, modularity, Infomap and surprise on the bunch of real-world networks above. Table 2 sums up the results. A first observation concerns the number of detected communities: while Infomap is the algorithm producing the smallest number of clusters, surprise is the one producing the largest number of clusters—more precisely, surprise outputs more and smaller clusters than the other two methods. As a consequence, our three algorithms produce partitions with an overall small overlap, as indicated by the NMI; the ARI confirms such an observation—although indicating that the pictures provided by Infomap and surprise are (overall) more similar than those provided by modularity and surprise (and, as a consequence, by modularity and Infomap). Interestingly enough, the values of the AWI are quite large—and larger than the corresponding NMI and ARI values: since it just focuses on the percentage of true positives, a good performance under such an index indicates that the two tested algorithms agree on the nodes to be clustered together (although they may not—and, in general, will not—agree on the number of communities). Hence, the discrepancy between the ARI and the AWI may be explained by the presence of statistical noise (i.e. misclassified pairs of nodes, although the word may not be correct as the information about the true partition is not available) around the bulk of nodes to be put together.

Table 2 Table comparing the performance of modularity, Infomap and surprise on the real-world networks considered in the present paper.

Full size table

Let us stress once more that, whenever real-world networks are considered, information about the existence of a true partition is rarely available; for this reason, exercises as the one we have carried out here may be useful to gain insight on the system under study: instead of trusting just one algorithm, combining pairs of them—e.g. by considering as communities the subsets of nodes output by both—may be the right solution to overcome the limitations affecting each single method.

Discussion

The hypergeometric distribution—together with its many variants—has recently revived the interest of researchers who have employed it to define novel network ensembles³⁷, recipes for projecting bipartite networks³⁸, etc.

The distributions related to it allow for a wide variety of benchmarks to be defined, each one embodying a different set of constraints. In the present paper we have explored the power of the hypergeometric formalism to carry out the detection of mesoscale structures: it allows proper statistical tests to be definable for revealing the presence of modular, core-periphery and bipartite structures on any kind of network, be it binary or weighted, undirected or directed. According to the classification proposed in the introductory section of the paper, we believe surprise to belong to the second class of algorithms—its asymptotic expression embodying a sort of LRT aimed at choosing between alternative hypotheses (the RGM and the SBM, the WRG and the WSBM, etc.).

More in general, our approach reveals the superiority of the algorithms for mesoscale structures detection belonging to the second and to the third class with respect to those belonging to the first one: still, the two classes of statistically grounded approaches compete on some specific benchmarks, as the comparisons carried out on the LFR and the RC ones clearly show (specifically, methods performing well on LFR benchmarks do not on RC benchmarks and vice versa). This also suggests a strategy to handle mesoscale structures detection on real-world networks: as the information on the presence of a possible, true partition is rarely available, a good strategy may be that of running different algorithms on the same empirical networks, check the consistency of their output and combine them, e.g. by taking the overlap—in a way that is reminiscent of multi-model inference.

Although the surprise-based approach is powerful and versatile, its downside is represented by the specific kinds of tests that are induced by the optimization of surprise-like score functions: comparisons between benchmarks ignoring the local structure of nodes (i.e. their degree) are, in fact, carried out. While this seems to be perfectly reasonable when considering core-periphery structures—see also the contribution²⁷ whose authors claim that a core-periphery structure is always compatible with a network degree sequence—this is no longer true for the community detection task³⁹—indeed, as it has been noticed elsewhere, ignoring degrees may be at the origin of the large number of singletons output by surprise as assigning nodes with few neighbors to larger clusters may be disfavored from a statistical point of view.

The observations above call for the extension of the hypergeometric formalism we have studied here to include more refined benchmarks as the ones constraining the entire degree sequence.