Introduction

The definition of a suitable data-driven spatial social network model is a widely studied problem in computational social sciences. Various dynamical processes (e.g., diseases spread) can be represented on such networks, and the topology of the network has a direct impact on the evolution of the process. Having general models for social interactions, based on available data and capable of reconstructing stylized facts known from the literature, is of utmost importance to prevent from reaching conclusions biased by incorrect or ill-defined assumptions. With widely available survey and census data, it is now possible to generate synthetic geo-localized populations, stratified by age and organized into households. However, there is no equally direct way to accurately model interpersonal relationships.

Many real social networks exhibit some form of group mixing, driven by the tendency of individuals to socialize with their peers1. This property can be reproduced using the so-called Stochastic Block Model (SBM), risen to prominence as a way to generate networks with a known community structure2,3. In the SBM, the vertex set is partitioned into disjoint blocks and the probability of an edge between two nodes depends on the blocks to which the two nodes belong. In the original formulation of the SBM, all vertices belonging to the same block are indistinguishable, so that the degree distribution within each block tends to be Poisson-like for large graphs2. To produce a more realistic network, a few extensions to the model have been proposed in the literature, including the Degree-Corrected Block-Model (DCBM)2 and its maximum-entropy version4. These models are, in some sense, the SBM-equivalent of the well-known configuration model. They are based on enforcing both the desired group mixing and a target degree-sequence, either exactly or in expectation. If \(L_{IJ}\) is the number of links between blocks I and J – or twice that number, if \(I=J\) – and \(\deg _i\) is the degree of node \(v_i\), the maximum-entropy DCBM works by imposing that \(\left\langle L_{IJ}\right\rangle =K_{IJ}\) and \(\left\langle \deg _i\right\rangle =k_i\) for suitable constants \(K_{IJ}\) and \(k_i\). The internal consistency of the model requires \(\sum _J K_{IJ}=\sum _{i\in I} k_i\) for all I, and the density of the network is fixed equal to \(\frac{\sum _{I,J} K_{IJ}}{N(N-1)}\), where N is the size of the network.

In this paper, we define and analyze the Fitness-Corrected Block Model (FCBM), a variation of the DCBM where the network density p is a configuration parameter. In the FCBM the block-level mixing is specified in terms of a matrix of edge-densities \(\Delta \)—as in the original SBM—whereas a sequence \(\vec {f}\) of vertex intrinsic fitness values5,6 measures the propensity of each vertex to establish links and can be used to enforce the desired intra-block heterogeneity. In the DCBM the constants \(K_{IJ}\) and \(k_i\), that bound, respectively, \(\left\langle L_{IJ}\right\rangle \) and \(\left\langle \deg _i\right\rangle \), need to be known explicitly. The DCBM was in fact conceived as an instrument to randomize an observed graph while preserving some of its features. In the FCBM, instead, \(K_{IJ}\) and \(k_i\) are determined based on p, \(\Delta \) and \(\vec {f}\), making the FCBM a suitable model for generating random graphs whose properties may be tuned upon the characteristics of a given population or set of entities.

To the purpose of having a model that is maximally random, the FCBM is defined following the approach first presented in Ref.4 for the DCBM, and later clarified and generalized in Ref.7. In a nutshell, the approach consists in: (1) the definition of an ensemble of networks, each with the same number of nodes, but with all possible configurations of links; (2) the constrained maximization of the entropy associated to the network ensemble, via the method of Lagrangian multipliers. By imposing the conditions \(\left\langle L_{IJ}\right\rangle =K_{IJ}\) and \(\left\langle \deg _i\right\rangle =k_i\), for all I, J, i, the probability per graph in the ensemble factorises in terms of probabilities per link. The numerical value of the Lagrangian multipliers has to be calculated solving the system of nonlinear equations given by the model constraints. The resolution of this system might become expensive when the network is large and, to the best of our knowledge, it is not implemented in any publicly available software library. For the present paper, we implemented a parallel version of the solver for the maximum-entropy DCBM, written in C making use of the Intel MKL scientific library and the OpenMP API, which can also be used to solve our FCBM. The solver is released as open-source software at https://gitlab.com/cranic-group/dcbm_solver.

A known sparse-regime approximation for the edge probability of the DCBM implies that, in the sparse FCBM, the probability of a link binding two individuals i and j is directly proportional to their sociability and to the cohesion of their blocks, i.e., \(p_{ij}\propto p f_i f_j \Delta _{I_iJ_j}\) for all \(i\in I\) and \(j\in J\). Under the sparse-regime approximation, the maximum entropy condition leads to a system of equations that admits a closed-form solution. We make use of this approximate solution to find two closed-form estimates for the degree distribution \(p_k\) of the FCBM—one more accurate, the other neater. These two estimates put in direct relation \(p_k\) with the fitness distribution \(p_f\), showing that the degree distribution of the FCBM, albeit not known a priori, can be essentially controlled through the model’s parameters. In particular, if \(p_f\) follows a power-law, lognormal or exponential distribution, then the same holds, approximately, for \(p_k\).

Among the many possible applications, the FCBM is especially well suited for generating a data-driven social network of geo-referenced and age-stratified individuals. The partition of the population into blocks can be obtained by grouping the individuals having similar position (e.g., residence) and age, whereas the expected block-to-block edge-density matrix \(\Delta \) can be calibrated based on survey data that quantify the dependence of contact frequencies upon geographic and socio-demographic factors. Finally, the fitness vector \(\vec {f}\) may be drawn from a suitable probability distribution, modelled upon measurable features such as wealth, employment, or mobility. In this context, the sparse FCBM is well approximated by the phenomenological model presented in Refs.8,9.

To show how the FCBM can be used in practice, and to provide empirical support to our analytical findings, we generate a set of synthetic social networks for a stylized urban population distributed on a disk of radius 2.5 Km. We use the SOCRATES10,11 tool to extract age-based social mixing patterns, and we embrace the widely-accepted assumption that an inverse-power-law relation binds the distance between two individuals and the frequency of their social interactions12,13. We generate instances of our FCBM using either the exact model or its sparse-regime approximation, varying both the spatial density of the individuals in the territory and the fitness distribution. We show that the empirical degree distribution obtained with all considered configurations is in sharp agreement with the analytical estimates. The implementation of the FCBM is publicly available as part of the Urban Social Networks (USN) framework at https://gitlab.com/cranic-group/usn.

Contributions and results

In the following, we summarize the main contributions and present the main analytical and experimental results of this paper. For all methodological details, we refer the reader to the “Methods” section.

The Fitness Corrected Block Model

We propose the Fitness-Corrected Block Model (FCBM), a new maximum-entropy model for modular networks, parameterized by the network density \(p\in (0,1)\), the block-wise mixing structure \(\Delta \)—a symmetric matrix such that \(\sum _{I,J}\Delta _{I,J}=2\)—and the vertex-intrinsic fitness \(\vec {f}\)—a vector that controls the tendency of each vertex to establish links with other vertices.

Formally, let V be a vertex set of size N partitioned into n blocks \(\{B_I\}_{I=0}^{n-1}\). The FCBM is defined as the maximum entropy probability distribution P, over all networks having vertex set V, fulfilling the following two conditions:

$$\begin{aligned} \left\langle L_{IJ}\right\rangle _P&= p\left( {\begin{array}{c}N\\ 2\end{array}}\right) \Delta _{IJ} \quad \text {for all } I,J \end{aligned}$$
(1)
$$\begin{aligned} \left\langle \deg _i\right\rangle _P&= p\left( {\begin{array}{c}N\\ 2\end{array}}\right) \frac{f_i}{\sum _{u\in I_i} f_u} \sum _J \Delta _{I_iJ} \quad \text {for all } i \end{aligned}$$
(2)

where \(L_{IJ}\) is the number of edges between \(B_I\) and \(B_J\); \(\deg _i\), \(f_i\) and \(I_i\) are the degree, fitness and pertaining block of vertex \(v_i\); \(\left\langle \cdot \right\rangle _P\) denotes the expected value with respect to P.

The FCBM can be seen as a generalization of the Degree-Corrected Block Model (DCBM). However, contrarily to the DCBM, we show that the FCBM is consistent for any choice of the configuration parameters, making it suitable for generating random graphs with tunable topological properties.

Leveraging on the framework of entropy-based null-models, we prove that the probability P(G) of generating a specific graph G with the FCBM can be factorized as the product of independent edge probabilities, namely \(P(G) = \prod _{i,j} p_{ij}\), where \(p_{ij}\) is the probability of an edge between \(v_i\) and \(v_j\). The edge probabilities can be recovered solving a system of non-linear equations obtained from (1) and (2).

Efficient solver for FCBM/DCBM system of equations

The system of nonlinear equations, needed to explicitly calculate edge probabilities, becomes computationally expensive as the size of the network increases. Already with a few thousand nodes, a straightforward implementation may be too slow to be used in practical applications. We implemented a parallel C-program that efficiently solves this system, as well as the analogous system arising from the DCBM. The solver follows the Sequential Quadratic Programming approach presented in Ref.14, using Newton’s method for the Hessian approximation. To the best of our knowledge, this is the first publicly available solver for the DCBM. The source code is publicly released as open-source software at https://gitlab.com/cranic-group/dcbm_solver.

Properties of the FCBM

The fitness sequence \(\vec {f}\) guarantees intra-block heterogeneity. In fact, (2) can be rewritten as

$$\begin{aligned} \left\langle \deg _i\right\rangle _P = \frac{f_i}{\sum _{u\in I_i} f_u} \left\langle \deg _I\right\rangle _P \end{aligned}$$

where \(\left\langle \deg _I\right\rangle _P\) is the expected total degree of block \(B_I\), i.e., the total number of edges incident to \(B_I\).

When the network is sparse, we show that the system from which all \(p_{ij}\)’s must be derived admits a closed-form approximate solution. The sparse-regime approximation allows to estimate the expected degree distribution of the sampled graph based on the fitness distribution \(p_f\). We derive two estimates for the degree distribution \(p_k\) of the network. The first estimate reads

$$\begin{aligned} p_k(k) \approx \frac{\left\langle f\right\rangle _{p_f}}{N} \sum _I p_f\left( k \frac{\left\langle f\right\rangle _{p_f}}{\mu _I}\right) \frac{N_I}{\mu _I} \end{aligned}$$
(3)

where \(\mu _I\) is the expected average degree of the vertices in \(B_I\). The second estimate, less accurate but easier to interpret and use in practice, reads

$$\begin{aligned} p_k(k) \approx p_f\left( k \frac{\left\langle f\right\rangle _{p_f}}{\mu }\right) \frac{\left\langle f\right\rangle _{p_f}}{\mu } \end{aligned}$$
(4)

where \(\mu \) is the average degree of the network. The obtained analytical expressions for \(p_k\) have a very desirable property: for many choices of \(p_f\)—including power-law, lognormal or exponential distributions—\(p_k\) essentially has the same “shape” of \(p_f\).

Data-driven FCBM for spatial social networks

We propose an application for our FCBM as a maximum-entropy model for data-driven spatial social networks. In particular, we envision its application to generate geographic networks informed by census data, contact surveys, and geospatial data. Indeed, the phenomenological model described at the end of this section has already been employed to develop a realistic social network at the urban scale9 and to study the spread of an epidemic process15,16,17, on it.

Let the vertex set V describe an age-stratified population of N individuals living in a territory tessellated into square tiles of side l. Each \(v_i\) is thus characterized by two data-driven discrete attributes: its tile of residence \(t_i\in T\), that is, the discretized position of \(v_i\) in the territory, and its age-group \(g_i\in \Gamma \). These two attributes induce a partition of the population into \(n=|T|\cdot |\Gamma |\) blocks \(\{B_I\}_{i=0}^{n-1}\), with \(v_i\in B_I=(t_I,g_I)\) if and only if \(t_i=t_I\) and \(g_i=g_I\).

To define a data-driven block-wise mixing matrix \(\Delta \), we observe that:

  • Social mixing patterns, derived from heterogeneous data sources such as surveys, cell phones or wearable sensors18,19,20, can be used to reconstruct a data-driven age-based mixing matrix S9, whose \(s_{IJ}\) element measures the tendency of age groups \(g_I\) and \(g_J\) to socialize with each other.

  • The frequency of social relations between individuals living in \(t_I\) and \(t_J\) is generally assumed to decay as \(d_{IJ}^{-\beta }\), where \(d_{IJ}\) is the normalized (geographic or euclidean) distance between tiles \(t_I\) and \(t_J\), and the exponent \(\beta >0\) depends on the type of relation and on the extension of the territory12,13.

This leads to:

$$\begin{aligned} \Delta _{IJ} = \frac{d_{IJ}^{-\beta }s_{IJ}}{\sum _{O\le Q}d_{OQ}^{-\beta }s_{OQ}} \end{aligned}$$
(5)

Finally, we extract the fitness vector \(\vec {f}\) from a suitable distribution \(p_f\), set the density parameter \(p\in (0,1)\) and, for all pairs ij, compute the edge probability \(p_{ij}\) as prescribed by the FCBM. As a result, the graph sampling probability P guarantees that: (1) the expected number of links between \(B_I\) and \(B_J\) is proportional to \(s_{I,J}\) and decays as \(d_{I,J}^{-\beta }\); (2) the expected degree of \(v_i\) is proportional to \(f_i\) and to the expected total degree of block \(I_i\).

In this case, the sparse-regime approximation yields

$$\begin{aligned} p_{ij} \approx p \left( {\begin{array}{c}N\\ 2\end{array}}\right) \frac{f_i}{\sum _{u\in I_i} f_u} \frac{f_j}{\sum _{w\in J_j} f_w}\frac{d_{I_iJ_j}^{-\beta }s_{I_iJ_j}}{\sum _{I\le J}d_{IJ}^{-\beta }s_{IJ}} \end{aligned}$$
(6)

Expression (6) defines a phenomenological model, analogous to the one presented in Ref.9, where the probability of two individuals being connected is proportional to their sociability and to the cohesion of their age-groups, while decaying as a power of their distance. Clearly, the estimates obtained in (3) and (4) for the degree distribution stay valid in the data-driven model. If the network is sufficiently sparse and the population of all tiles/groups is sufficiently large, the degree distribution of the sampled graph G is controlled by the available data and by the chosen fitness distribution \(p_f\).

Experimental analysis

We implemented both the exact FCBM and its sparse-regime approximation. The code is released, as open-source software, as part of the Urban Social Networks (USN) framework at https://gitlab.com/cranic-group/usn.

We used our data-driven FCBM to generate instances of a spatial social network for a stylized city of 10K inhabitants living in disk of radius 2.5 Km. We set p so that the average degree of the network is \(\mu =25\), we set \(\beta =1\), we used age-density data for Italy as released by the Italian National Institute of Statistics (ISTAT), and we extracted the matrix S from data released by the POLYMOD project21. We considered three possible spatial densities—uniform, and increasing or decreasing with the distance from the disk’s center—and three possible fitness distributions—pareto, lognormal and exponential. For all nine combinations, we generated 10 independent graph instances.

In Fig. 1 we show the empirical degree distribution, averaged over the 10 instances of the FCBM for each configuration, considering both the exact model and the sparse-regime approximation. We also show the two estimates (3) and (4). The plots confirm that the approximation is sound and that the two estimates can be safely used in practice—at least, for sparse networks—with (3) working especially well in all cases—except, possibly, for very small degrees.

Figure 1
figure 1

The degree distribution for an ideal city (circular, 10K inhabitants) for 3 different densities (uniform, radial increasing, radial decreasing) and 3 different fitness distributions (powerlaw, lognormal, exponential).

Discussion

Developing realistic network models for social interactions is of paramount importance to understand the underlying mechanisms that lead to the observed features of such networks and to study all dynamical processes, such as disease spreading, that are strongly influenced by the network topology. Extreme care must be taken to avoid that any bias is unintentionally injected in the model. For this reason, maximum-entropy models are extensively used in network analysis, either as null models, or to generate synthetic networks that have specific characteristics, but are otherwise maximally random.

In this paper, we introduced the maximum-entropy Fitness-Corrected Block Model (FCBM), an adjustable-density model for modular and heterogeneous networks, whose block-wise mixing pattern is known in expectation. The model has a general and flexible formulation, but it was designed with a key application in mind: providing a working tool for building synthetic social networks informed with census, survey and geospatial data. Publicly available spatial density and demographic data are in fact necessarily discrete, thus inducing a partition of the population into blocks of agents who belong to the same age-group and live in the same area. The expected block-to-block edge-density can be estimated based on empirical findings that quantify the dependence of contact frequencies upon geographic and demographic features. However, both the density and the degree of heterogeneity of real-world social networks depend on the considered type of interpersonal relations and are rarely explicitly known beforehand. Contrarily to other block-models in the literature, the FCBM makes both these network features adjustable. In particular, the desired intra-block heterogeneity can be enforced through a vertex-intrinsic social fitness, possibly modelled upon observable population-level variables, such as wealth, employment or mobility.

We implemented the FCBM and made it publicly available as open-source software. The released software includes a parallel code that speeds up the computation of the most expensive part of the required maximization procedure, a step of the algorithm that is also needed in the well-known Degree-Corrected Block Model. We tested our implementation of the FCBM by reconstructing instances of a social network connecting the individuals of a stylized city of 10K inhabitants. The experiments allowed to verify that the exact model and its efficient sparse-regime approximation yield networks with almost identical degree distributions. We also showed and experimentally verified that, in the sparse-regime, the expected degree distribution of the output network can be estimated by two closed-form expressions. Thanks to these two estimates, the shape of the degree distribution can be predicted based on the chosen fitness distribution.

In the next future, we plan to use the FCBM—and, possibly, a temporal extension of the model—to simulate dynamical processes in real-world territories and understand how socio-demographic features and social habits affect the outcomes of these processes. To this end, we will work towards gaining a better understanding of how the topological properties of the FCBM depend on the configuration parameters, with special attention paid to a set of network properties, such as the clustering coefficient and the excess degree distribution, that can be used to study percolation and diffusion dynamics on the network.

Related work

Under the pressure of the COVID-19 pandemic, several simulation frameworks have been developed to provide realistic descriptions of the disease spread process on different spatial and temporal scales, from a single building to complex urban areas up to a global scale22,23,24,25,26,27,28,29. World-scale meta-population models and agent-based systems describing small and large areas can be informed by a variety of data sources. Census data and/or grid based population counts can be integrated to reconstruct populations that are statistically indistinguishable from real ones, including age, geographic distribution, education and wealth. The number and intensity of contacts in specific settings, such as workplaces, schools, or households, can be collected through surveys, questionnaires, diaries, and, if possible, supplemented with data obtained from digital technologies such as cell phones or wearable sensors18,19,20. These data are then used to reconstruct contact matrices, individual or group schedules, and are widely used to reproduce synthetic interactions30,31.

In comparison, generative network models that can reproduce the characteristics of real-world social networks by incorporating information from data, have received much less attention. The task of inferring a realistic distribution of social ties is quite challenging, since friendship ties can only be measured for a small subset of real-world networks, and the mechanism underlying tie formation, while thoroughly studied, is still far from being fully understood. Ideally, the models should reproduce the main features of real-world social networks, well summarized in Ref.32. These networks show a heavy-tailed (e.g, lognormal) degree distribution, often with a finite cutoff in agreement with Dunbar’s number. The transitivity of the networks is high, compared to a random graph model, as a consequence of the well-established principle that “friends of my friends are my friends”. Moreover, they show positive assortativity by degree and type. By quantitatively looking at ego networks, mobile phone networks, and online social networks we now have a better understanding of some of their peculiar features and underline mechanisms. Education, wellness, age and spatial proximity are regarded as critical elements in the formation of friendship bonds33,34. Among these, age is probably the most studied, possibly due to the fact that age-related data is actually available at different spatial scales35. While there is wide evidence that geographical factors alone cannot explain the structure of real-world spatial social networks12,36,37, the dependence of friendship on distance is widely assumed to follow an inverse power-law with exponent \(\beta \in [0.5, 2]\)12,13,36,37,38,39,40,41—and this surprisingly holds even for online relationships42. In particular, \(\beta <1\) seems to work better for short range contacts (\(<20\) km)13 and for urban networks40, in line with sociological studies43. However, contrary to other real-world networks44,45, such networks do not present very large hubs12,37,38,46. Their degree distribution is right skewed and relatively long-tailed12,37, and it has been, at times, approximated by a power-law with a large (5–8) exponent38,46 or by a lognormal distribution13. Within cities, population density impacts on the frequency of close-range contacts, but usually not on the overall size of each person’s network41. While geographical proximity and community structure appear to be related37,40,41, some authors argue that only small clusters (\(<30\) members) are geographically bounded39, whereas the large ones may span across very large areas of a city37.

Defining simple models that capture all of these features is not an easy task. Models designed to mimic the scale-free degree distribution emerging in many real networks, for instance, may fail to yield the expected clustering structure47,48. Exponential random graphs have been shown to overcome some of these limitations49,50. Stochastic Block Models (SBMs)2,3, on the other hand, have been specifically developed to reproduce networks with a community structure, a typical characteristic of real-world social networks, which follows from the presence of some kind of group-level homophily. In this type of network models, the nodes are partitioned into disjoint sets named blocks and the probability of an edge existing between two nodes depends on the blocks to which the two nodes belong. The SBM and its generalization have gained their success in the last decades as they can be used to discover and understand the structure of a network, as well as for clustering purposes51,52.

Spatial network models are often obtained by incorporating vertices into a metric space and induced constraints can determine some of the network properties53. Introducing a penalty on “long” edges, which mimics a penalty in maintaining long-distance relationships, has an impact on clusters, path lengths, degree distributions, and more54. Recently, network instances having suitable features have been generated by means of the so-called random geometric models55,56,57, where the popularity and similarity of the nodes depend on their position in some latent metric space58. Embedding the vertices into a hyperbolic disk55 has proved a way to obtain both high clustering and heavy-tailed degree distribution.

In the present paper, we leverage on the framework of entropy-based null-models for real complex networks, revised in Ref.7. Among the very first fundamental papers, the work of Park and Newman has a particular relevance59: based on Jaynes’ derivation of Statistical Physics from Information Theory60, they proposed a general maximum entropy approach for the randomization of complex networks. Among the extension to a different context, the main innovations of Ref.59 are the introduction of local constraints, as the degree sequence, and the interpretation of the general framework of Exponential Random Graphs (ERGs) in terms of maximum entropy models. The present construction was later extended to the analysis of real networks61,62, tailoring the entropy-based model on the observed network. As a matter of fact, the various Lagrangian multipliers, introduced for the entropy maximisation in Ref.59, can be numerically calculated by maximising the (Log-)Likelihood associated with the real network. Such a construction represents a perfect benchmark for the analysis of real systems, since it is maximally random (due to the entropy maximisation) and tailored on the observed system (due to the Likelihood maximisation). It is not surprising that it has been extensively applied to the study of non trivial structural patterns of different systems, as financial and trade networks, biological systems and online social networks7,63. Moreover, the general framework can be easily extended to tackle different kinds of networks, as undirected, directed, weighted, directed and weighted64,65,66, bipartite67, bipartite weighted68 and degree corrected block models4. Another relevant branch of research focuses on the reconstruction of networks from limited information69. This application is of particular interest for risk assessment of financial networks, and limited information is available due to privacy concerns70. Reconstruction approaches based on entropy-based null models have proven particularly effective in this context, and the maximally random nature of the framework described here is critical to avoid the introduction of bias into the predictions69,70,71,72,73.

Methods

Formalism

Let \({\mathcal {G}}\) be the ensemble of all simple graphs of N vertices. If P is a probability distribution over \({\mathcal {G}}\), P(G) is the probability of graph \(G\in {\mathcal {G}}\), and \(\left\langle \cdot \right\rangle _P\) denotes the expectation with respect to P. The vertex set \(V=\{v_i\}_{i=0}^{N-1}\) is partitioned into n blocks \(\{B_I\}_{I=0}^{n-1}\). The size of block I is \(N_I=|B_I|\) and, for each \(v_i\in V\), \(I_i\) denotes the index of the block to which \(v_i\) belongs. For all pairs IJ, \(N_{IJ}\) denotes the number of possible pairs (ij) with \(v_i\in I\), \(v_j\in J\) and \(i\ne j\), i.e., \(N_{II}=N_I(N_I-1)\) and \(N_{IJ}=N_IN_J\) if \(I\ne J\)—notice that in the case of \(N_{II}\) we are counting twice the number of couples, as we will do for the counts of edges in the following.

Each \(G\in {\mathcal {G}}\) is uniquely determined by its adjacency matrix \(A(G)=\{a_{ij}(G)\}_{i,j=0}^{N-1}\), where \(a_{ij}(G)=1\) if edge \((i,j)\in E(G)\) and \(a_{ij}(G)=0\) otherwise. The degree of vertex \(v_i\) in G is \(\deg _i(G)=\sum _j a_{ij}(G)\). The total degree of block I is \(\deg _I(G) = \sum _{i\in I} \deg _i(G)\) and, for all IJ, \(L_{IJ}(G)=\sum _{i\in I}\sum _{j\in J} a_{ij}(G)\) is the number of edges between \(B_I\) and \(B_J\) in G, or, if \(I=J\), twice that number. Therefore, using the definitions, \(\deg _I(G)=\sum _J L_{IJ}(G)\). For the sake of simplicity, the dependence of these quantities on the specific graph G will be often omitted in the following.

Maximum entropy Degree-Corrected Block Model4

The maximum entropy Degree-Corrected Block Model (DCBM) is defined as the maximum entropy probability distribution P over \({\mathcal {G}}\) in which the number of links per block and the degree sequence are constrained on average, i.e.

$$\begin{aligned} \left\langle L_{IJ}\right\rangle _P&= K_{IJ} \quad \text {for all } I,J \end{aligned}$$
(7)
$$\begin{aligned} \left\langle \deg _i\right\rangle _P&= k_i \quad \text {for all } i, \end{aligned}$$
(8)

with \(\sum _J K_{IJ}=\sum _{i\in I} k_i\), for all I. If H(P) denotes the Shannon entropy of P, the sought P can be obtained by finding the stationary points of

$$\begin{aligned} H'(P,\vec {\eta },\vec {\theta })=H(P)-C(P,\vec {\eta },\vec {\theta }) = -\left\langle \ln P(G)\right\rangle _P - \left( \sum _{I\le J}\eta _{IJ}\left( \left\langle L_{IJ}\right\rangle _P-K_{IJ}\right) + \sum _i\theta _i\left( \left\langle \deg _i\right\rangle _P-k_i\right) +\alpha (\sum _G P(G)-1)\right) \end{aligned}$$
(9)

where \(\eta _{IJ}\), \(\theta _i\) and \(\alpha \) are Lagrange multipliers: while, \(\eta _{IJ}\) and \(\theta _i\) control the conditions (7) and (8), respectively, \(\alpha \) is necessary for the normalization of the probability P(G). Since the functional derivatives with respect to P(G) are \(\frac{\mathrm{d}\left\langle L_{IJ}\right\rangle _P}{\mathrm{d}P(G)}= \sum \nolimits _{i\in I}\sum \limits _{j\in J} a_{ij}(G)\) and \(\frac{\mathrm{d}\left\langle \deg _i\right\rangle _P}{\mathrm{d}P(G)}= \sum \nolimits _j a_{ij}(G)\), then

$$\begin{aligned} \frac{\mathrm{d}C(P,\vec {\eta },\vec {\theta })}{\mathrm{d}P(G)}= \sum _{I\le J}\eta _{IJ}\sum _{i\in I}\sum _{j\in J} a_{ij}(G) + \sum _i \theta _i \sum _j a_{ij}(G) +\alpha = \sum _{i<j}\eta _{I_iJ_j}a_{ij}(G) + \sum _{i<j} (\theta _i + \theta _j) a_{ij}(G)+\alpha \end{aligned}$$

This results in

$$\begin{aligned} P(G)\propto \exp \left[ -\sum _{i<j}(\theta _i+\theta _j+\eta _{I_iJ_j})a_{ij}(G)\right] \end{aligned}$$

and the probability per graph factorises in terms of probabilities per link as

$$\begin{aligned} p_{ij}={\left\{ \begin{array}{ll} \dfrac{x_ix_jy_{I_iJ_j}}{1+x_ix_jy_{I_iJ_j}} &{}\text {if } i\ne j\\ 0 &{}\text {otherwise} \end{array}\right. } \end{aligned}$$
(10)

where \(x_i=e^{-\theta _i}\) and \(y_{IJ}=e^{-\eta _{IJ}}\). For all i and all \(I\le J\), \(x_i\) and \(y_{IJ}\) can be found by solving the system of equations

$$\begin{aligned}&\sum _{i\in I}\sum _{j\in J}p_{ij} = K_{IJ} \quad \text {for all } I\le J \end{aligned}$$
(11)
$$\begin{aligned}&\sum _j p_{ij} = k_i \quad \text {for all } i \end{aligned}$$
(12)

Sparse DCBM

When the average degree \(\left\langle k\right\rangle \) is small, i.e., when the network is sparse, the edge probability of the DCBM can be approximated as \(p_{ij}\approx x_ix_jy_{I_iJ_j}\). This allows to rewrite (11) and (12) as

$$\begin{aligned} y_{IJ} \sum _{i\in I} x_i \sum _{\begin{array}{c} j\in J\\ j\ne i \end{array}}x_j&= K_{IJ} \quad \text {for all } I\le J \end{aligned}$$
(13)
$$\begin{aligned} x_i \sum _J y_{I_iJ} \sum _{\begin{array}{c} j\in J\\ j\ne i \end{array}}x_j&= k_i \quad \text {for all } i \end{aligned}$$
(14)

Since the constants K and \(\vec {k}\) are, by construction, bound by the relation \(\sum _{J} K_{IJ} = \left\langle \deg _I\right\rangle = \sum _{i\in I} k_i\), (13) and (14) admit the following solution

$$\begin{aligned} x_i&= k_i \quad \text {for all } i\\ y_{IJ}&= \frac{K_{IJ}}{\left( \sum _{i\in I}k_i\right) \left( \sum _{\begin{array}{c} j\in J\\ j\ne i \end{array}}k_j\right) } \quad \text {for all } I\le J \end{aligned}$$

Maximum entropy Fitness-Corrected Block Model

Given a scalar \(p\in (0,1)\), a symmetric matrix \(\Delta \) such that \(\sum _{I,J}\Delta _{I,J}=2\), and a fitness sequence \(\vec {f}\), we define the Fitness-Corrected Block Model (FCBM) as the maximum entropy model fulfilling the following two conditions:

$$\begin{aligned} \left\langle L_{IJ}\right\rangle _P&= p\left( {\begin{array}{c}N\\ 2\end{array}}\right) \Delta _{IJ} \quad \text {for all } I, J \end{aligned}$$
(15)
$$\begin{aligned} \left\langle \deg _i\right\rangle _P&= p\left( {\begin{array}{c}N\\ 2\end{array}}\right) \frac{f_i}{\sum _{u\in I_i} f_u} \sum _J \Delta _{I_iJ} \quad \text {for all } i \end{aligned}$$
(16)

The FCBM is a variation of the DCBM where the network density p is a configuration parameter. As in the original stochastic block model, the block-level mixing is specified in terms of a set of edge-densities \(\Delta _{IJ}\), rather than a set of edge-counts \(K_{IJ}\). Similarly, the degree sequence \(\vec {k}\) is replaced by a vertex intrinsic fitness sequence \(\vec {f}\), in line with previous models available in the literature5,6. \(f_i\) measures \(v_i\)’s propensity to establish links and \(\left\langle \deg _i\right\rangle _P\) is set proportional to \(f_i\) by a constant that depends on \(I_i\), other than p. By design, (15) and (16) imply \(\sum _J\left\langle L_{IJ}\right\rangle _P = \left\langle \deg _I\right\rangle _P = \sum _{i\in I} \left\langle \deg _i\right\rangle _P\), and (16) can be rewritten as

$$\begin{aligned} \left\langle \deg _i\right\rangle _P = \frac{f_i}{\sum _{u\in I_i} f_u} \left\langle \deg _I\right\rangle _P \end{aligned}$$

which clarifies the role of \(\vec {f}\) as an element of intra-block heterogeneity.

For fixed \(\Delta \) and \(\vec {f}\), the derivation of the maximum entropy FCBM is identical to that of the DCBM: the maximum entropy probability per graph factorizes into the probability per edge given by (10) and the vectors of constants \(\vec {x}\) and \(\vec {y}\) can be obtained by solving the analogous of (11) and (12), i.e.

$$\begin{aligned}&\sum _{i\in I}\sum _{j\in J}p_{ij} = p\left( {\begin{array}{c}N\\ 2\end{array}}\right) \Delta _{IJ} \quad \text {for all } I\le J \end{aligned}$$
(17)
$$\begin{aligned}&\sum _j p_{ij} = p\left( {\begin{array}{c}N\\ 2\end{array}}\right) \frac{f_i}{\sum _{u\in I_i} f_u} \sum _J \Delta _{I_iJ} \quad \text {for all } i \end{aligned}$$
(18)

Sparse FCBM

In the sparse-regime, i.e, when \(p\ll 1\), using \(p_{ij}\approx x_ix_jy_{I_iJ_j}\) yields the following approximate solution to (17) and (18):

$$\begin{aligned} x_i&= p\left( {\begin{array}{c}N\\ 2\end{array}}\right) \frac{f_i}{\sum _{u\in I_i} f_u} \sum _J \Delta _{I_iJ} = \frac{f_i}{\sum _{u\in I_i} f_u} \left\langle \deg _I\right\rangle _P \quad \text {for all } i\\ y_{IJ}&= \frac{p\left( {\begin{array}{c}N\\ 2\end{array}}\right) \Delta _{IJ}}{ p\left( {\begin{array}{c}N\\ 2\end{array}}\right) \left( \sum _O \Delta _{IO} \right) p\left( {\begin{array}{c}N\\ 2\end{array}}\right) \left( \sum _O \Delta _{OJ} \right) } = \frac{p\left( {\begin{array}{c}N\\ 2\end{array}}\right) \Delta _{IJ}}{\left\langle \deg _I\right\rangle _P \left\langle \deg _J\right\rangle _P} \quad \text {for all } I\le J \end{aligned}$$

In this regime, the probability per edge can thus be rewritten as

$$\begin{aligned} p_{ij} \approx p\left( {\begin{array}{c}N\\ 2\end{array}}\right) \frac{f_i}{\sum _{u\in I_i} f_u} \frac{f_j}{\sum _{w\in J_j} f_w} \Delta _{I_iJ_j}. \end{aligned}$$
(19)

Degree distribution for the sparse FCBM

Let us assume that each fitness value \(f_i\) is drawn from a suitable distribution \(p_f\). The sparse-regime approximation allows to estimate the expected degree distribution of the sampled graph G with respect to both \(p_f\) and the graph sampling probability P. If \(N_{I_i}\) is large enough, we have \(\sum _{u\in I_i }f_u \approx \left\langle f\right\rangle _{p_f} N_{I_i}\). Now, following the approach used in Ref.5, we have

$$\begin{aligned} k(f,I) = \left\langle \deg _i\bigm \vert f_i=f,I_i=I\right\rangle _{p_f,P} \approx \dfrac{f}{\left\langle f\right\rangle _{p_f}N_{I}}\left\langle \deg _I\right\rangle _P = \dfrac{f}{\left\langle f\right\rangle _{p_f}} \mu _I \end{aligned}$$
(20)

where \(\mu _I = \frac{\left\langle \deg _I\right\rangle _P}{N_I} = \left\langle \deg _i \bigm \vert I_i=I\right\rangle _P\) is the expected average degree of the vertices in I. Equation (20) can be inverted leading to estimate

$$\begin{aligned} f(k,I) \approx k \frac{\left\langle f\right\rangle _{p_f}}{\mu _I} \end{aligned}$$

so that

$$\begin{aligned} p_k(k,I) = \Pr [k(f)=k\bigm \vert I] = \Pr [f(k)=f\bigm \vert I]\frac{\mathrm{d}f(k)}{\mathrm{d}k} \approx p_f\left( k \frac{\left\langle f\right\rangle _{p_f}}{\mu _I}\right) \frac{\left\langle f\right\rangle _{p_f}}{\mu _I} \end{aligned}$$
(21)

and, hence,

$$\begin{aligned} p_k(k) = \Pr [k(f)=k] = \sum _I p_k(k,I)\frac{N_I}{N} \approx \sum _I p_f\left( k \frac{\left\langle f\right\rangle _{p_f}}{\mu _I}\right) \frac{\left\langle f\right\rangle _{p_f}}{\mu _I} \frac{N_I}{N} = \frac{\left\langle f\right\rangle _{p_f}}{N} \sum _I p_f\left( k \frac{\left\langle f\right\rangle _{p_f}}{\mu _I}\right) \frac{N_I}{\mu _I} \end{aligned}$$
(22)

A slightly less accurate, yet much simpler, approximation can be obtained computing first

$$\begin{aligned} k(f) = \left\langle \deg _i\bigm \vert f_i=f\right\rangle _{p_f,P} = \sum _I k(f,I) \frac{N_I}{N} \approx \frac{f}{\left\langle f\right\rangle _{p_f}} \sum _I \mu _I \frac{N_I}{N} = \frac{f}{\left\langle f\right\rangle _{p_f}} \mu \end{aligned}$$
(23)

where \(\mu =\sum _I \mu _I \frac{N_I}{N}\) is the average degree of the network. Then, (23) can be inverted as

$$\begin{aligned} f(k) \approx k \frac{\left\langle f\right\rangle _{p_f}}{\mu } \end{aligned}$$

yielding

$$\begin{aligned} p_k(k) = \Pr [k(f)=k] = \Pr [f(k)=f]\frac{\mathrm{d}f(k)}{\mathrm{d}k} \approx p_f\left( k \frac{\left\langle f\right\rangle _{p_f}}{\mu }\right) \frac{\left\langle f\right\rangle _{p_f}}{\mu } \end{aligned}$$
(24)

In many cases, \(p_k\) belongs to the same family of probability distributions of \(p_f\): e.g., if \(p_f\) follows a power-law, lognormal or exponential distribution, then the same holds for \(p_k\).

Data-driven FCBM for spatial social networks

Our FCBM can be easily tuned upon real data and empirical findings to produce instances of a maximum entropy spatial social network. On one hand, the local density and demographic profile of the population are generally available in the form of discrete, geographically located (e.g., residents in 500 m \(\times \) 500 m tiles) and/or age-stratified (e.g., 0–5 years old) population segments. These data naturally induce a partition of the population into blocks. On the other hand, intra-block population heterogeneity can be controlled by a vertex-related social fitness, possibly modelled upon measurable features such as wealth, employment or mobility.

Formally, let the vertex set V describe an age-stratified population of N individuals living in a territory tessellated into square tiles of side l. Each \(v_i\) is characterized by two data-driven discrete attributes: its tile of residence \(t_i\in T\), that is, the discretized position of \(v_i\) in the territory, and its age-group \(g_i\in \Gamma \). \(t_i\) and \(g_i\) may either be directly available—in the case of a real population—or be drawn, respectively, from given spatial density \(p_t\) and age-distribution \(p_g\)—in the case of a synthetic population. These two attributes induce a partition of the population into \(n=|T|\cdot |\Gamma |\) blocks \(\{B_I\}_{i=0}^{n-1}\), with \(v_i\in B_I=(t_I,g_I)\) if and only if \(t_i=t_I\) and \(g_i=g_I\). We embrace the widely-acknowledged assumption that an inverse-power-law relation binds the distance \(d_{IJ}\) and the frequency of social relations between individuals living in \(t_I\) and \(t_J\)12,13. For all pairs of blocks IJ, we thus define the edge-density

$$\begin{aligned} \Delta _{IJ} = \frac{d_{IJ}^{-\beta }s_{IJ}}{\sum _{O\le Q}d_{OQ}^{-\beta }s_{OQ}} \end{aligned}$$

where \(d_{IJ}\) is the normalized (geographic or euclidean) distance between tiles \(t_I\) and \(t_J\); the normalization is obtained through a division by \(\frac{l}{2}\). We set \(d_{II}=1\), so that the distance between individuals in the same tile is half the distance of individuals living in neighboring tiles. \(\beta >0\) is a configuration parameter. \(s_{IJ}\) measures the tendency of age groups \(g_I\) and \(g_J\) to socialize with each other; such a \(|\Gamma |\times |\Gamma |\) symmetric age-based social mixing matrix S can be obtained, by imposing reciprocity and normalizing, from a suitable data-driven contact matrix9.

Finally, we extract the fitness vector \(\vec {f}\), set the density parameter \(p\in (0,1)\) and, for all pairs ij, compute the edge probability \(p_{ij}\) as described for the FCBM. As a result, the graph sampling probability P guarantees that: (1) the expected number of links between \(B_I\) and \(B_J\) is proportional to \(s_{I,J}\) and decays as \(d_{I,J}^{-\beta }\); (2) the expected degree of \(v_i\) is proportional to \(f_i\) and to the expected total degree of block \(I_i\).

In this case, the sparse-regime approximation yields

$$\begin{aligned} p_{ij} \approx p \left( {\begin{array}{c}N\\ 2\end{array}}\right) \frac{f_i}{\sum _{u\in I_i} f_u} \frac{f_j}{\sum _{w\in J_j} f_w}\frac{d_{I_iJ_j}^{-\beta }s_{I_iJ_j}}{\sum _{I\le J}d_{IJ}^{-\beta }s_{IJ}} \end{aligned}$$
(25)

Expression (25) defines a phenomenological model, where the probability of two individuals being connected is proportional to their sociability and to the cohesion of their age-groups, while decaying as a power of their distance. Clearly, the estimates obtained in (22) and (24) for the degree distribution stay valid in the data-driven model. If the network is sufficiently sparse and the population of all tiles/groups is sufficiently large, the degree distribution of the sampled graph G is controlled by the available data and by the chosen fitness distribution \(p_f\).