Community detection with node attributes in multilayer networks

Community detection in networks is commonly performed using information about interactions between nodes. Recent advances have been made to incorporate multiple types of interactions, thus generalising standard methods to multilayer networks. Often though, one can access additional information regarding individual nodes, attributes or covariates. A relevant question is thus how to properly incorporate this extra information in such frameworks. Here we develop a method that incorporates both the topology of interactions and node attributes to extract communities in multilayer networks. We propose a principled probabilistic method that does not assume any a priori correlation structure between attributes and communities but rather infers this from data. This leads to an efficient algorithmic implementation that exploits the sparsity of the dataset and can be used to perform several inference tasks; we provide an open-source implementation of the code online. We demonstrate our method on both synthetic and real-world data and compare performance with methods that do not use any attribute information. We find that including node information helps in predicting missing links or attributes. It also leads to more interpretable community structures and allows the quantification of the impact of the node attributes given in input.


Introduction
Community detection is a fundamental task when investigating network data. Its goal is to cluster nodes into communities and thus find large-scale patterns hidden behind interactions between many individual elements.
The range of application of this problem spans several disciplines. For instance, community detection has been used in sociology to analyze terrorist groups in online social networks [1]; in finance to detect fraud events in telecommunication networks [2]; in engineering to refactor software packages in complex software networks [3]; and in biology to investigate lung cancer [4] and to explore epidemic spreading processes [5]. In recent years, the variety of fields interested in this topic has broadened and the availability of rich datasets is increasing accordingly. However, most research approaches rely on using only the information about interactions among nodes, in other words the network topology structure. This information can be complex and rich, as is the case for multilayer networks where one observes different types of interactions. For instance, in social networks, interactions could entail exchanging goods, socializing, giving advice, or requesting assistance. Most network datasets, however, contain additional information about individuals, attributes which describe their features; for instance their religion, age or ethnicity. Node attributes are often neglected a priori by state-of-the-art community detection methods, in particular for multilayer networks. They are instead commonly used a posteriori, acting as candidates for "ground-truth" for real-world networks to measure the quality of the inferred partition [6,7]; a practice that can also lead to incorrect scientific conclusions [8]. It is thus a fundamental question how to incorporate node attributes into community detection in a principled way. This is a challenging task because one has to combine two types of information [9], while evaluating the extent to which topological and attribute information contribute to the network's partition [10].
To tackle these questions, we develop MTCOV, a mathematically rigorous and flexible model to address this problem for the general case of multilayer networks, i.e., in the presence of different types of interactions. To the best our knowledge, MTCOV is the first overlapping community detection method proposed for multilayer networks with node attributes. The model leverages two sources of information, the topological network structure and node covariates (or attributes), to partition nodes into communities. It is flexible as it can be applied to a variety of network datasets, whether directed, weighted, or multilayer and it outputs overlapping communities, i.e., nodes can belong to multiple groups simultaneously. In addition, the model does not assume any a priori correlation structure between the attributes and the communities. On the contrary, the contribution of the attribute information is quantitatively given as an output of the algorithm by fitting the observed data. The magnitude of this contribution can vary based on the dataset. Even if this is not very high, for instance if the attributes are noisy or sparse, the model is nevertheless able to use this extra information to improve performance. At the same time, if incorporating attribute information hurts inference tasks, the model will down weigh this contribution and instead use mostly the topological network structure.
Our method allows domain experts to investigate particular attributes and select relevant community partitions based on what type of node information they are interested in studying. In fact, by choosing the input data, we can drive the algorithm to select for communities that are more relevant to the attribute under study. If the attribute hurts performance and is consequently down weighted by the algorithm, this can be used as a signal that the attribute might not correlate well with any partition, given the remaining topological information available, and thus inform the expert accordingly.
We study MTCOV on synthetic multilayer networks, a variety of single-layer node-attributed real networks and several real multilayer networks of social support interactions in two Indian villages. We measure performance based on prediction tasks and overlap with ground-truth (when this is known). For single-layer networks, we compare the performance of MTCOV to state-of-the-art community detection algorithms with node attributes; for multilayer networks, we test against a state-of-the-art algorithm that does not use any node attribute information and measure the extent to which knowing both types of information helps inference. We find that MTCOV performs well in predicting missing links and attributes. It also leads to more interpretable community structures and allows the quantification of the impact of the node attributes given as input.
Related Work. Several methods have been proposed to study community detection in networks [11]. In particular, we are interested in those valid for multilayer networks [12]. These generalize single-layer networks in that they can model different types of interactions and thus incorporate extra information which is increasingly available. Among these, we focus on generative models for multilayer networks [13][14][15][16][17][18][19], which are based on probabilistic modeling like Bayesian inference or maximum likelihood optimization. These are flexible and powerful in terms of allowing multiple inference tasks, injecting domain knowledge into the theoretical framework, and being computationally efficient. However, the majority of these methods do not consider node attributes as input along with the network information. In fact, the few methods developed for community detection in multilayer networks with node attributes are based on first aggregating the multilayer network into a single layer, either by combining directly the adjacency matrices of each layer [20] or by using similarity matrices derived from them along with the node attributes [21,22]. In the context of data mining, a similar problem can be framed for learning low dimensional representations of heterogeneous data with both content and linkage structure (what we call attributes and edges). This is tackled by using embeddings extracted via deep architectures [23], which is rather different than our approach based on statistical inference. Our problem bears some common ground with the one studied by Sachan et al. [24] for extracting communities in online social networks, where users gather based on common interests; they adopt a Bayesian approach, but with a rather different goal of associating different types of edges to topics of interest. A related but different problem is that of performing community detection with node attributes on multiple independent networks [25,26]; this differs with modeling a single multilayer network in that it assumes that covariates influence in the same way all the nodes in a network but in a different way the various networks in the ensemble. For single-layer networks, there has been more extensive work recently on incorporating extra information on nodes [9,25,[27][28][29][30][31][32][33]. Among those adopting probabilistic modeling, some incorporate covariate information into the prior information of the latent membership parameters [25,34,35], while others include covariates in an additive way along with the latent parameters [36,37], so that covariates influences the probability of interactions independently of the latent membership.
These works show the impact of adding nodes attributes in community detection a priori into the models to uncover meaningful patterns. One might then be tempted to adopt such methods also in multilayer networks by collapsing the topological structure into a suitable single network that can then be given in input to these single-layer and nodeattributed methods as done by Gheche et al. [20] However, collapsing a multilayer network often leads to important loss of information, and one needs to be careful in determining when this collapse is appropriate and how it should be implemented, as shown for community detection methods without attribute information [38,39]. Thus the need of a method that not only incorporates different types of edges but also node attributes.

Methods
We adapt recent ideas from the generative model behind MULTITENSOR [13], a multilayer mixed-membership model based on a Poisson tensor factorization [40], to incorporate node attributes in a principled manner. It can take in input directed and undirected networks, allowing different topological structures in each layer, including arbitrarily mixtures of assortative, disassortative and core-periphery structures. We move beyond MULTITENSOR by incorporating node covariates via introducing a proper likelihood term that accounts for this extra information. We use the formalism of maximum likelihood estimation: we combine the structural and the node information into a global likelihood function and provide a highly scalable Expectation-Maximization algorithm for the estimation of parameters.

Model description and notation
Consider a multilayer network of N nodes and L layers. This is a set of graphs G = {G (α) V, E (α) } 1≤α≤L defined on a set V of N vertices shared across L ≥ 1 layers, and E (α) is the set of edges in the layer α. Each layer α ∈ {1, . . . , L} is a graph G (α) (V, E (α) ) with adjacency matrix ij is the number of edges of type α from i to j; here we consider only positive discrete entries; for binary entries, E = i,j,α a (α) ij is the total the number of edges. Alternatively, we can consider a 3-way tensor A with dimensions N × N × L. In addition, for each node i ∈ V consider the vector of covariates X i ∈ R 1×K (alternatively called also attributes or metadata), where K is the total number of attributes. Here, for simplicity we focus on the case of K = 1 and categorical covariates with Z different categories. However, we can easily generalize to more than one covariate by encoding each possible combination of them as a different value of one single covariate. For example, for two covariates being gender and nationality, we can encode X i being one covariate with possible values female/American, male/Spanish and so forth. One could also consider real-valued covariates by cutting them into bins. Nevertheless, a future expansion should include the possibility to work with any type of metadata. A community is a subset of vertices that share some properties. Formally, each node belongs to a community to an extent measured by a C-dimensional vector denoted membership. Since we are interested in directed networks, for each node i we assign two such vectors, u i and v i (for undirected networks we set u = v); these determine how i forms outgoing and incoming links respectively. Each layer α has an affinity matrix W (α) = [w (α) kl ] ∈ R C×C which describes the density of edges between each pair (k, l) of groups. Each community k ∈ {1, . . . , C} is linked to a category z ∈ {1, . . . , Z} by a parameter β kz , that explains how much information of the z-th category is used for creating the k-th community. To summarize, we consider two types of observed data: the adjacency tensor A = {A (α) } 1≤α≤L and the design matrix X = {X i } i∈{1,...,N } ; the first contains information about the networks topology structure, the latter about the node covariates. In addition, we have the model parameters that we compactly denote as Θ = {U, V, W, β}.
The goal is to find the latent parameters Θ using the data A and X. In other words, given an observed multilayer network with adjacency tensor A and design matrix X, our goal is to simultaneously infer the node's membership vectors u i and v i ∀i ∈ {1, ..., N }; the affinity matrices W (α) , ∀α ∈ {1, . . . , L}, and the matrix β = [β kz ] ∈ R C×Z , which captures correlations between communities and attributes. We consider a probabilistic generative model where MTCOV generates the network and the attributes probabilistically, assuming an underlying structure consisting of C overlapping communities. We adopt a maximum likelihood approach where, given the latent parameters Θ, we assume that the data A and X have independent likelihoods; in other words, we assume that A and X are conditionally independent given the latent parameters Θ. In addition, we assume that the memberships U and V couple the two datasets, as they are parameters shared between the two likelihoods; whereas the W and β are specific to the adjacency and design matrix respectively. We describe separately the procedures for modeling the topology of the network and the node attributes and then we show how to combine them in a unified log-likelihood framework.

Modeling the network topology
In modeling the likelihood of the network topology, we adopt the ideas behind MULTITENSOR: we assume that the expected number of edges of type α from i to j is given by the parameter: (1) We then assume that each entry a ij . This is a common choice for network data [41][42][43] as it leads to tractable and efficient algorithmic implementations, compared for instance with other approaches that use Bernoulli random variables [9,27]; it also allows the flexibility of treating both binary and integer-weighted networks. We further assume that, given the memberships and affinity matrices, the edges are distributed independently; this is again a conditional independence assumption.
We can then write the likelihood of the network topology as: which leads to the log-likelihood L G (U, V, W ) for the structural dimension: where we have neglected constants that do not depend on the parameters.

Modeling the node attributes
In modeling the likelihood of the attributes, we assume that this extra information is generated from the membership vectors; this captures the intuition that knowing a node's community membership helps in predicting the value of the node's attribute. This assumption has also been made in other models for single-layer attributed networks [9] where one wants to enforce the tendency that nodes in the same community (for assortative structures) are likely to share common attributes. Different approaches [36,37] assume instead independence between attributes and membership, which follows a different idea of observing an interaction between individuals if either they belong to the same community (for assortative structures) or they share an attribute or both. Then, we model the probability of observing the z-th category for the attribute covariate of node i as the parameter: where β kz is the probability of observing a particular category z together with a community k; thus π i = (π i1 . . . , π iZ ) is a Z-dimensional vector such that π iz ∈ [0, 1] and Z z=1 π iz = 1, ∀i. For convenience, we consider one-hot encoding for x i = (x i1 , . . . , x iZ ), the realization of the random variable X i : x iz = 1 if node i has attribute corresponding to category z, 0 otherwise and Z z=1 x iz = 1; the original design matrix X N ×1 is thus translated into a binary matrix X N ×Z . We then assume that each entry X i of the design matrix is extracted from a multinomial distribution of parameter π i , which yields the likelihood of the covariates: In order to satisfy the sum constraint Z z=1 π iz = 1, we impose the normalizations Z z=1 β kz = 1, valid ∀k and C k=1 u ik = C k=1 v ik = 1, valid ∀i. Such constraints are a particular case for which the general constraint for the multinomial parameter is satisfied. Although they are not the only choices, they allow us to give a probabilistic meaning to the components of β and the memberships U and V . As done for the network's edges, we assume conditional independence for the attributes on the various nodes. This leads to the log-likelihood L X (U, V, β) for the attribute dimension: Note, we assume that the attributes have values that can be binned in a finite number Z of unordered categories and the attributes do not need to be one-dimensional. Indeed, we can encode each combination of more attributes as a different value of one-dimensional "super-attribute". The model will not be affected, but the computational complexity might increase.

Inference with the EM algorithm
Having described how the model works and its main assumptions and intuitions, we now turn our attention to describe how to fit the parameters to the data, in other words, how to perform inference. We assume conditional independence between the network and attribute variables, thus we can decompose the total log-likelihood into a sum of two terms L(U, V, W, β) = L G (U, V, W ) + L X (U, V, β). However, in practice, we can improve parameters' inference performance by better balancing the contributions of the two terms as their magnitude can be on different scales, thus the risk of biasing the total likelihood maximization towards one of the two terms. For this, we introduce a scaling parameter γ ∈ [0, 1] that explicitly controls the relative contribution of the two terms. The total log-likelihood is then: Varying γ from 0 to 1 lets us interpolate between two extremes: analyzing the data purely in terms of the network topology or purely in terms of the attribute information. One can either fix this a priori based on the goal of the application, closer to 0 for link prediction or closer to 1 for attribute classification, or this can be treated as a hyperparameter that must be estimated, whose optimal value is obtained by fitting the data via tuning techniques (for instance cross-validation). This approach provides a natural quantitative measure for the dependence between the communities and the two sources of information. Notice that one can rescale a priori each likelihood term individually in order to control even more their magnitudes, and then add it to equation 7. This choice should be made based on the dataset at hand. Here we consider rescaling L G and L X only in studying the social support networks of Indian villages, as we have enough data for estimating the normalization coefficients; see Supplementary Section S2.1 for details.
We wish to find the Θ = (U, V, W, β) that maximizes equation (7). In general, this is computationally difficult, but we make it tractable by adopting a variational approach using an Expectation-Maximization (EM) algorithm [44], similar to what done by De Bacco et al. [13], but extended here to include attribute information. Namely, we introduce two probability distributions: h ikz and ρ (α) ijkl . For each i, z with X iz = 1, h izk represents our estimate of the probability that the i-th node has the z-th category, given that it belongs to the community k. On the other hand, ijkl is the probability distribution over pairs of groups k, l. Using Jensen's inequality logx ≥ log x for each log-likelihood term gives: These lower bounds hold with equality when thus maximizing L X (U, V, β) is equivalent to maximizing L X (U, V, β, h); similarly for L G (U, V, W ) and L G (U, V, W, ρ) (this was also the same result derived in [13]). Overall, we aim at maximizing in analogy with what was done before. These maximizations can be performed by alternatively updating a set of parameters while keeping the others fixed. The EM algorithm performs these steps by alternatively updating h, ρ (Expectation step) and Θ (Maximization step); this is done starting from a random configuration until L(Θ, h, ρ) reaches a fixed point. Calculating equation (10) represents the E-step of the algorithm. The M-step is obtained by computing partial derivatives of L(Θ, h, ρ) with respect to the various parameters in Θ and setting them equal to zero. We add Lagrange multipliers λ = λ (β) , λ (u) , λ (v) to enforce constraints: For instance, focusing on the update for β zk , setting the derivative with respect to it in equation (11) to zero and enforcing the constraint Z z=1 β kz = 1 gives λ (β) k = γ i,z x iz h izk ; plugging this back finally gives: which is valid for γ = 0. Doing the same for the other parameters yields (see Supplementary Section S1 for details): where equation (15) is valid for γ = 1. The EM algorithm thus consists in randomly initializing the parameters Θ and then repeatedly alternating between updating h and ρ using equation (10) and updating Θ using equations (12)- (15) until L(Θ, h, ρ) reaches a fixed point. A pseudo-code is given in Algorithm 1. In general, the fixed point is a local maximum but we have no guarantees that it is also the global one. In practice, we run the algorithm several times, starting from different random initializations and taking the run with the largest final L(Θ, h, ρ). The computational complexity per iteration scales as O(M C 2 + N CZ), where M is the total number of edges summed across layers. In practice, C and Z have similar order of magnitude, usually much smaller than the system size M ; for sparse networks, as is often the case for real datasets, M ∝ N , thus the algorithm is highly scalable with a total running time roughly linear in the system size.
Notice that, although we started from a network log-likelihood L G (U, V, W ) similar to the one proposed in the MULTITENSOR model [13], the only update preserved from that is the one of w kl in equation (15). The updates for u ik and v ik are instead quite different; the main reason is that here we incorporated the node attributes, which appear both explicitly and implicitly (through h) inside the updates. In addition, here we enforce normalizations like k u ik = 1, not enforced in MULTITENSOR. This implies that our model restricted to γ = 0, i.e., no attribute information, does not correspond exactly to MULTITENSOR. This also implies that, upon convergence, we can directly interpret the memberships as soft community assignments (or overlapping) without the need of post-processing their values; in words, u ik represent the probability of node i to belong to the incoming community k, similarly for v ik and an outgoing membership. This distinction is necessary when considering directed networks. If one is interested in recovering hard memberships, where a node is assigned to only one community, then one can choose the community corresponding to the maximum entry of u or v.
Algorithm 1 MTCOV-EM algorithm . Initialize U, V, W, β at random. Repeat until convergence: 1. Calculate h and ρ (E-Step): i) for each node i and community k update memberships: ii) if γ = 1, for each pair of communities (k, l) update network-affinity matrix: for each pair of community-attribute (k, z) update attribute-affinity matrix:

Evaluation metrics
We adopt two different criteria for performance evaluation, based on having or not having access to ground-truth values for the community assignments. The first case applies to synthetic-generated data, the second to both synthetic and real-world data. We explain performance metrics in detail below.
a. Ground-truth available In the presence of a known partition, we measure the agreement between the set of ground-truth communities C * and the set of detected communities C using metrics for recovering both hard and soft assignments. For hard partitions, the idea is to match every detected community with its most similar ground-truth community and measure similarity δ(C * i , C j ) (and vice versa for every ground-truth community matched against a detected community) as done by Yang et al. [9]. The final performance is the average of these two comparisons: where here we consider as similarity metric δ(·) the F1-score and the Jaccard similarity.
In both cases, the final score is a value between 0 and 1, where 1 indicates the perfect matching between detected and ground-truth communities. For soft partitions, we consider two standard metrics for measuring distance between vectors as done by De Bacco et al. [13], such as cosine similarity (CS) and L 1 error, averaged over the nodes: where u i is the C-dimensional vector containing the i-th row of U , representing the detected membership and similarly for u 0 i for the ground-truth U 0 . The factor 1/2 ensures that the L 1 distance ranges from 0 for identical distributions to 1 for distributions with disjoint support. Similarly to the what done for hard partitions, we match the ground-truth and detected communities by choosing the permutation of C groups that gives the highest cosine similarity or smallest L 1 distance.
b. Ground-truth not available In the absence of ground-truth, these metrics cannot be computed, and one must resort to other approaches for model evaluation. Here we consider performance in prediction tasks when hiding part of the input datasets while fitting the parameters, and in particular on the extent to which partial knowledge of network edges helps predict node attributes and vice versa. Thus we consider a measure for link-prediction and one for correct retrieval of the attributes. For link-prediction, we used the AUC statistic, equivalent to the area under the receiveroperating characteristic (ROC) curve [45]. It represents the probability that a randomly chosen missing connection (a true positive) is given a higher score than a randomly chosen pair of unconnected vertices (a true negative). Thus, an AUC statistic equal to 0.5 indicates random chance, while the closer it is to 1, the more our model's predictions are better than chance. We measure the probability of observing an edge as the predicted expected Poisson parameters of equation (1). For the attribute, instead, we use the accuracy as a quality measure. For each node, we compute the predicted expected multinomial parameter π i using equation (4). We then assign to each node the category with the highest probability, computing the accuracy as the ratio between the correctly classified examples over the total number of nodes. As baselines, we compare with the accuracy obtained with a random uniform probability and the highest relative frequency observed in the training set.

Cross-validation tests and hyperparameter settings
We perform prediction tasks using cross-validation with 80-20 splits: we use 80% of the data for training the parameters and then measure AUC and accuracy on the remaining 20% test set. Specifically, for the network topology, we hold out 20% of the triples (i, j, α); for the attributes, we hold out 20% of the entries of the categorical vector. Our model has two hyperparameters, the scaling parameter γ and the number of communities C. We estimate them by using 5-fold cross-validation along with grid search to range across their possible values. We then select the combination (Ĉ,γ) that returns the best average performance over the cross-validation runs. Standard crossvalidation considers performance in terms of a particular metric. However, here we have two possible ones which are qualitatively different, i.e., AUC and accuracy. Depending on the task at hand, one can define performance as a combination of the two, bearing in mind that the values of (Ĉ,γ) at the maximum of either of them might not coincide. Here we select (Ĉ,γ) as the values are jointly closer to both the maximum values. In the experiments where one of the two hyperparameters is fixed a priori, we run the same procedure but vary with grid search only the unknown hyperparameter.

Results
We test MTCOV's ability to detect communities in multilayer networks with node attributes by considering both synthetic and real-world datasets. We compare against MULTITENSOR [13], an algorithm similar to ours but that does not include node attributes. We also test MTCOV's performance on single-layer networks, as the mathematical framework behind MTCOV still applies. Given this potential use and the paucity of algorithms suitable for comparison for multilayer networks, such comparisons assess the general utility of MTCOV.

Multilayer synthetic networks with ground-truth
To illustrate the flexibility and the robustness of our method, we generate multilayer synthetic networks with different kinds of structures in the various layers adapting the protocol described in De Bacco et al. [13] to accommodate node attributes. We generate attributes as done in Newman and Clauset [27]: we match them with planted communities in increasing ratios varying from 0.3 to 0.9; these values correspond also to the γ parameters that we fix for MTCOV. Specifically, we generate three types of directed networks using a stochastic block model [46], all with C = 2 communities of equal-size unmixed group membership and N = 1000 nodes, but with different numbers and kinds of layers, similar to De Bacco et al. [13]. The first network (G 1 ) has L = 2 layers, one assortative (W α has higher diagonal entries) and one disassortative (W α has higher off-diagonal entries); the second (G 2 ) has L = 4 layers, two assortative and two disassortative and the third (G 3 ) has L = 4 layers, one assortative, one disassortative, one core-periphery (W α has higher diagonal entries but one of the two is bigger than the other) and one with biased directed structure (W α has higher off-diagonal entries but one of the two is bigger than the other). We generate ten independent samples of each of these types of networks and use all the evaluation metrics described in the Methods section in the presence of ground-truth. We use the membership inferred by the algorithms using the best maximum likelihood fixed point over 10 runs with different random initial conditions. As shown in Table I, MTCOV performs significantly better than MULTITENSOR on the first and second network. This suggests that incorporating attribute information can significantly boost inference, with an increasing benefit for a smaller number of layers. Figure 1 shows an example of this result. On the other hand, notice that G 2 requires a smaller match (γ = 0.5) between attributes and communities than G 1 (γ = 0.7) to achieve similar performance. G 1 and G 2 have similar structure but the second has twice as many layers. Thus, increasing the number of layers may require less contribution from the extra information of the attributes, a possible advantage for multilayer networks. This intuition is reinforced by noticing not only that the best performance is achieved for γ < 0.9, but also that both the algorithms perform very well in the third network, regardless of the value of the match between attributes and communities. Contrary to G 2 , G 3 has a different structure in each of the 4 layers. This diversity can be even more beneficial than having more but correlated layers (as in G 1 vs G 2 ). These synthetic tests demonstrate the impact of leveraging both node attributes and topological information: when topological structure is not very informative (as in G 1 with only two layers), adding node attributes can significantly help in recovering the communities. In contrast, when topological information is more complex (as in G 3 where all layers are different), properly combining the different layers' structures can compensate for a limited access to extra information on nodes. Overall, this shows the need for methods suited for exploiting various sources of information and the complexity behind multilayer networks.  3 denotes a match of 0.3, this is also the value we use to fix γ) between attributes and planted communities on synthetic directed multilayer networks. Results are averages and standard deviations over 10 networks samples for each network type G m , m = 1, 2, 3; we take the average performance over the in-coming and out-going memberships, i.e., the matrices U and V , and the best performances are in boldface. Networks are generated with stochastic block model with C = 2, N = 1000 and average degree k = 4. Denote W a , W d , W cp and W bd , the affinity matrices of the assortative, disassortative, core-periphery and the biased directed layers respectively. Then, their entries are The F1-score, Jaccard, CS and L 1 are performance metrics as defined in the Methods section.

Multilayer social support network of rural Indian villages
We demonstrate our model beyond synthetic data by applying it to social support networks of two villages in Tamil Nadu, India, which we call by the pseudonyms "Tenpat . t . i" (Ten) and "Alakāpuram" (Ala) [47][48][49]. Data were collected in the form of surveys where adult residents were asked to nominate those individuals who provided them with various types of support, including running errands, giving advice, and lending cash or other household items. These were collected in two rounds, one in 2013 and the other in 2017. Each type of support corresponds to a layer in the network; we consider only those layers present in both rounds, for a total of L = 6 layers. After pre-processing the data, by considering only those individuals that had at least one outgoing edge and removing self-loops, the resulting networks have the size reported in Table II. In addition, several attributes were collected, which include information Colors denote the inferred partition; the attributes in (c) and (d) are generated by matching them with true community assignments for the 50% and 70% of the nodes respectively, and chosen uniformly at random from the non-matching values; square and triangle denote the synthetic dummy attribute (squares are matched with the red group, triangles with the blue) and the size of the node shows the nodes matched with the true community (bigger means deterministic match, smaller means uniform at random match). We use the matrix U for the membership.
about age, religion, caste, and education level. Ethnographic observation in these villages [47] and previous analyses [48,49] suggest that social relations are strongly structured by religious and caste identity, with these divisions shaping where people live, who they marry, and who they choose to associate with. In other words, they suggest a dependence between the attributes religion and caste and the mechanisms driving edge formation in these social support networks. Motivated by these insights, here we consider the attributes caste and religion and add them into the model. In addition, we test the importance of variables that we expect to be less informative, such as gender and age. The latter, being continuous, is also an example of a non-categorical variable. Provided it has a finite range, as it is the case for age, we can encode it into categorical by binning its values. Here we use equal bins of size 5 years.  Without assuming a priori any ground-truth, we measure performance using the AUC and accuracy as explained in the Methods section. We compare with MULTITENSOR to measure the extent to which adding the attributes helps predicting edges and attributes; in addition, in terms of accuracy values, we considered two baselines for further comparisons: i) a uniform at random probability over the number of possible categories (RP); and ii) the maximum relative frequency of the attribute value appearing more often (MRF). We fix hyperparameters using 5-fold crossvalidation and obtain values of γ ∈ [0.2, 0.9], signaling relevant correlations between attributes and communities. For details, see Supplementary Table S2. Empirically, we observe that when γ > 0.5 the algorithm achieves better performance in terms of link and attribute prediction by well balancing the log-likelihood of the attribute dimension and the one of the network structure.

Village Year Nodes Edges k Caste Religion Age Gender
For validation, we split the dataset into training/test sets uniformly at random as explained in the Methods section. Table III reports the average results over ten runs for each network, and shows that MTCOV is capable of leveraging two sources of information to improve both performance metrics. In fact, our algorithm systematically achieves the highest accuracy for attribute prediction and the highest AUC for edge prediction (boldface). While a good performance in attribute prediction is expected by design as we add this data into the model, the fact that it also boosts performance in terms of edge prediction is not granted a priori. Instead, it is a quantitative way to show that an attribute plays an important role in the system. It also demonstrates the potential of capturing correlations between two different sources of information, which can have relevant applications, in particular when missing information of one kind. Notice in particular the improvement in AUC when using caste compared to no attribute given (MULTITENSOR). The other attributes are less informative; in particular age has a performance similar to MULTITENSOR in edge prediction, signaling that it does not contribute to inform edge formation. Indeed, it has the smallest inferred γ (always < 0.5), which gives also worse accuracy performance than the baseline, signaling again that this attribute may not be correlated with the community structure. All these results show the flexibility of MTCOV in adapting based on the data given in input: if warranted, it is able to ignore those attributes that are not correlated with network division and instead find communities that are mainly based on the network structure. Next, we test how adding node attributes impacts robustness against unbalanced data, where the ratio of positive examples (existing edges) observed in the training is different than that in the test set. We denote the total probability of selecting an edge in the test as tpe and consider values tpe ∈ {0.001, 0.004, 0.015, 0.03} denoting under-representation (0.001), equal (0.004), and over-representation (values 0.015 and 0.03) compared to the uniform at random selection (empirically we find tpe = 0.004). In these tests, we hold out 20% of the entries of A biasing their selection using the tpe values; in addition, we give as input the whole design matrix X (attributes) and measure link prediction performance. We observe that MTCOV is significantly more robust than the algorithm that does not use any attribute information, regardless of the value of γ. In fact, even though performance deteriorates as we decrease the number of positive examples in the training set (i.e., higher tpe), MTCOV is less impacted by this, as shown in Fig. 2 (results reported in Supplementary Table S3). Notice in particular performance discrepancies when using the attribute caste in the difficult regimes (tpe ∈ {0.015, 0.03}): MTCOV's performance deteriorates only a little, while using the other attributes or none makes performance significantly worse, with AUC down to 0.6 from a value higher than 0.8 for easier regimes. Moreover, notice that attributes with the same scaling parameter value can give different prediction results, underlying the necessity to consider both the value of the estimated γ and the quality of the attribute for quantifying its importance. This could explain why caste provides always better results, given by the fact that its categories are more heterogeneous (i.e., more information) than religion and gender. The robustness of MTCOV is also confirmed by analyzing the performances on a trial-by-trail basis, each trial being a random sample of the held-out entries. As we show in Fig. 3, MTCOV better predicts links in 89% of the trials and never goes below the threshold of 0.5, the baseline random choice. These results demonstrate how adding another source of information helps when observing a limited amount of network edges.  Results are averages and standard deviations over 10 independent trials of cross-validation with 80-20 splits selected uniformly at random (i.e., tpe = 0.004); the best performances are in boldface. Datasets are described in Table II. RP is the performance of uniform random probability and MRF the one of the maximum relative frequency, see Methods section for details.

Qualitative analysis of a social support network
To demonstrate our MTCOV model beyond prediction tasks and highlight its potential for interpretability, we show as an example its qualitative behavior on the real network of Alakāpuram in 2017 (see Table II). Specifically, we compare the communities extracted by our algorithm and those inferred by MULTITENSOR. To ease comparison, we fix the same number of groups to C = 4 for both algorithms and measure how caste membership distributes across communities, and fix γ = 0.8 as obtained with cross-validation. Figure 4 shows the magnitude of each individual's inferred outgoing memberships u i in each group. While the communities identified by MTCOV and MULTITENSOR show substantial similarities, MTCOV generally classifies castes more consistently into distinct communities, as we show in Fig. 4 and 5. To make a quantitative estimate of the different behaviors, we measure the entropy of the attribute inside each community H k = − Z z=1 f z log f z / log(Z), where f z is the relative frequency of the z-th caste inside a group k, and the denominator is the entropy of a uniform distribution over the Z castes, our baseline for comparison. Values of H k close to 1 denote a more uniform distribution of castes, whereas smaller values denote an unbalanced distribution with most of the people belonging to a few castes. We found that MTCOV has smaller entropies over the groups, with two groups having the smallest values, whereas MULTITENSOR have the highest, showing its tendency to cluster individuals of different castes in the same group. In addition, we observe that MTCOV has a more heterogenous group size distribution which seems to be correlated with caste. Notably, the algorithms differ in how they place two caste groups that live in hamlets separated from the main village (the Hindu Yātavars and CSI Paraiyars). With MULTITENSOR, they are grouped together, while with MTCOV, the Hindu Yātavars are joined up into a community with Pal . l . ars and Kulālars. While MULTITENSOR is clearly picking up the structural similarities of the two hamlets, this division makes little sense socially and culturally. In contrast, the way in which MTCOV defines a community which spans caste boundaries (MTCOV C1) aligns with ethnographic knowledge of the relations between these castes. Finally, we remark that there might be multiple meaningful community divisions in the network, and the fact that MTCOV's partition seems to better capture the distributions in the attribute caste does not mean than one algorithm is better than the other. In fact, there might be other hidden topological properties that MULTITENSOR's partition is picking up by being agnostic to caste membership. The choice of which algorithm to use should be made based on the final goal of the application at hand.

Results on single-layer networks
Our model can be used for single-layer networks as well. For these we can compare against two state-of-theart algorithms, both probabilistic generative models but different in their assumptions: CESNA [9] which considers overlapping communities and posits two independent Bernoulli distributions for network edges and node attributes; and the model proposed by Newman and Clauset [27] (NC) for non-overlapping communities, a Bayesian approach where the priors on the community memberships depend on the node attributes. CESNA, similarly to our model, assumes conditional independence of the two likelihoods and introduces a regularization parameter between them; it    (16). The results are averages and standard deviations over ten independent runs and the best outcomes are bolded.
uses block-coordinate ascent for parameters' estimation, while NC uses an EM algorithm for parameters' estimation, similarly to what we do here. We test MTCOV against them on both synthetic and real single-layer networks with node attributes, with and without ground-truth. We transform directed networks to undirected because both CESNA and NC do not distinguish for edge directionality. Results on synthetic data show that MTCOV and NC have similar performance in correctly classifying nodes in their ground-truth communities and both are better than CESNA; the main difference is that MTCOV is more stable and has less variance for high attribute correlation, in particular in the hard regime where classification is more difficult. We leave details in the Supplementary Section S3. For singlelayer real networks, we use datasets with ground-truth candidates and node attributes: the ego-Facebook network [50] (facebook ), a set of 21 networks built from connections between a person's friends where potential ground-truth are circles of friends hand-labeled by the ego herself; the American College football network (football ) [51], a network of football teams playing against each other, where a ground-truth candidate is the conference to which each team belongs; and a network of political blogs [52] (polblogs) where potential ground-truth communities are divided by left/liberal and right/conservative political parties, see Supplementary Section S3 for details. For each network, we run a 5-fold cross-validation procedure combined with grid-search for fixing the hyperparameter γ. For facebook we find that the average over the 21 networks is γ = 0.15, which signals a low correlation between the covariates and the communities, whereas for the football and polblogs networks we obtain much higher values of γ equal to 0.6 and 0.75 respectively. MTCOV has better performance in terms of F1-score and Jaccard similarity across the majority of datasets, as shown in Table IV. This is also supported by a trial-by-trial comparison shown in Fig. 6 for F1-score (similar results are obtained for Jaccard), where we find that MTCOV is more accurate in 59% and 90% of the cases than NC and CESNA, respectively.

Discussion
We present MTCOV, a generative model that performs overlapping community detection in multilayer networks with node attributes. We show its robustness in adapting to different scenarios, and its flexibility in exploiting the attributes that are more informative while ignoring those that are less correlated with the network communities. Our method is capable of estimating quantitatively the contribution given by the attributes and incorporating them to improve prediction performance both in terms of recovering missing attributes and in terms of link prediction. This allows domain experts to investigate particular attributes and select relevant community partitions based on what type of node information they are interested in investigating. There are valuable possible extensions of this work. One example is to incorporate modeling of more complex data types for the attributes, for instance combinations of discrete and continuous attributes, or other types of extra information, like time-varying network elements, whether the attributes, node, edges or combinations of these. From a technical point of view, when the topological and attribute datasets are very unbalanced in size, this might impact their relative likelihood weight and thus inference. One should then consider automating the process of rescaling them accordingly, as a pre-processing step to be incorporated into the model. Similarly, hyperparameter selection would benefit from an automatized routine when more than one performance metric is considered. The relations between attributes and communities could be transferred across networks to predict missing information when having access to similar but incomplete datasets. We show examples of these here, where we studied two snapshots of the same village networks across time. While we leave these questions for future work, we provide an open source version of the code.

Data availability
The code used for the analysis and to generate the synthetic data is publicly available and can be found at https://github.com/mcontisc/MTCOV. Supporting Information (SI)

S1. Methods: EM detailed derivation
We show derivations of the updates given in equations (12)-(15) of the main manuscript. The partial derivative with respect to the elements of the affinity matrices W (α) is given by The valid update when γ is different from 1, is given by setting equation (S1) to zero and we obtain In order to take the derivative with respect to β kz we need to consider the Lagrange multiplier λ (β) k because of the constraint in equation (11). Then, and setting it to zero implies Enforcing the constraint (11), we have which implies Plugging (S6) into (S4), we obtain the update: (S7) Focusing the attention on the elements of the matrix U , we first consider that plugging the update for w (α) kl given in equation (S2) into the log-likelihood of the structural dimension L G , the last term becomes a constant. Indeed, ij is the number of links in layer α when the network is directed (or twice this value in the undirected case). Thus, we can ignore this term when estimating u ik . Using the same strategy used in computing the update of β, we compute the Lagrange multiplier λ (u) i for the constraint in equation (11). Then, and Enforcing the constraint k u ik = 1 yields: ijkl = 1 and z x iz = 1. Plugging the result of equation (S12) into the equality (S10) we obtain (S13) In order to compute the update for V , we fix j and l, rewriting the attribute dimension L X (U, V, β, h) as L X (U, V, β, h) = j,z,l x jz (h jzl log(β lz (u jl + v jl )) − h jzl log(h jzl )). (S14) Analogously to before, we consider the Lagrange multiplier to satisfy the constraint given in equation (11), and we obtain (S15) In each iteration of the EM algorithm, the parameters in Θ = (U, V, W, β) are updated with equations (S2), (S7), (S13) and (S15), until the log-likelihood L(Θ, h, ρ) reaches a fixed point.

A. Normalization
A way to control the magnitude of the likelihood terms is rescaling each term individually. Here, we estimate two linear regression models in order to obtain the normalization constants for the two terms. Given the access to several network datasets of the same kind (social support networks in Indian villages), we collect log-likelihood values with respect to the number of nodes (N ), edges (E) of the observed network and the number of categories of the categorical attribute (Z). Quantitatively, this means normalizing as: (S16) The super and the subscripts of the c parameters indicate the dependent variable (L X or L G ) and the input regressor they refer to, respectively. In particular, we collect the data by running the model for each pair of network and categorical attribute, arbitrarily fixing the number of communities C = 3 and the scaling parameter γ = 0.5. Table  S1 states the values of the statistically significant coefficients for the estimation of the log-likelihood terms, and only those coefficients are used in the normalized equation (S16).
L X L G c N −0.486 * * * −1.778 * * c E −6.158 * * * c Z −33.862 * * * On one side, this procedure allows to obtain coefficient estimates useful for analyzing all four networks of Indian villages in a quantitative and automatized way. On the other side, we are aware that these coefficients are strictly related to the type of network we are working with, i.e. their values cannot be used in network datasets other than the social support networks used here. Future works should investigate an automated normalization procedure applicable to any network dataset as a pre-processing step.

C. Biased link prediction
We use sampling bias techniques for assigning higher or lower sampling probability to the entries of the adjacency tensor, which correspond to edges and non-edges. By defining tpe the total probability of selecting one edge (non-zero entry), we assign to the entries a (α) ij > 0 the probability of being selected in the test set given by: and for 0 entries where E and N 2 L are the total number of the edges and entries of the adjacency tensor respectively. These two probabilities are assigned to the entries a (α) ij for performing a biased selection while choosing test and train sets, as in a selection of a binary mask. The tpe is used to select entries for the test set; in case of selecting entries uniformly at random, this value would be around 0.004. This low value is due to the common case in real networks of having sparse matrices, where the number of non-zero entries is much lower than the number of zeros. We create three different situations, starting from tpe = 0.001 where the probability of selecting one edge is lower than the probability of choosing one non edge, and the number of edges in the training set is much higher than the number in the test set. Then we have tpe = 0.015 and finally tpe = 0.03, where the probability of selecting one edge in the test set is higher than the probability of choosing one non edge, and the test set has a bigger number of positive examples. These settings follow an increasing order of complexity, starting from an under-represented case, where tpe = 0.001, and ending with a difficult task where the number of edges in the test set is over-represented, tpe = 0.03. We run 10 independent trials for each setting and model.  Results are averages and standard deviations over 10 independent trials of cross-validation with 80-20 splits applied to the entries of A (the whole X is given in input); best performances are in boldface. Datasets are described in the main manuscript.

S3. Single-layer networks experiments
We generate 10 independent realization with different random initializaton for each network. We run NC setting the maximum number of BP steps before aborting to 40 and maximum number of EM steps before aborting to 500 to be consistent with MTCOV. MTCOV and CESNA use the fraction match as weight between the network and attributes. Moreover, for CESNA we add a new community with all non-classified nodes, and this is done for all tests. Therefore, NC and MTCOV give the possibility to decide if considering a hard membership or a mixed one, thanks to the posterior probabilities given as output, and assigning the community with the highest probability, even though NC doesn't mention the possibility of mixed-membership in the paper. On the other hand, CESNA does not return those values, and we have to work with their output files which provide an overlapping partition of the nodes.

A. Single-layer synthetic networks
In analogy with what done for multilayer synthetic networks, we test MTCOV's ability to detect communities on synthetic single-layer networks, using the approach proposed by Newman and Clauset [27], where they generate synthetic networks with known community structure embedded within them. Networks are generated using the stochastic block model [46,53], with N = 1000 nodes and C = 2 non-overlapping communities of equal size. Edge probabilities are p in = c in /n and p out = c out /n, for within-group and between-group edges, respectively. The difference c in − c out measures the strength of the community structure, when c in is much greater than c out the communities are easy to detect from network structure alone, and it becomes harder when these two quantities are close. Discretevalued attributes are generated on nodes, which match the true community assignments of nodes a given fraction of the times, and are instead chosen uniformly at random from the non-matching values otherwise. This allows to control the correlation between attributes and community structure and hence test the algorithm's ability to exploit the extra information of varying quality. The match between attributes and planted communities varies between 0.5 and 0.9, the higher this value the higher the extent to which node attributes help predicting edges. In Fig. S1 we show the fraction of correctly classified nodes in terms of F1-score (results in terms of Jaccard are similar) for such experiments. We notice first a clear pattern where all the methods increase their performance and reduce their variance as the difference c in − c out gets bigger, going from a hard to an easy to detect regime. In the hard regime, where community structure weakens, both MTCOV and NC remain robust in detecting the communities for scenarios where attributes are correlated. However, MTCOV shows lower variance and has more stable results for high attribute correlations. In addition, empirically we observe that NC does not reach converge in 19% of the trials, while MTCOV only for the 13% of the times. CESNA always shows lower performances, probably penalized by the relative high number of non classified examples which we also observe experimentally.
B. Single-layer real networks a. The facebook ego-networks We consider the attributes Birthday (B), Hometown (H) and Location (L), relationship that potentially correlate with the community partitions. In addition, they have a reasonable number of values, compared to other attributes whose proportions of missing values are too high. We take in consideration also the sum by row: since we are assuming a multinomial distribution for the attribute, we ask for a sum by row equal to 1. As first step, we combine all the 10 ego networks merging both the edges and the covariates. In this way we double check the real number of edges and the number of nodes. Moreover, we build complete design matrices for B, H and L. They are used for retrieving the largest possible number of attributes for each node. However, we decide to work with the ego networks separately, for having a clear classification of the ground truth. We build the input files using the following procedure, where we consider only the nodes having at least one edge and the attribute. We obtain 30 networks: 3 for each ego network according to the attribute B, H or L; and for each combination of ego and attribute we have different adjacency matrix A and ground truth circles. However, we analyse 21 networks because we have to discard three ego networks (3437, 3980, 698) due to the null number of communities or to the small number of nodes with ground truth.
b. American College Football We use the corrected version of the dataset [54], where among all they assign the independent teams to a unique community label rather than assigning them a single community label as in the original football dataset. In this way the number of communities given by the number of conferences is 19. c. Political Blogs (polblogs) We remove the isolated vertices and self-loops. The ground-truth communities are left/liberal, right/conservative, so C = 2.
For the analysis of these networks, we run CESNA with the default parameter α = 0.5 because their released code does not allow to perform a cross-validation procedure on the scaling parameter. The facebook networks have overlapping communities, while for the other two datasets we assume non-overlapping, according to the proposed ground-truth. For facebook we consider the 21 ego networks as different iterations. For the other two, since both MTCOV and NC are based on a EM algorithm which does not ensure to reach a global maximum, we perform 10 restarts of the algorithms with different initializations at random. Results are presented in the main manuscript.
FIG. S1: Accuracy of classification for synthetic single-layer networks with two communities of equal size, generated with the stochastic block model. Each plot shows results with a given match between metadata and planted communities. The results are averaged over 10 independent trials and the bars represent the standard deviation. The accuracy is measured with F1-score as similarity measure.