Introduction

Complex networks provide a powerful tool for representing real-world complex systems1. Social networks, the World Wide Web, protein-protein interaction networks, academic citation and coauthor networks, and hyper-linked blogs are typical examples of such networks, where nodes denote objects and links denote pairs of relations between nodes. In recent years, much effort has been focused on identifying communities, groups of related nodes with dense internal connections and few external connections2,3,4,5. In addition to node connectivity information, most real-world networks have node-associated attributes. In this case, two types of information are available; graph data to represent the relationship between objects and attribute data to characterize a single object. Thus, nodes can be grouped either by data clustering methods using only their attributes6, or by community detection methods using only their link structure4, 7. However, clustering objects by attribute similarity ignores relationships between objects, and identifying communities using only links between pairs of nodes isolates node attributes within communities. Therefore, various methods have been developed to uncover communities in networks by combining structural and attribute information such that nodes in a community are not only connected more densely than nodes outside of the community, but also share similar attributes.

Existing methods can be classified roughly into two categories. The first category is composed of probabilistic generative models that formulate joint models of link connections and node attributes, and that use the models to infer the posterior community memberships of nodes in a network8,9,10,11,12,13,14,15,16,17. The second category contains three types of hybrid methods. The first represents links as a class of node feature and uses node attributes and link connections to perform vertex clustering18,19,20. The second makes use of node attributes to help identify communities in networks21. The third uses node attributes and link structure together to optimize a unified objective function22, 23.

Probabilistic generative models include CESNA15, PCL-DC9, PPL-DC10, PPSB-DC11, cohsMix12, BAGC13, GBAGC14, BNPA17, and Metacode16. CESNA employs the probabilistic generative process of BIGCLAM24 for generating links and the logistic model of attributes together to infer the distribution of community memberships. PCL-DC, PPL-DC, and PPSB-DC project the discriminative content (DC) model of attributes into a generative model of links (like PCL9, PPL10, and PPSB11) via community memberships. cohsMix embeds numerical attributes of nodes into the MixNet model25 for generating link classes. BAGC and GBAGC extend the cohsMix model to process categorical attributes and weighted networks. BNPA introduces node attributes and Bayesian priors to Newman’s mixture model26 and integrates the Chinese Restaurant Process to infer the number of communities. Metacode represents node attributes as metadata that describe properties of nodes and incorporates the metadata with the degree corrected stochastic block model27 to infer correlation between metadata and network structure. These models have good interpretability and provide powerful tools to discover overlapping communities or general structures. However, existing models deal with only one type of attribute (either binary, categorical, or numerical) and are sensitive to initial values.

SA-cluster18 and Inc-cluster19, 20 are typical examples of vertex clustering methods that use node attributes and link connections. SA-cluster views node attributes as virtual vertices, constructs an attribute-augmented graph, and performs a random walk on the attribute-augmented graph to obtain a unified distance. It then adopts the K-medoids algorithm to cluster the nodes based on learned pairwise distance. Inc-cluster was introduced as a slightly faster version of SA-cluster. CODICIL21 constructs content edges by selecting the top \(\bar{K}\) neighbors of each vertex using their attributes, obtains the combined similarity of a pair of nodes, and then sparsifies the newly constructed graph with content edges28. Finally, a fast graph clustering algorithm (Metis29 or MLR-MCL30) is used to partition the sparsified graph into K communities. GLFM22 extends MLFM31 (the multiplicative latent factor model) to give a unified model of homophily in networks such that an edge is more likely to exist between two nodes with similar attributes than between nodes having different attributes. A minorization-maximization algorithm is then used to optimize the latent eigenmodel of GLMF. PICS23 finds cohesive clusters of nodes that have similar connectivity patterns and exhibit high levels of attribute homogeneity by optimizing a unified objective function defined by minimum description length. Compared to probabilistic generative models, these hybrid methods are more efficient. Nonetheless, these methods were designed to process networks with binary or categorical attributes only.

Nearly all of the methods mentioned above follow the assumption that cluster memberships related to node attributes must be consistent with community memberships determined by link structure for a network. However, it is not always true in real world networks. In fact, although nodes in the same community tend to have similar features by the homophily hypothesis32, there may exist some nodes in a community that share similar attributes but are not linked due to the sparseness of a real network. Therefore, for each node, we used only a small portion of the nearest neighbors measured by attribute similarity to alleviate the sparsity of a network, while strengthening the community structure. Consequently, in this study, we have proposed a node attribute-enhanced community detection approach, named kNN-enhance, using the kNN (e.g., k ≤ 10) graph of node attributes. We have instantiated kNN-enhance into two algorithms, kNN-nearest and kNN-Kmeans, to test the efficiency and the effectiveness of the approach. In the first stage, we constructed a kNN graph enhanced network by adding the kNN graph of node attributes to the original network. Then, we selected the number of communities and community centers on the enhanced network using the idea behind the method K-rank-D33, which is the extended version of the data clustering method proposed by Rodriguez and Laio34. In the second stage, we used kNN-nearest or kNN-Kmeans to cluster nodes into groups, where kNN-nearest assigned each remaining node to the cluster of its nearest neighbor with higher centrality and kNN-Kmeans clustered nodes iteratively by the K-means method. Our experimental results suggest that kNN-enhance improves upon existing algorithms through its ability to process networks with binary, categorical, or numerical attributes. Moreover, the approach can handle large-scale attributed networks by combining fast approximate kNN-graph algorithms35,36,37 with fast community detection algorithms such as BGLL38 and Informap39.

Results

A Description and Illustration of kNN-enhance

Networks in real applications are often sparse and contain noise in the form of spurious edges. This sparseness and noise blur the community structure of a network. Yet, nodes in the same community are likely to be connected to each other and share similar interests even though some of them are ‘silent’. Therefore, we can obtain a kNN graph by using a set of node attributes. The kNN-graph is then combined with the original network to compensate for sparsity, thereby strengthening the community structure of the network. Figure 1 is an illustration of kNN-enhance. Figure 1a shows an attributed network, where each node has four attributes: degree, research area, affiliation, and location. This original network is sparse and the community structure in it is not clear. If we add a link between nearest neighbors with common node attributes for each pair of nodes (Fig. 1b), the now attribute-enhanced network shows distinctive community structure. Optionally, a community detection algorithm like K-rank-D can be used to discover community structure in the newly generated, attribute-enhanced network.

Figure 1
figure 1

An illustrated example of kNN-enhance.

Figure 2 illustrates the effectiveness of kNN-enhance from its partition process. Figure 2a is an example of the decision graph of an original LFR network40 with μ = 0.9 and n = 1000 using K-rank-D. The original network contained 38 communities. One hundred binary attributes with the same cluster structure as the original network were attached to each node at a noise ratio of 20%. In the original LFR network, the community structure was unclear and the 38 community centers were not sufficiently separated in the right upper corner of the decision graph. As a result, it was difficult to determine the number of communities and the community centers as well as to detect the community structure in the network. Subsequently, the kNN-graph was added to the original network and the decision graph of the kNN-graph enhanced network was created with k = 10 using K-rank-D (Fig. 2b). The community structure became clearer and the 38 community centers were separated in the right upper part of the decision graph. This made the community structure much easier to determine. In addition, the red nodes in Fig. 2b were the top 38 nodes with highest comprehensive value (computed by Equation (4) in the Methods section). The nodes in the square were selected by manually drawing a rectangle in the right upper section of the graph. Using manually selected nodes as initial centers, all nodes are correctly partitioned when compared to the ground truth. Yet, using the top 38 nodes (red nodes) as initial centers, the accuracy (computed by Equation (5) in the Methods section) is only 95%. In some cases it is difficult to select the exact K community centers in decision graphs (see Fig. 2a as an example). We automatically selected the top K nodes with the highest comprehensive value as the centers in the following experiments.

Figure 2
figure 2

The decision graph of an original LFR network and that of its kNN enhanced network.

Experiment Results

We generated two groups of LFR40 benchmark networks with binary and numerical node attributes, respectively. We tested existing state-of-the-art algorithms including probabilistic models (PCL-DC, PPL-DC, PPSB-DC, CESNA, cohsmix, BAGC, and GBAGC) and hybrid methods (SA-Custer, Inc-Cluster, CODICIL, and GLFM) on these synthetic benchmarks. We then evaluated these algorithms on several commonly used real networks, including some with or some without associated ground truth. We compared two instantiations of our kNN-enhance approach, kNN-nearest and kNN-Kmeans, to these existing algorithms. In addition, we compared kNN-nearest and kNN-Kmeans with K-rank-D using only link information, K-means using only node attributes, and cluster-dp34 using both node attribute and link information on these networks to show whether the proposed approach performed better than existing similar methods and methods using either links or attributes alone.

Experimental Results on Synthetic Networks

Largeron et al.41 have provided a generator to generate networks with community structure and numerical node attributes. However, the generator cannot be used to generate networks with binary attributes. Therefore, we generated our own series of networks based on a commonly used LFR benchmark40.

LFR benchmark networks are presented by Lancichinetti et al.40. These mimic real networks by introducing associated characteristics, i.e., the heterogeneity in the distribution of node degree and community size. The LFR benchmark method uses several parameters to generate a network, including n (the number of vertices), μ (the mixing parameter), 〈k〉 (the average degree of vertices), k max (the maximum degree of vertices), C min (the minimum community size), C max (the maximum community size), γ and β (exponents of the power-law distribution of node degree and community size). The mixing parameter μ is designed to control the clearness of community structure in a network. Each node shares a fraction 1 − μ of its links with other nodes in its community and a fraction of μ with the other nodes in the network. Thus, the smaller μ is, the clearer the community structure in an LFR network. When μ ≤ 0.6, all algorithms are able to classify nearly all vertices into the correct communities. Therefore, we only added node attributes to LFR networks when μ = 0.7, 0.8, or 0.9. Following the example of previous studies33, 40, we generated a group of LFR benchmarks with 1000 nodes, \(\langle k\rangle =20\), k max  = 50, C min  = 10, C max  = 50, γ = 2, and β = 1.

We generated two types of node attributes, binary and numerical, for the LFR benchmarks. We did not generate category attributes for simplicity since these can be formulated as binary attributes. We first attached D-dimensional binary attributes to each node and gave nodes in the same community the same d (d < D) attributes. In this group of experiments, we set D = 100 and d = 10 for testing high dimensional attributes. In order to blur the attribute cluster structure, we added 10% to 50% noise by randomly flipping the corresponding portion of binary attributes. With the increase of the noise ratio, the clearness of cluster structure decreased. We then used the Gaussian cluster generator (http://personalpages.manchester.ac.uk/mbs/Julia.Handl) to generate D dimensions of numerical attributes following multivariate normal distributions such that the cluster structure of attributes was the same as the community structure of the corresponding network. For a single multivariate cluster, the mean was uniformly distributed in the range [−10, 10], the off-diagonal entries of the covariance matrix were generated as a random number in the range [−1, 1], and the diagonal entries of the covariance matrix were generated as the sum of all off-diagonal entries plus a random number in the range \([\mathrm{0,}\,20\cdot \sqrt{D}]\). We set D = 10, 5, 3, and 2 in these groups of experiments. Higher dimensionality led to clearer attribute clusters.

We first compared six probabilistic generative models including PCL-DC, PPL-DC, PPSB-DC, BAGC, GBAGC, and CENSA and seven hybrid methods comprising CODICIL, SA-cluster, Inc-cluster, GLFM, cluster-dp, kNN-nearest, and kNN-Kmeans on the sample sets with binary attributes, where the noise ratio was in the range \(\{\mathrm{10 \% },\mathrm{20 \% ,}\,\cdots ,50 \% \}\) at μ = 0.7, 0.8 or 0.9, respectively. Also, we compared all algorithms to K-rank-D using only links and K-means using only attributes. We reported the average results and standard deviations on 10 sample sets for each setting shown in Tables 1, 2 and 3, where columns indicate the noise ratio of LFR benchmarks, the three numbers in each cell represent the average values and the standard deviations of the three accuracy metrics (ACC, NMI, and PWF defined by Equations (57) in the Methods section) of the corresponding algorithm, and the best performing algorithm is marked in bold. The details of parameter settings of these compared algorithms can be found in the Methods section. We did not include the results for PICS and BNPA in the tables because they did not converge to the real number of communities and distorted the meaning of the accuracy metrics (ACC, NMI and PWF).

Table 1 Results on LFR networks with binary attributes, μ = 0.7.
Table 2 Results on LFR networks with binary attributes, μ = 0.8.
Table 3 Results on LFR networks with binary attributes, μ = 0.9.

Since only kNN-nearest, kNN-Kmeans, cohsMix, and cluster-dp can been used to cope with networks having numerical node attributes, we then compared these four algorithms on LFR benchmarks with numerical attributes at different D = 10, 5, 3, or 2 when μ = 0.7, 0.8, or 0.9. Also, we compared these four algorithms with K-means using only numerical attributes (the results of K-rank-D using only links can be seen in Tables 1, 2 and 3). The experimental results are shown in Tables 4, 5 and 6, where columns represent the dimension of numerical attribute space (D = 10, 5, 3 or 2), the three numbers in each cell represent the average values and the standard deviations of three accuracy metrics (ACC, NMI, and PWF) of the corresponding algorithm over 10 samples, and ‘−’ indicates that cohsMix was trapped in a saddle point. The best algorithm for each column is marked in bold.

Table 4 Results on LFR networks with numerical attributes, μ = 0.7.
Table 5 Results on LFR networks with numerical attributes, μ = 0.8.
Table 6 Results on LFR networks with numerical attributes, μ = 0.9.

We also tested the algorithms on LFR networks with 5000 nodes, \(\langle k\rangle =20\), k max  = 50, C min  = 20, C max  = 100, γ = 2, and β = 1. Because there were too many testing samples and the results were similar to the first group networks with 1000 nodes, we did not report the results of this group of experiments in the manuscript. Instead, to give a glimpse of the time complexity of the compared algorithms, we have reported the time costs of each algorithm on a randomly generated sample containing 40% noise for binary attributes when {n = 1000, C min  = 10, C max  = 50}, {n = 5000, C min  = 20, C max  = 100}, and {n = 10000, C min  = 20, C max  = 200}, respectively, at \(\langle k\rangle =20,{k}_{max}=50,\gamma =2,\beta =1\) and μ = 0.8 in Table 7. All algorithms were run only once, each number represents the running time of the corresponding algorithm with time unit ‘second’, ‘—’ indicates that the time cost of the corresponding algorithm was beyond 48 hours, and ‘*’ indicates that the algorithm ran out of memory. These experiments were performed on a laptop with an Intel 2.50 GHz processor and 4 GB of main memory running Windows 7.0. CESNA was implemented in C++, CODICIL was implemented in Python and C/C++, cohsMix was implemented in R, and the remaining algorithms were implemented in MATLAB.

Table 7 Time cost of the compared algorithms (in seconds).

From the data in Tables 17, we have concluded that adding node attributes promotes the performance of community detection in most cases. Taking the results of kNN-Kmeans as an example, most of the results were better than those of the basic K-rank-D algorithm on links and K-means on attributes. As Tables 13 show, in most cases, kNN-nearest and kNN-Kmeans performed best among the 13 tested algorithms including probabilistic generative models and hybrid methods, and these outperformed the other hybrid methods in all cases. Although kNN-nearest performed slightly worse than kNN-Kmeans, it was more efficient (see Table 7) since each node received its community label from the nearest node with higher centrality. According to our experiments, kNN-Kmeans converged quickly since community centers were carefully selected. In some cases, the probabilistic generative model PCL-DC displayed the best performance but ran too slowly to be used for processing large networks in real applications (see Table 7). CESNA and CODICIL also showed good performance on this group of experiments. Among the probabilistic methods, CENSA was the fastest algorithm. However, it was much slower than the majority of the hybrid heuristic methods. CODICIL ran quickly due to the fast graph partition program Metis, which was used to cut the networks into communities. GBAGC performed well because it used Metis on links to get the initial partition. Moreover, as Tables 46 show, both the kNN-nearest and the kNN-Kmeans algorithms allowed us to discover communities effectively in LFR networks with numerical attributes. In summary, this empirical study on LFR benchmarks proves the flexibility, effectiveness, and efficiency of the kNN-enhance approach.

Experimental Results on Real Networks

In addition to our experiments using synthetic networks, we tested the algorithms on two groups of real networks. The nodes in the first group were associated with binary/categorical attributes, while those in the second group possessed numerical attributes. The first group of data sets included Cora42, Citeseer42, and DBLP10K18. Sinanet (https://github.com/smileyan448/Sinanet) and PubMed (http://linqs.umiacs.umd.edu/projects//projects/lbc/) belonged to the second group. Detailed information on these data sets is described below.

The Cora data set consisted of machine learning papers. These papers were classified as belonging to one of the following seven classes: CBR (case based reasoning), GA (genetic algorithms), NN (neural networks), PM (probabilistic methods), RL (reinforcement learning), or RLT (rule learning theory). The papers were selected in such a way that in the final corpus every paper cited or was cited by at least one other paper. Assuming each node represented a paper, there were 2,708 nodes and 5,429 citations. After stemming and removing stop-words and words with document frequency less than 10, the corpus remained a vocabulary of size 1,433 unique words. Each paper was described by a 1433-dimension 0/1 vector indicating the absence/presence of the corresponding words from the dictionary of these unique words.

The Citeseer data set was also a citation network in the field of machine learning. These papers were classified into one of the following six classes: Agents, AI (artificial intelligence), DB (database), IR (information retrieval), ML (machine learning), and HCI (human-computer interaction). The papers were selected in the same way as the Cora dataset. There were 3,312 papers in the corpus and 4,732 citations between papers. A paper was described by a 0/1 word vector indicating the absence/presence of the corresponding words from the dictionary of the 3,703 unique words.

The DBLP data set was a co-author network extracted from DBLP Bibliography data. This network contained 10,000 authors and their coauthor relationships. These authors were distributed across four research fields including databases, data mining, information retrieval, and artificial intelligence. Each author was associated with two relevant attributes; prolific and primary topic. The attribute prolific had three possible values: authors with ≥20 publications were labeled as highly prolific, authors with ≥10 and <20 papers were labeled as prolific, and authors with <10 papers were labeled as low prolific. The attribute primary topic had 99 values. Each author was assigned a primary topic out of 99 extracted by a topic model from a collection of paper titles of the authors. For this data set, we did not know the exact number of communities or to which community a node belonged.

The Sinanet data set was a microblog user relationship network that we extracted from the sina-microblog website (http://www.weibo.com). We first selected 100 VIP sina-microblog users distributed across 10 major forums including finance and economics, literature and arts, fashion and vogue, current events and politics, sports, science and technology, entertainment, parenting and education, public welfare, and normal life. Starting from these 100 VIP sina-microblog users, we extracted the followees of these users and their published micro-blogs. Using a depth-first search strategy, we extracted three-layers of user relationships and obtained 8,452 users, 147,653 user relationships, and 5.5 million micro-blogs in total. We merged all microblogs that a user published to characterize that user’s interests43. After removing silent users (those who post less than 5000 words), we were left with 3,490 users and 30,282 relationships. If we used words’ frequency of the merged blogs of a user to describe the user’s interest, the dimension of the feature space would have been too high to be successfully processed. We used users’ topic distribution in the 10 forums, which was obtained by the LDA topic model (http://gibbslda.sourceforge.net/), to describe users’ interests. Thus, besides the followee relationships between pairs of users, we have 10 dimensional numerical attributes to describe the interests of each user. This data set is available at https://github.com/smileyan448/Sinanet.

The Diabetes data set consisted of 19,717 scientific publications from the PubMed database pertaining to diabetes classified into one of three classes: Diabetes Mellitus Experimental, Diabetes Mellitus Type 1, and Diabetes Mellitus Type 2. These publications formed a citation network with 44,338 edges representing the citation relationships of pairs of publications. Further, each publication in the dataset was described by a TF/IDF weighted word vector from a dictionary that consisted of 500 unique words.

The experimental results of the methods on Cora and Citeseer are shown in Table 8, where columns represent the data sets used in the evaluation, the cells of each row represent the values of ACC, NMI, and PWF for the corresponding algorithm, and the algorithm with the best performance is marked in bold in each of the two groups (probabilistic methods and hybrid methods). Because the DBLP network had two categorical attributes, we tested SA-cluster on the two category attributes with 3 values and 99 values, respectively. We named the result SA-cluster-cate. We also tested SA-cluster on DBLP when we viewed these 102 values as binary and named the result SA-cluster-bina (the same was done for Inc-cluster). Since no ground truth was available on DBLP, we reported Modularity and Entropy (defined by Equations (89) in the Methods section) of all algorithms in Figs 3 and 4 at different K (K is the number of communities). In Figs 3 and 4, we do not show the results of PCL-DC, PPL-DC, and PPSB-DC since either the time or space complexity of these algorithms was too high to handle large networks like DBLP10k. The results of the methods for processing Sinanet and PubMed networks with numerical attributes are shown in Table 9, where ‘*’ indicates that the algorithm ran out of memory on the corresponding data set. For the probabilistic methods PCL-DC, PPL-DC, PPSB-DC on Cora and Citeseer, and cohsMix on Sinanet, we ran the algorithms 10 times and reported the result with the largest likelihood. For K-means on attributes alone, we reported the best results of these networks over 10 runs. The details of the parameter settings for the compared algorithms in this group of experiments can be found in the Methods section.

Table 8 Performance of compared algorithms on Cora and Citeseer.
Figure 3
figure 3

Modularity of the compared algorithms on DBLP10k.

Figure 4
figure 4

Entropy of the compared algorithms on DBLP10k.

Table 9 Performance of compared algorithms on Sinanet and PubMed Diabetes.

We drew the following conclusions using the information in Tables 8 and 9 and Figs 3 and 4: (1) According to Table 8, probabilistic methods PCL-DC, PPL-DC, and PPSB-DC showed the best performance on Cora and Citeseer data sets. However, the time cost of these methods was too high and they would not be appropriate for real applications. In contrast, the kNN-enhance approach achieved high accuracy in comparison to other hybrid methods and was much faster than probabilistic methods PCL-DC, PPL-DC, and PPSB-DC (see Table 7). (2) By Figs 3 and 4, the Entropy of kNN-Kmeans on DBLP was the lowest, especially when the number of communities K was larger than 200. The Modularity of kNN-enhance indicates that the partitioned network maintained community structure. Therefore, kNN-enhance was able to identify a clear community structure (large Modularity) with a high level of attribute homogeneity (low Entropy) in the network. (3) The kNN-enhance approach was capable of processing networks with numerical attributes (see Table 9). Even through the accuracy of cohsMix was higher than kNN-nearest and kNN-Kmeans on Sinanet, cohsMix ran much slower than kNN-enhance and the results from this algorithm were selected over 10 runs on Sinanet data. Moreover, cohsMix was not capable of dealing with a large network such as the one from the Diabetes data set due to its high memory usage when storing the similarity matrix and all hidden variables.

Discussion

We have proposed a simple and flexible node attribute enhanced community detection approach, kNN-enhance. This method was designed to construct the k nearest neighbor graph of node attributes first, then merge the kNN graph with the original network. With this approach we were able to alleviate the sparsity of the original network, reduce noise effects, and strengthen the community structure of the original network. Because of this, a clear community structure could be partitioned within the kNN graph enhanced network by a community detection algorithm like K-rank-D. Our two implementations, kNN-nearest and kNN-Kmeans, have shown that the proposed algorithms achieved better performance against the existing state-of-the-art algorithms. Furthermore, the algorithms were able to deal with a network containing binary, categorical, or numerical attributes and could be easily extended to process large-scale networks.

In the future we intend to test this approach on large scale networks with millions of edges by combining fast approximate kNN graph construction algorithms (such as NN-Descent36 with O(n 1.14) empirical cost) with fast community detection algorithms such as BGLL38 and Informap39. Moreover, besides strengthening the community structure of a network using node attributes, we plan to design a more effective method by removing some easily detected weak-linked edges from the network. In this study we were concerned with detecting community structures containing nodes with more links to each other than to nodes outside their communities. However, it has been observed that trees and tree-like networks have high modularity44, 45, the classical objective function to discover communities and to measure their strength46, and that many real world networks have tree-like structures47,48,49. Existing methods use connections only to decompose a network into tree-like components. It is a challenging task to combine node attributes with topology to cluster nodes in a tree-like network into groups, and we will investigate whether our kNN-enhance approach is capable of partitioning attributed tree-like networks.

Methods

Community Detection in Attributed Networks

Suppose that G = (V, E, X) is a network with node attributes, where V is a set of nodes (\(\Vert V\Vert =n\)), E is an edge set that indicates relationships between pairs of nodes (\(\Vert E\Vert =m\)) and is usually represented by an adjacency matrix A = [A ij ] (A ij  = 1 if there is an edge between nodes i and j, A ij  = 0 otherwise), X = {x 1, x 2, …, x n } \(({x}_{i}=({x}_{i1},{x}_{i2},\cdots ,{x}_{iD}),i=1,2,\cdots ,n)\) is a set of vectors, each of which denotes the values of D attributes associated with a node i. We call this an ‘attributed network’ or ‘attributed graph’. Community detection in an attributed network involves partitioning nodes into clusters such that nodes in the same cluster are not only densely connect to each other but also exhibit a high level of attribute homogeneity.

An Active Method for Community Detection in Networks

cluster-dp is a recently-developed clustering algorithm similar to the K-means method34. The algorithm assumes that cluster centers are surrounded by neighbors with lower local density and that they are a relatively large distance from any data points with a higher local density. Therefore, for each data point i, two quantities, the local density ρ i and the distance from points of higher density δ i , are defined as follows to quantify the likelihood of a data point being a cluster center:

$${\rho }_{i}=\sum _{j}\chi ({d}_{ij}-{d}_{c}),{\delta }_{i}=mi{n}_{j:{\rho }_{j} > {\rho }_{i}}({d}_{ij})$$
(1)

where d ij is the distance of data points i and j, χ(x) = 1 if x < 0 and χ(x) = 0 otherwise, d c represents the cutoff distance, and δ i  = max j (d ij ) for the point with the highest density.

If we scatter all data points on a decision graph drawn by their values of ρ i and δ i for all \(i\in \{1,2,\cdots ,n\}\), the cluster centers tend to occupy the right upper part of the graph. After cluster centers with both relatively large ρ i and δi are manually selected on the decision graph, each remaining point is assigned to the same cluster as its nearest neighbor of higher density. This allows cluster-dp to uncover the cluster structure of data points by actively knowing the number of clusters and cluster centers.

However, the following issues exist: (1) When the cluster structure is not clear (i.e., there is not a distinguished boundary between cluster centers and other data points on the decision graph), it is difficult to obtain the correct number of clusters and cluster centers. This leads to poor partitioning. (2) The parameter d c must be tuned in many cases, and it is usually difficult to know which parameter value is best. (3) The input for cluster-dp is a distance matrix. The quality of the matrix has a strong effect on the clustering result. When the algorithm is used to discover community structure in a network, the topological structure implied in the network is not fully utilized.

In a network structure, we suppose that community centers are: (1) influential and surrounded by less influential nodes, and (2) located far from each other in the network. Therefore, we have proposed K-rank-D33 and use two quantities, \({v}_{i}\in v=\{{v}_{1},{v}_{2},\cdots ,{v}_{n}\}\) and \({\bar{\delta }}_{i}\), to describe the centrality and the dispersion of each node i, respectively. The centrality vector v can be calculated efficiently using PageRank50 centrality as follows:

$${v}^{t+1}=((1-\beta )P+e\frac{\beta }{n}){v}^{t},\,\,{P}_{ij}=\frac{{A}_{ij}}{{\sum }_{j}{A}_{ij}},\,i,j\in \{1,2,\cdots ,n\}$$
(2)

where β is the re-start probability (fixed at 0.15), e is the unit matrix, v 0 is a n-dimensional unit vector, and v t is normalized to 1 in each iteration. The dispersion of a node i to other nodes with higher centrality is defined by \({\bar{\delta }}_{i}=mi{n}_{j:{v}_{j} > {v}_{i}}({d}_{ij})\) and \({\bar{\delta }}_{k}=max({\bar{\delta }}_{i})\), i ≠ k, for the node k with highest centrality. d ij is the structural distance between nodes i and j. It can be computed using Euclidean distance measurement \({\Vert \cdot \Vert }_{2}\) after τ-step signal propagation51 by following equations:

$$S={(A+I)}^{\tau },\,\,{\bar{S}}_{ij}={S}_{ij}/\sqrt{\sum _{j}{S}_{ij}^{2}},\,\,{d}_{ij}={\Vert {\bar{S}}_{i}-{\bar{S}}_{j}\Vert }_{2}.$$
(3)

where τ = 3 in implementation and \({\bar{S}}_{i}\) is the i-th row of \(\bar{S}\). In the case that the community structure of a network is fuzzy, we define the comprehensive value for each node i as follows:

$$CV(i)={v}_{i}\cdot {\bar{\delta }}_{i}/({ma}{{x}}_{j=1}^{n}({v}_{j})\cdot {ma}{{x}}_{j=1}^{n}({\bar{\delta }}_{j})).$$
(4)

The top K nodes with the highest comprehensive value can then be automatically selected as the initial centers of K-rank-D.

kNN-enhance: a Node Attribute-enhanced Community Detection Approach

Given an attributed network G = (V, E, X), we first construct the kNN graph of node attributes. The kNN graph for a set of nodes V is a directed graph with vertex set V and an edge from each vV to its k most similar objects in V under a given similarity measure on attributes. x i , x j X, the cosine similarity \({x}_{i}\cdot {x}_{j}^{T}\) is used to compute the similarity of a pair of nodes with binary attributes, and \(1-norm({\Vert {x}_{i}-{x}_{j}\Vert }_{2})\) is used to compute the similarity of a pair of nodes with numerical attributes, where \(norm({\Vert {x}_{i}-{x}_{j}\Vert }_{2})\) is the normalization of the Euclidean distance of x i and x j . We then add the kNN-graph of attributes to the original network. For an edge of the kNN graph, if it is a new edge in the original network, we add this edge to the original network; otherwise, we keep the edge in the original network unchanged.

After the kNN-enhanced network is established, we use the K-rank-D method introduced above to perform node clustering. In addition to K-rank-D, we employ two node assignment strategies after selecting K community centers from the decision graph. kNN-nearest uses the cluter-dp strategy34, which involves assigning each remaining node to the same cluster as its nearest neighbor of higher PageRank centrality computed by Equation (2). kNN-Kmeans uses the strategy of the K-means method, where the input is the data matrix \(\bar{S}=[{\bar{S}}_{ij}]\) 51. It iteratively updates its community centers. It should be pointed out that kNN-nearest and kNN-Kmeans are just two implementations of kNN-enhance approach. Fast approximate kNN graph construction methods35,36,37 and highly-efficient community detection algorithms38, 39 can be combined to process large scale networks.

Metrics for Evaluating Algorithm Quality

In this study, we use two groups of metrics to evaluate the performance of each algorithm. The first group includes ACC (Accuracy), NMI (Normalized Mutual Information), and PWF (Pairwise F-Measure)9, 11. These are commonly used to evaluate an algorithm running on a data set with ground truth. Larger values indicate better algorithm performance. The other group consists of Modularity 38, 46 and Entropy 13, 18. Modularity is used to measure the quality of communities in a network, and a larger Modularity value indicates better partition quality. Entropy is used to measure the degree of attribute consistency in a community, and a lower Entropy value indicates a greater consistency. These metrics are often used when an algorithm is run on a network without ground truth. Formal definitions are provider below:

ACC. Given node i, l pi is the node label assigned by an algorithm and l ti is its true label. The accuracy is defined by

$$ACC=\sum _{i=1}^{n}\delta ({l}_{ti},{p}_{map}({l}_{pi}))/n$$
(5)

where δ(·) is a Kronecker function, P map (l pi ) is a permutation mapping function that maps the label l pi to its corresponding label l ti in the ground truth, and n is the total number of nodes in a network.

NMI

Suppose \(C=\{{C}_{1},{C}_{2},\cdots ,{C}_{K}\}\) is a set of K communities contained in a network and \(C^{\prime} =\{{C}_{1}^{^{\prime} },{C}_{2}^{^{\prime} },\cdots ,{C}_{K}^{^{\prime} }\}\) is a set of K communities obtained by a specific algorithm. NMI is defined by

$$NMI(C,C^{\prime} )=\frac{-2{\sum }_{i=1}^{K}{\sum }_{j=1}^{K}{n}_{ij}\,\mathrm{log}\frac{n\cdot {n}_{ij}}{{n}_{i}^{C}\cdot {n}_{j}^{{C}^{^{\prime} }}}}{{\sum }_{i=1}^{K}{n}_{i}^{C}\,\mathrm{log}\frac{{n}_{i}^{C}}{n}+{\sum }_{j=1}^{K}{n}_{j}^{{C}^{^{\prime} }}\,\mathrm{log}\frac{{n}_{j}^{{C}^{^{\prime} }}}{n}}$$
(6)

where n ij is the number of nodes in the ground truth community C i that are assigned to the computed community C j , \({n}_{i}^{C}\) is the number of nodes in the ground truth community C i , and \({n}_{j}^{{C}^{^{\prime} }}\) is the number of nodes in the computed community C j .

PWF

Let T denote the set of nodes in the ground truth communities and W denote the set of nodes assigned by a given algorithm in the corresponding communities. PWF is defined as follows:

$$PWF=\frac{2\times precision\times recall}{precision+recall}$$
(7)

where \(precision=\Vert W\cap T\Vert /\Vert W\Vert \), \(recall=\Vert W\cap T\Vert /\Vert T\Vert \), and \(\Vert \cdot \Vert \) denotes the cardinality of a set.

Modularity

Given a network with n nodes and m edges, Modularity can be calculated as follows:

$$Modularity=\frac{1}{2m}\sum _{i,j}({A}_{ij}-\frac{{k}_{i}\cdot {k}_{j}}{2m})\delta ({c}_{i},{c}_{j})$$
(8)

where A = [A ij ] is the adjacency matrix of the network, k i is the degree of node i, δ(·,·) is the Kronecker function, and c i is the community to which the node i belongs.

Entropy

Given a network with n nodes, we suppose that each node is associated with D attributes \(({a}_{1},{a}_{2},\cdots ,{a}_{D})\) and that the nodes can be partitioned into K communities. Let n c be the number of nodes in the c-th community and p ic be the fraction of nodes in the c-th community taking attribute a i . The total Entropy of attributes in communities can then be defined in the following way:

$$Entropy=\sum _{c=1}^{K}\frac{{n}_{c}}{n}\sum _{i=1}^{D}{p}_{ic}\,\mathrm{log}({p}_{ic})$$
(9)

Entropy measures the homogeneity of communities and their shared attributes.

Parameter Settings on Synthetic and Real-world Networks

As mentioned above, the PCL-DC, PPL-DC and PPSB-DC methods are sensitive to initial values. For the experiments on these algorithms using synthetic attributed networks, we ran the algorithms 10 times on each sample set, selected the best result determined by maximum likelihood, and then reported the average results and standard deviations on 10 samples. We set the max iteration number and the convergence threshold of PCL-DC, PPL-DC, and PPSB-DB to 2000 and 10−8, respectively. We set the regularization coefficient λ = 1 for PCL-DC and CESNA and λ = 0.1 for PPL-DC and PPSB-DC since they perform the best when λ is set accordingly. Similarly, for K-means on attributes, we ran it 10 times on each sample set (since it is sensitive to its initial values), selected the best result with the highest accuracy, and then reported the average results and standard deviations on 10 samples. For BAGC and GBAGC, the max iteration number was set at 10. For CODICIL, we set \(\bar{K}=30,50\) and 70 and selected the one with the highest accuracy. We used cosine similarity \({x}_{i}\cdot {x}_{j}^{T}\) and signal similarity \(1-norm({\Vert {\bar{S}}_{i}-{\bar{S}}_{j}\Vert }_{2})\) to compute the similarity of node attributes and that of link structure for cluster-dp, respectively. We set the weight α of attribute similarity and link similarity to 0.5 for cluster-dp and CODICIL since it was difficult to tune the weight adaptively for each sample. We set \(\bar{D}=50\) for GLFM. For kNN-nearest and kNN-Kmeans, we set k = 10 because we wanted to strengthen only the community structure of the original network so that small k is sufficient. We used default values for algorithm parameters not mentioned above. Similarly, for the probabilistic method cohsMix, we set the max iteration number to 200, chose the best result of cohsMix determined by maximum likelihood among 10 runs for each sample set, and then reported the average values and the standard deviations on 10 samples of each test setting.

In the group of experiments on real-world networks, we used the same parameter settings as in the original method publications in nearly all cases. We set λ = 5 for PCL-DC, PPL-DC, and PPSB-DC and λ = 1 for CESNA because these settings produced the best performance. We set the max iteration number to 10 for BAGC and GBAGC. We chose \(\bar{K}=50\) for CODICIL because it resulted in the best performance among the options \(\bar{K}\in \{30,50,70\}\). We set \(\bar{D}\) of GLFM to 20. The weight between link structure and node attributes was 0.5 for cluster-dp and CODICIL. The max iteration number of cohsMix was 200 since we used only node attributes to make up for the sparsity of the original network and strengthen its community structure. The parameter k of a kNN attribute graph was also 10 for all real networks with the exception of the Diabetes data set, for which we set k = 60 due to the fact that there were only 3 large communities with thousands of nodes and a larger k provided better performance.