Node Attribute-enhanced Community Detection in Complex Networks

Jia, Caiyan; Li, Yafang; Carson, Matthew B.; Wang, Xiaoyang; Yu, Jian

doi:10.1038/s41598-017-02751-8

Download PDF

Article
Open access
Published: 25 May 2017

Node Attribute-enhanced Community Detection in Complex Networks

Caiyan Jia¹,
Yafang Li¹,
Matthew B. Carson ORCID: orcid.org/0000-0003-4105-9220²,
Xiaoyang Wang¹ &
…
Jian Yu¹

Scientific Reports volume 7, Article number: 2626 (2017) Cite this article

6930 Accesses
48 Citations
6 Altmetric
Metrics details

Subjects

Abstract

Community detection involves grouping the nodes of a network such that nodes in the same community are more densely connected to each other than to the rest of the network. Previous studies have focused mainly on identifying communities in networks using node connectivity. However, each node in a network may be associated with many attributes. Identifying communities in networks combining node attributes has become increasingly popular in recent years. Most existing methods operate on networks with attributes of binary, categorical, or numerical type only. In this study, we introduce kNN-enhance, a simple and flexible community detection approach that uses node attribute enhancement. This approach adds the k Nearest Neighbor (kNN) graph of node attributes to alleviate the sparsity and the noise effect of an original network, thereby strengthening the community structure in the network. We use two testing algorithms, kNN-nearest and kNN-Kmeans, to partition the newly generated, attribute-enhanced graph. Our analyses of synthetic and real world networks have shown that the proposed algorithms achieve better performance compared to existing state-of-the-art algorithms. Further, the algorithms are able to deal with networks containing different combinations of binary, categorical, or numerical attributes and could be easily extended to the analysis of massive networks.

Cross-link assisted spatial proteomics to map sub-organelle proteomes and membrane protein topologies

Article Open access 17 April 2024

A host–microbiota interactome reveals extensive transkingdom connectivity

Article 20 March 2024

ColabFold: making protein folding accessible to all

Article Open access 30 May 2022

Introduction

Complex networks provide a powerful tool for representing real-world complex systems¹. Social networks, the World Wide Web, protein-protein interaction networks, academic citation and coauthor networks, and hyper-linked blogs are typical examples of such networks, where nodes denote objects and links denote pairs of relations between nodes. In recent years, much effort has been focused on identifying communities, groups of related nodes with dense internal connections and few external connections^2,3,4,5. In addition to node connectivity information, most real-world networks have node-associated attributes. In this case, two types of information are available; graph data to represent the relationship between objects and attribute data to characterize a single object. Thus, nodes can be grouped either by data clustering methods using only their attributes⁶, or by community detection methods using only their link structure^{4, 7}. However, clustering objects by attribute similarity ignores relationships between objects, and identifying communities using only links between pairs of nodes isolates node attributes within communities. Therefore, various methods have been developed to uncover communities in networks by combining structural and attribute information such that nodes in a community are not only connected more densely than nodes outside of the community, but also share similar attributes.

Existing methods can be classified roughly into two categories. The first category is composed of probabilistic generative models that formulate joint models of link connections and node attributes, and that use the models to infer the posterior community memberships of nodes in a network^{8,9,10,11,12,13,14,15,16,17}. The second category contains three types of hybrid methods. The first represents links as a class of node feature and uses node attributes and link connections to perform vertex clustering^18,19,20. The second makes use of node attributes to help identify communities in networks²¹. The third uses node attributes and link structure together to optimize a unified objective function^{22, 23}.

Probabilistic generative models include CESNA¹⁵, PCL-DC⁹, PPL-DC¹⁰, PPSB-DC¹¹, cohsMix¹², BAGC¹³, GBAGC¹⁴, BNPA¹⁷, and Metacode¹⁶. CESNA employs the probabilistic generative process of BIGCLAM²⁴ for generating links and the logistic model of attributes together to infer the distribution of community memberships. PCL-DC, PPL-DC, and PPSB-DC project the discriminative content (DC) model of attributes into a generative model of links (like PCL⁹, PPL¹⁰, and PPSB¹¹) via community memberships. cohsMix embeds numerical attributes of nodes into the MixNet model²⁵ for generating link classes. BAGC and GBAGC extend the cohsMix model to process categorical attributes and weighted networks. BNPA introduces node attributes and Bayesian priors to Newman’s mixture model²⁶ and integrates the Chinese Restaurant Process to infer the number of communities. Metacode represents node attributes as metadata that describe properties of nodes and incorporates the metadata with the degree corrected stochastic block model²⁷ to infer correlation between metadata and network structure. These models have good interpretability and provide powerful tools to discover overlapping communities or general structures. However, existing models deal with only one type of attribute (either binary, categorical, or numerical) and are sensitive to initial values.

SA-cluster¹⁸ and Inc-cluster^{19, 20} are typical examples of vertex clustering methods that use node attributes and link connections. SA-cluster views node attributes as virtual vertices, constructs an attribute-augmented graph, and performs a random walk on the attribute-augmented graph to obtain a unified distance. It then adopts the K-medoids algorithm to cluster the nodes based on learned pairwise distance. Inc-cluster was introduced as a slightly faster version of SA-cluster. CODICIL²¹ constructs content edges by selecting the top $\bar{K}$ neighbors of each vertex using their attributes, obtains the combined similarity of a pair of nodes, and then sparsifies the newly constructed graph with content edges²⁸. Finally, a fast graph clustering algorithm (Metis²⁹ or MLR-MCL³⁰) is used to partition the sparsified graph into K communities. GLFM²² extends MLFM³¹ (the multiplicative latent factor model) to give a unified model of homophily in networks such that an edge is more likely to exist between two nodes with similar attributes than between nodes having different attributes. A minorization-maximization algorithm is then used to optimize the latent eigenmodel of GLMF. PICS²³ finds cohesive clusters of nodes that have similar connectivity patterns and exhibit high levels of attribute homogeneity by optimizing a unified objective function defined by minimum description length. Compared to probabilistic generative models, these hybrid methods are more efficient. Nonetheless, these methods were designed to process networks with binary or categorical attributes only.

Nearly all of the methods mentioned above follow the assumption that cluster memberships related to node attributes must be consistent with community memberships determined by link structure for a network. However, it is not always true in real world networks. In fact, although nodes in the same community tend to have similar features by the homophily hypothesis³², there may exist some nodes in a community that share similar attributes but are not linked due to the sparseness of a real network. Therefore, for each node, we used only a small portion of the nearest neighbors measured by attribute similarity to alleviate the sparsity of a network, while strengthening the community structure. Consequently, in this study, we have proposed a node attribute-enhanced community detection approach, named kNN-enhance, using the kNN (e.g., k ≤ 10) graph of node attributes. We have instantiated kNN-enhance into two algorithms, kNN-nearest and kNN-Kmeans, to test the efficiency and the effectiveness of the approach. In the first stage, we constructed a kNN graph enhanced network by adding the kNN graph of node attributes to the original network. Then, we selected the number of communities and community centers on the enhanced network using the idea behind the method K-rank-D³³, which is the extended version of the data clustering method proposed by Rodriguez and Laio³⁴. In the second stage, we used kNN-nearest or kNN-Kmeans to cluster nodes into groups, where kNN-nearest assigned each remaining node to the cluster of its nearest neighbor with higher centrality and kNN-Kmeans clustered nodes iteratively by the K-means method. Our experimental results suggest that kNN-enhance improves upon existing algorithms through its ability to process networks with binary, categorical, or numerical attributes. Moreover, the approach can handle large-scale attributed networks by combining fast approximate kNN-graph algorithms^35,36,37 with fast community detection algorithms such as BGLL³⁸ and Informap³⁹.

Results

A Description and Illustration of kNN-enhance

Networks in real applications are often sparse and contain noise in the form of spurious edges. This sparseness and noise blur the community structure of a network. Yet, nodes in the same community are likely to be connected to each other and share similar interests even though some of them are ‘silent’. Therefore, we can obtain a kNN graph by using a set of node attributes. The kNN-graph is then combined with the original network to compensate for sparsity, thereby strengthening the community structure of the network. Figure 1 is an illustration of kNN-enhance. Figure 1a shows an attributed network, where each node has four attributes: degree, research area, affiliation, and location. This original network is sparse and the community structure in it is not clear. If we add a link between nearest neighbors with common node attributes for each pair of nodes (Fig. 1b), the now attribute-enhanced network shows distinctive community structure. Optionally, a community detection algorithm like K-rank-D can be used to discover community structure in the newly generated, attribute-enhanced network.

Figure 2 illustrates the effectiveness of kNN-enhance from its partition process. Figure 2a is an example of the decision graph of an original LFR network⁴⁰ with μ = 0.9 and n = 1000 using K-rank-D. The original network contained 38 communities. One hundred binary attributes with the same cluster structure as the original network were attached to each node at a noise ratio of 20%. In the original LFR network, the community structure was unclear and the 38 community centers were not sufficiently separated in the right upper corner of the decision graph. As a result, it was difficult to determine the number of communities and the community centers as well as to detect the community structure in the network. Subsequently, the kNN-graph was added to the original network and the decision graph of the kNN-graph enhanced network was created with k = 10 using K-rank-D (Fig. 2b). The community structure became clearer and the 38 community centers were separated in the right upper part of the decision graph. This made the community structure much easier to determine. In addition, the red nodes in Fig. 2b were the top 38 nodes with highest comprehensive value (computed by Equation (4) in the Methods section). The nodes in the square were selected by manually drawing a rectangle in the right upper section of the graph. Using manually selected nodes as initial centers, all nodes are correctly partitioned when compared to the ground truth. Yet, using the top 38 nodes (red nodes) as initial centers, the accuracy (computed by Equation (5) in the Methods section) is only 95%. In some cases it is difficult to select the exact K community centers in decision graphs (see Fig. 2a as an example). We automatically selected the top K nodes with the highest comprehensive value as the centers in the following experiments.

Experiment Results

We generated two groups of LFR⁴⁰ benchmark networks with binary and numerical node attributes, respectively. We tested existing state-of-the-art algorithms including probabilistic models (PCL-DC, PPL-DC, PPSB-DC, CESNA, cohsmix, BAGC, and GBAGC) and hybrid methods (SA-Custer, Inc-Cluster, CODICIL, and GLFM) on these synthetic benchmarks. We then evaluated these algorithms on several commonly used real networks, including some with or some without associated ground truth. We compared two instantiations of our kNN-enhance approach, kNN-nearest and kNN-Kmeans, to these existing algorithms. In addition, we compared kNN-nearest and kNN-Kmeans with K-rank-D using only link information, K-means using only node attributes, and cluster-dp³⁴ using both node attribute and link information on these networks to show whether the proposed approach performed better than existing similar methods and methods using either links or attributes alone.

Experimental Results on Synthetic Networks

Largeron et al.⁴¹ have provided a generator to generate networks with community structure and numerical node attributes. However, the generator cannot be used to generate networks with binary attributes. Therefore, we generated our own series of networks based on a commonly used LFR benchmark⁴⁰.

LFR benchmark networks are presented by Lancichinetti et al.⁴⁰. These mimic real networks by introducing associated characteristics, i.e., the heterogeneity in the distribution of node degree and community size. The LFR benchmark method uses several parameters to generate a network, including n (the number of vertices), μ (the mixing parameter), 〈k〉 (the average degree of vertices), k _max (the maximum degree of vertices), C _min (the minimum community size), C _max (the maximum community size), γ and β (exponents of the power-law distribution of node degree and community size). The mixing parameter μ is designed to control the clearness of community structure in a network. Each node shares a fraction 1 − μ of its links with other nodes in its community and a fraction of μ with the other nodes in the network. Thus, the smaller μ is, the clearer the community structure in an LFR network. When μ ≤ 0.6, all algorithms are able to classify nearly all vertices into the correct communities. Therefore, we only added node attributes to LFR networks when μ = 0.7, 0.8, or 0.9. Following the example of previous studies^{33, 40}, we generated a group of LFR benchmarks with 1000 nodes, $\langle k\rangle =20$, k _max = 50, C _min = 10, C _max = 50, γ = 2, and β = 1.

We generated two types of node attributes, binary and numerical, for the LFR benchmarks. We did not generate category attributes for simplicity since these can be formulated as binary attributes. We first attached D-dimensional binary attributes to each node and gave nodes in the same community the same d (d < D) attributes. In this group of experiments, we set D = 100 and d = 10 for testing high dimensional attributes. In order to blur the attribute cluster structure, we added 10% to 50% noise by randomly flipping the corresponding portion of binary attributes. With the increase of the noise ratio, the clearness of cluster structure decreased. We then used the Gaussian cluster generator (http://personalpages.manchester.ac.uk/mbs/Julia.Handl) to generate D dimensions of numerical attributes following multivariate normal distributions such that the cluster structure of attributes was the same as the community structure of the corresponding network. For a single multivariate cluster, the mean was uniformly distributed in the range [−10, 10], the off-diagonal entries of the covariance matrix were generated as a random number in the range [−1, 1], and the diagonal entries of the covariance matrix were generated as the sum of all off-diagonal entries plus a random number in the range $[\mathrm{0,}\,20\cdot \sqrt{D}]$. We set D = 10, 5, 3, and 2 in these groups of experiments. Higher dimensionality led to clearer attribute clusters.

We first compared six probabilistic generative models including PCL-DC, PPL-DC, PPSB-DC, BAGC, GBAGC, and CENSA and seven hybrid methods comprising CODICIL, SA-cluster, Inc-cluster, GLFM, cluster-dp, kNN-nearest, and kNN-Kmeans on the sample sets with binary attributes, where the noise ratio was in the range $\{\mathrm{10 \% },\mathrm{20 \% ,}\,\cdots ,50 \% \}$ at μ = 0.7, 0.8 or 0.9, respectively. Also, we compared all algorithms to K-rank-D using only links and K-means using only attributes. We reported the average results and standard deviations on 10 sample sets for each setting shown in Tables 1, 2 and 3, where columns indicate the noise ratio of LFR benchmarks, the three numbers in each cell represent the average values and the standard deviations of the three accuracy metrics (ACC, NMI, and PWF defined by Equations (5–7) in the Methods section) of the corresponding algorithm, and the best performing algorithm is marked in bold. The details of parameter settings of these compared algorithms can be found in the Methods section. We did not include the results for PICS and BNPA in the tables because they did not converge to the real number of communities and distorted the meaning of the accuracy metrics (ACC, NMI and PWF).

Table 1 Results on LFR networks with binary attributes, μ = 0.7.

Full size table

Table 2 Results on LFR networks with binary attributes, μ = 0.8.

Full size table

Table 3 Results on LFR networks with binary attributes, μ = 0.9.

Full size table

Since only kNN-nearest, kNN-Kmeans, cohsMix, and cluster-dp can been used to cope with networks having numerical node attributes, we then compared these four algorithms on LFR benchmarks with numerical attributes at different D = 10, 5, 3, or 2 when μ = 0.7, 0.8, or 0.9. Also, we compared these four algorithms with K-means using only numerical attributes (the results of K-rank-D using only links can be seen in Tables 1, 2 and 3). The experimental results are shown in Tables 4, 5 and 6, where columns represent the dimension of numerical attribute space (D = 10, 5, 3 or 2), the three numbers in each cell represent the average values and the standard deviations of three accuracy metrics (ACC, NMI, and PWF) of the corresponding algorithm over 10 samples, and ‘−’ indicates that cohsMix was trapped in a saddle point. The best algorithm for each column is marked in bold.

Table 4 Results on LFR networks with numerical attributes, μ = 0.7.

Full size table

Table 5 Results on LFR networks with numerical attributes, μ = 0.8.

Full size table

Table 6 Results on LFR networks with numerical attributes, μ = 0.9.

Full size table

We also tested the algorithms on LFR networks with 5000 nodes, $\langle k\rangle =20$, k _max = 50, C _min = 20, C _max = 100, γ = 2, and β = 1. Because there were too many testing samples and the results were similar to the first group networks with 1000 nodes, we did not report the results of this group of experiments in the manuscript. Instead, to give a glimpse of the time complexity of the compared algorithms, we have reported the time costs of each algorithm on a randomly generated sample containing 40% noise for binary attributes when {n = 1000, C _min = 10, C _max = 50}, {n = 5000, C _min = 20, C _max = 100}, and {n = 10000, C _min = 20, C _max = 200}, respectively, at $\langle k\rangle =20,{k}_{max}=50,\gamma =2,\beta =1$ and μ = 0.8 in Table 7. All algorithms were run only once, each number represents the running time of the corresponding algorithm with time unit ‘second’, ‘—’ indicates that the time cost of the corresponding algorithm was beyond 48 hours, and ‘*’ indicates that the algorithm ran out of memory. These experiments were performed on a laptop with an Intel 2.50 GHz processor and 4 GB of main memory running Windows 7.0. CESNA was implemented in C++, CODICIL was implemented in Python and C/C++, cohsMix was implemented in R, and the remaining algorithms were implemented in MATLAB.

Table 7 Time cost of the compared algorithms (in seconds).

Full size table

From the data in Tables 1–7, we have concluded that adding node attributes promotes the performance of community detection in most cases. Taking the results of kNN-Kmeans as an example, most of the results were better than those of the basic K-rank-D algorithm on links and K-means on attributes. As Tables 1–3 show, in most cases, kNN-nearest and kNN-Kmeans performed best among the 13 tested algorithms including probabilistic generative models and hybrid methods, and these outperformed the other hybrid methods in all cases. Although kNN-nearest performed slightly worse than kNN-Kmeans, it was more efficient (see Table 7) since each node received its community label from the nearest node with higher centrality. According to our experiments, kNN-Kmeans converged quickly since community centers were carefully selected. In some cases, the probabilistic generative model PCL-DC displayed the best performance but ran too slowly to be used for processing large networks in real applications (see Table 7). CESNA and CODICIL also showed good performance on this group of experiments. Among the probabilistic methods, CENSA was the fastest algorithm. However, it was much slower than the majority of the hybrid heuristic methods. CODICIL ran quickly due to the fast graph partition program Metis, which was used to cut the networks into communities. GBAGC performed well because it used Metis on links to get the initial partition. Moreover, as Tables 4–6 show, both the kNN-nearest and the kNN-Kmeans algorithms allowed us to discover communities effectively in LFR networks with numerical attributes. In summary, this empirical study on LFR benchmarks proves the flexibility, effectiveness, and efficiency of the kNN-enhance approach.

Experimental Results on Real Networks

In addition to our experiments using synthetic networks, we tested the algorithms on two groups of real networks. The nodes in the first group were associated with binary/categorical attributes, while those in the second group possessed numerical attributes. The first group of data sets included Cora⁴², Citeseer⁴², and DBLP10K¹⁸. Sinanet (https://github.com/smileyan448/Sinanet) and PubMed (http://linqs.umiacs.umd.edu/projects//projects/lbc/) belonged to the second group. Detailed information on these data sets is described below.

The Cora data set consisted of machine learning papers. These papers were classified as belonging to one of the following seven classes: CBR (case based reasoning), GA (genetic algorithms), NN (neural networks), PM (probabilistic methods), RL (reinforcement learning), or RLT (rule learning theory). The papers were selected in such a way that in the final corpus every paper cited or was cited by at least one other paper. Assuming each node represented a paper, there were 2,708 nodes and 5,429 citations. After stemming and removing stop-words and words with document frequency less than 10, the corpus remained a vocabulary of size 1,433 unique words. Each paper was described by a 1433-dimension 0/1 vector indicating the absence/presence of the corresponding words from the dictionary of these unique words.

The Citeseer data set was also a citation network in the field of machine learning. These papers were classified into one of the following six classes: Agents, AI (artificial intelligence), DB (database), IR (information retrieval), ML (machine learning), and HCI (human-computer interaction). The papers were selected in the same way as the Cora dataset. There were 3,312 papers in the corpus and 4,732 citations between papers. A paper was described by a 0/1 word vector indicating the absence/presence of the corresponding words from the dictionary of the 3,703 unique words.

The DBLP data set was a co-author network extracted from DBLP Bibliography data. This network contained 10,000 authors and their coauthor relationships. These authors were distributed across four research fields including databases, data mining, information retrieval, and artificial intelligence. Each author was associated with two relevant attributes; prolific and primary topic. The attribute prolific had three possible values: authors with ≥20 publications were labeled as highly prolific, authors with ≥10 and <20 papers were labeled as prolific, and authors with <10 papers were labeled as low prolific. The attribute primary topic had 99 values. Each author was assigned a primary topic out of 99 extracted by a topic model from a collection of paper titles of the authors. For this data set, we did not know the exact number of communities or to which community a node belonged.

The Sinanet data set was a microblog user relationship network that we extracted from the sina-microblog website (http://www.weibo.com). We first selected 100 VIP sina-microblog users distributed across 10 major forums including finance and economics, literature and arts, fashion and vogue, current events and politics, sports, science and technology, entertainment, parenting and education, public welfare, and normal life. Starting from these 100 VIP sina-microblog users, we extracted the followees of these users and their published micro-blogs. Using a depth-first search strategy, we extracted three-layers of user relationships and obtained 8,452 users, 147,653 user relationships, and 5.5 million micro-blogs in total. We merged all microblogs that a user published to characterize that user’s interests⁴³. After removing silent users (those who post less than 5000 words), we were left with 3,490 users and 30,282 relationships. If we used words’ frequency of the merged blogs of a user to describe the user’s interest, the dimension of the feature space would have been too high to be successfully processed. We used users’ topic distribution in the 10 forums, which was obtained by the LDA topic model (http://gibbslda.sourceforge.net/), to describe users’ interests. Thus, besides the followee relationships between pairs of users, we have 10 dimensional numerical attributes to describe the interests of each user. This data set is available at https://github.com/smileyan448/Sinanet.

The Diabetes data set consisted of 19,717 scientific publications from the PubMed database pertaining to diabetes classified into one of three classes: Diabetes Mellitus Experimental, Diabetes Mellitus Type 1, and Diabetes Mellitus Type 2. These publications formed a citation network with 44,338 edges representing the citation relationships of pairs of publications. Further, each publication in the dataset was described by a TF/IDF weighted word vector from a dictionary that consisted of 500 unique words.

The experimental results of the methods on Cora and Citeseer are shown in Table 8, where columns represent the data sets used in the evaluation, the cells of each row represent the values of ACC, NMI, and PWF for the corresponding algorithm, and the algorithm with the best performance is marked in bold in each of the two groups (probabilistic methods and hybrid methods). Because the DBLP network had two categorical attributes, we tested SA-cluster on the two category attributes with 3 values and 99 values, respectively. We named the result SA-cluster-cate. We also tested SA-cluster on DBLP when we viewed these 102 values as binary and named the result SA-cluster-bina (the same was done for Inc-cluster). Since no ground truth was available on DBLP, we reported Modularity and Entropy (defined by Equations (8–9) in the Methods section) of all algorithms in Figs 3 and 4 at different K (K is the number of communities). In Figs 3 and 4, we do not show the results of PCL-DC, PPL-DC, and PPSB-DC since either the time or space complexity of these algorithms was too high to handle large networks like DBLP10k. The results of the methods for processing Sinanet and PubMed networks with numerical attributes are shown in Table 9, where ‘*’ indicates that the algorithm ran out of memory on the corresponding data set. For the probabilistic methods PCL-DC, PPL-DC, PPSB-DC on Cora and Citeseer, and cohsMix on Sinanet, we ran the algorithms 10 times and reported the result with the largest likelihood. For K-means on attributes alone, we reported the best results of these networks over 10 runs. The details of the parameter settings for the compared algorithms in this group of experiments can be found in the Methods section.

Table 8 Performance of compared algorithms on Cora and Citeseer.

Full size table

Table 9 Performance of compared algorithms on Sinanet and PubMed Diabetes.

Full size table

We drew the following conclusions using the information in Tables 8 and 9 and Figs 3 and 4: (1) According to Table 8, probabilistic methods PCL-DC, PPL-DC, and PPSB-DC showed the best performance on Cora and Citeseer data sets. However, the time cost of these methods was too high and they would not be appropriate for real applications. In contrast, the kNN-enhance approach achieved high accuracy in comparison to other hybrid methods and was much faster than probabilistic methods PCL-DC, PPL-DC, and PPSB-DC (see Table 7). (2) By Figs 3 and 4, the Entropy of kNN-Kmeans on DBLP was the lowest, especially when the number of communities K was larger than 200. The Modularity of kNN-enhance indicates that the partitioned network maintained community structure. Therefore, kNN-enhance was able to identify a clear community structure (large Modularity) with a high level of attribute homogeneity (low Entropy) in the network. (3) The kNN-enhance approach was capable of processing networks with numerical attributes (see Table 9). Even through the accuracy of cohsMix was higher than kNN-nearest and kNN-Kmeans on Sinanet, cohsMix ran much slower than kNN-enhance and the results from this algorithm were selected over 10 runs on Sinanet data. Moreover, cohsMix was not capable of dealing with a large network such as the one from the Diabetes data set due to its high memory usage when storing the similarity matrix and all hidden variables.

Discussion

We have proposed a simple and flexible node attribute enhanced community detection approach, kNN-enhance. This method was designed to construct the k nearest neighbor graph of node attributes first, then merge the kNN graph with the original network. With this approach we were able to alleviate the sparsity of the original network, reduce noise effects, and strengthen the community structure of the original network. Because of this, a clear community structure could be partitioned within the kNN graph enhanced network by a community detection algorithm like K-rank-D. Our two implementations, kNN-nearest and kNN-Kmeans, have shown that the proposed algorithms achieved better performance against the existing state-of-the-art algorithms. Furthermore, the algorithms were able to deal with a network containing binary, categorical, or numerical attributes and could be easily extended to process large-scale networks.

In the future we intend to test this approach on large scale networks with millions of edges by combining fast approximate kNN graph construction algorithms (such as NN-Descent³⁶ with O(n ^1.14) empirical cost) with fast community detection algorithms such as BGLL³⁸ and Informap³⁹. Moreover, besides strengthening the community structure of a network using node attributes, we plan to design a more effective method by removing some easily detected weak-linked edges from the network. In this study we were concerned with detecting community structures containing nodes with more links to each other than to nodes outside their communities. However, it has been observed that trees and tree-like networks have high modularity^{44, 45}, the classical objective function to discover communities and to measure their strength⁴⁶, and that many real world networks have tree-like structures^47,48,49. Existing methods use connections only to decompose a network into tree-like components. It is a challenging task to combine node attributes with topology to cluster nodes in a tree-like network into groups, and we will investigate whether our kNN-enhance approach is capable of partitioning attributed tree-like networks.

Methods

Community Detection in Attributed Networks

Suppose that G = (V, E, X) is a network with node attributes, where V is a set of nodes ($\Vert V\Vert =n$), E is an edge set that indicates relationships between pairs of nodes ($\Vert E\Vert =m$) and is usually represented by an adjacency matrix A = [A _ij] (A _ij = 1 if there is an edge between nodes i and j, A _ij = 0 otherwise), X = {x ₁, x ₂, …, x _n} $({x}_{i}=({x}_{i1},{x}_{i2},\cdots ,{x}_{iD}),i=1,2,\cdots ,n)$ is a set of vectors, each of which denotes the values of D attributes associated with a node i. We call this an ‘attributed network’ or ‘attributed graph’. Community detection in an attributed network involves partitioning nodes into clusters such that nodes in the same cluster are not only densely connect to each other but also exhibit a high level of attribute homogeneity.

An Active Method for Community Detection in Networks

cluster-dp is a recently-developed clustering algorithm similar to the K-means method³⁴. The algorithm assumes that cluster centers are surrounded by neighbors with lower local density and that they are a relatively large distance from any data points with a higher local density. Therefore, for each data point i, two quantities, the local density ρ _i and the distance from points of higher density δ _i, are defined as follows to quantify the likelihood of a data point being a cluster center:

$${\rho }_{i}=\sum _{j}\chi ({d}_{ij}-{d}_{c}),{\delta }_{i}=mi{n}_{j:{\rho }_{j} > {\rho }_{i}}({d}_{ij})$$

(1)

where d _ij is the distance of data points i and j, χ(x) = 1 if x < 0 and χ(x) = 0 otherwise, d _c represents the cutoff distance, and δ _i = max _j(d _ij) for the point with the highest density.

If we scatter all data points on a decision graph drawn by their values of ρ _i and δ _i for all $i\in \{1,2,\cdots ,n\}$, the cluster centers tend to occupy the right upper part of the graph. After cluster centers with both relatively large ρ _i and δi are manually selected on the decision graph, each remaining point is assigned to the same cluster as its nearest neighbor of higher density. This allows cluster-dp to uncover the cluster structure of data points by actively knowing the number of clusters and cluster centers.

However, the following issues exist: (1) When the cluster structure is not clear (i.e., there is not a distinguished boundary between cluster centers and other data points on the decision graph), it is difficult to obtain the correct number of clusters and cluster centers. This leads to poor partitioning. (2) The parameter d _c must be tuned in many cases, and it is usually difficult to know which parameter value is best. (3) The input for cluster-dp is a distance matrix. The quality of the matrix has a strong effect on the clustering result. When the algorithm is used to discover community structure in a network, the topological structure implied in the network is not fully utilized.

In a network structure, we suppose that community centers are: (1) influential and surrounded by less influential nodes, and (2) located far from each other in the network. Therefore, we have proposed K-rank-D³³ and use two quantities, ${v}_{i}\in v=\{{v}_{1},{v}_{2},\cdots ,{v}_{n}\}$ and ${\bar{\delta }}_{i}$, to describe the centrality and the dispersion of each node i, respectively. The centrality vector v can be calculated efficiently using PageRank⁵⁰ centrality as follows:

$${v}^{t+1}=((1-\beta )P+e\frac{\beta }{n}){v}^{t},\,\,{P}_{ij}=\frac{{A}_{ij}}{{\sum }_{j}{A}_{ij}},\,i,j\in \{1,2,\cdots ,n\}$$

(2)

where β is the re-start probability (fixed at 0.15), e is the unit matrix, v ₀ is a n-dimensional unit vector, and v ^t is normalized to 1 in each iteration. The dispersion of a node i to other nodes with higher centrality is defined by ${\bar{\delta }}_{i}=mi{n}_{j:{v}_{j} > {v}_{i}}({d}_{ij})$ and ${\bar{\delta }}_{k}=max({\bar{\delta }}_{i})$, i ≠ k, for the node k with highest centrality. d _ij is the structural distance between nodes i and j. It can be computed using Euclidean distance measurement ${\Vert \cdot \Vert }_{2}$ after τ-step signal propagation⁵¹ by following equations:

$$S={(A+I)}^{\tau },\,\,{\bar{S}}_{ij}={S}_{ij}/\sqrt{\sum _{j}{S}_{ij}^{2}},\,\,{d}_{ij}={\Vert {\bar{S}}_{i}-{\bar{S}}_{j}\Vert }_{2}.$$

(3)

where τ = 3 in implementation and ${\bar{S}}_{i}$ is the i-th row of $\bar{S}$. In the case that the community structure of a network is fuzzy, we define the comprehensive value for each node i as follows:

$$CV(i)={v}_{i}\cdot {\bar{\delta }}_{i}/({ma}{{x}}_{j=1}^{n}({v}_{j})\cdot {ma}{{x}}_{j=1}^{n}({\bar{\delta }}_{j})).$$

(4)

The top K nodes with the highest comprehensive value can then be automatically selected as the initial centers of K-rank-D.

kNN-enhance: a Node Attribute-enhanced Community Detection Approach

Given an attributed network G = (V, E, X), we first construct the kNN graph of node attributes. The kNN graph for a set of nodes V is a directed graph with vertex set V and an edge from each v ∈ V to its k most similar objects in V under a given similarity measure on attributes. ∀x _i, x _j ∈ X, the cosine similarity ${x}_{i}\cdot {x}_{j}^{T}$ is used to compute the similarity of a pair of nodes with binary attributes, and $1-norm({\Vert {x}_{i}-{x}_{j}\Vert }_{2})$ is used to compute the similarity of a pair of nodes with numerical attributes, where $norm({\Vert {x}_{i}-{x}_{j}\Vert }_{2})$ is the normalization of the Euclidean distance of x _i and x _j. We then add the kNN-graph of attributes to the original network. For an edge of the kNN graph, if it is a new edge in the original network, we add this edge to the original network; otherwise, we keep the edge in the original network unchanged.

After the kNN-enhanced network is established, we use the K-rank-D method introduced above to perform node clustering. In addition to K-rank-D, we employ two node assignment strategies after selecting K community centers from the decision graph. kNN-nearest uses the cluter-dp strategy³⁴, which involves assigning each remaining node to the same cluster as its nearest neighbor of higher PageRank centrality computed by Equation (2). kNN-Kmeans uses the strategy of the K-means method, where the input is the data matrix $\bar{S}=[{\bar{S}}_{ij}]$ ⁵¹. It iteratively updates its community centers. It should be pointed out that kNN-nearest and kNN-Kmeans are just two implementations of kNN-enhance approach. Fast approximate kNN graph construction methods^35,36,37 and highly-efficient community detection algorithms^{38, 39} can be combined to process large scale networks.

Metrics for Evaluating Algorithm Quality

In this study, we use two groups of metrics to evaluate the performance of each algorithm. The first group includes ACC (Accuracy), NMI (Normalized Mutual Information), and PWF (Pairwise F-Measure)^{9, 11}. These are commonly used to evaluate an algorithm running on a data set with ground truth. Larger values indicate better algorithm performance. The other group consists of Modularity ^{38, 46} and Entropy ^{13, 18}. Modularity is used to measure the quality of communities in a network, and a larger Modularity value indicates better partition quality. Entropy is used to measure the degree of attribute consistency in a community, and a lower Entropy value indicates a greater consistency. These metrics are often used when an algorithm is run on a network without ground truth. Formal definitions are provider below:

ACC. Given node i, l _pi is the node label assigned by an algorithm and l _ti is its true label. The accuracy is defined by

$$ACC=\sum _{i=1}^{n}\delta ({l}_{ti},{p}_{map}({l}_{pi}))/n$$

(5)

where δ(·) is a Kronecker function, P _map(l _pi) is a permutation mapping function that maps the label l _pi to its corresponding label l _ti in the ground truth, and n is the total number of nodes in a network.

NMI

Suppose $C=\{{C}_{1},{C}_{2},\cdots ,{C}_{K}\}$ is a set of K communities contained in a network and $C^{\prime} =\{{C}_{1}^{^{\prime} },{C}_{2}^{^{\prime} },\cdots ,{C}_{K}^{^{\prime} }\}$ is a set of K communities obtained by a specific algorithm. NMI is defined by

$$NMI(C,C^{\prime} )=\frac{-2{\sum }_{i=1}^{K}{\sum }_{j=1}^{K}{n}_{ij}\,\mathrm{log}\frac{n\cdot {n}_{ij}}{{n}_{i}^{C}\cdot {n}_{j}^{{C}^{^{\prime} }}}}{{\sum }_{i=1}^{K}{n}_{i}^{C}\,\mathrm{log}\frac{{n}_{i}^{C}}{n}+{\sum }_{j=1}^{K}{n}_{j}^{{C}^{^{\prime} }}\,\mathrm{log}\frac{{n}_{j}^{{C}^{^{\prime} }}}{n}}$$

(6)

where n _ij is the number of nodes in the ground truth community C _i that are assigned to the computed community C′_j, ${n}_{i}^{C}$ is the number of nodes in the ground truth community C _i, and ${n}_{j}^{{C}^{^{\prime} }}$ is the number of nodes in the computed community C′_j.

PWF

Let T denote the set of nodes in the ground truth communities and W denote the set of nodes assigned by a given algorithm in the corresponding communities. PWF is defined as follows:

$$PWF=\frac{2\times precision\times recall}{precision+recall}$$

(7)

where $precision=\Vert W\cap T\Vert /\Vert W\Vert $, $recall=\Vert W\cap T\Vert /\Vert T\Vert $, and $\Vert \cdot \Vert $ denotes the cardinality of a set.

Modularity

Given a network with n nodes and m edges, Modularity can be calculated as follows:

$$Modularity=\frac{1}{2m}\sum _{i,j}({A}_{ij}-\frac{{k}_{i}\cdot {k}_{j}}{2m})\delta ({c}_{i},{c}_{j})$$

(8)

where A = [A _ij] is the adjacency matrix of the network, k _i is the degree of node i, δ(·,·) is the Kronecker function, and c _i is the community to which the node i belongs.

Entropy

Given a network with n nodes, we suppose that each node is associated with D attributes $({a}_{1},{a}_{2},\cdots ,{a}_{D})$ and that the nodes can be partitioned into K communities. Let n _c be the number of nodes in the c-th community and p _ic be the fraction of nodes in the c-th community taking attribute a _i. The total Entropy of attributes in communities can then be defined in the following way:

$$Entropy=\sum _{c=1}^{K}\frac{{n}_{c}}{n}\sum _{i=1}^{D}{p}_{ic}\,\mathrm{log}({p}_{ic})$$

(9)

Entropy measures the homogeneity of communities and their shared attributes.

Parameter Settings on Synthetic and Real-world Networks

As mentioned above, the PCL-DC, PPL-DC and PPSB-DC methods are sensitive to initial values. For the experiments on these algorithms using synthetic attributed networks, we ran the algorithms 10 times on each sample set, selected the best result determined by maximum likelihood, and then reported the average results and standard deviations on 10 samples. We set the max iteration number and the convergence threshold of PCL-DC, PPL-DC, and PPSB-DB to 2000 and 10⁻⁸, respectively. We set the regularization coefficient λ = 1 for PCL-DC and CESNA and λ = 0.1 for PPL-DC and PPSB-DC since they perform the best when λ is set accordingly. Similarly, for K-means on attributes, we ran it 10 times on each sample set (since it is sensitive to its initial values), selected the best result with the highest accuracy, and then reported the average results and standard deviations on 10 samples. For BAGC and GBAGC, the max iteration number was set at 10. For CODICIL, we set $\bar{K}=30,50$ and 70 and selected the one with the highest accuracy. We used cosine similarity ${x}_{i}\cdot {x}_{j}^{T}$ and signal similarity $1-norm({\Vert {\bar{S}}_{i}-{\bar{S}}_{j}\Vert }_{2})$ to compute the similarity of node attributes and that of link structure for cluster-dp, respectively. We set the weight α of attribute similarity and link similarity to 0.5 for cluster-dp and CODICIL since it was difficult to tune the weight adaptively for each sample. We set $\bar{D}=50$ for GLFM. For kNN-nearest and kNN-Kmeans, we set k = 10 because we wanted to strengthen only the community structure of the original network so that small k is sufficient. We used default values for algorithm parameters not mentioned above. Similarly, for the probabilistic method cohsMix, we set the max iteration number to 200, chose the best result of cohsMix determined by maximum likelihood among 10 runs for each sample set, and then reported the average values and the standard deviations on 10 samples of each test setting.

In the group of experiments on real-world networks, we used the same parameter settings as in the original method publications in nearly all cases. We set λ = 5 for PCL-DC, PPL-DC, and PPSB-DC and λ = 1 for CESNA because these settings produced the best performance. We set the max iteration number to 10 for BAGC and GBAGC. We chose $\bar{K}=50$ for CODICIL because it resulted in the best performance among the options $\bar{K}\in \{30,50,70\}$. We set $\bar{D}$ of GLFM to 20. The weight between link structure and node attributes was 0.5 for cluster-dp and CODICIL. The max iteration number of cohsMix was 200 since we used only node attributes to make up for the sparsity of the original network and strengthen its community structure. The parameter k of a kNN attribute graph was also 10 for all real networks with the exception of the Diabetes data set, for which we set k = 60 due to the fact that there were only 3 large communities with thousands of nodes and a larger k provided better performance.

References

Strogatz, S. H. Exploring complex networks. Nature 410, 268–276, doi:10.1038/35065725 (2001).
Article ADS CAS PubMed Google Scholar
Girvan, M. & Newman, M. E. J. Community structure in social and biological networks. Proceedings of the National Academy of Sciences of the United States of America 99, 7821–7826, doi:10.1073/pnas.122653799 (2002).
Article ADS MathSciNet CAS PubMed PubMed Central MATH Google Scholar
Palla, G., Derenyi, I., Farkas, I. J. & Vicsek, T. Uncovering the overlapping community structure of complex networks in nature and society. Nature 435, 814–818, doi:10.1093/bioinformatics/btl039 (2005).
Article ADS CAS PubMed Google Scholar
Fortunato, S. & Castellano, C. Community structure in graphs. Computational Complexity 490–512, doi:10.1007/978-1-4614-1800-9 (2012).
Yang, Z., Algesheimer, R. & Tessone, C. J. A comparative analysis of community detection algorithms on artificial networks. Scientific Reports 6, 30750, doi:10.1038/srep30750 (2016).
Article ADS CAS PubMed PubMed Central Google Scholar
Jain, A. K. Data clustering: 50 years beyond k-means. Pattern Recognition Letters 31, 651–666, doi:10.1016/j.patrec.2009.09.011 (2010).
Article Google Scholar
Lancichinetti, A., Kivelä, M., Saramäki, J. & Fortunato, S. Characterizing the community structure of complex networks. PLOS ONE 5, e11976, doi:10.1371/journal.pone.0011976 (2010).
Article ADS PubMed PubMed Central Google Scholar
Cohn, D. & Hofmann, T. The missing link - a probabilistic model of document content and hyprtext connectivity. In Proceedings of the conference on Advances in Neural Information Processing Systems, 2001, MA 430–436 (2001).
Yang, T., Jin, R., Chi, Y. & Zhu, S. Combining link and content for community deteciton: a discriminative approach. In Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2009, France 927–936 (2009).
Yang, T., Jin, R., Chi, Y. & Zhu, S. Directed network community detection: a popularity and productivity link model. In Proceedings of SIAM Conference on Data Mining, 2010, USA 742–753 (2010).
Chai, B., Yu, J., Jia, C., Yang, T. & Jiang, Y. W. Combining a popularity-productivity stochastic block model with a discriminative-content model for general structure detection. Physical Review E 88, 012807:1–012807:10, doi:10.1103/PhysRevE.88.012807 (2013).
Article ADS Google Scholar
Zanghi, H., Volant, S. & Ambroise, C. Clustering based on random graph model embedding vertex features. Pattern Recognition letters 31, 830–836, doi:10.1016/j.patrec.2010.01.026 (2010).
Article Google Scholar
Xu, Z., Ke, Y., Wang, Y., Cheng, H. & Cheng, J. A model-based approach to attributed graph clustering. In Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data, 2012, USA 505–516 (2012).
Xu, Z., Ke, Y., Wang, Y., Cheng, H. & Cheng, J. GBAGC: a general bayesian framework for attributed graph clustering. ACM Transaxtions on Knowledge Discovery form Data 9, 5:1–5:43 (2014).
Google Scholar
Ruan, Y., Fuhry, D. & Parthasarathy, S. Efficient community detection in large networks using content and links. In Proceedings of the International World Wide Web Conference, 2013, Brazil 1089–1098 (2013).
Newman, M. E. J. & Clauset, A. Structure and inference in annotated networks. Nature Communications 7, 11863, doi:10.1038/ncomms11863 (2015).
Article ADS Google Scholar
Chen, Y., Wang, X., Bu, J., Tang, B. & Xiang, X. Network structure exploration in networks with node attributes. Physica A-statistical Mechanics and Its Applications 449, 240–253, doi:10.1016/j.physa.2015.12.133 (2016).
Article ADS MathSciNet Google Scholar
Zhou, Y., Cheng, H. & Yu, J. X. Graph clustering based on structural/attribute similarities. In Proceedings of the VLDB Endowment, 2009, France 718–729 (2009).
Zhou, Y., Cheng, H. & Yu, J. X. Clustering large attributed graphs: an efficient incremental approach. In Proceedings of the 2010 IEEE International Conference on Data Mining, 2010, USA 689–698 (2010).
Cheng, H., Zhou, Y. & Yu, J. X. Clustering large attributed graphs: a balance between structural and attribute similarities. ACM Transaction on Knowledge Discovery from Data 5, 12:1–12:33, doi:10.1145/1921632 (2011).
CAS Google Scholar
Yang, J., McAuley, J. & Leskovec, J. Community detection in networks with node attributes. In Proceedings of the IEEE International Conference on Data Mining, 2013, USA 1151–1156 (2013).
Li, W., Yeung, D. & Zhang, Z. Generalized latent factor models for social network analysis. In Proceedings of the 22th International Joint Conference on Artificial Intelligence, 2011, Spain 1705–1710 (2011).
Akoglu, L., Tong, H., Meeder, B. & Faloutsos, C. PICS: parameter-free identification of cohesive subgoups in large attributed graphs. In Proceedings of the SIAM International Conference on Data Mining, 2012, USA 439–450 (2012).
Yang, J. & Leskovec, J. Overlapping community detection at scale: a nonnegative matrix factorization approach. In Proceedings of the Sixth ACM International Conference on Web Search and Data Mining, 2013, Italy 587–596 (2013).
Daudin, J., Picard, F. & Robin, S. A mixture model for random graph. Statistics Computing 18, 173–183, doi:10.1007/s11222-007-9046-7 (2008).
Article MathSciNet Google Scholar
Newman, M. E. J. & Leicht, E. Mixture models and exploratory analysis in networks. Proceedings of the National Academy of Sciences of the United States of America 104, 9564–9569, doi:10.1073/pnas.0610537104 (2007).
Article ADS CAS PubMed PubMed Central MATH Google Scholar
Karrer, B. & Newman, M. E. J. Stochastic blockmodels and community structure in networks. Physical Review E 83, 16107, doi:10.1103/PhysRevE.83.016107 (2010).
Article MathSciNet Google Scholar
Satuluri, V., Parthasarathy, S. & Ruan, Y. Local graph sparsification for scalable clustering. In Proceedings of the 2011 ACM SIGMOD International Conference on Management of Data, 2011, Greece 721–732 (2011).
Karypis, G. & Kumar, V. A fast and high quality multilevel scheme for partitioning irregular graphs. SIAM Journal on Scientific Computing 20, 359–392, doi:10.1137/S1064827595287997 (1998).
Article MathSciNet MATH Google Scholar
Satuluri, V. & Parthasarathy, S. Scable graph clustering using stochastic flows: applications to community discovery. In Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2009, France 737–745 (2009).
Hoff, P. D. Multiplicative latent factor models for description and prediction of social networks. Computational and Mathematical Organization Theory 15, 261–272, doi:10.1007/s10588-008-9040-4 (2009).
Article Google Scholar
Marsden, P. V. Homogeneity in confiding relations. Social Networks 10, 57–76, doi:10.1016/0378-8733(88)90010-X (1988).
Article Google Scholar
Li, Y., Jia, C. & Yu, J. A parameter-free community detection method based on centrality and dispersion of nodes in complex networks. Physica A-Statistical Mechanics and Its Applications 438, 321–334, doi:10.1016/j.physa.2015.06.043 (2015).
Article ADS Google Scholar
Rodriguez, A. & Laio, A. Clustering by fast search and find of density peaks. Science 344, 1492–1496, doi:10.1126/science.1242072 (2014).
Article ADS CAS PubMed Google Scholar
Chen, J., Fang, H. & Saad, Y. Fast approximate kNN graph construction for high dimensional data via recursive lanczos bisection. Journal of Machine Learning Research 10, 1989–2012 (2009).
MATH Google Scholar
Dong, W., Charikar, M. & Li, K. Efficient k-nearst neighbor graph construction for generic similarity measures. In Proceedings of the International World Wide Web Conference, 2011, India 577–586 (2011).
Zhang, Y., Huang, K., Geng, G. & Liu, C. Fast kNN graph construction with locality sensitive hashing. In Proceedings of ECML and PKDD, 2013, Czech Republic, 660–674 (2013).
Blondel, V., Guillaume, J., Lambiotte, R. & Lefebvre, E. Fast unfolding of communities in large networks. Journal of Statistical Mechanics: Theory and Experiment 2008, P10008, doi:10.1103/PhysRevE.77.036114 (2008).
Article Google Scholar
Rosvall, M. & Bergstrom, C. Maps of random walks on complex networks reveal community structure. Proceedings of the National Academy of Sciences of the United States of America 105, 1118–1123, doi:10.1073/pnas.0706851105 (2008).
Article ADS CAS PubMed PubMed Central Google Scholar
Lancichinetti, A., Fortunato, S. & Radicchi, F. Benchmark graphs for testing community detection algorithms. Physics Review E 78, 046110, doi:10.1103/PhysRevE.78.046110 (2008).
Article ADS Google Scholar
Largeron, C., Mougel, P., Rabbany, R. & Zaiane, O. R. Generating attributed networks with communities. PLOS ONE 10 (2015).
Sen, P. et al. Collective classifiction in network data. AI Magazine 29, 93–106 (2008).
Google Scholar
Zhao, W. et al. Comparing twitter and tradition media using topic models. Advances in Information Retrieval 338–349 (2011).
De Montgolfier, F., Soto, M. & Viennot, L. Asymptotic modularity of some graph classes. In Proceedings of International Symposium on Algorithms and Computation, 2011 435–444 (2011).
Bagrow, J. P. Communities and bottlenecks: Trees and treelike networks have high modularity. Physical Review E 85, 066118, doi:10.1103/PhysRevE.85.066118 (2012).
Article ADS Google Scholar
Newman, M. E. J. Modularity and community structure in networks. Proceedings of the National Academy of Sciences of the United States of America 103, 8577–8582, doi:10.1073/pnas.0601602103 (2006).
Article ADS CAS PubMed PubMed Central Google Scholar
Adcock, A. B., Sullivan, B. D. & Mahoney, M. W. Tree-like structure in large social and information networks. In Proceedings of IEEE International Conference Data Mining, 2013, 1–10 (2013).
Stam, C. J. et al. The trees and the forest: Characterization of complex brain networks with minimum spanning trees. International Journal of Psychophysiology 92, 129–138, doi:10.1016/j.ijpsycho.2014.04.001 (2014).
Article CAS PubMed Google Scholar
Abuata, M. & Dragan, F. F. Metric tree-like structures in real-world networks: an empirical study. Networks 67, 49–68, doi:10.1002/net.21631 (2016).
Article MathSciNet Google Scholar
Page, L., Brin, S., Motwani, R. & Winograd, T. The pagerank citation ranking: bringing order to the web. Technical Report, Stanford InfoLab, URL http://ilpubs.stanford.edu:8090/422/ (1999).
Hu, Y., Zhang, P., Fan, Y. & Di, Z. Community detection by signaling on complex networks. Physics Review E 78, 016115, doi:10.1103/PhysRevE.78.016115 (2008).
Article ADS Google Scholar

Download references

Acknowledgements

This work is supported in part by the National Nature Science Foundation of China (No. 61473030, No. 61370129 and No. 61632004), the Fundamental Research Funds for the Central Universities (No. K15JB00070), the Program for Changjiang Scholar and Innovative Research Team in University (No. IRT201206) and the Opening Project of State Key Laboratory of Digital Publishing Technology. The authors would like to acknowledge the anonymous reviewers for their constructive comments.

Author information

Authors and Affiliations

School of Computer and Information Technology & Beijing Key Lab of Traffic Data Analysis and Mining, Beijing Jiaotong University, Beijing, China
Caiyan Jia, Yafang Li, Xiaoyang Wang & Jian Yu
Division of Health and Biomedical Informatics, Department of Preventive Medicine, Feinberg School of Medicine, Northwestern University, Chicago, IL, USA
Matthew B. Carson

Authors

Caiyan Jia
View author publications
You can also search for this author in PubMed Google Scholar
Yafang Li
View author publications
You can also search for this author in PubMed Google Scholar
Matthew B. Carson
View author publications
You can also search for this author in PubMed Google Scholar
Xiaoyang Wang
View author publications
You can also search for this author in PubMed Google Scholar
Jian Yu
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

C.J. conceived the experiments and wrote the manuscript, Y.L. and X.W. conducted parts of the experiments, M.C. revised the manuscripts, J.Y. analyzed the results. All authors reviewed the manuscript.

Corresponding author

Correspondence to Caiyan Jia.

Ethics declarations

Competing Interests

The authors declare that they have no competing interests.

Additional information

Publisher's note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Jia, C., Li, Y., Carson, M.B. et al. Node Attribute-enhanced Community Detection in Complex Networks. Sci Rep 7, 2626 (2017). https://doi.org/10.1038/s41598-017-02751-8

Download citation

Received: 21 December 2016
Accepted: 19 April 2017
Published: 25 May 2017
DOI: https://doi.org/10.1038/s41598-017-02751-8

This article is cited by

Multi-omics network model reveals key genes associated with p-coumaric acid stress response in an industrial yeast strain
- F. E. Ciamponi
- D. P. Procópio
- M. M. Brandão
Scientific Reports (2022)
Community Detection in Feature-Rich Networks Using Data Recovery Approach
- Boris Mirkin
- Soroosh Shalileh
Journal of Classification (2022)
The machinery of the weight-based fusion model for community detection in node-attributed social networks
- Petr Chunaev
- Timofey Gradov
- Klavdiya Bochenina
Social Network Analysis and Mining (2021)
Summable and nonsummable data-driven models for community detection in feature-rich networks
- Soroosh Shalileh
- Boris Mirkin
Social Network Analysis and Mining (2021)
A measure of local uniqueness to identify linchpins in a social network with node attributes
- Matthew D. Nemesure
- Thomas M. Schwedhelm
- Erika L. Moen
Applied Network Science (2021)

Comments

By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.

Subjects

Abstract

Similar content being viewed by others

Introduction

Results

A Description and Illustration of kNN-enhance

Experiment Results

Experimental Results on Synthetic Networks

Experimental Results on Real Networks

Discussion

Methods

Community Detection in Attributed Networks

An Active Method for Community Detection in Networks

kNN-enhance: a Node Attribute-enhanced Community Detection Approach

Metrics for Evaluating Algorithm Quality

NMI

PWF

Modularity

Entropy

Parameter Settings on Synthetic and Real-world Networks

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing Interests

Additional information

Rights and permissions

About this article

Cite this article

Share this article

This article is cited by

Comments

Search

Quick links