Node Attribute-enhanced Community Detection in Complex Networks

Community detection involves grouping the nodes of a network such that nodes in the same community are more densely connected to each other than to the rest of the network. Previous studies have focused mainly on identifying communities in networks using node connectivity. However, each node in a network may be associated with many attributes. Identifying communities in networks combining node attributes has become increasingly popular in recent years. Most existing methods operate on networks with attributes of binary, categorical, or numerical type only. In this study, we introduce kNN-enhance, a simple and flexible community detection approach that uses node attribute enhancement. This approach adds the k Nearest Neighbor (kNN) graph of node attributes to alleviate the sparsity and the noise effect of an original network, thereby strengthening the community structure in the network. We use two testing algorithms, kNN-nearest and kNN-Kmeans, to partition the newly generated, attribute-enhanced graph. Our analyses of synthetic and real world networks have shown that the proposed algorithms achieve better performance compared to existing state-of-the-art algorithms. Further, the algorithms are able to deal with networks containing different combinations of binary, categorical, or numerical attributes and could be easily extended to the analysis of massive networks.


Results
A Description and Illustration of kNN-enhance. Networks in real applications are often sparse and contain noise in the form of spurious edges. This sparseness and noise blur the community structure of a network. Yet, nodes in the same community are likely to be connected to each other and share similar interests even though some of them are 'silent' . Therefore, we can obtain a kNN graph by using a set of node attributes. The kNN-graph is then combined with the original network to compensate for sparsity, thereby strengthening the community structure of the network. Figure 1 is an illustration of kNN-enhance. Figure 1a shows an attributed network, where each node has four attributes: degree, research area, affiliation, and location. This original network is sparse and the community structure in it is not clear. If we add a link between nearest neighbors with common node attributes for each pair of nodes (Fig. 1b), the now attribute-enhanced network shows distinctive community structure. Optionally, a community detection algorithm like K-rank-D can be used to discover community structure in the newly generated, attribute-enhanced network.  Figure 2 illustrates the effectiveness of kNN-enhance from its partition process. Figure 2a is an example of the decision graph of an original LFR network 40 with μ = 0.9 and n = 1000 using K-rank-D. The original network contained 38 communities. One hundred binary attributes with the same cluster structure as the original network were attached to each node at a noise ratio of 20%. In the original LFR network, the community structure was unclear and the 38 community centers were not sufficiently separated in the right upper corner of the decision graph. As a result, it was difficult to determine the number of communities and the community centers as well as to detect the community structure in the network. Subsequently, the kNN-graph was added to the original network and the decision graph of the kNN-graph enhanced network was created with k = 10 using K-rank-D (Fig. 2b). The community structure became clearer and the 38 community centers were separated in the right upper part of the decision graph. This made the community structure much easier to determine. In addition, the red nodes in Fig. 2b were the top 38 nodes with highest comprehensive value (computed by Equation (4) in the Methods section). The nodes in the square were selected by manually drawing a rectangle in the right upper section of the graph. Using manually selected nodes as initial centers, all nodes are correctly partitioned when compared to the ground truth. Yet, using the top 38 nodes (red nodes) as initial centers, the accuracy (computed by Equation (5) in the Methods section) is only 95%. In some cases it is difficult to select the exact K community centers in decision graphs (see Fig. 2a as an example). We automatically selected the top K nodes with the highest comprehensive value as the centers in the following experiments. Experiment Results. We generated two groups of LFR 40 benchmark networks with binary and numerical node attributes, respectively. We tested existing state-of-the-art algorithms including probabilistic models (PCL-DC, PPL-DC, PPSB-DC, CESNA, cohsmix, BAGC, and GBAGC) and hybrid methods (SA-Custer, Inc-Cluster, CODICIL, and GLFM) on these synthetic benchmarks. We then evaluated these algorithms on several commonly used real networks, including some with or some without associated ground truth. We compared two instantiations of our kNN-enhance approach, kNN-nearest and kNN-Kmeans, to these existing algorithms. In addition, we compared kNN-nearest and kNN-Kmeans with K-rank-D using only link information, K-means using only node attributes, and cluster-dp 34 using both node attribute and link information on these networks to show whether the proposed approach performed better than existing similar methods and methods using either links or attributes alone.
Experimental Results on Synthetic Networks. Largeron et al. 41 have provided a generator to generate networks with community structure and numerical node attributes. However, the generator cannot be used to generate networks with binary attributes. Therefore, we generated our own series of networks based on a commonly used LFR benchmark 40 .
LFR benchmark networks are presented by Lancichinetti et al. 40 . These mimic real networks by introducing associated characteristics, i.e., the heterogeneity in the distribution of node degree and community size. The LFR benchmark method uses several parameters to generate a network, including n (the number of vertices), μ (the mixing parameter), 〈k〉 (the average degree of vertices), k max (the maximum degree of vertices), C min (the minimum community size), C max (the maximum community size), γ and β (exponents of the power-law distribution of node degree and community size). The mixing parameter μ is designed to control the clearness of community structure in a network. Each node shares a fraction 1 − μ of its links with other nodes in its community and a fraction of μ with the other nodes in the network. Thus, the smaller μ is, the clearer the community structure in an LFR network. When μ ≤ 0.6, all algorithms are able to classify nearly all vertices into the correct communities. Therefore, we only added node attributes to LFR networks when μ = 0.7, 0.8, or 0.9. Following the example of previous studies 33, 40 , we generated a group of LFR benchmarks with 1000 nodes, = k 20, k max = 50, C min = 10, C max = 50, γ = 2, and β = 1.
We generated two types of node attributes, binary and numerical, for the LFR benchmarks. We did not generate category attributes for simplicity since these can be formulated as binary attributes. We first attached D-dimensional binary attributes to each node and gave nodes in the same community the same d (d < D) attributes. In this group of experiments, we set D = 100 and d = 10 for testing high dimensional attributes. In order to blur the attribute cluster structure, we added 10% to 50% noise by randomly flipping the corresponding portion of binary attributes. With the increase of the noise ratio, the clearness of cluster structure decreased. We then used the Gaussian cluster generator (http://personalpages.manchester.ac.uk/mbs/Julia.Handl) to generate D dimensions of numerical attributes following multivariate normal distributions such that the cluster structure of attributes was the same as the community structure of the corresponding network. For a single multivariate cluster, the mean was uniformly distributed in the range [−10, 10], the off-diagonal entries of the covariance matrix were generated as a random number in the range [−1, 1], and the diagonal entries of the covariance matrix were generated as the sum of all off-diagonal entries plus a random number in the range ⋅ D [0, 20 ]. We set D = 10, 5, 3, and 2 in these groups of experiments. Higher dimensionality led to clearer attribute clusters.
We first compared six probabilistic generative models including PCL-DC, PPL-DC, PPSB-DC, BAGC, GBAGC, and CENSA and seven hybrid methods comprising CODICIL, SA-cluster, Inc-cluster, GLFM, cluster-dp, kNN-nearest, and kNN-Kmeans on the sample sets with binary attributes, where the noise ratio was in the range  {10%, 20%, , 50%} at μ = 0.7, 0.8 or 0.9, respectively. Also, we compared all algorithms to K-rank-D using only links and K-means using only attributes. We reported the average results and standard deviations on 10 sample sets for each setting shown in Tables 1, 2 and 3, where columns indicate the noise ratio of LFR benchmarks, the three numbers in each cell represent the average values and the standard deviations of the three accuracy metrics (ACC, NMI, and PWF defined by Equations (5)(6)(7) in the Methods section) of the corresponding algorithm, and the best performing algorithm is marked in bold. The details of parameter settings of these compared algorithms can be found in the Methods section. We did not include the results for PICS and BNPA in the tables because they did not converge to the real number of communities and distorted the meaning of the accuracy metrics (ACC, NMI and PWF).
Since only kNN-nearest, kNN-Kmeans, cohsMix, and cluster-dp can been used to cope with networks having numerical node attributes, we then compared these four algorithms on LFR benchmarks with numerical attributes at different D = 10, 5, 3, or 2 when μ = 0.7, 0.8, or 0.9. Also, we compared these four algorithms with K-means using only numerical attributes (the results of K-rank-D using only links can be seen in Tables 1, 2  We also tested the algorithms on LFR networks with 5000 nodes, = k 20, k max = 50, C min = 20, C max = 100, γ = 2, and β = 1. Because there were too many testing samples and the results were similar to the first group networks with 1000 nodes, we did not report the results of this group of experiments in the manuscript. Instead, to give a glimpse of the time complexity of the compared algorithms, we have reported the time costs of each algorithm on a randomly generated sample containing 40% noise for binary attributes when {n = 1000, C min = 10,  Table 7. All algorithms were run only once, each number represents the running time of the corresponding algorithm with time unit 'second' , '-' indicates that the time cost of the corresponding algorithm was beyond 48 hours, and '*' indicates that the algorithm ran out of memory. These experiments were performed on a laptop with an Intel 2.50 GHz processor and 4 GB of main memory running Windows 7.0. CESNA was implemented in C++, CODICIL was implemented in Python and C/C++, cohsMix was implemented in R, and the remaining algorithms were implemented in MATLAB. From the data in Tables 1-7, we have concluded that adding node attributes promotes the performance of community detection in most cases. Taking the results of kNN-Kmeans as an example, most of the results were better than those of the basic K-rank-D algorithm on links and K-means on attributes. As Tables 1-3 show, in most cases, kNN-nearest and kNN-Kmeans performed best among the 13 tested algorithms including probabilistic generative models and hybrid methods, and these outperformed the other hybrid methods in all cases. Although kNN-nearest performed slightly worse than kNN-Kmeans, it was more efficient (see Table 7) since each node received its community label from the nearest node with higher centrality. According to our experiments, kNN-Kmeans converged quickly since community centers were carefully selected. In some cases, the probabilistic generative model PCL-DC displayed the best performance but ran too slowly to be used for processing large networks in real applications (see Table 7). CESNA and CODICIL also showed good performance on this group of experiments. Among the probabilistic methods, CENSA was the fastest algorithm. However, it was much slower than the majority of the hybrid heuristic methods. CODICIL ran quickly due to the fast graph partition program Metis, which was used to cut the networks into communities. GBAGC performed well because it used Metis on links to get the initial partition. Moreover, as Tables 4-6 show, both the kNN-nearest and the kNN-Kmeans algorithms allowed us to discover communities effectively in LFR networks with numerical attributes. In summary, this empirical study on LFR benchmarks proves the flexibility, effectiveness, and efficiency of the kNN-enhance approach.
Experimental Results on Real Networks. In addition to our experiments using synthetic networks, we tested the algorithms on two groups of real networks. The nodes in the first group were associated with binary/categorical attributes, while those in the second group possessed numerical attributes. The first group of data sets included Cora 42 , Citeseer 42 , and DBLP10K 18 . Sinanet (https://github.com/smileyan448/Sinanet) and PubMed (http://linqs. umiacs.umd.edu/projects//projects/lbc/) belonged to the second group. Detailed information on these data sets is described below.
The Cora data set consisted of machine learning papers. These papers were classified as belonging to one of the following seven classes: CBR (case based reasoning), GA (genetic algorithms), NN (neural networks), PM (probabilistic methods), RL (reinforcement learning), or RLT (rule learning theory). The papers were selected in such a way that in the final corpus every paper cited or was cited by at least one other paper. Assuming each node represented a paper, there were 2,708 nodes and 5,429 citations. After stemming and removing stop-words and words with document frequency less than 10, the corpus remained a vocabulary of size 1,433 unique words. Each paper was described by a 1433-dimension 0/1 vector indicating the absence/presence of the corresponding words from the dictionary of these unique words.
The Citeseer data set was also a citation network in the field of machine learning. These papers were classified into one of the following six classes: Agents, AI (artificial intelligence), DB (database), IR (information retrieval), ML (machine learning), and HCI (human-computer interaction). The papers were selected in the same way as the Cora dataset. There were 3,312 papers in the corpus and 4,732 citations between papers. A paper was described by a 0/1 word vector indicating the absence/presence of the corresponding words from the dictionary of the 3,703 unique words. The DBLP data set was a co-author network extracted from DBLP Bibliography data. This network contained 10,000 authors and their coauthor relationships. These authors were distributed across four research fields including databases, data mining, information retrieval, and artificial intelligence. Each author was associated with two relevant attributes; prolific and primary topic. The attribute prolific had three possible values: authors with ≥20 publications were labeled as highly prolific, authors with ≥10 and <20 papers were labeled as prolific, and authors with <10 papers were labeled as low prolific. The attribute primary topic had 99 values. Each author was assigned a primary topic out of 99 extracted by a topic model from a collection of paper titles of the authors. For this data set, we did not know the exact number of communities or to which community a node belonged.
The Sinanet data set was a microblog user relationship network that we extracted from the sina-microblog website (http://www.weibo.com). We first selected 100 VIP sina-microblog users distributed across 10 major forums including finance and economics, literature and arts, fashion and vogue, current events and politics, www.nature.com/scientificreports/ sports, science and technology, entertainment, parenting and education, public welfare, and normal life. Starting from these 100 VIP sina-microblog users, we extracted the followees of these users and their published micro-blogs. Using a depth-first search strategy, we extracted three-layers of user relationships and obtained 8,452 users, 147,653 user relationships, and 5.5 million micro-blogs in total. We merged all microblogs that a user published to characterize that user's interests 43 . After removing silent users (those who post less than 5000 words), we were left with 3,490 users and 30,282 relationships. If we used words' frequency of the merged blogs of a user to describe the user's interest, the dimension of the feature space would have been too high to be successfully processed. We used users' topic distribution in the 10 forums, which was obtained by the LDA topic model (http:// gibbslda.sourceforge.net/), to describe users' interests. Thus, besides the followee relationships between pairs of users, we have 10 dimensional numerical attributes to describe the interests of each user. This data set is available at https://github.com/smileyan448/Sinanet. The Diabetes data set consisted of 19,717 scientific publications from the PubMed database pertaining to diabetes classified into one of three classes: Diabetes Mellitus Experimental, Diabetes Mellitus Type 1, and Diabetes Mellitus Type 2. These publications formed a citation network with 44,338 edges representing the citation relationships of pairs of publications. Further, each publication in the dataset was described by a TF/IDF weighted word vector from a dictionary that consisted of 500 unique words.
The experimental results of the methods on Cora and Citeseer are shown in Table 8, where columns represent the data sets used in the evaluation, the cells of each row represent the values of ACC, NMI, and PWF for the corresponding algorithm, and the algorithm with the best performance is marked in bold in each of the two groups (probabilistic methods and hybrid methods). Because the DBLP network had two categorical attributes, we tested SA-cluster on the two category attributes with 3 values and 99 values, respectively. We named the result SA-cluster-cate. We also tested SA-cluster on DBLP when we viewed these 102 values as binary and named the result SA-cluster-bina (the same was done for Inc-cluster). Since no ground truth was available on DBLP, we reported Modularity and Entropy (defined by Equations (8)(9) in the Methods section) of all algorithms in Figs 3 and 4 at different K (K is the number of communities). In Figs 3 and 4, we do not show the results of PCL-DC, PPL-DC, and PPSB-DC since either the time or space complexity of these algorithms was too high to handle large networks like DBLP10k. The results of the methods for processing Sinanet and PubMed networks with numerical attributes are shown in Table 9, where '*' indicates that the algorithm ran out of memory on the corresponding data set. For the probabilistic methods PCL-DC, PPL-DC, PPSB-DC on Cora and Citeseer, and cohsMix on Sinanet, we ran the algorithms 10 times and reported the result with the largest likelihood. For K-means on attributes alone, we reported the best results of these networks over 10 runs. The details of the parameter settings for the compared algorithms in this group of experiments can be found in the Methods section.
We drew the following conclusions using the information in Tables 8 and 9 and Figs 3 and 4: (1) According to Table 8, probabilistic methods PCL-DC, PPL-DC, and PPSB-DC showed the best performance on Cora and Citeseer data sets. However, the time cost of these methods was too high and they would not be appropriate for real applications. In contrast, the kNN-enhance approach achieved high accuracy in comparison to other hybrid methods and was much faster than probabilistic methods PCL-DC, PPL-DC, and PPSB-DC (see Table 7). (2) By  Figs 3 and 4, the Entropy of kNN-Kmeans on DBLP was the lowest, especially when the number of communities K was larger than 200. The Modularity of kNN-enhance indicates that the partitioned network maintained community structure. Therefore, kNN-enhance was able to identify a clear community structure (large Modularity) with a high level of attribute homogeneity (low Entropy) in the network. (3) The kNN-enhance approach was capable   Table 8. Performance of compared algorithms on Cora and Citeseer.
of processing networks with numerical attributes (see Table 9). Even through the accuracy of cohsMix was higher than kNN-nearest and kNN-Kmeans on Sinanet, cohsMix ran much slower than kNN-enhance and the results from this algorithm were selected over 10 runs on Sinanet data. Moreover, cohsMix was not capable of dealing with a large network such as the one from the Diabetes data set due to its high memory usage when storing the similarity matrix and all hidden variables.

Discussion
We have proposed a simple and flexible node attribute enhanced community detection approach, kNN-enhance. This method was designed to construct the k nearest neighbor graph of node attributes first, then merge the kNN graph with the original network. With this approach we were able to alleviate the sparsity of the original network, reduce noise effects, and strengthen the community structure of the original network. Because of this, a clear community structure could be partitioned within the kNN graph enhanced network by a community detection algorithm like K-rank-D. Our two implementations, kNN-nearest and kNN-Kmeans, have shown that the proposed algorithms achieved better performance against the existing state-of-the-art algorithms. Furthermore, the algorithms were able to deal with a network containing binary, categorical, or numerical attributes and could be easily extended to process large-scale networks.    Table 9. Performance of compared algorithms on Sinanet and PubMed Diabetes.
In the future we intend to test this approach on large scale networks with millions of edges by combining fast approximate kNN graph construction algorithms (such as NN-Descent 36 with O(n 1.14 ) empirical cost) with fast community detection algorithms such as BGLL 38 and Informap 39 . Moreover, besides strengthening the community structure of a network using node attributes, we plan to design a more effective method by removing some easily detected weak-linked edges from the network. In this study we were concerned with detecting community structures containing nodes with more links to each other than to nodes outside their communities. However, it has been observed that trees and tree-like networks have high modularity 44,45 , the classical objective function to discover communities and to measure their strength 46 , and that many real world networks have tree-like structures [47][48][49] . Existing methods use connections only to decompose a network into tree-like components. It is a challenging task to combine node attributes with topology to cluster nodes in a tree-like network into groups, and we will investigate whether our kNN-enhance approach is capable of partitioning attributed tree-like networks.

Methods
is a set of vectors, each of which denotes the values of D attributes associated with a node i. We call this an 'attributed network' or 'attributed graph' . Community detection in an attributed network involves partitioning nodes into clusters such that nodes in the same cluster are not only densely connect to each other but also exhibit a high level of attribute homogeneity.

An Active Method for Community Detection in Networks.
cluster-dp is a recently-developed clustering algorithm similar to the K-means method 34 . The algorithm assumes that cluster centers are surrounded by neighbors with lower local density and that they are a relatively large distance from any data points with a higher local density. Therefore, for each data point i, two quantities, the local density ρ i and the distance from points of higher density δ i , are defined as follows to quantify the likelihood of a data point being a cluster center: where d ij is the distance of data points i and j, χ(x) = 1 if x < 0 and χ(x) = 0 otherwise, d c represents the cutoff distance, and δ i = max j (d ij ) for the point with the highest density.
If we scatter all data points on a decision graph drawn by their values of ρ i and δ i for all ∈  i n {1, 2, , }, the cluster centers tend to occupy the right upper part of the graph. After cluster centers with both relatively large ρ i and δi are manually selected on the decision graph, each remaining point is assigned to the same cluster as its nearest neighbor of higher density. This allows cluster-dp to uncover the cluster structure of data points by actively knowing the number of clusters and cluster centers.
However, the following issues exist: (1) When the cluster structure is not clear (i.e., there is not a distinguished boundary between cluster centers and other data points on the decision graph), it is difficult to obtain the correct number of clusters and cluster centers. This leads to poor partitioning. (2) The parameter d c must be tuned in many cases, and it is usually difficult to know which parameter value is best. (3) The input for cluster-dp is a distance matrix. The quality of the matrix has a strong effect on the clustering result. When the algorithm is used to discover community structure in a network, the topological structure implied in the network is not fully utilized.
In a network structure, we suppose that community centers are: (1) influential and surrounded by less influential nodes, and (2) located far from each other in the network. Therefore, we have proposed K-rank-D 33 and use two quantities, and δ i , to describe the centrality and the dispersion of each node i, respectively. The centrality vector v can be calculated efficiently using PageRank 50 centrality as follows: where β is the re-start probability (fixed at 0.15), e is the unit matrix, v 0 is a n-dimensional unit vector, and v t is normalized to 1 in each iteration. The dispersion of a node i to other nodes with higher centrality is defined by for the node k with highest centrality. d ij is the structural distance between nodes i and j. It can be computed using Euclidean distance measurement ⋅ 2 after τ-step signal propagation 51 by following equations: where τ = 3 in implementation and S i is the i-th row of S. In the case that the community structure of a network is fuzzy, we define the comprehensive value for each node i as follows: i i j n j j n j 1 1 The top K nodes with the highest comprehensive value can then be automatically selected as the initial centers of K-rank-D.
Scientific RepoRts | 7: 2626 | DOI:10.1038/s41598-017-02751-8 kNN-enhance: a Node Attribute-enhanced Community Detection Approach. Given an attributed network G = (V, E, X), we first construct the kNN graph of node attributes. The kNN graph for a set of nodes V is a directed graph with vertex set V and an edge from each v ∈ V to its k most similar objects in V under a given similarity measure on attributes. ∀x i , x j ∈ X, the cosine similarity ⋅ x x i j T is used to compute the similarity of a pair of nodes with binary attributes, and − − is used to compute the similarity of a pair of nodes with numerical attributes, where is the normalization of the Euclidean distance of x i and x j . We then add the kNN-graph of attributes to the original network. For an edge of the kNN graph, if it is a new edge in the original network, we add this edge to the original network; otherwise, we keep the edge in the original network unchanged.
After the kNN-enhanced network is established, we use the K-rank-D method introduced above to perform node clustering. In addition to K-rank-D, we employ two node assignment strategies after selecting K community centers from the decision graph. kNN-nearest uses the cluter-dp strategy 34 , which involves assigning each remaining node to the same cluster as its nearest neighbor of higher PageRank centrality computed by Equation (2). kNN-Kmeans uses the strategy of the K-means method, where the input is the data matrix 51 . It iteratively updates its community centers. It should be pointed out that kNN-nearest and kNN-Kmeans are just two implementations of kNN-enhance approach. Fast approximate kNN graph construction methods [35][36][37] and highly-efficient community detection algorithms 38,39 can be combined to process large scale networks.

Metrics for Evaluating Algorithm Quality.
In this study, we use two groups of metrics to evaluate the performance of each algorithm. The first group includes ACC (Accuracy), NMI (Normalized Mutual Information), and PWF (Pairwise F-Measure) 9,11 . These are commonly used to evaluate an algorithm running on a data set with ground truth. Larger values indicate better algorithm performance. The other group consists of Modularity 38,46 and Entropy 13,18 . Modularity is used to measure the quality of communities in a network, and a larger Modularity value indicates better partition quality. Entropy is used to measure the degree of attribute consistency in a community, and a lower Entropy value indicates a greater consistency. These metrics are often used when an algorithm is run on a network without ground truth. Formal definitions are provider below: ACC. Given node i, l pi is the node label assigned by an algorithm and l ti is its true label. The accuracy is defined by where δ(·) is a Kronecker function, P map (l pi ) is a permutation mapping function that maps the label l pi to its corresponding label l ti in the ground truth, and n is the total number of nodes in a network.
where n ij is the number of nodes in the ground truth community C i that are assigned to the computed community C′ j , n i C is the number of nodes in the ground truth community C i , and ′ n j C is the number of nodes in the computed community C′ j .
PWF. Let T denote the set of nodes in the ground truth communities and W denote the set of nodes assigned by a given algorithm in the corresponding communities. PWF is defined as follows: where A = [A ij ] is the adjacency matrix of the network, k i is the degree of node i, δ(·,·) is the Kronecker function, and c i is the community to which the node i belongs.
Entropy. Given a network with n nodes, we suppose that each node is associated with D attributes  a a a ( , , , ) D 1 2 and that the nodes can be partitioned into K communities. Let n c be the number of nodes in the c-th community and p ic be the fraction of nodes in the c-th community taking attribute a i . The total Entropy of attributes in communities can then be defined in the following way: Entropy measures the homogeneity of communities and their shared attributes.
Parameter Settings on Synthetic and Real-world Networks. As mentioned above, the PCL-DC, PPL-DC and PPSB-DC methods are sensitive to initial values. For the experiments on these algorithms using synthetic attributed networks, we ran the algorithms 10 times on each sample set, selected the best result determined by maximum likelihood, and then reported the average results and standard deviations on 10 samples. We set the max iteration number and the convergence threshold of PCL-DC, PPL-DC, and PPSB-DB to 2000 and 10 −8 , respectively. We set the regularization coefficient λ = 1 for PCL-DC and CESNA and λ = 0.1 for PPL-DC and PPSB-DC since they perform the best when λ is set accordingly. Similarly, for K-means on attributes, we ran it 10 times on each sample set (since it is sensitive to its initial values), selected the best result with the highest accuracy, and then reported the average results and standard deviations on 10 samples. For BAGC and GBAGC, the max iteration number was set at 10. For CODICIL, we set = K 30, 50 and 70 and selected the one with the highest accuracy. We used cosine similarity ⋅ to compute the similarity of node attributes and that of link structure for cluster-dp, respectively. We set the weight α of attribute similarity and link similarity to 0.5 for cluster-dp and CODICIL since it was difficult to tune the weight adaptively for each sample. We set = D 50 for GLFM. For kNN-nearest and kNN-Kmeans, we set k = 10 because we wanted to strengthen only the community structure of the original network so that small k is sufficient. We used default values for algorithm parameters not mentioned above. Similarly, for the probabilistic method cohsMix, we set the max iteration number to 200, chose the best result of cohsMix determined by maximum likelihood among 10 runs for each sample set, and then reported the average values and the standard deviations on 10 samples of each test setting.
In the group of experiments on real-world networks, we used the same parameter settings as in the original method publications in nearly all cases. We set λ = 5 for PCL-DC, PPL-DC, and PPSB-DC and λ = 1 for CESNA because these settings produced the best performance. We set the max iteration number to 10 for BAGC and GBAGC. We chose = K 50 for CODICIL because it resulted in the best performance among the options ∈ K {30, 50, 70}. We set D of GLFM to 20. The weight between link structure and node attributes was 0.5 for cluster-dp and CODICIL. The max iteration number of cohsMix was 200 since we used only node attributes to make up for the sparsity of the original network and strengthen its community structure. The parameter k of a kNN attribute graph was also 10 for all real networks with the exception of the Diabetes data set, for which we set k = 60 due to the fact that there were only 3 large communities with thousands of nodes and a larger k provided better performance.