Reduced network extremal ensemble learning (RenEEL) scheme for community detection in complex networks

Guo, Jiahao; Singh, Pramesh; Bassler, Kevin E.

doi:10.1038/s41598-019-50739-3

Download PDF

Article
Open access
Published: 02 October 2019

Reduced network extremal ensemble learning (RenEEL) scheme for community detection in complex networks

Jiahao Guo^1,2,
Pramesh Singh^1,2 &
Kevin E. Bassler^1,2,3

Scientific Reports volume 9, Article number: 14234 (2019) Cite this article

1655 Accesses
5 Citations
17 Altmetric
Metrics details

Subjects

Abstract

We introduce an ensemble learning scheme for community detection in complex networks. The scheme uses a Machine Learning algorithmic paradigm we call Extremal Ensemble Learning. It uses iterative extremal updating of an ensemble of network partitions, which can be found by a conventional base algorithm, to find a node partition that maximizes modularity. At each iteration, core groups of nodes that are in the same community in every ensemble partition are identified and used to form a reduced network. Partitions of the reduced network are then found and used to update the ensemble. The smaller size of the reduced network makes the scheme efficient. We use the scheme to analyze the community structure in a set of commonly studied benchmark networks and find that it outperforms all other known methods for finding the partition with maximum modularity.

Representative community divisions of networks

Article Open access 17 February 2022

Alec Kirkley & M. E. J. Newman

Community detection with Greedy Modularity disassembly strategy

Article Open access 26 February 2024

Heru Cahya Rustamaji, Wisnu Ananta Kusuma, … Irmanida Batubara

Community Detection on Networks with Ricci Flow

Article Open access 10 July 2019

Chien-Chun Ni, Yu-Yao Lin, … Jie Gao

Introduction

Among the most basic and important problems in Network Science is to find the structure within a network^1,2. One way of doing this is to find the community, or modular structure of the nodes. In many real-world networks, the community structure has been found to control much of their dynamical or functional behavior. Although there are many possible definitions of community^3,4, a commonly used definition assumes that a community is a group of nodes that are more densely connected than what would occur randomly. This intuitively appealing concept of community can be used to define a metric, called Modularity Q, that quantifies the extent to which a partition of the nodes of a network is modular². The community structure of a given network can then be obtained by finding the partition of the network’s nodes that has the maximum modularity Q_max. Finding this partition, however, is an NP-hard problem⁵. It is of considerable interest and importance to develop an algorithm that robustly finds an accurate solution to this optimization problem that completes in polynomial time. The accuracy of a solution can be measured by how close the value Q of the partition found is to the value of Q_max. Any solution provides a lower bound estimate of the value of Q_max. Thus, the higher a solution’s value of Q is, the more accurate it and its estimate of Q_max is.

A number of polynomial time complexity algorithms for finding a network partition that enables Q_max to be estimated have been proposed. Some are quite fast, such as random greedy agglomeration^6,7,8 and the Louvain method⁹. These algorithms, however, don’t generally find very accurate solutions. Far more accurate solutions can generally be found with spectral clustering algorithms^10,11 that iteratively bisect the set of nodes. The most accurate algorithm of this type¹² combines bi-sectioning based on the eigenvector of largest eigenvalue of the modularity matrix¹⁰, tuning with generalized Kernighan–Lin refinements^13,14, and agglomeration. Until recently this was the most accurate algorithm known. Virtually all algorithms for maximizing modularity are partially stochastic, as they make random choices at intermediate steps among what are seemingly equivalent options at that point. These choices can affect the final partition, and, thus, different runs can produce different partitions. Because of this, to find the partition that provides the best estimate of the maximum modularity, algorithms are often run multiple times to produce an ensemble of partitions and the best of those partitions is chosen.

It has, however, recently been demonstrated that partitions with even more accurate estimates of Q_max can be obtained with a scheme that uses information contained within an ensemble of partitions generated with conventional algorithms. This idea is known as ensemble learning. Its use distinguishes a new class of modularity maximizing algorithms^15,16. An ensemble learning scheme known as Iterative Core Group Graph Clustering (CGGCi)¹⁷ was the most accurate algorithm for finding the network partition that maximizes modularity in the 10^th DIMACS Implementation Challenge¹⁸. The CGGCi scheme starts with an ensemble of partitions obtained by using a conventional “base algorithm” and identifies “core groups” of nodes that are grouped together in the same community in every partition in the ensemble. It then transforms the original network into a weighted reduced network by collapsing each of these core groups into a single “reduced” node and summing all link weights between original nodes to assign weights to the links between the reduced nodes. A base algorithm is then used to find an ensemble of partitions of the reduced network, and that ensemble is used to find a new reduced network. This procedure is iterated until no further improvement in Q is found. The best partition of the final reduced network is then mapped back onto the original network to identify the communities.

In this paper, we introduce a different ensemble learning scheme for network community detection. It uses an algorithmic paradigm we call Extremal Ensemble Learning (EEL). Our scheme, which we refer to as Reduced Network Extremal Ensemble Learning (RenEEL), starts with an ensemble of partitions obtained using a conventional base algorithm, and then iteratively updates the partitions in the ensemble until a consensus about which partition is best is reached within the ensemble. To find the partitions used to update the ensemble efficiently, core groups of nodes are identified and used to form a reduced network that is partitioned using a base algorithm. RenEEL then uses a partition of the reduced network to update the ensemble through extremal updating. We will show that an algorithm using the RenEEL scheme improves the quality of community structure discovered, especially for larger networks for which estimating the partition with Q_max becomes challenging. Testing our scheme on a wide range of real-world and synthetic benchmark networks, we show that it outperforms all other existing methods, consistently finding partitions with the highest values of Q ever discovered.

Methods

Community detection via modularity maximization

Modularity Q is a metric that quantifies the amount of modular structure there is in a given partition of a network’s nodes into disjoint communities P = {c₁, c₂, …, c_r}, where c_i is the ith community of nodes and r is the number of communities. It is defined as²

$$Q=\sum _{i}\,[\frac{{m}_{i}}{m}-{(\frac{2{m}_{i}+{e}_{i}}{2m})}^{2}]$$

(1)

where the sum is over communities, m_i and e_i are respectively the number of internal and external links of community c_i, and m is total number of links in the network. The first term in Eq. 1 is the fraction of links inside communities, and the second term is the expected fraction if all links of the network were randomly placed. For a weighted network, m_i, e_i and m are sums of link weights instead of numbers of links. Modularity measures the deviation of the structure of a network partition from that expected in a random null model. The community structure of a network corresponds to the partition P of its nodes that maximizes Q. The number of communities in P is free to vary. The challenge of detecting the community structure of a network, therefore, is to find the partition with the maximum modularity Q_max.

Reduced networks

To find a reduced network G′ starting from a network G and an ensemble of partitions of it ${\mathscr{P}}$, we first identify the core groups in G. A core group is a set of nodes that are found together in the same community in every partition in the ensemble. Any node that is not found in the same community with some other node in every partition in ${\mathscr{P}}$ is itself a core group. G′ is then formed by collapsing core groups of nodes into single reduced nodes and combining their links to other nodes by summing their weights. An example of this is shown in Fig. 1. Each circle containing multiple nodes of G that are colored the same in Fig. 1(a) denotes a core group. Two nodes that do not belong to any circle are shown in black and dark green. The core groups are collapsed to reduced nodes of the same color in the reduced network G′ shown in Fig. 1(b). The link weights in the reduced network are the sum of link weights between core groups in the original network. The weighted self-loops in G′ result from the total internal weights of the core groups in G.

Reduced network extremal ensemble learning scheme

The RenEEL scheme is summarized in the flowchart shown in Fig. 2 and is described as follows. First, an ensemble ${\mathscr{P}}$ of at most k_max partitions $P$ of the network G is obtained from multiple runs of a base algorithm. The base algorithm can be, for example, any of the conventional ones that have been developed to find a partition to estimate Q_max. Alternatively, a set of base algorithms can be used to find ${\mathscr{P}}$. The partitions in ${\mathscr{P}}$ are then ordered according to their modularity values, from the one with the largest value P_best to the one with the smallest value P_worst. Next, the core groups of nodes in the ensemble ${\mathscr{P}}$ are identified and used to construct the reduced network G′. An ensemble ${\mathscr{P}}^{\prime} $ consisting of k′ partitions P′ of G′ is then obtained using a base algorithm. The base algorithm used for this step can either be the same as or different from the base algorithm used to find ${\mathscr{P}}$. The steps in which a base algorithm is used to find the ensembles ${\mathscr{P}}$ and ${\mathscr{P}}^{\prime} $ are shown in red in Fig. 2(a). The partition in ${\mathscr{P}}^{\prime} $ with the largest modularity value ${P^{\prime} }_{{\rm{best}}}$ is then identified and used to perform an extremal update of ensemble ${\mathscr{P}}$. This step is shown in blue in Fig. 2(a) and detailed in Fig. 2(b). If Q(${P^{\prime} }_{{\rm{best}}}$) > Q(P_worst), then ${P^{\prime} }_{{\rm{best}}}$ is expanded into a partition of G and either used in place of P_worst in ${\mathscr{P}}$ (if k = k_max) or added to the ensemble ${\mathscr{P}}$ (if k < k_max) as shown in Fig. 2(b). In doing so ${\mathscr{P}}$ is enriched with a better quality partition. However, it is possible that at any iteration either ${P^{\prime} }_{{\rm{best}}}$ is already contained in ${\mathscr{P}}$, or Q(${P^{\prime} }_{{\rm{best}}}$) < Q(P_worst). In both cases, in order to move toward consensus within ${\mathscr{P}}$, its current size k is reduced by 1 by deleting P_worst from it. This procedure is repeated until there is only one partition left in the ensemble ${\mathscr{P}}$. This consensus partition is the partition that has the largest modularity. It can be used to identify the communities of the network, and its modularity Q_best estimates Q_max.

Computational complexity and practical implementation

The most computationally complex and time consuming steps of the RenEEL scheme are those that use a base algorithm to find an ensemble of partitions. These steps are colored in red in the flowchart in Fig. 2. Assuming that the size of the ensembles ${\mathscr{P}}$ and ${\mathscr{P}}^{\prime} $ are fixed, the computational complexity of executing these steps is simply a fixed multiple of the computational complexity of the base algorithm used. The scaling of the computational complexity of base algorithms is typically between ${\mathscr{O}}$(n²) and ${\mathscr{O}}$(n³), where n is the number of nodes in the network. All other steps of the scheme have less complexity; the steps of network reduction, colored purple in Fig. 2, and network expansion both have a computational complexity that scales as ${\mathscr{O}}$(n²), and the rest all have a computational complexity that is ${\mathscr{O}}$(1). Thus, since each iteration of the scheme has only one step that uses the base algorithm a fixed number of times, each iteration has a computational complexity that scales the same as that of the base algorithm used. As the scheme progresses, however, the size of the reduced network monotonically decreases, significantly increasing the speed of later iterations.

A RenEEL algorithm applied to a finite network is sure to complete since new partitions are added to the ensemble ${\mathscr{P}}$ only if they have a modularity that is greater than Q(P_worst) and the size of ${\mathscr{P}}$ is bounded. However, it is difficult to determine the precise scaling of number of iterations required in general for an algorithm implementing the scheme to complete, as it depends on the structure of the specific network under consideration. For the networks we analyzed, the number of iterations required was approximately proportional to k_max. Thus, we find empirically that the overall complexity of a RenEEL algorithm scales roughly as the base algorithm times k′ times k_max.

The base algorithm used to obtain the results presented in this paper is a randomized greedy agglomerative hierarchical clustering algorithm⁸. It is commonly used to find the community structure in complex networks¹⁷ and has an expected time complexity that scales as ${\mathscr{O}}$(m ln n)⁸, where m is the number of links in the network. There can be, at most, ${\mathscr{O}}$(n²) links. The overall complexity of the algorithm used here thus scales approximately as ${\mathscr{O}}$(k_maxk′n² ln n). The particular choice of parameters k_max and k′ is important for the quality of community structure as well as the computational time. In general, higher k′ and k_max yield higher Q_best.

Co-clustering analysis

In order to visualize the evolution of the clustering results in the RenEEL scheme, co-clustering matrices at various stages of the scheme are shown in Fig. 3. In Fig. 4 the results of the core group co-clustering at the different stages are combined to show their evolution. A co-clustering matrix S is a matrix whose elements s_ij are defined as the fraction of times node i and node j are in the same community in an ensemble of partitions ${\mathscr{P}}$. The order of the nodes in Figs 3 and 4 was determined using simulated annealing to optimize the block-diagonal structure of the matrices. Starting from a random ordering of the nodes, their order was rearranged to minimize a cost function, or “Hamiltonian”, that is a function of minimum distance of matrix elements (i,j) from the diagonal d_ij assuming periodic boundary conditions on the order:

$$H=\sum _{i < j}\,{s}_{ij}\,{d}_{ij}^{\alpha },$$

(2)

where α is an arbitrary factor that controls the non-linear dependence of H on d_ij. The results in Figs 3 and 4 were obtained using α = 3. Simulated annealing seeks to find the order of nodes that minimizes H. For the Monte Carlo updates in our simulated annealing, Metropolis rates¹⁹ with Boltzmann factor e^−(ΔH)/T were used. Starting from a relatively high temperature where the order of the nodes is random, the temperature was systematically lowered each Monte Carlo step until the node order stabilized.

To get the three co-clustering matrices shown in Fig. 3, which respectively show results at the initial, intermediate, and final stages of the RenEEL scheme, the following procedure was used in the simulated annealing Monte Carlo. First nodes were reordered by considering swaps of random pairs of nodes so as to minimize H in the final stage co-clustering matrix. Then, swaps of pairs of final stage core groups and swaps of pairs of nodes within the final stage core groups were considered to minimize H in the intermediate stage co-clustering matrix. Finally, swaps of pairs of final stage core groups, swaps of pairs of intermediate stage core groups within a finial stage core group, and swaps of pairs of nodes within an intermediate stage core group were considered to minimize H in the initial stage co-clustering matrix. The order of nodes that resulted is used in all three co-clustering matrices in Fig. 3 and in Fig. 4.

Benchmark networks used for comparison

To test the effectiveness of our methods of community detection we studied a set of networks. All of these networks were used in the 10^th DIMACS challenge¹⁸. The networks are unweighted and undirected. They also have no self-loops. They may be connected or disconnected. The networks we studied are listed and described in Table 1. These networks have been compiled from various sources and cover a wide range of sizes, functions and other characteristics. Hence, they are often used as benchmarks for testing community detection methods. The lists of links defining the Email, Jazz, PGPgc, Metabolic networks were downloaded from ref.²⁰. For Adjnoun, Polblog, Netscience, Power, Astro-ph, As-22july06, Cond-mat-2005, they were downloaded from ref.²¹. For Memplus, it was downlaoded from ref.²². For Smallworld and CAIDARouterLevel, they were downloaded from ref.²³.

Table 1 Benchmark networks.

Full size table

Results

Evolution of core groups

The essence of how the RenEEL scheme works and why it is efficient can be seen by the evolution of the co-clustering of the nodes across the ensemble ${\mathscr{P}}$. Figure 3 shows the co-clustering results during a typical realization of the scheme on the Email network²⁴ (see Table 1) at the initial, intermediate and final stages. In the three sub-figures, the intensity with which a pixel (i, j) is colored white corresponds to the frequency that nodes i and j are in the same community in the member partitions of ${\mathscr{P}}$. The pixels colored blue, red, and yellow indicate that the nodes are in the same community in all member partitions. The nodes in the blue, red, and yellow blocks on the diagonal are the core groups that are used to form the reduced network. Nodes are listed in the same order in each of the three sub-figures. Figure 4 shows the evolution of just the core groups in the same realization.

The Email network has n = 1133 nodes. Initially, as shown in Fig. 3(a), there are 446 core groups, most of which contain only one or two nodes. After 100 iterations of the scheme, as shown in Fig. 3(b), the number of core groups is reduced to 192. Finally, in the stable state, after about 300 iterations of the scheme, only 10 core groups remain, as shown in Fig. 3(c). This reduction, from the original network of 1133 nodes to a reduced network of 10 nodes, is a tremendous simplification and greatly improves the overall speed of network clustering.

Within a network G it is generally “easy” to determine that certain groups of nodes should be clustered together. All partitions group them together. These are the core groups of nodes. The hard work in finding the optimal partition is to determine whether nodes that are grouped together in only some of the partitions should indeed be in the same community, that is, to determine whether or not core groups should combine. This is precisely what RenEEL focuses on. The formation and evolution of core groups in RenEEL is an agglomerative process²⁵. Once a core group is formed, RenEEL never subsequently divides it. As the scheme progresses, core groups grow and merge with each other and the number of core groups monotonically decreases.

Evolution of the ensemble ${\mathscr{P}}$

A defining characteristic of RenEEL is that the ensemble of partitions ${\mathscr{P}}$ evolves as the scheme progresses. The ensemble “learns” what the partition with Q_best is by using extremal updating to incorporate new partitions, replace existing ones with higher quality ones, or remove low quality partitions. The new partitions are partitions of the reduced network G′. They are used in RenEEL to improve the quality of ${\mathscr{P}}$ at every iteration of the scheme until a consensus is reached about what the optimal partition is.

A typical way that ${\mathscr{P}}$ evolves as the scheme progresses can be seen with the results shown in Fig. 5 from an example run of RenEEL that partitions the As-22july06 network²¹. (See Table 1). In this example run, k_max = 100 and k′ = 20. Figure 5(a,b) show the modularity value Q of P_best the best partition in ${\mathscr{P}}$ (red dots), of P_worst the worst partition in ${\mathscr{P}}$ (black dots), and of ${P^{\prime} }_{{\rm{best}}}$ the new partition of G′ considered for the enrichment of ${\mathscr{P}}$ (blue dots) as a function of the number of iterations. The main panel of Fig. 5(a) shows the full results of the scheme, from start to finish. An enlarged view of the results for the initial 150 iterations is shown in the inset of Fig. 5(a). The main panel of Fig. 5(b) shows an enlarged view of the vertical Q axis near the final result of the entire scheme. An enlarged view of both axes at the end stages of the scheme is shown in the inset of Fig. 5(b). Figure 5(c) shows the size of the reduced network, or equivalently the number of core groups, as a function of the number of iterations. The main panel of Fig. 5(c) shows the results on linear axis scales, and the inset shows the same results on log scales. Figure 5(d) shows the ensemble size k as function of the number of iterations.

In the example run, as can be seen from the inset of Fig. 5(a), for the first 100 iterations the modularity of the new partitions Q(${P^{\prime} }_{{\rm{best}}}$) are all significantly better than that of the worst in the ensemble Q(P_worst). In fact, all the first 100 new partitions generated by RenEEL are better than every one the 100 original ones in ${\mathscr{P}}$ generated by the base algorithm. (The number of partitions in ${\mathscr{P}}$ initially is k_max = 100). So, for the first k_max iterations RenEEL systematically replaced each of the original partitions. There is large increase in Q(P_worst) at iteration 100. Although it’s difficult to see in the figure, there are other similar, significant increases in Q(P_worst) at iterations 200 and 300, indicating that RenEEL also replaces its first and second 100 new partitions with entirely new sets in the second and third 100 iterations, respectively. After the first 300 iterations, the quality of the new partitions starts to become comparable to the existing partitions. Throughout the process, the Q(P_best) intermittently raises when a new best partition is discovered.

Figure 5(c) shows that the size of the reduced network keeps decreasing as the scheme progresses. It initially decreases exponentially, then there is what appears to be a power-law decay from iteration 100 to iteration 1000 (see inset of Fig. 5(c)), followed by a sharp, perhaps exponential, decay in the final iterations of the scheme. The original size of this network, n = 22963, is reduced to 38 core groups at the termination step. The size of the ensemble, shown in Fig. 5(d), varies when new partitions are discovered and added to ${\mathscr{P}}$ or when low quality partitions are deleted as the scheme drives ${\mathscr{P}}$ toward consensus. The plot shows that as the ensemble learns, its size grows and shrinks multiple times before its size falls to unity and the scheme terminates. There are two main periods in which the size of the ensemble grows, one beginning at about iteration 900 and the other at about iteration 1200. During these periods the value of Q(P_best) increases quickly, as can be seen in the main panel and inset of Fig. 5(b). These are periods when the ensemble ${\mathscr{P}}$ has made a “breakthrough” by discovering a new set of high quality partitions. The example run ends with a consensus choice that a partition with modularity Q_best = 0.678579 is the one that maximizes modularity for this network, a value higher than that any previously reported partition. (See Table 2).

Table 2 Comparison of results using RenEEL to the previous best results for benchmark networks.

Full size table

Distribution of results for Q _best

Since virtually all conventional algorithms are stochastic, ensemble learning schemes that use them as base algorithms will also be stochastic. Thus, a range of results for Q_best are possible with each realization of virtually all methods of modularity maximization. As an example, Fig. 6 shows the distribution of Q_best that three different methods of community detection produce for the Email network. Results from 250 realizations for each method are shown. Results from the RenEEL, CGGCi ensemble learning schemes, and naive ensemble analyses are shown in red, green, and blue, respectively. The results for all three of these schemes were obtained using a randomized greedy algorithm as the base algorithm and an ensemble size of k_max = 100. Each of the blue data points were obtained by running the algorithm 100 times and choosing the largest value from those runs. The distributions from the three different methods are all non-overlapping, with the RenEEL results having the largest values, followed those of CGGCi and then those of the naive ensemble analyses with the conventional algorithm. The distribution of Q_best for RenEEL is also narrower than those of the other two schemes, which suggests that the results from RenEEL are close to the value of Q_max for the network.

Application to benchmark networks

To test the accuracy of the RenEEL scheme, we applied it to the benchmark networks listed in Table 1. In Table 2, the maximum modularity value Q_best found for these networks by RenEEL is compared to the best previously published values. Many of these values were the best result in the 10^th DIMACS challenge²⁶. To be consistent, all realizations had k_max = 100 and k′ = 20 and used the randomized greedy algorithm as a base network. 100 different realizations of RenEEL were run on the smaller networks, up to and including the Netscience network, and 5 were run on the larger networks. For the smaller networks the value of Q_best reported in table was consistently obtained. For the larger networks a range of results were obtained and the largest one is listed. As the table shows, the partitions found by RenEEL have a value of Q_best that is higher than or equivalent to the best previously reported value for every benchmark network. The difference between Q_best found by RenEEL and the previous best values increases with network size. This is due to the fact that for small networks it is generally easier to find the Modularity maximizing partition, but the task becomes more challenging for larger networks.

Our results are significant for every network studied. For the smallest networks, our best partition has the same modularity as that of the previous best result. This is presumably because we find the true best partition that other algorithms have also found. For larger networks, however, our results are better than any previously reported result. For some medium size networks, our value of Q_best may be only slightly better than the previous best, but, in these cases, finding any new better result is remarkable and mathematically noteworthy. Furthermore, for these networks, we may be discovering the true best partition. For larger networks, our accuracy improvement is substantial.

Perhaps a better way of quantifying the mathematical significance of our results would be, if one knew what the value of true best Modularity Q_max is, to consider results for 1/ΔQ, where ΔQ ≡ Q_max − Q, instead of the results for Q. Unfortunately, that’s not possible as the value of Q_max for most networks is not known. If we could though, it would be clear that our results are indeed highly significant, for every network studied.

Discussion

Recent advances in Machine Learning and Artificial Intelligence have enabled progress to be made toward solving a range of difficult computational problems²⁷. In this paper, we have introduced a powerful algorithmic paradigm for graph partitioning that we call Extremal Ensemble Learning (EEL). EEL is a form of Machine Learning. An EEL scheme creates an ensemble of partitions and then uses information within the ensemble to find new partitions that are used to update the ensemble using extremal criteria. Through the updating procedure, the ensemble learns how to form improved partitions, as it works toward a conclusion by achieving consensus among its member partitions about what the optimal partition is.

The particular EEL scheme we have introduced, Reduced Network Extremal Ensemble Learning (RenEEL), uses information in the ensemble of partitions to create a reduced network that can be efficiently analyzed to find a new partition with which to update the ensemble. We have used RenEEL to find the partition that maximizes the modularity of networks. This is a difficult, NP-hard computational problem⁵. We have shown that an algorithm using the RenEEL scheme outperforms all existing modularity maximizing algorithms when analyzing a variety of commonly studied benchmark networks. For those networks it finds partitions with the largest modularity ever discovered. For the larger benchmark networks, the partitions that we discovered are novel.

Although we have only demonstrated the effectiveness of our algorithm for the well-known problem of finding the network partition that maximizes modularity, the EEL paradigm and the RenEEL scheme can be used to solve other network partitioning problems. For example, the algorithm we used can be straightforwardly adapted to optimize other metrics such as modularity density²⁸, or excess modularity density²⁹. Work is underway to explore the effectiveness of RenEEL for solving those problems. Its potential effectiveness for finding the partition that maximizes excess modularity density may be especially important. Using excess modularity density largely mitigates the resolution limit problem in community detection by maximizing modularity³⁰, making it a preferred metric for applications where the resolution limit is problematic, such as finding the community structure in gene regulatory networks^31,32.

There is potential to improve upon our results using the RenEEL scheme. As previously discussed, any conventional algorithm can be used as the base algorithm of the scheme. There is also freedom to vary the size of the ensembles used in the scheme. Which base algorithm and what ensemble sizes are best to use depends on the network to be analyzed. Using a high quality base algorithm though, such as the Iterative Spectral Bisectioning, Tuning and Agglomeration algorithm¹², is likely to yield more accurate results for many of the networks studied. There is also potential to improve the RenEEL scheme itself. For instance, currently, a naive ensemble analysis of partitions of the reduced network is used to find a new partition with which to update the ensemble. Another method, such as a recursive use of the RenEEL scheme, may yield better results. Also, currently, once the original ensemble of partitions is created, no new information is ever added to the system during the learning processes. It may be beneficial to occasionally use a new partition of the original network instead of the reduced network to update the ensemble. Work is in progress to explore if these ideas lead to improved results.

Finally, the principal reasons why the RenEEL scheme is both efficient and effective should be noted. Its efficiency stems from its use of an ensemble of partitions to form reduced networks. The smaller size of the reduced networks allows them to be partitioned much more quickly than the full network. Also, because the scheme is so effective, highly accurate results can be obtained even if a fast, but low quality, base algorithm is used. This allows significantly larger networks to be analyzed than what would otherwise be possible. The remarkable effectiveness of RenEEL, even relative to other Ensemble Learning schemes, is mainly due to its extremal updating of the ensemble of partitions. It is of course just one example of a scheme using the EEL paradigm. Its success, though, suggests that EEL is an algorithmic paradigm that will be useful for solving a variety of graph theoretic problems.

Data Availability

The data used in this study are publicly available from the sources that are cited in the main text.

References

Fortunato, S. Community detection in graphs. Physics Reports 486, 75–174 (2010).
Article ADS MathSciNet Google Scholar
Newman, M. E. & Girvan, M. Finding and evaluating community structure in networks. Physical Review E 69, 026113 (2004).
Article ADS CAS Google Scholar
Schaub, M. T., Delvenne, J.-C., Rosvall, M. & Lambiotte, R. The many facets of community detection in complex networks. Applied Network Science 2, 4 (2017).
Article Google Scholar
Peel, L., Larremore, D. B. & Clauset, A. The ground truth about metadata and community detection in networks. Science Advances 3 5, e1602548 (2017).
Article ADS Google Scholar
Brandes, U. et al. On modularity clustering. IEEE Transactions on Knowledge and Data Engineering 20, 172–188 (2008).
Article Google Scholar
Clauset, A., Newman, M. E. & Moore, C. Finding community structure in very large networks. Physical Review E 70, 066111 (2004).
Article ADS Google Scholar
Newman, M. E. Fast algorithm for detecting community structure in networks. Physical Review E 69, 066133 (2004).
Article ADS CAS Google Scholar
Ovelgönne, M. & Geyer-Schulz, A. Cluster cores and modularity maximization. In Data MiningWorkshops (ICDMW), 2010 IEEE International Conference on, 1204–1213 (IEEE 2010).
Blondel, V. D., Guillaume, J.-L., Lambiotte, R. & Lefebvre, E. Fast unfolding of communities in large networks. Journal of Statistical Mechanics: Theory and Experiment P10008 (2008).
Newman, M. E. Finding community structure in networks using the eigenvectors of matrices. Physical Review E 74, 036104 (2006).
Article ADS MathSciNet CAS Google Scholar
Newman, M. E. Modularity and community structure in networks. Proceedings of the National Academy of Sciences 103, 8577–8582 (2006).
Article ADS CAS Google Scholar
Treviño, S. III., Nyberg, A., Del Genio, C. I. & Bassler, K. E. Fast and accurate determination of modularity and its effect size. Journal of Statistical Mechanics: Theory and Experiment P02003 (2015).
Kernighan, B. W. & Lin, S. An efficient heuristic procedure for partitioning graphs. The Bell System Technical Journal 49, 291–307 (1970).
Article Google Scholar
Sun, Y., Danila, B., Josić, K. & Bassler, K. E. Improved community structure detection using a modified fine-tuning strategy. Europhysics Letters 86, 28004 (2009).
Article ADS Google Scholar
Polikar, R. Ensemble based systems in decision making. IEEE Circuits and Systems Magazine 6, 21–45 (2006).
Article Google Scholar
Sagi, O. & Rokach, L. Ensemble learning: A survey. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 8, e1249 (2018).
Google Scholar
Ovelgönne, M. & Geyer-Schulz, A. An ensemble learning strategy for graph clustering. Graph Partitioning and Graph Clustering 588, 187 (2012).
Article MathSciNet Google Scholar
10th DIMACS Implementation Challenge., https://www.cc.gatech.edu/dimacs10/.
Kirkpatrick, S., Gelatt, C. D. & Vecchi, M. P. Optimization by simulated annealing. Science 220, 671–680 (1983).
Article ADS MathSciNet CAS Google Scholar
Alex A datasets., http://deim.urv.cat/~alexandre.arenas/data/welcome.htm.
Network data., http://www-personal.umich.edu/~mejn/netdata/.
Hamm/memplus|SuiteSparse Matrix Collection., https://sparse.tamu.edu/Hamm/memplus.
10th DIMACS Implementation Challenge., https://www.cc.gatech.edu/dimacs10/archive/clustering.shtml.
Guimera, R., Danon, L., Diaz-Guilera, A., Giralt, F. & Arenas, A. Self-similar community structure in a network of human interactions. Physical Review E 68, 065103 (2003).
Article ADS CAS Google Scholar
Rokach, L. & Maimon, O. Clustering Methods, 321–352 (Springer US, Boston, MA, 2005).
Index of/dimacs10/results., https://www.cc.gatech.edu/dimacs10/results/.
Mohammed, M., Khan, M. B. & Bashier, E. B. M. Machine Learning: Algorithms and Applications. (CRC Press, 2016).
Chen, M., Kuzmin, K. & Szymanski, B. K. Community detection via maximization of modularity and its variants. IEEE Transactions on Computational Social Systems 1, 46–65 (2014).
Article Google Scholar
Chen, T., Singh, P. & Bassler, K. E. Network community detection using modularity density measures. Journal of Statistical Mechanics: Theory and Experiment, 053406 (2018).
Fortunato, S. & Barthelemy, M. Resolution limit in community detection. Proceedings of the National Academy of Sciences 104, 36–41 (2007).
Article ADS CAS Google Scholar
Treviño, S. III., Sun, Y., Cooper, T. F. & Bassler, K. E. Robust detection of hierarchical communities from escherichia coli gene expression data. PLOS Computational Biology 8, 1–15 (2012).
Article Google Scholar
Mentzen, W. I. & Wurtele, E. S. Regulon organization of arabidopsis. BMC Plant Biology 8, 99 (2008).
Article Google Scholar
Gleiser, P. M. & Danon, L. Community structure in jazz. Advances in Complex Systems 6, 565–573 (2003).
Article Google Scholar
Duch, J. & Arenas, A. Community detection in complex networks using extremal optimization. Physical Review E 72, 027104 (2005).
Article ADS Google Scholar
Jeong, H., Tombor, B., Albert, R., Oltvai, Z. N. & Barabási, A.-L. The large-scale organization of metabolic networks. Nature 407, 651 (2000).
Article ADS CAS Google Scholar
Overbeek, R. et al. WIT: integrated system for high-throughput genome sequence analysis and metabolic reconstruction. Nucleic Acids Research 28, 123–125 (2000).
Article CAS Google Scholar
Adamic, L. A. & Glance, N. The political blogosphere and the 2004 us election: divided they blog. In Proceedings of the 3rd International Workshop on Link discovery, 36–43 (ACM, 2005).
Watts, D. J. & Strogatz, S. H. Collective dynamics of ‘small-world’ networks. Nature 393, 440 (1998).
Article ADS CAS Google Scholar
Boguñá, M., Pastor-Satorras, R., Díaz-Guilera, A. & Arenas, A. Models of social networks based on social distance attachment. Physical Review E 70, 056122 (2004).
Article ADS Google Scholar
Newman, M. E. The structure of scientific collaboration networks. Proceedings of the National Academy of Sciences 98, 404–409 (2001).
Article ADS MathSciNet CAS Google Scholar
Davis, T. A. & Hu, Y. The university of florida sparse matrix collection. ACM Transactions on Mathematical Software (TOMS) 38, 1 (2011).
MathSciNet MATH Google Scholar
CAIDA Skitter Router-Level Topology and Degree Distribution., http://www.caida.org/data/router-adjacencies.
Aloise, D. et al. Modularity maximization in networks by variable neighborhood search. In Graph Partitioning and Graph Clustering (2012).

Download references

Acknowledgements

We thank Peter Grassberger and Eve S. Wurtele for fruitful discussions. This work was supported by the NSF through grants DMR-1507371 and IOS-1546858. Some of the computations in this work were done on the uHPC cluster at the University of Houston, acquired through NFS Award Number 1531814.

Author information

Authors and Affiliations

Department of Physics, University of Houston, Houston, Texas, 77204, USA
Jiahao Guo, Pramesh Singh & Kevin E. Bassler
Texas Center for Superconductivity, University of Houston, Houston, Texas, 77204, USA
Jiahao Guo, Pramesh Singh & Kevin E. Bassler
Department of Mathematics, University of Houston, Houston, Texas, 77204, USA
Kevin E. Bassler

Authors

Jiahao Guo
View author publications
You can also search for this author in PubMed Google Scholar
Pramesh Singh
View author publications
You can also search for this author in PubMed Google Scholar
Kevin E. Bassler
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

J.G., P.S. and K.E.B. conceived of the project. J.G. performed the simulations. J.G., P.S. and K.E.B. analyzed the results and wrote the paper. All authors read and approved the manuscript.

Corresponding author

Correspondence to Kevin E. Bassler.

Ethics declarations

Competing Interests

The authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Guo, J., Singh, P. & Bassler, K.E. Reduced network extremal ensemble learning (RenEEL) scheme for community detection in complex networks. Sci Rep 9, 14234 (2019). https://doi.org/10.1038/s41598-019-50739-3

Download citation

Received: 15 May 2019
Accepted: 17 September 2019
Published: 02 October 2019
DOI: https://doi.org/10.1038/s41598-019-50739-3

Comments

By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.