Local clustering via approximate heat kernel PageRank with subgraph sampling

Graph clustering, a fundamental technique in network science for understanding structures in complex systems, presents inherent problems. Though studied extensively in the literature, graph clustering in large systems remains particularly challenging because massive graphs incur a prohibitively large computational load. The heat kernel PageRank provides a quantitative ranking of nodes, and a local cluster can be efficiently found by performing a sweep over the heat kernel PageRank vector. But computing an exact heat kernel PageRank vector may be expensive, and approximate algorithms are often used instead. Most approximate algorithms compute the heat kernel PageRank vector on the whole graph, and thus are dependent on global structures. In this paper, we present an algorithm for approximating the heat kernel PageRank on a local subgraph. Moreover, we show that the number of computations required by the proposed algorithm is sublinear in terms of the expected size of the local cluster of interest, and that it provides a good approximation of the heat kernel PageRank, with approximation errors bounded by a probabilistic guarantee. Numerical experiments verify that the local clustering algorithm using our approximate heat kernel PageRank achieves state-of-the-art performance.

, which are commonly used as goodness measures for clusters and communities. As with standard Pag-eRank, computing an exact heat kernel PageRank vector is prohibitively computationally expensive, and most methods use approximate algorithms instead 21,24 . Yet these algorithms still compute the heat kernel PageRank vector on the whole graph, and thus the number of computations is dependent on the size of the whole graph.
In this paper, we present a local algorithm for approximating the heat kernel PageRank. Instead of working on the whole graph, we compute the values of the heat kernel PageRank vector only for nodes in a subgraph near the seed node. This subgraph is obtained by sampling nodes near the seed node in a greedy way. We then simulate random walks inside this sampled subgraph and evaluate the probability distributions resulting from these random walks. We show that, by making a few mild assumptions about the graph, the approximation errors induced by restricting the random walks in a subgraph can be well bounded and do not cause significant changes to the heat kernel PageRank vector. Finally, we apply the proposed local clustering algorithm to computer-generated benchmark networks and real-world networks. When applied to networks with known communities, our local clustering algorithm finds local clusters that show excellent agreement with the groundtruth communities. When applied to real-world networks without a-priori information of communities, our local clustering algorithm yields promising results that aid our understanding of the organization and structure in these massive complex networks.

Preliminaries
Let G = (V , E) be an undirected simple graph on n nodes, where V is the set of nodes and E is the set of edges. We use u ∼ v to denote that a pair of nodes u and v are connected by an edge {u, v} ∈ E , and the nodes in such a pair are called neighbors. The degree d v of a node v is the number of edges attached to the node v. The volume of a subgraph S ⊂ V is defined as the sum of the degrees of its nodes, which is also known as the total degree of subgraph S. The edge boundary ∂(S) of subgraph S is the set of edges with one end in S and another end not in S, For a given subgraph S of a graph G, the conductance value �(S) , also called the Cheeger ratio, of the induced cut (S, V \S) is defined as which is the ratio of the number of edges between the two parts of the cut and the volume of the smaller part of the cut. This ratio can also be interpreted as the fraction of edges leaving the small part of the cut. We will use this ratio as the objective function to search for a local cluster near a seed node.
Let p : V → [0, 1] be a probability distribution vector over the nodes. Consider a probability-per-degree ranking of the nodes in the graph where Let S i be the set of the first i nodes per the ranking: The set S i is called a segment. The process of investigating all local subgraphs induced by segments to find an optimal local community is called performing a sweep over the probability distribution vector p. The ς-local Cheeger ratio � ς (p) of a sweep over a vector p is the minimum Cheeger ratio among all segments with volume from 0 to 2ς . For any subgraph S, we define p(S) to be the sum of all values of p(v) for v ∈ S: A heat kernel PageRank vector ρ t,f is defined as , (5) S i = {v 1 , v 2 , . . . , v i }.  www.nature.com/scientificreports/ where t is a non-negative real value representing temperature, f : V → R is a preference vector on the graph representing the starting state of the heat diffusion, and P is the transition probability matrix If f is a probability distribution vector, properly normalized so that v∈V f (v) = 1 , then the heat kernel PageRank vector ρ t,f can be regarded as the expectation of the ending state of a random walk whose length follows a Poisson distribution with mean t and whose starting state is f. This observation sheds light on an approximate calculation of the heat kernel PageRank vector: we simulate sufficiently many random walks with lengths following a Poisson distribution with mean t and record the nodes at which these random walks end.

Sampling local subgraphs
Most local clustering algorithms calculate an approximate heat kernel PageRank vector on a whole graph, and thus the number of computations is dependent on the size of the whole graph. For example, the state-of-the-art method by Chung and Simpson requires O(log n) computations 21 , where n is the size of the whole graph. We improve on the work of Chung and Simpson by stipulating that all simulated random walks must be local, i.e., we simulate random walks only around the seed node, to avoid unnecessary computations irrelevant to the local clustering that we wish to perform. In the local clustering problem, given a seed node s, we wish to find a local cluster with a low conductance value around s. Based on this problem formulation, we propose to perform a rough sampling in the neighborhood of the seed node s and find a local subgraph. The local cluster of interest will be a subset of the sampled subgraph, and will occupy a non-vanishing fraction of this subgraph. Then we calculate an approximate heat kernel PageRank vector by simulating random walks in the sampled subgraph. Through this process, we can ensure that the number of computations of the heat kernel PageRank approximation is dependent only on the size of the local cluster, and not on the size of the whole graph.
Our proposed subgraph sampling algorithm is formulated as follows. We start with the seed node s and all its neighbors. Then we iteratively add nodes that roughly cause the greatest decrease, or the least increase, in the conductance value into a subgraph S, thereby expanding S in a greedy way while keeping the conductance value �(S) low. In each iteration, we investigate all neighboring nodes of the current subgraph S, namely all nodes with connections to nodes in S. We compute for each neighboring node v the fraction of connections to S in its total connections d v . Then we add the nodes with the largest fraction into the subgraph S. We repeat this process iteratively until the size of the subgraph reaches the expected size, which is chosen to be proportional to the expected size ς of the local cluster of interest. Algorithm 1 provides the details of our subgraph sampling algorithm.

Algorithm 1: Subgraph sampling
Input : Seed node s, graph G, expected volume ς of local cluster, constant α Output: Sampled subgraph S 1 Initiate S as s and all neighbors of s; 2 while vol(S) < ας do 3 Find all neighbors of S, namely all nodes with connections to any member of S; 4 Denote as κ v the fraction of connections to S in total connections d v of a neighboring node v; 5 Denote as κ as the maximum of fractions κ v among all neighbors v of S; 6 Add all neighbors with κ fractions into S.

Approximate heat kernel PageRank in sampled subgraph
We then calculate an approximate heat kernel PageRank vector on the support of the sampled local subgraph S. As stated earlier, we perform this calculation by simulating random walks only in subgraph S. Recall that the heat kernel PageRank vector ρ t,s on graph G with the starting preference vector χ s is given as where and s is the seed node. Since χ s is properly normalized on graph G, the heat kernel PageRank vector ρ t,s is thus the expectation of the ending state of a random walk starting from the seed node s and tracing a length that follows a Poisson distribution with mean t. Based on this observation, the intuition of our algorithm of approximating www.nature.com/scientificreports/ ρ t,s on subgraph S works as follows. We perform r random walk simulations to approximate the infinite sum in (9) on the support of the sampled subgraph S, and choose r large enough to bound the error. In each simulation, we track the steps of the simulated random walk. If this random walk leaves the subgraph S, we terminate this simulation and restart. We also note that the Poisson distribution is highly concentrated around the mean value t, and thus we can choose a finite number K and discard all random walks with length greater than K. Our proposed algorithm for the approximate heat kernel PageRank computation is summarized in Algorithm 2. Starting from the seed node, we generate r random walks, with lengths following a Poisson distribution. When a random walk reaches a node v in the subgraph S, one of the two following cases can happen. With probability proportional to the number of edges connecting to the outside of S, this random walk is terminated. Also, with probability proportional to the number of edges connecting to other nodes in S, this random walk chooses one of the neighbors of v in S at random as the next step. If a random walk is not terminated in the middle of the simulation, the ending node is recorded in the PageRank vector. The choice of the number of iterations r depends on an error tolerance parameter ǫ , as we will discuss in detail later.

Algorithm 2: Approximate heat kernel PageRank
Input : Seed node s, sampled subgraph S, expected size ς of local cluster, true degree k v of each node v in graph G, error tolerance parameter 0 < < 1 Output: Approximate PageRank vector ρ 1 r ← 16 3 log ς; 4 Denote as κ v the number of edges that connect node v to other nodes in subgraph S; 5 Initiate a zero vector ρ of dimension |S|; 6 for r iterations do 7 Simulate a random walk starting from seed node s with length k taken with probability e −t t k k! , and at each step at node v do the follows; 8 Generate a random value τ uniformly in the range [0, 1); 9 If τ < (k v − κ v )/k v , then this random walk is terminated; 10 Otherwise, choose one of the neighbors of node v in subgraph S uniformly at random as next step; 11 If a random walk is not terminated, let u be the ending node of this random walk and ρ(u) ← ρ(u) + 1; 12 end 13 Return 1/r · ρ Restricting our attention to subgraph S comes with a price: we lose track of simulated random walks once they cross the boundary of S and enter the complementary part V \S . But we are interested only in the part of the heat kernel PageRank vector that is supported on the subgraph S, and we can simply discard the other part that is supported on V \S , since it is irrelevant to the local clustering we wish to perform. Thus, the only random walks that will interrupt our approximation of the heat kernel PageRank are those that leave the subgraph S first and then come back. But we will argue that, with reasonable assumptions, the contribution of these random walks is vanishingly small if the whole network is sufficiently large.
We will consider a network model with community structure as follows. The network is divided into a few non-overlapping communities, and each node belongs to one of these communities. Each node v has a mixing parameter µ v ∈ (0, 1) , such that a fraction 1 − µ v of its edges connects to other nodes in its community, and a fraction µ v of its edges connects to other nodes outside its community. We additionally assume that µ v is an independent and identically distributed (i.i.d.) random variable, with an expected value µ . We note that our assumptions fit with commonly used network models with community structure, for example, the Girvan-Newman benchmark 25 , the LFR benchmark 26 , and the stochastic blockmodel 27 .
For a fixed seed node s in a local cluster of interest C, and for any node v in this community, we denote by X v k the indicator random variable defined as X v k = 1 if a random walk with length k starting from the seed node s ends at node v. By definition, the expectation E[X v k ] is equal to the fraction of random walks with length k starting from the seed node s that ends at node v. We denote by X v the random variable corresponding to the combined random walk process that starts from the seed node s, ends at node v, has a length that follows a Poisson distribution with mean value t and a maximum value of K, and has all steps in this cluster C. Then we can write X v as where t 1 , . . . , t k are the nodes visited in a random walk of length k. This random variable X v is a weighted sum of the random variables X v k . The values of the weights follow a Poisson distribution corresponding to the distribution www.nature.com/scientificreports/ of the lengths of random walks, and are multiplied by a sequence of factors of the form 1 − µ , which restricts all the steps of random walks to within subgraph S. The expected value of X v is This expected value represents the fraction of random walks starting from the seed node s that end at node v, which have lengths distributed according to a Poisson distribution with mean value t and a maximum value of K, and have all steps in this cluster C. We will also consider the approximate form of the infinite sum in (9) by limiting the length of random walks to at most K. Let ρ(K) t,s denote the contribution to ρ t,s from random walks with a length of at most K, It follows from definition that which is the expected value of the random variable Y v corresponding to the combined random walk process that starts from the seed node s, ends at node v, and has length distributed according to a Poisson distribution with mean value t and has a maximum value of K: represents the fraction of random walks starting from the seed node s that end at node v, and whose lengths follow a Poisson distribution with mean value t and have a maximum value of K. We denote by Poiss(t) a Poisson random variable with mean t. By the Chernoff bound 28 , we have As a result, we have with probability at least 1 − ǫ , for each node v in the network. Our analysis of the approximation error of (11) relies on the Chernoff bound as summarized in Lemma 1 20 .
We now state our bound on the approximation error of our approximate heat kernel PageRank vector.

Theorem 1 Let G be a graph and s be a node in a local cluster
Let ρ t,s be the heat kernel PageRank vector with 1 u as the starting distribution. For any 0 < ǫ < 1 , Algorithm 2 outputs an approximate heat kernel PageRank vector ρ t,s satisfying for all nodes v ∈ C , with a probability of at least 1 − ǫ . The running time of the algorithm is O log ǫ −1 log ς ǫ 3 log log ǫ −1 , where ς is the volume of the local cluster C. www.nature.com/scientificreports/ The proof of Theorem 1 is given in "Appendix A". Theorem 1 states that, as long as the local cluster C of interest is sufficiently small compared with the whole network G, then Algorithm 2 provides a good approximation of ρ t,s for all nodes in the local cluster C. Many real-world networks are massive, and we are usually interested only in local structures near the seed node of interest, namely a small local cluster. The results in Theorem 1 indicate that we can efficiently compute an approximate heat kernel PageRank for a small local subgraph when analyzing a massive network, and thus probe the local structures near the seed node without incurring huge computational loads.

Finding a good local cluster
Our local clustering algorithm finds a subset of nodes near the seed node s by performing a sweep on the approximate heat kernel PageRank vector ρ t,s . We now show that the ranking deduced from our approximate heat kernel PageRank can be used to find a good local cluster near the seed node. Specifically, by performing a sweep on ρ t,s with an appropriate seed node s in a subgraph C with �(C) < φ , we can find a subset of nodes with a conductance value of at most O( √ φ) . The performance guarantee for the local cluster estimated by our proposed algorithm relies on the results summarized in Lemma 2 21,29 .
The ranking deduced from the approximate heat kernel PageRank vector ρ t,s is not very different from that of a true vector ρ t,s in the local cluster C, as demonstrated by our bound on approximation errors in Theorem 1 (see "Computational complexity" section for an experimental analysis). For nodes in the subgraph S but not in the local cluster C, we note that we discard all simulated random walks that exit S and thus ρ t,s (v) < ρ t,s (v) for any v ∈ S\C . However, unlike the local cluster C, the sampled subgraph S is not guaranteed to have few boundary edges. Thus the approximation errors in S\C are not bounded from below as in Theorem 1. We note in addition that the approximation errors decrease the values of the approximate PageRank vector ρ t,s . Thus, compared with the ranking deduced from the true PageRank vector ρ t,s , the ranking deduced from the approximate PageRank vector ρ t,s keeps the relative ordering of the nodes in C and also keeps the nodes in S\C behind the nodes in C. In other words, the segments S i with small i's would more likely correspond to subsets of C. Then, as shown by Chung and Simpson 21 , we can apply the results in Lemma 2 to our approximate heat kernel PageRank vector ρ t,s . Therefore by the bounds in (20) and Lemma 2, where c = |C| is the number of nodes in C, and in particular, for a set of nodes with volume at least ς/2 . We now present our performance guarantee using the bounds in (23).

Theorem 2
Suppose that G is a network and C ⊂ G is a local cluster with vol(C) = ς , |C| = c , and �(C) ≤ φ . There exists a subset C t ⊂ C with vol(C t ) ≥ ς/2 , such that for any node s ∈ C t , the approximate heat kernel Pag-eRank vector ρ t,s computed by Algorithm 2 satisfying the following two assumptions induces a ranking for which a sweep finds a set with ς-local Cheeger ratio at most √ 8φ.
The proof of Theorem 2 is given in "Appendix B". Theorem 2 implies that, we can find a good local cluster by choosing parameters t and ǫ following the assumptions in (25) and (26). Specifically, for a subset C with Cheeger ratio bounded by φ , for any seed node s in C t ⊂ C as defined in Theorem 2, performing a sweep on the output of Algorithm 2 yields a local cluster with conductance at most √ 8φ.

Computational complexity
Sampling a subgraph using Algorithm 1 is sublinear in terms of the expected volume of the subgraph, which is in proportion to the expected volume ς of the local cluster. According to the construction of the sampling algorithm, multiple nodes can be added to the subgraph in each round. Hence the number of computations required for subgraph sampling is sublinear in terms of ς. We adopt the assumptions that drawing from a probability distribution with bounds and performing a random walk step both require constant time. For each of the r iterations, at most K = log ǫ −1 log log ǫ −1 random walk steps are simulated, and the number of simulated random walks is r = 16 ǫ 3 log ς . Thus the number of computations for computing the approximate heat kernel PageRank is O log ǫ −1 log ς ǫ 3 log log ǫ −1 . Performing a sweep on the approximate heat kernel PageRank vector ρ involves sorting the nodes according to their probability per degree and calculating the conductance of each segment. Recall that there are m nodes in the sampled subgraph S, which is in proportion to the expected size c of the local cluster C. Thus the number of computations for performing a sweep over ρ is O(m log m) = O(c log c).
To summarize, the computational work of our local clustering algorithm is dominated by computing the approximate heat kernel PageRank, and the total number of computations is O log ǫ −1 log ς ǫ 3 log log ǫ −1 . We emphasize that, unlike some other approximation algorithms based on iterations, our proposed algorithm is based on simulated random walks, and thus allows parallel computations. As a result, in practice our proposed algorithm can be transformed into distributed computation tasks, thus improving running time 30 .

Numerical experiments
We test our proposed local clustering algorithm on benchmark networks with ground-truth communities and compare the results to those of other existing local clustering algorithms. We will compare the clustering results in terms of the accuracy and the conductance of the estimated clusters.
Benchmark. We use benchmark networks to compare our local clustering algorithm with the algorithm by Chung and Simpson 21 , the TEA+ algorithm 24 , and with the results of performing a sweep on the exact heat kernel PageRank with maximum length K Benchmark networks are computer-generated graphs with ground-truth communities. We use the Lancichinetti-Fortunato-Radicchi (LFR) network as the benchmark for our numerical experiments 26 . In this benchmark, each computer-generated graph contains n nodes and is partitioned into l communities. The degrees of the nodes and the sizes of the ground-truth communities both follow a power-law distribution, with exponents γ and β respectively. The minimal and maximal degrees, k min , k max , and the minimal and maximal community sizes, s min , s max , are chosen to satisfy k min < s min and k max < s max . Each node is assigned to one of the ground-truth communities, and the community structure is controlled by a mixing parameter µ . For each node, a fraction 1 − µ of its edges connect to other members of the same community chosen at random, and the remaining fraction µ of its edges connect to nodes outside its community also chosen at random. In our numerical simulations, we choose as parameters n = 10000 , l = 30 , k min = 30 , k max = 300 , γ = 2 , s min = 50 , s max = 2000 , and β = 1.1 , and we vary the mixing parameter µ from 0.1 to 0.4. For each mixing parameter, we generate 400 graphs independently. For each generated graph, we randomly choose one of the ground-truth communities and then randomly choose one node in this community as a seed node. Then we apply the local clustering algorithms to find a local community of this seed node. Subgraph sampling. As the first part of our numerical experiments, we evaluate the performance of our subgraph sampling algorithm. We vary the mixing parameter from 0.1 to 0.6 to better demonstrate the performance of this algorithm. As a measure of goodness, we choose the fraction of nodes in the chosen local cluster that is also in the sampled subgraph where C indicates the local cluster chosen, and S indicates the local subgraph found by our subgraph sampling algorithm. We choose 2vol(C) as the expected size of the sampled subgraph, and note that in practice we can choose a larger expected size when working with large networks.
The average fraction with varying mixing parameters of our subgraph sampling algorithm is shown in Fig. 1. The sampled subgraph contains all the nodes of the chosen local cluster as the value of the mixing parameter µ varies from 0.1 to 0.55, and starts to degrade slightly only after the value of µ exceeds 0.55. Thus, our subgraph sampling algorithm always finds a local subgraph that contains the local cluster of interest as a subgraph, as long as the nodes in this local cluster are more strongly connected internally than to nodes outside this cluster.
We also evaluate the goodness of the local subgraphs found by our subgraph sampling algorithm by measuring the conductance values of these local subgraphs. Figure 2 shows the average conductance values of sampled subgraphs with varying mixing parameters µ . When µ = 0.1 , the average conductance value of sampled subgraphs is around 0.2. The average conductance value gradually increases as µ becomes larger, reaching around 0.65 when µ = 0.6 . We note that, for each local cluster, the mixing parameter is the average fraction of edges that connect to nodes in other clusters, and so approximately equals the conductance values of these local clusters. We also show in Fig. 2 the average conductance value of benchmark clusters. The average conductance value of the sampled subgraphs grows almost in parallel with that of the benchmark clusters, with an increase of about 0.1. Thus our subgraph sampling algorithm finds local subgraphs with conductance values slightly greater than those of the local clusters of interest.  Fig. 3, where the term "heat kernel" refers to performing sweeps on the heat kernel PageRank vector with maximum length, as in (13), and "benchmark" refers to the conductance values of the chosen clusters in the benchmark networks, averaged over experiments with the same mixing parameters. As we can see, for all values of the mixing parameter µ , all methods except the TEA+ algorithm yield conductance values equal to those of benchmark clusters. In particular, our proposed algorithm, despite the approximation errors in computing the heat kernel PageRank, finds local clusters as good as the ground-truth clusters, in terms of conductance. The clusters estimated by the TEA+ algorithm all have conductance values greater than those of the benchmark clusters. We evaluate the similarity between two clusters C 1 and C 2 using the normalized mutual information 31  www.nature.com/scientificreports/ where C 1 = (C 1 , C 1 ) and C 2 = (C 2 , C 2 ) are the bipartitions of the network induced by the clusters C 1 and C 2 , respectively, p(c) is the probability that a randomly chosen node belongs to a subgraph c, p(c 1 , c 2 ) is the probability that a randomly chosen node belongs to both subgraphs c 1 and c 2 , and H(C) is the Shannon entropy, defined as By the above definition, the mutual information I n (C 1 , C 2 ) equals 1 if the two subgraphs C 1 and C 1 are identical, and equals 0 if they are independent. In our numerical experiments, C 1 is the ground-truth community specified by the benchmark and chosen in our experimental setup, and C 2 is the local cluster detected by different local clustering algorithms. The average mutual information values with varying mixing parameters of different local clustering algorithms are shown in Fig. 4. The proposed local clustering algorithm maintains almost complete mutual information in all experiments, demonstrating that the approximation errors in computing the heat kernel PageRank do not degrade performance in estimating local clusters. In other words, our proposed approximate heat kernel PageRank vector is accurate enough for finding good clusters. For other methods, when µ is small, all methods yield a good estimate of the ground-truth local community, achieving almost complete mutual information and so detecting correct clusters. When µ increases, the average mutual information of the output of the TEA+ algorithm clearly decreases, demonstrating that the clusters estimated by this algorithm deviate from the groundtruth cluster.
To conclude, even though our proposed approximate heat kernel PageRank might suffer from increased approximation errors due to discarding part of the simulated random walks, the performance of our local clustering algorithm is not significantly influenced by these errors. In fact, our local clustering algorithm achieves state-of-the-art performance on benchmark networks; it always finds local clusters that are as good as benchmark clusters and clusters estimated by other methods. Our proposed approximate computation of the heat kernel PageRank is accurate enough to find good local clusters.

Application to real-world networks
College football network. We first apply the proposed local clustering algorithm to a real-world network with a known labeled community structure. This network represents the schedule of American football games between 116 football teams from Division IA colleges during the regular season in Fall 2000 25 . A visualization of this network is shown in Figs. 5, 6, and 7 using Cytoscape 33 , where nodes represent teams and edges mean that there exist regular season games between the pair of teams connected. These football teams belong to 12 conferences, which we use as reference ground-truth communities. Every conference contains around 8 to 12 teams, represented as circles of nodes in our visualization. The five teams in the middle of the figure are not involved in any of the circles. These independent teams do not belong to any conference. In principle, teams tend to play more games within conferences than between conferences, which generates more edges within each conference.
Each subfigure of Figs. 5, 6, and 7 shows the local clusters found by the proposed method, using different seed nodes. We choose one node from each conference as a seed node, and use the size of the conference as the expected size of a local cluster. We mark the seed node in green, and mark nodes found to be a member of the local cluster in red. We also highlight in red the edges between the members of the found local cluster. In most cases, as the local cluster, the proposed algorithm correctly finds the conference to which the seed node belongs.  www.nature.com/scientificreports/ We also observe in some cases that the resulting local cluster does not fit very well with the reference communities labeled by conferences. In Fig. 6a, node 110 is not included in the local cluster of the seed node 57, which is reasonable if we carefully inspect the network connections of this node. There are no connections between node 110 and any other members of this conference; instead, this node forms several connections with almost all members of the conference at eight o' clock. Thus it is no wonder that node 110 is identified as part of the local cluster of seed node 49, as shown in Fig. 7b. We also note that the independent team node 42 is regarded as part of the local cluster of seed node 38, as shown in Fig. 6b. This team plays regular games with four teams from the conference at four o' clock, and with, at most, one team each from any other conference. So our algorithm identifies it as part of the detected local cluster. Moreover, as seen in Fig. 7b, two teams, nodes 28 and 58 of the conference at eight o' clock, are not included in the local cluster of seed node 49. But node 28 has no connections within this conference, while at the same time it maintains several connections outside this conference, especially to the four teams colored in the conference at three o' clock, as shown in Fig. 7c. This pattern of connections explains why this team is included in the local cluster of seed node 24. On the other hand, the team of node 58 plays regular games with only two teams within the conference at eight o' clock, but plays regular games with one or two teams from four other conferences and independent teams. Due to such an atypical connection pattern among multiple conferences, this node is not included in any local cluster in our experiment. Similar phenomena are also observed for the uncolored nodes in the conference at three o' clock, as shown in Fig. 7c, and thus these nodes are also left outside any local clusters. By contrast, the independent team of node 90 plays regular games with all the colored teams in this conference, and so is included in this local cluster.
In conclusion, our proposed local clustering algorithm perfectly reflects the local community structure around the selected seed nodes, established by regular-season-game association. In addition, our algorithm detects the lack of associations within conferences, as reflected by network connections.
YouTube social network. Our second example is a social network on the video-sharing website YouTube 34 .
In the YouTube social network, nodes represent users and edges represent friendships formed by users with each other. The whole network is composed of 1,134,890 nodes and 2,987,624 edges. In this network, users can create groups which other users can join, and these user-defined groups are considered as reference "ground-truth" communities 34 . We apply the proposed clustering method to one of these user-defined groups composed of 121 users, which has a conductance value of 0.2030. We randomly choose one of the group members as the seed node, and then find a local cluster of this seed node. The estimated local cluster is composed of 225 nodes, contains all 121 users in the user-defined group, and has a conductance value of 0.0168. The local cluster found by our algorithm is visualized in Fig. 8 using Cytoscape 33 , with members of the estimated local cluster in red. Due to limited graphical processing capabilities, we show only 5,846 nodes (about 0.5% of the total number of nodes) and the edges between these nodes that are closest to the seed node. This visualized subgraph is chosen using the subgraph sampling algorithm in Algorithm 1. The YouTube social network is relatively sparse and contains several locally dense groups, shown as small circles of nodes. The estimated local cluster contains three such dense groups, visualized as the three circles of nodes at top right. Two of these groups have obvious hubs, and other members surrounding them and connected only to them by edges. The other group, on the right, does not show such concentration and forms edges among its members uniformly. These three groups have multiple edges between them, and most of these edges are linked to the hubs. Some scattered www.nature.com/scientificreports/ nodes have many connections with these three groups, especially with the hubs of these groups, and thus these nodes are also identified as part of the local cluster.

DBLP collaboration network.
Our third example is a collaboration network on the DBLP computer science bibliography, which provides a comprehensive list of research papers in computer science 34 . In the DBLP collaboration network, nodes represent authors of these research papers, and two authors are connected by an edge if they collaborate and publish at least one research paper together. The whole network is composed of 317,080 nodes and 1,049,866 edges. The reference "ground-truth" communities are defined by publication venues, for example, journals or conferences, and authors who have published research papers in a certain journal or conference are considered to form a reference community 34 . The publication-venue group we choose is composed of 36 nodes, and has a conductance value of 0.1034. As the seed node, we randomly choose one of the group members, and then apply our proposed local clustering algorithm to find a local cluster of this seed node. The estimated local cluster is composed of 59 nodes and contains all 36 authors in the publication-venue group, and it has a conductance value of 0.0308. The local cluster found by our algorithm is visualized in Fig. 9 using Cytoscape 33 , with members of the estimated local cluster colored in red. As before, here we show only 6,342 nodes (about 2% of the total number of nodes) and the edges between these nodes that are most close to the seed node chosen using our proposed subgraph sampling algorithm. As shown in the figure, the DBLP collaboration network has a much denser connection pattern than the YouTube social network, and does not show explicit local groups. But our local clustering algorithm successfully finds a small research community at the top of the visualization, with dense collaborations among group members, and only four external links, all made by one group member. Thus our algorithm finds a research group in which all the members but one collaborate only with other group members and not with researchers outside the group.
Amazon product co-purchasing network. Our fourth example is a product co-purchasing network of Amazon users. This network was collected by crawling the Amazon website and is based on a co-purchasing www.nature.com/scientificreports/ feature, "Customers Who Bought This Item Also Bought, " of the Amazon website 34 . In the Amazon product copurchasing network, nodes represent products, and two products are connected by an edge if these two products are frequently co-purchased. The whole network is composed of 334,863 nodes and 925,872 edges. The reference "ground-truth" communities are defined by the product categories provided by Amazon. The product-category group we choose is composed of 96 products, and has a conductance value of 0.0235. The seed node is chosen as a random product in this category, and we then find a local cluster of this seed node using our proposed local clustering algorithm. The estimated local cluster is composed of 184 nodes, contains all 96 products in the product-category group, and has a conductance value of 0.0064. The local cluster found by our algorithm is visualized in Fig. 10 using Cytoscape 33 , with members of the estimated local cluster colored in red. As before, we show only 6,701 nodes (about 2% of the total number of nodes) and the edges between these nodes that are closest to the chosen seed node using our proposed subgraph sampling algorithm. As can be seen, the Amazon co-purchasing network has a giant and dense group of connections in the middle, with many branches that are relatively sparse and contain some locally dense groups. Our algorithm finds a local cluster that is part of one of these branches. The estimated local cluster contains approximately three local dense groups, with multiple edges connecting them. The estimated local cluster has only two edges connecting itself to the main part of the network, effectively separating it from the giant dense group in the middle. We also note two other dense groups on this branch, but each has only one edge connected to our estimated local cluster. As a result, these two dense groups are not regarded as part of our estimated local cluster.

Conclusion and discussion
In this paper, we introduced a local algorithm for computing approximate heat kernel PageRank vectors, and based on this algorithm we devised a novel local clustering algorithm. We restricted our computation of heat kernel PageRank vectors to within a local subgraph near the seed node, discarding simulated random walks that depart from this subgraph. As a result, the number of computations required to achieve a good approximation is no longer dependent on the size of the whole network, and is determined only by the expected size of the local www.nature.com/scientificreports/ cluster of interest. We showed that, under reasonable assumptions, the approximation error caused by restricting simulated random walks to within a local subgraph is bounded by a probabilistic guarantee. We also presented a greedy local search algorithm for finding a local subgraph near the seed node. The numbers of computations required for subgraph sampling and computing approximate heat kernel PageRank vectors are both sublinear in terms of the expected size of the local cluster of interest, and thus are both independent of the size of the whole network.
Using numerical experiments, we demonstrated that the resulting local subgraph almost certainly contains the local cluster of interest as a subgraph, as long as the nodes in this local cluster are more strongly connected internally than to nodes outside this cluster. This local cluster has a conductance value only slightly greater than that of the local cluster of interest. Putting these results together, we then argued that, by performing a sweep on the proposed approximate heat kernel PageRank vector, our local clustering algorithm can find a local cluster having a small conductance value with high probability. Then we tested our local clustering algorithm on benchmark networks with well-established ground-truth communities, and on a football network with reference communities. In both cases, our local clustering algorithm achieved state-of-the-art performance, and the local clusters it found aligned well with ground-truth communities. Finally, we applied our local clustering algorithm to massive real-world networks with reference communities labeled using side information. Specifically, our algorithm provided insight into the online groupings of a video-sharing network, the collaboration patterns of a group of researchers, and the co-purchasing habits of Amazon users. In addition, our proposed approximate computation of the heat kernel PageRank is based on Monte Carlo simulation of random walks, and thus can be transformed into distributed algorithms in practical settings. www.nature.com/scientificreports/ outside of its community C ′ , and thus the probability of this random walk leaving C ′ at this step is µ u . There are v / ∈C ′ µ v d v edges that this random walk can choose if it leaves C ′ , and among them |∂(C)| edges lead to our community of interest C. Denote by ν u the probability that this random walk enters C from node u, and we have where the approximation becomes exact when the network is large enough. Then the average probability that a random walk enters C, denoted as ν , is The combined random variable corresponding to random walks with length at most K that leaves the cluster C exactly once and then comes back exactly once is given as where t 1 , . . . , t j , t j+1 are nodes in C that are visited in a k-step random walk and s 1 , . . . , s k−2−j , s k−1−j are nodes outside C that are visited in a k-step random walk. Then the expected contribution to ρ(K) t,s (v) from random walks that leave C once and then enter C once is Now consider the more general case that random walks leave the cluster C for exactly ℓ times, enter the cluster C for exactly ℓ times, and end at node v, where ℓ is a positive integer less than k/2. The random variable corresponding to these random walks is thus Similarly, the expected contribution to ρ(K) t,s (v) from these random walks can be written as Define Z v as ℓ Z v ℓ , and then the expected contribution to ρ(K) t,s (v) of all random walks that first leave the cluster C and then come back and end at node v is Noting that 1 − µ , 1 − ν , µ are all less than 1, we have where the last inequality comes from K 2 ν/2 < ǫ and K = log ǫ −1 log log ǫ −1 . Suppose that we perform r > 16 ǫ 3 log ς simulated random walks, where ς is the volume of the sampled subgraph S. For every component with ρ(K) t,s (v) > ǫ , by the third part of Lemma 1, we have (1 − µ t 1 ) · · · (1 − µ t j )(1 − ν s 1 ) · · · (1 − ν s k−2−j )µ t j+1 ν s k−1−j , (1 − µ) j (1 − ν) k−2ℓ−j µ ℓ ν ℓ . (37) (1 − µ) j (1 − ν) k−2ℓ−j µ ℓ ν ℓ .
(38) We note that our approximate PageRank vector equals ρ t,s (v) = ρ(K) t,s (v) − 1 r Z v . Thus for any node v in the local cluster C with ρ(K) t,s (v) > ǫ , we have with probability at least 1 − ς −4/ǫ , and for any node v in the local cluster C with ρ(K) t,s (v) ≤ ǫ , we have with probability at least 1 − ς −4/ǫ 2 . Thus, combined with (18), for each node v in the local cluster C, with probability at least 1 − max(ǫ, ς −4/ǫ ) , which completes our proof.

B Proof of Theorem 2. Proof
Let C t be the set of nodes satisfying the inequality in (24). Now rearranging this inequality, we get and further rearrangement of the right-hand side yields Then by the assumption in (26), we have and thus Then by the assumption in (25), we have Applying that fact that � * (C) < φ and rearranging (48) yields Thus a sweep over the approximate PageRank vector ρ t,s finds a subset with Cheeger ratio at most O( √ φ) as long as s is in C t .