Estimation and update of betweenness centrality with progressive algorithm and shortest paths approximation

Betweenness centrality is one of the key measures of the node importance in a network. However, it is computationally intractable to calculate the exact betweenness centrality of nodes in large-scale networks. To solve this problem, we present an efficient CBCA (Centroids based Betweenness Centrality Approximation) algorithm based on progressive sampling and shortest paths approximation. Our algorithm firstly approximates the shortest paths by generating the network centroids according to the adjacency information entropy of the nodes; then constructs an efficient error estimator using the Monte Carlo Empirical Rademacher averages to determine the sample size which can achieve a balance with accuracy; finally, we present a novel centroid updating strategy based on network density and clustering coefficient, which can effectively reduce the computation burden of updating shortest paths in dynamic networks. The experimental results show that our CBCA algorithm can efficiently output high-quality approximations of the betweenness centrality of a node in large-scale complex networks.


Estimation and update of betweenness centrality with progressive algorithm and shortest paths approximation
Nan Xiang 1,2,3* , Qilin Wang 1 & Mingwei You 1 Betweenness centrality is one of the key measures of the node importance in a network.However, it is computationally intractable to calculate the exact betweenness centrality of nodes in largescale networks.To solve this problem, we present an efficient CBCA (Centroids based Betweenness Centrality Approximation) algorithm based on progressive sampling and shortest paths approximation.Our algorithm firstly approximates the shortest paths by generating the network centroids according to the adjacency information entropy of the nodes; then constructs an efficient error estimator using the Monte Carlo Empirical Rademacher averages to determine the sample size which can achieve a balance with accuracy; finally, we present a novel centroid updating strategy based on network density and clustering coefficient, which can effectively reduce the computation burden of updating shortest paths in dynamic networks.The experimental results show that our CBCA algorithm can efficiently output high-quality approximations of the betweenness centrality of a node in large-scale complex networks.
Network analysis 1 is a technique to investigate the structure and properties of networks, and one of the important tasks is to calculate the centrality of a node in the network [2][3][4] , which measures how connected or influential a node is within the network.Common centrality measures are degree centrality 5 , betweenness centrality 6 , closeness centrality 7 , etc.The betweenness centrality of a node has many applications in various domains, such as identifying critical nodes in transportation networks 8 , detecting essential proteins in protein networks 9 , and improving clustering 10 and community detection algorithms 11 .
Betweenness centrality (BC) measures the importance of a vertex or an edge based on the shortest paths in a graph (i.e., a vertex or an edge with higher BC appears more frequently on the shortest paths in the graph).Several exact algorithms for computing Betweenness centrality have been proposed [12][13][14][15][16] , among which Brandes' algorithm 12 is a representative one that uses the single source shortest paths (SSSP) idea to optimize the computation process.The time complexity of this algorithm is O(nm + n 2 log n ) for weighted graphs and O(nm) for unweighted graphs, where n and m are the numbers of vertices and edges in the graph, respectively.
However, exact algorithms are infeasible for large-scale networks 17 due to their increasing size, and research emphasizes the ordering of nodes over exact values.Hence, some sampling-based approximation algorithms have emerged [18][19][20][21][22][23][24][25][26][27][28][29][30] , which can generate a betweenness centrality approximation with a high probability (1 − δ) and a bounded maximum deviation, satisfying ε − approximation 31 .However, this category of approximation algorithms faces some challenges, such as determining the sample size that can represent the global distribution of parameters with the least samples; selecting samples that can better estimate the parameter distribution; and adapting to changes in network dynamics.Currently, some researchers have proposed many attempts to address these problems.For example, Brandes 29 proposed a bp algorithm based on Hoeffding's inequality and union bound, but this algorithm relies too much on the number of nodes n, resulting in excessive running overhead; then Matteo et al 22 proposed an rk algorithm based on Vapnik-Chervonenkis (VC) dimension theory, using VC dimension theory to limit the sample size so that the sample no longer depends on the number of nodes in the graph but on the diameter of the graph (i.e., the number of nodes with the largest shortest path in the graph).Nonetheless, the VC dimension does not yield the best ε − approximation for a given number of δ ; therefore, Matteo et al 20 proposed another ab algorithm based on the Rademacher averages 32 and progressive sampling 33

Graphs and betweenness centrality
Let G = (V , E) be a graph, either undirected or directed, and where each edge must have a non-negative weight.We denote n = |V | in the graph, and n is the number of all nodes.For any distinct node pair (u, v) in the graph, where u = v .Let σ uv denote the number of all shortest paths between node u and node v, and let σ uv (w) be the number of shortest paths between node u and node v for all shortest paths passing through w.For convenience, we write the shortest paths as SPs.
Given a graph G = (V , E) , the normalized betweenness centrality b(w) of a node w ∈ V is defined as:

Rademacher averages
Rademacher averages 36 are an essential core of statistical learning theory 53 and allow measuring the convergence speed of sample means concerning their expectations.More detailed information on the Rademacher averages can be found in 35,36 .Define a finite field P and a uniform distribution µ over the elements from P. Let F be a family of functions from P to [0,1], and let S = {S 1 , ..., S m } be the set of m independent identically distributed samples of P with probability uniform distribution µ .The average value and its expectation for each function f, which is defined over the samples S, are as followed, respectively: i.e., ρ s (f ) is an unbiased estimator of ρ µ (f ).Now, for a given S, we focus on the upper bound of the supremum deviation S(F , S) of ρ s (f ) from ρ µ (f ) among all f ∈ F , the quantity is: The S(F , S) is the vital notion in the study of empirical processes.The Empirical Rademacher Averages (ERA) R(F , S) of F on S is the indicator that can effectively determine the sample-dependent upper bound of the supremum deviation S(F , S) , which can take into account the data distribution relationship.Let = ( 1 , ..., m ) is the collection of m independent identically distributed (i.e., i.i.d) Rademacher random variables that takes the values {−1, 1} with the same probability 1 2 of taking -1 or 1, respectively.This quantity is: However, R(F , S) is difficult and costly to compute.Monte Carlo estimation 35 gives an efficient way to get the apparent sharply probability bound for ERA.For any k ≥ 1 , let ∈ {−1, +1} k×m be a matrix of k × m that is independently identically distributed for the Rademacher random variables.The k-Trials Monto-Carlo Empirical Rademacher Averages (i.e., k-MCERA) Rk m (F , S, ) of F on S using is: The Empirical Rademacher Averages R(F , S) is the expectation of k-Trials Monto-Carlo Empirical Rademacher Averages Rk m (F , S, ) of F on S, which controls the probability bound of S(F , S) .It is not efficient to use the Rademacher averages to obtain clear probability bounds, but rather to use k-MCERA to make a better balance between sample size and accuracy.That is because k-MCERA can directly estimate the highest deviation of the function set by data dependence.For each f ∈ F , We define the empirical wimpy variance as α , defined as: Before stating Theorem 2.2, we need to know the upper bound of the wimpy variance (i.e., Theorem 2.1).Then, Theorem 2.2 shows how to use the k-MCERA to compute the upper bound of the maximum deviation S(F , S) , using only sample-dependent quantities.Theorem 2.1 For any f ∈ F is a family of functions in the domain P up to [0,1] and µ is a uniform distribution of probabilities over the elements of P: Proof We have given in Online Appendix A.1, which leverages the properties of the betweenness centrality under large-scale networks and the basic variance formula.Theorem 2.2 For k, m ≥ 1 and the function f ∈ F , where F is a family of functions from P to [0,1].Let ∈ {−1, +1} k×m be a k × m matrix of Rademacher random variables, such that independently and with equal probability 1 2 .Let S be a sample size of m drawn i.i.d.from P, and take a distribution µ .For each f ∈ F , δ ∈ (0, 1) , define: (2) With the probability at least 1 − δ over the choice of S and , it holds: The proof of Theorem 2.2 We put in Online Appendix A.2, using the self-boundary function 54 and the symmetry inequality 55 , as well as the substitution theorem.Observing Theorem 2.2, we can see that α is the largest factor affecting the whole equation and controls the supremum deviation S(F , S) .Thus, we can achieve a better balance between sample size and accuracy, obtaining a uniform variance for most families of functions.This is the reason why k-MCERA outperforms rk and bp.The rk algorithm uses VC dimensional theory to obtain an upper limit on the sample size, which is data-independent and depends on the properties of the graph itself.However, it does not consider the feature of data dependence.The bp algorithm uses Hoeffding's inequality and union bound, which results in an excessive sample.These two methods lead to a large number of samples to guarantee a high quality approximation, so they are characterized by a sample size that is suboptimal, while k-MCERA can capture the relationship between sample size and accuracy very well.

Network density and clustering coefficient
In this section, we focus on the dynamic changes of complex networks.
We know that dynamic networks change by adding or deleting nodes and edges, but this requires updating and maintenance.However, not all nodes or edges addition and deletion have a significant impact on the network, and we can ignore some minor changes while pursuing approximate estimates.According to [56][57][58] , network density and clustering coefficient are closely related to the power law property and small world property of complex networks, which are the key assumptions of our algorithm.Therefore, we choose the network density and clustering coefficient, which are two parameters that can reflect the features of the network.And the network is considered to have been changed measurably when their changes exceed a certain threshold value.The centroid is a critical factor that affects the efficiency of the whole algorithm.
We present a strategy to update the network centroid based on the network density and clustering coefficients to detect the dynamic changes, thus avoiding a large amount of updating time.
Network density 59 : it can be used to characterize the density of interconnected edges of nodes in a network, which is defined as the ratio of the number of variables present in the network with the upper limit of the number of edges that can be accommodated.It is commonly used to measure the intensity of social relationships and the evolutionary trend in online social networks.For G = (V , E) with n nodes and m edges, the network density is defined as: Clustering coefficient 60,61 : it quantifies how densely nodes form cliques in a graph.There is evidence 62 that nodes tend to create tightly bound groups in most real-world networks, especially social networks.This is divided into a global clustering coefficient and a local clustering coefficient.We choose the local clustering coefficient as one of the parameters to reflect the features of the network, because it can capture the local structural changes of nodes and their neighbors in dynamic networks.
For G = (V , E) , where V = (v 1 , v 2 , ..., v n ) denotes the collection of vertices and E = {e ij : (i, j) ∈ U ⊂ [1, ..., n] 2 } denotes the collection of edges ( e ij denotes the edge connecting vertices v i and v j ).Denote by L(i) the collection of edges connected to the vertex v i : L(i) = {v j : e ij ∈ E ∧ e ji ∈ E}, z i is the degree of node i.The local clustering coefficient of vertex v i in an undirected graph is: The local clustering coefficient of vertex v i in a directed graph is: Vol.:(0123456789) To facilitate the reader's understanding of the article, we provide the explanations of the main parameters in Table 1.

Methods
This section shows our proposed CBCA algorithm, which is an efficient approximation algorithm based on progressive sampling and shortest paths approximation.
Progressive sampling is a technique that gradually increases the sample size until a desired accuracy is achieved.It allows us to avoid over-sampling or under-sampling the network, and adapts to the dynamic changes of the network.Shortest paths approximation is a technique that uses the network centroids to estimate the shortest paths between nodes.It allows us to reduce the computational complexity and memory requirement of the algorithm.
We firstly introduce the basic process and related results for approximating the shortest paths based on the network centroids, which can efficiently compute the betweenness centrality values after approximating the shortest paths in section "Shortest paths approximation based on network centroids", and we have provided the data-dependent bounds for these results in Theorem 2.2.Then, we describe the specific steps and parameters of the CBCA algorithm in Section 3.2, which uses this improved bound to obtain high-quality betweenness centrality approximations with high probability and ensures that the betweenness centrality values of all nodes are within the additive error ε by progressive sampling.The detailed theoretical and experimental results regarding the initial sample size, the selection of the sample schedule, and the sections on updating the network centroids are given in the introduction section of Experiment 4 and Experiment 4.5, respectively.

Shortest paths approximation based on network centroids
The common shortest paths algorithms do not exploit the properties of complex networks (i.e., scale-free network characteristic (power-law property) and small-world property).We leverage these two properties to improve a shortest paths approximation method based on the network centroids.The small-world network has two important properties: high clustering coefficient and short average shortest paths length.The scale-free network characteristic means that few nodes have a high degree and most nodes have a low degree.Then we can naturally select some nodes with high degree or high adjacency information entropy as the centroids, and divide the nodes near or adjacent to the centroids into subgroups.From these subgroups, we use the shortest paths algorithm to calculate the distance between each node and the centroids.This can reduce the search space and time.And it can improve the approximation accuracy.Therefore, the number and quality of the centroids are crucial.

Centroids screening strategy
We use the VC dimension theory to calculate the number of t centroids in statistical learning, and we refer the reader to 32 for more details on the VC dimension theory: The number of all SPs between node u and node v.
The number of SPs between node u and node v for all SPs via w.
Vol:.( 1234567890) where t is the number of centroids with c taking the value 1 3 , D is the diameter of the graph, and δ ∈ (0, 1).The equation is derived from Matteo 22 , but c is not the meaning of the coefficients among them in this paper, and they are irrelevant.The reason for employing this equation is that with the data independence of the VC dimension, where the number of centroids does not depend on the number of nodes, but only on the diameter in the graph (i.e., the maximum length of the shortest path).It reduces the number of t and speeds up the approximation time.At the same time, the value of c can affect that of t.We need to trade-off between the number of centroids and the ratio T 1 T 0 ( T 1 is the shortest paths after approximation and T 0 is the shortest paths without approximation) to get an acceptable value.According to experiments, we found that the quality of the centroids is best when c = 1 3 and t = 2or3 .This range of values facilitates the calculation of the shortest paths approximation.Furthermore, in this Eq.( 14), we need to know the diameter of the graph D. One way to compute the exact value of D is by solving the shortest paths problem between all pairs of nodes.However, this exact calculation of the diameter D is not desirable, because it requires a time complexity of O(n 3 ) , which is against our experi- mental purpose.Given the short average path length of complex networks and the data independent property of VC dimension, we can reason that the error of diameter has little or even negligible effect on the result.Thus, we adopt different approximate diameter calculation methods for directed and undirected graphs respectively, in this paper.
Approximate diameter methods: 1. Let G = (V , E) be an undirected graph with all edges having equal weights.Choosing a vertex u ∈ V uni- formly and randomly from the graph and computing the shortest paths from u as the source to all other nodes.We can calculate the diameter D equal to the sum of the two shortest paths from u to two other different nodes w, v. 2. For a directed graph with weights, we note that D is not necessarily equal to the longest path among the shortest paths between all pairs of nodes, because the edge weights may affect the distance sums.This makes the calculation of the approximate diameter more complicated.We can use the maximum weakly connected component, as an upper bound on the diameter D.

Centroids quality screening strategy
We use the adjacency information entropy formula to effectively filter out high-quality centroids based on the small-world feature and the scale-free characteristic of complex network.Commonly used information entropy formulas are 50,63,64 .In this paper, we choose a more popular adjacency information entropy formula 50 and give a reasonable explanation.
The undirected unweighted graph G(V, E): The node degree of the undirected unweighted graph can be obtained by , where l is the number of node i and j is the neighbor of node i. o ij is equal to 1 if an edge exists between node i and node j, otherwise it is 0. The undirected weighted graph G(V, E): The node degree of the undirected unweighted graph can be obtained by h i = l 1 w ij , where l is the number of node i, j is the neighbor of node i, and w ij represents the weight magnitude between node i and node j.
The degrees in the directed graph are divided into in degrees and out degrees, which are divided into two cases, directed with weights and directed without weights, as follows.

Directed weighted graph G(V, E):
Where ξ i is the set of the neighbors of node i.
When discussing the influence of graphs, we need to distinguish between directed and undirected graphs.In a directed graph, each node has an incoming degree and an outgoing degree, which indicates the number of edges pointing to and from that node, respectively.In an undirected graph, each node has a degree, which indicates the number of edges connected to that node.Thus, we introduce an influence factor ζ to measure the influence (15) www.nature.com/scientificreports/ of a node.It is a constant between 0 and 1 to regulate the contribution of incoming and outgoing degrees to influence.Usually, ζ taken as 0.7 is a reasonable choice.In an unweighted graph, the calculation of influence is relatively simple by multiplying the in degree of nodes by ζ and adding the out degree by 1 − ζ .In the weighted graph, the calculation of influence is more complicated, which requires considering the weights and directions of the edges, as follows: Directed unweighted graph G(V, E): Directed weighted graph G(V, E): we can easily obtain that A i = j∈ξ i k j , if the graph is undirected.
Definition 3 Probability P i j , we define the selection probability of node i in the network by considering the probability of node i being selected by its neighbor j with the following probability: Definition 4 Information entropy formula, the important technical formula is used to carry out how to filter t network centroids to achieve the properties of complex networks in this paper, which compound our assumptions: by the above way, we can reasonably screen out the high-quality centroids by the adjacency information entropy.

Subgraphs construction for the shortest paths approximation
In this section, we describe in detail how to generate subgraphs and how centroids of prime nodes are generated.As an example of an undirected unweighted graph, the same method is used for the other types of graphs.
Step 1 We use the adjacency list to represent the graph structure, so one Breadth-First Search (BFS) traversal gives us the degrees and weights of all nodes.As shown in Fig. 1a.
Step 2 When performing the BFS traversal, we get the diameter by the approximate diameter calculation method in section "Centroids screening strategy", (i.e., D = 10.)In addition, we use the method of screening the number and quality of centroids mentioned in sections "Centroids screening strategy" and "Centroids quality screening strategy".The method of screening the number of centroids is based on the diameter of the graph and a given confidence parameter δ to determine the number of centroids to be selected as t.The formula is t = c(⌊log 2 (D − 2)⌋ + 1 + ln 1 δ ) , where c = 1 3 , δ = 0.1 .We get t = 2.1 , so there are two centroids.Then we utilize the adjacency information entropy formula to easily get the two high-quality centroid nodes, as shown in Fig. 1b).
Step 3 The nodes surrounding the centroids can be easily obtained by traversing the neighboring nodes one at a time, as in Fig. 2a.Then continue to traverse the neighboring nodes and finally a complete subgraph can be obtained as shown in Fig. 2b.
Eventually, we get the complete graph with two centroids by the above method.
We explain in detail how to perform the shortest paths approximation.According to section "Shortest paths approximation based on network centroids", we can reasonably infer that most of the shortest paths pass through the centroids of the network.Eqs. ( 22) and ( 23) can effectively reduce the influence of paths that do not pass through the centroids on sampling and estimation of the betweenness centrality.The experimental section "Different types of networks" verifies our theoretical hypothesis.

CBCA algorithm description and analysis
In this section, we present the CBCA algorithm, which is based on the contributions of section "Shortest paths approximation based on network centroids", for computing a strict approximation of the betweenness centrality of all nodes in the graph.We firstly describe the effective estimators that can satisfy the ε − approximation , which is an important component of the CBCA algorithm, for estimating the betweenness centrality in section "Effective estimator description and analysis".Then we describe the CBCA algorithm in section "Our CBCA algorithm flow".

Effective estimator description and analysis
Our CBCA algorithm takes as input a graph G = (V , E) , which can be directed or undirected and can have non- negative weights on the edges and includes two parameters ε, δ ∈ (0, 1) .It can output a set B = { b(w), w ∈ V } (i.e., with probability at least 1 − δ , betweenness centrality B = {b(w), w ∈ Vof εapproximation} ).Let P = {(u, v) ∈ V × V , u � = v} be the collection of all different node pairs.For each node w ∈ V , let f w : the func- tion of P mapping to [0,1].
To improve the computational efficiency, we use: , where t i , t j refer to the centroids in u i , u j , respec- tively, from the set of independent and uniformly sampled node pairs (u, v) from P. We define: Thus, we have that

Our CBCA algorithm flow
Our CBCA algorithm is a method based on a progressive algorithm.A progressive algorithm is an algorithm that checks whether a certain stopping condition is satisfied after each iteration.If it is satisfied, the final result is output; if not, the iteration continues.The goal of the CBCA algorithm is to estimate an approximation to the betweenness centrality b(w) of each node w in the graph and output an approximate set B = { b(w), w ∈ v} .
To improve efficiency, the CBCA algorithm does not iterate through all possible node pairs, but samples node pairs from the set S i , which is used to estimate b(w).Here m i = |S i | denotes the size of the set S i .Therefore, it is important to choose a suitable stopping condition, which affects the accuracy and speed of the CBCA algorithm.Especially for large graphs, the computational cost of each sample is high.We now present the CBCA algorithm and next show how to obtain the set of The input parameters of the CBCA algorithm: a graph, a failure probability δ ∈ (0, 1) , the number of k for the k-MCERA, a user-specified error ε , sample schedule, and a suitable sample size m 0 .The output is a pair ( B, ε) , where B is a set of pairs (v, b(v)) for each v ∈ V , where b(v) is the estimate of b(v), and ε ∈ (0, 1) is the accuracy that is probabilistically guaranteed in the following Theorem 3.1.Theorem 3.1 With probability at least 1 − δ for the CBCA algorithm, the output ( B, ε) This is presented in Online Appendix A.3, using the converse method.Our CBCA algorithm computes an ε − approximation of B = {b(v), v ∈ V } by using the technique introduced in section "Rademacher averages".CBCA can be divided into two phases.In the first phase, the CBCA algorithm www.nature.com/scientificreports/trials, referring to Bavarian [18] and MCRapper [44].It was fixed to 25, which shows that clear bounds can be obtained even using a small number of Monte Carlo trials, but of course, we also experimented with k = 50, 100 , and 200 and found that it is not better than the case of k = 25 .We ran all the algorithms 5 times and reported Avg ± stddev , which is representative of the standard deviation of a single measurement.

Sample size
We first show a comparison figure of the required sample sizes on different datasets in Fig. 3. Figure 3a illustrates the ratio of sample size required by Silvan and CBCA to achieve high-quality approximation.First, for the large graphs Wiki_Talk and Wiki_topcats, CBCA requires at least 10.3% and at most 19.7% less sample size than Silvan, respectively.Although both Silvan and CBCA algorithms employ the k-MCERA technique, CBCA adopts the shortest paths approximation, which can effectively reduce the time to compute the shortest paths and thus decrease the sample size.In some small graphs, such as ca-GrQc, Silvan's sample size is smaller than CBCA's, which is due to CBCA's consideration of the graph diameter, while Silvan uses an empirical peeling technique to reduce the required sample size.Finally, the CBCA algorithm performs better on large graphs, which is consistent with our experimental objective.
As shown in Fig. 3b, the CBCA algorithm requires much smaller sample sizes than the rk algorithm.The difference is up to an order of magnitude lower; and at least several times smaller.This is closely related to the different methods they use to reduce the sample sizes.The rk algorithm relies on the VC dimension, which can guarantee a high-quality approximation, but it only considers the diameter of the graph and ignores other features.This leads to an overly conservative of the sample size.However, our CBCA algorithm employs the state-of-the-art k-Monto Carlo trials technique to obtain the maximum correlation index between the nodes with sharp variance-aware probability tail bounds.This can effectively reduce the sample size and provide better guarantees.
In conclusion, the CBCA algorithm can better obtain the minimum number of samples required to satisfy the high-quality approximation, which illustrates the importance of our shortest paths approximation and k-Monto Carlo trials techniques.(a) (b) In Fig. 4, we can observe that the running time increases linearly with the sample size, because the main timeconsuming step of the algorithm is sampling, which accounts for more than 90% of the running time.Similarly, from Fig. 4b,c, both the number of samples and the running time decrease as the error ε increases, because a larger error ε reduces the number of samples required to obtain a more accurate value.This also demonstrates that our algorithm does not waste time on calculating the centroids, but focuses on the useful work.
We can see the ratio of time required to achieve approximation in Fig. 5a.For the large graph Wiki_Talk and Wiki_topcats, CBCA performs better than Silvan in different errors ε , saving 7.1% to 15.2% of time.This is because the algorithm spends more than 90% of the time on sampling.Our CBCA algorithm uses the shortest path approximation, which can speed up the sampling time, while Silvan uses bidirectional BFS traversal, which leads to an increase in sampling time.When the graph is small, such as ca-GrQc, Silvan's sampling speed is faster than CBCA's.The results show that CBCA's running time is faster in large graphs.
As shown in Fig. 5b, the CBCA algorithm can be two to three orders of magnitude faster than rk in large graphs and one order of magnitude faster even in small graphs.This is attributed to the improvement of CBCA based on the shortest paths approximation, reducing the running time, as well as the sharply wimpy variance technique for satisfying approximations of high quality, reducing the number of samples.

Accuracy
we show a comparison of the absolute errors on the four datasets in Fig. 6.
In this section, we discuss the accuracy of the algorithm that we introduced in section "Preliminaries" of this paper.As shown in Fig. 6a-d, the CBCA algorithm can always guarantee that all nodes satisfy ε − approximation with probability 1 − δ .Moreover, the absolute error computed is smaller than ε , even by an order of magnitude.This indicates that the algorithm can perform better than the theoretical guarantee.This is due to the use of the Compared with CBCA, the errors of both bp and rk algorithms are smaller.This is because the VC dimensional theory used by rk only considers the diameter of the graph without a more detailed understanding of the other structural underlying distributions of the graph.This results in too many samples, making it better in terms of accuracy guarantees.Moreover, the error of bp is minimal because the algorithm uses Hoeffding's inequality and union bound.Although both algorithms can satisfy ε − approximation with a low error and a probability of 1 − δ , they consume a lot of time in respect of running time.Increasing the sample size of the number of samples to improve the accuracy sacrifices the running time.

Different types of networks
We generate ER random networks 65 , WS small-world networks, and BA scale-free networks with 3000, 6000, 9000, 12000, and 15000 nodes (i.e., five graphs with random network properties and five graphs with the smallworld property as well as power-law property, respectively).
Our Shortest Paths Approximation theory is based on two properties of complex networks: the small-world property and the power-law property.To verify this theory, we apply the shortest paths approximation to a random network.The clustering coefficient of random networks is small, while the clustering coefficient of smallworld networks is large.Random networks do not have power-law distributions, while scale-free networks have a few high degree nodes and a large number of low degree nodes.These differences affect the effectiveness of the shortest paths approximation.we proceed to set the resulting THV for the threshold qualifier to 0.2, 0.3, 0.4, 0.5, and 0.6.It can be seen from Table 3 that the original centroids need to be updated when the threshold THV exceeds 0.5.This shows that our CBCA algorithm updates the centroids only if the THV is set to a value greater than 0.5, saving a lot of time.

Conclusion
In this paper, we present a novel betweenness centrality approximation algorithm based on progressive sampling and shortest paths approximation.The algorithm firstly uses the adjacency information entropy to generate network centroids and constructs an efficient shortest paths approximation strategy; then, it uses the k-Monto Carlo trials technique to trade off the sample size and error to obtain a high-quality approximation of the betweenness centrality of all nodes.The algorithm can also handle dynamic networks with frequent BC changes by using a centroid updating strategy based on network density and clustering coefficients.
Our experimental results show that our algorithm outperforms the baseline algorithm for the same probability in various networks.Our algorithm can efficiently output high-quality approximations of the node betweenness centrality in large-scale complex networks.Our algorithm can also be applied to network analysis and applications, such as identifying the most influential or central nodes in a network, which can help us understand the network structure and function, as well as optimize its performance or resilience.However, our algorithm also has some limitations and challenges, such as relying on the quality of the network centroids, which may not always be optimal or representative.Therefore, we intend to extend our algorithm to address these issues in future work, such as exploring different methods or criteria to select or update network centroids, applying our algorithm to different types of networks, and applying our algorithm to other network analysis tasks.
Moreover, the network centroids and shortest paths approximation methods proposed in this paper can also be tried for other centrality measures, such as degree centrality, closeness centrality, etc.These centrality measures can also reflect the importance or influence of network nodes in different aspects, and have many practical applications.We believe that our methods can provide an effective and flexible approximation strategy for other centrality measures, and can adapt to different scales and characteristics of networks.We also believe that the methods proposed in this paper can be used to improve the speed and quality of data analysis and mining, such as finding similar or different points or groups in data faster, evaluating important or abnormal points or edges in data more accurately, etc.These are very important and valuable problems in the field of data science, and have many practical applications. https://doi.org/10.1038/s41598-023-44392-0

Figure 1 .Figure 2 .
Figure 1.(a) Initial input graph.(b) The required number of centroids calculated from the diameter and the adjacency information entropy is used to screen the high-quality centroids.

Figure 3 .
Figure 3.The ratio of the number of samples that can satisfy the high quality ε − approximation . (a) the ratio of the number of samples between Silvan and CBCA.(b) The ratio of the number of samples between rk and CBCA.

Table 2 .
9 graphs, where D is the diameter of the graph (the longest one shortest path); t represents the number of prime centers; type A indicates a directed graph and B indicates an undirected graph.