An algorithm for discovering vital nodes in regional networks based on stable path analysis

Vital node discovery is a hotspot in network topology research. The key is using the Internet’s routing characteristics to remove noisy paths and accurately describe the network topology. In this manuscript, a vital regional routing nodes discovery algorithm based on routing characteristics is proposed. We analyze the stability of multiple rounds of measurement results to overcome the single vantage point’s path deviation. The unstable paths are eliminated from the regional network which is constructed through probing for target area, and the pruned topology is more in line with real routing rules. Finally, we weight the edge based on the actual network’s routing characteristics and discover vital nodes in combination with the weighting degree. Unlike existing algorithms, the proposed algorithm reconstructs the network topology based on communication and transforms unweighted network connections into weighted connections. We can evaluate the node importance in a more realistic network structure. Experiments on the Internet measurement data (275 million probing results collected in 107 days) demonstrate that: the proposed algorithm outperforms four existing typical algorithms. Among 15 groups of comparison in 3 cities, our algorithm found more (or the same number) backbone nodes in 10 groups and found more (or the same number) national backbone nodes in 13 groups.

www.nature.com/scientificreports/sampling technique to sample the data in the topology graph to reduce the calculation cost of BC and approximated the BC of vital nodes in the original topology on this basic.Dong et al. 14 proposed a localized strategy that can find vital nodes without global knowledge of the network.Sunil 15 provided a GNN-based (Graph Neural Network) inductive framework to approximate BC using the message passing mechanism.These methods are almost all based on the macro-statistical characteristics of graphs 16 , and pay less attention to the routing characteristics of the actual Internet.
Unlike the above methods, another research takes certain characteristics of the actual Internet into account.Ulrik et al. 17 proposed Traffic Load Centrality (TLC).TLC simulates the transmission process of network data packets, and only uses the transmission on the shortest path to describe the load carried by the node, which is used to describe the node's importance.Linton et al. 18 proposed Flow Betweenness Centrality (FBC) by considering the shortest path and the non-shortest paths at the same time.FBC believes that the larger the proportion of paths passing through a node among all the non-repeated paths in the network, the more critical the node is.Shlomi et al. 19 combined FBC and network routing and proposed the Routing Betweenness Centrality (RBC).They assumed that the routing table is known and mined vital nodes according to the number of paths connected by the target node.Leonardo et al. 20 proposed the Load Centrality (LC), which mined vital nodes in the network by calculating the expected load on the routing nodes.Alain et al. 21considered the heterogeneity of edges between nodes in real-world networks and introduced the Weighted Degree Centrality to measure the importance of nodes.To address the issue of low-degree nodes tending to have higher clustering coefficients, Xuefei et al. 22 proposed the Weighted Clustering Coefficient to assess top-k key nodes by taking into account both the node's clustering coefficient and its degree.This kind of methods regulate the characteristics of the network such as path, traffic and protocol to a certain extent, and mine the vital nodes on this basis.These methods solve some phenomena on the Internet, but they are still difficult to adapt to the actual network.
In view of the above problems and the difficulty of obtaining the Internet routing tables, this manuscript proposes an algorithm for discovering vital nodes in regional networks based on stable path analysis.The main idea is to obtain stable paths from the vantage points to the target based on a large number of repeated and long-term probing, and the vital nodes is discovered based on statistical theory.First, we deploy vantage points inside and outside the target area, and the path information between nodes of the target area is obtained through Internet measurement.On this basis, a preliminary topology graph is constructed.Second, we extract the stable paths of the target network from measured path information and eliminate the unstable paths to denoise the constructed preliminary network topology.Finally, we weight the edges according to the number of stable paths passing through adjacent nodes, and rank the nodes according to the weighting results.The main contributions of this manuscript are the following: • We propose a network topology denoising method based on stable paths.This method can effectively reduce the data processing scale and reveal the role of stable paths in actual networks.• We combine the edge-weighting method with stable paths, which can accurately describe the role of edges between nodes.The structure of this manuscript is organized as follows.In section Vital nodes discovery algorithm in regional networks based on stable path analysis, we give the details of the proposed algorithm and its main steps.In section Algorithm analysis, we analyze the effect of the proposed algorithm in principle.In section Experiments, we perform experimental evaluations to quantify the benefits of our algorithm and discusses the results.Section Conclusion concludes the whole manuscript.

Vital nodes discovery algorithm in regional networks based on stable path analysis
The communication among nodes on the Internet is determined by routing rules, which are difficult to obtain directly.However, these routing rules can be approximated by a large number of repeated probes and statistical analyses.In addition, there is a large amount of path information in the massive data measured.The path information contains stable paths determined by the routing rules.This is similar to travel planning on the highway in real life, the planned route is fixed when there is no congestion.Therefore, it is possible to obtain a stable path from the vantage point to the target through Internet measurement and construct a network topology composed of only stable paths.Based on this idea, this section proposes an algorithm for discovering vital nodes in regional networks based on stable path analysis.The proposed algorithm is based on stability analysis of multi-round measurement results to overcome path deviation caused by a single measurement, eliminate unstable paths in the network, and obtain a regional network topology that is more in line with real routing rules.Unlike existing algorithms, this algorithm reconstructs the network topology based on traffic, transforming the unweighted internet into a weighted connection, and studying node importance assessment in a structure closer to the actual internet.This can effectively overcome the inapplicability of traditional algorithms on non-cooperative networks whose size, node relationships and routing rules are almost unknown to us. Figure 1 illustrates its overall architecture.
The main steps of the algorithm are as follows.
Step 1: Deploy the vantage points.When only a single vantage point is used to probe the target IP, the measurement results are prone to spatial offset and accidental.Therefore, a set of vantage points V V is selected, including n I vantage points located inside the target area A, and n O vantage points located outside the target area.www.nature.com/scientificreports/ Step 2: Acquire the preliminary topology of the target area.Firstly, retrieve the IP address segments S A assigned to target area A from databases D such as IPIP, WHOIS, and IP2location (detailed in section Experimental setup), and obtain more accurate IP address segments by intersecting the address segments from multiple data sources.Then, enumerate the IP address in each IP address segment to form the target IP set V T .Finally, use the vantage point set V V to measure the target IP set V T with multi-rounds, continuous and high-frequency Internet measurement, and acquire the network topology information such as paths and delays.According to the measurement results, the node set V A and the edge set E A located in the target area are extracted, and the preliminary topology G is constructed.
Step 3: Optimize the network topology based on stable paths.Count the paths, and find the stable path P S in the network according to the routing rules.Then, eliminate the unstable path in the topology to optimize the topology, and obtain the topology G S that only retains stable paths after denoising.
Step 4: Weight edges based on routing characteristics.By applying formula (5) to weight the edges of the denoised topology, we can obtain a weighted topology of the target area.The weights represent the actual traffic carried by the edges.
Step 5: Identify the vital nodes in the regional network.Calculate the routing weighted degree centrality (RWDC) of each node in the topological graph, and rank the nodes according to RWDC, then identify the vital nodes in the regional network.The calculation of the Routing Weighted Degree Centrality for node v is shown in formula (1).
(1) The topology denoising method based on stable paths.Among the paths between node v i and node v j (from vantage point to target IP), the path with the most occurrence times is regarded as the stable path.The path P n between v i and v j is denoted as: where v m represents a node in the path P n .
The set of paths between v i and v j obtained by N times probing in time t is denoted as P(i, j): The occurrence times of path p is denoted as N p .Then the stable path set P S (i, j) between v i and v j is: For ease of understanding, this section describes the process of denoising network topology based on stable paths with the following example.
(2) The edge-weighting method based on routing characteristics.The proposed method takes the number of stable paths passing through an edge as the weight of the edge.If all stable paths in the network are denoted as P S , the calculation formula for the weight w(e g,k ) of the edge e g,k between two adjacent nodes v g and v k is as follows: If the path p in P S passes the edge e g,k , then δ(p, where, V p is the set of all nodes on path p.
For ease of understanding, this section describes the method of edge-weighting based on routing characteristics with the example in Fig. 3.
As shown in Fig. 3, in actual communication, there are 3 stable paths passing the edge ( IP 4 -IP 5 ): Stable path 1:

Algorithm analysis
In the algorithm proposed, the topology denoising based on stable paths and the edge-weighting method based on routing characteristics are the most important steps, and its effectiveness will be analyzed in this section.Accurate topological characterization is significant to solve the problem of vital nodes discovery.The proposed algorithm can eliminate edges that have a negative impact on vital nodes discovery, and weight edges between nodes more accurately.Therefore, it can accurately reflect the topological characteristics of the regional network.

Analysis of the topology denoising method based on stable paths. When conducting research on
vital nodes discovery, it is necessary to consider the amount of communication carried by nodes and the amount of transmission on edges between nodes.There are two kinds of edges: edges on fixed and non-fixed paths.On the one hand, communication protocols are often designed based on ideal conditions at the beginning, without An example diagram for the proposed topology denoising method.
considering the unstable path.On the other hand, the appearance of the unstable path is due to network congestion, which is caused by many reasons, so it is difficult to consider the importance of nodes based on unstable path.Therefore, in the process of discovering vital nodes, the research should only be based on stable paths, and eliminate noise data such as unstable paths.
Existing research on vital nodes discovery usually add all existing nodes and edges to the network topology graph.By analyzing the measurement results, it comes to a conclusion: there exist stable paths for communication between nodes on the Internet.Therefore, the proposed algorithm denoises the topology graph based on the stable path and only the stable path in the actual communication is retained in the final network topology graph.
Among the results of 40-day Internet measurements in the three cities, there are 28,987,966 responses, including 168,594 different paths, and 79,166 are stable paths.For the completed measurement results in the three cities, the path with the highest proportion of occurrence times in the total path occurrence times is counted respectively.The results are shown in Fig. 4.
In Fig. 4, the x-axis represents the ratio of the most frequent path's occurrence times to the total number of paths to a single IP; the y-axis represents the number of paths in the interval.As shown in Fig. 4, in the measurement results of Zhengzhou, Hangzhou and Chengdu, the proportion of major paths to the target node is basically more than 50%, and these paths are called stable paths.As shown in Table 1, the proportions of stable paths in the measurement results are 83.1%,86.1%, and 85.5%, respectively.This indicates that there is indeed a stable path in actual network communication.
Due to the limited network resource, network O &M personnel need to conduct hierarchical management of router nodes to ensure the network's QoS.From the perspective of routing characteristics, because of the  www.nature.com/scientificreports/existence of load balancing and other strategies, some communications will not pass through stable paths.The existence of these paths is the noise data in the process of vital nodes discovery.Taking the highway as an example for analogy, when the road conditions are good, the driver will choose the optimal one; but when congestion occurs, the driver will choose the sub-optimal way to avoid the congestion.Obviously, the nodes on the optimal path are the actual vital nodes.Routing rules determine the existence of stable paths.Therefore, the network topology denoising method based on stable paths proposed in this manuscript reduces the data size, enhances the ability to process data, reduces the interference caused by load balancing, improves the efficiency of vital nodes discovery, and can also obtain more accurate vital nodes discovery results.Take the data of the first 40 days in Chengdu as an example to compare the network scale before and after denoising, as shown in Table 2.
As can be seen from Table 2, denoising the network can reduce about 3.8% of nodes, 55.4% of edges, and 54.4% of paths.The reduction of a small number of nodes is caused by the deployment of vantage points and load balancing.These nodes are not on the stable path from the selected vantage point, so they will be removed during the topology optimization process.This process will have a certain impact on the coverage of vital nodes, but has no effect on the accuracy of vital nodes discovery.If we want to increase the coverage of vital nodes, we can select different combinations of vantage points to measure the target network separately.
The existing research object of vital nodes discovery is usually static network models which do not consider the transmission of traffic in the network, or simply assume that traffic is transmitted equally on the edges.However, due to the existence of routing rules, the number of paths passing the edges between different nodesin-pairs is significantly different.So these edges have great differences in the traffic they carry and the roles they play.Therefore, their influence on the connected nodes is also different.In this case, this manuscript weights edges based on stable paths in actual communication, and then combines the weights of edges to evaluate the importance of nodes to obtain more accurate results of vital nodes discovery.Take Fig. 5 as an example to illustrate the necessity of constructing a weighted topology graph.Suppose the paths existing in the communication from IP 1 to IP 9 , IP 10 and IP 11 are: The unweighted topology graph can be constructed from the above paths, as shown in Fig. 5a.The proposed method weights edges according to the number of paths passing through an edge and the weighted topology graph is constructed, as shown in Fig. 5b.
Calculate the degree centrality (DC) of all nodes in Fig. 5a.The results are shown in Table 3.
Consider the weight of the edge and calculate the weighted degree centrality (WDC) of all nodes in Fig. 5b.The results are shown in the Table 4.
As can be seen from Tables 3 and 4, IP 3 ranks higher than IP 2 in the unweighted graph, that is to say, IP 3 is more important than IP 2 .However, in the weighted graph, IP 2 ranks higher than IP 3 ; that is to say, IP 2 is more important than IP 3 .In the internet communication process, the arrival of IP 3 must go through IP 2 .Use the nodes deletion method to evaluate their importance.After removing IP 2 and IP 3 , respectively, the topology graph of the network is shown in Fig. 6.Obviously, after removing IP 3 , the remaining nodes in the network can still communicate with each other.However, after removing IP 2 , many nodes cannot communicate with each other normally.So IP 2 plays a more critical role in the network than IP 3 .It can be seen that more accurate ranking results can be obtained by using weighted network topology.
In the existing research, the mining of vital nodes based on the weighting method are not based on the actual topology data, they are still the mining of the mathematical characteristics of the known topology.The proposed algorithm starts with the actual data, and the proposed weighting method is closer to actual network characteristics, which can better reflect the importance of different edges in the network.

Experiments
In order to verify the feasibility and effectiveness of the proposed algorithm, this section conducts the vital nodes discovery experiment.In the case of obtaining the actual communication paths between all nodes-in-pairs in the target area, we can get the most accurate results of vital nodes discovery.However, this requires deploying a probe at each node in the target network, which is difficult for a medium-sized city.Therefore, this section selects some vantage points to carry out continuous probing (last 107 days) on the IP addresses of the target area.The measurement results could approximate the communication of actual networks.The experimental results show that the performance of the proposed algorithm is better than existing algorithms, indicating that the approximation method is reasonable.

Experimental setup.
Experimental setup in the data acquisition stage are shown in Table 5.
In Table 5, A represents the target areas, D represents the IP address databases, V represents vantage points, and T represents the cycle of probing.Considering the realistic conditions, this section chooses three cities in China, including Chengdu, Sichuan Province, Hangzhou, Zhejiang Province, and Zhengzhou, Henan Province as the target areas.Then, select the IP address blocks located in the target areas from 6 IP address databases, retain the IP blocks that have appeared in at least 3 IP address databases to form the IP block set S A of the target area.
The real subnet structure and division method are difficult to obtain directly, so this section extracts IP addresses from network segments for probing.IPes in the same network segment are often similar in routing strategy, geographical location and other settings, and often belong to the same organization [23][24][25] .Based on this situation, this section selects one IP from each /24 IP block to construct the target set V T .Then, probe V T with vantage points at V (V is composed of three vantage points located in Zhengzhou, Hangzhou and Chengdu).
Internet measurement.This section uses Scamper 26 developed by CAIDA for Internet measurement.The IP address blocks of the three target cities were selected from 6 IP address databases, including IPIP, Whois, IPPlus, IP2location, Maxmind and IPcn released in November 2019.
There are 12,748,117 IP addresses in the three target cities, and the three target IP sets contain 60,337 target IP addresses in total.The number of IP addresses and target IP addresses of the three cities are shown in Table 6.In 2019-2020, we probed the target IP addresses in the three cities, obtaining 275,893,827 results in total.The number of /24 blocks covered by the measurement results, and the number of routing nodes and paths extracted from the results are shown in Table 6.
Due to the unique situation of the layered architecture of China's Internet, the communication between Internet Service Providers (ISP) without interconnection needs to be forwarded through Internet Exchange Points (IXP) deployed in specific cities.Therefore, in order to avoid the interference of cross-city data, this section only selects a single operator for experimentation.This manuscript uses the data of China Telecom in the above data set as an example to conduct the following vital nodes discovery experiments.

Results of vital nodes discovery experiment.
After obtaining the topological data of target cities, the weighted network topology graph could be constructed and denoised based on the stable paths to obtain the routing weighted degree centrality (RWDC) of the nodes.This section conducted the following three experiments: Experiment on the effect of different Internet measurement durations on the algorithm's performance, comparison experiment of nodes discovery before and after denoising, and comparison experiment of the proposed algorithm and baseline algorithms.The experimental results are validated according to the existing database.
Notations used in this section are listed in Table 7.
Effect of different durations on the performance of the algorithm.This section compares the network size and ranking results on measurement results collected in 5 days (60 rounds), 40 days (360 rounds) and 107 days (1,284 rounds), respectively.Take Chengdu as an example to show the results, as shown in Tables 8 and 9.
As can be seen from Table 8, the number of nodes, edges and paths in the data of 107 days is 1.12, 2.41 and 2.49 times of 40 days, 1.22, 4.47 and 4.03 times of 5 days, respectively.From Table 9, we can see that in the 5-day results, 10 national-level backbone nodes and 4 provincial-level backbone nodes are found; in the 40-day results, 10 national-level backbone nodes and 7 provincial-level backbone nodes are found; in the 107-day results, 10 national-level backbone nodes and 7 provincial-level backbone nodes are found.
This shows that in the case of a large difference in measurement duration, the number of paths, edges, and nodes have significant changes in the obtained topology graph.However, the data scale after denoising does not change much, as well as the vital nodes discovery results of the proposed algorithm.At the same time, when the measurement duration is short, it is impossible to find enough vital nodes because the number of stable paths www.nature.com/scientificreports/ is insufficient.Therefore, it is necessary to mine the vital nodes after the data collection reaches a certain scale.
When the number of stable paths is sufficient, the proposed algorithm can discover all the vital nodes that can be mined under this vantage point.
Comparison of experimental results before and after denoising.This section compares the scale of networks and ranking accuracy before and after denoising, and the results are shown in Table 10 and Fig. 7.
Table 10 shows the network scale before and after denoising, including the number of nodes, the number of edges, and the number of paths in the network topology.B N /B P /B respectively represent the number of national backbone node / provincial backbone node / backbone nodes among the top-k nodes in the ranking results.As can be seen from Table 10, denoising alternative paths can reduce about 7% of nodes, 80% of edges, and 80% of paths, which significantly reduces the scale of data processing.In addition, the bold number indicates the larger value in the comparison result before and after denoising.We can see that in a total of 15 groups of comparisons in 3 cities, the ranking metric after denoising (i.e., RWDC) performs better in 10 groups.
The green/orange/blue sectors in Fig. 7 respectively represent the number of national-level backbone nodes / provincial-level backbone nodes / backbone nodes among the top-k nodes in the ranking results.It can be seen that, in most cases, the results of the proposed algorithm have larger green area and smaller blue area, indicating that proposed algorithm can find more (or the same) number of national-level backbone nodes than that of before denoising.
Combining the results in Table 10 and Fig. 7, it can be concluded that the proposed topology denoising method can significantly improve the accuracy of the ranking result and reduce the scale of data processing.www.nature.com/scientificreports/ In Table 11, Top-k represents the number of backbone nodes in the top k nodes obtained by various algorithms, and the value bolded in the table is the maximum value of the number of backbone nodes found by the 4 algorithms.In Fig. 11, the green/orange/gray/blue cylindrical represents the experimental results of RBC/BC/ DC/RWDC, respectively; the light bars represent the number of provincial-level backbone nodes, and the dark bars represent the number of national-level backbone nodes.
Taking the results of Hangzhou as an example, it can be seen from Fig. 9 and Table 11 that among the top-10/20/30/40/50 nodes obtained by various algorithms, the proposed algorithm can find the largest number of national-level and provincial-level backbone nodes.Besides, among 15 groups of comparison in 3 cities, the proposed algorithm finds more (or the same number) backbone nodes in 10 groups, and finds more (or the same number) national-level backbone nodes in 13 groups.It comes to a conclusion that the proposed algorithm can find more vital nodes than DC, BC and RBC.
Take the experimental results in Chengdu as an example, the top-10 nodes and validation results under the 4 metrics are shown in Table 12.
According to Table 12, the proposed algorithm discovers 9 national-level backbone nodes and 1 provinciallevel backbone node in the top 10 nodes.While DC finds no backbone node in the top 10 nodes.BC discovers 1 national-level backbone node and 4 provincial-level backbone nodes; RBC discovers none national-level backbone node and 10 provincial-level backbone nodes.
The experimental results demonstrate that the proposed algorithm can find more backbone nodes than DC, BC and RBC, and the results are more accurate.

Conclusion
This manuscript proposes an algorithm for discovering vital nodes in regional networks based on stable path analysis.The network topology denoising method based on stable paths proposed by this algorithm can effectively reduce the scale of processed data, and the edge-weighting method based on routing characteristics can significantly distinguish the role of edges in actual communication.Experimental results show that, the proposed algorithm can find more vital nodes than existing algorithms.However, due to the impact of load balancing and the limitation of the deployment of vantage points, this algorithm cannot find all the vital nodes in the target area.This is determined by stable paths passed by the experimentally deployed vantage points.For this reason, we will study how to deploy vantage points to obtain a relatively complete regional network topology in future work, then improve the discovery ability of vital nodes in the target area. https://doi.org/10.1038/s41598-023-39174-7

Figure 1 .
Figure 1.Overall architecture of proposed algorithm.

Figure 3 .
Figure 3.An example diagram for the proposed edge-weighting method.

Figure 4 .
Figure 4. Path proportion statistics for target IP.

Figure 9 .
Figure 9.Comparison of the ranking results of the 4 metrics.

Table 1 .
The ratio of occurrence times of stable paths in all paths.

Table 2 .
Comparison of network scale before and after denoising.

Table 3 .
Result of nodes ranking by DC in the unweighted graph.

Table 4 .
Result of nodes ranking by WDC in the weighted graph.

IP 2 IP 3 IP 1 IP 9 IP 5 IP 6 IP 10 IP 4 IP 7 IP 8 IP 11
Figure 6.Comparison of the topology graph after removing IP 2 and IP 3 .

Table 6 .
Statistics of the dataset in Internet measurement.

Table 7 .
Symbol definition.Backbone nodes covered by top k nodes in the node importance ranking result.
N , B P and B.

Table 8 .
Comparison of the data scale collected in 5 days, 40 days and 107 days.

Table 9 .
Comparison of the ranking results based on the data collected in 5 days, 40 days and 107 days.

Table 10 .
Comparison of the network scale and ranking results before and after denoising.Significant values are in bold.

Table 12 .
Comparison of top-10 nodes ranked by RWDC, DC, BC and RBC in Chengdu.Significant values are in bold.