Sampling unknown large networks restricted by low sampling rates

Graph sampling plays an important role in data mining for large networks. Specifically, larger networks often correspond to lower sampling rates. Under the situation, traditional traversal-based samplings for large networks usually have an excessive preference for densely-connected network core nodes. Aim at this issue, this paper proposes a sampling method for unknown networks at low sampling rates, called SLSR, which first adopts a random node sampling to evaluate a degree threshold, utilized to distinguish the core from periphery, and the average degree in unknown networks, and then runs a double-layer sampling strategy on the core and periphery. SLSR is simple that results in a high time efficiency, but experiments verify that the proposed method can accurately preserve many critical structures of unknown large scale-free networks with low sampling rates and low variances.


Introduction
Graph sampling extracts nodes or edges to create subgraphs representing an original network, which is often used as pre-processing before data mining to reduce the scale of datasets [1][2][3][4][5] or as post-processing after optimization to represent the original network more accurately [6][7][8][9] .The former usually adopts traversal-based samplings [5] that simulate walkers travelling on unknown networks based on the neighbor information of the nodes they are accessing.The traversal-based samplings [5] are time efficient, which enables complex mining algorithms, such as graph convolutional networks [1] , subgraph pattern mining [2,3] and network embedding [4] , to be applied to networks with more than one million nodes.The latter adopts samplings that construct optimization models [6][7][8][9] to minimize the difference between known original networks and sampled subgraphs.Approximation algorithms used for solving the optimization models are powerful in representing objective structures of the original networks but are time-consuming [6][7][8][9] .This paper focuses on the traversalbased samplings in pre-processing systems that travel on unknown original networks, and intents to represent more important structures of the networks with high time efficiency.
Metropolis-hastings random walk (MHRW) [10] and simple random walk (SRW) [10,11] are two classical traversal-based samplings.Based on Markov chain random models, MHRW is unbiased, which samples each node with uniform stationary distribution, whereas, SRW is biased, that is, the probability of a node being sampled is proportional to its degree [10,11] .This paper focuses on ubiquitous scale-free networks exhibiting core-periphery structures, which consist of a dense core and a sparse periphery [12][13][14][15] .The core determines many important structures of the networks, such as low-diameter; however, it is almost ignored by unbiased samplings since the number of nodes in the core is extremely small.On the contrary, biased samplings preserve more structures determined by the core nodes that have high-degrees.However, the above-mentioned node sampling probability of SRW corresponds to the convergence state of Markov chain [11] .
Sampling rate is defined as the ratio of the number of nodes (or edges) between sampled and original networks [5] .Low sampling rates are needed to improve time efficiency on large networks, but also make it difficult to achieve the convergence of Markov chain.In the core-periphery structures of scale-free networks, each core node is well-connected by periphery nodes but the latter are not well-connected to each other [12][13][14][15] , that is, the walkers of biased samplings are more likely to be attracted to the core under constraints of low sampling rates, resulting in loss of structures related to the periphery that occupies the vast majority of nodes in the networks.Thus, this paper proposes a sampling for unknown networks at low sampling rates, called SLSR, objective to achieve a balanced sampling on the core-periphery structures.
The organization of this paper is as follows: Section 3 investigates the problem formulation and design principles of SLSR.Section 4 provides a random node sampling to evaluate the average degree (AD) and degree threshold (DT) of original unknown large networks.Section 5 designs the traversal-based sampling SLSR.Specifically, SLSR starts by the AD and DT evaluation, then limits the sampling process to the periphery using the DT, and designs a bisection method constrained by the AD to preserve the core structure.Section 6 evaluates SLSR with related methods and verifies that SLSR can capture many critical structures except for degree, including shortest path length, clustering, graph spectrum, centrality and communities.
The contributions of this paper are as follows:  Analysis of the advantages of the random node sampling in capturing the DT and its high time efficiency, as well as the shortcomings of the sampling in subgraph representation, such as, the loss of critical high-degree core nodes and periphery topological structure. Designing a simple traversal-based sampling that only relies on node set and the adjacent node information of sampled nodes, without involving complex topological characteristics, but can preserve critical properties at low sampling rates.Simplicity corresponds to a high time efficiency that is important for pre-processing systems. Analysis of entropy and variance of sampled subgraphs at low sampling rates.Assuming Ω is a sample space with samples, then the probability of extracting a sample from Ω uniformly at random is 1 ⁄ , and the entropy of the probability distribution is = = − ∑ .That is, with increasing , the uncertainty measured by the entropy grows [16] .The traversal-based samplings randomly choose next node from the adjacent nodes of a current node , that is, the adjacent node set of the node constructs a sample space for randomly choosing .We prove that SLSR sharply reduces the scale of the sample space at most cases by a simple and deterministic bisection method (i.e., reduces the entropy), and experimentally verify that the reduced entropy can ensure the low variances of many critical statistics of the sampled SLSR subgraphs. We experimentally obtain that, the smaller and denser the cores in original networks, the stronger the preference of the traditional traversal-based samplings for high-degree core nodes at low sampling rates, which is difficult to be mathematically proven by the Markov chain theory [10] , because the low sampling rates prevent the Markov chain random process from reaching convergence state. Time efficiency and community visualization are analyzed in depth. The codes of SLSR are provided at https://github.com/jiaoboleetc/SLSR.

Graph sampling on unknown networks
Node/edge-based samplings choose a set of nodes (or edges) at random and extract the subgraphs induced by the chosen nodes (or edges), including uniform samplings, such as random node (RN) and random edge (RE), and non-uniform samplings, such as random degree node (RDN) and random PageRank node (PRN) [5,17] .Specifically, nodes can be sampled proportional to the degree centrality by RDN [5] , and proportional to the PageRank weight by PRN [5] .Recently, Wang et al. [18] investigated the relations between edges and their edge neighbors caused by the reconcile of scalefree and self-similarity, and proposed a series of sampling algorithms based on the relations, which can keep important statistical characteristics of original networks.
Traversal-based samplings start with one or more seeds and crawl on unknown original networks based on the neighbor information of the nodes they are accessing.Forest fire (FF), which is a variant of breadth first (BF) and snow ball (SB), performs superior in time efficiency since each node in the unknown networks is traversed no more than once [19] .FF starts by a random seed, then burns a fraction of its neighbors that have not been traversed, where the fraction is randomly drawn from a geometric distribution, and the process is recursively repeated for each burnt neighbor until the desired sample size is obtained [5] .SRW [11] starts by a random seed, and moves from a node to one of its neighbors chosen uniformly at random, until the expected fraction of nodes is collected.In addition, more random walk samplings, namely, non-backtracking random walk (NBRW), circulated neighbor random walk (CNRW), and common neighbor awareness random walk (CNARW) [20]   , have been proposed to reduce the asymptotic variance of sampled subgraphs and overcome the slow convergence of SRW, as simple random walker tends to be stuck in local loops [20] .The principles and pseudo codes of the three improved random walk samplings can be obtained in the recent review articles [10,20] .Rank degree (RD) is a multi-seed sampling [21] , which adopts a predetermined number of random starting seeds to avoid the sampling trapped locally, then iteratively explores tophighest-degree neighbors of each seed and adds them to the seed set.Moreover, some samplings were designed to capture specific network structures, such as, community structure expansion (CSE) [22]   .Recently, the node/edge-based and traversal-based samplings have become important tools for efficient network intervention and AD evaluation on large unknown networks [23] .
Stream-based samplings generate subgraphs from activity networks that can be treated as a stream of edges [24,25] .In the networks, besides the unknown topology, the node set and the neighbor information of any node are unobtainable.

Graph sampling on known networks
If all network information of a dataset is known, complex structures hidden in the dataset can be discovered in advance.Hong et al. [6] first extracted precise structures, such as -core, closeness, betweenness, and eigenvector centrality, from known original networks, and then reduced the scale of the networks under the guidance of the structures.Martin et al. [8] created an optimization model for large-scale network reduction towards scale-free structure.Jiao et al. [9] adopted a strategy of removing edges from known original networks one by one.However, a lower sampling rate means more edges need to be removed.Sampling on known networks helps preserve more precise structures, but usually comes at the cost of time [6][7][8][9] .

Problem formulation
This paper focuses on simple, undirected, and scale-free original networks, in which self-loops, multi-edges, and direction of edges are ignored.We assume that the topological information of the original networks, such as, community, clique, and global statistical characteristics, is unknown.But we assume that the node set and the neighbors of sampled nodes can be accessed [5,[17][18][19][20][21][22][23] .We intend to quickly obtain subgraphs representing the unknown large original networks.The notions used by our SLSR sampling are listed in Table 1.

= ( , )
An unknown original network where and respectively denote the node set and edge set.Please note that is a simple and undirected graph.

̅
The average degree of that can be evaluated by a random node sampling [23] .
A degree threshold of that can be evaluated by a random node sampling.
( ) The set of neighbor nodes of a sampled node in , which can be obtained by the traversal-based samplings.

‖•‖
The cardinality of a set.

−
A set consisting of elements that belong to but not to , where and denote two sets.
A sampling rate of the random node sampling for evaluating AD and DT.
A sampling rate of the SLSR sampling.

Design principles
The classical Barabasi-Albert (BA) scale-free evolving network model [26] confirms that the degree distribution of the network almost remains the same as the scale changes.Please note that the distribution only represents low-degrees of nodes in periphery, ignoring high-degrees of nodes in core, because the number of core nodes is extremely smaller than that of periphery nodes [12][13][14][15] .Based on the preferential attachment (PA) rule adopted by the BA model [26] , which attaches each newly-added node preferentially to high-degree nodes, the degrees of core nodes quickly grow with increasing network scale, that is, the larger an original network, the greater the difference in degree between the core and periphery nodes, which causes the biased samplings on large networks to be overly attracted to the core at low sampling rates.The degree distribution is an important metric and the existing biased samplings are good at capturing the metric under specific conditions [17][18][19][20][21][22] .Thus, the first principle P1 is to create a core-periphery framework in which the existing biased samplings continue to be used but are only limited to the periphery sampling, that is, the core, which hinders the capture of the degree distribution, is stripped off and processed separately.
During changes in a scale-free network, such as scale-reduction, the core has a low variability [15]   .In addition, based on the fractal characteristic [26] , the communities of the network also represent core-periphery structures [9,27] .Specifically, the community cores are mainly located in the core of the network, that is, the network core represents the structure of community centers.Thus, the second principle P2 is to maximize the preservation of the connections in the network core.
Owing to sparse connections between periphery nodes, faster information exchange between these nodes depends on core nodes [12][13][14][15] .Specifically, based on the PA rule [26] , the higher the degree of a core node, the greater its probability of being connected by other newly-added periphery nodes, that is, the core node has stronger ability to shorten the path length between periphery nodes.Thus, the third principle P3 is to preserve a proportion of core neighbors with top highest degrees for each sampled periphery node, where the proportion can be determined by the AD of original networks that can be evaluated by a random node sampling [23] .Please note that the second and third principles are helpful in preserving the path length distribution.
The periphery is the main contributor of the clustering coefficient distribution since it occupies the vast majority of nodes in the network [12][13][14][15] .Based on the PA rule [26] , the neighbors of a periphery node tend to be located in the core, that is, the connections between high-degree core nodes has a significant impact on the distribution.Thus, the above-mentioned three principles marked as P1, P2, and P3 are helpful in preserving the distribution.

Random node sampling for evaluating parameters 4.1 Random node evaluation
Core-periphery detection [27] refers to a partition of a network into two groups of nodes called core and periphery, which is a useful tool to realize P1.An important procedure of the detection is to provide rank orders of nodes for the partition.Many measures based on clique, community, centrality, and probability [12][13][14][15]27] , have been adopted for the rank. Thee complex measures can help improve detection accuracy, but are difficult to be quickly evaluated on unknown networks.Thus, we use a simple measure, namely degree, to rank the nodes, and adopt a random node sampling to evaluate the AD and DT of the unknown networks that are critical for P1 and P3.Specifically, nodes with degrees larger than DT are classified to the core, while other nodes are divided to the periphery.To clearly distinguish the core nodes and periphery nodes, the DT is determined by maximizing the number of edges connecting the two types of nodes.

Random node sampling algorithm: Evaluating the AD and DT
12:End While %Lines 4 and 10 respectively output the evaluated AD and DT.
In the above sampling, line 4 evaluates the AD of the unknown original network , and lines 5 to 12 evaluate the DT of .Our random node sampling is different from the RN sampling [5] introduced in Section 2.1 that generates a subgraph induced by randomly chosen nodes.Our sampling collects the degrees ( ) and neighbors ( ) of the randomly chosen nodes in , and evaluates the AD and DT based on the degrees and neighbors.

Analysis of the AD evaluation
Recently, Qi et al. [23] confirmed the effectiveness of the random node sampling on the evaluation of AD.We further analyze the shortcomings of the sampling in subgraph representation.Table 2. Degree rank and the number of nodes with a certain degree in a scale-free network.(com-Youtube with 1,134,879 nodes [28] , described in Section 6.2, was chosen for the analysis.)All degrees existing in the com-Youtube network [28] are ranked by decreasing order in Table 2, which shows that the top highest-degrees are related to only one or several nodes.Please note that the phenomenon in Table 2 is common in scale-free networks.With uniform distribution, the probability of a -degree node being sampled is equal to ( ) that is defined as the ratio of the number of -degree nodes to the total number of nodes in the original network.Thus, the top highest-degree nodes in scale-free networks are easily lost by unbiased samplings at low sampling rates, such as the random node sampling, which induces that the AD of the subgraph induced by the nodes randomly chosen by the RN sampling [5] is extremely small, as shown in Fig. 1.

Fig. 1.
Random node sampling on a simple core-periphery graph, in which ( ) is defined in the original network, not in the RN subgraph, and can be obtained by the traversal-based samplings.The random node sampling is suitable for the evaluation of the AD of the original network [23] , but cannot directly output a sampled subgraph capturing the degree property.
The AD of the original network is equal to ∑ ( ), where is the degree defined in .The random node sampling is prone to losing top highest-degrees .However, owing to the extremely small ( ), the loss of these degrees has almost no impact on the AD evaluation.

Analysis of the DT evaluation
Let us return to the random node sampling algorithm described in Section 4.1.Assuming that in line 8 is not more than the actual DT value, then the nodes in line 8 and the nodes with ( ) = in _ _ (in line 9) are classified into the periphery, that is, because _ _ (in line 9) contains the edges that connect the peripheral nodes to core nodes in the original network , and _ _ consists of the edges connecting two periphery nodes.Please note that there are dense connections between core and periphery nodes but sparse connections between periphery nodes [12][13][14][15]27] . Althugh the node set in line 3 (chosen by the random node sampling) losses top highest degree nodes in the core, most of nodes in can directly reach the core through only one jump in the scale-free original network , because the PA rule [26] causes the core to be densely connected by the periphery [12][13][14][15]27] , which ensures that the top highest degree nodes in are not lost in the neighbor sets ( ) for most of nodes that falls in the node set (the result will be further verified in Section 4.5).Thus, all the connections between the peripheral nodes and the core nodes in are preserved in the The nodes in the original network are independently chosen with uniform distribution; thus, the random node sampling not only loses top highest degree nodes in the core but also ignores the complex topological correlation between sampled periphery nodes.However, the determination of the DT value depends on the connections between core and periphery nodes, while it is weakly correlated with the connections between periphery nodes.The uniform distribution enables the random node sampling to choose the periphery nodes without preference, which is critical for the accurate DT evaluation of our random node sampling.

Analysis of the variance and runtime
We choose five real-world large scale-free networks [28] , described in Section 6.2, for the variance and runtime analysis, as listed in Table 3. Table 3. Mean and standard errors (below the mean) of the evaluated AD, evaluated DT and runtime (seconds) from 100 independent realizations for each sampling rate .The real AD and DT were obtained by = 100%.The running environment is illustrated in Section 6.3.5., the standard errors of the evaluated values show a decreasing trend.Owing to the very high time efficiency of the random node sampling, which is induced by that the sampling ignores the complex topological correlation among the randomly chosen nodes, we choose = 35% to pursue a low variance.Note that the random node sampling cannot directly output a subgraph capturing the degree properties or other important properties, as shown in Fig. 1; however, our purpose is to obtain a subgraph representing the original network.

Analysis of diverse core-periphery structures
Based on the DT value , the original network = ( , ) can be partitioned into core nodes with ( ) > , periphery nodes with ( ) ≤ , core edges that connect two core nodes, periphery edges that connect two periphery nodes, and vertical edges that connect a periphery node to a core node.As shown in Table 4, the five real-world scale-free networks consist of a dense core and a sparse periphery.In addition, a few core nodes are densely connected by the periphery, and more than 55% of periphery nodes can directly reach the core through only one jump.Moreover, for the com-Youtube, web-Stanford and loc-Gowalla networks in Table 4, we find that their core node percentages are much smaller and their core edge densities (defined as the ratio of the number of core edges to the number of core nodes) are much denser than those of other networks.
Restricted by low sampling rates, the smaller and denser the core, the stronger its attraction to traditional traversal-based samplings.Because the Markov chain theory cannot achieve convergence at low sampling rates [20] , this paper will further confirm the impact of the core structure of the original networks on the sampling results through experimental comparisons in Section 6.
Table 4. Percentage distributions of nodes and edges in the core-periphery structures partitioned by .Two DT values were chosen for each original network: bold represents the accurate value with = 100%, and non-bold represents the mean of the evaluated values with = 35%, as shown in Table 3.
is defined as the ratio of the number of periphery nodes that can directly reach core through only one jump to the total number of periphery nodes.5 Unknown network sampling SLSR

Traversal-based sampling algorithm at low sampling rates
Traversing on an original network establishes topological connections between sampled nodes, but may be overly attracted to the high-degree core nodes at low sampling rates.Thus, this section designs a new traversal-based sampling SLSR, which only adopts the information of the node set and the neighbors of sampled nodes, that is, any complex topological information of the unknown original network, such as, community, clique, and real statistical characteristics, cannot be used in design process of SLSR.To improve time efficiency, a low sampling rate is needed, but the sampled subgraph should capture more properties of the original network.
Our SLSR creates a core-periphery framework for existing traversal-based samplings that run on unknown networks, such as FF [19] , SRW [11] , NBRW [20] , CNRW [20] , CNARW [20] , and RD [21] .First, choose one sampling from the existing methods and use to represent it, then SLSR restricts to only traverse on the periphery using the evaluated .Specifically, the set of neighbors of a node that is accessing by is changed as and other principles and steps of remain unchanged.Please note that, the other neighbor set that is obtained simultaneously with Eq. ( 2) should be saved for the core sampling.The process of running on the periphery of with a sampling rate is represented as follows: where denotes the sampled subgraph of the periphery, and and denote its node set and edge set.According to P3, predefine a parameter % to preserve top-( ) × % highest degree core neighbors for each ∈ , and let ( ) denote the node set composed of the preserved core neighbors where ( ) ⊆ ( ).
We define in Eq. ( 5) as the set of sampled core nodes.According to P2, all the connections between the nodes in should be preserved, which can be implemented by the access of the neighbors of each node in .Please note that the nodes and the connections between them construct the sampled core.To ensure the connectivity of the sampled subgraph, the vertical edge set defined in Eq. ( 6) also should be preserved.
Bisection method: Determining the parameter % Next, analyze how to determine the parameter %.With the growth of %, the number of the vertical edges (namely, ‖ ‖) increases, while the sampled periphery remains unchanged and the sampled core induced by the node set has a low variability [15] .Thus, the AD of the subgraph, composed of the vertical edges, the sampled periphery, and the sampled core, increases monotonically as % grows, that is, a bisection method can be used to determine % under the guidance of the evaluated ̅ that is our target AD of the sampled subgraph.
In the bisection method proposed on the previous page, the number of iterations of the While loop is not more than log 100.In addition, ( ) ≤ for each ∈ in line 7, and ‖ ‖ ≪ in line 10, since the scale of the core is much smaller than that of the periphery, as shown in Table 4. Thus, the time complexity of the bisection method is .Once the parameter % is determined, the sampled subgraph of can be obtained using lines 6 to 10 in the bisection method.The Main function of SLSR is described as follows: SLSR: Main function Determine the parameter % using the bisection method proposed on the previous page.
Please note that the obtainment process is the same as lines 6 to 10 in the above-mentioned bisection method.
The sampling framework created by SLSR is simple.Simpler methods typically have higher time efficiency.However, the framework can significantly improve the accuracy of multi-structure preservation for at low sampling rates that are needed for a high time efficiency.
PeripherySampling: Choosing as the FF sampling [19] 1: Input: = ( , ), , and .We arbitrarily choose as the FF sampling [19] , since the sampling traverses each node no more than once which leads to its high time efficiency.Users can use to represent other existing traversal-based samplings.Once is chosen, the periphery sampling can be determined.

Analysis of the variance of sampled SLSR subgraphs
Low sampling rates may lead to high variances in the sampling results of large sample spaces.However, our SLSR sampling can control the variances based on the following three points:  The bisection method is deterministic, that is, there is no randomness in the method. The random node sampling has a very high time efficiency, as shown in Table 3, thus setting its sampling rate = 35% can not only reduce the variance, but also has a weak impact on the time efficiency of our traversal-based SLSR sampling. The smaller the scale of a sample space, the lower the uncertainty of randomly extracting a sample from the space, which can be ensured by the theory of information entropy [16] .In the periphery sampling of SLSR, described in Section 5.1, the set of neighbors of a node that is accessing by has been compressed from In a scale-free network, the scale of the neighbor set of a core node is extremely larger than that of a periphery node, as shown in Table 2, and the number of the vertical edges connecting a periphery node to a core node is much larger than the number of edges connecting a periphery node to another periphery node, as listed in Table 4. Thus, ( ) < ( ) ≪ ( ) when is a periphery node and is a core node.Please note that ( ) is related to the sample space of when traverses from to ∈ ( ).Compared to the traditional traversal-based samplings that excessively prefer high-degree core nodes at low sampling rates, the sample space of in our SLSR sampling has been sharply compressed at most cases.The above three points are critical for controlling the uncertainty of SLSR with ≤ 10%, and the experimental results in Section 6 can further verify the low variance.

Metrics
This paper proposes a traversal-based sampling SLSR that only uses the information of node set and the neighbors of sampled nodes, without involving complex topological characteristics, but can solve the issue of excessive preference for high-degree core nodes at low sampling rates.Thus, some metrics that can measure the excessive preference are included in this section.
AD is defined as ∑ ( ) = 2‖ ‖ ‖ ‖ ⁄ in a simple and undirected graph = ( , ) with node set and edge set , where ( ) denotes the fraction of nodes with degree in [29][30][31] .The statistic reflects whether a sampling favors core nodes with high-degrees.
Average clustering coefficient (ACC) defined as ̅ = ∑ ( ) ( ) represents how close a node's neighbors are to forming a clique [32][33][34][35] , where ( ) = 2 ( ) ( − 1) ⁄ and ( ) denotes the average of the number of links between two neighbors of -degree nodes.A related distribution characteristic is clustering coefficient distribution [32] defined as ( ) vs. .Average path length (APL) is defined as = ∑ • ( ) that represents the reachability of nodes within each other, where ( ) denotes the fraction of node pairs with shortest path length between the two nodes.A related distribution characteristic is shortest path length distribution that is defined as ( ) vs. [32,33,34] .
Ratio of weighted spectral distribution to node number (RWSD) represents the connection relationship between low-degree nodes [36] most of which are in the periphery.The weighted spectral distribution is defined as ∑(1 − ) where denotes the eigenvalue in the normalized Laplacian spectrum of [36][37][38] .Please note that the statistic can be quickly calculated by a 4-cycle enumeration algorithm without the need for the calculation of the eigenvalues [39] .
Ratio of maximum degree to node number (RMD) represents the influence of the node with maximum degree, which is suitable for the comparison of graphs with different scales [40] .
Closeness centrality (CC) of a node is defined as ( − 1) ⁄ where is the total number of nodes and denotes the sum of the length of the shortest path from the node to other nodes, which reflects how efficiently the node exchanges information with others [41] .
Betweenness centrality (BC) of a node represents the fraction of the shortest paths that pass through the node for any pair of nodes, which describes potential power of the node in controlling the information flow in a network [42,43] .
Community represents local densely-connected structures that are visually salient [44,45] .Thus, a visual evaluation was adopted.Specifically, the communities were detected by a Louvain method [45] and visually displayed by a force-directed method [46] .The correspondence between the communities of an original network and its sampled subgraphs was established by their shared nodes.

Original networks and sampled node number
The above multiple metrics and five widely-used large original networks chosen from Stanford Large Network Dataset Collection [28] were adopted for the evaluation.Please note that the original networks listed in Table 5 were simplified as undirected graphs, namely the self-loops, multi-edges, direction of edges, and a few isolated nodes with degree zero in the five chosen networks have been removed.In addition, the important statistics of the original networks have been listed in Tables 6 to 10 for the convenience of comparison.
Table 5. Descriptions of the five widely-used large original networks [28] , and the mean and standard errors (below the mean) of the sampled node number of the SLSR subgraphs from 100 independent realizations for each sampling rate .According to Section 5.1, our SLSR sampling sequentially executes the AD and DT evaluation, the periphery sampling with a low sampling rate , and the bisection method for preserving core and vertical edges.Since the original network = ( , ) is unknown, SLSR cannot know in advance the actual number of periphery nodes in .Owing to that the number of core nodes is extremely smaller than that of periphery nodes, × is approximately equal to the expected number of periphery nodes to be sampled.Thus, is not a strict sampling rate that is defined as the ratio of the number of nodes in a sampled subgraph to .In order to compare more fairly with the traditional traversal-based samplings on unknown networks, we first obtain the sampled SLSR subgraphs, and calculate the mean and standard errors of the sampled node number of the subgraphs from 100 independent realizations for each original network and given sampling rate in Table 5.Then, each traditional traversal-based sampling outputs 100 subgraphs whose node number is equal to the mean of the sampled node number for each original network.

Variance comparison with statistics
At a low sampling rate, low variance is important for the reliability of sampling results.Thus, we first compare the standard errors of the statistics AD, ACC, AC, RWSD, RMD, and CC( ) of the sampled subgraphs from 100 independent realizations, where is the maximum degree node in an original network and is easily preserved in the sampled subgraphs by the chosen sampling methods.Please note that, owing to the high time complexity, we compare the standard errors of the statistics APL and BC( ) from 5 or 10 independent realizations.

BC
of the original com-Youtube network with 1,134,879 nodes and 2,987,595 edges was not provided in Table 10 due to extremely high computation and memory requirements.Please note that the RWSD of the network can be quickly obtained within 2 hours [39] , whereas the APL and the path length distribution of the network have to be computed by a parallel algorithm.On a computer with Intel Core i7-8700 CPU 3.20 GHz Memory 16 G, the parallel algorithm with 5 threads used for calculating the APL and the path length distribution runs about 12 days.Because in SLSR was chosen as the FF sampling for the experiments.Thus, we first compare the variance between SLSR and FF.Based on Tables 6 to 10, we can observe that the standard errors of the statistics of SLSR are generally smaller than those of FF, except for a few cases where the mean is very small.Please note that the low variance of SLSR has been analyzed in Section 5.2.NBRW, CNRW and CNARW are improved random walk samplings objective to reduce the asymptotic variance [10,20] .Although the three samplings provided rigorous mathematical proofs based on Markov chain, their theoretical basis is that the Markov chain must converge, which is difficult to be guaranteed at low sampling rates.Thus, based on the analysis in Section 5.2 and Tables 6 to 10, the standard errors of the statistics of SLSR can be effectively controlled in most cases.RD consists of two steps: the first step is to extract a predetermined number of starting seeds using the random node sampling, and the second step adopts a deterministic algorithm without randomness [21] .Thus, RD can also effectively control the standard errors in most cases.
Mean is another important indicator of the statistics.Thus, Section 6.3.2 will use the mean to analyze the excessive preference for high-degree core nodes at low sampling rates.Table 7.The statistics of the original loc-Gowalla network, and the mean and standard errors (below the mean) of the statistics of the sampled subgraphs.APL and BC( ) are related to 10 independent realizations, while other statistics are related to 100 independent realizations.The sampling rate of our SLSR sampling, and the sampled node number of the other chosen samplings FF, RD, SRW, NBRW, CNRW, and CNARW have been illustrated in Section 6.2.Table 8.The statistics of the original com-DBLP network, and the mean and standard errors (below the mean) of the statistics of the sampled subgraphs.APL and BC( ) are related to 5 independent realizations, while other statistics are related to 100 independent realizations.The sampling rate of our SLSR sampling, and the sampled node number of the other chosen samplings FF, RD, SRW, NBRW, CNRW, and CNARW have been illustrated in Section 6.2.

Mean comparison with statistics
This section first analyzes the importance of the statistics AD, ACC, APL, RWSD, and RMD in measuring the excessive preference for high-degree core nodes, and then experimentally studies the influence of the core-periphery structures on the excessive preference.
Scale-free networks consist of a dense core and a sparse periphery, and the core is densely connected by the periphery.Thus, high-degree core nodes tend to be densely connected to each other.If more high-degree core nodes are sampled, the subgraph induced by the sampled nodes can preserve more edges, that is, AD becomes larger as the preference is stronger.
Given a periphery node , owing to the dense core, the local clustering coefficient of tend to be larger as the proportion of high-degree core nodes in ( ) increases.Thus, ACC generally becomes larger as the preference is stronger.
Owing to the sparse periphery, nodes in the periphery have to shorten the path length between each other through the core, that is, APL becomes smaller as the preference is stronger.
According to the study of Jiao et al. [36] , RWSD indicates the feature of connections between low-degree nodes on large networks, and the statistic decreases as the connections become sparser.The excessive preference for core nodes can make the connections between low-degree periphery nodes much sparser, which lead to smaller RWSD.
In addition, the excessive preference for core nodes generally induces larger RMD because the maximum degree node must be located in the core.
According to the mean values of the above-mentioned statistics and the core-periphery structures of the original networks shown in Table 4, we study the influence of the core structures on the excessive preference for the core nodes, which is an important contribution of this paper, because it is helpful in improving the traditional traversal-based samplings at low sampling rates.
Table 4 shows that the com-Youtube network has the smallest core node percentage and largest core edge density, while the mean values of the statistics in Table 10 confirm that the traditional traversal-based samplings have the strongest excessive preference for the core.Conversely, Table 4 shows that the com-DBLP network owns the largest core node percentage and smallest core edge density, while the mean values of the statistics in Table 8 verify that the traditional traversal-based samplings can avoid the excessive preference for the core.According to Tables 6 to 10, our SLSR sampling performs outstandingly in Tables 7, 9 and 10.The results have been analyzed in Section 4.5, that is, the cores of the com-Youtube, web-Stanford and loc-Gowalla networks in Table 4 are much smaller and denser than those of other chosen networks.
Moreover, we compare two centralities (i.e., CC and BC) of some important nodes between the original networks and their sampled subgraphs.The nodes were chosen as the maximum degree nodes in the original networks, and they can be preserved to the sampled subgraphs by the biased samplings being compared.The chosen in SLSR is competent for capturing peripheral degree distribution without the interference of the core, all the edges that connect to nodes in are preserved due to P3, where = , is the output of , and the structure of the core is preserved to the maximum extent possible due to P2.All the clues strongly influence the centralities of the top highest degree core nodes in the SLSR sampled subgraphs, and Tables 6 to 10 verify that the clues are helpful in capturing CC and BC .

Distribution comparison
This section chooses degree complementary cumulative distribution [29] , clustering coefficient distribution [32] , and path length distribution [32,34] , which are commonly-used measures, for further comparison.Because it is difficult to display the distributions of 100 realizations simultaneous, we choose only one realization with AD closest to the mean for the comparison.In addition, we choose top five sampling methods with minimum AD standard errors for each original network, since high variance corresponds to high uncertainty.
The distribution of nodes with lower degrees is more important for the first two measures.For example, in Fig. 4(a) and (b), degrees not exceeding 10 correspond to 85.09% of the total number of nodes in the com-DBLP network.The path length distribution of a graph was calculated on the maximum connected component of the graph.Force-directed layout [46] is a powerful visualization tool for communities, but has two shortcomings when applied to large networks.One is that communities overlap severely with each other, the other is that the layout speed is extremely slow.To make up for the shortcomings, we first use a Louvain method [45] to detect the communities of an original network and its sampled subgraphs, and then extract top-largest communities and visualize them.As shown in Fig. 7, the boundaries of distinct communities in the original web-Stanford network are clearer than those in the original loc-Gowalla network.Thus, we chose the former for the visualization comparison.The difficulty of the visualization under low sampling rates lies in the uncertainty of sampling results induced by high variances.To solve this issue, we only visualize the subgraphs obtained by RD, SLSR, and CNARW, which exhibit the lowest AD standard errors in Table 9.In addition, we choose two realizations for each sampling method, one with AD closest to the mean and the other with AD farthest from the mean, as shown in Fig. 8. show that and gather with other communities where the boundaries between them are vague, and the Louvain method [45] used for community detection is a random algorithm and may divide the vague boundaries into different communities.In addition, Fig. 7 Both SLSR and RD adopt the random node sampling that randomly chooses nodes with uniform distribution, which can effectively avoid getting stuck locally at low sampling rates, as shown in Figs. 9 and 10.Different from RD, SLSR separates the random node sampling from the subgraph representation, that is, the random node sampling is only used for the evaluation of AD and DT, and does not participate in the obtainment of subgraphs.Specifically, SLSR uses the periphery sampling to construct the complex topological connections between periphery nodes, and adopts the bisection method to preserve important high-degree core nodes.Thus, SLSR not only inherits the advantage of the random node sampling in avoiding getting stuck locally, but also compensates for the shortcomings of the random node sampling analyzed in Sections 4.2 and 4.3, that is, the random node sampling not only loses top highest degree nodes in the core but also ignores the complex topological correlation between sampled periphery nodes.
The initial seed in Fig. 11(a) falls in the gathering center including and , while the huge magenta and green communities in Fig. 11(b) are caused by the initial seed falling into or in Fig. 7(a).Although the random walk-based samplings [10,11,20] , such as, SRW, NBRW, CNARW, and CNRW, focus on the theoretical proof of asymptotic variance, mathematical theory is usually based on simplified assumptions of the real world, and Figs.8(a),11 and Table 10 experimentally show that the low sampling rates do not meet the simplified assumptions of the Markov chain theory.

Time efficiency of the traversal-based samplings
The seven sampling methods, namely, SLSR, FF, SRW, NBRW, CNRW, CNARW and RD, run on another computer with Intel Core i7-8550U CPU 1.80 GHz Memory 20 G.The time comparisons of the sampling methods are listed in Table 11, which shows that SLSR maintains high time efficiency of unknown graph samplings.The running time of SLSR depends on the bisection method and the periphery sampling designed in Section 5. Specifically, the periphery sampling corresponds to the chosen sampling, and the time complexity of the bisection method is restricted by × log 100 where denotes the number of sampled periphery nodes that has been sharply decreased in contrast to the number of nodes in the original network = ( , ) owing to the low sampling rate ≤ 10%.Please note that the sampling can be replaced by other sampling methods with high time efficiency in Table 11.

Fig. 2 .Fig. 3 .Fig. 4 .Fig. 5 .
Fig. 2. Comparison of distribution characteristics between original ego-Twitter network and its subgraphs sampled by SLSR and related methods with relatively low variances.The sampled node number of the subgraphs has been illustrated in Section 6.2.

Fig. 6 . 6 . 3 . 4 7 .
Fig. 6.Comparison of distribution characteristics between original com-Youtube network and its subgraphs sampled by SLSR and related methods with relatively low variances.The sampled node number of the subgraphs has been illustrated in Section 6.2.Based on the comparisons of Figs. 2 to 6, SLSR can preserve the three distributions.6.3.4Community visualization (a) AD histogram (b) Six chosen realizations for the visualization Fig. 8. AD histograms of 100 realizations for each method (i.e., SLSR, CNARW and RD) on the original web-Stanford network, and the chosen realizations for the visualization.The sampled node number of the three methods has been illustrated in Section 6.2.

Fig. 9 .Fig. 10 .
Fig. 9. Visualization of communities in the subgraphs with AD 14.23 and with AD 13.11 that were sampled by SLSR from the original web-Stanford network and were illustrated in Fig. 8. (a) Top 6 largest communities in , in which 79% of blue nodes fall in C1, 63% of cyan nodes fall in C2, 51% of magenta nodes fall in C3, and 44% of magenta nodes fall in C4.(b) Top 6 largest communities in , in which 39% of blue nodes fall in C1, 67% of cyan nodes fall in C2, 43% of magenta nodes fall in C3, and 53% of magenta nodes fall in C4.

Fig. 11 .
Fig. 11.Visualization of communities in the subgraphs with AD 22.71 and with AD 17.88 that were sampled by CNARW from the original web-Stanford network and were illustrated in Fig. 8. (a) Top 6 largest communities in , in which 53% of blue nodes fall in C1, 95% of cyan nodes fall in C2.(b) Top 6 largest communities in , in which 46% of blue nodes fall in C1, 76% of cyan nodes fall in C2, 56% of magenta nodes fall in C3, and 82% of green nodes fall in C4.

Original networks [28] Node percentage distribution Edge percentage distribution Core Periphery Core Vertical edges Periphery
.

Table 6 .
The statistics of the original ego-Twitter network, and the mean and standard errors (below the mean) of the statistics of the sampled subgraphs.APL and BC( ) are related to 10 independent realizations, while other statistics are related to 100 independent realizations.The sampling rate of our SLSR sampling, and the sampled node number of the other chosen samplings FF, RD, SRW, NBRW, CNRW, and CNARW have been illustrated in Section 6.2.

Table 9 .
The statistics of the original web-Stanford network, and the mean and standard errors (below the mean) of the statistics of the sampled subgraphs.APL and BC( ) are related to 10 independent realizations, while other statistics are related to 100 independent realizations.The sampling rate of our SLSR sampling, and the sampled node number of the other chosen samplings FF, RD, SRW, NBRW, CNRW, and CNARW have been illustrated in Section 6.2.

Table 10 .
The statistics of the original com-Youtube network, and the mean and standard errors (below the mean) of the statistics of the sampled subgraphs.APL and BC( ) are related to 5 independent realizations, while other statistics are related to 100 independent realizations.The sampling rate of our SLSR sampling, and the sampled node number of the other chosen samplings FF, RD, SRW, NBRW, CNRW, and CNARW have been illustrated in Section 6.2.

Table 11 .
The mean and standard errors (below the mean) of running time (Seconds) of SLSR and related methods from 100 independent realizations that sample the original networks in Table5.The sampled node number has been illustrated in Section 6.2.