Introduction

Network science has proved to be an exceptionally useful tool, especially with the large-scale data sets now available in every discipline. Many systems are driven by pairwise interactions at the microscopic level and the network formalism is ideal as it represents these bilateral relationships as edges between nodes. The real power of network analysis is that it gives us tools to look at the emergent behaviour of such systems even when driven by interaction’s on meso- and macroscopic scales1.

A good example is the key task of finding the most important nodes in the system. In network science, this is done by using a centrality measure which assigns nodes a centrality value with larger values indicating greater importance. Centrality has been developed over the last 70 years2,3,4,5,6,7,8,9,10,11,12,13,14,15 and is a core part of many introductory texts on Network Science16,17,18,19. The simplest centrality measure is a degree, the number of neighbours each node has, probably the first node property examined in any study. However, the degree is insensitive to the wider network structure which is the primary goal of network analysis.

In order to probe the broader network structure, many centrality measures are based on the distance between nodes defined as the length of the shortest path between nodes. For instance, our focus is on one of the oldest and most widely used centrality measures, closeness2, the inverse of the average distance from a node to all others. Closeness is an important indicator in many different contexts. It has been used to investigate the role of different academic disciplines when academics choose their journals and bibliographies12. Closeness centrality measures the impact of an author on a field and their social capital10. When used to select potential leads in customer data, closeness led to a significant gain in the success rate9. In an air transport network, it has been shown that the closeness of a city is highly correlated with socio-economic indicators such as gross regional domestic product13. Closeness has been applied to biological networks6 and closeness measures were able to identify more than 50% of the global regulators within the top 2% of the ranked genes8. Essential genes have been found to have higher closeness than nonessential genes in protein-interaction networks7. In a metabolic network, the closeness of nodes can identify the most important metabolites20. Centrality measures such as closeness are an important tool to analyse many types of data.

However, there are a vast number of centrality indices available, as visualised nicely by Schoch21,22, suggesting that there is a lot of redundancy. Many different centrality measures encode similar information as seen in the strong correlations between centrality indices19,23,24,25,26,27,28,29,30,31,32,33. In particular, Pearson correlation coefficients are invariably used which are most sensitive to linear correlations between centrality measures. That is we measure \({\rho }_{cd}=\langle ({c}_{u}-\bar{c})({d}_{u}-\bar{d})\rangle /({\sigma }_{c}{\sigma }_{d})\) where cu and du are two centrality measures for node u, the average is over all nodes with the means (\(\bar{c}\), \(\bar{d}\)) and standard deviations (σc, σd) found using the same ensemble average. There seems to be no clear consensus from these studies other than there are often strong relationships between centrality measures but these vary from network to network. For instance, in the introduction of Schoch et al.30 it is suggested that “Reported results, however, are often inconsistent with regard to the similarity of centrality indices”. Of particular interest here is the conjecture made in the Introduction section of Valente et al.27 where it is stated that “We expect that measures of degree and closeness centrality will be more highly correlated with each other than with other measures because they are both based on direct ties.” Later, in the discussion of results by Valente et al.27, the authors conclude that “The amount of correlation between degree, betweenness, closeness, and eigenvector indicates that these measures are distinct, yet conceptually related” and the closeness-degree pair is only the third most correlated pair of centrality indices in their study.

In this paper, we will focus on the relationship between closeness centrality and degree. Our result is that closeness centrality and degree have a non-linear relationship, namely that the inverse of closeness (‘farness’34, a normalised version of ‘status’5,6,35) is linearly dependent on the logarithm of the degree. This explains why linear correlation measures often link degree and closeness centrality but at the same time, no general pattern has been seen before. Equally our results suggest that studies based on linear correlation measures may well miss important other features in the landscape of centrality measures.

This non-linear relationship means that measuring closeness gives the same broad information as closeness making the computationally expensive calculation of the latter redundant to a first approximation. To extract useful information from closeness the dependence on degree must first be removed and only then will closeness reveal important and distinct information on individual nodes.

Our starting point is that the shortest paths from any one node to all other nodes can be arranged as a spanning tree. We then conjecture that the branches of this tree are statistically similar, implying that the closeness of a node can only depend on the number of such branches, i.e., the degree of the node. If we also assume that the number of nodes in each branch grows exponentially, we find that the inverse of closeness is linearly dependent on the logarithm of degree centrality. The success of this non-linear relationship in a wide range of real-world networks suggests that both these conjectures are broadly true, an insight that can be exploited in a much wide range of network studies.

Results and discussion

Theory

General definitions

For simplicity, we will assume throughout this paper that we are analysing a simple graph \({{{{{{{\mathcal{G}}}}}}}}\) with just one component. We will denote the degree of each node v as kv. A path in a network of length is a sequence of ( + 1) distinct nodes such that each consecutive pair of nodes in the path is connected by an edge. We will define the distance between two nodes u and v in a network to be the length of the shortest path between two nodes, denoted here as duv.

The closeness cv2,16,17,18,19 of a vertex v is then defined to be the inverse of the average distance from v to every other vertex in the graph, so

$$\frac{1}{{c}_{v}}=\frac{1}{(N-1)}\mathop{\sum}\limits_{u\in {{{{{{{\mathcal{V}}}}}}}}\setminus v}{d}_{uv}$$
(1)

where \({{{{{{{\mathcal{V}}}}}}}}\) is the set of nodes and \(N=| {{{{{{{\mathcal{V}}}}}}}}|\) is the number of nodes. Clearly the closer a vertex v is to other vertices in the network, the larger the closeness. Thus closeness mimics the properties of points in a geometric shape where those points closest to the geometric centre will have highest closeness.

Trees16,17,18,19 are connected networks with no loops so the number of edges is always one less than the number of nodes. Here we use a spanning tree36 which is a connected subgraph of the original graph \({{{{{{{\mathcal{G}}}}}}}}\) containing all the original vertices \({{{{{{{\mathcal{V}}}}}}}}\) but a subset of (N − 1) edges that are just sufficient to keep every node connected to all others. We are also going to work with rooted trees \({{{{{{{\mathcal{T}}}}}}}}(r)\) in which we have singled out one special node, the root r of the tree.

Estimate of closeness

We start from the idea that some of the statistical properties of real-world networks may be captured by spanning trees36. Here we are interested in closeness which uses the lengths of shortest paths between nodes so the most useful trees for this work are the shortest-path trees, \({{{{{{{\mathcal{T}}}}}}}}(r)\), that contains one shortest path from a root node r to each remaining node in the network. As our networks are unweighted, the shortest-path trees always exist and are easily defined as part of a breadth-first search algorithm, see the Supplementary Note 2 on the Shortest-Path Tree Algorithm for more details. Every node can act as a root node so there is at least one shortest-path tree, \({{{{{{{\mathcal{T}}}}}}}}(r)\), for every node r. These trees are not unique as there can be many shortest paths between a pair of nodes.

Our picture for these shortest-path trees is shown in Fig. 1. We start with the observation that close to the root node the structure of these shortest-path trees will vary and in particular, the number of nearest neighbours kr of the root vertex r will vary. However, as we move further away from the root node, the number of nodes nr() at some distance from root node r grows exponentially with each step in most networks. This is the origin of the small-world effect seen in many networks, where the distance between nodes is typically much smaller than is found in similar size networks that are constrained by Euclidean geometry, such as a regular square grid of streets or a random geometric graph. Regardless of the local context of a root node, the shortest-path trees quickly access a similar set of nodes in the main bulk of the network provided there is no large-scale inhomogeneity in the network. Thus we conjecture that the structure and statistical properties of these trees away from the root node are similar for all possible root nodes. The contribution to the closeness of each node in the bulk is bigger as they are further from the root and more numerous. So we expect that the largest contributions to closeness always come from the same bulk regions where we can expect statistical similarity.

Fig. 1: An illustration of the shortest-path tree approximation.
figure 1

Every node r, here the red star, is considered to be the root node of a shortest-path tree \({{{{{{{\mathcal{T}}}}}}}}(r)\). This has kr nearest neighbours, as indicated by the five solid black lines. Each of the neighbouring nodes, here the blue circles, is treated as the root of a branch of the shortest-path tree. These branches are treated as being statistically identical with a branching number \((1+\bar{z})\) as indicated here through the use of the same shaded shape rooted on each neighbouring node. The grey dashed lines represent some of the many edges in the graph \({{{{{{{\mathcal{G}}}}}}}}\) that are not included in the shortest-path tree.

The most important difference when comparing different root nodes, therefore, comes from the local structure of each root node. In particular, the initial value for the exponential growth in the number of nodes at a distance from the root will have a large effect. This will depend on the local structure and the simplest effect comes from the number of immediate neighbours the root node has, i.e. the degree of the root node kr. The simplest approximation for the growth of these shortest-path trees is therefore \({n}_{r}(\ell )={k}_{r}{\bar{z}}^{\ell}\), where \(\bar{z}\) is some measure of the rate of exponential growth of the shortest-path tree. Note that our assumption of statistical similarity suggests that the branching factor of these trees is, on average, the same so we use a single parameter \(\bar{z}\) to represent the exponential growth from any root node r.

Our crude approximation is clearly going wrong when we look at large distances from the root as eventually any real network will run out of vertices so n() = 0 for large . One can model the end of the exponential growth in different ways but we will use the simplest. Namely, we will define a sharp upper cutoff Lr and assume that nr() = 0 for  > Lr which gives

$$N=1+\mathop{\sum }\limits_{\ell =1}^{{L}_{r}}{k}_{r}\,{\bar{z}}^{\ell -1}=1+{k}_{r}\frac{({\bar{z}}^{{L}_{r}}-1)}{(\bar{z}-1)}\,.$$
(2)

We can invert this expression to express this cutoff Lr in terms of parameters N and kr, giving us Lr = L(N, kr) where for large N

$$L(N,k)\approx \frac{\ln \left(N(\bar{z}-1)/k\right)}{\ln (\bar{z})}\,.$$
(3)

Individual distances are integers but it is clear from the form in Eq. (3) that we need L to be a real number; in some sense L is an average over the actual distances from the root to the leaves (nodes with degree one) of the tree. This also tells us that the distance scale associated with a node depends on the logarithm of that node’s degree. Since the inverse closeness is a sum of distances, this distance scale L controls the result and that explains why \(\ln (k)\) appears in our expression for inverse closeness. As an aside, the \(\ln (N)\) dependence of this distance scale L in Eq. (3) reflects the small-world effect seen in most networks.

We now have that \({n}_{r}(\ell )={k}_{r}{\bar{z}}^{\ell -1}\) for  < L(N, kr) and zero for larger which depends on local parameters kr, the degree of each root node, and two global parameters, the total number of nodes N and some measure of the growth rate of the shortest-path trees \(\bar{z}\). We can now rewrite the closeness cr Eq. (1) of a vertex r in terms of nr() as \(1/{c}_{r}={(N-1)}^{-1}\mathop{\sum }\nolimits_{\ell = 1}^{{L}_{r}}\ell {n}_{\ell }\) to find that

$$\frac{1}{{c}_{r}}=\frac{1}{(N-1)}\mathop{\sum }\limits_{\ell =1}^{{L}_{r}}\ell {k}_{r}{\bar{z}}^{\ell -1}=\frac{k}{(N-1)}\left(\frac{({L}_{r}+1){\bar{z}}^{{L}_{r}}}{\bar{z}-1}-\frac{({\bar{z}}^{{L}_{r}+1}-1)}{{(\bar{z}-1)}^{2}}\right)$$
(4)

By using Eqs. (2) and (3) we can eliminate Lr to find that

$$\frac{1}{{c}_{r}}=-\frac{1}{\ln (\bar{z})}\ln ({k}_{r})+\beta \,.$$
(5)

Our calculation shows the parameter β is also independent of the root vertex r chosen but it is a function of the global network parameters N and \(\bar{z}\) so that \(\beta =\beta (\bar{z},N)\) where

$$\beta (\bar{z},N)=\left(\frac{1}{(\bar{z}-1)}+\frac{\ln (\bar{z}-1)}{\ln (\bar{z})}\right)+\frac{1}{\ln (\bar{z})}\ln (N)\,.$$
(6)

Our prediction is that the inverse of closeness cr of any node r should show a linear dependence on the logarithm of the degree kr of that node with a slope that is the inverse of the log of the growth parameter \(\bar{z}\).

In our analysis, we will not assume the parameter β is given by Eq. (6). By adding one additional fitted parameter we lose a little predictive power since many parameters are needed to characterise a network. This leaves us with a conjecture based on the number of nodes N and the degree of each node kr which are usually known. Then in principle, we have two unknown global parameter values which we find from a linear fit to our data for cr and kr giving \({\bar{z}}^{{{{{{{{\rm{(fit)}}}}}}}}}\) and β(fit).

Numerical results for theoretical models

We looked at the relationship between closeness and degree using simple networks produced from three different theoretical models17,18,19: the Erdős-Réyni (ER) model37 the Barabási-Albert model with pure preferential attachment38 and the configuration model39 network starting from a network generated with the same Barabási-Albert model. In the first and third model, the edges are completely randomised so there are no vertex-vertex correlations. The last two models both have fat-tailed degree distributions. Our networks built from artificial models were created using standard methods in the networkx package40.

For a single network, we get several nodes with the same degree and we use this variation to find a mean and standard error in the mean shown. The fit is done using Eq. (5) with two free parameters \({\bar{z}}^{{{{{{{{\rm{(fit)}}}}}}}}}\) and β(fit) and the goodness of fit measures in Table 1 shows this is a good fit, confirmed visually by the plots in Fig. 2a. Note that a good linear fit of status ((N − 1)/c) to the logarithm of degree for one example of each of the Erdős-Réyni and Barabási-Albert models was found in Fig. 1 of Wuchty & Stadler6.

Table 1 Results for artificial networks.
Fig. 2: Closeness and degree for artificial networks.
figure 2

In (a), each plot shows results for networks formed from one artificial model: the Erdős-Réyni (ER) model, the Barabási-Albert (BA) model, and the configuration model network starting from a Barabási-Albert model (Config-BA). The dashed lines shows the best linear fit of 1/c to \(\ln (k)\) using Eq. (5). The same data from all nine artificial networks is shown in the scatter plot in (b) with data \(1/\hat{c}\) against predicted value c obtained from the best fit Eq. (5) and the shaded region corresponds to a 5% deviation from the theoretical prediction. Panel c shows the fractional error, the fitted value of closeness divided by data value. The results are for three different sized networks: N = 1000 (red points) N = 2000 (blue points) and N = 4000 (yellow points) where N is the number of nodes. All networks have average degree 10.0 and 100 realisations were taken for each case. The values of closeness for each value of degree are binned, the mean is shown as the data point with error bars the standard error of the mean. The results show that the non-linear correlation of closeness and degree predicted in Eq. (5) works most of the time within a 2% variation. There are some hints of small but systematic at higher degree value but the data are sparse and less reliable here.

Roughly speaking we find that mean inverse closeness values for any one degree are typically within 2% of the prediction made from the best fit as we can see in “Results” are shown in Fig. 2b. The small deviations seen in Fig. 2c, especially for the Config-BA model, are for higher degree values where the data are sparse and uncertainties are large. This means no firm conclusions can be drawn from Fig. 2c about the presence of higher-order corrections to our form Eq. (5).

We now turn to look at the actual values obtained from these fits of data on closeness and degree from the artificial networks to Eq. (5). As Table 1 shows there is a small amount of variation in value of \({\bar{z}}^{{{{{{{{\rm{(fit)}}}}}}}}}\), the fit for the shortest-path tree growth factor, with the size of the network. What is of more interest are the differences in values between these three types of artificial networks. All these networks had an average degree of about 10.0 and an infinite tree with a constant degree 10 (a Bethe lattice) would have a growth factor \(\bar{z}=9\), one less than the average degree. So the best fit values for the growth factor \(\bar{z}\) in the Erdős-Réyni networks are a little higher than this while the Barabási-Albert network and its randomised version are a lot bigger.

Another possible reference value for the shortest-path tree growth factor \(\bar{z}\) is the average degree of a neighbour in a random graph with the same degree distribution which is 〈knn = 〈k2〉/〈k〉. This is the relevant value for diffusive processes on a random graph. For our finite Erdős-Réyni networks we have that 〈knn ≈ 〈k〉 so again the growth factor found to give the best fit, \({\bar{z}}^{{{{{{{{\rm{(fit)}}}}}}}}}\), in actual Erdős-Réyni networks is still a bit higher than this estimate. For the Barabási-Albert networks and their randomised versions, the 〈knn is around twenty-two to twenty-five for the networks in Table 1. This value is much closer but still not in complete agreement. This suggests our shortest-path trees are sampling nodes in different ways from diffusion but still with a bias to higher degree nodes.

Since spanning trees have many fewer edges than the original graphs, it is perhaps somewhat surprising that we find that the growth factors are comparable with any measures of the average degree in the original network. So the high values of \(\bar{z}\) are telling us that the shortest-path trees are sampling the nodes of their networks with a large bias towards high-degree nodes in the parts of the tree close to the root node and that is why we need such a high growth rate \({\bar{z}}^{{{{{{{{\rm{(fit)}}}}}}}}}\) when we fit our data for closeness. That way when we prune the edges to produce a tree we will still have high degrees in the tree close to the root node. The corollary is that the outer parts of shortest-path trees are dominated by leaves (degree one nodes) and other low-degree nodes, and these also correspond to low-degree nodes in the original network.

It is also clear that node correlations play an important role as these are present in the Barabási-Albert model but absent in the randomised version. The large difference in \(\bar{z}\) values for these two cases show such node correlations are important and yet, the non-linear relationship in Eq. (6) still holds well in these artificial networks, with or without these correlations.

The β parameter in Eq. (5) is harder to interpret but Table 1 shows a comparison between the two values of β. The first is β(fit) derived from a two-parameter fit of the data to Eq. (5). The second value is \(\beta ({\bar{z}}^{{{{{{{{\rm{(fit)}}}}}}}}},N)\) the value predicted using Eq. (6) where we use the \(\bar{z}\) value obtained from the same two-parameter fit and the number of nodes N. What we can see is that the values derived using Eq. (6), \(\beta ({\bar{z}}^{{{{{{{{\rm{(fit)}}}}}}}}},N)\), are consistently poorer than the values β(fit) derived from a two-parameter fit. It highlights that the details of our theoretical form, such as the precise formula for β, here Eq. (6), can be improved. However, our simple calculation has captured the important features of the problem so that the form Eq. (5) does work in these theoretical models provided we treat both \(\bar{z}\) and β in Eq. (5) as free parameters to be determined.

Numerical results for real data

For networks representing real-world data, we used data that is open access and easily obtained41,42,43,44. We aimed for a wide range of networks both in terms of size and in terms of the type of interaction encoded in these real-world networks.

The first set of eighteen data sets we refer to as the Konect-SNAP networks. These were derived from real-world data and were chosen to reflect five broad categories of networks: social networks (social-…), communication networks (commun-…), citation networks (citation-…), co-author networks (coauth-…), and hyperlink networks (hyperlink-…). A more detailed description of these Konect-SNAP networks is given in Supplementary Note 3 on data sets.

Summary statistics are given in Table 2. The reduced chi-square \({\chi }_{{{{{{{{\rm{r}}}}}}}}}^{2}\) measure is between 1.05 and 1.61 for ten, more than half, of our examples and another four networks have values between 2.09 and 2.86. Given the wide range of both size and nature of these networks and the simplicity of our theoretical derivation, this level of agreement may not have been expected. We also give the Pearson correlation measure between closeness and degree, ρ(c, k), and this is generally high as has been noted before23,24,25,26,27,28,29,30,31,32. The success of our non-linear relationship between closeness and degree is not incompatible with high ρ(c, k) values.

Table 2 Results for the largest components the Konect-SNAP networks.

The data for each network is shown in more detail in Fig. 3. Again, we can see that within the error bars the average closeness at each degree generally follows the form we predict within 5% when the best fit parameters are used. Further, the uncertainties estimated for these data points suggest that the vast majority of average closeness values are statistically consistent with the predicted value for that degree, something already captured by the reduced chi-square values in Table 2.

Fig. 3: Closeness and degree for Konect-SNAP networks.
figure 3

Results for the largest components of eighteen Konect-SNAP networks derived from real-world data, see Table 2 for the statistics of each dataset. The yellow shaded region corresponds to 5% deviation and grey region corresponds to a 10% deviation. Plots in column (a) show the inverse of the predicted result \(1/\hat{c}\) from the best fit against the inverse of the mean measured value 1/c = 1/〈ck averaged over nodes with the same degree k. Both axes are essentially \(\ln (k)\). Column (b) shows the ratio of best fit value \(\hat{c}\) over measured value c as a function of degree k rescaled by the largest degree in each network \({k}_{\max }\). If the prediction matched data perfectly, points will lie on the dashed lines. The error bars represent from standard error of mean of the inverse closeness. For majority of points, we can see our prediction Eq. (5) captures the relation between closeness and degree, usually with within a 5% margin.

Finally, we looked at our relationship between degree and closeness in 112 networks taken from the Netzschleuder archive of 276 network data sets44. Our only selection criterion was that we could automatically download a network and that it could be analysed successfully by our standard code without further work. This excluded several examples in this archive such as those with multiple networks (e.g. amazon_copurchases), or some which were too large for our code (e.g. academia_edu). Supplementary Note 3 contains further information on these Netzschleuder networks. As a result, our sample contains many for which we would not expect much success: some are very small, some are very dense, and some have added additional known structures, e.g., a bipartite network. At the same time there are many for which we would expect to be successful, typically anything sparse and large. We represented each as a simple graph and analysed the largest connected component which had up to 40,000 nodes and an average degree of <300.

The results are shown in Fig. 4 with additional information and a table of results on the Netzschleuder networks provided in Supplementary Note 4. We found 50 networks had an excellent fit with a reduced chi-square \({\chi }_{r}^{2}\) close to 1.0 and always <2.0. Another 13 networks gave a reasonable fit \(2.0\le {\chi }_{r}^{2}\ < \ 3.0\) and 9 networks had \(3.0\le {\chi }_{r}^{2}\ < \ 4.0\). For the remaining networks, our code did not find a \({\chi }_{r}^{2}\) for 13 of these as there was at most one node for each degree value and our code could not estimate the uncertainty in the measurement of closeness. Overall, of the 99 networks where we had a \({\chi }_{r}^{2}\) result, our degree-closeness relationship was very successful (\({\chi }_{r}^{2}\ < \ 2.0\)) in these arbitrary networks 50% of the time.

Fig. 4: Closeness and degree for Netzschleuder real-world networks.
figure 4

Results for the largest components of an additional 112 networks taken from the Netzschleuder archive44. Each point represents one network, plotted using the number of nodes N and the average degree 〈k〉 as coordinates. The symbol colour indicates reduced chi-square (\({\chi }_{r}^{2}\)) value with yellow being the best results (closest to 1.0). The symbol shape also indicates \({\chi }_{r}^{2}\) values with triangles representing best results where \({\chi }_{r}^{2}\ < \ 2.0\), squares are for \(2.0\le {\chi }_{r}^{2}\ < \ 3.0\) and circles represent poor performance with \(3.0\le {\chi }_{r}^{2}\). The stars (white fill) are networks where our \({\chi }_{r}^{2}\) calculation failed due to a lack of an error estimate on each point which occurs when we have one node per k value. To the right (left) of the dashed (dotted) line, networks have a density 〈k〉/(N − 1) of <0.01 (>0.1).

However, from Fig. 4 we can see that the density of a network, the number of edges divided by the number of node pairs, has an impact on the success rate. Our sample has included several very dense networks where there was a lot of additional information on the edges so the existence of an edge was much less important. Indeed from Table 3 we see that for Netzschleuder networks with densities below 0.04 we have around 70% fitting with a reduced chi-square of <2.0, rising to roughly 80% for \({\chi }_{r}^{2}\ < \ 3.0\).

Table 3 The success rate for sparse Netzschleuder networks.

Using closeness

Our numerical results confirm our analytical work that the inverse of closeness depends linearly on the logarithm of degree Eq. (5) for most networks. This is a correlation, true on average but not an exact result for every node. It has long been known that nodes with larger degrees tend to have smaller closeness which leads to significant correlation measures19,23,24,25,26,27,28,29,30,31,32,33 but this is often discussed in terms of the Pearson correlation coefficient which is most sensitive to linear correlations. A non-linear relationship between degree and closeness is not discussed in previous studies and our relationship puts this intuitive understanding of a degree-closeness relationship on a firm footing.

Our work suggests that in the majority of networks, to a first approximation, closeness of individual nodes captures little more information on average than is contained in the degree. There is no point spending time calculating closeness if you only want a rough measure of centrality, you may as well just use degree. However, closeness measurements can give useful information on a network if used correctly.

First, closeness and degree measurements yield our fitted parameters, β(fit) and \({\bar{z}}^{{{{{{{{\rm{(fit)}}}}}}}}}\). These characterise each network and can be used to compare different types of networks and even networks of different sizes. In particular, \({\bar{z}}^{{{{{{{{\rm{(fit)}}}}}}}}}\) characterises the exponential growth that is a feature of most complex networks and which is behind many well-known phenomena as we shall note later in this discussion.

There are a number of ways to determine \(\bar{z}\) and β from our relation Eq. (5). We have found the most effective approach is the simpler method where we determined \(\bar{z}\) and β from a linear fit of our data to Eq. (5). There are alternatives but these throw light on various possible approximations rather than being of practical use, see the sections on determining \(\bar{z}\) and β in Supplementary Note 1 and results for higher-order polynomial fits in Supplementary Note 4.

For networks where degree and closeness are linked by our relationship Eq. (5), our work shows that useful information from closeness centrality for individual nodes can only come by comparing values for individual nodes against the expected value derived using our relationship Eq. (5) using the logarithm of degree. The only useful information in closeness values is the deviation from their expected value. In these situations, we could start by examining the degree centrality of every node. This would be the primary measure of centrality. We then fit our closeness values using Eq. (5) to produce an expected value of closeness \({c}_{v}^{({{{{{{{\rm{fit}}}}}}}})}\) for each node. Finally, we use this fit to find nodes which are noticeably more (or less) central than expected. One way to do this is to look at the normalised closeness

$${c}_{v}^{({{{{{{{\rm{norm}}}}}}}})}=\frac{{c}_{v}}{{c}_{v}^{({{{{{{{\rm{fit}}}}}}}})}}\,.$$
(7)

Our normalised closeness measure \({c}_{v}^{({{{{{{{\rm{norm}}}}}}}})}\) will highlight the outliers which would then be of most interest. See Supplementary Note 4 to see some examples of the fluctuations in closeness values around our predicted value.

One could also compare closeness to degree by running the configuration model, which keeps node degree constant, and measuring the closeness of each node in such a null-model network. However our method, requiring a simple linear fit to data already acquired, will be much faster than running the configuration model.

General network insights

The success of our conjecture also suggests that most networks satisfy two key assumptions built into our derivation.

First we assumed that the number of nodes a distance from any node grows exponentially. Such exponential growth is common in networks as it is the mechanism behind the concepts of the “six degrees of separation” and the “small-world” effect45 often reported in networks. More formally, this is linked to length scales in networks models with N vertices which grow as \(O(\ln (N))\). This is to be contrasted with a network controlled by the geometry of a d-dimensional Euclidean space (such as Random Geometric Graphs) where the number of nodes at distance from a node grows as a power law d−1 and length scales in such networks grow as O(N1/d). We will consider the length scales in more detail in the Network Length Scales subsection below.

This non-Euclidean behaviour of most networks highlights one situation where closeness is a useful centrality measure independent of and uncorrelated with degree, so a situation where our relationship Eq. (5) will fail. That is for graphs embedded in Euclidean space. This is somewhat ironic as this type of planar network, such as a network representing the connections between intersections of streets in a city, is the prototypical example used to motivate the idea that the network measure closeness is related to our intuition about the concept of centrality in a network. Indeed, Bavelas2 only used planar graph examples to develop closeness so a link between closeness and degree in most networks was never an issue in the original motivation for the closeness measure.

The second assumption that our work supports is that the branches of the shortest-path trees are statistically similar as illustrated in Fig. 1. The success of our analysis suggests this assumption works well whenever we are looking at measurements that depend on the bulk of the network. This simple approximation may therefore help analyse other network measurements, and we consider some of these in the Network Length Scales subsection below.

Closeness and real-world networks

Given the simple assumptions and approximations used in deriving our relationship Eq. (5), it is perhaps surprising to find that this closeness-degree relation works well for so many networks based on real-world data. So how often is our relationship a success? Also, can we understand when our closeness-degree relationship may succeed and when it is likely to fail?

For the first set of eighteen Konect-SNAP networks shown in Fig. 3 and Table 2, we had eleven (61%) with a reduced chi-square of <2.0 and all but two (89%) have a reduced chi-square with <3.0. The success rate for the 112 Netzschleuder networks, as measured by reduced chi-square, is lower. Of the ninety-nine Netzschleuder networks where we had a reduced chi-square measurement, fifty (51%) had a reduced chi-square of <2.0 while sixty-three (64%) had a a reduced chi-square of below 3.0. The lower success rate for the Netzschleuder networks can be understood as we are sampling without bias a varied collection of network data sets and, in many cases, we have information on the networks which suggests we might never have expected success.

The simplest issue is network density, the fraction of possible edges that are actually present in a network. If the density is very high, the distances between nodes will be low, perhaps just one or two. This leads to very small variations in the distances encoded in the closeness of each node making it harder to distinguish meaningful patterns in closeness measurements so our relationship is more likely to struggle with high-density networks. This trend is clear in Fig. 4. Of the two Konect-SNAP networks with poor chi-square, one, the social-jazz network, has a high density of 0.13. Conversely, we find that the success rate is much higher in our Netzschleuder networks for networks with density below 0.04: roughly two-thirds of these low-density networks fit our degree-closeness relationship well with \({\chi }_{r}^{2}\ < \ 2.0\) and this rises to 80% if we accept anything with \({\chi }_{r}^{2}\ < \ 3.0\), see Table 3.

More generally, a failure of our relationship is a powerful tool to highlight where there is an additional structure in the network. Our theoretical analysis assumes a generic network where every node sees the same exponential growth in the number of nodes at distance away as discussed in the General Network Insights subsection. If this generic structure is not present, we have no reason to expect our relationship to work well.

To see this structural issue, consider some of the notable exceptions to our low-density criterion. Looking at Fig. 4, the low-density networks with N between 700 and 2000 and 〈k〉 between 2.0 and 3.5, you can see two good networks with good fits (triangles) and three with poor fits (two circles and a square). One circle comes from the crime network with N = 1263, 〈k〉 = 2.2 for which we find a reduced chi-square of \({\chi }_{r}^{2}=3.8\). The description of the crime network mentions that the data was obtained using “snowball sampling from five initial homicides” so we are looking at a biassed sample of a much larger network. We suspect this sampling produces a different structure from a typical complex network leading to a failure of our relationship in this case. The other circle is the plant_pol_kato network with N = 768, 〈k〉 = 3.1 and \({\chi }_{r}^{2}=34.1\) which is a “bipartite network of plants and pollinators” from a forest. Our analysis does not allow for a bipartite structure and in this case the difference between the two types of nodes is large with 91 plant species nodes but 715 nodes representing species of insects. The square represents the unicodelang network with N = 858, 〈k〉 = 2.9 and \({\chi }_{r}^{2}=2.4\). This is still a good fit but it is also a “bipartite network of languages and the countries” so again, a better result might be achieved if we adapted our approach for two-mode networks.

In some cases, the meta-data we have for a network may already tell us about large-scale structure, such as a bipartite network, that will invalidate our relationship Eq. (5). In simple cases, it may be possible to adjust our derivation to find a more appropriate relation that matches the known structure. For instance, for bipartite networks we could represent our shortest-path trees using two growth rate parameters, \({\bar{z}}_{a}\) and \({\bar{z}}_{b}\), for odd and even distances from the root node.

In other cases, failure may indicate the presence of a structure that is not already known. For example, the network may have strong inhomogeneities such as high-degree nodes clustering together in a dense core. That would invalidate our assumption that all the branches of the shortest-path trees look statistically similar. We can see this type of problem in discussions of the average path length in random graph models46,47. There it was noted that the typical \(\ln (N)\) dependence of length scales in complex networks is not seen when these models have a degree distribution of the form p(k) ~ kγ with 2 < γ < 3. In these cases this failure is linked to the network taking on a rather different structure, perhaps an “octopus”46, a star-like structure with one dense core connected to many legs. This type of network invalidates our assumptions and we would not expect to see our closeness-degree relationship.

Network length scales

We can compare our analytical approach to that of other theoretical work on length scales in networks. The vast majority of the analytical results on distance in networks is for well-defined simple theoretical models, typically the Erdős-Réyni graph17,46,48,49, the Barabási-Albert model50,51, or scale-free random graphs52 (our BA-Configuration model). The focus in the literature tends to be on characteristic length scales for a graph, such as the average path length 〈〉 or the diameter, and not on length scales associated with each vertex, such as the closeness that we consider.

What is distinctive about our approach is that we focus on generic network properties that appear to hold in many networks. So our approach can be used on a much wider range of networks and in particular work for real-world networks not just for one simple model. The downside is that we do not have the mathematical rigour of those working with simple models. So, let us consider how our approach can be related to other length scale calculations in the literature.

We have already noted that the concepts of the “six degrees of separation” and the “small-world” effect45 are linked to a network models where the length scale grows as \(O(\ln (N))\), much slower that for O(N1/d) behaviour expected for networks embedded in d-dimensional Euclidean space. One common way to study this more precisely is to consider the average distance between nodes 〈〉, where each node pair contributes equally to the average. This is what we obtain if we take half the average of the inverse closeness Eq. (1) over all vertices, \(\langle \ell \rangle ={(2N)}^{-1}{\sum }_{r}{({c}_{r})}^{-1}\). From our results for closeness, we can see that the N dependence in 〈〉 comes from the \(\ln (N)\) term in the expression for β of Eq. (6), see the section on average shortest-path length in Supplementary Note 1 for more details. So our result is consistent with the behaviour found more rigorously for the average path length in random graph models46,47. The success of our method in terms of this average path length result shows that small-world behaviour can be linked to the simple network features built into our method, those discussed in the General Network Insights subsection.

Our theoretical cutoff L(N, k) of Eq. (3) gives another length scale for each node and again this has the same small-world \(\ln (N)\) behaviour. One might conjecture that this cutoff length scale L(N, k) could be linked to the eccentricity length scale. The eccentricity6,16er of a node r is the largest distance from r to any other node, \({e}_{r}=\max \{{d}_{rv}| v\in {{{{{{{\mathcal{V}}}}}}}}\}\). In that case we are making a new conjecture that eccentricity should also be linearly dependent of \(\ln (k)\) with a gradient of \(-1/\ln ({\bar{z}}^{{{{{{{{\rm{(fit)}}}}}}}}})\) to match our theoretical cutoff L(N, k). As seen in Fig. 1 of Wuchty & Stadler6 and confirmed in our own analysis, eccentricity does depend linearly on \(\ln (k)\) but the gradient does not seem to match the prediction from our theoretical cutoff L(N, k). Roughly speaking our cutoff L(N, k) represents a typical average large scale, not an extremal value of a distribution, so we should never expect a close link with eccentricity. For more details see the section on eccentricity and L(N, k) in Supplementary Note 1.

As the largest eccentricity is the diameter of a graph, we can see the \(O(\ln (N))\) behaviour of our L(N, k) expression as matching this behaviour seen analytically in the diameter in simple models. However, just as for eccentricity and L(N, k), we don’t expect to be able to get a precise handle on diameter in our approach.

We also see strong similarities between our approach and that used in several papers48,52 where the network is reduced to a set of rings of nodes, each ring containing all the nodes at the same distance from a root node. For instance, the mean first-passage time of random walkers on a network to a given vertex v is just the inverse of closeness where now closeness is defined in terms of a new distance function duv in Eq. (1), where duv is the average first-passage time for a random walker to move from vertex u to vertex v (this is also known as Markov centrality53 or Random walk closeness centrality). Mean first-passage time has been observed to be proportional to degree48 and applying our simple approximations to the “ring” method of Baronchelli & Loreto48 quickly reveals this feature, see the section on the ring calculations of first passage times in Supplementary Note 1 for more details.

Overall, our approach can give insights into the behaviour of many network length scales, sometimes only very roughly, sometimes with more precision. It can never match the precision of the analytic calculations done for the simplest models but our approach can be used in a much broader range of networks.

Improvements

We have already noted one simple improvement when working with bipartite networks. That is to use two growth rate parameters, \({\bar{z}}_{a}\) and \({\bar{z}}_{b}\) for odd and even distances from the root node in the shortest-path tree of bipartite networks.

Other extensions are suggested by probing our numerical results in more detail. For our smaller set of 18 Konect-SNAP networks, beyond the two poorly fitting cases, our relationship Eq. (5) is very successful for most individual nodes within a 5% margin, a success which may not be expected given the simple analytical derivation. However, we can see some clear if small trends in the deviations in Fig. 3. We suggest these trends highlight the limitations of our analytical approach but it is possible to improve our theoretical methods.

At the simplest level, we could replace the sharp cutoff used for n(r) where n(r) = 0 for  > L. There are examples of these distributions for some simple models in Baronchelli & Loreto48. A better cutoff may well lead to better predictions for β allowing one to fit a function with one independent parameter rather than two that we used by keeping β as a fitted parameter. However, while fitting one rather than two parameters may be theoretically satisfying, it does not seem much of a gain for the analysis of real-world networks.

Another option might be to calculate a different network parameter, namely the second degree \({k}_{r}^{(2)}={n}_{\ell = 2}(r)\)54 for each node r. By finding the number of nodes two steps away from every node, we can make a better approximation for n(r), that is n0(r) = 1, n1(r) = kr, and \({n}_{\ell }(r)={k}_{r}^{(2)}{\bar{z}}^{\ell -2}\) for 2 ≤  ≤ Lr and n(r) = 0 for  > Lr. This approach cannot be worse than the method used here as the latter is included as a special case where the second degree \({k}_{r}^{(2)}=\bar{z}{k}_{r}\) for all nodes r. To leading order we get the same type of result, namely that \(1/{c}_{r}={(\bar{z})}^{-1}\ln ({k}_{r}^{(2)})+\beta\) since the degree kr now only contributes a small number of terms to closeness. So in this approach using second degree we need to measure a different set of N parameters, the second degree of each node. Finding second degree is slower numerically than degree but both scale in the same way with increasing network size. The success of our simpler method here points to the idea that second degree and degree may often be correlated so it is likely that using second degree may only enhance results in a few cases.

More serious changes will be needed to the calculation if other effects neglected here, such as community structure or degree assortativity, are to be included.

Distance and logarithm of degree

The logarithm of degree \(\ln (k)\) has been found to play an important role in network analysis before. A large fraction of papers on networks will show degree distributions where the horizontal axis is the dependent variable \(\ln (k)\) and not simply the degree k. A more specific example comes from Zhou et al.55 where the ratio of the degrees of nodes at the two ends of each edge (largest value in the numerator) is used to assign a ‘distance’ η(u, v) to each edge (u, v). This is equivalent to defining \(\lambda (u,v)=\ln (\eta (u,v))=| \ln ({k}_{u})-\ln ({k}_{v})|\). In fact one can quickly see that while both η and λ are semi-distances on the set of edges in the formal mathematical sense, only λ is also a semi-metric and so λ is in some sense the more natural ‘distance’ measure in a qualitative sense. Our work suggests that an alternative view is to replace the logarithm of degree by the inverse of closeness. Since \({({c}_{u})}^{-1}\) is the actual average of the shortest-path distances from u to all other nodes, we can immediately see it is natural to work with inverse closeness when considering distances. For instance Zhou et al.55 we could look at a different edge measure \(\tilde{\lambda }(u,v)=| {({c}_{u})}^{-1}-{({c}_{v})}^{-1}|\). While the inverse closeness is a more natural distance, the degree is much easier to calculate in practice. Our work allows researchers to move between these two pictures.

Closeness and Gromov centrality

It has been noted56 that the inverse of closeness is related to another centrality measure based on the average of the Gromov product. This centrality measure captures the extent to which the triangle inequality is not saturated between three nodes in a network, and so this is deeply connected to the geometry of a network. Our result leads to a natural prediction for this Gromov Centrality \({G}_{r}^{D}\) of node r defined on the scale of the network’s diameter D56. This Gromov Centrality is defined on other network length scales, \({G}_{v}^{\ell }\), and Babul et al.56 suggest there are useful generalisation of closeness. Our approximations will prove just as effective for such generalisations of closeness.

Conclusions

We have derived a non-linear relationship between degree and closeness Eq. (5), two of the most important centrality measures in Network Science. This was achieved by assuming that every node can be seen as the root of a shortest-path spanning tree where for all nodes these trees are statistically similar two or more steps away from the root. We have shown that this degree-closeness relationship works on a large number of artificial and real-world networks, particularly low-density networks where there is no special macroscopic structure.

Our results mean that measuring closeness is of little use unless our relationship is used to remove the dependence on degree from closeness. Further, most networks can be seen as shortest-path spanning trees which are statistically similar two or more steps away from their root nodes. Equally, if our relationship fails, it is an indication that our basic assumptions about network structure are wrong and so there are strong inhomogeneities and macroscopic structure in the data.