The last two decades of network science have discovered stunning similarities in the topological characteristics of real life networks (many biological, social, transportation and organizational networks) on a strong empirical basis. However our knowledge about the operational paths used in these networks is very limited, which prohibits the proper understanding of the principles of their functioning. Today, the most widely adopted hypothesis about the structure of the operational paths is the shortest path assumption. Here we present a striking result that the paths in various networks are significantly stretched compared to their shortest counterparts. Stretch distributions are also found to be extremely similar. This phenomenon is empirically confirmed on four networks from diverse areas of life. We also identify the high-level path selection rules nature seems to use when picking its paths.
The implicit “shortest path” assumption, meaning that the used communication path in a network is the one with the shortest length, seems to dominate the network science community and most of the fundamental network metrics (diameter, average path length, centrality metrics1, etc.) are computed using this assumption. There are other works supposing various models2, network metrics (e.g. degree, centrality, congestion, homophily3,4,5) and hidden structures (e.g. hidden hierarchies and metric spaces6,7,8) guiding path selection. The contribution of these works is considerable in the modeling and understanding of simple routing strategies that can recover near shortest paths without requiring global knowledge of the topology. However, a lack of confirmation with empirical data leaves an important question open: What kind of paths are actually chosen by nature in real world networks?
Here we approach the question of path selection in networks from this lacking empirical angle. Using existing and newly created datasets of the traffic flow on real world networks (see Methods), we compare the topology of the networks to the structure of empirically-determined paths extracted from these datasets. From this comparison we infer common characteristic rules of path selection in different networks, which we call routing policies. Our study here presents the analysis of empirically-determined paths in air transportation networks, the Internet, the fit-fat-cat word morph game, and empirically-inferred paths in the human brain (see Fig. 1a for an illustration). For the remainder of this text we will refer to empirically-determined and inferred paths as empirical paths. The main topological features of our networks and the statistics of the empirical paths are shown in Table 1.
Our first finding is that traffic in networks does not necessarily follow shortest paths. Figure 2 presents the stretch of the paths which is computed as the length of the empirical path minus the length of the corresponding shortest path having the same source and destination pair. The figure shows a significant resemblance in the distribution of path stretch across our networks. While around 60–80% of the empirical paths exhibit zero stretch, the remaining paths show path stretch which can exceed up to 4–5 hops. From this result two things follow. First, the plot confirms the shortest path assumption of previous studies in the sense that most of the empirical paths are shortest indeed. In this respect nature’s routing policy definitely “prefers short paths”. However, the non-negligible portion (20–40%) of inflated paths suggests that there may be other policies at use simultaneously.
One such policy our measurements support is the “conform hierarchy” (CH) policy, meaning that the used paths follow the topological hierarchy of the network. For showing this we have computed the closeness centrality of the nodes comprising the empirical paths indicating which (inner or outer) parts of the network the information flows through. The closeness centrality of the node is computed as: , where d(y,x) is the distance between vertices x and y, while N refers to the number of nodes in the network. We found that most of the empirical paths do not contain a large-small-large pattern forming a “valley” anywhere in their closeness centrality sequence. This informally means that higher level nodes do not prefer the exchange of information through their subordinates even if there are short paths through them. On a CH path the closeness centrality increases monotonically at first up to a point (upstream), then starts to decrease (downstream) until it reaches the destination, or it is just go upstream or downstream all the way. Figure 1b illustrates this graphically. One could argue that maybe short paths on real networks have this property as a default, but Fig. 3a–d verify that this is not the case. For comparison we picked random paths between the source-destination pairs of our empirical paths with the same stretch distribution and plotted the results for that case too. One can see that, while the path length distribution is the same for the two datasets, a much larger fraction of stretch-equivalent random paths violate the CH policy.
There can be subtle differences between CH paths of similar length. For example, a path can contain upstream then downstream steps or downstream steps only. Recall that an upstream step goes towards the core, while a downstream step goes towards the periphery of the network. Is there a preference among these? For answering this we plotted the Cumulative Distribution Function (CDF) of CH paths with respect to the number of upstream steps preceding the downstream phase (Fig. 3e–h). For comparison we have also plotted the results of a random policy which picks randomly from the possible CH paths of the given length. The plots confirm that the empirical paths contain much less upstream steps, which means that these paths try to avoid stepping towards the core. This finding adds “prefer downstream” as a third identifiable policy component (see Fig. 1b for an illustration). We note that such behaviour is easy to interpret on the Internet, since stepping towards the core of the network implies paying for a transit provider for carrying the traffic, while going downstream comes for (almost) free. However, at this time it is not clear what causes the same behaviour in the other networks.
Our datasets hint at the operation of the “prefer short paths”, “conform hierarchy” and “prefer downstream” policies, with no clear relative priorities among them. In what follows we set up a synthetic toy routing policy which uses these components in a prioritized fashion and, we show that using these simple ingredients we can approximate the empirical paths much better than simple shortest paths do. According to Fig. 2 the prefer shortest path policy can only have lower priority than the “conform hierarchy” and the “prefer downstream” otherwise we would not experience stretch at all. Since “prefer downstream” implies the conform hierarchy policy, the only reasonable choice is to: prefer hierarchically conform paths at first, then prefer downstream if there is a downstream path and from the remaining paths prefer the short paths. More specifically we define our synthetic routing policy to (i) use CH paths only, (ii) use downstream if applicable and from the paths remaining after (i) and (ii) use the shortest one (if there are still multiple choices break ties randomly). We note that such routing policy is not unfamiliar in the literature9, 10. Figure 4a shows that this simple routing policy immediately gives very realistic path inflation, close to the stretch computed for the real paths. What this simple algorithm cannot reproduce is that the empirical paths sometimes violate the “conform hierarchy” and the “prefer downstream” policies. Figure 4b shows the CH distribution of the paths generated with our synthetic routing policy over the same network, when the closeness values are slightly randomized. This randomization can be interpreted as simulating the case in which nodes do not have full global information about the network and therefore, the precise closeness values of the nodes are not known. Instead the nodes can only have an approximate picture about the closeness hierarchy. One can see that this modification recovers both the stretch and CH distributions exhibited by the empirical paths (see Figs 2 and 3a–d).
An immediate application of these findings is the more accurate estimation of the network’s response to outer stress (e.g. sudden load increase). Figure 4c–f show the load of the nodes when carrying the traffic of the empirical paths in comparison with shortest paths and our synthetic toy policy. One can see that real traffic has larger volume (due to the stretch) and it is even more concentrated in the core of the network. Our synthetic routing policy recovers this behavior, thus, compared to a simple shortest path policy, the synthetic policy can more accurately assess the load footprint of real paths on the network.
Despite the simplicity and intuitive nature of the identified policies, readers may have the feeling that there should be a more simple explanation out there. For example one could imagine a weighting of the edges over which shortest path computation will give exactly the same paths that we find in our data. We had two reasons for going this way. First, our data hints for policies that are not unfamiliar in the literature. For example, the authors of ref. 9 use a very similar routing algorithm for modeling the flow of information in organizational networks, after an in-depth analysis of such networks. Secondly, we have run experiments to find an appropriate weighting using edge betweenness (see Supplementary Note 1 for a brief listing of the results). These experiments suggest that finding reasonable weights that are able to generate paths matching all of our statistics is far from trivial.
Collecting or inferring paths in networks is a non-trivial problem. Here we list our methods for every specific networks analyzed in our paper.
Internet AS topology and real AS paths
The Internet protocols permit the tracing of packets. We have downloaded an AS level Internet topology and full AS level packet traces from CAIDA (Center for Applied Internet Data Analysis11, recorded on 09/29/2015. The topology contains 52194 nodes and 117251 connections. For this topology we had around 2.5 million traces.
Air transportation network and flight travels
The world’s flight map is available from OpenFlights12, from which the topology of the air transportation network can be reconstructed. For a realistic estimation of the flights used by customers, we used the Rome2Rio13 trip planner and generated routes between 27444 randomly chosen pairs of airports. From the offered paths we have chosen the cheapest one in the analysis in the paper. However, we note that picking according to other parameters (lowest number of transfers, lowest travel time) did not qualitatively change our results. To achieve a more realistic topology we used airport connections extracted from traceroutes to increase the accuracy of the OpenFlight topology. The reconstructed map contained 3433 airports and 20347 flights connecting them.
fit-fat-cat word ladder game app and word chains
For collecting paths from word networks we have implemented a word ladder game named “fit-fat-cat” for smartphones. The goal of the game is to transform a source word into a target word through meaningful intermediate words by changing only one letter at a time. The word chain fit-fat-cat is a good solution of a game with source word fit and target word cat. These word chains, collected anonymously from our users, can be considered as the footprints of human navigation over the word-maze of the English language. For the reconstruction of the word graph we have downloaded the official three-letter English Scrabble words from WordFind14 and created an edge between all the words differing only in one letter. The collected three-letter word chains were considered as our traces. For capturing only the “working” paths we have filtered out the first 20 games (the warming up phase) and the games taking more than 30 seconds (when the players are not just using a known path but discover an unknown one) of every player. After all, we have a dataset of more than 2500 paths from 100+ players.
Human brain and estimated paths
Getting realistic paths from inside the human brain is extremely hard, if not impossible. As a consequence, almost all studies in the literature concerning path-related analysis assume shortest path signaling paths. Taking into account the extreme non-triviality of path estimation in the brain we ask here if we can use empirical anatomical and functional data to infer feasible communication traces. Our dataset comprises 40 healthy human subjects who underwent an MRI session where Diffusion Spectrum Imaging (DSI) and resting-state functional MRI data were acquired for each subject. DSI data was processed following the procedures described in refs 15–17, resulting in 40 weighted, undirected structural connectivity maps (GS) comprising 1015 nodes, where each node represents a parcel of cortical or subcortical gray matter, and connections represent white matter streamlines connecting a pair of brain regions. Connection weights determine the average density of white matter streamlines and here only consider connections with density above 0.0001, resulting in GS with an average of 12596.2 connections per subject. Functional MRI data was processed following state of the art pipelines described in refs 18 and 19, yielding a BOLD signal time-series per node, each with 276 points that were sampled every 1920 ms. The magnitude of the BOLD signal is an indicator of the degree of neural activity at a node. Combining structural and functional data, we infer feasible structural pathways through which neural signals might propagate using the following process. (i) Identify source-destination pairs with high statistically-dependent brain activity. We searched for pairs of nodes such that the Pearson correlation of the BOLD signal time series - without global regression - was above 90%. These nodes were used as the source-destination pairs of our paths. (ii) Determine which nodes are active at every time-step. We say that a node is “active” at a given time-step if the BOLD signal is >γ and “inactive” otherwise. We construct activity vectors for each time-step indicating which nodes were active. Here we use γ = 0 but we get qualitatively similar results for near zero γ. (iii) Construct subgraphs of active nodes. We constructed a subgraph GS i of GS for each time-step by considering only the nodes that are active at a given time-step i. (iv) Define paths between source-destination node pairs. For all of our source-destination pairs (generated in step (i)), we considered the shortest path in the GS i graphs, if the path existed. If there was multiple shortest paths between a source-destination pair we choose one randomly. Our source-destination traces include the paths found across all GS i subgraphs. It is worth noting that this method assumes that information can only traverse active nodes. Furthermore, we are considering here a model for large spatial and temporal scale communication in brain networks that is not necessarily applicable to neural networks at smaller scales. While we cannot validate with empirical data whether these paths are actually used for the flow of neural signals, from a path inflation perspective we can consider these paths as a lower bound on the length of the real signaling pathways.
The data supporting the findings of this study are either available from public data repositories or can be requested from the authors via e-mail. In particular, the topology of the AS level Internet and the corresponding AS traceroutes are available at CAIDA (Center for Applied Internet Data Analysis, www.caida.org). We have downloaded the airport network from the OpenFlights database (www.openflights.org), while the reconstruction of the word graph was based on the list of official three-letter English Scrabble words from WordFind14. The remaining datasets (air, brain, fit-fat-cat paths and the structural and functional brain dataset) can be requested via e-mail to the corresponding author.
Publisher's note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
The research work leading to these results was partially supported by HSNLab and Ericsson. Project no. 108947 and 123957 has been implemented with the support provided from the National Research, Development and Innovation Fund of Hungary, financed under the OTKA-K (2013/1) funding scheme. A. Gulyás was supported by the János Bolyai Fellowship of the Hungarian Academy of Sciences.
Electronic supplementary material
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.