Simplification of networks by conserving path diversity and minimisation of the search information

Alternative paths in a network play an important role in its functionality as they can maintain the information flow under node/link failures. In this paper we explore the navigation of a network taking into account the alternative paths and in particular how can we describe this navigation in a concise way. Our approach is to simplify the network by aggregating into groups the nodes that do not contribute to alternative paths. We refer to these groups as super-nodes, and describe the post-aggregation network with super-nodes as the skeleton network. We present a method to describe with the least amount of information the paths in the super-nodes and skeleton network. Applying our method to several real networks we observed that there is scaling behaviour between the information required to describe all the paths in a network and the minimal information to describe the paths of its skeleton. We show how from this scaling we can evaluate the information of the paths for large networks with less computational cost.

Scientific Reports | (2020) 10:19150 | https://doi.org/10.1038/s41598-020-75741-y www.nature.com/scientificreports/ Our approach to obtain the simplified network with minimal information is to assign random weights to the links of the network. The contraction is done by aggregating links in increasing order of their weights. Two nodes are aggregated if their aggregation does not introduces a multilink in the simplified network. The treecontraction finishes when all the links are visited obtaining the skeleton network and super-nodes. Then the search information H simp is evaluated. This process is repeated with different random seeds keeping track of the simplified network with minimal search information. The concept of search information was first introduced to consider the 'hide-and-seek' problem 20 in a network, that is how much information is needed to describe a shortest path from one node to another. It is known that a shortest path is not necessarily the path with minimal information 2 . Our method is not restricted by the assumption that the relevant paths are the shortest paths.
In next section, we will show how the partitions of the networks affect the minimal search information, how our method avoids the constraint of looking only at the shortest paths, and how we can approximate the search information of large networks with small computational cost.

Results
The skeleton and super-nodes both contribute to the search information of the simplified network. Figure 2 shows the search information for two real networks against the number of super-nodes. The first data set is Adjacent-Nouns network and the second data set is the Transport for London network (TfL) describing the London underground railway network. From all the real networks that we considered (Supplementary Information  Table S1), we notice that the search information of the skeleton is proportional to the number of super-nodes (Fig 2a,d) compared to the total search information of the super-nodes which has large variations (Fig 2b,e). Also, depending on the network, sometimes the main contributor to the search information comes from the skeleton network, (e.g. adjacent-nouns network, Fig. 2a-c) and for other networks the main contribution is the information describing the super-nodes (Transport for London network in Fig. 2d-f).
It is known that the search information increases with the size of the network 4 . The subnetwork contained inside a super-node, by construction, is a tree and we expect that the search information of these trees also increases with the number of nodes. The search information for a tree H tree tends to increase as a function of the number of nodes but it would fluctuate depending on the tree connectivity. To verify the increase of H tree with the number of nodes we evaluated the average search information from a random selection of connected trees with N nodes. From numerical simulations (Fig. 3a) we observed a remarkable property, the average search information for a tree scales as H tree ≈ αN β where α = 0.721 ± 0.019 and β = 2.550 ± 0.006.
In a network the number of nodes contained inside the super-nodes depends on how the contraction is carried out which can create large fluctuations in the number of nodes contained in the super-nodes and hence in their search information (Fig. 2b,e). This large variability of the super-nodes search information can be illustrated with a ring network which is the simplest network with an alternative path (Fig. 3b-e). In this case there are two possible routes from any node to any other node. The tree-contraction will produce a skeleton network that is a triangle. For the ring networks it is possible to show analytically (see "Methods" section) that the minimal search information network is when nodes of the network are evenly distributed between the three super-nodes. The other extreme, evaluated numerically, is when two super-nodes only contain one node each and the rest of the nodes are included in the third super-node, that is, larger chains have larger search information.
It is known that the shortest-path is not necessarily the path with minimal search information and also it is expected that a minimal information path would tend to avoid network hubs 1 . Our method extends these  www.nature.com/scientificreports/ observations to the general description of the network. The condition of searching for the simplified network with minimal search information produces a simplified network where super-nodes with large number of nodes tend to be avoided and the hubs of the skeleton network are now the well connected super-nodes as they are important to the path diversity. As an example, the TfL network ( Fig. 4a) when simplified using the condition of maximal search information produces an skeleton network with 23 super-nodes ( Fig. 4b) and the largest supernode contains 98 nodes (Fig. 4c) compared with the minimal search information which produces a smaller skeleton of 15 super-nodes (Fig. 4d) and the largest super-node contains 41 nodes (Fig. 4e). The minimal search information is used to split the network into groups (super-nodes), where there is only one path between any members of a group and different paths for members of different groups. In Fig. 4d the red, green and blue nodes are the three largest super-nodes in the skeleton network. Figure 4f shows the original network with these three super-nodes expanded to their original red, green and blue tree subgraphs. In previous research, the search information is calculated assuming that a 'traveller' follows one of the possible shortest path from the start of the traveller's walk to its destination. In here we are interested in the existence of alternative paths which not necessarily are the shortest, for example in Fig. 4f, a traveller has different options if she wants to go from any red station to any blue station, she can take a red-green-blue line or red-blue line route.
The minimal search information of the simplified network depends on the structure of the network. For a fully connected network the tree-contraction would not simplify the network and the search information for the original and simplified network are the same. If the original network has large chains of nodes in its structure, as these subgraphs are aggregated via the tree-contraction, the simplified network would have a small search information. Figure 5a compares the ratio between the minimal search information against the search information of the network ( H simp /H o ) and the normalised number of nodes ( N skeleton /N o ) for many real networks. Networks like the Bison network tend to be almost fully connected and the simplified network and original network have very similar minimal search information. The other extreme is the Transport for London (TfL) network, which contains long chains in its structure. Again, as in the case of the search information for the simplified networks, we observe a scaling behaviour for the normalised search information of the simplified network (Fig. 5a) and the skeleton network (Fig. 5b). However, there is no obvious scaling for the trees (Fig. 5c). The normalised search The evaluation of the search information can be computationally slow due to the evaluation of all the shortest paths (Dijkstra's algorithm). For large networks this process becomes slow and even slower if we need to search for a simplified network with the minimal search information. Our previous results provides a method to estimate the search information via the scaling found previously. For example, it is 50 times faster to obtain an approximation to the search information of the Rome-road network which has over 3353 nodes and 4831 links from its skeleton network than evaluate it directly. However in this approximation the skeleton was obtained by selecting at random the links in the tree-contraction and we cannot guarantee if the structure of the simplified network is similar to the structure of the simplified network with minimal search information. To overcome this shortcoming the contraction-tree process was modified as follows.
To each link l ab connecting node a and b, we assign the weight W l ab = k a + k b , where k a and k b are the degree of the nodes. The tree-contraction is done by contracting the links in increasing order of their weight. This strategy reduces the search information of the simplified network as it tends to aggregate chains first. Next we consider three possible ways to approximate the search information of the original network. From the simplified network, consider the search information obtained from the skeleton network and the super-nodes, consider only the search information of the skeleton network and finally consider the search information obtained from an "average" tree that has the same number of nodes as the skeleton network. Figure 6 shows the relative error when approximating the minimal search information of a network via the simplified network. The best approximation is obtained when using the search information of the skeleton network.

Discussion
The structure of a network can be studied by partitioning it into communities. Loosely speaking a community is a set of nodes which have higher connectivity to nodes within their community than nodes outside this set. It is expected that these communities reflect properties of the network, e.g. friendships in social networks. Since in this paper we are interested in the existence of alternative paths between different parts of the network, we used a different approach to partitioning a network. Our approach is to aggregate the nodes that do not contribute to alternative paths into a group (super-node) reducing the network to a network of super-nodes (skeleton network). To decide which nodes should belong to a group we used the search information to find the paths between nodes which are described with minimal information.  www.nature.com/scientificreports/ We envisage that the description of a network using our method can have applications when describing alternative paths in a communication network. The network structure inside a super-node is of a tree and the routing decision inside a tree is unique and not difficult to compute, there is only one route between two nodes in the super-node. The path diversity is captured via the skeleton network where routing decisions are made. This path diversity can be used to design maps of networks that present information in a simpler and more usable way 18,21 .
By searching for a simplified network via the minimal search information we obtained a partition where there is a balance between the information describing the super-nodes and the information describing the skeleton network. Remarkably, from all the networks studied here, it seems that there is a scaling of the search information relating the original network and the minimal search information of the skeleton of the simplified network. Even more, it seems that for some networks, this scaling can be obtained by approximating the search information of the skeleton network via the search information of an "average" tree.
For large networks the simplification of a network via the minimal search information becomes computationally expensive due to the evaluation of all the shortest-paths for all pair of nodes. The scaling we observed here allows us to approximate the minimal search information for large networks from the smaller skeleton network, where, the skeleton network is obtained by doing only one tree-contraction. This tree-contraction is biased, contracting first the links where the degree of its end nodes is relatively small. This allow us to evaluate the search information of large networks with a small computational effort.
The work presented here can be extended by instead of considering the contraction of the links based on a random decision or in the degree of the nodes at the end of the link, the contraction can be based in other relevant property, for example distance or travelling time in a transport network.

Methods
The search information of networks. Rosvall et al. 1,2,19 introduced the Search Information H to judge whether a network is difficult to navigate. This information measures the amount of information needed to route a signal from a source node to a destination node via the shortest paths. This assumes that traffic flow on a network is closely related to the shortest path 2 . Search Information is employed in various areas such as social networks, biological networks, computer networks etc to quantify network complexity. Let ℓ(s, d) be a set of linked nodes describing the shortest path from source s and ending at destination d. The probability that this path is followed by a random walker who avoids exactly reversing their path is given by where j denotes the nodes in the shortest path ℓ(s, d) excluding the source s and destination d nodes and k j is the degree of the node j. In Eq. (1), the probability of choosing the correct link at the starting node s with degree k s has probability 1/k i (as there are k s possible links to choose from). For any other node in the shortest path, with the exception of the destination node, the probability of choosing the correct link when in node j is p j = 1/(k j − 1) as at it is assumed that the random walker does not retrace to the last node visited. As there can be many shortest paths between the source and destination pair, the probability to locate node d using a shortest path is P(s → d) = {ℓ(s,d)} P(ℓ(s, d)) , where the sum is over all possible shortest paths ℓ(s, d) from s to d. The search information from s to d is defined as 1,2,19 (1) P(ℓ(s, d)  Search information for the simplified ring network. We consider that a simplified network consists of the skeleton network and its super-nodes. For a ring network the tree-contraction will always produce a simplified network where the skeleton network is a triangle which connects three super-nodes (Fig.3b,c). The connectivity of the nodes forming a super-node is a chain or a single node. The search information of the simplified network is H simp = H skeleton + H chain1 + H chain2 + H chain3 . The search information of the skeleton depends only on the source node which has degree 2, so H skeleton = −6 log 2 (1/2) = 6 , where the factor 6 is because each node can reach two of its neighbours and there are three nodes. The search information for a chain of n nodes is where we used that the chain information of the two end nodes is zero and the search information for the other n − 2 nodes is n − 1 . The total search information for the ring network is  . Finally the overall minimal information is defined by the derivative H ′ ring (b) = 0 which gives b = N/3 and a = N/3 , that is the minimal search information for the simplified ring network is when the super-nodes contain N/3 nodes. If N is divisible by 3 then the minimal search information is H ring = N 2 /3 − 3N + 12 . If N is not divisible by 3 then nodes are divided as even as possible between the three super-nodes.
Path diversity of tree-contraction. Large networks can be difficult to understand therefore there are different techniques to simplify them leaving behind only the relevant structure. An example is the partition of a network into several clusters, in the clusters the nodes have many connections while within cluster there are few connections. There are many other methods to decompose a network into clusters however they do not conserve the cyclomatic-number ( C = L − N + P where N is the number of nodes, L the number of links and P the number of connected components). The tree-contraction conserves the cyclomatic number, that is the first Betti number of the graph. For comparison we used the Louvian method 22 (Louvian), Fast Greedy method 23 (FG), Information Map method 3 (IM), Walk Trap method 24 (WT) and Betweenness Centrality method 25 (BC) and evaluate the cyclomatic number of the network of clusters, which would be the equivalent to our skeleton network. The tree-contraction method maintains all the path diversity of the original network however other clustering methods prune edges and as a consequence the cyclomatic number of the original network and the network of clusters is different. For an example see Table S2 in the supplementary information.
Examples of using the tree contraction to estimate the search information. Search information has been used to characterise how difficult is to navigate a city. To take into account that cities have different sizes Rosvall et al. used the network average search information 1 They noticed that modern cities like Manhattan are easier to navigate than older cities like Umeå , i.e. Ĥ (Manhattan) <Ĥ(Umeå) . To decide if a city is difficult to navigate or not, Rosvall et al. compared the average search information of the city against its random counterpart, where the random counterpart has the same degree distribution of the original network but not the geometrical constraints. This comparison indicates how easy is to find a destination in a networks. Rosvall et al. found out that many cities are more difficult to navigate that their random counterpart, i.e. Ĥ >Ĥ R . We extend their results and consider not only cities but many other real networks and investigate if the difficulty of navigating a real network is also captured in the skeleton networks. In this case we evaluate the skeleton network and its random counterpart. Figure 7 shows that for almost all networks and their skeletons, as the ratios are bound in the unit square. Hence the skeleton and its randomised version also captures that real networks are more difficult to navigate that their random counterparts.
The second application is to approximate the search information for large networks. For large networks the evaluation of their search information is challenging as it requires the evaluation of all shortest-paths, including degeneracies, for all source-destination pairs which, in general, is an expensive computational process. As the search information scales with the size of the network we use the tree-contraction to search for a skeleton network with a small number of nodes and then approximate the search information from the scaling  Figure 8a shows the search information for thirteen large networks which follow the scaling behaviour noticed in Fig. 5b indicating that Eq. (6) can be use to approximate the search information of the original network. Figure 8b compares the relative error of the search information and its approximation as a function of the ratio N sk /N o . The approximation is good for large values of N sk /N o but the error increases for small values of N sk /N o . The ratio N sk /N o is small if the number of nodes in the skeleton network is smaller that the number of nodes in the original network. This happens when the original network is more tree like, that is when the super-nodes contain large sized trees. In this case the approximation based on the scaling Eq. (6) will be inaccurate. Figure 8a shows the ratio of the computational time between evaluating the search information for the original and the skeleton network against the the ratio N sk /N o . Figure 8c shows that the computational time can be reduced by up to two orders of magnitude. As a rule of thumb, our method gives a reasonable approximation if N sk /N o ≥ 0.3  Figure 7. The ratio of the average search information of the original network H ave O and its random counterpart H ave R against the ratio of the average search information of the skeleton network with the minimal search information obtained from the original network H ave skeleton and its random counterpart H ave Rskeleton . All the random counterparts are strictly randomized following two rules: conserve the same degree distribution with the original network and ensure the connectivity of itself.  Table S1) and red dots represent large networks which has more than 4000 nodes (shown in Supplementary Information Table S3). License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creat iveco mmons .org/licen ses/by/4.0/.