Ecological networks: Pursuing the shortest path, however narrow and crooked

Representing data as networks cuts across all sub-disciplines in ecology and evolutionary biology. Besides providing a compact representation of the interconnections between agents, network analysis allows the identification of especially important nodes, according to various metrics that often rely on the calculation of the shortest paths connecting any two nodes. While the interpretation of a shortest paths is straightforward in binary, unweighted networks, whenever weights are reported, the calculation could yield unexpected results. We analyzed 129 studies of ecological networks published in the last decade that use shortest paths, and discovered a methodological inaccuracy related to the edge weights used to calculate shortest paths (and related centrality measures), particularly in interaction networks. Specifically, 49% of the studies do not report sufficient information on the calculation to allow their replication, and 61% of the studies on weighted networks may contain errors in how shortest paths are calculated. Using toy models and empirical ecological data, we show how to transform the data prior to calculation and illustrate the pitfalls that need to be avoided. We conclude by proposing a five-point check-list to foster best-practices in the calculation and reporting of centrality measures in ecology and evolution studies.

The last two decades have witnessed an exponential increase in the use of graph analysis in ecological and conservation studies (see refs. 1,2 for recent introductions to network theory in ecology and evolution). Networks (graphs) represent agents as nodes linked by edges representing pairwise relationships. For instance, a food web can be represented as a network of species (nodes) and their feeding relationships (edges) 3 . Similarly, the spatial dynamics of a metapopulation can be analyzed by connecting the patches of suitable habitat (nodes) with edges measuring dispersal between patches 4 . Data might either simply report the presence/absence of an edge (binary, unweighted networks), or provide a strength for each edge (weighted networks). In turn, these weights can represent a variety of ecologically-relevant quantities, depending on the system being described. For instance, edge weights can quantify interaction frequency (e.g., visitation networks 5 ), interaction strength (e.g., per-capita effect of one species on the growth rate of another 3 ), carbon-flow between trophic levels 6 , genetic similarity 7 , niche overlap (e.g., number of shared resources between two species 8 ), affinity 9 , dispersal probabilities (e.g., the rate at which individuals of a population move between patches 10 ), cost of dispersal between patches (e.g., resistance 11 ), etc.
Despite such large variety of ecological network representations, a common task is the identification of nodes of high importance, such as keystone species in a food web, patches acting as stepping stones in a dispersal network, or genes with pleiotropic effects. The identification of important nodes is typically accomplished through centrality measures 5,12 . Many centrality measures has been proposed, each probing complementary aspects of node-to-node relationships 13 . For instance, Closeness centrality 14,15 highlights nodes that are "near" to all other the Shortest path is full of pitfalls In network analysis, the interaction between nodes can be thought of as a flow of information between the nodes that are linked by edges. The sequence of edges that information must cross in order to reach a specific node is called a path. It is generally assumed that the bulk of information between any two given nodes (among all the possible paths between these two nodes) passes through the shortest path connecting them (i.e., the one with "lowest weight"). However, it should be emphasized that while the concept of information flow is general, its immanence can differ dramatically from case to case, depending on which network feature weights quantify. For reference, in Table 1 we list the principal types of networks and the proportionality of the edges to the information flow between nodes found in the ecological literature.
The interpretation of a shortest path as the path that funnels the bulk of information flow relies on it being the least weight path (i.e., the path of least resistance) between two nodes. Indeed, all the shortest path algorithms currently available 31,32 and generally implemented in graph theory software (Table 2) seek to minimize the value of the path between two nodes calculated as the sum of the edge weights. The reason being that a minimization problem converges, while maximization can fail (see ref. 24 for a detailed explanation). Nevertheless, the identification of the shortest paths is far from trivial, as one must pay attention to what edge weights represent. That is, one must ensure that the edge weight is inversely proportional to the flow of information between the nodes. This condition is automatically fulfilled if the natural weight suggested by the network at study is already inversely proportional to the information flow (e.g., resistance distance, dispersal time). However, when the weight is directly proportional to the information flow (e.g., interaction frequency, individual transfer, dispersal probability, pathogen transmission, energy transfer across food webs), it is necessary to transform the edge weight in order to calculate the shortest paths, and the centrality measures that rely on them (Table 3). In particular, this is important when using user-friendly software packages that automate the calculation of centrality measures (Table 2). whether the information provided was sufficient or insufficient, and network type (as "landscape", "interaction", and "others", which include social, co-occurrence, etc. networks). We only categorize as "correct" or as "wrong" studies on which we had enough information to support such a claim, and as "seems correct" and "seems wrong" studies on which the insufficient information available suggest that calculations are correct or wrong, respectively. Notice that (1) most of the correct or probably correct studies use unweighted edges (n = 52, 66%); (2) for all network types half of the studies do not report enough information to validate whether calculations were correct (n = 63, 49%); and that (3) numerous studies report unclear calculations (n = 19). Differences in the number of oversights in centrality calculations between weighted and landscape and interaction networks are probably due to the fact that in interaction networks weighted edges typically require transformation (see Table 1), whereas in landscape networks edge weights tend to be inversely proportional, requiring no transformation. www.nature.com/scientificreports www.nature.com/scientificreports/ There is a wide range of functions that can accomplish this transformation. For example, if a ij measures the flow of information between nodes i and j, the functions 1 − a ij , exp(−a ij ), 1/a ij , log(1/a ij ) and log(a ij /(1 − a ij )) are all found in the graph-theoretical literature [32][33][34][35] . Note that one must pay attention to the range of values that the edge weights can span, and to the values induced by the transformation -for example, one must avoid the use of negative edges (e.g., log transformation of values between 0 and 1), as these can greatly hamper the interpretation of the results. Finally, if the edge weights represent probabilities, one should account for the independence (or lack of independence) of the different edges.
Avoiding the pitfalls. When computing any shortest path-related measure, various decisions need to be made, and the wrong decision could lead to unexpected results, as we will illustrate in the following examples. We focus on the effects on Betweenness (BC) and Closeness centralities (CC) as these are the most commonly used centrality measures. In Fig. 2 we summarize this decision process.
Binary or weighted?. The first methodological choice when computing shortest paths is whether to consider edge weights. Binary data can be highly informative: for example, it has been used to identify species fundamental niches 36 , and key species in pollination networks 18 . Nonetheless, studies must clearly state whether the analysis is performed on weighted or unweighted networks, and provide an ecological justification supporting either choice 37 . Seminal studies 38 as well as recent ones 10 analyzed unweighted versions of their data arguing that paths with fewer edges would inherently be stronger than those composed of multiple ones. In our analysis of the literature, we have found nine studies that, despite using weighted data for some calculations, revert to binary versions of the network for the calculation of centrality measures without providing a justification (Fig. 1). Indeed, note that unipartite projections of bipartite networks result in weighted networks even when the original network is binary. Seven articles out of nine use binary unipartite projections without any justification. Although Quantifies the proportion of shortest paths g between any two nodes i, j, that pass through a All types 15 Stress, SC Measures the number of shortest paths g between any two nodes i, j, that pass through a focal node v.
All types 17 Closeness, CC Measures the average length of the shortest paths from a node v to all the other nodes in the network.
All types 15 Integral Index of Connectivity, IIC Measures the degree of connectivity of the entire landscape (of total area A L ) through the calculation of the number of edges in the shortest path nl ij between patches with area a i and a j .
Binary landscape networks 61 Probability of Connectivity Index, PC Quantifies the probability that two species randomly placed across a patchy landscape (of total area A L ) fall into habitat patches a i and a j that are reachable from each other with a maximum connectivity probability p ij , defined as the maximum product probability of all possible paths between patches i and j (including singlestep paths).
Landscape networks 62 Table 3. Measures of shortest path-related centrality measures commonly used in ecological network analysis. Notice that other common centrality measures are based on eigenvectors or dissimilarity scores instead of on the identification of shortest paths and are hence not considered in this work. www.nature.com/scientificreports www.nature.com/scientificreports/ calculations in these are technically correct, one must be aware that the calculation of shortest paths and related measures in binary and weighted versions of the same network can lead to dramatically different conclusions.
To see how discarding the edge weights can significantly change the results of network analysis, let us start with a highly idealized example. In Fig. 3a we show a network such that there are two possible paths between from Busan (South Korea) to Almería (Spain): one containing two edges (Busan-Chicago-Almería), and the other one three (Busan-Copenhagen-Marseille-Almería). If one considers only the binary data (implicitly assuming that the distance between any cities is the same), the shortest path from Busan to Almería would be crossing Chicago (one stopover vs. the two stopovers of the other possible path). However, if the distance between cities is considered, it can be easily seen that the path Busan-Copenhagen-Marseille-Almería is much shorter (~11000 vs. ~18000 Km).
For an ecologically relevant example, we constructed a connectivity matrix for a hypothetical bird species living between 500 and 2000 m above sea-level, and with a typical habitat size of about 15 km 2 . For this purpose, the Global Relief Database ETOPO1 39 data in the region we considered were coarse-grained to 15 km 2 horizontal resolution, resulting in 787 habitat patches (Fig. 4a). Dispersal probability between patches was calculated as p ij = exp (−α d ij ) following ref. 35 , where d ij is the geographical distance between the patches boundaries, and α is a parameter chosen to be 0.03 in order to have a (hypothetical) median dispersal distance of 100 km. This probability can be stored in a connectivity matrix (see Supporting Information, Fig. S1) and graph theory can be used to identify the habitat patches that have high Betweenness and Closeness centrality scores. Calculating BC and CC on binary and weighted versions of this dataset resulted in markedly different outcomes. None of the 20 habitat patches with highest BC and CC in the binary network (Fig. 4b,c) match the ones obtained from the analysis of the weighted network (Fig. 4d,e). Note that to calculate Betweenness and Closeness centralities in weighted data we first inverted the edge weights using log(1/p ij ) (see next section for a detailed explanation).
Modifying edge weights. The next question one should answer when computing shortest paths is whether edge weights are inversely proportional to the information flow between the nodes in the network (Fig. 2). If www.nature.com/scientificreports www.nature.com/scientificreports/ this is the case, the shortest path between nodes can be calculated directly using the edge weights. If, on the other hand, edge weights are proportional to the flow of information, one must transform them before using shortest path algorithms. If one does not modify the edge weights, the shortest path algorithms will either fail (and identify the longest, rather than shortest path), or will be unable to identify a shortest path at all. We use the transformations 1/a ij or log(1/a ij ) for the edges (although for this operation there are several alternative options reviewed below).
For example, let us consider a small network of four primates sharing a certain number of parasites (Fig. 5a). In order to detect the primate mediating the transmission of infectious diseases in this network, one could identify the primate displaying the largest number of parasites common to other primates -indicated by stronger edges. In this simple network it is easy to verify that most of the paths with the highest edge weight (largest number of shared parasites) pass through primate A. However, if BC and CC were calculated directly on unmodified edge weights, we would conclude that primate B is the key primate in this network (Fig. 5a). If, on the other hand, we transform edge weights using the function 1/a ij , we correctly identify primate A as the primate with the highest BC and CC (Fig. 5b).
Using a real-world case, the impact of not reversing edge weights can be illustrated using the Global Mammal Parasites Database (GMPD, https://parasites.nunn.lab.org) containing data on 542 primate species and their 750 parasites (ref. 40 and references therein). To identify the species mediating the transmission of parasites (similarly to ref. 41 ), a connectivity matrix linking primates that have been found to host the same parasite was built (see Supporting Information, Fig. S2). Edge weights were defined as the number of shared parasites between any two species. Therefore, edge weight is directly proportional to the relationship strength between two species and, as in the previous example, the quantity should be transformed before calculating the shortest paths. In Fig. 6a we show the lack of correlation between species rankings based on the centrality scores calculated on modified and unmodified edge weights. For this example, we use the inverse of the edge weights, e.g., 1/a ij . Interestingly, the BC scores calculated using the modified edge weights highlight only few species, one of which has by far the highest BC score. Instead, if we directly use the unmodified values, several more species have comparable BC scores. This is not surprising if we consider that, when using the raw weights, shortest paths pass through weak connections, which are likely to be numerous. Differences in ranks are substantial. For instance, among the top ten high-Betweenness species identified using modified weights, only one is also among the high-Betweenness species identified using raw weights (see Supporting Information, Table S1).
Likewise, the results from the Closeness-based rankings (Fig. 6b) show that species rankings based on CC also differ significantly between modified and unmodified edge weights. The CC scores calculated on modified edge weights also support the importance of a handful of species (see Supporting Information, Table S1). Unlike with BC, nine of the top 10 high-Closeness species are the same ones when using the unmodified weights (also in Supporting Information) but the exact ranking differs between the two cases. other modifying functions. Adding constants. When edges are directly proportional to the information flow, it is frequent practice to make them inversely proportional by subtracting their value from a theoretical www.nature.com/scientificreports www.nature.com/scientificreports/ maximum or some other meaningful constant 32,35 . For example, in the case of transfer probabilities (migration, mass, energy, networks), one could choose to subtract the edge weights a ij from 1. However, the new edge weight 1-a ij biases the calculation of the shortest paths towards the path with the lowest number of edges (because probabilities sum to one, nodes with many edges tend to have lower values). www.nature.com/scientificreports www.nature.com/scientificreports/ Again, we will use the simple toy matrix presented in Fig. 3, where we show a network connecting different cities. We then consider three different quantities to weight the edges that represent the movement of researchers between these cities. In the first case (Fig. 3b), edge weights quantify the number of researchers who moved from one site to another. In this case, the path sustaining the largest"flow of researchers" between Busan and Almería is the three-steps path Busan-Copenhagen-Marseille-Almería (30 vs 10). However, applying a shortest path algorithm directly to this network would identify the two-step path as more important. One possible way to reverse the edge weight is to subtract the edge weights from a large (and in most instances arbitrary) constant C -de facto adding a constant to all edges. For example, if one chooses C = 100, now the largest edge weights (those representing largest flows) are the smallest and would hence be identified by the shortest-path centrality algorithm as Scientific RepoRtS | (2019) 9:17826 | https://doi.org/10.1038/s41598-019-54206-x www.nature.com/scientificreports www.nature.com/scientificreports/ more central. However, one can easily verify that such a transformation did not change the fact that the two-step path is the shortest between Busan and Almería (190 vs 270). The reason is that adding a constant to all the edge weights biases the shortest paths algorithm towards paths with fewer edges.
In Fig. 3c we see that subtracting the edge weights from a theoretical maximum C (in the case of transfer probabilities of interaction frequencies, C = 1) makes the three-step path with higher information flow the shortest path (3 × (1 − 0.9) vs 2 × (1 − 0.1)). However, this transformation does not work in all cases. In fact, if the edge weights, as frequently occurs in ecological studies, span different orders of magnitude (e.g., Fig. 3d), the three-step path will not be the shortest path anymore (3 × (1 − 10 −3 ) vs 2 × (1 − 10 −5 )), as in Fig. 3b. Furthermore, we note that this type of transformation, even when it is likely to work, cannot be used for all the edge weights. For example, it cannot be used for probabilities, as the values of log(1-a ij ) are negative and, consequently, cannot be used to find shortest paths (see next section).
A real-world example of the effect of adding constants to the edge weights is provided in the Supporting Information (Figs. S3-5).
Negative weights and loops. Another way to reverse the edge weights is to reverse the sign of the weights (i.e., using −a ij ). However, as shortest path algorithms seek to minimize the value of a path, they would keep looping closed paths (cycles) ad infinitum, without ever converging. It must be noted that there are alternative algorithms that can handle negative edges values (e.g., the Bellman-Ford-Moore algorithm 42 ) cannot handle cycles. As cycles are essentially ubiquitous in ecological applications, edge weight transformations that result in negative values should therefore be avoided. As an example, consider a simple toy network depicting the carbon flow between different layers of a food chain (Fig. 7a). In this case, given that the flow of carbon is directly proportional to the strength of the connection between two layers of the food chain, we need to transform the weights. However, if one uses -a ij (Fig. 7b), the shortest path algorithms would never converge, and would keep circling the loop. On the other hand, using another weight reversing function, such as 1/a ij , would correctly identify the 2 nd order consumers as key species pivoting the carbon flow in this network example (Fig. 7c).
independence of probabilities. We should note an important aspect to consider when calculating the lengths of paths in networks: when edges represent probabilities, as for instance dispersal probabilities, we must question the independence of the edges in order to calculate meaningful values for the overall probability of the entire path. From a practical point of view, this means that when calculating the value of a path from node A to node C passing through node B (path ABC), we need to postulate that the path BC does not depend on the path used to reach B. When edges represent independent probabilities, the probability along a path containing multiple nodes is the product of the probabilities of all the paths linking the nodes. Interestingly, in the case of independent probability edges, converting edge weight a ij into distance using log(1/a ij ) nicely transform probability product along multiple nodes path into distances addition along this path, conserving weights relative contribution to the path and avoiding weights distortion in shortest paths algorithm. Without that edge transformation, the path ABC will be the sum of the probabilities of paths AB and BC, resulting in the identification of most improbable paths as those more central (see ref. 24 for a detailed explanation, and Fig. S6 for an example using the ETOPO1 dataset). one or all shortest paths?. In most networks, whether binary or weighted, there may be more than one shortest path connecting any two nodes. To account for this fact, an alternative definition of Betweenness centrality based on random walks 43 and a generalization of node centrality that considers both edge weight and number when calculating centrality measures 35 have been developed. Although the discussion of the pros and cons of the www.nature.com/scientificreports www.nature.com/scientificreports/ different formulations falls beyond the scope of this study, we encourage the reader to be aware that considering a single or all shortest paths may also introduce differences in the resulting centrality values, and the decision should hence also be reported. This is of particular importance when comparing centrality metrics with food chain length related metrics, where all paths may be considered.

Widening the path
Network analysis has been developing quite independently in different branches of ecology. However, dissemination between ecological disciplines and reproduction of published studies are being hampered, at least partially, by the lack of transparency when describing the methodologies used. Establishing a protocol for the analysis and reporting of calculations would ease these obstacles, and boost the use of centrality metrics for unconventional uses. For example, in a species-interaction network (where species are typically considered closer if they interact with higher frequencies), one could purposely choose to calculate shortest path-centrality measures without transforming the weights in order to study the effect of weak interactions across the network. For this reason, we www.nature.com/scientificreports www.nature.com/scientificreports/ urge all the researchers applying graph theory to ecological data to pay special attention when reporting their calculations, and, in particular, to provide a description of the network and edge weight they used.
Here, we provide a checklist of crucial methodological information that should always be reported (Fig. 2). Following this guide ensures the study reports sufficient information to allow reproducibility, a quick understanding of the methods by readers from other fields, and that the decision process prior calculations is done sequentially.
(1) A clear definition of what nodes and edges represent. Nodes and edges depict different entities and relationships in different ecological studies. Nodes may represent proteins, genes, individuals, populations, species, sites, etc., and edges may depict interactions of different kind, or movement measured in numerous ways. A clear definition of nodes and edges enables a faster and deeper understanding of the rationale and methodology of the analysis by readers from different disciplines. (2) Are edges binary or weighted? If edges are weighted, one needs to report the proportionality of the edge weight to the information flow between the nodes in order to evaluate whether edges need to be modified. In particular, one should ensure that there is no contradiction between the weights of a network and the interpretation of shortest paths.

conclusions
Graph theory enables to achieve precious insights on ecological networks. For this reason, it has gained popularity in ecology and has developed quite independently in different disciplines, becoming a routine analysis in ecological studies. Our analysis of the literature evidenced that this familiarity is however associated to a lack of methodological rigor in the published studies. Indeed, by reading the methodological sections of a large portion of the published studies, we were not able to clearly ascertain what edges represented when centrality measures calculations were carried out. The increasing popularity of packages for the analysis of ecological networks will only boost the use of tools and methodologies researchers may be unfamiliar with. Using both theoretical and real-world case studies we showed that oversights in the methods and calculations can lead to radically different results. Hence it is fundamental to establish a code of good practices that guides researchers through the calculations, while ensuring the correct calculation of metrics across fields, aiding understanding from other fields and the reproducibility of results. For that reason, in this article we provide an overview of different methods to meaningfully calculate shortest paths and related centrality measures in ecological systems, and a checklist to ensure clear and sufficient reporting of such calculations. We hope that following the protocol we suggest will further increase the popularity of centrality measures in ecology, and, at the same time, guarantee the reproducibility of these studies.

Data availability
This article uses no data.