Abstract
The different factors involved in the growth process of complex networks imprint valuable information in their observable topologies. How to exploit this information to accurately predict structural network changes is the subject of active research. A recent model of network growth sustains that the emergence of properties common to most complex systems is the result of certain tradeoffs between node birthtime and similarity. This model has a geometric interpretation in hyperbolic space, where distances between nodes abstract this optimisation process. Current methods for network hyperbolic embedding search for node coordinates that maximise the likelihood that the network was produced by the aforementioned model. Here, a different strategy is followed in the form of the Laplacianbased Network Embedding, a simple yet accurate, efficient and data driven manifold learning approach, which allows for the quick geometric analysis of big networks. Comparisons against existing embedding and prediction techniques highlight its applicability to network evolution and link prediction.
Introduction
The gradual addition of nodes and edges to a network, a common representation of the relationships between complex system components, imprints valuable information in its topology. Consequently, development of techniques to mine the structure of networks is crucial to understand the factors that play a role in the formation of their observable architecture.
Strong clustering and scalefree node degree distribution, properties common to most complex networks, have served as the basis for the establishment of tools for link prediction^{1}, community detection^{2}, identification of salient nodes^{3} and so on. In addition, several models that aim to mimic the evolution and formation of networks with the abovementioned characteristics have been introduced^{4}. Of special interest are a series of models that assume the existence of a hidden geometry underlying the structure of a network, shaping its topology^{5,6,7,8,9,10,11,12} (we refer the reader to^{13} for an extensive review on the subject). This is justified by the fact that complex networks possess characteristics commonly present in geometric objects, like scale invariance and selfsimilarity^{7,14,15,16}.
One of such models is the socalled PopularitySimilarity (PS) model, which sustains that clustering and hierarchy are the result of an optimisation process involving two measures of attractiveness: node popularity and similarity between nodes^{12}. Popularity reflects the property of a node to attract connections from other nodes over time and it is thus associated with a node’s seniority status in the system. On the other hand, nodes that are similar have a high likelihood of getting connected, regardless of their rank.
The PS model has a geometric interpretation in hyperbolic space, where the tradeoff between popularity and similarity is abstracted by the hyperbolic distance between nodes^{9,12}. Short hyperbolic distances between them correlate strikingly well with high probabilities of link formation^{12}. This means that mapping a network to hyperbolic space unveils the value of the variables in charge of shaping its topology (popularity and similarity in this case), allowing for a better understanding of the dynamics accountable for the system’s growth process.
Current efforts to infer the hyperbolic geometry of complex networks bet for a Maximum Likelihood Estimation (MLE) approach, in which the space of PS models with the same structural properties as the network of interest is explored, in search for the one that better fits the network topology^{12,17,18}. This search is computationally demanding, which means that these methods require of correction steps or heuristics in order to make them suitable for big networks^{18}.
In this paper, a different strategy is followed. Inspired by the wellestablished field of nonlinear dimensionality reduction in Machine Learning^{19}, an adaptation of the Laplacian Eigenmaps algorithm, introduced by Belkin and Niyogi for the lowdimensional representation of complex data^{20}, is put forward for the embedding of complex networks into the twodimensional hyperbolic plane. The proposed approach is based on the approximate eigendecomposition of the network’s Laplacian matrix, which makes it quite simple yet accurate and efficient, allowing for the analysis of big networks in a matter of seconds. Furthermore, it is data driven, which means that no assumptions are made about the model that constructed the network of interest. Finally, benchmarking of the method against existing embedding and prediction techniques highlights the advantages of using it for the prediction of new links between nodes and for the study of network evolution.
Results
Preliminaries and the proposed method
In this paper, only undirected, unweighted, singlecomponent networks are considered, as the proposed embedding approach is only applicable to networks with these properties^{20}. Moreover, these networks are assumed to be scalefree (with scaling exponent γ ∈ [2, 3]) and with clustering coefficient significantly larger than expected by chance. These networks are graphs G = (V, E) with N = V nodes and L = E edges connecting them. An undirected, unweighted graph can be represented by an N × N adjacency matrix A_{i,j} = A_{j,i} ∀i, j, whose entries are 1 if there is an edge between nodes i and j and 0 otherwise. The graph Laplacian is a transformation of A given by L = D − A, where D is a matrix with the node degrees on its diagonal and 0 elsewhere.
Let us now consider a realvalued function on the set of network nodes, , which assigns a real number f (i) to each graph node. If the Laplacian acts as an operator over this function, Lf (i) = ∑_{j}A_{i,j}(f (i) − f (j)), one can see that it is giving information about how the value of f for each node i compares to that of its neighbours j^{21}. This is the discrete analogue of the Laplace operator in vector calculus and its generalisation in differential geometry, the LaplaceBeltrami operator^{20}, which measure how much the curvature of a surface is changing at a given point. This is more evident if one thinks of a function as being approximated by a graph, such that nodes have more edges where the value of the Laplacian is greater (see Fig. 1).
In particular, embedding of the network to the twodimensional hyperbolic plane , represented by the interior of a Euclidean circle^{9}, is given by the N × 2 matrix Y = [y_{1}, y_{2}] where the ith row, Y_{i}, provides the embedding coordinates of node i. Using the Laplacian operator (see above), this corresponds to minimising , which reduces to with D as defined above, I the identity matrix, M^{T} the transpose of M and tr(M) the trace of M. Finally, Y_{emb}, the matrix that minimises this objective function, is formed by the two eigenvectors with smallest nonzero eigenvalues that solve the generalised eigenvalue problem LY = λDY^{20} (see Section 1 in the Supplementary Information for a detailed justification of this embedding approach).
In the context of manifold learning, most algorithms rely on the construction of a mesh or network over the highdimensional manifold containing the samples of interest^{19,22}. When pairwise distances between samples are computed, they correspond to shortestpaths over the constructed network, allowing for a better preservation of the sample relationships when the data is embedded to low dimensions^{19,20,22,23}. If there is really a hyperbolic geometry underlying a complex network, it should lie on a hyperbolic plane, with nodes drifting away from the space origin. If the network itself is seen as the mesh that connects samples (nodes in this case) that are close to each other^{12}, it can be used as in manifold learning to recover the hyperbolic coordinates of its nodes. Connected pairs of nodes in the network should be very close to each other in the target, lowdimensional space (hence the minimisation problem presented above) and, consequently, their angular separation (governed by their similarity dimension according to the PS model) should also be small. Figure 2a shows that, if the described embedding approach is employed, this is indeed the case for an artificial network generated with the PS model (see the Methods for more details).
As a result, to complete the mapping to , angular node coordinates are obtained via θ = arctan(y_{2}/y_{1}) and radial coordinates (abstracted by the popularity or seniority dimension in the PS model) are chosen so as to resemble the rank of each node according to its degree. This is achieved via r_{i} = 2β ln(i) + 2(1 − β) ln(N), where nodes i = {1, 2, …, N} are the network nodes sorted decreasingly by degree and β = 1/(γ − 1)^{9,12} (see Fig. 2b and the Methods for further details).
This strategy is valid, because the native representation of , in which the hyperbolic space is contained in a Euclidean disc and Euclidean and hyperbolic distances from the origin are equivalent, is a conformal model. This means that Euclidean angular separations between nodes are also equivalent to hyperbolic ones^{9}. On the other hand, the radial arrangement of nodes corresponds to a quasiuniform distribution of radial coordinates in the disc^{9}. It is also important to mention that, due to rotational invariance of distances, the set of hyperbolic coordinates responsible for the edges observed in a network is not unique (see Fig. 2b). Therefore, the goal of the proposed technique is not to find a specific set of coordinates, but the one that corresponds better with the hyperbolic, distancedependent connection probabilities that produce the network of interest.
The network embedding approach described in this section and in Table 1 is hereafter referred to as Laplacianbased Network Embedding (LaBNE).
Benchmarking LaBNE
In order to test the ability of LaBNE to infer the hyperbolic geometry of complex networks, artificial networks were generated using the PS model. This model allows for the construction of networks with known hyperbolic coordinates for each of its N nodes and target average node degree 2m, scaling exponent γ and clustering coefficient . The latter is controlled by the network temperature T, is reduced almost linearly with T and is the strongest possible at T = 0^{9} (see Methods for more details). One hundred synthetic networks were grown for each combination of parameters T = {0, 0.3, 0.6, 0.9}, 2m = {4, 6, 8, 10} and γ = {2.25, 2.50, 2.75}; the number of nodes N was fixed to 500. These networks were then mapped to with LaBNE and Pearson correlations between the inferred hyperbolic distances and the ones measured with the real node coordinates were computed and averaged across the one hundred networks for each parameter configuration. This same procedure was followed for the most recent and fastest version of HyperMap, a MLE method for network embedding to hyperbolic space that finds node coordinates by maximising the likelihood that the network is produced by the PS model^{17,18} (see Methods for details).
Figure 3a shows how the average correlation for each parameter combination, obtained by LaBNE, is as high or higher than the one obtained by HyperMap. This is especially evident in dense, strongly clustered networks (i.e. networks with large 2m and T → 0, which implies high ). These results are supported by the fact that angular coordinates inferred by LaBNE are closer to the angles from the generated PS networks, compared to the ones inferred by HyperMap (see Figure S1).
To provide an indepth analysis of the differences in time performance of the two algorithms, an experiment similar to the one presented in Fig. 3a was carried out, but this time parameter T was fixed to 0 and the network size changed as N = {250, 500, 750, 1000}. Figure 3b depicts that the foldchange from the average time needed by LaBNE to embed each of the one hundred networks per parameter configuration to the average time needed by HyperMap to perform the same task is very big, the latter being 500 times slower than LaBNE for the smallest networks. These results highlight one of the great benefits of LaBNE, its time performance, which stems from its computational complexity of O(kN^{2}) with k = 3. In a network with a single connected component, the smallest eigenvalue of L is always 0^{21} and its corresponding eigenvector is discarded. The next two, which correspond to the smallest nonzero eigenvalues, form the resulting node coordinates. Since k is always constant, LaBNE can be considered a O(N^{2}) algorithm.
In spite of the version of HyperMap considered here being a O(N^{2}) method as well^{18} (see Methods for details), Fig. 3b shows that it is slower than LaBNE. This is because the heuristic used to speed up this algorithm is not actually applied to all N nodes in the network of interest, but only to those with degrees smaller than a parameter k_{speedup}^{18} (see Methods for details). The larger the value of this parameter, the faster the algorithm, but also the higher the impact on its accuracy. It was set to 10 throughout this paper, which produced good results in a reasonable amount of time. On the other hand, LaBNE takes advantage of the efficiency and portability of the R package igraph^{24} and its interface with highperformance subroutines designed to solve large scale eigenvalue problems^{25}.
Hyperbolic embedding for link prediction in real networks
Given the accuracy and time performance achieved by LaBNE and considering that short hyperbolic distances correlate well with high probabilities of connection between nodes^{12}, LaBNE was used to infer the hyperbolic coordinates for the nodes of three real networks (see Table 2 and Methods) to then carry out link prediction. As discussed in the previous section, Figure 3 shows that the accuracy of LaBNE is higher for networks with more topological information, i.e. with more edges between nodes, which occurs at high average node degrees (2m) and low temperatures (which imply high clustering ). Consequently, the three real networks analysed here were chosen with the aim to investigate the performance of LaBNE in the low, medium and high clustering coefficient scenarios (see Table 2). Furthermore, these network datasets represent complex systems from different domains: the high quality human protein interaction network (PIN) models the relationships between proteins within the human cell (low ), in the PrettyGoodPrivacy network (PGP) users share encryption keys with people they trust (medium ) and the autonomous systems Internet (ASI) corresponds to the communication network between groups of routers (high , see the Methods for more details).
Topological link prediction deals with the task of predicting links that are not present in an observable network, based merely on its structure. The standard way to evaluate the performance of a link predictor is to randomly remove a certain number of links from the network under study, use a predictor to assign likelihood scores to all nonadjacent node pairs in the pruned topology, sort the candidate links from best to worst based on their scores, to finally scan this sorted list of candidates with a moving threshold to compute Precision (fraction of candidate links that pass the current score threshold and are in the set of removed links) and Recall (fraction of candidate links that have not passed the score threshold but are in the set of removed links) statistics^{1,26}.
The abovedescribed evaluation framework comes, however, with a critical caveat. Pruning edges at random can remove important information from the observable network topology in unpredictable ways and this most certainly affects link predictors differently^{26}. To avoid this problem, historical data of the evolution of the network must be used to test the ability of a link predictor to assign high likelihood scores to edges in a network G_{t + 1} that are not yet present in a snapshot G_{t}, to which the predictor is applied. For the ASI^{27} and the PGP, temporal snapshots of their topology were available and this evaluation method was used (see networks with subscript t + 1 in Table 2). For the PIN^{28}, it was necessary to resort to the socalled Guiltbyassociation Principle^{29}, which states that two proteins are highly likely to interact if they are involved in the same biological process. In this scenario, link predictors are applied to the observable protein network and discrimination between good and bad candidate interactions is based on a stringent cutoff on a measure of the similarity between the biological processes of the nonadjacent proteins (see the Methods for more details).
Figure 4a–c shows the performance in link prediction of LaBNE, HyperMap and a set of neighbourhoodbased link predictors that are commonly considered for benchmarking in this context (see Methods). As expected from the results on artificial networks (Fig. 3a), the performance of both LaBNE and HyperMap improves as clustering increases. The worst results are thus obtained in the PIN (Fig. 4a). However, LaBNE is the only prediction technique that allows for more Recall without sacrificing as much Precision as the others. The performance of HyperMap in this network is bad because, as it can be seen in Fig. 3a and S1b, its application should be restricted to highly clustered networks. In the other two cases (Fig. 4b,c), LaBNE is the second best performing method in terms of area under the PrecisionRecall curve, only slightly behind HyperMap. These are very good results considering that LaBNE can obtain reliable link predictions in these big networks in a matter of seconds, while the current implementation of the most recent and fastest version of HyperMap requires days to produce the results (see Table 2).
Regarding the performance of the neighbourhoodbased link predictors, it is important to highlight their good Precision at low Recall levels. This is mainly due to the fact that they are only able to assign meaningful scores to node pairs separated by at most two hops, which is a very small fraction of all possible candidate edges. The rest obtain the exact same score, which prevents from differentiating between good and bad candidate interactions and results in the rapid drop of Precision observed in Fig. 4a–c.
Hyperbolic embedding for real network evolution analysis
In the PS model, radial coordinates are directly proportional to node birthtimes, i.e. if a node i is close to the origin of the hyperbolic circle containing the network (r_{i} → 0), it means it was born early in the evolution of the complex system^{12}. To test whether this is the case in the most recent temporal snapshot of the three real networks considered (PIN, PGP and ASI), node radial coordinates inferred by LaBNE were compared to actual node birthtimes (see the Methods for details on how birthtimes are defined for each real network; HyperMap was not used here as it practically produces the same radial coordinates as LaBNE).
Figure 5a–c shows that in the three cases, nodes that are close to the centre of the hyperbolic space are older than those located in its periphery. Even all nodes from the first network snapshot in the PGP and ASI, which represent a mix of nodes that appeared at that time and nodes from older, not available timepoints, possess small radial coordinates (Fig. 5b,c).
These are very important results, because they exhibit the close relationship between node popularity and seniority in networks of very different objects and time scales. The results obtained with LaBNE suggest that, even when the identity of the network nodes is unknown, one can have an idea of their history in the system under study, based merely on their degree and, consequently, their inferred radial positions.
Conclusions
Scaleinvariance, selfsimilarity and strong clustering, properties present in complex systems and geometric objects alike, have led to the proposal that the network representations of the former lie on a geometric space where distance constraints play important roles in the formation of links between system components^{8,9,12,30}. One of such proposals advocates for the hyperbolic space as a good candidate to host complex networks, given that their skeletons (trees abstracting their underlying hierarchical structure) require an exponential space to branch and only hyperbolic spaces expand exponentially^{9}.
In consequence, efficient and accurate methods to embed networks to hyperbolic space are needed. In this article, a novel approach to perform this task is proposed: the Laplacianbased Network Embedding or LaBNE. Since it is based on a transformation of the adjacency matrix representation of a network, namely the graph Laplacian, it highly depends on topological information to carry out good embeddings. This was confirmed when applied to artificial and real networks with differing structural characteristics. The higher the average node degree (2m) and clustering coefficient () of a network, the better the results achieved by LaBNE. Nevertheless, its low computational complexity allows for the study of the hyperbolic geometry of big networks in a matter of seconds. This means that LaBNE is suited to draft a geometric configuration of a network, which can then be used by more involved and time consuming techniques, thus reducing the space of possible node coordinates they have to explore.
Notwithstanding the fact that techniques for embedding networks to generic lowdimensional spaces have been proposed to facilitate their visualisation and analysis^{19,20,22,23,30,31,32,33}, it is important to stress that LaBNE deals specifically with the embedding to the twodimensional hyperbolic plane. This space has been shown to provide an accurate reflection of the geometry of real networks^{9,12} (see Figs 4 and 5) and allows for their visual inspection in two or three dimensions (see Figure S2). However, LaBNE does not make any a priori assumption about the model or mechanism that led to the formation of the network of interest. Thus, the distancedependent connection probabilities resulting from the mapping to serve as the basis to determine if such a space is suitable for the network or not. For example, networks grown with the BarabasiAlbert model^{14} are infinitedimensional hyperbolic networks^{34}, but short distances between their nodes, measured with the coordinates inferred by LaBNE or HyperMap in , are not indicative of link formation in this space (see Figure S3). The indepth study of networks with highdimensional latent spaces or their embedding to (with d > 2) are of great interest, but beyond the scope of this article.
Finally, although this work did not intend to provide an extensive comparison between link predictors or a thorough analysis of the evolution of real networks, it is important to note that LaBNE performed very well in these two type of studies when applied to a biological, a social and a technological network. These represent a few example scenarios in which the inference of the hyperbolic geometry underlying a network could be useful.
Methods
The PS model
The PS model^{12} on the hyperbolic plane of curvature K = −1 is formulated as follows: (1) initially the network is empty; (2) at time t ≥ 1, a new node t appears at coordinates (r_{t}, θ_{t}) with r_{t} = 2 lnt and θ_{t} uniformly distributed on [0, 2π] and every existing node s < t increases its radial coordinate according to r_{s}(t) = β r_{s} + (1 − β) r_{t} with β = 1/(γ − 1) ∈ [0, 1]; (3) new node t picks a randomly chosen node s < t that is not already connected to it and links with it with probability , where parameter T, the network temperature, controls the network’s clustering coefficient, is the current radius of the hyperbolic circle containing the network, x_{st} = r_{s} + r_{t} + 2ln(θ_{st}/2) is the hyperbolic distance between nodes s and t and θ_{st} is the angle between the nodes; (4) repeat step 3 until node t gets connected to m different nodes; (5) repeat steps 1–4 until the network is comprised of N nodes. Note that if T → 0, . In addition, if β = 1/(γ − 1) = 1, existing nodes do not change their radial coordinates and R_{t} = 2lnt.
Radial arrangement of nodes in LaBNE
As described above, new nodes in the PS model acquire radial coordinates r_{t} = 2lnt that depend on their birthtime t. This means that the probability of finding a node that is close to the centre of the hyperbolic circle containing the growing network, is exponentially lower than the probability to find a node on the periphery. When a new node is added to the system and the existing ones change their radial position according to r_{s}(t) = β r_{s} + (1 − β) r_{t}, where β = 1/(γ − 1), their seniority is attenuated by increasing their distances to every newly added node^{12}. Consequently, the N angular coordinates found by LaBNE are complemented with the nodes’ radial coordinates obtained via r_{i} = 2β ln(i) + 2(1 − β) ln(N), where nodes i = {1, 2, …, N} are the network nodes sorted decreasingly by degree.
HyperMap
HyperMap^{17} is a Maximum Likelihood Estimation method to embed a network to hyperbolic space. It finds node coordinates by replaying the network’s hyperbolic growth and, at each step, maximising the likelihood that it was produced by the PS model^{17}. For embedding to the hyperbolic plane of curvature K = −1 it works as follows: (1) nodes are sorted decreasingly by degree and labelled i = {1, 2, …, N} from the top of the sorted list; (2) node i = 1 is born and assigned radial coordinate r_{1} = 0 and a random angular coordinate θ_{1} ∈ [0, 2π]; (3) for each node i = {2, 3, …, N}: (3.1) node i is born and assigned radial coordinate r_{i} = 2lni; (3.2) the radial coordinate of every existing node j < i is increased according to r_{j}(i) = β r_{j} + (1 − β) r_{i}; (3.3) node i is assigned the angular coordinate θ_{i} maximising the likelihood . β and p(x_{ij}) are defined as in the PS model and e_{ij} is 1 if nodes i and j are connected and 0 otherwise. The maximisation of is performed numerically by trying different values of θ in [0, 2π], separated by intervals Δθ = 1/i and then choosing the one that produces the greatest .
Since the angular coordinates yielded by this linkbased likelihood are not very accurate for small i (i.e. for high degree nodes)^{17}, the fast version of HyperMap used in this paper uses information on the final number of common neighbours between these old nodes via the maximisation of the loglikelihood , where μ is the mean number of common neighbours n_{ij} between i and j and σ^{2} is the associated variance^{18}. This hybrid version of HyperMap is O(N^{3}) and to speed it up, Papadopoulos and colleagues resort to the following heuristic: for nodes i with degree k_{i} < k_{speedup}, an initial estimate of their angular coordinate is computed by considering only the previous nodes j < i that are their neighbours; these estimates are then refined, searching for the final θ_{i} within a small region around . The fast hybrid version of HyperMap with k_{speedup} = 10 is the one used throughout this work. We refer the reader to^{18} for more details on the speedup heuristic and the derivation of . Finally, even when correction steps can be used together with the fast hybrid HyperMap, their effect on this method has been reported not to be significant^{18} and they are not considered here.
Network datasets
For the three network datasets used in this paper, selfloops and multiple edges were discarded and only the largest connected component was considered.
The highquality protein interaction network (PIN) is a stringent subset of the Human Integrated ProteinProtein Interaction rEference (HIPPIE)^{28}. HIPPIE retrieves interactions between human proteins from major expertcurated databases and calculates a score for each one, reflecting its combined experimental evidence. This score is a function of the number of studies supporting the interaction, the quality of the experimental techniques used to measure it and the number of organisms in which the orthologs of the interacting human proteins interact as well. In this paper, only interactions with confidence scores ≥0.73 (the upper quartile of all scores) in release 1.7 were considered. The raw version of this network is available at http://cbdm01.zdv.unimainz.de/mschaefer/hippie/download.php. To determine the birthtime of the PIN nodes, proteins from the manually curated database SwissProt were clustered based on near fulllength similarity and/or high threshold of sequence identity using FastaHerder2^{35}. If proteins from two evolutionarily distant organisms are present in one cluster, this suggests that the protein family is ancient. The minimum common taxonomy from all proteins that are part of a cluster was taken as an indication of the cluster’s age. Each node of the PIN was assigned to one of the following age clusters: Tree of Life Root, Metazoa, Chordata, Mammalia, Euarchontoglires or Primates/Human.
PrettyGoodPrivacy (PGP) is a data encryption and decryption program for secure data communication. In a PGP web of trust, each user (node) knows the public key of a group of people he trusts. When user A wants so send information to user B, this information is encrypted with B’s public key and signed with A’s private key. When B receives the information, he verifies that the message is coming from one of the users he trusts and decrypts it with his private key^{36}. This encryption and decryption event, forms a directed link between users A and B. In this article, however, the edge directionality of this network is not considered. This is not a problem for the interpretation of the network if we assume that by sharing a key, two users reciprocally endorse their trust in each other^{12}. The four temporal snapshots of the undirected PGP network used here, which were collected by Jörgen Cederlöf ^{37}, were used to assign a birthtime for each user based on the snapshot in which he first appeared. The snapshots correspond to April and October 2003, December 2005 and December 2006. The raw PGP data is available at http://www.lysator.liu.se/jc/wotsap/wots2/.
The autonomous systems Internet (ASI) corresponds to the communication network between IPv4 Internet subgraphs comprised of routers, as collected by the Center for Applied Internet Data Analysis^{27}. The six available network snapshots, spanning the period from September 2009 to December 2010 in 3month intervals, were used to determine the birthtime of each autonomous system based on the snapshot in which it first appeared. These Internet topologies are available for download at https://bitbucket.org/dklab/2015_code_hypermap.
Link prediction
The performance of LaBNE and HyperMap in link prediction was compared to that of reference neighbourhoodbased link predictors. They receive this name because the scores they produce are usually based on how much overlap there is between the neighbourhoods of nonadjacent pairs of nodes in a network. These unlinked nodes are often called seed nodes.
The simplest predictor considered was the Common Neighbours (CN) index, which just counts the number of common neighbours between seed nodes^{38}. The other indices examined were the Dice Similarity (DS), which is one of the possible normalisations of CN^{39}; the Adamic and Adar (AA) index, which assigns higher likelihood scores to seed nodes whose CNs do not interact with other components^{40}; and the Preferential Attachment index (PA), which is simply the degree product of the seed nodes^{38}. The formulae for these indices are, respectively: CN(x, y) = Γ(x) ∩ Γ(y), DS(x, y) = 2CN(x, y)/(Γ(x) + Γ(y)), and PA(x, y) = Γ(x)Γ(y), where Γ(x) is the set of neighbours of x and Γ(x) is the set cardinality.
The embedding and neighbourhoodbased predictors were applied to all the seed nodes of a network snapshot G_{t}, these node pairs were later sorted from best to worst score. This sorted list was scanned with a moving score threshold from top to bottom to compute the proportion of candidate interactions taken that coincide with the set of new edges in G_{t+1} (Precision) and the proportion of candidate interactions not taken at each threshold but that belong to the set of new edges present in G_{t+1} (Recall). This allowed for the construction of a PrecisonRecall curve for each of the predictors considered.
For the PGP, G_{t} has 14367 nodes and 37900 edges (snapshot from April 2003), while G_{t+1} has 31524 nodes and 168559 edges (all the nodes and edges from April 2003 to December 2006). Note, however that only the 62547 new links between the same set of 14367 nodes present in G_{t} are considered.
For the ASI, G_{t} has 24091 nodes and 59531 edges (snapshot from September 2009), while G_{t+1} has 34320 nodes and 128839 edges (all the nodes and edges from September 2009 to December 2010). Note, however that only the 48119 new links between the same set of 24091 nodes present in G_{t} are considered.
For the PIN, it was necessary to follow a different procedure, as edge timestamps are not available for this network. When the performance of a link predictor is assessed in protein networks, researchers have opted for using Gene Ontology (GO) similarities to discriminate between good and bad candidate interactions^{30,32,41,42}. This is based on the Guiltbyassociation Principle, which states that if two proteins are involved in similar biological processes, they are more likely to interact^{29}. So, the link predictors were applied to the nonadjacent protein pairs in the observable network topology of the PIN and then sorted from best to worst score. The GO similarity of the top 10% candidate links was then computed, together with the proportion of protein pairs with similarities ≥0.75. This percentage corresponds to the precision of the link predictors reported for the PIN.
To compute GO similarities the R package GOSemSim was utilised^{43}. Although this package provides different indices to measure similarities between proteins, Wang’s index was used because it was formulated specifically for the GO^{44}. GO similarities on the high end of the range [0, 1] are normally good indicators of a potential protein interaction^{44}. However, a threshold of 0.544 was preferred in this study, as it corresponds to the upper quartile of all the GO similarities of connected protein pairs in the PIN.
Hardware used for experiments
All the experiments presented in this paper were executed on a Lenovo ThinkPad 64bit with 7.7 GB of RAM and an Intel Core i74600U CPU @ 2.10 GHz × 4, running Ubuntu 14.04 LTS. The only exception were the link prediction experiments, which were executed on nodes with 100 GB of RAM, within the Mogon computer cluster at Johannes Gutenberg Universität in Mainz.
Availability
R implementations of the PS model and LaBNE are available at http://www.gregal.info/code. The network data used in this paper are also available at the same website. The C++ implementation of the fast version of HyperMap used in this paper is available at https://bitbucket.org/dklab/2015_code_hypermap.
Additional Information
How to cite this article: AlanisLobato, G. et al. Efficient embedding of complex networks to hyperbolic space via their Laplacian. Sci. Rep. 6, 30108; doi: 10.1038/srep30108 (2016).
References
Lü, L. & Zhou, T. Link prediction in complex networks: a survey. Physica A 390, 1150–1170 (2011).
Harenberg, S. et al. Community detection in largescale networks: a survey and empirical evaluation. Wiley Interdiscip. Rev. Comput. Stat. 6, 426–439 (2014).
Opsahl, T., Agneessens, F. & Skvoretz, J. Node centrality in weighted networks: generalizing degree and shortest paths. Soc. Networks 32, 245–251 (2010).
Shore, J. & Lubin, B. Spectral goodness of fit for network models. Soc. Networks 43, 16–27 (2015).
Dall, J. & Christensen, M. Random geometric graphs. Phys. Rev. E 66, 016121 (2002).
Aste, T., Di Matteo, T. & Hyde, S. Complex networks on hyperbolic surfaces. Physica A 346, 20–26 (2005).
Serrano, M. A., Krioukov, D. & Boguñá, M. Selfsimilarity of complex networks and hidden metric spaces. Phys. Rev. Lett. 100, 078701 (2008).
Boguñá, M., Krioukov, D. & Claffy, K. C. Navigability of complex networks. Nat. Phys. 5, 74–80 (2009).
Krioukov, D., Papadopoulos, F., Kitsak, M., Vahdat, A. & Boguñá, M. Hyperbolic geometry of complex networks. Phys. Rev. E 82, 036106 (2010).
Ferretti, L. & Cortelezzi, M. Preferential attachment in growing spatial networks. Phys. Rev. E 84, 016103 (2011).
Aste, T., Gramatica, R. & Di Matteo, T. Exploring complex networks via topological embedding on surfaces. Phys. Rev. E 86, 036109 (2012).
Papadopoulos, F., Kitsak, M., Serrano, M. A., Boguñá, M. & Krioukov, D. Popularity versus similarity in growing networks. Nature 489, 537–540 (2012).
Barthélemy, M. Spatial networks. Phys. Rep. 499, 1–101 (2011).
Barabãsi, A.L. & Albert, R. Emergence of scaling in random networks. Science 286, 509–512 (1999).
Song, C., Havlin, S. & Makse, H. A. Origins of fractality in the growth of complex networks. Nat. Phys. 2, 275–281 (2006).
Goh, K.I., Salvi, G., Kahng, B. & Kim, D. Skeleton and fractal scaling in complex networks. Phys. Rev. Lett. 96, 018701 (2006).
Papadopoulos, F., Psomas, C. & Krioukov, D. Network mapping by replaying hyperbolic growth. IEEE ACM T. Network. 23, 198–211 (2015).
Papadopoulos, F., Aldecoa, R. & Krioukov, D. Network geometry inference using common neighbors. Phys. Rev. E 92, 022807 (2015).
Cayton, L. Algorithms for manifold learning. UCSD tech report CS2008–0923, 1–17 (2005). URL http://www.lcayton.com/resexam.pdf. Last visited: 20160330.
Belkin, M. & Niyogi, P. Laplacian eigenmaps and spectral techniques for embedding and clustering. Adv. Neur. In. 14, 585–591 (2001).
Brouwer, A. E. & Haemers, W. H. Spectra of graphs (SpringerVerlag: New York, 2012).
Zemel, R. S. & CarreiraPerpiñán, M. A. Proximity graphs for clustering and manifold learning. Adv. Neur. In. 17, 225–232 (2004).
Tenenbaum, J. B. A Global Geometric Framework for Nonlinear Dimensionality Reduction. Science 290, 2319–2323 (2000).
Csardi, G. & Nepusz, T. The igraph software package for complex network research. InterJournal Compl. Syst. 1695, 1695 (2006).
Lehoucq, R., Sorensen, D. & Yang, C. ARPACK Users’ Guide (Society for Industrial and Applied Mathematics, 1998).
Yang, Y., Lichtenwalter, R. N. & Chawla, N. V. Evaluating link prediction methods. Knowl. Inf. Syst. 45, 751–782 (2014).
Claffy, K., Hyun, Y., Keys, K., Fomenkov, M. & Krioukov, D. Internet mapping: from art to science. 205–211 (IEEE, 2009).
Schaefer, M. H. et al. HIPPIE: integrating protein interaction networks with experiment based quality scores. PLoS ONE 7, e31826 (2012).
Oliver, S. Guiltbyassociation goes global. Nature 403, 601–603 (2000).
Cannistraci, C. V., AlanisLobato, G. & Ravasi, T. Minimum curvilinearity to enhance topological prediction of protein interactions by network embedding. Bioinformatics 29, i199–i209 (2013).
Kuchaiev, O., Rašajski, M., Higham, D. J. & Pržulj, N. Geometric Denoising of ProteinProtein Interaction Networks. PLoS Comput. Biol. 5, e1000454 (2009).
You, Z.H. H., Lei, Y.K. K., Gui, J., Huang, D.S. S. & Zhou, X. Using manifold embedding for assessing and predicting protein interactions from highthroughput experimental data. Bioinformatics 26, 2744–2751 (2010).
Newman, M. & Peixoto, T. P. Generalized communities in networks. Phys. Rev. Lett. 115, 088701 (2015).
Ferretti, L., Cortelezzi, M. & Mamino, M. Duality between preferential attachment and static networks on hyperbolic spaces. Europhys. Lett. 105, 38001 (2014).
Mier, P. & AndradeNavarro, M. A. FastaHerder2: four ways to research protein function and evolution with clustering and clustered databases. J. Comput. Biol. 23, 270–278 (2016).
Schneier, B. Applied cryptography (John Wiley & Sons, 1996).
Cederlöf, J. The OpenPGP web of trust (2003). URL http://www.lysator.liu.se/jc/wotsap/wots2/. Last visited: 20150908.
Newman, M. Clustering and preferential attachment in growing networks. Phys. Rev. E 64, 1–4 (2001).
Dice, L. R. Measures of the amount of ecologic association between species. Ecology 26, 297–302 (1945).
Adamic, L. & Adar, E. Friends and neighbors on the web. Soc. Networks 25, 211–230 (2003).
Chen, J., Hsu, W., Lee, M. L. & Ng, S.K. Increasing confidence of protein interactomes using network topological metrics. Bioinformatics 22, 1998–2004 (2006).
Cannistraci, C. V., AlanisLobato, G. & Ravasi, T. From linkprediction in brain connectomes and protein interactomes to the localcommunityparadigm in complex networks. Sci. Rep. 3, 1613 (2013).
Yu, G. et al. GOSemSim: an R package for measuring semantic similarity among go terms and gene products. Bioinformatics 26, 976–978 (2010).
Wang, J., Du, Z., Payattakool, R., Yu, P. & Chen, C.F. A new method to measure the semantic similarity of GO terms. Bioinformatics 23, 1274–1281 (2007).
Acknowledgements
The authors would like to thank Carlo V. Cannistraci and Josephine Thomas for useful comments and suggestions. In addition, the posts published by Alex Kritchevsky and Arpan Saha on http://www.quora.com/ were very important for a clear and intuitive presentation of the Laplacian as an operator. Finally, the authors also thank the Zentrum für Datenverarbeitung at the Johannes Gutenberg Universität in Mainz, which made the analysis of big networks possible in the Mogon computer cluster.
Author information
Authors and Affiliations
Contributions
G.A.L. created LaBNE, designed, implemented and carried out the experiments. P.M. was in charge of the protein age assignment. M.A.A.N. supervised the research. G.A.L. wrote the manuscript, incorporating comments, contributions and corrections from P.M. and M.A.A.N.
Ethics declarations
Competing interests
The authors declare no competing financial interests.
Electronic supplementary material
Rights and permissions
This work is licensed under a Creative Commons Attribution 4.0 International License. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in the credit line; if the material is not included under the Creative Commons license, users will need to obtain permission from the license holder to reproduce the material. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/
About this article
Cite this article
AlanisLobato, G., Mier, P. & AndradeNavarro, M. Efficient embedding of complex networks to hyperbolic space via their Laplacian. Sci Rep 6, 30108 (2016). https://doi.org/10.1038/srep30108
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/srep30108
This article is cited by

Towards kernelizing the classifier for hyperbolic data
Frontiers of Computer Science (2024)

Hyperbolic matrix factorization improves prediction of drugtarget associations
Scientific Reports (2023)

Modelindependent embedding of directed networks into Euclidean and hyperbolic spaces
Communications Physics (2023)

Network embedding based on highdegree penalty and adaptive negative sampling
Data Mining and Knowledge Discovery (2023)

Detecting the ultra low dimensionality of real networks
Nature Communications (2022)
Comments
By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.