Abstract
Network embedding is a generalpurpose machine learning technique that encodes network structure in vector spaces with tunable dimension. Choosing an appropriate embedding dimension – small enough to be efficient and large enough to be effective – is challenging but necessary to generate embeddings applicable to a multitude of tasks. Existing strategies for the selection of the embedding dimension rely on performance maximization in downstream tasks. Here, we propose a principled method such that all structural information of a network is parsimoniously encoded. The method is validated on various embedding algorithms and a large corpus of realworld networks. The embedding dimension selected by our method in realworld networks suggest that efficient encoding in lowdimensional spaces is usually possible.
Introduction
Neural embedding methods are machine learning techniques that learn geometric representations of entities. For instance, word embedding methods leverage the relationships between words captured in large corpora to map each word to a vector^{1,2,3}; graph embedding methods map each node to a vector by capturing structural information in the graph^{4,5,6,7}. Embedding approaches have not only been pushing the performance envelope in many tasks, such as the prediction of word analogies and network links, but also provide a novel way to capture semantic and structural relationships geometrically^{3,8,9,10,11}.
In neural embedding of networks, the embedding space and dimension do not have a special meaning as in traditional embedding methods such as Laplacian Eigenmaps^{12}, or hyperbolic space embedding^{13,14,15,16,17}. Instead, the dimension is considered as a hyperparameter of the model, which is either optimized through a model selection process or simply chosen based on the common practice (e.g., 100, 200, or 300 dimensions).
In word embedding, because we expect that the semantic space of human language would not drastically vary across corpora or languages, using common default parameters is reasonable, and the behavior of embedding models with the dimension parameter is rather well studied. For instance, it is common to use 100 to 300 dimensions, which are known to provide excellent performance in various tasks without a lot of overparametrization risk^{6,18,19,20,21}. By contrast, the space of graphs that we have is vast and we expect that there is no strong universal structure that may lead to similar optimal hyperparameters. Moreover, it is unclear how the structural properties of networks, such as community structure, would affect the right dimension of the model. For instance, imagine a road network, which is naturally embedded in a twodimensional space. Because its geometrical nature, a suitable value for the embedding dimension should be close to two. Now, imagine another network that consists of two densely connected clusters of nodes, and only a few connections are present between the clusters. If there is little structural difference between the nodes within the same cluster, then even one dimension would suffice to represent the nodes effectively. Although the embedding dimension is one of the most important hyperparameter of the embedding models, particularly in graph embedding literature, it is difficult to find principled approaches to the proper selection of this hyperparameter value. Most existing methods use performance in downstream tasks, such as node classification and link prediction, to perform the model selection rather than establishing direct connections between the embedding dimension and the structural properties of the network. However, the performance for different tasks may be optimized with a different number of dimensions. Furthermore, it is natural to expect that the total information content of a network dataset is task independent.
Selfcontained approaches to the selection of the embedding dimension of a network have been considered in previous papers. Classical embedding methods based on the spectral decomposition of the adjacency matrix or other graph operators, for example, take advantage of features of the spectrum of the operator to select the appropriate number of eigencomponents required to represent the graph with sufficient accuracy^{12,22,23}. Other approaches are based on the underlying assumption that the network under observation is compatible with a priori given generative model defined in the geometric space. Then, the goodness of the fit of the observed network against the model provides an indication of the intrinsic dimension of the network itself. Irrespective of the specific model considered, the common message is that lowdimensional spaces are generally sufficient to obtain an accurate embedding of a network. In network embedding in Blau space for instance, the number of dimensions required to appropriately represent a network is expected to grow logarithmically with the network size^{24,25,26}. Empirical analyses of largescale social networks are in support of this claim^{26}. Also, Hoff et al. considered generative network models where the probability of connection between pairs of nodes is a function of their distance in the embedding space^{27}. They developed nearlyoptimal inference algorithms to perform the embedding, and show that some smallsize social networks can be quasioptimally represented in three dimensions. We stress that expecting a lowdimensional embedding to be able to well represent a network does not mean that the network information is fully and exactly encoded by the embedding. This fact was emphasized by Seshadhri et al. who considered generic models where the probability of connection between pairs of nodes is a function of the dot product between the vector representations of the nodes in the embedding space^{28}. The stochastic block model used in community detection^{29,30,31} and the random dot product model^{32,33} belong to this class of models. They mathematically proved that lowdimensional embeddings cannot generate graphs with both low average degree and large clustering coefficients, thus failing to capture significant structural aspects of realworld complex networks. The result seems, however, dependent on the specific model considered, as a minor relaxation of the model by Seshadhri et al. leads to exact lowdimensional factorizations of many realworld networks^{34}.
Here, we contribute to the literature on the subject by proposing a principled technique to choose a proper dimension value of a generic network embedding, and examine the relationships between structural characteristics and the chosen value of the embedding dimension. Our method does not aim at identifying the intrinsic dimension of a given network, rather the proper value of the space dimension of a given network according to a given embedding algorithm. Different embedding models applied to the same network may lead to different values of the selected embedding dimension. To this end, we follow a route similar to the recent study on word embedding dimension by Yin et al.^{20}. Yin et al. proposed a metric, Pairwise Inner Product (PIP) loss function, that compares embeddings across different dimensions by measuring pairwise distances between entities within an embedding. By using this metric, they argued that a suitable value for the embedding dimension can be identified as the one corresponding to the optimal balance between bias and variance. Here, we extend the approach to the embedding of networks by proposing an alternative metric to quantify the amount of information shared by embeddings of different dimensions. We show that the content of structural information encoded in the embedding saturates in a powerlaw fashion as the dimension of the embedding space increases, and identify the proper value of the embedding dimension as the point of saturation of our metric. We evaluate our method by employing multiple embedding techniques, a host of realworld networks, and downstream prediction tasks.
Results
Embedding dimension and performance in detection tasks
To compare embeddings of the same network obtained for different values of the embedding dimension d, we use the normalized embedding loss. We first set a reference dimension d_{r}, which we consider sufficiently large to capture the information stored in the original network. We then compute the embedding matrix V^{(d)} for d < d_{r} and use Eq. (6) to evaluate the normalized embedding loss function at dimension d as \(L(d):=L({V}^{(d)},{V}^{({d}_{r})})\). Specifically, for graphs with less than N = 500 nodes, we set the reference value d_{r} = N and for graphs with more than N = 500 nodes, we set d_{r} = 500. The latter choice is made for convenience to avoid computational issues. Also, we stress that the choice of the reference value d_{r} is not so crucial as long as d_{r} is large enough (Supplementary Fig. 1).
We stress that some embedding algorithms contain stochastic components (e.g., due to random sampling procedures), so that the embedding V^{(d)} obtained for a given dimension d may be not the same over different runs of the algorithm on the same network. We neglect this fact in the analysis performed in this section. We will illustrate how fluctuations affect our method in the selection of the embedding dimension below.
In the above recipe, we assume that the geometric representation obtained for d = d_{r} is the most accurate representation that the embedding algorithm can obtain for the network. Such an assumption is obviously true for spectral algorithms of dimensionality reduction such as LE in d_{r} = N dimensions, since the network topology can be exactly reconstructed from the full spectral decomposition of the normalized graph Laplacian. Any LE embedding in d < N dimensions is necessarily a lossy representation of the network. Particularly for LE, the amount of topological information lost by suppressing some eigencomponents is a function of the eigenvalues of the normalized graph Laplacian. For other nonspectral embedding methods, similar assumptions should hold. The best geometric representation of a network, ideally a lossless representation, that an embedding method can achieve requires a space of dimension equal to the size of the network itself. The quality of a lowerdimensional embedding is quantified in terms of the information lost with respect to the most accurate embedding achievable by the embedding method on the network at hand.
An example of the application of our method is reported in Fig. 1a for the network of the American college football. Here, we use node2vec to obtain the embeddings. The reference value for this graph is d_{r} = 115, where N = 115 is the number of nodes. As d increases, L(d) smoothly decreases toward an asymptotic value close to zero. This fact indicates that, as we increase the dimension d of the embedding, the resulting geometric representations become similar to the one obtained for the reference value d_{r}. The saturation at high values of the embedding dimension is not due to degeneracy of the cosine similarity metric (Supplementary Fig. 2). As the embedding dimension grows the distribution of pairwise cosine similarities converges to a stable distribution with finite variance. Note that L(d) is already very close to the asymptotic value for d ≃ 10, suggesting that the representation obtained by node2vec at d = 10 is already able to capture the structural features (pairwise distance relationships) of the network from the perspective of node2vec. Of course, this observation does not mean that 10 dimensions are sufficient to fully describe the observed network. It simply tells us that increasing the dimension d of the embedding space is superfluous, in the sense that it does not significantly augment what node2vec is able to learn about the network.
This statement is further supported by our experiments on the performance in link prediction obtained for different values of the embedding dimension d, as illustrated in Fig. 1c. The accuracy in predicting missing links based on ddimensional embeddings behaves almost exactly as L(d). Increasing d is useful up to a certain point. After that, there is no additional gain in prediction accuracy. The saturation of the link prediction accuracy arises earlier than the one observed for L(d). This fact may be expected as there should be potentially more information in a geometric embedding than the one actually used in the link prediction task.
The analysis of the Cora citation network allows us to draw similar conclusions (Fig. 1b). Still, we see that L(d) quickly decreases to an asymptotic value as d increases. Furthermore, it seems that d ≃ 45 dimensions are more than enough to provide a sufficiently accurate geometric representation of the network that contains almost all the structural features that node2vec is able to extract from it. Also, the behavior of L(d) predicts pretty well how the accuracy in link prediction grows as a function of the embedding dimension d (Fig. 1d).
If we repeat the same analysis using the LINE algorithm^{35}, although the rate of decay of the function L(d) is not identical for the two embedding algorithms, we find analogous behavior with L(d) smoothly decaying toward an asymptotic value as d increases (Fig. 1a, b). However, a faster decay doesn’t necessarily imply that the corresponding algorithm is able to represent more efficiently the network than the other algorithm. The reference embeddings used in the definition of L(d) are in fact algorithm dependent, and the quantitative comparison between the L(d) functions of two different algorithms may be not informative.
Similar observations can be made when we perform the geometric embedding of synthetic networks. In Fig. 2, we display L(d) for networks generated according to the SB model. We still observe a saturation of the function L(d). Graph clustering reaches maximal performance at slightly smaller d values than those necessary for the saturation of L(d), and slowly decreases afterwards. The drop of performance for values of the embedding dimension that are too large has been already observed in other papers, see for example refs. ^{35,36}. Results based on LE embedding provide similar insights (Supplementary Figs. 3–6): L(d) saturates quickly as d increases; the performance in the graph clustering task is optimal before the saturation point of L(d); further, the saturation point is compatible with the presence of a significant gap in the sorted Laplacian eigenvalues.
As we already mentioned it, we tested the robustness of our findings for different choices of the value of the reference dimension d_{r} (Supplementary Fig. 1). We find that as long as d_{r} is large enough, then the function L(d) is basically unaffected by the specific choice made. Furthermore, we compared our loss function with the PIP loss function^{20} (Supplementary Fig. 7). The functions behave in a qualitatively similar manner, displaying a clear decay toward an asymptotic value as the embedding dimension increases. The normalized loss function introduced here has the advantage of looking smoother than the PIP loss function, thus allowing for a mathematical description that requires less parameters.
Finally, we fed the kmeans algorithm with GraphSAGE embeddings to detect the community structure of some realworld networks (Supplementary Fig. 8). We found that the identified partition of a network does not significantly change if the network is embedded in a dimension larger than a certain threshold, and that such a threshold value is well predicted by the point of saturation of the normalized loss function associated with the GraphSAGE embeddings of the network. Similarly, node classification tasks based on GraphRNA embeddings reach stable performance if the dimension of the embedding is larger than a threshold (Supplementary Fig. 9). Such a threshold is well localized as the point of saturation of the normalized loss function of the GraphRNA network embeddings.
Principled selection of a suitable value for the embedding dimension
The observed behavior of the loss function L(d) indicates that embedding algorithms may generate sufficiently accurate geometric descriptions with a relatively small value of d. Assuming that the plateau value L_{∞} of the loss function L(d) corresponds to the best geometric description that the embedding algorithm can achieve, we indicate with
the minimal d value such that the difference between L(d) and the optimum L_{∞} is less than ϵ. As already stated, a key implicit assumption here is that the empirically trained maxdimensional embedding is the best achievable by the algorithm. d_{o}(ϵ) is then the best choice for the embedding dimension that can be made to achieve a geometric representation that uses less training time and computational resources but is still sufficiently similar (i.e., within an ϵ margin) to the best achievable representation of the network. Our formulation of the problem does not explicitly account for the fact that the network at hand may be a specific realization of some underlying stochastic model which the embedding method may rely on. Rather, Eq. (1) defines a lowrank approximation problem, meaning that we aim at finding the lowdimensional (i.e., lowrank) approximation that best describes the highdimensional (i.e., fullrank) embedding matrix. We differentiate from the standard setting considered in lowrank approximation problems in two main respects. First, we do not rely on a standard metric of distance (e.g., Frobenius distance) between the low and the fullrank embedding matrices. Instead, we make use of a measure of collective congruence between the embeddings, i.e., Eq. (6). The normalized embedding loss function is equivalent to a metric of distance between matrices only for spectral embedding methods such as LE. However, the equivalence is not immediate for arbitrary embedding methods. Second and more important, in a standard lowrank approximation, one generally aims at finding the best lowrank reduction of a fullrank matrix by constraining the maximum rank of the approximation. In our case instead, we seek the best rank reduction constrained by a maximum amount of tolerated discrepancy between the low and the highdimensional embeddings.
There could be several ways of finding the solution of Eq. (1). Here, given the smoothness of the empirically observed loss functions, we opted for a parametric approach. Specifically, we found that
where s and α are fitting parameters, represents well the data in various combinations of embedding methods and embedded networks. Equation (2) is meant to describe the mathematical behavior of the loss function only in the range d ∈ [1, d_{r}). In particular in the definition of Eq. (2), L_{∞} appears as the value of the normalized loss function in the limit of infinitely large embedding dimensions. However, such a limit has no physical meaning and is used only to facilitate the mathematical description of the empirical loss function in the regime d ≪ d_{r}. Best fits, obtained by minimizing the meansquared error, of the function of Eq. (2) are shown in Figs. 1a, b and 2a. We performed a systematic analysis on a corpus of 135 realworld networks. We found that best estimates \({\hat{L}}_{\infty }\) of the asymptotic value L_{∞} for the normalized embedding loss function are generally close to zero (Supplementary Table 1). Best estimates \(\hat{s}\) of the factor s, regulating the rate of the powerlaw decay toward L_{∞}, seem very similar to each other, irrespective of the network that is actually embedded. Measured values of \(\hat{s}\) indicate only a mild dependence on the underlying network (see Fig. 3). For uncorrelated clouds of points in d dimension, the central limit theorem allows us to predict that L(d) ~ 1/d^{1/2}. Our best estimates \(\hat{\alpha }\) of the decay exponent α are generally greater than those expected for uncorrelated clouds of data points, indicating that the embedding algorithm correctly retains network structural information even in highdimensional space. In systematic analyses performed on random networks constructed using either the ER and the BA models, we find that the size of the network is not an important factor in determining the plateau value L_{∞}. The decay exponent α and the rate s are positively correlated with density of the embedded network, but their values become constant in the limit of large network sizes (see Fig. 4).
Assuming the validity of Eq. (2), the solution of Eq. (1) for d_{o}(ϵ) can be written as
The best estimate \({\hat{d}}_{o}(\epsilon )\) is calculated using the knowledge of the best estimates \(\hat{s}\) and \(\hat{\alpha }\) as \({\hat{d}}_{o}(\epsilon )={\left(\hat{s}/\epsilon \right)}^{1/\hat{\alpha }}\). Values \({\hat{d}}_{o}(\epsilon )\) measured in realworld networks for ϵ = 0.05 are reported in the Supplementary Table 1. In Fig. 3, we display the cumulative distribution of \({\hat{d}}_{o}(\epsilon =0.05)\) over the entire corpus. For all networks in our dataset, we find that \({\hat{d}}_{o}(\epsilon =0.05)\,\leq \,218\). The size N of the network is an upper bound for d_{o}(ϵ). However for sufficiently large networks, estimated values of \({\hat{d}}_{o}(\epsilon )\) do not display any clear dependence on N. Specifically, for roughly 40% of the real networks \({\hat{d}}_{o}(\epsilon =0.05)/N\,<\,0.01\), and for 80% of the real networks, \({\hat{d}}_{o}(\epsilon =0.05)/N\,<\,0.1\) (see Fig. 3).
The definition of the loss function provided in Eq. (6) uses cosine similarity to compare pairs of embeddings, tacitly assuming that the metric well represents distance of nodes in the embedding space. However, one may argue that this is not the case^{28}, and the outcome of the analysis may depend on the specific metric of distance adopted. For example, if the embedding is obtained by fitting an observed network against a generative network model, then the exact definition of connection probability in the model as a function of the distance between points in the space may play a fundamental role in properly embedding a network^{27}. We therefore tested the robustness of our findings against different choices for the distance metric in the embedding space (Supplementary Fig. 10). In our tests, we simply replace cosine similarity with another metric (e.g., Euclidean, correlation, and Chebyshev distances), and then look at how the normalized embedding loss changes as a function of the embedding dimension. We find that some differences are visible in small networks (e.g., the American college football); instead the curves obtained for large networks (e.g., Cora citation and Citeseer citation) are robust against the choice of the metric. We stress that the asymptotic value of the loss function is different from metric to metric, suggesting that some metric may be better at encoding pairwise similarity among nodes in the embedding space. This finding is perfectly in line with results by Hoff and collaborators^{27}. However, the saturation of the content of information is similar irrespective of the actual distance metric adopted, supporting our main claims.
Similar conclusions can be made by looking at the values of \({\hat{d}}_{o}(\epsilon )\) obtained for node2vec and LE embeddings of synthetic networks with preimposed community structure (Supplementary Figs. 3–6). While there is a distinction between the actual value of the estimates \({\hat{d}}_{o}(\epsilon )\) for the two embedding methods, everything else is consistent. Specifically, the loss function L(d) saturates as the embedding dimension increases. After the saturation, the performance in graph clustering doesn’t improve anymore. For the LE embedding, the dimension value of the saturation roughly corresponds to the point where a neat gap between the eigenvalues of the graph normalized Laplacian is visible. The location of a neat gap among the eigenvalues of a graph operator is the standard approach for the identification of a suitable value for the embedding dimension in spectral dimensionality reduction methods. Our results indicate that the value of \({\hat{d}}_{o}(\epsilon )\) is not dependent on the specific metric used.
Embedding dimension and variance of network embeddings
Our definition of d_{o}(ϵ) of Eq. (1) corresponds to the minimal embedding dimension necessary to learn the structure of a network with a sufficient level of accuracy. However, we are not in the position to tell if the desired level of accuracy is reached because the embedding algorithm is actually encoding the network structure in an optimal way, or instead the algorithm is over fitting the network. In this section, we perform a simple analysis in the attempt of providing some clarifications regarding this aspect of the problem.
We rely on the embedding coherence function of Eq. (7) to quantify the accuracy of a given algorithm to embed a network in d dimensions. Specifically, we apply node2vec K = 10 times to the same network for every value of the embedding dimension d to obtain a set of embeddings \({V}_{d}^{(1)},\ldots ,{V}_{d}^{(K)}\). Here, the diversity of the embeddings is due to the stochastic nature of the algorithm that relies on finitesize samples of random walks to embed the network. We then quantify the embedding coherence S^{2}(d) of the algorithm at dimension d using the embedding coherence function of Eq. (7) as \({S}^{2}(d):={S}^{2}({V}_{d}^{(1)},\ldots ,{V}_{d}^{(K)})\).
Figure 5 shows how S^{2}(d) behaves as a function of d for the same of the networks as we considered earlier. S^{2}(d) is a nonmonotonic function of d, showing a minimum value at a certain dimension d. We note that S^{2}(d) decrease quickly toward its minimum, but it slowly increases afterwards. This finding concurs with previous observations that using higherthannecessary dimensions does not critically affect the usefulness of the embeddings and may explain why the performance of embedding algorithms in tasks such as link prediction (Fig. 1) and graph clustering (Fig. 2) doesn’t deteriorate as d grows.
Discussion
In this paper, we proposed a principled solution to the problem of defining and identifying a suitable value of the embedding dimension of graphs. Our method is an extension to network data of the technique introduced by Yin et al.^{20} in word embedding. The spirit of the approach is similar to the one adopted in spectral methods for dimensionality reduction, where a lossy representation of a network is obtained by suppressing an arbitrary number of eigencomponents of a graph operator (e.g., combinatorial and normalized Laplacians, adjacency matrix). The amount of information lost by neglecting some of the eigencomponents is a function of the eigenvalues of the operator used in the reduction, and represents the main criterion to assess the effectiveness of the reduction itself. We generalize such an idea to an arbitrary graph embedding method. Accordingly, the effectiveness of an embedding in a lowdimensional space is measured as the ability to reproduce as closely as possible the best network representation that the embedding method can achieve. We assume that the best representation is obtained when the network is embedded in a space with dimension equal to the size of the network itself. We then measure the amount of information lost by reducing the dimension of the network representation. Finally, we identify a reasonable value of the embedding dimension, namely d_{o}(ϵ), as the smallest dimension value able to provide a sufficiently accurate (i.e., within an ϵ margin) representation of the network structure.
We validated the approach in two different ways. First, we applied it to embeddings based on the spectrum of the normalized graph Laplacian^{12}, and recovered values of d_{o}(ϵ) compatible with those obtained through wellestablished spectral criteria. Second, we validated the method by applying it to the cases where we could estimate the performance of embeddings in downstream tasks. We found that the embedding dimension value determined by our method roughly corresponds to the dimension value where the performance of standard tasks such as link prediction and graph clustering saturates to its maximum value. After validating it, we applied the method systematically to a large collection of realworld networks, finding that the estimated d_{o}(ϵ) is much smaller than the number of nodes in the network.
In relation to the existing debate on the dimensionality of networks^{28,34}, we believe that our method may serve to provide an expost estimate for the network intrinsic dimension. If the loss function approaches zero for \({\hat{d}}_{o}={\hat{d}}_{o}(\epsilon =0)<N\), then it means that the exact representation of the network (i.e., the one valid for d = N) may be obtained in a \({\hat{d}}_{o}\)dimensional space. However, one cannot exclude the existence of a better method able to perfectly embed the same network in a space with \(d\,<\,{\hat{d}}_{o}\) dimensions. Our systematic analysis of a corpus of 135 realworld networks bolsters the idea that the actual number of dimensions that are needed to describe the structure of a network is typically low^{25,26,27}.
Methods
Networks
In this paper, we focus our attention on unweighted and undirected networks. The topology of a given network with N nodes is described by its N × N adjacency matrix A. The entry A_{ij} = A_{ji} = 1 if a connection from node i to node j exists, whereas A_{ij} = A_{ji} = 0, otherwise.
Empirical networks
We consider a corpus of 135 network datasets. Sizes of these networks range from N = 16 to N = 1, 965, 206 nodes. We consider networks from different domains, including social, technological, information, biological, and transportation networks (Supplementary Table 1). We ignore directions and weights of the edges, thus making every network undirected and unweighted. For illustrative purposes, we explicitly consider two realworld networks: the American college football network^{37} and the Cora citation network^{38}. The American college football network is a network composed of N = 115 nodes and M = 613 edges. Each node is a college football team. Two teams are connected by an edge if they played one against the other during the season of year 2000. The Cora citation network is composed of N = 2, 708 nodes and M = 5, 429 edges. Each node is a scientific paper and edges represent citations among papers.
Network models
In addition to the empirical networks, we consider three network generative models: the ErdősRényi (ER) model^{39}, the BarabásiAlbert (BA) model^{40}, and the stochastic block (SB) model^{29}.
First, the ER model generates random networks with a Poisson degree distribution with average 〈k〉 = Np. Here, N is the total number of nodes and p is the connection probability among pairs of nodes.
Second, the BA model is a growing network model that generates graphs obeying powerlaw degree distributions P(k) ~ k^{−γ}, with degree exponent γ = 3. To generate a single instance of the model, we specify the total number of nodes N of the network, and the number of edges m that each node introduces to the network during the growth process (the total number of edges in the network is M = Nm).
Finally, the SB model generates random networks with preimposed block structure, which solely determines the connection probabilities between nodes. Here, we implement the simplest version of the model, where N nodes are divided into C groups of identical size N/C. Pairs of nodes within the same group are connected with probability p_{in}; pairs of nodes belonging to different groups are connected with probability p_{out}. The total number of edges in the network is approximated by M = N/Cp_{in} + (C − 1)N/Cp_{out}.
Network embedding algorithms
The embedding in a ddimensional space of a network with N nodes is fully specified by the embedding matrix \(V\in {{\mathbb{R}}}^{N\times d}\). The ith row of the matrix V is the vector V_{i,⋅} containing the coordinate values of node i in the embedding. The entire procedure described in this paper can be applied to any embedding technique. Here, as a proof of concept and to ensure the robustness of the results, we consider five different embedding algorithms: node2vec^{5}, LINE^{35}, Laplacian Eigenmaps (LE)^{12}, GraphSAGE^{19} and GraphRNA^{18}. We center the embedding matrix V so that \(\mathop{\sum }\nolimits_{i = 1}^{N}{V}_{i,s}=0\), for all s = 1, …, d.
node2vec^{5} is a popular network embedding algorithm that builds on the word2vec algorithm^{3} by taking the following analogy: nodes in the network are considered as “words”; a sequence of nodes explored during a biased random walks is considered as a “sentence.” In our analysis, we fix the number of walks per node to 20, the walk length to 10, the number of iterations to 10, and the parameters that bias the random walk toward a breadthfirst or depthfirst walk both equal to 1. The latter condition makes the embedding algorithm similar to DeepWalk^{4}.
LINE^{35} is another popular embedding algorithm that aims at identifying embeddings that preserve first and secondorder neighborhood information. We use the default values for the algorithm parameters: the order of the proximity was set to 2, the number of negative samples to 5, and the initial learning rate to 0.025.
Laplacian Eigenmaps (LE)^{12} is a classical embedding method based on the spectral decomposition of the graph Laplacian up to a desired eigencomponent. The importance of the various eigencomponents is inversely proportional to the magnitude of the corresponding eigenvalue (excluding the first trivial eigenvalue equal to zero), so that LE embedding up to a given dimension corresponds to representing each node using the components of the ranked eigenvectors up to the given dimension value.
GraphSAGE^{19} is an inductive framework for computing node embeddings in an unsupervised, supervised, or semisupervised way. Instead of training individual embeddings for each node, GraphSAGE learns a function that generates embeddings by sampling and aggregating features from the nodes’ local neighborhoods. There are several ways for aggregating feature information from the nodes’ neighbors. Here, we choose the mean aggregator (GraphSAGEmean), which takes the elementwise mean of the representation vectors of a node’s neighbors. GraphSAGE is based on the Graph Neural Network (GNN) framework. We interpret the number of hidden units in the last layer of the unsupervised GraphSAGE as the embedding dimension. In our experiments, we vary the number of units of the last layer while keeping other parameters unchanged.
GraphRNA^{18} learns node representations for networks with attributes by feeding bidirectional Recurrent Neural Networks (RNNs) with random walk sequences. In our experiments, we apply the node classification task to evaluate its embedding performance. The number of neural units of the last layer of the bidirectional RNNs is set equal to the number of classes of the nodes in the network. All parameters of the embedding algorithm are kept unchanged, but we vary the number of hidden units in the secondlast layer of the bidirectional RNNs. We interpret the number of units in the secondlast layer as the embedding dimension.
Taskbased evaluation of network embedding methods
We consider three quantitative evaluation tasks: link prediction, graph clustering and node classification. All tasks are considered as standard tests for the evaluation of the performance of graph embedding methods^{7,18,41}.
Link prediction
For the link prediction task, we use the repeated random subsampling validation by applying the following procedure 10 times. We report results corresponding to the AUCROC score (the area under a receiver operating characteristic curve) over these repetitions. We first hide 10% of randomly chosen edges of the original graph. The hidden edges in the original graph are regarded as the “positive” sample set. We sample an equal amount of disconnected vertex pairs as the “negative” sample set. The union of the “positive” and “negative” sample sets form our test set. The training set consists of the remaining 90% of connected node pairs and an equal number of randomly chosen disconnected node pairs. We use the training set to learn the embedding and determine the embedding of the nodes. Furthermore, we use the training set to learn the probability of connection between pairs of nodes given their embedding. Specifically, for every pair of nodes i and j in the training set, we evaluate the Hadamard product e_{ij} = V_{i,⋅} ⊙ V_{j,⋅}. Note that e_{ij} is a ddimensional vector. We assume that the probability of connection between nodes i and j is given by
where θ is a ddimensional parameter vector, and \({e}_{ij}^{T}\theta\) is the dot product between the vectors e_{ij} and θ^{42}. The best estimate of the entries of vector θ are obtained from the training set via logistic regression. In absence of training when all components of the vector θ are identical and positive, the probability of Eq. (4) is proportional to the dot product between the vectors V_{i,⋅} and V_{j,⋅}. We use Eq. (4) to rank edges of the test set and evaluate the AUCROC score of the link prediction task.
Graph clustering
We take advantage of the SB model to generate graphs with preimposed cluster structure. We consider various combinations of the model parameters, N, C, p_{in} and p_{out}.
We then perform the embedding of the graph using information from its adjacency matrix only. To retrieve clusters of nodes emerging from the embedding, we use the kmeans clustering algorithm^{43}. In the application of the clustering algorithm, we provide additional information by specifying that we are looking for exactly C clusters. The retrieved clusters of nodes are compared against the true clusters imposed in the construction of the SB model. Specifically, we rely on the normalized mutual information (NMI) to measure the overlap between the groundtruth clusters and the detected clusters^{44}. Values of NMI close to one indicate an almost perfect match between the two sets; NMI equals zero if detected and groundtruth clusters have no relation.
Also, we apply graph clustering to realworld networks. We still identify clusters via the kmeans algorithm applied to graph embeddings. The retrieved clusters of nodes are compared against the clusters identified by the popular community detection method Infomap^{45}. In the application of the kmeans algorithm, we specify that we are looking for a number of clusters equal to the number of communities found by Infomap. NMI is used is to quantify the level of similarity between the clusters identified by the two methods.
Node classification
In the node classification task, given the embedding matrix V and the labels of all nodes, we first randomly select 90% of the node representations from V to form the training set. The remaining 10% of the nodes constitutes the test set. Within the training set, we randomly select 10% of the node representations to form the validation set. To conduct the node classification task, we leverage the training set and the corresponding labels to train a multilayer perceptron classifier and use the validation set to fine tune some hyperparameters. The micro average metric is used to quantify the accuracy of the node classification over the test set. The results of the node classification task reported in our paper are given by the arithmetic average over 10 independent runs of the above procedure.
Evaluation metrics of network embeddings
Here, we define two metrics for assessing the quality of network embeddings.
Normalized embedding loss function
We define a loss function similar to the Pairwise Inner Product (PIP) loss function used for word embeddings^{20}. The metric can be used to compare any pair of embeddings, regardless of the way they are obtained, as long as they refer to the same network with the same set of nodes. The function takes two inputs V^{(a)} and V^{(b)}, respectively representing the matrices of the embeddings a and b that we want to compare. We preprocess each of these matrices by calculating the cosine similarity between all node pairs. For the embedding matrix V^{(a)}, we compute
A similar expression is used to compute C(V^{(b)}) for the embedding matrix V^{(b)}. As we already stated, the embedding matrices V^{(a)} and V^{(b)} are appropriately centered to account for fact that cosine similarity is not a translationinvariant metric. C(V^{(a)}) is a N × N matrix that captures the pairwise similarity between nodes; \({[C({V}^{(a)})]}_{i,j}\) corresponds to the cosine similarity between nodes i and j in the embedding V^{(a)}.
The normalized embedding loss function L(V^{(a)}, V^{(b)}) is defined as the average, over all possible node pairs, of the absolute difference values between the cosine similarity of the two embeddings, i.e.,
L(V^{(a)}, V^{(b)}) = 0 if the two embeddings a and b are equivalent. We expect instead L(V^{(a)}, V^{(b)}) ≃ 2 if the two embeddings represent two radically different representations of the network.
Cosine similarity is chosen for its simplicity. We verify that the use of other similarity/distance metrics in the normalized embedding loss function of Eq. (6) in place of cosine similarity provides outcomes qualitatively similar to those reported in the paper (Supplementary Fig. 10).
Embedding variance
Another metric that we use is “embedding variance,” which estimates the level of coherence among a set of multiple embeddings of the same network. The metric takes a set {V^{(1)}, V^{(2)}, …, V^{(K)}} of K embedding matrices of the same network as its input and calculates the average value of the variance of the node pair similarities across the embeddings. We obtain the cosine similarity matrix C(V^{(k)}) for each of the k = 1, …, K embeddings using Eq. (5). We compute the embedding variance as
where
is the average value of the cosine similarity of the nodes i and j over the entire set of embeddings. Equation (7) equals the variance of the cosine similarity over all pairs of nodes in all embeddings. S^{2}(V^{(1)}, V^{(2)}, …, V^{(K)}) = 0, if the embeddings are such that the cosine similarity of all node pairs is always the same across the entire set of embeddings. High values of S^{2}(V^{(1)}, V^{(2)}, …, V^{(K)}) indicate low coherence among the embeddings in the set. Note that, contrary to Eqs. (6) and (7) applies only to embeddings of the same dimension.
Data availability
All network datasets considered in this paper have been obtained from http://konect.cc/networks/. We include complete information on how to obtain individual datasets in Supplementary Table 1.
Code availability
The Python script used for this project is available online at https://zenodo.org/record/4757121#.YKKPLCcZp8^{46}.
References
Bengio, Y., Ducharme, R., Vincent, P. & Jauvin, C. A neural probabilistic language model. J. Mach. Learn. Res. 3, 1137–1155 (2003).
Pennington, J., Socher, R. & Manning, C. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), 1532–1543 (2014).
Mikolov, T., Sutskever, I., Chen, K., Corrado, G. & Dean, J. Distributed representations of words and phrases and their compositionality. Adv. Neural Inf. Process. Syst. 26, 3111–3119 (2013).
Perozzi, B., AlRfou, R. & Skiena, S. Deepwalk: online learning of social representations. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 701–710 (2014).
Grover, A. & Leskovec, J. Node2vec: scalable feature learning for networks. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 855–864 (2016).
Kipf, T. N. & Welling, M. In Semisupervised classification with graph convolutional networks. International Conference on Learning Representations (ICLR) (2017).
Hamilton, W. L., Ying, R. & Leskovec, J. Representation learning on graphs: Methods and applications. IEEE Data Eng. Bull. 40, 52–74 (2017).
An, J., Kwak, H. & Ahn, Y.Y. Semaxis: A lightweight framework to characterize domainspecific word semantics beyond sentiment. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistic 1, 2450–2461 (2018).
Hamilton, W. L., Leskovec, J. & Jurafsky, D. Diachronic word embeddings reveal statistical laws of semantic change. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics 1, 1489–1501 (2016).
Garg, N., Schiebinger, L., Jurafsky, D. & Zou, J. Word embeddings quantify 100 years of gender and ethnic stereotypes. Proc. Natl. Acad. Sci. USA 115, E3635–E3644 (2018).
Kozlowski, A. C., Taddy, M. & Evans, J. A. The geometry of culture: Analyzing the meanings of class through word embeddings. Am. Sociol. Rev. 84, 905–949 (2019).
Belkin, M. & Niyogi, P. Laplacian eigenmaps and spectral techniques for embedding and clustering. In Advances in Neural Information Processing System 14, 585–591 (2001).
Serrano, M. A., Krioukov, D. & Boguná, M. Selfsimilarity of complex networks and hidden metric spaces. Phys. Rev. Lett. 100, 078701 (2008).
Krioukov, D., Papadopoulos, F., Kitsak, M., Vahdat, A. & Boguná, M. Hyperbolic geometry of complex networks. Phys. Rev. E 82, 036106 (2010).
Boguná, M., Papadopoulos, F. & Krioukov, D. Sustaining the internet with hyperbolic mapping. Nat. Commun. 1, 62 (2010).
Muscoloni, A., Thomas, J. M., Ciucci, S., Bianconi, G. & Cannistraci, C. V. Machine learning meets complex networks via coalescent embedding in the hyperbolic space. Nat. Commun. 8, 1–19 (2017).
Faqeeh, A., Osat, S. & Radicchi, F. Characterizing the analogy between hyperbolic embedding and community structure of complex networks. Phys. Rev. Lett. 121, 098301 (2018).
Huang, X., Song, Q., Li, Y. & Hu, X. Graph recurrent networks with attributed random walks. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 732–740 (2019).
Hamilton, W. L., Ying, R. & Leskovec, J. Inductive representation learning on large graphs. In Proceedings of the 31st International Conference on Neural Information Processing Systems, 1025–1035 (2017).
Yin, Z. & Shen, Y. On the dimensionality of word embedding. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, 895–906 (2018).
Pan, S. et al. Adversarially regularized graph autoencoder for graph embedding. In International Joint Conference on Artificial Intelligence, 2609–2615 (2018).
Lee, J. R., Gharan, S. O. & Trevisan, L. Multiway spectral partitioning and higherorder Cheeger inequalities. J. ACM 61, 1–30 (2014).
Ahmed, A., Shervashidze, N., Narayanamurthy, S., Josifovski, V. & Smola, A. J. Distributed largescale natural graph factorization. In Proceedings of the 22nd international conference on World Wide Web, 37–48 (2013).
McPherson, J. M. & RangerMoore, J. R. Evolution on a dancing landscape: organizations and networks in dynamic blau space. Soc. Forces 70, 19–42 (1991).
Bonato, A. et al. Dimensionality of social networks using motifs and eigenvalues. PloS One 9, e106052 (2014).
Bonato, A. The geometry of social networks. Notes Can. Math. Soc. 48, 12–13 (2017).
Hoff, P. D., Raftery, A. E. & Handcock, M. S. Latent space approaches to social network analysis. J. Am. Stat. Assoc. 97, 1090–1098 (2002).
Seshadhri, C., Sharma, A., Stolman, A. & Goel, A. The impossibility of lowrank representations for trianglerich complex networks. Proc. Natl. Acad. Sci. USA 117, 5631–5637 (2020).
Holland, P. W., Laskey, K. B. & Leinhardt, S. Stochastic blockmodels: first steps. Soc. Netw. 5, 109–137 (1983).
Karrer, B. & Newman, M. E. Stochastic blockmodels and community structure in networks. Phys. Rev. E 83, 016107 (2011).
Newman, M. Equivalence between modularity optimization and maximum likelihood methods for community detection. Phys. Rev. E 94, 052315 (2016).
Young, S. J. & Scheinerman, E. R. Random dot product graph models for social networks. In International Workshop on Algorithms and Models for the WebGraph, 138149 (2007).
Athreya, A. et al. Statistical inference on random dot product graphs: a survey. J. Mach. Learn. Res. 18, 8393–8484 (2017).
Chanpuriya, S., Musco, C., Sotiropoulos, K. & Tsourakakis, C. E. Node embeddings and exact lowrank representations of complex networks. In Advances in Neural Information Processing System 33, (2020).
Tang, J. et al. Line: Largescale information network embedding. In International Conference on World Wide Web, 1067–1077 (2015).
Ma, Y., Ren, Z., Jiang, Z., Tang, J. & Yin, D. Multidimensional network embedding with hierarchical structure. In Proceedings of the eleventh ACM international conference on web search and data mining, 387395 (2018).
Girvan, M. & Newman, M. E. Community structure in social and biological networks. Proc. Natl Acad. Sci. USA 99, 7821–7826 (2002).
Sen, P. et al. Collective classification in network data. AI Magazine 29, 93 (2008).
Erdős, P. & Rényi, A. On random graphs i. Publ. Math. Debrecen 6, 290–297 (1959).
Barabási, A.L. & Albert, R. Emergence of scaling in random networks. Science 286, 509–512 (1999).
Goyal, P. & Ferrara, E. Graph embedding techniques, applications, and performance: a survey. Knowl. Based Syst. 151, 78–94 (2018).
Stellargraph package. https://stellargraph.readthedocs.io/en/stable/demos/linkprediction/node2veclinkprediction.html.
MacQueen, J. et al. Some methods for classification and analysis of multivariate observations. In Proceedings of the fifth Berkeley symposium on mathematical statistics and probability, 1, 281–297 (Oakland, CA, USA, 1967).
Danon, L., DiazGuilera, A., Duch, J. & Arenas, A. Comparing community structure identification. J. Stat. Mech. Theory Exp. 2005, P09008 (2005).
Rosvall, M. & Bergstrom, C. T. Maps of random walks on complex networks reveal community structure. Proc. Natl. Acad. Sci. USA 105, 1118–1123 (2008).
Gu, W. W., Tandon, A., Ahn, Y. Y. & Radicchi, F. Principled approach to the selection of the embedding dimension of networks, Zenodo, https://doi.org/10.5281/zenodo.4757120 (2021).
Acknowledgements
W.W. acknowledges financial support from the National Science Foundation of China (61903020), the Swarma Club and the program of China Scholarships Council (No.201806040107). Y.Y.A. acknowledges support from the Air Force Office of Scientific Research (FA95501910391). F.R. acknowledges support from the National Science Foundation (CMMI1552487) and the US Army Research Office (W911NF161 0104).
Author information
Authors and Affiliations
Contributions
W.W., Y.Y.A., and F.R. wrote the paper. W.W., A.T., Y.Y.A., and F.R. designed the experiments. W.W. and A.T. performed the experiments.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Peer review information Nature Communication thanks the anonymous reviewer(s) for their contribution to the peer review of this work. Peer reviewer reports are available.
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Gu, W., Tandon, A., Ahn, YY. et al. Principled approach to the selection of the embedding dimension of networks. Nat Commun 12, 3772 (2021). https://doi.org/10.1038/s41467021237955
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41467021237955
Further reading

Network analysis reveals rare disease signatures across multiple levels of biological organization
Nature Communications (2021)
Comments
By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.