The Hidden Flow Structure and Metric Space of Network Embedding Algorithms Based on Random Walks

Network embedding which encodes all vertices in a network as a set of numerical vectors in accordance with it’s local and global structures, has drawn widespread attention. Network embedding not only learns significant features of a network, such as the clustering and linking prediction but also learns the latent vector representation of the nodes which provides theoretical support for a variety of applications, such as visualization, link prediction, node classification, and recommendation. As the latest progress of the research, several algorithms based on random walks have been devised. Although those algorithms have drawn much attention for their high scores in learning efficiency and accuracy, there is still a lack of theoretical explanation, and the transparency of those algorithms has been doubted. Here, we propose an approach based on the open-flow network model to reveal the underlying flow structure and its hidden metric space of different random walk strategies on networks. We show that the essence of embedding based on random walks is the latent metric structure defined on the open-flow network. This not only deepens our understanding of random- walk-based embedding algorithms but also helps in finding new potential applications in network embedding.


Introduction
Complex networks, as high-level abstractions of complex system, have been widely applied in different areas, such as biology, sociology, economics and technology [1][2][3][4][5][6] . Recent progress has revealed a hidden geometric structure in networks 7,8 that not only deepens our understanding of the multiscale nature and intrinsic heterogeneity of networks but also provides a useful tool to unravel the regularity of some dynamic processes on networks 7,[9][10][11][12][13][14] . At the same time, researchers in the machine learning community have developed several techniques to embed a whole network in a high-dimensional space [15][16][17][18][19][20] such that the vectors of each node can be used as abstract features feeding on neural networks to perform tasks. It has been demonstrated that such a form of network embedding has wide applications, such as community detection, node classification and link prediction 16,21 . Various methods have been proposed in network embedding field such as Principal Component Analysis, Multi-Dimensional Scaling, IsoMap and their extensions [22][23][24][25][26] . Those embedding methods give good performance when the network is small. But most of them cannot be effectively applied on networks containing millions of nodes and billions of edges.
Recently, there has been a surge of works proposing alternative ways to embed networks by training neural networks 15,16 in various approaches inspired by natural language processing techniques [27][28][29][30] . To build a connection between a language and a network, a random walk needs to be implemented on the network such that the node sequences generated by random walks are treated as sentences in which nodes resemble words. After the sequences are generated, skip-gram in word2vec 30 , which is one of the most famous algorithms for word embedding developed in the deep learning community, can be efficiently applied on the sequences. Among these random-walk-based approaches, deepwalk 15 and node2vec 16 have drawn wide attention for their high training speed and high classification accuracy. Both algorithms regard random walks as a paradigmatic dynamic process on a network that can reveal both the local and global network structures. Several extended works that unravel the fundamental co-occurrence matrix between the context and words in skip-gram-based embedding algorithms and the multiplestep transition matrix. Levy et al. 31 proves that skip-gram models are implicitly factorizing a word-contex matrix. Tang et al. 17 takes 1-step and 2-step local relational co-occurrence into consideration and Cao et al. 18 believes that the skip-gram is an equally weighted linear combination of k-step relational information. Those works were proposed soon after word2vec and deepwalk were presented.
Although word2vec and network embedding are successfully applied in some real problems, several drawbacks still exist. First, explicit and fundamental explanations are nedded to explain why neural-based algorithms work so well since these algorithms are fundamentally black boxes. Second, how we should set the values of the hyper-parameters is still poorly understood. Third, explicit and intuitive explanations of the embedding vectors of each node and the inner structures of the embedding space are needed. We should find an explanation to provide a general framework to unify deepwalk, node2vec and other new algorithms based on random walks.
In this paper, we put forward a novel framework based on an open-flow network to deepen the understanding of network embedding algorithms based on random walks. We first use the so-called open-flow network model to characterize the different random walk strategies on a single background network. Then, we note that there is a natural metric called the flow distance that is defined on these flow networks. Finally, the hidden metric space framed by the flow distances can be derived and, interestingly, this metric space is similar to the embedding space from the deepwalk and node2vec algorithms. We uncover that the embedding algorithms based on neural networks are only attempting to realize the hidden metric based on flow networks, and the correlation between flow distance and node2vec is up to 0.91. With this understanding, we propose a new method called Flow-based Geometric Embedding(FGE), which has no free parameters and performs excellently in some applications, such as clustering and node centrality ranking.

Methods
Both deepwalk and node2vec are aim to learn the continuous feature representations of nodes by sampling truncated random walk sequences from the graph as mimic sentences to feed on the skip-gram algorithm in word2vec. The difference lies in the random walk strategy, where the deepwalk algorithm implements a common unbiased random walk on a graph such that all the edges are visited in accordance with the relative weights on the local node, while node2vec employs a biased random walk in which the probability of visiting is adjusted by two parameters p and q. Node2vec can uncover much richer structures of a network because it resembles deepwalk when p = 1 and q = 1. Thus, we discuss only node2vec in the rest of this paper. Please reffer to algorithm 3 for more concrete details about node2vec.

Constructing Open-flow Networks
To reveal the flow structure behind a random walk strategy (for a given p and q), we constructed an open-flow network model 32 in accordance with the random walk strategy. An open-flow network is a special directed weighted network in which the nodes are identical to those of the original network, and the weighted edges represent the actual fluxes realized by a large number of random walks. There are two special nodes, the source and the sink, representing the environment, that is why the network is called an open network. When a random walker is generated at a given node, a unit of flux is injected into the flow network from the source to the given node, and this particle contributes one unit of flux to all the edges visited. When the random walk is truncated, a unit of flux is added from the last node to the sink. A large number of random walkers jumping on the network according to the specific strategy form a flow structure that can be characterized by the open-flow network model, in which the weight on the edge i − → j is the number of particles visited.

Calculating Flow Distance
For a given flow network F with (N + 2) × (N + 2) entries, where the value at the i-th row and the j-th column represents the flux from i to j, node 0 represents the source, and the sink is represented as the last node, the flow distance c i j between any pair of nodes i and j is defined as the average number of steps needed for a random walker to jump from i to j and finally return back to i along the network. It can be expressed as: Where, m i j is the transition probability from i to j, which is defined as: where f i j is the total flow from node i to node j. The pseudo probability matrix U is defined as 32 : Where I is the identity matrix with N + 2 nodes. u i j is the pseudo probability that a random walker jumps from i to j along all possible paths. Figure 2 is a sample flow network constructed under condition 1 in Figure 1. Algorithm 1 shows the concrete details about how to calculate flow distance based on F matrix.

Embedding Networks
To display the hidden information in an open-flow network and visualize the node relationships, we embed the flow distance (c i j ) into a high-dimensional Euclidean space. We use the SMACOF algorithm 3, 33, 34 to perform the embedding. This algorithm takes the distance matrix and the number of dimensions as the input and tries to place each node in N-dimensional space such that the between-node distance is preserved as well as possible. Through network embedding, we find the proper vector representation of n nodes in the network. Please refer to algorithms 2 for more concrete details about this embedding method. Based on algorithms 1 and 2, we proposed a new network embedding algorithm named Flow-based Geometric Embedding (FGE). We then discovered the hidden metric space of the random-walk-based network embedding algorithms, such as node2vec, word2vec GraRep and so on. The nodes' training vectors obtained from the node2vec algorithm is highly correlated with the Euclidean distance embedding vectors derived from the flow network. The strong correlation is shown in the "Results" section.

Results
In this section, we present our results applied on several empirical networks. First, we applied FGE algorithm on the Karate network and plotted the open-flow network models behind two different random walk strategies (with different p and q). Next, we compared FGE algorithm with node2vec by embedding networks into two-dimensional planes. After that, we correlated the two distances, the flow distances and the Euclidean distances which is obtained from any given node pair in node2vec algorithm to show that the node2vec embedding algorithm is attempting to realize the metric of the flow distances. Then, we compared FGE and node2vec on clustering and centrality measuring tasks. Finally, we studied how the parameters of embedding algorithms based on random walks affected the flow structure and the correlations between the two distances. An overview of the networks we consider in our experiments is given in Table 1.

Flow Structure and Representation
This section describes experiments on Karate Graph. Figure shows different flow structures of node2vec with different p and q, where the thickness of the line indicates the amount of the flow between nodes. To capture the hidden metric on the flow structures, we fed random walk sequences into node2vec and FGE with number of walks per node r = 1024, walk length l = 10 and embedding size d = 16. After this training process, each node could acquire two vector representations, denoted by θ in FGE and π in node2vec. We then visualized the vector representations using t-SNE 35 , which provided both qualitative and quantitative results for the learned vector representations. Figure 4 shows the flow structure generated by unbaised random walk strategies(p = 1, q = 1). Figure 4 represents the visualization of θ and π under p = 1 and q = 1. Intuitively, we observed that the nodes embedded by this two methods almost overlapped each othjavascript:void(0);er. This indicates that the flow distances of the embedding captured the essence of node2vec. Additionally, the latent relationship between nodes was well expressed. For example, we found that nodes 4,5,10 and 16 were all close to each other and belong to the same community in both algorithms. By analysing the network structure, we also discovered that nodes 14,15,20 and 22 were much closer to each other in node2vec embedding than in the FGE embedding. That is because node2vec only considers n-step connection between nodes. However, the relationship changed when we consider infinity step connection with other nodes. This change can be captured by FGE algorithm since it considers all pathways.

Correlations between Distances
To confirm our conclusion that the skip-gram algorithm only tries to realize the hidden metric of the flow distance defined by random walks, we plotted the flow distance of the flow network generated by random walks from FGE algorithm and the Euclidean distance in the embedding space given by the node2vec algorithm on the same node sequences for any given node pair on the same background network. The results showed strong correlations between the two distances. Figure 5 is a heat map, where the X-axis represents the flow distance. between nodes i and j, and the Y-axis is the node2vec distance . The Pearson correlation between the two distances was 0.90 with a p-value = 0.001 in Figure 5(a) and 0.83 with a p-value = 0 in 5(b). The correlation indicates the highly linear relationship between the paired data.
To show the generality of our results, we performed the same experiments over different datasets, and the accuracy of the experiments was enhanced by averaging the correlation value of each dataset. The results in table 2 shows that there is a strong connection between the flow distance and node2vec's distance. We also found that the correlation is not sensitive to the different walking strategies. This is because different walking strategies generate different neighbor nodes, leading to different metric distances. All those walking strategies can be captured by the flow distances, and so the flow distance can reveal the latent space in random-walk-based network embedding algorithms such as node2vec,deepwalk and so on. The FGE algorithm can reveal the latent relationship between nodes in graph embedding.

Node Clustering
To further show the similarity between our method and node2vec, we compared the two approaches in performing node clustering. In complex network studies, node clustering is merely community structure detection, which is of importance in various backgrounds [36][37][38] . We then performed the k-means clustering method on the node vectors θ and π with r = 1024, d = 2 and l = 10. The number of clusters can be determined using the average silhouette coefficient as the criterion. According to the 3/12 silhouette value, we aggregated the graph into 4 clusters, each of them is regard as a community. Here, our method was applied to the karate club graph, as shown in Figure 4. In this graph different vertex colors represented different communities of the input graph. The clusters obtained using the two methods overlapped in a degree of 100%. We also performed a clustering experiment on other datasets, such as China Click Websites and Airline Network the clustering results were identical.

Centrality Measurement
We showed that the understanding of the network embedding from the angle of the flow structure could not only provide us new insights but also new applications. Such as centrality measuring. The centrality measure of nodes is a key issue in network analysis [39][40][41] and a variety of centrality measures have been developed. Here, we showed that the average distance from the focus node to all other nodes can be treated as a new type of node centrality measure. Formally, we defined a new metric to measure the centrality of nodes based on FGE as: The reason for the usefulness of this definition is that the nodes close to other nodes always have tight connections and high traffic. Because the flow distances are highly correlated with the Euclidean distances in node2vec embedding, this definition also works for the node2vec algorithm. That is, we can measure each node's centrality through its distances to all other nodes in the embedding Euclidean space. Furthermore, we can read the centrality information directly from the embedding graph because the nodes with high centrality (small average distances) are always concentrated on the central area of the embedding graph.
We tested the node centrality on the dataset of China Click Websites, which contained approximately 5 years of browsing data from more than 30000 online volunteers. We calculated each website's centrality based on its flow distance matrix and node2vec distance. We found that the popular websites have always had a small distance because they usually have had more travelling paths to other websites. Therefore, the smaller the average flow distance, the more central the website position. We ranked the websites in accordance with their centrality and then compared those two methods with other methods such as PageRank and total traffic (the number of clicks for each website). The ranking results for the top 10 websites are listed in Table 3. We found that the ranking orders of the flow distance and node2vec were nearly the same. We also discovered that high-traffic websites, such as Tmall (a popular shopping website) and 163.com, have lower ranks, but baidu.com and qq.com have high ranking orders even though their total traffic was not heavy. That is because baidu.com and qq.com are bridges between the real and virtual worlds.

Parameter Sensitivity
Random-walk-based embedding algorithms involve a number of sensitive parameters. To evaluate how the parameters affect the correlation between the two distances, we conducted several experiments on the dataset of the Karate club network. We examined how the embedding size d, the number of walks started per node r, the window size w, and the walk length l influenced the correlation between the two distances. As shown in Fig 6(a), the correlation grew with the number of walks increased, and the correlation tended to saturate when the number reached 512. This indicated that the node2vec embedding algorithm merely tried to realize the hidden metric of the flow structure of the random walk, and the performance increased as more samples were drawn. The neural network of the skip-gram algorithm behind node2vec is over-fitted when the number of walks is small because a higher embedding size d leads to more parameters in the neural network that needed to be trained and the correlation decreased with the embedding size d (Figure 6(a)). However, there was a slight trend of the decreasing correlation coefficient with the number of walks when this number is larger than 512. We speculated that the decrease in the correlation is due to errors in the substitution of the large sample of random walks using the open-flow network. The FGE algorithm assumes that the random walks can be represented as a Markovian process on the network, which means that each step jump is exclusively determined by the previous-step position. However, the random walk of node2vec does not satisfy this condition. Even though the difference exists as seen in Figure 6(b) we believe that the hidden metric of flows is more essential to reflect the structural properties of the network. We also evaluated how changes to the window size w and walk length l affected the correlation. We have fixed the embedding size and the number of walks to sensible values d = 128, r = 512 and varied the window size w and walk length l for each node. The performance differences were not that large as w changed. When the walk length l reached 10, the correlation declined rapidly with further increases in the walk length.

Conclusions and Discussions
In this paper, we reveal the hidden flow structure and metric space of random-walk-based network embedding algorithms by introducing FGE algorithm. This algorithm takes the flow from node to node as an input. After calculating the flow 4/12 distance, node2vec learns nodes representation that encodes both structural and local regularities. The high Pearson correlation value between the node2vec representations and FGE vectors indicates that there is a hidden metric of random-walk-based network-embedding algorithms. The FGE algorithm not only helps in finding the hidden metric space but also works as a novel approach to learn the latent relations between vertices. Experiments on a variety of different networks illustrate the effectiveness of this method in revealing the hidden metric space of random-walk-based network-embedding algorithms. This finding is of great importance because it not only provides a novel perspective to understand the essence of network embedding based on random walks but also reveals the skip-gram (the main algorithm in node2vec) is trying to find the proper node representation to match this metric between nodes. With this finding, we first applied node2vec to a centrality measuring task we use the Euclidean distance instead of cosine distance between nodes to measure the importance of nodes. We then validate the Euclidean distance of the nodes' vectors in FGE and node2vec in clustering task. The outcome shows that the two algorithms give similar clustering and centrality measuring results. The FGE algorithm has no free parameters, so it can work as a criterion for parameter setting for node2vec. PPMI 31 shows that the skip-gram in word2vec is implicitly factorizes a word-context matrix. In future, we would like to explore the hidden relationship between the flow distance and point wise mutual information. Both node2vec and FGE regard random walk as a paradigmatic dynamic process to reveal network structures. This sampling strategy consumes a large amount of computer resources to reach a stationary state for each node. Further extensions of FGE could involve calculating the nodes' flow distances without sampling.

Additional information
Competing financial interests: The authors declare no competing financial interests.     Table 3. Centrality ranking of top 10 websites. Ranking top 10 websites according to flow distance, node2vec distance and comparisons with other ranking methods.

8/12
Algorithm 1 Flow Distance(F, n) Input: total total flow from node to node matrix F number of nodes N Output: flow distance matrix C 1: Build the transition matrix based on: F 2: for i = 0 to N do 3: for j = 0 to N do 4:

12:
Symmetrize flow distance matrix: C 13: c i j = c ji = l i j + l ji Update stress function: 9:  for each v i ∈ Ω do 8: Ω v i = RandomWalk(G, v i , t) 9: end for 10: end for 11: Learn Features by SkipGram: 12: for each v j ∈ Ω v i do 13: for each u k ∈ Ω v i [ j − w : j + w] do