weg2vec: Event embedding for temporal networks

Network embedding techniques are powerful to capture structural regularities in networks and to identify similarities between their local fabrics. However, conventional network embedding models are developed for static structures, commonly consider nodes only and they are seriously challenged when the network is varying in time. Temporal networks may provide an advantage in the description of real systems, but they code more complex information, which could be effectively represented only by a handful of methods so far. Here, we propose a new method of event embedding of temporal networks, called weg2vec, which builds on temporal and structural similarities of events to learn a low dimensional representation of a temporal network. This projection successfully captures latent structures and similarities between events involving different nodes at different times and provides ways to predict the final outcome of spreading processes unfolding on the temporal structure.


Results
An embedding method of temporal networks may take a list of temporal interactions as input, and provide a lower dimensional representation, in which vectors corresponding to similar nodes or events in the original structure ideally point close to each other in the embedding. In our pipeline we solve this problem in three consecutive methodological steps. First, we turn the original temporal network into a higher-order representation, which captures pairs of adjacent and consecutive events, which may be in causal relationship. Second, we use this representation to generate environments for each event, sampled from their adjacent neighbours. Finally we obtain an embedding of events by training a Skip-Gram model on the obtained environments. These steps are schematically drawn in Fig. 1) and introduced next in the coming sections.
To demonstrate our method we used four different datasets all obtained from the SocioPatterns project 36 . These open datasets record the time evolving physical proxy interactions of a large number of people in different settings like in conference, high school, hospital, or primary school. The data comes as a long sequence of network snapshots recording simultaneous interactions between any participants in every 20 seconds. While for demonstration most of the results in the paper are shown only for the conference and primary school settings, a detailed data description together with the analysis for the other networks are presented in the Supplementary Information. Figure 1. Schematic presentation of the methodological pipeline of the presented temporal network embedding method, which takes a temporal network to (a) project it into a weighted event graph; (b) to sample a set of environments for each event; (c) and uses it as input for a Skip-Gram model; (d) to obtain an event embedding of the original network. 1 2 ). Note that adjacency in a static network G can be similarly defined between links = l i j ( , ) 1 and = l k l ( , ) 2 , which share at least one node ( ). More restrictively, we called them δt-adjacent ( δ ⟶ e e t 1 2 ) if they are adjacent and δ − ≤ t t t 2 1 , thus follow each other within a given period of time ( δ τ ≤ ≤ t 0 ) where τ corresponds to the total period of time of the interactions. Adjacency introduces a directed relation between events, with orientation respecting their order in time. Using this notion we can formally define a static directed network representation = D E E ( , ) 2 . The obtained network is a directed acyclic graph called the event graph, defined earlier in 28,29 . It can be interpreted as a temporal line graph, which provides a higher-order representation to map out simultaneously all time respecting paths of the original temporal network without any loss of information. Note, that to simplify our representation, for a given event if it has multiple future adjacent events with the same pair of nodes, we only consider the earliest one.
Event graphs can be easily enriched with various types of link weights reflecting some temporal or structural information coded in the original structure. Here, to better capture the strength of potential causal relationships, first we consider a weight defined as , which is a measure inversely proportional to the absolute time difference between adjacent events at t 1 and t 2 . This definition of the weight allows us to include the temporality of interactions such as long decay in social activities. At the same time we define a second weight for adjacent events (links of the event graph), based on the total number of co-occurring events on the underlying adjacent links in the static network. More precisely, the w e e ( , ) co 1 2 co-occurrence weight counts the number of δt -adjacent events in G T appearing on a given pair of adjacent links l i j ( , ) 1 and l k l ( , ) 2 in the static graph G. Note that adjacent events connected in d, which corresponds to the same links in the underlying network G, will have the same w co values. Datasets analysed in this paper are defined as sequences of snapshots aggregating temporal interactions in consecutive time windows of size Δt. In these systems we compute w co for adjacent links as the number of co-occurrence of corresponding events within a single snapshot. This definition may slightly underestimate the real co-occurrence (if δt ≤ Δt), but provides the best plausible solution due to the un-ambiguity of timings of events within a single snapshot.
Neighbourhood sampling strategy. In the same spirit of recent node embedding techniques 25,35 based on the Skip-Gram model, we propose an event embedding method, which samples neighbourhoods for events from the weighted event graph representation to map them to a lower dimensional space. To assign an environment to an event e k , we sample its local neighbourhood set N k , which consists of the set of its first in-(past) and out-(future) neighbours (also called its predecessors and successors from now on). The sampling is done according to probabilities determined by the two types of weights of the links that connect the actual event to its neighbours.
In order to consider not only the past but the future of an event in its environment, during sampling we use a combined set of its predecessor and successor events. Further, to balance the contribution of causality and temporal co-occurrence, we introduce a neighbourhood sampling strategy such that the probability p e ( ) l of picking an event e l from the combined neighbourhood set N k of the central event e k is given by: where α is a coefficient between 0 and 1 scaling the contribution of the two types of weights and F is a normalised weighted function defined as: k l n N k n k Using such probabilities computed for each neighbour, we sample nb number of environments randomly for each event for an effective training of the model explained next. Each environment contains s events, we call the number s, the length of the environment.
Embedding of empirical temporal networks. Once the environments have been created, they are passed as inputs to the Skip-Gram model, with parameters fixed to different values according to the analysis to conduct. The result is a d-dimensional vector space in which events are represented by Cartesian coordinates. As an illustration, Fig. 2 shows a 3-dimensional embedded representation of two of the empirical networks we used for the analysis, recorded in a conference and a primary school settings. Our aim with this setting was to investigate the performance of low-dimensional embedding on the one hand, and on the other taking into account equally the effects of causal temporal paths and co-occurrences by setting α = .
0 5. The environment parameters (2020) 10:7164 | https://doi.org/10.1038/s41598-020-63221-2 www.nature.com/scientificreports www.nature.com/scientificreports/ nb and s have been set both to 10 as an example. In Fig. 2(a,b) each event is represented as a point in the embedded space with colour indicating the time at which they occurred in the original temporal network. Interestingly, the gradient change of colours indicates that these embeddings capture in large part the time ordering of the events. At the same time, Fig. 2(c,d) shows the same 3-dimensional embedded representations, but with colours representing the membership to mesoscale structures detected by tensor factorisation methods applied on the original temporal network 9 (see Section on Tensor Factorisation for mesoscale structure extraction). Evidently, as colours are not distributed randomly but similar colours are somewhat grouped together in space, it suggests that our embedding is able to capture some of these mesoscale structures as well.
We propose an additional microscopic scale analyses of our embedding method in the Supplementary Information. There we studied pairs of events and we were interested in the relation between their time difference measured in the temporal network and their euclidean distance observed in the embedding. Indeed, we found clear correlation between these quantities and demonstrated that the euclidean distances among linked events in the temporal network are significantly smaller than the distances measured between randomly selected events pairs. These observations demonstrate that our method simultaneously captures structural and temporal vicinity of events.
Effects of the dimension and of the neighbourhood sampling. One of the most important hyper parameter of our method is the number of dimensions of the embedding vector space. Lower than optimal number of dimensions may lead to neglected but otherwise relevant latent correlations in the temporal structure, while overestimation of this number may give us a highly redundant embedded space. We test here the consistency and robustness of our embedding technique in terms of this parameter. We argue that as we increase the number of dimensions, once it reaches and overpasses an optimal number, it starts coding increasingly redundant information in the embedding. As a consequence, further dimensions would not alter the relative positions of embedded events and the pairwise euclidean distances among them would stabilise. To check this assumption, we use an entropy measure capturing the fluctuations of pairwise euclidean distances over several realisations with the same number of embedding dimensions. More precisely, for selected event pairs, we measured the probability distribution of their pairwise euclidean distances over 10 realisations (see Section Entropy Computation) and used it to compute an entropy score. Averaging these scores over all the event pairs provided us an indicator of the stability of the embedding. www.nature.com/scientificreports www.nature.com/scientificreports/ We show results in Fig. 3 for two empirical networks using a balanced embedding with α = .
0 5 in both cases. As we expected, the entropy decreases as we increase the number of dimensions. This is due to stabilisation of the distribution of pairwise euclidean distances, which in turn gives us a hint on the optimal dimension at which the local neighbourhoods (as defined by the sampling) are well captured by the embedding. To select this optimal dimension, we identified the lower bound at which entropy starts to fluctuate around a constant value (see Methods 4 and Supplementary Information). Note that to identify the optimal number of dimensions we have taken into account other algorithms 37 as explained in the Supplementary Information. However, our final choice fell on the entropy based method we introduced above as it maintained a good trade-off between the compactness (i.e. low dimensionality) and stability of the embedding. Other tested methods suggested an unrealistically high number of dimensions to be optimal, probably due to their incompatibility with the actual setting, as they were developed for word embedding problems.

Spreading process prediction with embedding events.
Beyond the demonstrated capacities of our model to capture the temporal ordering and the underlying mesoscopic structures, it may provide further useful information about the embedded events. As a temporal network embedding, it positions events in proximity with similar neighbourhoods. In other words, it can help to identify similar events maybe involving different nodes at different times, but influencing a similar set of other nodes via overlapping temporal paths. As a consequence, this information can be used to predict the outcome of information diffusion processes on temporal networks.
To explore this problem, we model a Susceptible-Infected (SI) process, which is the simplest schematic model of epidemic or information spreading (see Section Spreading Process in Methods). Defined on networks, this model assumes that each node can be in one of two mutually exclusive states (susceptible (S) or infected (I)) at a given time. While initially each node is susceptible, infection can spread from a selected infected seed node/event via temporal interactions and can reach all other nodes via connected valid temporal paths. To obtain the expected outcome of SI process on a temporal network we took each event as the seed and simulated the spreading on the empirical temporal network to measure the final epidemic size in each case. Note that our aim to investigate the final outcome of an epidemic differs from the one pursued with DyANE 34 . In our case, we are not interested in the status of the node time by time, but in the final outcome of the epidemic originated by a specific event. To test the versatility of our embedding method we trained a model using the embedded coordinates of events for epidemic size predictions and compared results directly to the corresponding simulated outcomes. We used linear regression to approximate the correspondence between the embedding coordinates of each event and the size of epidemic initiated from them. As the goodness of the prediction we simply computed the r 2 scores between the predicted and simulated epidemic sizes. Note, that we tested more complicated non-linear models but obtained lower performance in prediction (not shown here). We report our results in Table 1, where we fixed the environment parameters s and nb both to 10 and chose the optimal embedding dimension for each real network detected as we explained in Section Effects of the dimension and of the neighbourhood sampling.   www.nature.com/scientificreports www.nature.com/scientificreports/ Real temporal networks are interwoven by several temporal and structural correlations. First there are local temporal correlations induced by intrinsic or environment effects emerging between events on the same link leading to bursty behaviour, event trains or circadian activity patterns. Another type is higher-order temporal correlations leading to the emergence of causal temporal motifs. Structural correlations are responsible for the emerging assortative patterns, communities, or any non-random connection pattern in a social network, while weight-structural correlations induce the non-random distribution of strong and weak ties inside and between communities. In Table 2 we summarise which RRM preserves and eliminates which type of correlations.
To get a better sense about the effects of these correlations, we used three types of randomised reference models (RRM) 38 to eliminate combinations of temporal and structural correlations, and to identify which of them are determinant for the prediction task. These RRMs were • Snapshot shuffling, which randomises the timestamps of events in order to eliminate any temporal correlation between them inducing burstiness, causal motifs, or group activation etc. This model is assigned as P[ Γ p f ( t ) , ( ) T ] using the notation convention introduced in 38 .
• Timeline shuffling takes the complete timeline of events between a connected pair of nodes in the temporal network and swap it with the timeline of another randomly selected connected pair of nodes. This shuffling method, noted as P[ 38 , eliminates all correlations between the underlying structure and timelines while also vanish any casual correlations between events on adjacent links. 38 ) randomises links of the underlying aggregated static network first to obtain a Bernoulli random structure, and then reassign randomly the original timelines of events to the new links without replacement. In this way, it destroys any structural and structural-temporal correlations in the network, while keeping local temporal correlations like burstiness unaltered.
For a summary of present and eliminated correlations in the different RRMs see Table 2.
The prediction results for the original and the RRM networks are summarised in Table 1, where we depict the observed average r 2 values with their standard deviation computed over the embedding realisations. These results suggest that in general the randomised model embeddings perform worse in predicting the final epidemic size with respect to the original network embeddings. In one way this is straightforward as some correlations have been eliminated from RRMs, which might be determinant for the prediction task. On the other hand, RRMs also appear with a less complex structure and limited irrelevant dependencies and noise, which in turn may help the prediction. It is the snapshot shuffling method, which leads consistently to a significant drop in performance, suggesting that temporal correlations (local or higher-order) are very important for the spreading process and that our embedding can capture these dependencies successfully. Timeline shuffling, which destroys weight-structural and higher-order temporal correlations but conserve the dynamics on links and the underlying network seems to perform better as compared to the snapshot shuffling method but yet worse than the original network. This suggests that while structural correlations can be captured by the embedding, local temporal correlations might be better predictors than weight-structural correlations. Interestingly, the link shuffling method, which conserves only local timeline dynamics on links, performs the best among the RRMs, sometimes even better than the original dataset. Consequently, indeed local temporal event dynamics is the most important feature of the temporal network, while removing structural correlations the system becomes homogeneous and easier to predict (for a supporting analysis see Supplementary Information). Although these general conclusions seem to be consistent over the several analysed datasets, results computed in different settings may have some fluctuations. As explained in the Supplementary Information, this can be partially explained by the variance of the epidemic size distribution reflecting the fluctuation of epidemic size started from different times and events. In most of the cases smaller variance of epidemic size correlates with higher predictive performance except for the link shuffling method, as explained in the Supplementary Information.
As a general conclusion we showed that the embedding successfully captures a combination of temporal and structural features of the network. On the other hand, the fact that temporal and structural features can be entangled has an impact on embedding performance but not on what the embedding is able to learn about them. For instance, the model can under-perform in a community-rich network where nodes of the same community have totally uncorrelated activities, while can provide precise predictions in shuffled datasets if structural and temporal correlations code redundant information.
In Methods in Section Parameter dependencies we also analyse the impact of changing the environment parameters nb and s, the dimension of the embedding d and the hyper parameter α on the predictability of the epidemic spreading outcome (i.e. with respect to the r 2 score). The tuning of the parameters for the best prediction gives a hint about the sensitivity of the prediction task on local properties of the networks and the sampling parameters.

Correlation
Local temporal

Weightstructural
Higher-order temporal Structural RRM www.nature.com/scientificreports www.nature.com/scientificreports/ Comparison with other methods. There are a few other recently proposed dynamical network embedding methods, which can be used for the prediction of spreading outcome. Here we consider two of the most promising ones, the STWalk 30,39 , and the Online-Node2vec embedding methods 40,41 to compare their predictive performances to weg2vec. Both methods are thought to build node embeddings for dynamic graphs using the Skip-Gram model, which introduces a significant difference to our event embedding method.
STWalk is designed to learn trajectory representations of nodes in temporal graphs by operating with two graph representations, a graph at a given time step and a graph from past time steps. It performs random walks respectively called space-walk and time-walk, to sample environments to input for the Skip-Gram embedding. The authors propose two variants of STWalk, different in the way the environment is built. In STWalk1, space-walk and time-walk are performed as part of a single step on a combined graph, while in STWalk2, spaceand time-walks are done separately.
Online-Node2vec is a node embedding method updating coordinates each time a new event appears in a temporal network. It also applies random walks to generate environments possibly using two strategies, the Temporal Walk algorithm and the Temporal Neighbourhood algorithm. In the Temporal Walk algorithm 42 a temporal path based centrality metric is used to capture similarity between nodes by projecting nodes on the same temporal path close to each other in the embedding. In the Temporal Neighbourhood algorithm 43 , node similarity is inferred via a fingerprinting method, which projects nodes with similar neighbourhoods close to each other.
To compare the performance of the different methods, we test all of them on the four empirical networks introduced earlier. The environment parameter nb and s have been set to 10 and 10 for all cases to give them the same amount of information to learn and for a fair comparison of outcome. Further, we fix the balance parameter α to . 0 5. We then compute the average r 2 scores of simulated spreading outcomes as we vary the number of embedding dimensions. Since STWalk and Online-Node2vec use only the past and the present as basis for the nodes environment, we run the simulation for our methods using only the predecessors for each event as well (see Section Neighbourhood sampling strategy). Finally, as previously, we estimate the epidemic size by using the coordinates of the actual embedding in a linear regression model (see Section Spreading Process).
According to the results in Fig. 4, our method outperforms all the other methods on any of the networks for a broad range of dimensions. The performance improves if we also consider the successors and not only the predecessors in building the environment, as expected. The exception is the hospital network, where our method gets lower scores with respect to Online-Node2vec for dimensions 50 or larger. In general, the difference in the scores can be explained due to the advantage of event embedding instead of node embedding. Indeed if we are looking at epidemic spreading mediated by temporal interactions, it becomes more natural to work with events. In the case of STWalk, the lower scores can be partly explained by the selection of the environments that are allowed to included higher-order correlations among nodes. This more complex information coded in the environments can appear less relevant or noisy for the learning task here. In case of Online-Node2vec, the relative under-performance can be due to the fact that information of the temporal and neighbourhood information are considered separately instead. Missing to join these two aspects can lead to limited information and prediction capacities. www.nature.com/scientificreports www.nature.com/scientificreports/

Discussion
Embedding of networks has recently drawn a lot of attention as it both provides lower dimensional representations of networks and proves to be efficient to resolve task such as link prediction, node classification or anomaly detection. Here we proposed an embedding technique of temporal networks, a domain still little explored as most low dimensional representations of networks have been introduced for static networks. Moreover, instead of generating node embedding, we are creating link embedding, which we show to be much more efficient when we try to solve a task linked to spreading process compared to node embedding techniques.
The embedding method we introduced has the advantage to be very simple, it relies on the sampling of neighbourhood on a higher order static representation of the temporal networks and the use of the Skip-Gram model, largely developed for word embedding. The neighbourhood sampling was built so that it takes into account both the notions of causal temporal paths and co-occurrences, meaning that events that are on the same temporal paths and that tend to co-occur are projected close in space. We have shown that an embedding based on this neighbourhood sampling is particularly efficient to provide compact representations of temporal networks that retain essential features of the networks such as the time ordering and the organisation of networks in mesoscale structures. Here, we also provide a way to choose a relevant dimension for the embedding. Along with this, we show that the learned representations retain enough information of the original network to get a relevant estimate of the outcome of spreading process. Interestingly, this observation remains true even if the sampling strategy uses only information from the past for each event. This means that the technique introduced can be also used as an online method taking into account temporal events on the run. Moreover, tuning the neighbourhood sampling, i.e. playing on the trade-off between including causal temporal paths and co-occurrences, to get the best performance for the prediction of the outcome of the spreading process can be used to detect the relevant properties of the original network for the spreading process.
For future works, it would worth exploring other sampling strategies that decouples the purely structural properties, i.e. the presence of the communities in the aggregated network, from the temporal properties. Another important follow-up of this work would be the application of this embedding technique to solve questions such as the detection of key events in misinformation spreading.

Methods
Entropy Computation. To measure the entropy over the distributions of the euclidean distances between pairs of event coordinates in each of the temporal network embedding, we build 10 embeddings for 50 different dimensions (from 2 to 100 at step 2), setting α = .
0 5. The hyper parameters nb and s were set to 10 and 10 respectively. For each embedding, we then divide our dataset in 10 consecutive samples of 1000 consecutive events each, both to avoid to compute all the pairwise distances (which would be very costly computationally) and at the same time to have a representative set of events.
We compute the euclidean distance of each pair of events in the samples for each of the 10 different embeddings. We then bin the distances into = k 10 bins ranging from the global maximum and the global minimum values over all the possible dimension and realisations of the embedding for the same network, and measure the entropy over these sets of distances as k k k where p k represents the probability associated with the k th bin. In Fig. 3 (see also Supplementary Information) the blue curve represents the entropy values with respect to the number of embedding dimensions, averaged on the 10 samples as described above and the shaded surrounding area shows the variance among the 10 samples. The vertical dash line corresponds to the dimension at which the embedding stabilises. To determine this point we looked for the best fit of a horizontal line on the average entropy curve and took the value of the first interception of the curve with its fit.
Spreading Process. The simplest epidemic models are based on the assumption that the population can be divided into compartments, each representing a phase of the disease [44][45][46][47][48] . The one we used for our analysis, the Susceptible-Infected model, which foresees that once a healthy node (in state S) is exposed to the infection, it will become infected (state I) with a given rate β and will never return to the original healthy state. In our specific case, the SI model has been implemented in such a way that β = 1, which defines a a deterministic process from the spreading point of view. Taking each event as the starting of the epidemic, we simulate the epidemic spreading on the temporal network and we assign the final epidemic size to each seed event. We build a dataset containing embedding coordinates of each event, the associated epidemic size that will be the target to be predicted, the square of each coordinate and the euclidean distance from each event in the network and the first event in time, that will be used as regressors. In other terms, we assume a linear relationship between the epidemic size (y) and the embedding coordinates ( → x ) defined by the so-called regression equation where … x x , , r 1 are the predictors (the embedding coordinates, their square and the euclidean distance) and β β … , , r 0 are the regression coefficients. For training we operate with a 10-fold cross-validation, i.e. we first randomly partitioned the original sample into 10 equal sized sub-samples and retain a single sub-sample as the validation data for testing the model, while using the remaining 9 sub-samples for the actual training. We repeat this process 10 times to train the embedding to best learn the coordinates of each event in the network. We use the coefficient of determination, denoted as r 2 , to understand which amount of variation in y can be explained by the dependence on → x using the particular regression model. Larger r 2 indicates a better fit and means that the model can better explain the variation of the output with different inputs.
Tensor Factorisation for mesoscale structure extraction. A temporal network can be fully described by a time-dependent adjacency matrix, where its entries are either one or zero depending on the presence or absence of interactions of pair of nodes at a given time. This matrix can be seen as a three-way tensor, whose size is × × n n t (n indicates the number of nodes and t the number of snapshots in the temporal network). This tensor can then be factorised into a sum of r rank-one tensors, i.e. number of mesoscale structures we search into the temporal network. Using this technique we can group events into mesoscale structures. Indeed with the decomposition, we now have a value for each link for each time for each rank-1 tensor, we assign a mesoscale structure to each event based on these values. Basically the mesoscale structure assigned to each event is the one for which the corresponding link at the corresponding time has the highest value. To find the optimal number of mesoscale structures, we used the core consistency metric 49 . It is based on scrutinising the appropriateness of the structural model based on the data and the estimated parameters of gradually augmented models. A model is called appropriate if adding other combinations of the same components does not improve the fit considerably. In practice, we operate different tensor decompositions for different value of the rank (ranging from 2 to 20, for all the networks) in order to estimate the best value for it.
Parameter dependencies. Here, we explain the way we investigate how hyper parameters of the environment sampling may impact the prediction score on different real networks. Figure 5 shows the r 2 scores computed for the conference and primary school networks with respect to the length s and number nb of environments sampled for each event. For these computations we fixed α = .
0 5 and the embedding dimensions to their optimal values. According to Fig. 5 on the conference network, except for very small number or length of contexts we see an emerging plateau of r 2 values. This means that the environment size compensates for the number of environments (or vice versa) when we measure the embedding performance. In other terms, increasing the length of the environment has the same effect than increasing the number of environments on the r 2 score. For the primary school network we observe a similar but somewhat weaker compensation effect (Fig. 5), while we observed the same behaviour even for the hospital and the high school networks (see Supplementary Information). These results suggest that these two parameters are highly redundant, thus we can effectively reduce our parameter set by fixing both to a large enough value.
In order to investigate the influence of the embedding dimension and the α sampling balance parameter, next we fix the environment parameters to = s 10 and = nb 10 based on our evaluation above. As shown in Fig. 6 (and in Supplementary Information for other networks) both increasing the number of embedding dimensions and α lead to better performances in predicting the spreading outcome. As the function of the number of dimensions, each case reaches a plateau in accordance with our earlier results presented in Section Effects of the dimension and of the neighbourhood sampling. On the other hand we observe somewhat stronger dependencies on α. While for the conference and the hospital networks the more one increases α the better the prediction gets, for the primary school and the high school networks the score reaches a plateau and become less sensitive to the change of α (see Fig. 6 and Supplementary Information). If we consider lower values of α, the similarity we capture between the event between adjacent events is mainly based on the co-occurrences, which are more relevant in school networks where participants might be active at the same time (e.g. in breaks between classes). This argument only