Understanding microbiome dynamics via interpretable graph representation learning

Large-scale perturbations in the microbiome constitution are strongly correlated, whether as a driver or a consequence, with the health and functioning of human physiology. However, understanding the difference in the microbiome profiles of healthy and ill individuals can be complicated due to the large number of complex interactions among microbes. We propose to model these interactions as a time-evolving graph where nodes represent microbes and edges are interactions among them. Motivated by the need to analyse such complex interactions, we develop a method that can learn a low-dimensional representation of the time-evolving graph while maintaining the dynamics occurring in the high-dimensional space. Through our experiments, we show that we can extract graph features such as clusters of nodes or edges that have the highest impact on the model to learn the low-dimensional representation. This information is crucial for identifying microbes and interactions among them that are strongly correlated with clinical diseases. We conduct our experiments on both synthetic and real-world microbiome datasets.


Introduction
Complex microbiome ecosystems have a strong impact on the health and functioning of human physiology.Large-scale perturbations in the microbiome constitution are strongly correlated, whether as a driver or a consequence, with clinical diseases, such as inflammatory bowel disease 1,2 , obesity 3 , and some types of cancer [4][5][6][7] .Many studies have been aimed at accurately differentiating the disease state and at understanding the difference in the microbiome profiles of healthy and ill individuals 8,9 .However, most of them mainly focus on various statistical approaches, omitting microbe-microbe interactions between a large number of microbiome species that, in principle, drive microbiome dynamics.In addition, some studies make use of the concept of a potential landscape in physics [10][11][12] , giving a completely new insight into the analysis of microbiome dynamics.Namely, a healthy human microbiome can be considered as a metastable state lying in a minimum of some potential landscape.The system of time-evolving interactions of species appears to be equilibrated arXiv:2203.01830v2[q-bio.QM] 7 Sep 2022 for a short timescale but at larger timescales a disease or other strongly impacting factors, such as antibiotic exposure, makes the system undergo transitions from one metastable state (healthy) to other metastable states (diseased).
Detecting metastable states and associated interactions of species, which undergo changes from one metastable state to others, is complicated by the high dimensionality and the compositional nature of microbiome data.Therefore, we propose a method that simplifies analysis and prediction of the large-scale dynamics of microbiome composition by projecting this system onto a low-dimensional space.First, to allow interactions between species to change over time, we represent the system as a time-evolving graph with nodes being microbes and edges being interactions between microbes.Second, we define two key components of our method: 1) the Transformer 13 that learns both structural patterns of the time-evolving graph and temporal changes of the microbiome system, and 2) contrastive learning that makes the model maintain metastability in a low-dimensional space.To assess the performance of our method, we apply it to the synthetic data from 14 , which has known underlying dynamics, and to two real-world microbiome datasets, i.e.MovingPic 9 and Cholera Infection 15 .Furthermore, we will show that it is feasible to extract topological features of the time-evolving graph which are associated with metastable states and have the highest impact on how the model learns the low-dimensional representation of the time-evolving graph with metastability.This information can help in differentiating the microbiome profile of healthy and diseased individuals.
Our main contribution is presenting a model that learns a low-dimensional representation of the time-evolving graph with metastable behaviour in an unsupervised manner.We show in experiments that the metastability governing the time-evolving graph is preserved by the model.Through interpreting the output of the model with respect to the input, we demonstrate that it is feasible to extract topological features of the time-evolving graph, which define each metastable states.These features can be used to identify a set of microbes that drive microbiome constitution to undergo transitions from one metastable state to others.

Related work
We can broadly categorize methods for graph representation learning into semi-supervised or unsupervised methods and methods for static or time-evolving (dynamic) graphs.A good overview of the current state of methods for time-evolving and for static graph representation techniques can be found in 16,17 and in 18,19 , respectively.The most recent survey on both time-evolving graphs and static graphs is presented in this paper 20 .

Static graph representation.
Representation approaches for static graphs can be classified into two categories -those which learn the representation of nodes and those which learn the representation of sub-structures of the graphs.The first category tends to encode nodes of the graph in a low-dimensional space such that their topological properties are reflected in the new space (node2vec 21 , DeepWalk 22 ).Most studies are focused on node representation learning, and only a few learn the representation of the whole graph (graph2vec 23 ).

Dynamic graph representation.
Representing time-evolving graph in the low-dimensional space is an emerging topic that is still being investigated.Among recent approaches, DynGEM 24 uses the learned representation from the previous time step to initialize the current time step representation.Such initialization keeps the representation at the current time step close to the learned representation at the previous time step.The extension of the previous method is dyngraph2vec 25 , where authors have made it possible to choose the number of previous time steps that are used to learn the representation at the next time step.Moreover, dyngraph2vec uses recurrent layers to learn the temporal transitions in the graph.Unlike this method, we utilize the multi-head attention mechanism 13 to capture the temporal changes in the time-evolving graph.
Another category of methods that are successful in the graph representation learning is Graph Neural Networks.One of methods is EvolveGCN 26 which extents graph neural network for static graph to dynamic graphs through introducing a recurrent mechanism to update the network parameters.The author focus on the graph convolutional network and incorporate a recurrent neural network to capture the dynamics of the graph.Recently, attention-based methods have been extensively proposed, and one of them is DynSAT ( 27) that learns a dynamic node representation by considering topological structure (neighbourhood) and historical representations following the self-attention mechanism.However, one of the disadvantages of these methods for our problem is that time-evolving graphs with metastability usually consist of many time steps, and it is crucial to have a computationally efficient method in order to learn a low-dimensional representation.Another disadvantage is that all these methods capture the dynamic of nodes and, as the result, output the low-dimensional representation of nodes.

Datasets
Here, we briefly describe the datasets used to evaluate the model.Besides experiments with synthetic datasets, we show the application of our method to real-world microbiome data.Both the idea of generating synthetic dataset and the idea of pre-processing real-world datasets are explained in more details in 14 .An overview of the datasets used in this paper is shown in Table 1.

Synthetic data
To estimate how the proposed method can capture the dynamics of the time-evolving graph and learn a proper low-dimensional representation, we generate synthetic datasets with both understandable topological structures and temporal patterns.We use the following Stochastic Differential Equation (SDE) and a potential function to sample a trajectory with metastability based on which time-snapshot graphs {G 1 , . . .G T } are built: with the potential function: where s is the number of states or wells, X t = (x 1 t , x 2 t ), W t is a standard Wiener process, β is the inverse temperature that controls the transition between states.The higher β , the less likely is the transition from one state to another.We use the potential function (2) to generate pos_3WellGraph.
For the synthetic datasets with 2 states (pos_2WellGraph and npos_2WellGraph), we use the following potential function, which is often called a double-well potential: where We further split our synthetic datasets into two categories: positional and non-positional data.Positional data means that we use the positional encoding explained in Section before training the model.Non-positional data means that we do not use the positional encoding, as the topological structure of the time-evolving graph can be understood without providing the positional information of the node.We define these two categories to empirically show that our model does not rely on positions of nodes if the topological patterns of each state are clearly defined.
Positional data.This synthetic data has a topological structure that is difficult to distinguish without the positions of nodes.We briefly describe the three-step process, which generates the positional time-evolving graph: • We sample the trajectory S = {(x 2 )} T i=1 using SDE (1), and the corresponding potential function (2) or (3).• Then we choose the number of nodes n, and assign random coordinates (a j , b j ), j = 1, . . ., n to each of these nodes.The reason for that we need to know the locations of discriminating features of metastable states.
3/14 • In the final step, we define discriminating topological features for each state.Let G 0 be a complete graph.In case of the s-well potential, we generate G t , ∀t,t = 1, . . ., T by drawing a circle with the centre at (x t 1 , x t 2 ) and the radius r and randomly removing edges between nodes that are inside the current circle.In addition, to address the noise in the real-world data, we also remove edges outside the current circle.For double-well potential, we remove edges between nodes, which satisfy b We can see that the graph features of different states are difficult to distinguish, and the model will fail to discriminate between metastable states in the time-evolving graph.As an illustration of that, we provide Figure 2.There are two states A and B of the time-evolving graph G that are characterized by removed edges in the right part of G for the state A and in the left part of the graph for state B. Topologically, states have the same neighbourhood structures, which will result in the same points in the low-dimensional space.The same occurs for our synthetic dataset: nodes in the circles determine metastable states, and neighborhoods of these nodes are almost identical for the model.
Non-positional data.This type of data has a dissimilar topological pattern to the positional data.The time-evolving graph is generated in the same way as the positional synthetic dataset, except instead of removing a random number of edges between nodes that fall in the circle, we remove edges between nodes in the circle in such a way that each node has the particular number of neighbours.We define the number of removed neighbors of nodes arbitrary and different for each state.

Real-world dataset
MovingPic.This dataset, originally introduced in 9 , is the first real-world dataset on which we evaluate our model.In this study, one male and one female were sampled daily at three body sites (gut, skin, and mouth) for 15 months and for 6 months, respectively.To obtain a time-evolving graph, we pre-process Operational Taxonomic Units (OTU) that contains the number of 16S rDNA marker gene sequences that are observed for each taxonomic unit in each sample.Let D ∈ R T ×p be an OTU table, where T is the number of time points and p is the number of OTUs.As this data does not have any obvious perturbations, such as antibiotics exposure or diseases, which could potentially create a metastable structure, an artificial noisy signal is added to the data.The Pearson correlation between two OTUs is computed, and then the initial time-snapshot graph is constructed.To 4/14 construct time-snapshot graphs at each time step, the authors of 14 use the OTU table to remove edges between nodes.If the OTU count for a particular node is zero, then the edge is removed between this node and its neighboring nodes.
CholeraInf.This dataset has been introduced in a study about the recovery from Vibrio Cholera infection 15 .Here, faecal microbiota were collected from seven cholera patients from disease (state 1) through recovery (state 2) periods.Moreover, in our experiment, we use the microbiome of one patient, since the variation in the microbiome constitution among individuals can have an impact on the result of the model.The time-evolving graph is obtained in the same way as it has been done for the MovingPic dataset.

Visualization and comparative analysis
In this part, we focus on verifying the qualitative performance of our model.As a first experiment, we visualize the resulting graph embedding to evaluate how separated the metastable states are in the low-dimensional space.As a second experiment, we compare our model with the following methods: a simple baseline chosen from state-of-the-art methods for dimensional reduction, namely, Principal Component Analysis (PCA), two kernel-based methods graphKKE 14 and WL kernel 28 , and two graph representation learning methods node2vec 21 , and graph2vec 23 .
• Principal Component Analysis (PCA) is a method for dimensional reduction.To be able to apply this method to the time-evolving graph, we flatten an adjacency matrix of each time-snapshot graph into a vector.
• The graphKKE approach is proposed for learning the embedding of a time-evolving graph with metastability.It is a graph kernel-based method that combines a transfer operator theory and a graph kernel technique.
• The WL kernel decomposes graphs into rooted subgraphs using a relabelling process and computes feature vectors based on the number of initial and updated labels.
• The graph2vec approach projects the set of static graphs, and it comprises two main components: 1) Weisfeiler-Lehman relabeling process and 2) the Skip-gram procedure from doc2vec 29 .
• The node2vec algorithm is a node representation method that uses breadth-first search and depth-first search to extract local and global information from the static graph.

5/14
Evaluation metric.In order to conduct the comparison analysis, we use a standard clustering evaluation metric -Adjusted Rand Index(ARI).The ARI values lie in the range [−1; 1] with 0 representing random clustering and 1 being the highest correspondence to the ground-truth data.
Experimental setup.First, we examine the evolution of the graph embedding by visualizing it at the beginning, in the middle and at the end of the training of the model.To do so, we use the graph embedding ĝ = { ĝ1 , . . ., ĝT }, where ĝi ∈ R d with d = 2.For all synthetic datasets, we set the hyperparameter l to be 3, the batch size to be 64, and the number of epochs to be 200 for both pos_3WellGraph and npos_2WellGraph.For real-world data, the batch size is set to be 64 for MovingPic and 6 for the CholeraInf dataset.We use the Adam optimizer with default parameters, the number of heads of the Transformer 4 and the number of layers in the Transformer 3. As the graphKKE method approximates the eigenfunctions of a transfer operator, the dimension of the graph embedding equals the number of metastable states in the time-evolving graph.This means that we need to apply a dimensional reduction method to be able to visualize it.Thus, PCA is applied to the output of graphKKE with the number of components to be 2. We also apply PCA to the flattened adjacency matrices with the number of components to be 2.
Moreover, since we are interested in whether metastable states of the original space correspond to the clusters of points in the reduced space, the points of the graph embeddings are colored according to original ground truth metastable states.For the comparison analysis we obtain graph embeddings from other methods in the following way.We set a number of dimensions of graph embeddings to be 32 for our method, node2vec, graph2vec and PCA.We use the implementations for node2vec and graph2vec with the default hyperparameters provided by the authors.In the graphKKE method, the number of dimensions of the final graph embedding equals to the number of metastable states in the time-evolving graph.Finally, we apply the k-means method to cluster points of the final graph embeddings of each method.However, node2vec is developed to learn node representations, that is why, to obtain embeddings of entire time-snapshots graph, we average node embddings of each time-snapshot graph.the training, our model tends to capture the underlying metastable structure in the time-evolving graph.Moreover, at the end of the training, we see that our model learns the graph embedding maintaining the initial metastable dynamics.In the case of the npos_2WellGraph dataset, there is no obvious split between the two metastable states, the reason is that the initial SDE trajectory has points that are located on the boundary between two metastable states.Furthermore, we compare the initial SDE trajectory and the final graph embedding obtained from our model with d = 1 for npos_2WellGraph and with d = 2 for pos_3WellGraph.The result for npos_2WellGraph is presented in Figure 4(a) which shows that two trajectories are almost identical.The same result can be seen for pos_3WellGraph in Figure 4(b).These results indicate that the model is capable of extracting the underlying metastable dynamics in the time-evolving graph.
In Table 2, we can see that the results of visualization are reinforced by the high ARI values of our model.From the table can also be seen that the graphKKE method outperforms our model in case of the pos_2WellGraph and pos_3WellGraph datasets.However, if we aim to have lower dimensionality of the graph embedding, then this method will fail to produce the same clustering accuracy.As the evidence the visualization of the graph embedding obtained with graphKKE (Figure 6), we see that graphKKE+PCA fails to produce a visualization with clear separated metastable states.Moreover, considering the results of other graph representation learning (Table 2), node2vec fails completely to learn the graph embedding of the time-evolving graph and graph2vec performs poorly on all synthetic datasets except npos_2WellGraph.It remains unclear whether graph2vec struggles to identify states in the positional data because states do not have unique topological patterns, or because this method is not meant to capture temporal changes.
Result and Discussion: real-world data.In the case of real-world datasets, the evolution of the graph embedding during the training for MovingPic and CholeraInf are presented in Figure 5.As it was in the case of synthetic datasets, our method is also able to identify the metastable behaviour in the time-evolving graph and preserve it in the new space.For CholeraInf we have added time points from the original dataset to see if the new space has the same time order as it was in the original high-dimensional space.If we compare the visualization of graph embedding from other methods, the result for MovingPic shown in Figure 6(c) shows that all methods give relatively the same visualization.However, for the CholeraInf (Figure 6(d)) our model preserves consecutive time points in the new space which indicates that one metastable state (healthy) follows another metastable states (ill).
The second part of this experiment aims at comparing our model with other dimensional reduction methods in the clustering task.Again from the Table 2 it is evident that our model performs significantly better that WL kernel, graph2vec and node2vec.Node2vec performs poorly across all datasets, which is the result of a lower order substructure embedding method meaning that it can model only local similarities and fails learn global topological similarities.

Interpretebility
An improved understanding of how the microbiome contributes to health and well-being can drive and accelerate the development of microbiome-based treatments.The most important question, which has not been answered yet, is which species or interactions of species are responsible for or affected by the changes which microbiome undergoes from one state (healthy) to another state (diseased or antibiotic exposure).The presence of such valuable information can significantly improve modern 7/14 treatments of various diseases.We assume that if it is feasible for the model to successfully find and discriminate metastable states, then there might be topological features in the time-evolving graph that make these metastable states different.Therefore, the main objective of this section is to provide an insight into to which extent the model learns metastable states based on true discriminating topological features.And with regard to real-world data, we aim to find topological features of the time-evolving graph that make the two states, cholera infection period and recovery period, different.
To achieve this, we will use an approach 30 which is based on layer-wise relevance propagation (LRP) 31 .LRP is the family of explanation methods that leverages the layered structure of the network.It explains the prediction of a neural network classifier by backpropagating the neuron activation on the output layer to the previous layers until the input layer is reached.
The authors of the paper 30 address the lack of the conservation property in the attention mechanism, which is an essential assumption of LRP and the numerical issues of the skip connections by applying a normalization to the computed relevance score.Moreover, they make use of the attention weights and propose to compute the final relevance scores by multiplying the relevance score of each Transformer layer with the gradient of the corresponding attention matrix summed up across the "head" dimension.
Unlike original LRP and the approach mentioned in the last paragraph, where the decomposition starts from the classifier output corresponding to the target class, we have a similarity model that rather measures how similar graph embeddings of the time-snapshot graphs G t and G t+1 are.For this reason, we start the redistribution from the layer where we compute the graph embedding ĝt until the input layer is reached, and the final relevance is computed.We compute a relevance score for each time-snapshot graph in the test set.To obtain discriminating features of the whole state, we sum up relevance scores of time-snapshot graphs of each state.
Result and Discussion: synthetic dataset.We conduct this experiment on the npos_2WellGraph and the CholeraInf datasets.The result for npos_2WellGraph is demonstrated in Figure 7. From the result it is clear that the model can find topological features in the time-evolving graph that are unique for each metastable state.However, the interpretation of state 1 (Figure 7b) highlights all nodes in the upper part of the time-snapshot graph, which is a true discriminating feature, wheres the interpretation of state 0 in turn shows only 4 nodes in the lower part of the time-snapshot graph.
There is a necessity to mention that we have modelled the synthetic datasets in such a way that we know the location of nodes in the time-snapshot graphs.In case of real-world datasets, we do not have coordinates of nodes.
Result and Discussion: real-world dataset.We have assessed the interpretation of our model on the synthetic dataset, and now we focus on obtaining relevance scores for the real-world dataset, namely, CholeraInf.Unlike the synthetic dataset, we do not know the ground truth discriminating features for this dataset.Moreover, to visualize the interpretation, we use a correlation matrix that has been computed based on the OTU table (see more details about how this data has been pre-processed in 14 ).
The result of LRP is presented in Figure 8a for the diarrhea period (state 1) and in Figure 8b for the recovery period (state 2).We can see that we have two different sets of nodes with high relevance score, yet there are the same nodes in both states that are significant for the model.For example, the relevance scores of node 73 and node 4 are high in both states.
Further study of these results is needed to investigate if these interpretations have biological meaning.For instance, there have been done numerous works 32 that are mainly focused on statistical analysis to justify which bacteria/species are affected by, or on the contrary, cause shifts in microbiome compositions.Using the detected species from these works, we can compare them with the nodes that have shown the biggest impact on the model output according to both the LRP approach and the visualization of attention weights.

Discussion
We have presented a new approach that can simplify the analysis of time-evolving graphs with assumed metastability.Through an extensive set of experiments on both synthetic and real-world datasets, we have demonstrated that our approach is capable of projecting a time-evolving graph into a low-dimensional space retaining metastable properties of the system.Moreover, we have illustrated one of the possible applications of this approach to microbiome data that enhances the analysis of metagenomic data in a way that it takes into account a huge number of interactions among species.We have shown that through explaining the output of the model, we can find topological graph features, such as nodes or edges, that make the model arrive at a certain graph embedding.Concerning microbiome data, it means that our method coupled with a proper interpretation strategy can help to reveal underlying disease patterns in the data.
There are several directions for future work: 1) how to construct a time-evolving graph from metagonomic data such that it contains a real dynamics occurring in the microbiome; 2) further biological analysis of results obtained from the interpretability of the model; 3) visualization of topological graph features, such as nodes and edges, that have impacted the model the most, and 4) mathematical explanation of how the model learns a graph embedding of the time-evolving graph maintaining metastable dynamics.

Method
We first briefly introduce all necessary notations and definitions, which are used in the paper, and state the problem.
Definitions.A graph G is a pair (V, E) with a non-empty set of nodes V (G) and a set of edges The set V (G) often represents the objects in the data and E(G) the relations between objects.We define the adjacency matrix of the graph G as the n × n matrix A with A i j = 1 if the edge (v i , v j ) ∈ E(G), and 0 otherwise, where n =| V |.
Next, we define a metastability property, which was first mentioned in 14 .Consider a time-evolving graph G as a sequence of graphs G = (G 1 , . . ., G T ) at consecutive time points {1, . . ., T } for some T ∈ N, and G t being a time-snapshot of G at time t.The time-evolving graph G exhibits metastable behavior if G can be partitioned into s subsets G = G 1 ∪ • • • ∪ G s for some s T such that for each time point t ∈ {1, . . ., T } and i, j = 1, . . ., s, we have the following: where P(•) is a transition probability, and G 1 , . . ., G s are called metastable states of the time-evolving graph G, where s is the number of states.

9/14
Problem statement.We define our problem as follows: Given a time-evolving graph G = (G 1 , . . ., G T ) with assumed metastability property (4), we aim to represent each time-snapshot G t as a vector in a low-dimensional space R d , maintaining the metastable behaviour of G, where d is a number of dimensions of the reduced space.
In this section, we describe how we train the model to embed time-snapshot graphs into a low-dimensional space, maintaining the metastable behaviour of the graph.We first use the Transformer 13 to compute the embedding of a time-snapshot graph.Further, we add a recurrent mechanism to the Transformer that facilitates the learning of temporal changes across consecutive time-snapshot graphs.Finally, we use contrastive learning to make representations of consecutive time-snapshots graphs, which share metastable behaviour, close.

Algorithm 1: Main learning algorithm
Input: l: the number of historical representations; G = {G 1 , . . ., G T }: a time-evolving graph such that each time-snapshot graph G t = (x t , A t ); z init : a randomly initialized learnable master node; R: the Transformer; W,W (1) ,W (2) , b: parameters of the model; L = 0; for sampled mini batch of indices {t k } N k=1 do z curr = z init ; z next = z init for j = 1 to l do #time-snapshot graphs at time t Select time-snapshot graphs #projection for contrastive learning g curr = W (2) σ (W (1) ĝcurr ) g next = W (2) σ (W (1) ĝnext ) Transformer.The Transformer is currently the state-of-the-art method in the field of NLP, where it has shown tremendous success for handling long-term sequential data.Recently, it has become a leading tool in other domains such as computer vision 33 and graph representation learning 34,35 .We use the encoder part of the Transformer to learn node embeddings in each time-snapshot graph.The encoder has several stacked multi-head attention layers followed by a feed-forward layer.There is a residual connection around each of these two sub-layers that is also followed by a normalization layer.
Intuitively, the self-attention in time-snapshot graphs relates different nodes of the graph in order to compute a new representation of every node in the graph to which we refer as a node embedding.
Input.Let G = {G 1 , ..., G T } be a time-evolving graph with node features {x v t } v∈V (G t ) , t = 1, . . ., T .The input node features of each time-snapshot graph G t are embedded to d m -dimensional latent features via a linear projection and added to pre-computed node positional encodings.We demonstrate on two synthetic datasets with different topological structures that our model performs well in both cases: with positional information of nodes and without it.Moreover, in order to capture the topological structure of a single time-snapshot graph G t , we feed an adjacency matrix A t to the Transformer as a mask, and we set attention weights to 0 whenever the corresponding adjacency matrix entries are 0.
Model architecture.Further, we explain important details of the training of the model.The overview of the model architecture can be found in Figure 1.
Let T = {t k } N k=1 be a set of randomly sampled time points, and G T = {G t 1 , . . ., G t N } be a mini-batch of time-snapshots graphs sampled from G. To facilitate the learning of temporal changes, we share the embedding of a time-snapshot graph with the consecutive time-snapshot graph in the temporal sequence.We define a master node that is connected to all nodes in the time-snapshot graph.Initially, the master node is represented as a learnable, randomly initialized vector.The Transformer computes the embedding of the master node which we consider as a graph embedding.This graph embedding is then passed as the initial master node to the consecutive time-snapshot graph in the temporal sequence.We control the length of temporal sequence with the hyperparameter l.Moreover, since we connect the master node with all other nodes in each time-snapshot graph, the size of the adjacency matrix changes, A t ∈ R ( + )×( + ) .
Formally, we update the graph embedding z t of the time-snapshot graph G t recursively as follows: where R is the Transformer that updates node embeddings discussed, z t ∈ R d m is a master node, x t is a vector of node features, and A t ∈ R (n+1)×(n+1) is an adjacency matrix of G t .Finally, we project the graph embedding z t ∈ R d m of the time-snapshot graph G t into the space with the dimension d, where a downstream task is defined.We denote the final embedding of the time-snapshot graph with ĝt ∈ R d : ĝt = W z t + b with W and b being learnable parameters.
We use two hidden layers and a non-linear activation function in order to project the graph embedding ĝt into the space, where the contrastive learning is defined, as it is done in 36 .
Furthermore, we explain how we use the contrastive learning to make embeddings of consecutive time-snapshots graphs preserve the metastable behaviour in the low-dimensional space.
Contrastive learning.Intuitively, contrastive representation learning can be considered as learning by comparing.Thus, the goal is to find a low-dimensional space where samples from the same instance are pulled closer and samples from different instances are pushed apart.Formally, given a vector of input samples x i , i = 1, . . ., N with a corresponding labels y i ∈ {1, . . .,C} among C classes, contrastive learning aims to learn a function f θ (x) that can find the low-dimensional representation of x such that examples from the same class have similar representations and samples from different classes are far away from each other in a new space.One always needs to have negative and positive samples to apply contrastive learning, For this reason, we make the following assumption.
Assumption.According to the definition of metastability (4), the probability of two consecutive time-snapshot graphs G t and G t+1 being similar is almost 1 and so should be the probability for their graph embeddings ĝt and ĝt+1 .
In other words, we consider a pair of graph embeddings ( ĝt , ĝt+1 ) as a positive pair and pairs ( ĝt , ĝt+τ ),t + τ ∈ {2, . . ., N} as negative pairs, where τ is randomly sampled.It is feasible for negative samples to be of the same metastable states, but at different time points.
Given graph embeddings g t and g t+1 for G t and G t+1 as the output of our model, we first compute the similarity between g t and g t+1 : sim t,t+1 = g T t • g t+1 g t • g t+1 .
With the above assumption, diagonal elements of sim represent positive pairs and off-diagonal elements negative pairs.Then, the similarity score is fed to the Noise Contrastive Estimator (NCE) loss: L = − log exp sim t,t+1 /τ ∑ T j=1 exp sim t, j /τ .
Minimizing this loss function forces the parameters of the model to be tuned such that graph embeddings of two consecutive time-snapshot graphs is as close as possible.

11/14
Positional Encoding Most graph neural networks learn structural node information that is invariant to the node positions.However, there are cases when topological information is not enough.To demonstrate this, we conduct experiments on two different synthetic datasets.The first data has metastable states with defined graph features that can be distinguished only with the position information of nodes.Each metastable state in the second data has specific graph features, which are easily distinguished with just topological information.
To incorporate the positional information, we use the same positional encoding as in 37 : p pos,2i = sin pos/10000 2i/d m p pos,2i+1 = cos pos/10000 2i/d m , where pos, i and d m denote a position of the node in the time-snapshot graph, the dimension in the positional encoding and the dimension of node embedding, respectively.Through a set of various experiments in the next section, we demonstrate on synthetic and real-world datasets that our method is capable of learning a graph embedding of the time-evolving graph.

Figure 1 .
Figure 1.Overview of the method proposed in this paper.

1 Figure 2 .
Figure 2. The example of a time-evolving graph with metastability where two states A and B are difficult to distinguish since they are topologically the same.Red dashed edges are removed from the time-evolving graph G.

Figure 3 .
Figure 3.The evolution of the graph embedding of the time-evolving graph G during the training of our model on (a) npos_2WellGraph and (b) pos_3WellGraph.The points are colored according to ground-truth labels.

Figure 4 .
Figure 4.The comparison of an initial trajectory sampled from the SDE (1) (top) and the final graph embedding for (bottom): (a) the npos_2WellGraph dataset and (b) for the pos_3WellGraph dataset.

Figure 5 .
Figure 5.The evolution of the graph embedding of the time-evolving graph G during the training of our model on (a) CholeraInf and (b) MovingPic.The points are colored according to ground-truth labels.

Figure 6 .
Figure 6.The graph embeddings of the time-evolving graph G for (a) npos_2WellGraph, (b) pos_3WellGraph, (c) MovingPic and (d) CholeraInf.From left to right: PCA on adjacency matrices, PCA on eigenfunctions of graphKKE and the result of our model.

Figure 7 .
Figure 7. Fully-connected graphs for npos_2WellGraph dataset with nodes colored based on the relevance score of the LRP method, which are summed across 2 states: (a) State 1 (b) State 2. The locations of 2 states are obtained by clustering points of the graph embedding of the time-evolving graph via k-means.

Figure 8 .
Figure 8. Co-occurrence interaction graphs of CholeraInf dataset with nodes colored based on relevance scores of the LRP method, which are summed across each state: (a) diarrhea period and (b) recovery period.The locations of states are obtained by clustering points of the graph embedding of the time-evolving graph via k-means.The dark brown colour indicates nodes with the highest importance.

Table 1 .
Statistics of each dataset used in this paper.

Table 2 .
Adjusted Rand Index (ARI) for the comparative analysis on the graph clustering task.ARI close to 1 corresponds to greater accuracy in correctly identifying the ground truth states, and an ARI value close to 0 stands for random clustering.
Result and Discussion: synthetic data.The evolution of the graph embedding during the training for both synthetic datasets -npos_2WellGraph and pos_3WellGraph -are illustrated in Figure3.The visualization demonstrates that during