## Main

The biological functions of living systems rely on interactions that dynamically change in response to endogenous and exogenous stimuli. Studying the motion of the components of these systems sets the basis for mechanistic insights to understand health and disease1. Over the past 20 years, microscopy has advanced to the point where it can monitor dynamic processes at multiple scales with unprecedented spatiotemporal resolution. Time-lapse microscopy experiments have unveiled the strategies that unicellular organisms employ to search for food or to avoid adverse conditions, and have helped to understand tissue growth and repair, cancer metastasis, quorum sensing, the emergence of multicellularity and immune responses in multicellular organisms2,3. Fluorescence microscopy has monitored biological motion down to the nanoscale, detailing the diffusion of individual organelles and molecules within the cellular environment and disclosing their role, for example, in the fundamental processes of signalling and function regulation4,5.

The momentous improvement of microscopy acquisition techniques has led to a substantial effort to develop and improve algorithms to automatically extract quantitative information from these experiments6,7. The standard analysis pipeline of tracking-by-detection methods entails the following steps4,7,8: (1) video frames are partitioned (segmentation) and/or otherwise processed to detect and locate objects of interest (detection/localization); (2) detected positions at different times are connected into trajectories (linking); (3) reconstructed trajectories are finally analysed to quantify dynamical parameters (motion characterization). The first two steps are often presented together and referred to as tracking. Several factors complicate the analysis of biological experiments, such as imaging noise, high object density, fusion or splitting events, random and heterogeneous motion, and shape-changing objects. Errors at each step propagate along the pipeline and ultimately impact the extraction of dynamic information.

Numerous algorithmic solutions have been proposed to tackle the limitations of tracking algorithms and their performance has been compared in open challenges6,7. However, most of these methods are specific to a given experiment or dynamic model, and often require manual tuning of parameters. The current deep-learning revolution has fostered the development of various methods for both tracking9,10,11,12,13 and motion characterization14.

Geometric deep learning provides compelling approaches to tackle tracking and motion characterization from a different perspective. It generalizes neural networks to problems that can be described by mathematical objects such as graphs that encode information about the structure of the input15. Deep-learning methods based on graphs are typically referred to as graph neural networks (GNNs)16 and have been successfully applied, for example, to molecular property prediction17, drug discovery18, computer-assisted retrosynthesis19 and human trajectory prediction20. Besides being ubiquitously used in science to represent complex systems21,22, graphs provide a natural and intuitive way to represent the information contained in tracking experiments23,24.

Here we describe a framework for Motion Analysis through GNN Inductive Knowledge (MAGIK), which provides the accurate estimation of dynamical properties from time-lapse microscopy. MAGIK models the system’s motion and interactions through a graph representation. This graph is processed through an interpretable and adaptive attention-based GNN that estimates the associations among the objects and provides insights into the intrinsic dynamics of the systems. We demonstrate the flexibility and reliability of MAGIK by quantifying its performance on real and simulated data corresponding to a broad range of biological experiments. First, we benchmark it on its most natural application, that is, trajectory linking, in a variety of challenging experimental scenarios. Then, we show that MAGIK can estimate local and global dynamical properties without explicit linking even in highly heterogeneous scenarios.

## Results

### MAGIK represents spatiotemporal relations in a graph

MAGIK provides a GNN framework to estimate the dynamical properties of moving objects from time-lapse experiments. MAGIK models the objects’ motion and physical interactions using a graph representation. The details of the algorithm are given in Methods (‘Description of MAGIK’) and Extended Data Fig. 1. In this section, we provide a high-level description of the architecture (Fig. 1).

Graphs can define arbitrary relational structures between nodes connecting them pairwise through edges. When training a GNN, the graph architecture guides the learning process about the objects and their relations by introducing a relational inductive bias16. In MAGIK, each node describes an object detection at a specific time, the edges connect spatiotemporally close objects, and a set of global attributes encodes system-level properties. As an example, for subsequent frames of a cell migration experiment, each detected object (orange crosses in Fig. 1a) is represented as a node with a vector of node features (Fig. 1b). Directed edges with relational features connect each node to objects detected in the future in its proximity (Fig. 1b). There are no intrinsic restrictions on the type or number of descriptors (for example, location and morphological features, image-based quantities, biological events, interaction strength, distance, direction) that can be encoded in the graph feature representation. The basic graph relational structure is established through a set of rules that link nodes pairwise based on distance metrics between features. Node and edge features are encoded through learnable functions implemented by neural networks (Fig. 1c,d). An extra learnable token is added to aggregate global attributes from the whole graph25,26,27 (Fig. 1e).

The graph is processed through a sequence of attention-based fingerprinting graph neural networks (FGNNs; see also Methods ‘Description of MAGIK’ and Extended Data Fig. 1) that propagate information via message-passing steps (Fig. 1f–i). The relational inductive knowledge implemented in the graph structure sketches a network of redundant object associations. The objective of the FGNN is to modulate the association strength to identify the edges majorly influencing the dynamic properties of each object. For this, the FGNN implements two mechanisms that combine information from multiple objects at the local and global levels. The first mechanism intervenes when aggregating edge features to a node (equation (3)). The contribution of each edge has a weight that depends on the distance between the connected nodes through a function with learnable parameters (equation (2)), thus defining a learnable local receptive field28,29. The second is a gated self-attention mechanism30 that sets in when updating the latent representation of nodes (equation (4)). The node update operation involves also information stemming beyond each node’s topological neighbourhood, thus effectively expanding the receptive field to objects that, although not physically connected, can offer relevant information about the overall dynamics. The FGNN further updates the extra token for global attributes using information from all the nodes; thus, this extra token serves as an antenna to provide system-level insights.

The output of the FGNN is decoded by the last block of the GNN into an output graph, whose nodes, edges, and global attributes can be used to solve specific problems (Fig. 1j–l).

We benchmark MAGIK performance on a classical trajectory linking task, consisting of establishing temporal associations between identified objects. The graph structure includes a redundant number of edges with respect to the actual associations between objects. MAGIK aims to prune the wrong edges while retaining the true connections. We thus model this task as an edge-classification problem with a binary label (linked/unlinked) by minimizing the binary cross-entropy. From the predicted edge features, trajectories are built through a postprocessing algorithm that eliminates spurious connections (Methods ‘Postprocessing algorithm for trajectory linking’).

To test MAGIK, we use the silver-standard segmentation datasets provided for the training of the sixth edition of the Cell Tracking Challenge7. A representative segmentation of the dataset DIC-C2DH-HELA, corresponding to HeLa cells on a flat glass imaged through differential interference contrast, is shown in Fig. 2a. From the segmentation, we calculate the mean pixel intensity, area, perimeter, eccentricity and solidity of the segmented objects, which we use as input node features. The Euclidean distance between neighbouring objects is used as the sole edge feature. To limit memory usage, we generate graphs by drawing edges only between objects within a limited spatial and temporal reach (Fig. 2b).

The DIC-C2DH-HELA dataset presents several challenges, such as the heterogeneity in cell shape and dynamics as a consequence of migration and proliferation (Fig. 2g). Examples of ground-truth and predicted graphs are shown in Fig. 2c,e showing a good agreement, as confirmed by an F1 score of 99.4% in edge prediction. For the evaluation of performance at the trajectory level (Fig. 2d,f), we calculated the tracking accuracy measure (TRA), a normalized weighted distance between the tracking prediction and the reference tracking ground truth31 (Methods, ‘Quantification of cell-tracking results’). MAGIK reached a TRA = 99.2%, showing a great capability of correctly following objects despite shape changes and cell divisions (Fig. 2g and Supplementary Video 1).

We applied MAGIK to several other datasets of the 6th Cell Tracking Challenge, obtaining outstanding results for different microscopy techniques and cell types. Representative video frames with segmentation are shown in Fig. 3. Even though a strict objective comparison of MAGIK linking capability with other methods is limited by the fact that different algorithms rely on a different segmentation (whose errors influence linking and thus indirectly affect the value of the TRA metric), MAGIK obtained TRA values that are competitive, often superior, to the best-in-class methods of the 6th Cell Tracking Challenge. The influence of segmentation on linking is particularly relevant when dealing with touching or overlapping cells; thus specific isolation strategies might be adopted to prevent over- or under-segmentation32,33,34,35. While we attempted to account for these errors through the augmentation procedure, segmentation errors could produce systematic changes in node features that might impair the linking performance.

### MAGIK quantifies motion parameters without trajectory linking

In most applications, the ultimate objective of tracking is the characterization of the dynamics of the systems under investigation to gain insights into their underlying biological mechanisms. In this process, trajectory linking is often just an intermediate step necessary to obtain meaningful information from the data, but not the end goal itself.

MAGIK can characterize essentially any dynamic aspect without requiring the actual linking, owing to its capability of accounting for the whole spatiotemporal complexity contained in the associations between objects at multiple scales. Such linking-free analysis produces a twofold advantage. First, it bypasses the error-prone linking step, thus inherently preventing linking errors from propagating to the quantification of the ultimately relevant parameters. Second, it enables the analysis of experiments for which linking cannot be performed due to, for example, a high object density or low signal-to-noise ratio.

We apply MAGIK to analyse simulated data reproducing the diffusion of fluorescently labelled single molecules such as lipids or receptors in the plasma membrane of living cells (Fig. 4). We first consider the task of determining the diffusion coefficient from a heterogeneous ensemble of diffusing objects (Fig. 4a). We feed the network the centroid coordinates and the intensity of the localized fluorescence spots as node features and the Euclidean distance between neighbouring centroids as the edge feature. We define the problem as a node regression and minimize the mean absolute error (MAE). The target feature is the displacement scaling factor $$\sqrt{2D}$$, with D being the diffusion coefficient. Graphs are built by connecting localized objects with neighbours in space and time (Fig. 4b). Ground-truth and predicted graphs are shown in Fig. 4b,c, respectively. All the edges of the graph structure are drawn, representing the network of associations used to infer dynamic properties without direct linking. Nodes are colour-coded according to the value of the displacement scaling factor $$\sqrt{2D}$$. Their visual comparison suggests excellent agreement, further confirmed by the quantification in Fig. 4d. Notably, the same architecture can be applied at the single-trajectory level, opening interesting perspectives for the detection of dynamic changes and trajectory segmentation (Extended Data Fig. 2). The same approach can also be extended to estimate other parameters. In Extended Data Fig. 3a–d, we show the results of its application to the inference of the scaling exponent for objects undergoing anomalous diffusion, achieving similarly good results. It is also interesting to note that MAGIK’s performance for node regression is less influenced by crowding than the performance for the linking task (Extended Data Fig. 4).

Fluorescence microscopy experiments for object tracking must ensure that the number of visualized molecules is low enough to unambiguously link the trajectories, thus they are often performed at low labelling density4. However, these conditions are not optimal to probe the interactions between particles and make difficult the inference of spatial patterns of diffusion36. Enabling the quantification of diffusion properties without linking offers the possibility to process high-density videos to determine the underlying topology and spatial heterogeneity.

As an example, we used MAGIK to resolve a spatially modulated landscape with diffusion continuously varying over more than two orders of magnitude from the localizations of diffusing particles (Fig. 4e–h), treating the problem as a node feature regression, as above. At a number density of about 0.02 px−2, MAGIK is capable of correctly retrieving the spatial map of D (Fig. 4f). Remarkably, most spatial features can be already resolved with a 100-frame-long video (Fig. 4g). The spatial resolution of the predicted map can be further improved using longer videos (1,500 frames; Fig. 4h), with the typical duration of single-molecule fluorescence microscopy experiments for measuring diffusion4.

### MAGIK quantifies global dynamic properties

We applied MAGIK to directly extract ensemble information through the inference of global attributes skipping the trajectory linking step. We considered two biologically relevant scenarios. First, we analysed fluorescence microscopy experiments in which objects in the same video undergo diffusion according to different microscopic models (namely, fractional Brownian motion (FBM), annealed transient time motion (ATTM) and continuous-time random walk (CTRW); Fig. 5a–e). Although these diffusion models can give rise to anomalous diffusion, in this example they are parameterized to have the same scaling of the mean-squared displacement of Brownian motion (α = 1)14. Graphs are built as described above using centroid coordinates and intensity of the localized fluorescence spots as node features and the Euclidean distance between neighbouring centroids as the edge feature. MAGIK estimates the relative fraction of objects in each category, varying from experiment to experiment, as a regression problem by minimizing the sparse categorical cross-entropy of the global attribute. The results are summarized in Fig. 5a–e, showing an outstanding accuracy in predicting the correct fractions, even when the number of objects performing the same class of motion is very low. In Extended Data Fig. 3e–h, we further demonstrate that the same approach can also estimate the fraction of object moving according to different diffusion modes (subdiffusion with α < 1, normal diffusion with α = 1 and superdiffusion with α > 1).

The second example refers to simulations of holographic imaging of microorganisms diffusing in a liquid environment, such as plankton (Fig. 5f–k). We model diffusion as either FBM (Fig. 5f,g), ATTM (Fig. 5h,i) or CTRW (Fig. 5j,k) with α = 1. Objects in the same experiments follow the same physical model but with random diffusivity. Centroid three-dimensional coordinates, mean intensity, area and refractive index of the objects are used as node features in a classification problem to determine the common diffusion model of the objects in the same video, encoded as a global attribute. As shown in Fig. 5l, MAGIK correctly classifies the diffusion model even with largely overlapping objects. We find this result quite remarkable (equally so as that illustrated in Fig. 5e) since, for α = 1, all models converge to Brownian motion and feature large similarities in their statistical properties, making their classification rather challenging even when linked trajectories are available14. We believe that MAGIK achieves this capability by detecting the fingerprint of each model’s generative dynamics at the microscopic level.

Last, we explore MAGIK’s performance for quantifying anomalous diffusion through the estimation of the exponent α (ref. 14) from a sequence of holographic images reproducing the motion of microorganisms. All the objects in the same video undergo FBM with random diffusivity and the same exponent α, varying from sequence to sequence (Fig. 5m). Also in this case, MAGIK provides remarkable results (MAE = 0.11) from short videos (about 50 frames) containing only a few objects.

### Interpreting MAGIK

To determine the mechanisms that most contribute to MAGIK’s performance, we evaluated different models on ablation of key components of the MAGIK architecture (Extended Data Table 1). As a baseline, we consider a version of MAGIK lacking both the learnable local receptive field and the gated self-attention and compare it with a model without gated self-attention and with the full MAGIK. As shown in Extended Data Table 1, both components contribute to improving the performance but, depending on the specific task, have a different weight. The learnable local receptive field seems to have little influence on the trajectory linking and the estimation of local diffusion properties. In these cases, the gated self-attention significantly affects the results as both problems can benefit from information originated from nodes that are distant in time and/or space. In contrast, the learnable local receptive field is responsible for most of the gain in performance when inferring the fraction of objects performing different kinds of motion. Differences between diffusion modes can be in fact detected from short-time displacements, reducing the contribution of the gated self-attention to this task.

The relative distance between objects, encoded in MAGIK as an edge feature, is crucial to tackling the tasks presented in this work. In fact, we first attempted to establish a baseline through an ablated model without edge features. When trained for the node-regression problem, such a model did not converge and produced an MAE ≈ 0.2, compatible with random predictions of the diffusion coefficient.

We also explored positional encoding to provide distance-aware information, by appending Laplacian eigenvectors to node features27,37. However, Laplace positional encoding did not produce any improvement in this architecture. To further explore the role of edge and node features in MAGIK, we trained a model where distances between detected objects are encoded as edge features but absolute coordinates are removed as node features. Interestingly, this model shows degradation of performance with an MAE = 0.0748 ± 0.0115 (compared with MAE = 0.0538 ± 0.0022 for the complete model), pointing towards the importance of absolute coordinates for capturing spatial dynamics through the calculation of further parameters (for example, directionality) beyond Euclidean distance.

The use of gated self-attention offers a feature-wise discriminatory power to the node update operation as it weights individual features of the attention node embedding with respect to their importance to the overall graph structure. Through this mechanism, MAGIK identifies only the meaningful features of each node. This leads MAGIK to apply non-uniform attention over the neighbourhood38 and to differentially consider the contribution from nodes of the same trajectory with respect to other neighbouring nodes belonging to different objects (Fig. 6).

### Related works

Among the tasks explored in this work, the trajectory linking in biological systems is undoubtedly the most popular and has been tackled with a variety of methods6,7. These methods typically employ Kalman filter39, multiframe and/or multitrack optimization based on greedy algorithms that approximate the multiple-hypothesis tracking solution36,40,41 or combinatorial optimization42. Most of these approaches offer their best performance when knowledge of the motion is explicitly used6.

Recently, deep-learning approaches have also been introduced to track biological objects using recurrent neural networks43,44 and long short-time memory networks45. From a computer vision point of view, trajectory linking is equivalent to what is generally referred to as data association in multi-object tracking. In this context, several approaches have used GNN to solve data association as an edge-classification problem46 or to jointly optimize detection and data association47,48. More generally, the problem of classifying nodes in a graph has also been tackled using spectral graph convolutional neural networks49. Very recently, the leveraging of the standard transformer into the graph domains has produced state-of-the-art performance on a wide range of tasks27.

MAGIK jointly processes spatial and temporal information in a static GNN. However, other approaches treat space and time differently50,51. In MAGIK, similar to other architectures, edge features are used together with node features in the aggregation of each node25,49. In these cases, unless a message-passing framework16,25 or an attention layer are used27, the edge features only propagate to the associated node. As MAGIK leverages edge information through a message-passing framework, we compared its performance with other methods using the same mechanism, namely a message-passing neural network25 and a gated graph sequence neural network52. To assess differences in performance between the use of global and masked attention, we also compared MAGIK with a message-passing neural network having a graph transformer37 as a node update function. The results of the comparison are shown in Extended Data Table 1. For all the tasks and datasets, these methods perform better than or in line with the baseline but are outperformed by the full MAGIK. MAGIK performance is quite striking for the node-regression task, once more stressing the importance of the attention layer for this problem.

## Discussion

MAGIK is a versatile framework for the characterization of dynamic properties from time-lapse microscopy that exploits geometric deep-learning capability to capture the full spatiotemporal complexity of biological experiments. MAGIK strongly relies on an attention-based GNN that can extract dynamic parameters from image-based features by assuming relational constraints between objects.

The examples analysed in this work highlight the wide versatility of MAGIK in different biological contexts. Remarkably, the same architecture can be applied to investigate other observables, can be trained to simultaneously estimate several parameters, and can even be used for applications beyond time-lapse microscopy, where time is substituted by another variable.

MAGIK provides a key enabling technology to estimate dynamic parameters from segmentation/localization in a complete linking-free fashion, whereas other methods require some level of knowledge about the linking between objects53,54. As such, it provides a powerful solution for those experiments where trajectory linking cannot be reliably performed, for example, as a consequence of high object density or probe blinking.

## Methods

### Description of MAGIK

The input to MAGIK is the graph representation of the movement and interactions of an ensemble of objects. The nodes ($${{{\mathcal{V}}}}$$) contain features encoding meaningful information about the objects, whereas the edges ($${{{\mathcal{E}}}}$$) connect spatiotemporally neighbouring nodes codifying relational features, such as the Euclidean distance between them (Fig. 1a,b). To improve efficiency, each node is connected to a limited number of spatial neighbours at subsequent frames. This is achieved through the choice of two parameters that set the maximum distance at which nodes must be located in space and time to be connected. We thus obtain a directed graph that is generally not fully connected. For most of the examples discussed in this work, we connect nodes within a maximum distance equal to 20% of the full field of view and up to 2 frames in the future. Exceptions are the reconstruction of the spatially varying diffusion map for which the maximum distance was set to 12% of the full field of view and the time-varying diffusion (segmentation) for which we connected nodes up to 4 frames in the future.

The architecture comprises three main blocks. First, an encoder neural network $${\phi }_{v}:{{\mathbb{R}}}^{l}\to \,{{\mathbb{R}}}^{l{\prime} }$$ converts each node feature representation $$\bf{{v}}_{i}\in {{{\mathcal{V}}}}$$ of dimension l into an l′-dimensional feature representation $${\bf{{v}}_{i}^{{\prime}} }$$ (Fig. 1c). In parallel, another encoder neural network function $${\phi }_{e}:{{\mathbb{R}}}^{f}\to {{\mathbb{R}}}^{f{\prime} }$$ transforms each edge feature $$\bf{{e}}_{k}\in {{{\mathcal{E}}}}$$ into a high-level feature vector $${\bf{{e}}_{k}^{{\prime}} }$$ of dimension $${f}^{{\prime} }$$ (Fig. 1d). ϕv and ϕe are a series of multilayer perceptrons (MLPs) composed of a linear layer followed by a Gaussian error linear unit55 as activation function and layer normalization.

Second, the resultant graph representation $${{{\mathcal{G}}}}=\left\{{{{{\mathcal{V}}}}}^{{\prime} },{{{{\mathcal{E}}}}}^{{\prime} }\right\}$$ (Fig. 1e) is processed through repeated fingerprinting graph blocks (FGNN, described in detail in Extended Data Fig. 1). Each FGNN updates each edge in the graph by applying an MLP to the concatenation of the features of two neighbouring nodes and their connecting edge, that is

$$\bf{e}_{ij}^{{\prime\prime} }={{{\rm{MLP}}}}\left(\left[\bf{v}_{i}^{\prime} ,\bf{v}_{j}^{\prime},\bf{e}_{ij}^{\prime}\right]\right)$$
(1)

for $$j\in {{{{\mathcal{N}}}}}_{i}$$, where $${{{{\mathcal{N}}}}}_{i}$$ is the neighbourhood of node i, and [·] represents the concatenation operation (Extended Data Fig. 1b). Subsequently, the learned representation $$\bf{e}_{ij}^{{\prime\prime} }$$ (of dimension f′) is weighted by a Gaussian attention mechanism

$${w}_{ij}=\exp \left(-{\left(\frac{{d}_{ij}^{2}}{2{\sigma }^{2}}\right)}^{\beta }\right)$$
(2)

where dij is the Euclidean distance between the centroids of the nodes i and j, and the standard deviation σ and the Gaussian order β are learnable parameters that allow the FGNN to adapt to varied object dynamics (Extended Data Fig. 1c,d). The FGNN computes a local representation for the topological neighbourhood $${{{{\mathcal{N}}}}}_{i}$$ by applying a linear transformation to the concatenation of the current state of node i and the aggregate of the weighted edge features, according to

$${\tilde{\bf{h}}}_{i}={{{{\bf{W}}}}}_{\tilde{H}}\left[\bf{v}_{i}^{{\prime} },\mathop{\sum}\limits_{j\in {{{{\mathcal{N}}}}}_{i}}{w}_{ij} \bf{e}_{ij}^{{\prime\prime} }\right]$$
(3)

where $${{{{\bf{W}}}}}_{\tilde{H}}$$ is an $${l}^{{\prime} }\times ({l}^{{\prime} }+{f}^{{\prime} })$$ linear projection matrix. The $$\tilde{\bf{h}}_i$$ are stored in a local representation matrix $$\tilde{\bf{H}}$$. Importantly, we prepend to this matrix a learnable node embedding $$\bf{U}\in {{\mathbb{R}}}^{{l}^{{\prime} }}$$ through row-wise concatenation, that is, $$\bf{H} =\left[\bf{U};\tilde{\bf{H}}\right]$$, whose state serves as a graph-level representation (Extended Data Fig. 1e)26. Finally, gated self-attention layers30 are used to update the hidden states of the node features

$$\begin{array}{rcl}{{{{{\bf{V}}}}}}^{{\prime\prime} \left(z\right)}&=&{{{{\mathrm{attn}}}}}^{\left(z\right)}\left(\mathbf{H}\right)\\ &=&{{{{\mathbf{G}}}}}^{\left({{{z}}}\right)}\odot \left({{{\mathrm{softmax}}}}\left(\frac{1}{\sqrt{c}}{{{{\mathbf{Q}}}}}^{\left({{{z}}}\right)}{{{{{\mathbf{K}}}}}^{\left({{{z}}}\right)}}^{\top }\right){{{{\mathbf{P}}}}}^{\left({{{z}}}\right)}\right)\end{array}$$
(4)

where z = 1,  , Z, with Z representing the number of attention heads; $${{{{\bf{Q}}}}}^{({z})}\,=\,\bf{H}{{{{\bf{W}}}}}_{Q}^{(z)}$$, $${{{\bf{{K}}}^{({z})}}}\,=\,\bf{H}{{{{\bf{W}}}}}_{K}^{(z)}$$ and $${{{\bf{{P}}}^{({z})}}}=\bf{H}{{{{\bf{W}}}}}_{P}^{(z)}$$ are the queries, key and values, embedding matrices of dimension c obtained by the $${l}^{{\prime} }\times {l}^{{\prime} }$$ linear projection matrices $${{{{\bf{W}}}}}_{Q}^{(z)}$$, $${{{{\bf{W}}}}}_{K}^{(z)}$$ and $${{{{\bf{W}}}}}_{P}^{(z)}$$, respectively; $${{{{\bf{G}}}}}^{(z)}=\sigma \left(\bf{H}{{{{\bf{W}}}}}_{G}^{(z)}\right)$$ is the gate vector parameterized by the linear projection matrix $${{{{\bf{W}}}}}_{G}^{(z)}\in {{\mathbb{R}}}^{{l}^{{\prime} }\times {l}^{{\prime} }}$$, followed by an element-wise sigmoid function σ;  denotes the Hadamard product; and softmax normalizes the self-attention weights to be positive and add up to 1. The multi-head outputs $${{{{{\bf{V}}}}}}^{{\prime\prime} (z)}$$ are concatenated and passed through a MLP to capture nonlinear interactions between the node features to provide the set of updated node embeddings $${{{{\mathcal{V}}}}}^{{\prime\prime} }$$ (Extended Data Fig. 1f). Note that $${\bf{U}}'$$ needs to be retrieved from $${{{{\mathcal{V}}}}}^{{\prime\prime} }$$ to obtain the updated global features.

Third, the final node ($${{{{\mathcal{V}}}}}^{{\prime\prime} }$$), edge ($${{{{\mathcal{E}}}}}^{{\prime\prime} }$$) and global features ($$\bf{U}'$$) are decoded to obtain node, edge and global-level predictions. The node features $${{{{\mathcal{V}}}}}^{{\prime\prime} }$$ are processed using the decoding neural network φv to obtain predictions for nodes. Similarly, the decoder neural network φe receives $${{{{\mathcal{E}}}}}^{{\prime\prime} }$$ and yields a prediction for each edge in the graph. φv and φe are reflections of the encoder networks ϕv and ϕe, respectively, with an additional (prediction) layer comprising a linear transformation tailed by an output activation function (for example, softmax or logistic sigmoid for classification problems, or linear activation for regression tasks). To compute global attributes, $$\bf{U}'$$ is processed by φu, an MLP followed by a linear layer and a task-dependent nonlinear activation.

To demonstrate the versatility of MAGIK, we use the same model architecture for all examples. The encoding neural networks ϕv and ϕe consist of a series of MLPs of dimensions 32, 64 and 96, respectively. The latent dimension for nodes and edges (that is, l′ = f′ = 96) is maintained across two FGNN layers in the trunk of the network and is chosen such that it is divisible by the number of self-attention heads in each layer (Z = 6 or Z = 12). The global embedding vector U is zero-initialized. The node and edge decoding neural networks φv and φe consist of three MLPs of dimensions 96, 64 and 32, followed by a final linear layer and an activation function that map the decoded node and edge features to the output dimension. φu consists of a 64-dimensional MLP followed by a linear output layer and an activation function that returns the global-level predictions.

### MAGIK training

Once the network architecture is defined, MAGIK is trained using a set of graph feature representations and task-dependent targets. The input graphs follow the same relational structure regardless of the task, with nodes describing object detections and edges connecting the objects in time and space. Targets, in turn, represent different parameters depending on the specific task.

For trajectory linking (Figs. 2 and 3), MAGIK is trained to predict the probability of having a connection/link between two objects. This task is modelled as an edge-classification problem with a binary label (linked, labelled with 1, or unlinked, labelled with 0). Thus, during training, the network aims at minimizing the binary cross-entropy between the predicted probabilities and the ground-truth label for each edge. Accordingly, φe uses a sigmoid function as the final activation to produce probability estimates. For the training of the linking task, we use a single annotated video for each dataset/cell type, from which we stochastically extract 512 samples as sequences of consecutive frames with a duration of 10% to 20% of the whole video duration. Graphs are created using features obtained from video segmentation as described in Fig. 1a. Object coordinates are augmented by translations, rotations and mirroring. Other object descriptors are augmented by adding random noise to their values. To account for missed detections, we assign nodes a random number between 0 and 1 and remove those with values smaller than 0.05, together with the associated connections. For all the trajectory linking examples, the network is trained for 100 epochs using the 512 training samples processed in batches of 8 samples per iteration. Network performance was evaluated on samples extracted from different videos with respect to those used for training.

The inference of local properties is modelled as a node-regression problem (Fig. 4a–d), where MAGIK is trained to minimize the MAE between node predictions and ground truth. As a target feature, we use either the diffusion coefficient (Fig. 4b–d) or the anomalous diffusion exponent (Extended Data Figure 3b–d) of the object at the node level. Here, φv uses a linear activation function as the output activation. For the training, we generate a dataset of 2,000 videos with a duration of 50 to 55 frames corresponding to heterogeneous sets of moving objects (further details are provided in the ‘Simulations’ section). At each epoch, 1,024 samples are randomly extracted from the training dataset and their graph representations are augmented by translations, rotations and mirroring of the nodes’ centroids. The network was trained for 100 epochs, processing the 1,024 samples in batches of 8 per iteration. Network performance was evaluated on independently simulated samples obtained using a different seed with respect to the training.

The quantification of global dynamic properties requires MAGIK to be trained to estimate global-level attributes from the input graphs. We have approached this problem from different perspectives: a classification problem to determine the underlying diffusion model of a set of particles (Fig. 4e–l) and a regression problem to estimate the relative fraction of objects moving according to different diffusion modes (Extended Data Fig. 3e–h). For classification tasks, the network is trained to minimize the sparse categorical cross-entropy between class predictions and ground-truth labels, with a softmax as the output activation of φu. For regression tasks, MAGIK minimizes the MAE between the network estimates and the target features and φu uses a linear activation function as the output activation. As target features, we use either class labels (for classification tasks) or continuous features (for regression tasks). In each of these examples, the training data come from 2,000 simulated videos. At each epoch, 1,024 samples are randomly extracted from the training dataset from which we extract graph representations and augment their topological structure by translations, rotations and mirroring of the nodes’ centroids. The network was trained for 100 epochs, processing 1,024 samples in batches of 8 per iteration. Network performance was evaluated on independently simulated samples obtained using a different seed with respect to the training.

For all examples, the trainable parameters of MAGIK (that is, the weights of the artificial neurons in the neural networks and the parameters of Gaussian edge weighting function) were iteratively optimized using the backpropagation algorithm56 and Adam optimizer (with a learning rate of 0.001)57. The training time of MAGIK ranges between 1 min and 5 min for trajectory linking and from 30 min to 60 min in the case of node and global-level regression on an NVIDIA A100 GPU (40 GB VRAM, 2,430 MHz effective core clock, 6,912 CUDA cores).

The capability of training using a minimal amount of annotated data without requiring prior knowledge of the underlying dynamics enables the application of MAGIK to a wide range of real data for which large labelled datasets are not available. In addition, we have also tested the possibility to perform transfer learning for the linking task between migration experiments employing different cell types, as shown in the GitHub repository.

### Quantification of cell-tracking results

Quantification of the method performance for cell tracking was obtained by calculating the TRA metric based on the acyclic oriented graph matching (AOGM) measure discussed in ref. 31. First, images corresponding to the incomplete cell segmentation provided for the 6th Cell Tracking Challenge were annotated according to their ground truth and then transformed into an acyclic oriented graph according to the instructions for participation in the challenge7. A similar graph was also obtained for the trajectories predicted by our methods. The quantification of the matching between the two graphs performed by the AOGM corresponds to the weighted sum of the executed operations to transform the predicted graph into the ground-truth one31. For this, we used the AOGM-A measure, which corresponds to the AOGM measure calculated by keeping only the edge-related weights positive (wNS = wFN = wFP = 0; wED = 1, wEA = 1.5, wEC = 1)31. The AOGM-A thus evaluates the ability of an algorithm to follow objects in time (that is, its linking capability). The AOGM-A measure is normalized to obtain the tracking accuracy (TRA):

$${{{\rm{TRA}}}}=1-\frac{\min ({{{\rm{AOGM}}{\mbox{-}}{\rm{A}}}},{{{\rm{AOGM}}}}{\mbox{-}}{{{{\rm{A}}}}}_{0})}{{{{\rm{AOGM}}}}{\mbox{-}}{{{{\rm{A}}}}}_{0}}$$
(5)

where AOGM-A0 corresponds to the cost of linking the graph from scratch (that is, the cost of adding all the edges multiplied by the corresponding weights). The normalization bounds TRA in the interval [0, 1], with higher values corresponding to better tracking performance.

### Simulations

Trajectories were simulated using the andi-datasets Python package58. In addition, we used DeepTrack 2.159 to render imaged objects in different illumination modalities (fluorescence and holographic microscopy) reproducing optical conditions to provide realistic node features (Fig. 4). The localizations’ crowding was estimated by c = ρDΔt, an adimensional parameter in two dimensions that simultaneously accounts for changes in the number density ρ, diffusion coefficient D and sampling time Δt.

For the fluorescence microscopy experiments of Fig. 4a–d, we simulated objects performing Brownian motion in two dimensions with random diffusivities (0.005 ≤ D ≤ 0.7). For Fig. 4e–h, the diffusivity was defined by a random spatial map, smoothed with a Gaussian filter. For training, we typically use videos of 50–55 frames containing 30–35 objects for the inference of D and 70–80 frames for the diffusivity maps, initially positioned at random locations on 32 × 32 px2. The localizations’ crowding was estimated by c = ρDΔt, an adimensional parameter in two dimensions that simultaneously accounts for changes in the number density ρ, diffusion coefficient D and sampling time Δt. Due to the variability of D and of the number of objects, the crowding factor for these examples could vary from video to video in the range [0.003, 0.04]. Each object was rendered as a diffraction-limited spot through the optics module of DeepTrack 2.159, with a random intensity from a uniform distribution between 20 and 80 counts, varying over time with a standard deviation of 3 counts.

For all the experiments of Fig. 5, we generated trajectories undergoing three different diffusion models, namely FBM, ATTM and CTRW, with a constant anomalous exponent α = 1 and random diffusivity. For Fig. 5a–e, each object in the video undergoes two-dimensional diffusion with a randomly assigned model, with all other properties (sequence length, number of particles, intensity) being the same as described for the data in Fig. 4.

For the plankton trajectories illustrated in Fig. 5f–m, all microorganisms in the same video move according to the same three-dimensional model, varying from video to video. We generated holographic videos of 100 frames including 3–7 microorganisms, each with a randomly sampled refractive index from a uniform distribution between 1.35 and 1.55, covering a wide variety of plankton species in the literature60.

For the data of Extended Data Fig. 2, we generated trajectories undergoing Brownian motion with random diffusivity drawn from an exponential distribution with an average of 0.1 px2 per frame (truncated at 0.001 and 1 px2 per frame) and with a random intensity from a uniform distribution between 20 and 80 counts, varying over time with a standard deviation of 3 counts. The diffusivity was kept constant over dwell times extracted from a geometrical distribution with p = 0.05 truncated at values >5 frames. Sequence length was 400 frames.

For the examples in Extended Data Fig. 3a–d, we simulated objects performing FBM in two dimensions with random anomalous exponents (0.2 ≤ α < 1.8) and diffusivities (0.005 ≤ D < 0.7). For the examples illustrated in Extended Data Fig. 3e–h, we generated fluorescence images of objects undergoing FBM in two dimensions in sub-diffusive (0.2 ≤ α ≤0.6), normal (α = 1) and super-diffusive mode (1.4 ≤ α ≤ 1.8). All other properties (sequence length, number of particles, intensity) are the same as described for the data in Fig. 4.

For the data of Extended Data Fig. 4, we generated trajectories undergoing FBM with a constant anomalous exponent α = 1 and random diffusivity drawn from an exponential distribution with an average of 0.1 px2 per frame (truncated at 0.001 and 1 px2 per frame) and with a random intensity from a uniform distribution between 20 and 80 counts, varying over time with a standard deviation of 3 counts. Particles undergo diffusion with reflecting boundaries in a square box with a side of 32 × 32 px2. The number of particles was kept constant at 30. Trajectories were generated at sampling times Δt = 0.5, 1, 2, 4, 8, 16, 32 corresponding to crowding factor c = 0.0015, 0.0029, 0.0059, 0.0117, 0.0234, 0.0468, 0.0936. Sequence length was 100 frames.