Geometric deep learning reveals the spatiotemporal features of microscopic motion

The characterization of dynamical processes in living systems provides important clues for their mechanistic interpretation and link to biological functions. Owing to recent advances in microscopy techniques, it is now possible to routinely record the motion of cells, organelles and individual molecules at multiple spatiotemporal scales in physiological conditions. However, the automated analysis of dynamics occurring in crowded and complex environments still lags behind the acquisition of microscopic image sequences. Here we present a framework based on geometric deep learning that achieves the accurate estimation of dynamical properties in various biologically relevant scenarios. This deep-learning approach relies on a graph neural network enhanced by attention-based components. By processing object features with geometric priors, the network is capable of performing multiple tasks, from linking coordinates into trajectories to inferring local and global dynamic properties. We demonstrate the flexibility and reliability of this approach by applying it to real and simulated data corresponding to a broad range of biological experiments. The biological functions of living systems rely on interactions that dynamically change in response to endogenous and exogenous stim-uli. Studying the motion of the components of these systems sets the basis for mechanistic insights to understand health and disease 1 . Over the past 20 years, microscopy has advanced to the point where it can monitor dynamic processes at multiple scales with unprecedented spatiotemporal resolution. Time-lapse microscopy experiments have unveiled the strategies that unicellular organisms employ to search for food or to avoid adverse conditions, and have helped to understand tissue growth and repair, cancer metastasis, quorum sensing, the emergence of multicellularity and immune responses in multicel-lular organisms 2,3 . Fluorescence microscopy has monitored biological motion down to the nanoscale, detailing the diffusion of individual organelles and molecules within the cellular environment and disclos-ing


INTRODUCTION
The biological functions of living systems rely on interactions that dynamically change in response to endogenous and exogenous stimuli.Studying the motion of the individual components of these systems sets the basis for mechanistic insights to understand health and disease [1].Over the last 20 years, microscopy has advanced to the point where it can monitor dynamic processes at multiple scales with unprecedented spatiotemporal resolution.Timelapse microscopy experiments have unveiled the strategies that unicellular organisms employ to search for food or to avoid adverse conditions, and have helped to understand tissue growth and repair, cancer metastasis, quorum sensing, the emergence of multicellularity, and immune responses in multicellular organisms [2,3].Fluorescence microscopy has monitored biological motion down to the nanoscale, detailing the diffusion of individual organelles and molecules within the cellular environment and disclosing their role, e.g., in the fundamental processes of signaling and function regulation [4,5].Tethered-particle microscopy as well as optical and magnetic tweezers have used the motion of micron-sized beads as a proxy to infer changes in the kinetics of proteins and nucleic acids at the single-molecule level [6].
The momentous improvement of microscopy acquisition techniques has led to a substantial effort to develop and improve algorithms to automatically extract quantitative information from these experiments [7,8].The standard analysis pipeline of tracking-by-detection methods entails the following steps [4,8,9]: 1. Objects of interest are detected in movie frames (segmentation).2. Object positions and other state parameters are estimated (localization).3. Detected positions at different times are connected into trajectories (linking).4. Reconstructed trajectories are finally analyzed to quantify dynamical parameters (estimation).The first 3 steps are often presented together and referred to as tracking.In contrast, likely due to the broad variety of different parameters it might be required to evaluate, the estimation step is usually considered separately.For biological experiments, FIG. 1. Spatiotemporal characterization of trajectories using MAGIK.a, Sequence of images illustrating the evolution of a group of cells over time, corresponding to frame numbers t − 1, t, t + 1, t + 2. The orange crosses indicate a detection.b, The movement of the cells and their interactions are geometrically modeled using a directed graph, where the nodes (V) represent the detections and the edges (E) connect spatiotemporally-close detections.Each node contains features (orange squares) such as the cell's centroid and some relevant descriptors (e.g., the cell's morphological and intensity attributes).The edges contain features (blue squares) too, in this case encoding the Euclidean distance between the centroids of the cells.In this example, the node of interest, labeled with the subindex i, is connected to neighboring nodes in the future, labeled with the subindex j within a distance-based likelihood radius (the edge between nodes v t i and v t+1 j=4 is dumped).Meaningful biological events (e.g., cell divisions) are naturally encoded in the graph.c, d, The input node and edge features are mapped to a higher-level feature representation using learnable encoding functions implemented by the neural networks φv and φe, respectively.e Importantly, we also append an extra learnable token U to the graph latent representation G = {V , E , U }, whose function is to provide global insights about the dynamics of the cells.f, MAGIK relies on attention-based fingerprinting graph blocks (FGNN) to process G and provide an updated representation for nodes (V , g), edges (E , h), and global information (U , i) (for further details regarding the FGNN architecture, refer to Methods, "Description of MAGIK", and Suppl.Fig. S1).Finally, V , E , and U are decoded by applying learnable functions implemented by the neural networks ϕv, ϕe, and ϕu, respectively, to obtain the sought-after node (j), edge (k), and global information (l).
this analysis is made more difficult by various factors, such as imaging noise, high object density, fusion or splitting events, random and heterogeneous motion, and shape-changing objects.Errors at each step propagate along the pipeline and ultimately impact the extraction of dynamic information.
Several algorithmic solutions have been proposed to tackle limitations of tracking algorithms and their performance has been com-pared in open challenges [7,8].However, most of these methods are specific to a given experiment or dynamic model, and often require manual tuning of parameters.The current deeplearning revolution has fostered the development of various methods for both tracking [10][11][12][13] and estimation [14].However, deep-learningpowered approaches have so far been bound to follow the standard analysis pipeline, providing data-driven versions of conventional approaches without taking full advantage of their possibilities [13].
Geometric deep learning provides compelling approaches to tackle tracking and estimation from a different perspective.It generalizes neural networks to problems that can be described by mathematical objects such as graphs that encode information about the structure of the input [15].Deep learning methods based on graphs are typically referred to as graph neural networks (GNNs) [16] and have been successfully applied, e.g., to molecular property prediction [17], drug discovery [18], and computerassisted retrosynthesis [19].Besides being ubiquitously used in science to represent complex systems [20], graphs provide a natural and intuitive way to represent the information contained in tracking experiments [21,22].
Here, we describe a framework for Motion Analysis through GNN Inductive Knowledge (MAGIK), which provides the accurate estimation of dynamical properties from time-lapse microscopy.MAGIK models the system's motion and interactions through a graph representation.This graph is processed through an interpretable and adaptive attention-based GNN that estimates the associations among the objects and provides insights into the intrinsic dynamics of the systems.We demonstrate the flexibility and reliability of MAGIK by quantifying its performance on real and simulated data corresponding to a broad range of biological experiments.First, we benchmark it on its most natural application, i.e., trajectory linking, in a variety of challenging experimental scenarios, including high-density experiments, merge/split events, and shape-changing objects, where MAGIK features gap-closing capabilities, gracefully handles segmentation errors, and even overrides imperfect annotation of training datasets.Then, going beyond optimal trajectory linking, we show that MAGIK can further estimate local and global dynamical properties, such as diffusion coefficients, diffusion modes, and anomalous diffusion exponents, even in highly heterogeneous scenarios at the ensemble and single-object levels.

MAGIK represents spatiotemporal relations
in a graph.
MAGIK provides a GNN framework to estimate the dynamical properties of moving objects from time-lapse experiments, which are relevant in different biological scenarios.MAGIK models the objects' motion and physical interactions using a graph representation.The details of the algorithm are given in Methods ("Description of MAGIK") and Fig. S1.In this section, we provide a high-level description of the architecture, as represented in Fig. 1.
Graphs can define arbitrary relational structures between nodes connecting them pairwise through edges.When training a GNN, the graph architecture guides the learning process about the objects and their relations by introducing a relational inductive bias [16].In MAGIK, each node describes an object detection at a specific time, the edges connect spatiotemporally close objects, and a set of global attributes encodes system-level properties.As an example, for subsequent frames of a cell migration experiment, each detected object (orange crosses in Fig. 1a) is associated with a node with a vector of node features (Fig. 1b).Directed edges with relational features connect each node to objects detected in the future in its proximity (Fig. 1b).There are no intrinsic restrictions on the type or number of descriptors (e.g., location and morphological features, image-based quantities, biological events, interaction strength, distance, direction) that can be encoded in the graph feature representation.The basic graph relational structure is established through a set of rules that link nodes pairwise based on distance metrics between features.Node and edge features are encoded through learnable functions implemented by neural networks (Figs.1c, d).An extra learnable token is added to aggregate global attributes from the whole graph [23] (Fig. 1e).
The graph is processed through a sequence of attention-based fingerprinting graph neural networks (FGNN, see also "Description of MAGIK" and Suppl.Fig. S1) that propagate information through the graph via message-passing steps FIG. 2.
Trajectory linking using MAGIK.a, Representative frame of the HeLa cells in the DIC-C2DH-HELA validation video of the 6 th Cell Tracking Challenge [8].Segmentations (colored regions) are used to extract relevant information from each cell along the sequence of images to build b, the input graph structure including a redundant number of edges with respect to the actual associations between objects.c, Ground-truth graph, and d, ground-truth cell trajectories.e, The predicted graph agrees well with the expected solution, achieving a F1-score equal to 99.4%.f, The predicted trajectories reach TRA = 99.2%compared to the ground-truth.Cell divisions are detected correctly, and the network performs well also in edge regions where cells are only partially observed and move out of the field of view.g, Zoomed-in view of the inset in f showing the heterogeneity in cell shape and dynamics.A cell changes morphology during migration (frame 0-28) and divides into 2 daughter cells (frame 30) that spread and migrate apart (frames 30-83).h, MAGIK exhibit emergent gap-closing capabilities.MAGIK is able to identify the white and blue cells at frame 66 as the product of a cell division despite the missing segmentation of one of the daughter cells (frame 64) that led to the misclassification of the green cell (frames 65-66) in the annotation of ground-truth trajectory.A video visualizing the tracked cells can be found in the supplementary material (Suppl.Video S1) (Figs. 1f-i).The relational inductive knowledge implemented in the graph structure sketches a network of redundant object associations.The objective of the FGNN is to modulate the association strength to identify the edges majorly influencing the dynamic properties of each object.For this, the FGNN implements two attention mechanisms that combine information from multiple objects while considering the presence of heterogeneity at the local and global levels.
The first mechanism intervenes when aggregating edge features to a node (Eq.3).The contribution of each edge has a weight that depends on the distance between the connected nodes through a function with learnable parameters (Eq.2), thus defining a learnable local receptive field that allows the network to adapt to heterogeneous dynamics and to robustly account for relevant relations between the nodes.The second is a gated self-attention mechanism [24] that sets in when updating the latent representation of nodes (Eq.4).The node update operation involves also information stemming beyond each node's topological neighborhood, thus effectively expanding the receptive field to objects that, although not physically connected, can offer relevant information about the overall dynamics.The use of gated self-attention offers a feature-wise discriminatory power to the node update operation since it weights individual features of the attention node embedding with respect to their importance to the overall graph structure.Through this mechanism, MAGIK identifies only the meaningful features of each node.Unreliable or incomplete features (e.g., the morphology of objects at the edge of the image or partially outside the field of view) are thus prevented from adding noise to the correct prediction of the network.The FGNN further updates the extra token for global attributes using information from all the nodes; thus, this extra token serves as an antenna to provide systemlevel insights.
The output of the FGNN is decoded by the last block of the GNN into an output graph, whose nodes, edges, and global attributes can be used to solve specific problems (Figs.1j-l).The flexibility of the graph representation and the possibility to use various types and numbers of input features make MAGIK suitable for determining multiple parameters associated with various experimental scenarios where there are objects of interest moving in space and time.
In the following sections, we exemplify the application of MAGIK to: (i) the analysis of experiments of cell migration to determine trajectories in the presence of proliferation; (ii) fluorescence imaging of single molecules to determine parameters of heterogeneous and anomalous diffusion; and (iii) holographic imaging of microorganisms to classify their diffusion mode.In all cases, MAGIK is trained for ≈ 100 epochs with small datasets (a single video of ≈ 100 frames for linking; at most ≈ 1000 videos of ≈ 50 frames each for the other cases) thanks to the use of an ad hoc augmentation procedure that combines feature corruption and node dropping, thus enabling transfer learning to more challenging conditions with no loss of performance.
The training is typically completed in minutes on a GPU-enhanced computer (see Methods, "MAGIK training").

MAGIK accurately links trajectories.
We first benchmark MAGIK performance on a classical trajectory linking task, consisting of establishing temporal associations between identified objects.For object linking, the graph structure includes a redundant number of edges with respect to the actual associations between objects.The aim of MAGIK is to prune the wrong edges while retaining the true connections by using all the available spatiotemporal information.We thus model this task as an edge-classification problem with a binary label (linked/unlinked).From the predicted edge features, trajectories are built through a postprocessing algorithm that eliminates spurious connections (Methods, "Postprocessing algorithm for trajectory linking").
To test MAGIK, we use the silver-standard segmentation datasets provided for the training of the sixth edition of the Cell Tracking Challenge [8] (which has been created by combining results of several automatic analysis methods following a majority-voting scheme).A representative segmentation of the dataset DIC-C2DH-HELA, corresponding to HeLa cells on a flat glass imaged through differential interference contrast, is shown in Fig. 2a.From the segmentation, we calculate the mean pixel intensity, area, perimeter, eccentricity, and solidity of the segmented objects, which we use as input node features.The Euclidean distance between neighboring objects is used as the sole edge feature.To limit memory usage, we generate graphs by drawing edges only between objects within a limited spatial and temporal reach (Fig. 2b).
The DIC-C2DH-HELA dataset presents several challenges, namely, high packing density, low signal-to-noise ratio, and a highly heterogeneous intracellular signal due to DIChighlighted internal structures and organelles.The heterogeneity further extends to cell shape and dynamics over time as a consequence of migration and proliferation (Figs.2g,h).Examples of ground-truth and predicted graphs are shown in Figs.2c,e showing a good agreement, as confirmed by a F 1 -score of 99.4% in edge prediction.For the evaluation of performance at the trajectory level (Figs.2d,f), we calculated the tracking accuracy measure (TRA), a normalized weighted distance between the tracking prediction and the reference tracking ground truth [25] (Methods, "Quantification of cell tracking results").When evaluated with respect to trajectories, MAGIK reached a TRA = 99.2%showing a great capability of correctly following objects despite imperfect segmentation, shape changes, and cell divisions (Fig. 2g, Suppl.Video S1).
An interesting emergent capability of the method is highlighted in Fig. 2h.The video microscopy at frame 63 captures a cell (orange shadow) dividing into two daughter cells (frames 64-66).The ground-truth segmentation at frame 64 misses one of the daughter cells, preventing the identification of the division event at this frame.This kind of error is not uncommon because, for actual experiments, although most of the annotations can be considered as true positives they are a subset of the unknown ground truth.When the second daughter cell is identified (green shadow at frame 65), both MAGIK and the annotations used as groundtruth associate it to a new trajectory.However, at frame 66, MAGIK is able to identify the white and blue cells as the product of the cell division of the orange cell at frame 65, highlighting a general learning ability of the network based on the propagation of topological and morphological information over time.Although the detection of these events is not reflected in the computation of tracking metrics, their identification is relevant for the biological interpretation of the experiment (e.g., calculating the cell division rate) [8].
We applied MAGIK to several other datasets of the 6 th Cell Tracking Challenge obtaining outstanding results for different microscopy techniques and cell types.
Representative video frames with segmentation are shown in Fig. 3 for confocal microscopy imaging of GFP-GOWT1 mouse stem cells (Fig. 3a, F 1 -score = 99.8%,TRA = 99.2%),phase-contrast imaging of glioblastoma-astrocytoma U373 cells on a polyacrylamide substrate (Fig. 3b, F 1 -score = 99.8%,TRA = 100%), epifluorescence imaging of HeLa cells stably expressing H2b-GFP (Fig. 3c, F 1 -score = 98.8%,TRA = 98.4%), and phase-contrast imaging of pancreatic stem cells on a polystyrene substrate (Fig. 3d, F 1 -score = 99.3%,TRA = 98.5%) (see Suppl.Videos S2-S5 for full movies).Even though a strict objective comparison of MAGIK linking capability with other methods is limited by the fact that different algorithms rely on different segmentations (whose errors influence linking and thus indirectly affect the value of the TRA metric), MAGIK obtained TRA values that are competitive, if not superior, to the best-in-class methods of the 6 th Cell Tracking Challenge.

MAGIK quantifies motion parameters
without trajectory linking.
In most applications, the ultimate objective of tracking is the characterization of the dynamics of the systems under investigation to gain insights into their underlying biological mechanisms.In this process, trajectory linking is often just an intermediate step necessary to get meaningful information from the data, but not the end goal itself.For example, in single-molecule fluorescence microscopy, trajectory analysis is often performed to quantify dynamic parameters such as diffusion coefficients, to determine the extent of mixed diffusive behaviors (e.g., slow/fast, mobile/confined), and to classify the diffusion mode [4,5,14].
Differently from most other estimation techniques, MAGIK can characterize essentially any dynamic aspect of the system under investigation without requiring the actual linking, thanks to its capability of accounting for the whole spatiotemporal complexity contained in the associations between objects at multiple scales.Such linking-free analysis produces a two-fold advantage.First, it bypasses the error-prone linking step, thus inherently preventing linking errors from propagating to the quantification of the ultimately relevant parameters.Second, it enables the analysis of experiments for which linking cannot be performed due to, e.g., a high object density or low signal-to-noise ratio.
To highlight its capabilities and quantita-tively assess its performance, in Fig. 4, we apply MAGIK to analyze simulated data reproducing the diffusion of fluorescently-labeled single molecules like, e.g., lipids or receptors in the plasma membrane of living cells.We first consider the task of determining the diffusion coefficient from a heterogeneous ensemble of diffusing objects (Fig. 4a).We feed the network the centroid coordinates and intensity of the localized fluorescence spots as node features and the Euclidean distance between neighboring centroids as the edge feature.We define the problem as a node regression where the target feature is the displacement scaling factor √ 2D, with D being the diffusion coefficient of the molecule associated with each node.Graphs are built by connecting localized objects with neighbors in space and time (Fig. 4b).Ground-truth and predicted graphs are shown in Fig. 4b and Fig. 4c, respectively.All the edges of the graph structure are drawn, representing the network of associations used to infer dynamic properties without direct linking.Nodes are color-coded according to the value of the displacement scaling factor √ 2D.Their visual comparison suggests excellent agreement, further confirmed by the quantification in Fig. 4d.The same approach can also be extended to estimate other parameters.In Suppl.Fig. S2a-d, we show the results of its application to the inference of the scaling exponent for objects undergoing anomalous diffusion, achieving similarly good results.Fluorescence microscopy experiments for singleobject tracking must ensure that the number of visualized molecules is low enough to unambiguously link the trajectories, thus are often performed at low labeling density [4].However, these conditions are not optimal to probe the interactions between particles and make difficult the inference of spatial patterns of diffusion [26].Enabling the quantification of diffusion properties without linking offers the possibility to process high-density videos to determine the underlying topology and spatial heterogeneity.As an example, we used MAGIK to resolve a spatially-modulated landscape with diffusion continuously varying over more than 2 orders of magnitude from the localizations of diffusing particles (Fig. 4e-h), treating the problem as a node feature regression, as above.At a number density of ≈ 0.02 px −2 , about one order of magnitude higher than the limit for reliable tracking [7], MAGIK is capable of correctly retrieving the spatial map of D (Fig. 4f).Remarkably, most spatial features can be already resolved with a 100-frames long movie (Fig. 4g).The spatial resolution of the predicted map can be further improved using longer videos (1500 frames, Fig. 4h), with the typical duration of single-molecule fluorescence microscopy experiments for measuring diffusion [4].

MAGIK quantifies global dynamic
properties.
We further applied MAGIK to extract ensemble information through the inference of global attributes skipping direct trajectory linking in two biologically-relevant experimental scenarios.As a first example, we considered fluorescence microscopy experiments in which objects in the same video undergo diffusion according to different microscopic models (namely, fractional Brownian motion (FBM), annealed transient time motion (ATTM), and continuous-time random walk (CTRW), Figs.5a-e).Although these diffusion models can give rise to anomalous diffusion, in this example they are parametrized so to have the same scaling of the mean-squared displacement of Brownian motion (α = 1) [14].Graphs are built as described above using centroid coordinates and intensity of the localized fluorescence spots as node features and the Euclidean distance between neighboring centroids as the edge feature.MAGIK estimates the relative fraction of objects in each category, varying from experiment to experiment, as a regression problem on the global attribute.Results obtained over a large set of experiments are summarized in Fig. 5a-e, showing an outstanding accuracy in predicting the correct fractions, even when the number of objects performing the same class of motion in the experiment is very low.In Suppl.Fig. S2e-h, we further demonstrate that the same approach can also estimate the fraction of object moving according to different diffusion modes (subdiffusion with α < 1, normal diffusion with α = 1, and superdiffusion with α > 1).
The second example refers to simulations of holographic imaging of microorganisms diffusing in a liquid environment, such as plankton (Fig. 5f-k).In this case, we model diffusion as either FBM (Fig. 5f-g), ATTM (Fig. 5h-i), or CTRW (Fig. 5j-k) with α = 1.Objects in the same experiments move according to the same physical model but with random diffusivity.Centroid 3D coordinates, mean intensity, area, and refractive index of the objects are used as node features in a classification problem to determine the common diffusion model of the objects in the same video, encoded as a global attribute.As shown in Fig. 5l, MAGIK correctly classifies the generative dynamics even with largely overlapping objects.We find this result quite remarkable (equally so as that illustrated in Suppl.Fig. 4e) since, for α = 1, all models converge to Brownian motion and fea- ture large similarities in their statistical properties, making their classification rather challenging even when linked trajectories are available [14].
Last, we explore MAGIK's performance for quantifying anomalous diffusion through the estimation of the exponent α [14] from a sequence of holographic images reproducing the motion of microorganisms.All the objects in the same movie undergo FBM with random diffusivity and the same exponent α, varying from sequence to sequence (Fig. 5m).Also in this case, MAGIK provides remarkable results (MAE= 0.11) from short movies (≈ 50 frames) containing only a few objects.

DISCUSSION
MAGIK is a general and versatile approach for the characterization of dynamic properties from time-lapse microscopy that exploits geometric deep learning capability to capture the full spatiotemporal complexity of biological experiments.MAGIK strongly relies on an attention-based GNN that can extract dy- namic parameters from image-based features by assuming relational constraints between objects.The use of these relational properties at multiple scales makes MAGIK robust with respect to missed detections, object appearance/disappearance, merge/split events, and high object densities.In this article, we have shown several applications that highlight its most important features in different biological contexts.
For trajectory linking, currently used methods employ Kalman filter [27], multiframe and/or multitrack optimization based on greedy algorithms that approximate the multiplehypothesis tracking (MHT) solution [26,28,29], or combinatorial optimization [30].Most of these approaches offer their best performance when knowledge of the motion is explicitly used [7].MAGIK provides the first efficient data-driven alternative to these approaches and can be used to analyze any kind of motion and interaction pattern.Training can be performed on a minimal amount of annotated data or simulations.The inductive bias encoded in the graph structure offers an inherent gap-closing capability and reduces the combinatorial space of potential solutions.Flexible hyperparameters allow the tuning of the maximum distance and time lag for node connection to avoid missed links even for highly heterogeneous motion.
MAGIK provides a key enabling technology to estimate dynamic parameters from segmentation/localization in a complete linking-free fashion, whereas other methods require some level of knowledge about the linking between objects [31,32].As such, it provides a powerful solution for those experiments where trajectory linking cannot be reliably performed, e.g., as a consequence of high object density or probe blinking.By the inference of node properties, we demonstrate capability for resolving spatial patterns of diffusivity but the same approach can be to reveal other phenomena such as flows, scaffolding networks, and areas of trapping or confinement.
The examples analyzed in this work highlight the wide versatility of MAGIK.Remarkably, the same architecture can be applied to investigate other dynamical observables, can be trained to simultaneously estimate several parameters, and can even be used for applications beyond timelapse microscopy, where time is substituted by another variable.As such, it will enable new experimental designs and high-throughput analysis to decipher biological mechanisms underlying the modulation of spatiotemporal behavior.

Description of MAGIK.
The input to MAGIK is the graph representation of the movement and interactions of an ensemble of objects.The nodes (V) contain features encoding meaningful information of the objects, and the edges (E) connect spatiotemporally-neighboring nodes codifying relational features, such as, e.g., the Euclidean distance between them (Fig. 1a-b).
The architecture comprises three main blocks.First, an encoder neural network φ v converts each node feature representation v i ∈ V of dimension l into a l -dimensional feature represen-tation v i (Fig. 1c).In parallel, another encoder neural network function φ e transforms each 1-D edge feature e k ∈ E into a high-level feature vector e k of dimension f (Fig. 1d).φ v and φ e are a series of multi-layer perceptrons (MLPs) composed of a linear layer followed by a Gaussian Error Linear Unit (GELU) [33] as activation function and layer normalization.
Second, the resultant graph representation G = {V , E } ((Fig.1e)) is processed through repeated fingerprinting graph blocks (FGNN, described in detail in Suppl.Fig. S1a-f).Each FGNN updates each edge in the graph by applying an MLP to the concatenation of the features of two neighboring nodes and their connecting edge, i.e., for j ∈ N i , where N i is the neighborhood of node i, and [, ] represents the concatenation operation (Fig. S1b).Subsequently, the learned representation e ij (of dimension f ) is weighted by a Gaussian attention mechanism, where d ij is the Euclidean distance between the centroids of the nodes i and j, and the standard deviation σ and the Gaussian order β are learnable parameters that allow the FGNN to adapt to varied object dynamics (Figs.S1c and  S1d).The FGNN computes a local representation for the topological neighborhood N i by applying a linear transformation to the concatenation of the current state of node i and the aggregate of the weighted edge features, according to where W H is a l ×(l +f ) linear projection matrix.Importantly, we prepend a learnable node embedding U ∈ R l to the local representation matrix, i.e., H = [U ; H], whose state serves as a graph-level representation (Fig. S1e) [23].Fi-nally, gated self-attention layers [24] are used to update the hidden states of the node features, where z = 1, • • • , Z, with Z representing the number of attention heads; K , and P (z) = HW (z) P are the queries, key, and values, embedding matrices of dimension c obtained by the l × l linear projection matrices is the gate vector parametrized by the linear projection matrix W (z) G ∈ R l ×l , followed by an element-wise sigmoid function σ; denotes the Hadamard product; and softmax normalizes the self-attention weights to be positive and add up to 1.The multi-head outputs V (z) are concatenated and passed through a MLP to capture non-linear interactions between the node features to provide the updated node embbedings V (Suppl.Fig. S1f).Note that U needs to be retrieved from V to obtain the updated global features.
Third, the final node (V ), edge (E ), and global features (U ) are decoded to obtain node, edge, and global-level predictions.The node features V are processed using the decoding neural network ϕ v to obtain predictions for nodes.Similarly, the decoder neural network ϕ e receives E and yields a prediction for each edge in the graph.ϕ v and ϕ e are reflections of the encoder networks φ v and φ e , respectively, with an additional (prediction) layer comprising a linear transformation tailed by an output activation function (e.g., softmax or logistic sigmoid for classification problems, or linear activation for regression tasks).To compute global attributes, U is processed by ϕ u , an MLP followed by a linear layer and a task-dependent nonlinear activation.
To demonstrate the versatility of MAGIK, we use the same model architecture for all examples.The encoding neural networks φ v and φ e consist of a series of MLPs of dimension 32, 64, and 96, respectively.The latent dimension for nodes and edges (i.e., l = f = 96) is main-tained across two FGNNs layers in the trunk of the network and is chosen such that it is divisible by the number of self-attention heads in each layer (Z = 12).The global embedding vector U is zero-initialized.The node and edge decoding neural networks ϕ v and ϕ e consist of three MLPs of dimensions 96, 64, 32, followed by a final linear layer and an activation function that map the decoded node and edge features to the output dimension.ϕ u consists of a 64-dimensional MLP followed by a linear output layer and an activation function that returns the global-level predictions.

MAGIK training.
Once the network architecture is defined, MAGIK is trained using a set of graph feature representations and task-dependent targets.The input graphs follow the same relational structure regardless of the task, with nodes describing object detections and edges connecting the objects in time and space.Targets, in turn, represent different parameters depending on the specific task.
For trajectory linking (Figs.2-3), MAGIK is trained to predict the probability of having a connection/link between two objects.This task is modeled as an edge-classification problem with a binary label (linked, labeled with 1, or unlinked, labeled with 0).Thus, during training, the network aims at minimizing the binary cross-entropy between the predicted probabilities and the ground-truth label for each edge.Accordingly, ϕ e uses a sigmoid function as the final activation to produce probability estimates.For the trajectory linking tasks, MAGIK processes input graphs in batches of 8 samples while training.Each sample is obtained from a fraction of frames (10 to 20%), stochastically extracted from the same training video.Graphs are created according to Fig. 1a and augmented by translations, rotations, and mirroring of the set of nodes' centroids.Likewise, the object descriptors are augmented by adding random noise to their values.Moreover, we randomly remove nodes and their connections to account for detection blinking.For all the trajectory linking examples, the network was trained for 100 epochs, each consisting of 512 unique training samples split into batches of 8.
The inference of local properties is modeled as a node-regression problem (Figs.4a-d), where MAGIK trains to minimize the mean absolute error (MAE) between node predictions and ground truth.Here, ϕ v uses a linear activation function as the output activation.The training data comes from 2000 videos simulated with heterogeneous sets of moving objects and varying lengths (between 50 and 55 frames), and their graph representations are augmented by translations, rotations, and mirroring of the nodes' centroids (further details are provided in the "Simulations" section).As ground truth, we used either the diffusion coefficient (Figs.4b-d) or the anomalous diffusion exponent (Figs.S2b-d) of the object at the node level.In these examples, the network was trained for 100 epochs, each consisting of 1024 unique training samples split into batches of 8.
The quantification of global dynamic properties requires MAGIK to be trained to estimate global-level attributes from the input graphs.
Throughout the examples, we have approached this problem from different perspectives, from a classification problem to determine the underlying diffusion model of a set of particles (Figs.4el) to a regression problem to estimate the relative fraction of objects moving according to different diffusion modes (Suppl.Figs.S2e-h).For classification tasks, the network is trained to minimize the sparse categorical cross-entropy between class predictions and ground truth labels, with a softmax as the output activation of ϕ u .For regression tasks, MAGIK minimizes the MAE between the network estimates and the target features.Here, ϕ u uses a linear activation function as the output activation.In each of these examples, the training data comes from 2000 simulated videos from which we extracted graph representations and augmented their topological structure by translations, rotations, and mirroring of the nodes' centroids.As target features, we used either class labels (for classification tasks) or continuous features (for regression tasks).The network was trained for 100 epochs, each consisting of 1024 unique training samples split into batches of 8.
For all examples, the trainable parameters of MAGIK (i.e., the weights of the artificial neurons in the neural networks and the parameters of Gaussian edge weighting function) were iteratively optimized using the backpropagation training algorithm [34] and Adam optimizer (with a learning rate of 0.001) [35] Postprocessing algorithm for trajectory linking.
Cell trajectories are built from the scores obtained for the edge classification problem through a simple postprocessing algorithm.The algorithm starts from a random node at the initial frame t = 0 and connects it over time with other nodes at subsequent frames, considering only edges that have been classified as "linked" by MAGIK.If no "linked" edges connect the sender node at t with any receiver nodes at t+1, the algorithm checks future frames, until a maximum time lag.If no "linked" edges are found within this lag, the trajectory is interrupted.If a sender node has two "linked" edges connecting it to two receiver nodes at a later frame, the event is identified as a division.At this point, the algorithm treats the two nodes as independent and attempts to build two new trajectories.In the rare event that more than two "linked" edges originate from the same sender, the one connecting the furthest receiver is dropped.The procedure is iterated until all the "linked" edges have been taken into account.

Quantification of cell tracking results.
Quantification of the method performance for cell tracking was obtained by calculating the TRA metric based on the acyclic oriented graph matching (AOGM) measure discussed in Ref. [25].First, images corresponding to the incomplete cell segmentation provided for the 6 th Cell Tracking Challenge were annotated according to their ground truth and then transformed into an acyclic oriented graph according to the instructions for participation in the challenge [8].A similar graph was also obtained for the trajectories predicted by our methods.The quantification of the matching between the two graphs performed by the AOGM corresponds to the weighted sum of the executed operations to transform the predicted graph into the groundtruth one [25].For this, we used the AOGM-A measure, which corresponds to the AOGM measure calculated by keeping only the edge-related weights positive (w NS = w FN = w FP = 0; w ED = 1, w EA = 1.5, w EC = 1) [25].The AOGM-A thus evaluates the ability of an algorithm to follow objects in time (i.e., its linking capability).The AOGM-A measure is normalized to obtain the tracking accuracy (TRA): where AOGM-A 0 corresponds to the cost of linking the graph from scratch (i.e., the cost of adding all the edges multiplied by the corresponding weights).The normalization bounds TRA in the interval [0, 1], with higher values corresponding to better tracking performance.

Simulations
Trajectories were simulated using the andi-datasets Python package [36].Additionally, we used DeepTrack 2.1 to render imaged objects in different illumination modalities (fluorescence and holographic microscopy) reproducing optical conditions to provide realistic node features (Fig. 4).
For the fluorescence microscopy experiments of Fig. 4a-d and Suppl.Fig. S2a-d, we simulated objects performing FBM in two dimensions with random anomalous exponents (0.2 ≤ α < 1.8) and diffusivities (0.005 ≤ D < 0.7).For Fig. 4eh, the diffusivity was defined by a random spatial map, smoothed with a Gaussian filter.For training, we typically use videos of 50−55 frames containing 30 − 35 objects (70 − 80 for varying diffusivity and diffusivity maps) initially positioned at random locations.Each object is rendered as a diffraction-limited spot through the optics module of DeepTrack 2.1, with a random intensity from a uniform distribution between 20 and 80 counts, varying over time with a standard deviation of 3 counts.
For all the experiments of Fig. 5, we generated trajectories undergoing three different diffusion models, namely FBM, ATTM, CTRW, with a constant anomalous exponent α = 1 and random diffusivity.For Fig. 5a-e, each object in the video undergoes 2D diffusion with a randomlyassigned model, with all other properties (sequence length, number of particles, intensity) being the same as described for Fig. 4.
For the plankton trajectories illustrated in Fig. 5f-m, all microorganisms in the same video move according to the same 3D model, varying from video to video.We generate holographic videos of 100 frames including 3 − 7 microorganisms, each with a randomly sampled refractive index from a uniform distribution between 1.35 − 1.55, covering a wide variety of plankton species in the literature [37].

DATA AVAILABILITY
The cell tracking datasets were obtained from the cell tracking challenge webpage http: //celltrackingchallenge.net/2d-datasets/, where they can be accessed from.

CODE AVAILABILITY
All source code and examples are made publicly available at the DeepTrack-2.1 GitHub repository [38].

FIG. 3 .
FIG. 3.MAGIK reliably links trajectories in various experimental scenarios.a, Confocal microscopy of GFP-GOWT1 mouse stem cells.MAGIK achieves F1-score = 99.8% and TRA = 99.2%despite the fact that the cells frequently leave the field of observation.b, Phase-contrast imaging of glioblastomaastrocytoma U373 cells on a polyacrylamide substrate.MAGIK reaches F1-score = 99.8% and TRA = 100% even though the cells greatly change shape over time.c, Epifluorescence imaging of HeLa cells stably expressing H2b-GFP.MAGIK achieves F1-score = 98.8% and TRA = 98.4% despite the dense sample and frequent mitosis and collisions.d, Phase-contrast imaging of pancreatic stem cells on a polystyrene substrate.MAGIK obtains F1-score = 99.3% and TRA = 98.5% despite high cell density, elongated shapes, pronounced cell displacements, and a significant number of division events.Interrupted trajectories correspond to cases where cells left the field of view or missed segmentation in the image sequence.All videos belong to the dataset of the 6 th Cell Tracking Challenge[8] Results can be observed in greater detail in Supplementary Videos S2-S5.

FIG. 4 .
FIG. 4. MAGIK determines local diffusion properties.a, Simulated single-object tracking experiment where fluorescence microscopy is used to follow the motion of single molecules performing Brownian motion with diffusivity D varying from particle to particle.b, c, Ground-truth and predicted graphs.The edges depict the network of associations used to infer dynamic properties without direct linking.The nodes are color-coded according to the value of the target feature, i.e. the displacement scaling factor √ 2D measured in pixels per frame (color bar in b).d Probability distribution of the predicted vs. ground-truth diffusion coefficient D, showing a good agreement.e, Simulated single-object tracking experiment where fluorescence microscopy is used to follow the motion of single molecules performing Brownian motion with diffusivity D randomly varying in space.f, g, h, Ground-truth and predicted diffusion maps.Ground-truth spatial diffusivity pattern (f ) and prediction obtained by MAGIK using a 100-(g) and 1500-frame-long (h) movies with ≈ 0.02 localizations per px 2 per frame.The analysis is performed by breaking down the sequence in 2 and 30 videos of 50 frames each, respectively.Predicted maps are obtained by interpolating the values of diffusivity obtained for the nodes over the 64 px × 64 px grid through a triangulation-based nearest-neighbor algorithm.

FIG. 5 .
FIG. 5. MAGIK estimates local and global dynamic properties at the ensemble and singleobject levels.a, Simulated single-object tracking experiment where objects with different underlying diffusion models coexist (i.e., fractional Brownian motion (FBM), annealed transient time motion (ATTM), and continuous-time random walk (CTRW) with anomalous diffusion exponents α = 1).b-d Probability distribution of predicted vs. ground-truth model fraction for FBM, ATTM, and CTRW, respectively.d Confusion matrix demonstrating how the network classifies the underlying diffusion model exhibited by objects in 1199 validation videos.The diagonal represents the percentage of correctly classified graph representations, constituting most cases.The off-diagonal cells represent incorrectly classified examples.Column-based normalization is applied, such that the sum along the columns adds up to 1, with minor deviations due to rounding.f-k, Representative frames of simulated holographic video and corresponding graph representation for the whole image sequence, where objects follow f-g FBM, h-i, ATTM, and jk CTRW, with α = 1.In the graphs, the edges depict the association network used to infer dynamic properties without trajectory linking.l, Confusion matrix showing how the network classifies the underlying diffusion model presented in 1496 validation videos.Column-based normalization is applied.m, MAGIK predicts the anomalous diffusion exponent governing the motion of ensembles of objects performing FBM in 1097 holographic videos.The probability distribution of the predicted vs. ground-truth anomalous diffusion exponent (α) exhibits a good performance throughout the evaluated range.
SUPPL.FIG.S2.MAGIK estimates local and global anomalous diffusion properties at the ensemble and single-object levels.a, Simulated single-object tracking experiment.Fluorescence microscopy is used to follow the motion of single molecules characterized by a fractional Brownian motion (FBM) with varying anomalous diffusion exponent α. b, c, Ground-truth and predicted graphs.Edges depict the network of associations used to directly infer dynamic properties without explicit linking.Nodes are color-coded according to the value of the target feature α.The predicted node values agree with the groundtruth also in crowded areas (e.g., zoomed regions I and II).d, Probability distribution of the predicted vs. ground-truth anomalous diffusion exponent α. e-h, MAGIK estimates the relative fraction of objects following different diffusion modes, i.e., sub-(0.2≤ α ≤ 0.6), normal (α = 1) and super-diffusion (1.4 ≤ α ≤ 1.8).e-g, Probability distribution of predicted vs. ground-truth fraction for sub-diffusion, normal diffusion, and super-diffusion, respectively.h Confusion matrix demonstrating how the network classifies the underlying diffusion model exhibited by objects in 1199 validation videos.Column-based normalization is applied, such as the sum along the columns adds up to 1, with minor deviations due to rounding.