Objective comparison of methods to decode anomalous diffusion

Deviations from Brownian motion leading to anomalous diffusion are found in transport dynamics from quantum physics to life sciences. The characterization of anomalous diffusion from the measurement of an individual trajectory is a challenging task, which traditionally relies on calculating the trajectory mean squared displacement. However, this approach breaks down for cases of practical interest, e.g., short or noisy trajectories, heterogeneous behaviour, or non-ergodic processes. Recently, several new approaches have been proposed, mostly building on the ongoing machine-learning revolution. To perform an objective comparison of methods, we gathered the community and organized an open competition, the Anomalous Diffusion challenge (AnDi). Participating teams applied their algorithms to a commonly-defined dataset including diverse conditions. Although no single method performed best across all scenarios, machine-learning-based approaches achieved superior performance for all tasks. The discussion of the challenge results provides practical advice for users and a benchmark for developers.


SUPPLEMENTARY FIGURES AND TABLES
Supplementary Figure 1. Screenshots of the interactive tool for performance comparison. a, Summary of the results obtained for T1 and T2 according to corresponding challenge metrics. Hovering on each symbol reveals team name and scores. b-d, Plots of the metrics and estimators used to assess methods' performance for T1 (b), for T2 (c), and for T3 (d). For each task, plots can be displayed for user-selected subsets of the datasets. Sliders and buttons allow data selection based on task dimension, team, trajectory length, noise, α, diffusion model, or changepoint position. The interactive tool is available at http://andi-challenge.org/interactive-tool/.       Predicted model   74  2  3  0  4   9  97  2  1  0   8  1  87  1  13   0  0  1  97  0   9  0  7  1  82   team E   76  3  5  0  8   8  96  2  0  1   8  1  85  1  13   0  0  1  99  0   7  0  6  0  79   team M   67  3  0  0  2   8  95  3  0  0   12  1  82  0  14   0  0  1  99  1   13  0  13  0  83   team J   69  4  4  0  7   10  94  3  0  0   8  2  83  1         Bayesian inference using annealed importance sampling to sample from the posterior distribution. We attempted to use Bayes theorem to calculate the posterior probability distributions for the models and parameters. The likelihood functions, and to a large extent also the priors, could be derived from the descriptions and codes provided by the organizers. Effective Bayesian inference could be achieved for the SBM and FBM [8] models. However, the need to integrate out hidden waiting times impaired effective inference for ATTM, CTRW and LW. For ATTM and CTRW, we attempted to integrate out the waiting times together with the model parameters using Monte Carlo techniques. For LW, in 1D we used the forward algorithm on a hidden Markov model (but without including measurement noise) [9], while in 2D and 3D we used a goodness-of-fit test after inference with the other four models to exclude them, followed by a fit to the TA-MSD to obtain the anomalous diffusion exponent of the LW. Tasks: All First, each trajectory is turned into a graph, where nodes are the positions and edges connect positions following a pattern based on their time difference. Then, features computed from normalized positions are attached to nodes (e.g., cumulative distance covered since origin, distance to origin, maximal step size since origin). These graphs are then passed as input to a graph convolution module (graph neural network), which outputs, for each trajectory, a latent representation in a high-dimensional space. This fixed-size latent vector is then passed as input to task-specific modules, which can predict the anomalous exponent or the random walk type. Several output modules can be trained at the same time, using the same graph convolution module, by summing task-specific losses. The model can receive trajectories of any size as inputs. The high-dimensional latent representation of trajectories can be projected down to a 2D space for visualization and provides interesting insights regarding the information extracted by the model (see details in Ref. [10] We build our machine in the context of ensembles and hybrid structures. The applied preprocessing consists of three steps: 1) the noise is reduced by a 3-points moving average, 2) length of input trajectories are re-scaled to 100 points by a spline interpolation, and 3) the trajectories are normalized to the range[0, 1]. First, we prepare each normalized trajectory and extract user-defined features from the trajectory as an input for the ensemble modules. Then, we construct an ensemble of ten identical modules based on residual net (ResNet) [11] and multi-layer perceptron (MLP). The ResNet input is the normalized trajectory and the following MLP receives both an output of the ResNet and the prepared features. Finally, the ten outputs from the ResNet-MLP module are analyzed by a scalable tree boosting system (XGBoost) [12]. Tasks: T1.1D, T2.1D The method is based on recurrent neural networks (RNN). The RNN used in all tasks share the same basic architecture and differ only in the last layer or two. All the RNN have two long short-term memory (LSTM) layers (of dimension 250 and 50, respectively). For inference tasks (T1 and T3) the last output of the second LSTM layer is directly connected to the output layer. For classification tasks (T2 and T3), the last output of the second LSTM layer is followed by a dense layer including 20 nodes, which is then connected to the five dimensional output layer (representing each model with softmax activation). We train multiple RNN that specialize in analyzing trajectories of a certain length. When presented with a trajectory of length l, we use the predictions of the two RNN trained on the nearest lengths (one on longer trajectories of length L+ and one on shorter ones of length L−) and weigh them according to their distance from l. For T1, we train 14 RNN for different lengths in 1D and 9 RNN for different lengths in 2D. For T2, we train 6 RNN for different lengths in 1D and 4 RNN for different lengths in 2D. In T3 all trajectories have the same length; we train 4 RNN: the first RNN to classify the model of the first segment, the second RNN to classify the model of the second segment, and two inference RNN; each inference RNN predicts the switching time, first exponent and the second exponent and their predictions are then averaged. We follow the same approach in 2D (but there we use a single RNN for the inference). We do not train RNN on 3D trajectories. For 3D data, we take projections on lower dimensions and use RNN trained on 2D and 1D data and average their outputs. All RNN are trained using 3 × 10 6 trajectories that are generated using andi-datasets package [14]. To avoid overtraining, we split these trajectories in 30 datasets (each containing 10 5 trajectories) which are successively presented to the RNN. We use the first dataset to train for 5 epochs splitting it in batches of size 32. We then switch to another dataset, split it in batches of size 128 and train for 4 epochs. We repeat this procedure for 3 other datasets. We iterate the procedure using 5 datasets split into batches of size 512 each considered for 3 epochs and finally use 20 datasets split into batches of size 2048 for 2 epochs each. For memory reasons, we did not use the batches of size 2048 for trajectories containing large amounts of measurement, such as long or high-dimensional trajectories. We use recurrent dropout (20%) in both LSTM layers.
We preprocess the input data as follows: 1) We take the increment values of the trajectory. 2) We normalize the increments in a way that they have zero mean and unitary standard deviation for each trajectory. 3) To optimize the training, we re-shape the input trajectories into shorter trajectories of higher dimensions.  The Feature Extraction Stack long short-term memory (FEST) method was used to solve T1 and T2 and was applied to one-, two-and three-dimensional trajectory data. This method is divided in two parts: i) measurement of features at each point along the trajectories, and ii) training of a neural network consisting of a stack of bidirectional long short-term memory (LSTM) and fully connected ("Dense") layers [16].
The following features were computed: the displacements ∆rn(t) = (∆xn(t), ∆y n (t), ∆zn(t)) of a particle between time t and t + n (which is the difference between two particle positions rt and rt+n, where rt = (xt, yt, zt) and n ≥ 1) and the distances dn(t) = ∆xn(t) 2 + ∆y n (t) 2 + ∆zn(t) 2 . The features for 1D and 2D cases were similarly defined. Subsequently, a mean of distances between time t − p and t + p, dn,p(t), was calculated as dn,p(t) = 1 2p+1 t+p k=t−p dn(k), where p ≥ 1. All the mentioned features characterize how fast particles move. To gain information on the direction of motion, for 2D and 3D cases, the angles θn(t) between two displacement vectors ∆rn(t) and ∆rn(t − n) were computed. The number of features that were used as input to the neural network depended greatly on the number of dimensions. For 1D case, only displacements could be computed, therefor we used ∆xn, n = {1, 2}. Larger values of n led to smaller sizes of feature vectors. For 2D case, we computed six features: ∆x1, ∆y 1 , d1, d1,1, d2,1 and θ1. For 3D case, 6 other features were used: ∆x1, ∆y 1 , ∆z1, d1, d1,1, d2,1. We built two similar neural network architectures for T1 and T2. Using the above-mentioned features, the output for T1 was a predicted value of α, and the outputs for T2 were probabilities of input track belonging to one of 5 diffusive models. The architectures of both neural network were built using functions from the Keras library [17]. In both cases, we used 3 bidirectional LSTM layers (with 2 6 , 2 5 and 2 4 hidden nodes, respectively), followed by 4 Dense layers (with 2 5 , 2 4 , 2 3 and 1 (or 5) hidden nodes) with Dropout layers in between (with a dropout rate of 0.2 or 0.1). For T1, ReLu activation function was applied on each Dense layer, while for T2 tanh was applied with a softmax at the output layer. During the training, the models were optimized using the Adam optimizer and, as loss functions, we used the mean squared error (MSE) for T1 and categorical cross-entropy for T2. The described networks had to be trained using trajectories with a fixed number of time points. For that, new datasets were created with the tool provided by the organizers (https://github.com/AnDiChallenge/ ANDI_datasets [14]). To cover the variety of lengths that can be encountered in the challenge data, 4 different datasets were generated for each task, each consisting of different trajectory lengths: 50, 200, 400 or 600 time points. Thereby, each network was trained 4 times in order to create 4 distinct models. For each case (1D, 2D and 3D), we created 30000 tracks of length 50 for training and 6000 for validation (denoted 30000/6000) to keep a ratio 8:2, 7500/1500 trajectories of length 200, 3750/750 of length 400 and 2500/500 of length 600. Training and validation datasets were generated separately to ensure that all combined cases of α and diffusive models were present in both dataset. The training have been carried out on a Linux system with a GPU GeForce GTX 1650 and a processor 2.60 GHz Intel 12 cores i7. An early stopping criterion was used to monitor the validation loss and prevent over-fitting. Finally, during the prediction phase and depending on the trajectory length, a combination of the different models was used to predict the outcome. Any track with a length below 100 was predicted with the model trained with 50 time points (denoted model50), any length falling between 100 and 300 with model200, between 300 and 500 with model400 and above 500 with model600. This approach would increase the accuracy of the prediction when the variety of trajectory length would be very diverse in a dataset. Tasks: T1, T2 The training dataset consisting of 1D trajectories is generated at 43 specific lengths (see the open-access link for details). The total size of training dataset is about 330 GB. Each trajectory is normalized before training so that its position's average and standard deviation are 0 and 1 respectively. A long short-term memory (LSTM)-based recursive neural network (RNN) model is used to accomplish this competition task, where the dimension of the hidden layer is 64 and the number of stacked LSTM is 3. Models for each specific length are trained separately. 80% of training data is used for training, while the rest is used for validation. We implement the LSTM-based model by PyTorch 1.6.0. The model is trained with a batch size 512, where the loss function is the mean squared error (MSE). The optimizer is Adam with a learning rate l = 0.001. The learning rate is changed as l ← l/5 if the validation loss does not decrease for 2 epochs. When the number of such changes exceeds 1, the training process is early stopped to save time and avoid overfitting. The best epoch for a specific length is determined by the lowest mean absolute error (MAE) of the validation set. The inference of challenge data is guided by the following rule: 1) If the original length of trajectory belongs to one of the 43 specific lengths, this trajectory will be directly used for inference. 2) Otherwise, a new length of this trajectory will be set as the closest smaller specific length. For instance, the new length of a trajectory with an original length 49 should be 45. The trajectory data is subsequently transformed into 2 sequences. For clarity, we set the trajectory data as [x1, x2, · · · , xT ], where T is the original length. We denote Tn as the new length with Tn < T . The two sequences are [x1, x2, · · · , xT n ] and [xT −Tn+1, xT −Tn+2, · · · , xT ] respectively. Such two sequences are both used for inference, with model predictions α1 and α2. The predicted exponent α of the original trajectory is given by α = (α1 + α2)/2.
To further improve the model performance, 5-Fold cross validation is utilized. However, due to the time limit of this competition, we only use a 3-fold average. On the other hand, by analyzing an external validation dataset containing 100000 1D trajectories, the predicted results for challenge data are multiplied by 1.011 and finally clipped to ensure reasonable predictions. The methods for 2D and 3D tasks are both based on the solution for 1D trajectories. We separate the dimensions of the trajectories and treat the data of each dimension as 1D trajectories. Thus, we get predicted exponents αx, αy, and αz for x, y, and z dimensions, respectively. The final results are α2D = (αx + αy)/2 for 2D trajectories, and α3D = (αx + αy + αz)/3 for 3D trajectories. Tasks: T1 The convolutional long short-term memory (convLSTM) approach combines convolutional neural networks (CNN) and long short-term memory networks (LSTM), similarly as described in Ref. [19]. An additional linear block placed after the LSTM uses the flattened LSTM output to predict the type of anomalous diffusion of the trajectory. In more detail, it consists of a convolutional block (ConvBlock), a bidirectional LSTM, and a linear block (LinearOuts). The ConvBlock consists primarily of two one-dimensional convolutions with a filter size of two, each is followed by a ReLU. The first convolutional layer is more coarse and outputs 20 features, while the second layer takes the output of the first and outputs 64 features. At the end of the convolutional block, we have a dropout with dropout probability p = 0.2, to avoid overfitting, and a one-dimensional MaxPooling layer, which cuts the output size in half by selecting the larger of two adjacent entries. The bidirectional LSTM has three layers, each layer is followed by a dropout with probability of dropout p = 0.2. The final Block (LinearOuts) takes the flattened (2D tensor to 1D) output of the LSTM as its input and passes it to a fully connected linear layer, which has five output units that correspond to the five models used to produce the trajectories. The first two linear layers are followed by a ReLU activation and the final layer is not, as non-linearity is handled by an instance of nn.CrossEntropyLoss, during training, called the "criterion". Training of our method for the AnDi challenge was done using a hidden size of 32 and a learn rate of 0.001. However, later testing has shown that our model accuracy can be improved by increasing the hidden size to 128, while beyond that point we see a drop in accuracy. Training was performed by merging two data sets, which were generated with the andi-datasets package [14], the first of length 189810 and the second of length 150000. The resulting combined dataset was split into 75% training data and 25% test data. From the training data an additional 20% was reserved for validation data to be used by our early stopping algorithm. Our early stopping method saves the parameter state if there is an improvement in the mean validation loss, which is computed at the end of each epoch. We used 100 epochs and 10 patience for our early stopping. Tasks: T1.1D Our model combines feature engineering and the use of an extreme learning machine (ELM). In brief, raw trajectories were first standardized to set their starting coordinates to zero and have a unitary standard deviation of displacements for t lag = 1. For each t lag = 1, ..., 7, two features were calculated, corresponding to log |x(t+t lag )−x(t)| k log(t lag +1) for k = 1, 2. In addition, the correlation of absolute displacements obtained for t lag = 1 was also included, for a total of 15 features per trajectory. Features were standardized using the z-score over the training dataset. The mean and standard deviation obtained for each feature of the training dataset was saved and later used to standardize the validation and test datasets. For a training dataset of n trajectories and f features with target values T, the n × f feature matrix X is fed into a ELM composed by single hidden layer feedforward network (SLFN) with m = 1000 hidden nodes [21,22]. A matrix of initial weights W of size f × m and a bias vector b of size 1 × m are randomly initialized to connect observations to targets through: where f (·) represents the sigmoid activation function, u is a unitary vector of size n × 1, and B is the matrix of output weight. The training of the SFLN is converted into solving an over-determined linear problem, whose least squares solution corresponds to the Moore-Penrose pseudoinverse of the hidden layer matrix H [21,22] The SFLN was trained either as a regressor or as a classifier to provide predictions for T1 and T2 for 1D trajectories. Training was performed using only the dataset provided by the organizers (10000 trajectories per subtask) during the Development phase of the challenge. Training took typically 5 seconds on a MacBookPro with a 8-Core Intel Core i9 processor with 2.4GHz speed. Tasks: T1.1D, T2.1D We use a convolutional neural network structure adapted from the models used in Refs. [23,24]. For T1 and T2, this consists of a series of convolutional blocks, followed by a global max-pooling layer over the temporal dimension, which feeds into a dense network. For T1, the model outputs a single number representing the predicted anomalous exponent. For T2, the model outputs 5 numbers, representing a probability (from 0-1) for each diffusion type. For T3, convolutional blocks are followed by a 1 × 1 convolutional network, which outputs an array of size (1, n), where n is the number of steps in the trajectory, representing the probability of a switch at each position in the trajectory. The same network architectures can be used in 1D and higher dimensions, varying only the number of input features. Models were built using TensorFlow in Python, and code is available on Github.
Training data were generated using the andi-datasets package [14]. Trajectories were first preprocessed by taking the difference between successive positions, and normalized by dividing by the mean step size. For T1 and T2, a single model was simultaneously trained on trajectories of all lengths (ranging from 5-1000 steps This approach is based on theory, as opposed to pure data-driven methods. Anomalous diffusion can be described via more than just the Hurst exponent. The assumptions of the central limit theorem, which leads to standard diffusion, can be violated in three distinct ways: Increment correlations (like in FBM), fat-tailed increment distribution (like in CTRW), and nonstationarity of the increments' distribution, like in SBM. Each of these three paths can be characterized by its own scaling exponent, and can be measured directly in data, using methods of time-series analysis. The exponent J, describing the first violation, can be measured, e.g., using detrended fluctuations analysis. The exponent L, for the second violation, is measured from the temporal scaling of the time-average of the squared increments of the process. Finally, The exponent M is measured from the scaling of the time-average of the increments' absolute value. These exponents can be measured in any number of dimensions. Their sum leads to the Hurst exponent: H = M + L + J − 1 [25][26][27].
To estimate the Hurst exponent for T1, we evaluate J, L and M using methods which were specifically adapted for noise filtering. Importantly, this approach is not model-dependent, and our algorithm can be applied also to other types of data, not generated by one of the five models in the AnDi challenge. For T2, we construct a small set of educated questions, targeted to characterize different properties of the paths in the data set, via precise analysis of the increments of the process. When comparing between various models outside of the AnDi challenge, here we would need to construct a new set of questions for the new models. Some of the questions are aimed for various general relations between the three exponents described above, others, to more specific properties of the individual types of paths involved in the challenge. The answers of each question can be "yes" (= 1) or "no" (= 0) (or "maybe" (= 2)). An example for a question about the exponents: Is (J − 0.5) > (M − 0.5) + (L − 0.5)? Namely, is the effect of autocorrelations on the Hurst exponent stronger than the combined effect of the increment distribution? This question separates between FBM and LW on the one side, and ATTM and SBM on the other. An example for a question beyond the exponents, is given by the comparison of the autocorrelations of the increments of the process, versus that of their absolute value. This question is highly selective for distinguishing Lévy walk from all the others. For each trajectory in the competition data set: we generate a set of answers using the same algorithm, and then generate an array of probabilities for this set to be either ATTM, CTRW, FBM, LW, or SBM. This is done by counting how many times a similar line of answers appeared in the training set for each type of process, divided by the total number of occurrences. The answer is, e.g.:  . Each of these categories is then branched into a different network with 5 equally spaced α categories in the corresponding α range. The (overestimated) predicted value of α is the average value in that category. In a second method, the result of the classification is not used as an input but is used to split the data into 5 categories each one analyzed by a different network (architecture and training as above). In particular, the networks for ATTM and CTRW have 5 α categories in the range 0.05 to 1. The network for LW has 5 α categories in the range 1 to 2. Finally, the prediction for FBM and SBM is based on a 1 × 2 tree of networks with the parent network having 2 equally spaced categories in the range 0.05 to 2, each then refined by a 5-category network in the corresponding range. The (underestimated) predicted value of α is the average value of the corresponding α range. Tasks: T1, T2 We have defined a recurrent neural network (RNN) architecture based on convolutional layer to feature extraction, bidirectional long short-term memory (LSTM) to learn the characteristics of the trajectory and Dense layers to smooth the signal to the final result. For T1, we have followed the same approximation, but building up to 12 different models for trajectories of different length. We have built models for trajectories in the length intervals: [10,20], (20,30], (30,40] RISE makes use of several series-to-series feature extraction transformers (fitted auto-regressive coefficients, estimated autocorrelation coefficients, power spectrum coefficients), which provide data to build a time series forest classifier. MrSEQL converts the numeric time series vector into strings to create multiple symbolic representations of the time series. The symbolic representations are then used as input for a sequence learning algorithm, to select the most discriminative subsequence features for training a classifier using logistic regression. Tasks: T2 Our approach is related to the feature-based methods described in Refs. [32][33][34], with an extended list of features used for extraction of the trajectories' characteristics. We used the gradient boosting algorithm in XGBoost (T1) and Gradient Boosting (T2) architectures. Such procedures allow us to examine trajectories with different lengths by extracting characteristics such as diffusion coefficient, anomalous diffusion exponent, fractal dimension, or gaussianity. The full set of features is listed in the Github repository. Each task and dimension gets a different set of features, depending on the problem behind the task. Both algorithms (Gradient Boosting, XGBoost) belong to the class of ensemble learning, i.e., methods that generate many base classifiers/regressors (decision trees in this case) and aggregate their results. We decided to use these classifiers as the idea behind the classifiers is easy to understand and interpret. The training was performed on 70000 trajectories generated using andi-datasets package [14] (for each task and subtask). Each set was balanced with respect to the anomalous exponent value (T1) or the model (T2 The atoms are radially confined by a running wave optical trap. Axially the atoms are trapped within the sites of the lattice formed by two counter-propagating laser beams. During the experimental sequence, only the lattice potential is lowered, while the radial confinement is held constant at all times. This allows one to limit the diffusion of the atoms along the lattice axis, justifying an effective onedimensional description. Number of trajectories: 3331 Trajectory length: ≈ 10 frames Frame rate: 2 frame/s Localization precision: 2 µm