Fast detection of slender bodies in high density microscopy data

Computer-aided analysis of biological microscopy data has seen a massive improvement with the utilization of general-purpose deep learning techniques. Yet, in microscopy studies of multi-organism systems, the problem of collision and overlap remains challenging. This is particularly true for systems composed of slender bodies such as swimming nematodes, swimming spermatozoa, or the beating of eukaryotic or prokaryotic flagella. Here, we develop a end-to-end deep learning approach to extract precise shape trajectories of generally motile and overlapping slender bodies. Our method works in low resolution settings where feature keypoints are hard to define and detect. Detection is fast and we demonstrate the ability to track thousands of overlapping organisms simultaneously. While our approach is agnostic to area of application, we present it in the setting of and exemplify its usability on dense experiments of swimming Caenorhabditis elegans. The model training is achieved purely on synthetic data, utilizing a physics-based model for nematode motility, and we demonstrate the model’s ability to generalize from simulations to experimental videos.

Multi-organism detection can be achieved at increasing levels of fidelity: at the crudest, only center-of-mass locations or bounding boxes are predicted [36] which does enable tracking of organisms but provide little morphological information.In contrast, pixel-wise segmentation models [12] and pose estimation using keypoints [37] reveal accurate shape dynamics when employed on high-resolution data.However, these methods rely on high definition objects, as segmentation and prediction are highly sensitive to noise.In particular, for organisms that are long and slender, pixel-wise segmentation fails at low resolution, as correct predictions require sub-pixel accuracy.Moreover, at high densities, these methods may fail due to their inability to properly handle overlap between organisms.
Here, we consider the problem of studying slender organisms at low resolution and high density with the goal to enable both accurate identity tracking and quantification of shape dynamics.This problem has traditionally been approached by employing pixel-wise segmentation and subsequent skeletonization procedures [38][39][40][41][42][43], an approach that requires ad-hoc procedures to solve the problem of correctly identifying overlapping organisms [44], the combinatorial complexity of which blows up at high densities.To this end we abandon pixel-wise output and instead construct a neural network architecture that predicts, potentially overlapping, splines directly [45][46][47].Our method enables both accurate shape prediction and tracking in dense experiments of slender objects.This is applicable to a broad class of systems [Fig.1], including tracking of nematode worms [48][49][50], spiral or elongated bacteria [51][52][53][54], spermatozoa [55,56], the flagella of both eukaryotes [42,43] and prokaryotes [57], and freely swimming flagella such those of microgametes [58].Ref. [56].
Our method relies on recent advances in deep learning [59][60][61][62][63] and extends these by a few simple ideas: We found that humans are better at correctly resolving overlap between moving bodies when given access to videos rather than still micrographs.Thus, to allow the neural network to encode the identity of individual bodies as a function of their motion, the input to our neural network is taken to be short video clips rather than single frames.Our network outputs multiple independent predictions, and for each produces (1) the spline representing the centre-line of an object, (2) an estimated confidence score for the prediction, and (3) a latent vector, the space of which we induce a metric on that measures whether two predictions are trying to predict the same body.To train the network, each output quantity is associated with a specific loss term, where, importantly, the spline loss term is permutationinvariant in the labels.To resolve overlap, we do non-max suppression [36], but rather than measuring distances between spline predictions, we use the latent space output, which allows two predictions to be kept even though they are close in physical space.This enables correct predictions for data in which objects overlap very closely.Our method is further tailored to support the subsequent tracking process, which must link unique predictions from frame to frame.To that end, we not only predict the object location at a single timepoint but also predict consecutive past and future splines.Using these time-resolved predictions in the linking process enables high-precision tracking even through dense regions.
Our method is principally applicable to all microscopy datasets that involve slender bodies.In this paper, we focus on its applications for tracking dense experiments of crawling C. elegans worms, a popular model system in neuroscience [64], human diseases [65], drug discovery [32], motor control [66], memory [67], and ageing [68].Studies of C. elegans often rely on phenotypic assays that measure the motility of the nematode worms as a function of some environmental condition or treatment [35,[69][70][71][72][73][74][75][76][77][78][79][80][81], the throughput of which can be massively increased if the overlap between organisms can be tolerated.Likewise, resolving identities of organisms during overlap is crucial for studies of interactions between organisms [82].Previous work on tracking C. elegans have generally employed classical computer vision approaches to accurately track single or a few high-definition worms [39,[83][84][85][86], or many low-resolution worms at non-overlapping densities [40,87,88], in some cases by utilizing a computational model of the worm motion for hypothesis tracking [39,83].Recently, deep learning techniques have been utilized to track C. elegans worms using e.g.bounding box predictions [89][90][91] and fully resolved centre-line splines in the case of isolated worms [92], allowing for detection also during periods of self-overlap.
With this paper, we publish a dataset of videos of motile C. elegans worms imaged at a wide range of densities.The dataset includes ∼ 1,500 labelled splines that we use to evaluate, but not train, our detection model.We demonstrate that our model can be trained exclusively using synthetically generated data and yet generalizes well to real videos.Our method leverages the parallel capabilities of convolutional neural networks and is thus able to handle thousands of detections in a single pass, resulting in real-time detection at ∼ 90 Hz at 512 × 512 resolution on a single GPU.The code is open source and available at https://github.com/kirkegaardlab/deeptangle.

Architecture
Figure 2 illustrates the overall structure of our approach.Our model is based on single-stage detection models [36,59] that output many candidate predictions per target in a single forward pass and rely on a score system to prune until a single candidate is left for each target object.The performance of such single-stage models has been shown to enable accurate real-time bounding box detection [62].The backbone of our neural network [Fig.2a] consists of convolutional residual networks [60] with the small modification that we employ average pooling rather than max-pooling to avoid translational invariance in the spline predictions, which need to be accurate to a sub-pixel degree.
We take the input to our model to be a stack of consecutive frames in order to provide the model with a temporal context [Fig.2c].This has previously been shown to improve the detection of e.g.partially hidden objects [93].In particular, in the present case of motile slender objects where dynamic crossings and overlap between objects are very common, a temporal context can provide the necessary information to resolve the problem of correct identification.Furthermore, the temporal context allows the output of our model to include information on the motion of the splines, which we will further exploit for tracking purposes.
The backbone of our neural network performs a 16 2 -fold reduction in resolution when mapping the input images to feature space, from which the network outputs multiple anchored predictions.We choose the resulting number of candidates to be considerably larger than the number of objects in the frame, thus ensuring that all objects have suggestions.The anchored approach further means that the only restriction on input size is that its dimensions be divisible by 16, and, in particular, it allows training at a certain resolution H × W and subsequent inference at another H × W without loss of accuracy.
The output of our model is composed of spline predictions, confidence scores and latent vectors: Spline predictions We choose to represent the centre line of the slender bodies of interest by arrays consisting of k equidistant points [Fig.2d].These coordinate arrays, which we refer to as splines, become high-precision descriptors even for complex shapes when k is chosen large.To reduce the complexity of predicting k points, we embed the   1), the minimum of which corresponds to correct head-tail alignment and is the one that will be used in the model.spline representation with a principal component (PCA) transform A, the dimension κ of which can be much smaller than k [94].The PCA components λ represent the shape, and in addition hereto, the network also predicts the offset x 0 of the spline, the internal calculation of which is done in a local coordinate system defined by the anchor points.Thus, instead of predicting 2k floating point values per spline, the network needs only output κ + 2.
The temporal context of the input image stack permits output spline prediction also for the non-central images.In our approach, we predict a set of three splines z = [x − , x, x + ] corresponding to the three central frames [I − , I, I + ] of the input stack [Fig.2c].We consider the central spline x the main output, whereas the past x − and future x + splines are considered auxiliary predictions whose main purpose lies in their use during the latent space encoding as well as the tracking procedure.
We define the similarity measure between two splines by the standard Euclidean distance.In the case of splines that look symmetric from either end, we exploit this symmetry and employ the flip-invariant distance defined by as illustrated in Figure 2e.Likewise, we define a distance between two collections of consecutive splines z, z by their weighted average , where the weights can be adjusted to give focus to central predictions, and for the present case we choose ω = 2ω − = 2ω + .The neural network is trained to minimize the distance d 2 s between predictions and labels.To do so, we let the independent predictors specialize for different shapes.This is achieved by using a permutation-invariant loss such that the total loss is computed as a sum over the labels only, each using the predictor that best match the labels.Thus many spline predictions will not contribute to the spline loss.
Confidence scores Each independent prediction of the network includes a confidence score s, which is used to filter out bad candidates.In bounding box or mask detection, intersection over union (IoU) is commonly used to evaluate the accuracy of a prediction, however, this metric does not generalize well to spline predictions when there is overlap.Instead, we introduce a custom metric to define the goodness of a spline set z by comparing it to its label ẑ, Here, σ s is a parameter that sets the scale over which the score varies.The metric is sensitive to perturbations on accurate predictions, i.e. predictions close to labels where d s → 0, but loses sensitivity the worse the predictions are.This is a useful feature as correct scoring for good predictions is crucial for choosing the best one, whereas low-scoring predictions are discarded in any case and their relative scoring is therefore unimportant.The score prediction is trained using L2 loss.To avoid conflicting backwards error propagation between this task and that of spline prediction (as scoring bad predictions is easier), we stop the gradient flow in the computational graph on the last layer of the score-predicting part of f θ [Fig.2a] such that it does not interfere with the accuracy of the predicted splines.
Latent space for candidates suppression Finally, we need to ensure that there is only one prediction per object.Bounding box detectors let the user decide the fraction of overlap between prediction boxes of the same class that should be considered to be targeting the same object.As our method must work at high densities, this task is complicated by the fact that two predictions might be very close, even completely overlapping in the central frame, and yet represent different objects.The task of choosing a suitable cutoff distance is therefore difficult, and we make this a trainable task.We do so by embedding each prediction in a low-dimensional latent space in which comparison between predictions is cheap, thus allowing efficient and fast candidate suppression also at high densities.
Our method computes the latent vectors p for predictions using an auxiliary neural network, q φ which acts directly on the eigenvalues λ and offsets x 0 rather than the more redundant spline coordinate points.We induce a Euclidean metric on the latent space with the interpretation that two predictions i, j are predicting the same object with probability Here, σ l is a real-space visibility cutoff that prevents far predictions to interact in the encoded space, thus avoiding the need to scale the dimensionality of the latent space with the number of candidates or the input size.We note that when using the flip-invariant metric d s on splines, we explicitly construct the latent space encoder to likewise be flip-invariant (see Methods).
To train the latent space, we make the assumption that during training predictors are 'trying' to predict the label closest to the prediction spline.Combined with the probability interpretation, this allows us to use binary cross entropy as a loss function for the probability defined in Eq. (3).To avoid wrong clustering between undefined close-by predictions, the loss contribution of each prediction is scaled by the product of their real scores ŝi ŝj , thus ensuring that the network focuses its attention on good predictions that will not be filtered out.Finally, since the encoder should not alter the performance of the spline suggestions, the loss on the latent space representations only updates the weights q φ of the encoder but is trained concurrently with the main model.
We employ non-max suppression to choose the best prediction of each object, but with distances measured in latent space, as illustrated in Fig. 2b.Concretely: Once all the predictions whose score is lower than a threshold τ s have been discarded, multiple candidates are likely to still remain for each target object.The lack of low score predictions exposes clusters in the latent space that correspond to single objects.We sort the remaining predictions by their score, automatically accepting the highest-scored one.Once a prediction i is accepted, all predictions j that have a high probability P(i ↔ j) > τ o of being the same object are removed.This is equivalent to setting an exclusion radius r l in the latent space as shown in Fig. 2b.We keep iterating on the remaining predictions, pruning the latent space until all candidates have been iterated.The final number of accepted predictions should equal the number of objects in the frame.

Detection on dense C. elegans experiments
To evaluate our approach, we study microscopy videos of crawling C. elegans worms.We are particularly interested in videos captured at much higher densities than those typically used in motility experiments.Thus we evaluate our model on wide-field videos captured under approximately uniform illumination [40], exemplified in Fig. 3a.In our dataset, the number of nematode worms varies ranging from ∼ 400 with a small probability of overlap occurring ( ≈ 0.05 average overlaps per worm) to extremely densely packed plates with up to ∼ 6,000 nematodes, where there is, on average, one overlap per worm.This means that in the dense plates, detection methods that stop tracking after contact between worms happens are rendered completely ineffective.
Defining worm density ρ as the number of worms in a region per square millimeter, we find, as expected, a linear relation between the average amount of overlap per worm and the density [Fig.4a].Due to the spatial heterogeneity of the worm distribution inside the plate, higher densities can be observed when considering small regions.On 100 mm 2 scales, the highest density in the dataset is ρ ∼ 2.5 mm −1 , but this jumps to an extreme ρ ∼ 3.5 mm −1 when considering 10 mm 2 regions, where humans begin to struggle to correctly identify worms.For quantitative evaluation of our model, ∼ 200 random regions of the videos were sampled and hand-labelled resulting in ∼ 1,500 labelled worm splines.A sample of frames is shown in Fig. 3b to provide a sense of the different densities encountered in the evaluation dataset, with the predictions of the model overlaid.Simulation-based training.To train our network, we implement a physics-based synthetic dataset generator to exploit perfectly defined labels.This approach removes the need for a supervised dataset, and also allows labelled videos in situations where manual labelling may not be reliable, or where the subjectivity of the human labellers can result in inconsistent labels.Physics-based synthetic datasets have successfully been used to train systems on similar conditions, for instance where manual labelling may introduce unnecessary noise or bias to the model [16].
Our in-silico data generator has two main components: a physics-based model for the organism and a synthetic frame generator.
In-silico worms are generated on demand every training step which removes the possibility of overfitting to the generated frames.In order to train the model to work effectively with a range of worm densities, we generate batches with different numbers of worms in a uniform manner, without bias towards low or high worm counts.This teaches the model to handle a variety of densities without overfitting to any specific case.And to make the model more robust, training also happens on densities whose manual annotation would be extremely challenging.The simulation and video synthesis are implemented in a GPU framework which enables fast end-to-end training without the performance penalization of data transferring between the accelerator and the host machine.
We base the worm simulation on resistive force theory, as it has previously been shown to correctly predict the position of the skeleton for short spans of time [95].Since the network only perceives the frames surrounding the target frames, we found the total duration of the clip to be short enough that a linear crawling model approximation fits our needs.The physics-based model should encapsulate all types of organism behavior.This can be achieved by oversampling the behavior, i.e. by making the simulations more diverse in the behavior than reality and thus hope to include all types of real behavior as well.Details on the worm simulation and video synthesis can be found in the methods section.
Despite the potential for physics-based simulations to be used for synthetic training data, discrepancies with real data may lead to inaccuracies when applied to real microscopy images.This reality gap can be the result of an overly simplified motility model or physics model or the result of imprecise video synthesis.The gap may be further increased by the fact that the model relies on the PCA transformation matrix A obtained on synthetic data, where the number of PCA components used have been chosen to accurately reproduce all synthetic patterns, but not necessarily to generalize to out-of-sample videos.Thus we find that our model is limited to accurate skeleton predictions only on shapes that resemble those produced by our simulations, and the goal of the simulations is therefore to reproduce a broad spectrum of possible motility patterns.Likewise, we find that our model is susceptible to the brightness of the videos, and accordingly, we adjust the real videos to increase their resemblance to the training data (see Methods).
Metrics Despite being trained exclusively on synthetic data, the model's inference performance is very good on real clips.From visual inspection, no immediate discrepancies are observed between detections in low density clips and at high density [Fig.3b].Likewise, per design, the network accuracy is independent of the input clip dimensions, and the parallel structure of convolutions permits the use of large videos covering thousands of nematodes to be processed simultaneously in a single forward pass [Fig.3a].We note, however, that even though no quality impact on detections is observed when using large fields-of-view clips, there can be a dependency if non-uniform illumination is used as different sections of the frame may have different requirements for preprocessing.
For a quantitative assessment of the method accuracy, we compare to the manually labelled dataset, an example of which alongside the model predictions can be seen in Fig. 4c.As the predictions are densely defined splines (here, ∼ 50 points), we introduce a custom metric to suitably evaluate the accuracy of the predictions using labels with lower fidelity.The metric used must be shift-invariant, as having points anywhere along the spline should yield zero error regardless of whether the label points precisely coincide with the prediction points or not.Likewise, label points should be monotonically assigned along the spline in order to avoid artificially reducing the error for strongly bent or self-coiling worms.Finally, it must be robust against the subjectivity of the labellers, as manual annotations might miss or avoid spots where visibility is low such as the end-points of the worms.
To satisfy all these requirements, we introduce a metric based on the dynamical time warping (DTW) distance used to measure the similarity between temporal curves.In our modified version, asymmetric DTW, summation only runs over label points.Thus, the metric δ adtw is defined as follows: Let d(i, j) be the Euclidean distance between label point i and prediction line segment j, then where α : [1, N ] → [1, M ] is a monotonic (non-decreasing or non-increasing) assignment of the N label points to the M prediction line segments.A visual representation of the metric is shown in Figure 4b, and the O(N M ) algorithm for its calculation is detailed in the Methods section.The results of evaluating the trained model on the labelled dataset are shown in Fig. 4. For reliable comparisons, we first solve the assignment algorithm for the label-prediction pairs.This means that in the case of two completely overlapped worms, two predictions need to be present to not count as a miss, and likewise, two predictions cannot be considered to target the same label.We find an average error of δ adtw ≈ 0.54 px with no strong dependency between accuracy and density of worms [Fig.4d], with the exception of a slight increase in error for extremely dense clips (∼ 3.5 mm −2 ).The average error corresponds to less than the width of a worm (≈ 2 px ≈ 50 µm), and part of this can be attributed to the fact that humans accuracy is also near the half-pixel level [Fig.4c].Some outliers can be seen, however, which can mostly be attributed to an artefact of the model, where the network mistakes a single long worm for two overlapping shorter predictions.This effect seems particularly sensitive to incorrect intensity normalization of the videos.
Let σ be a cutoff distance above which we no longer consider the predictions to be targeting the closest label.For all the figures in Figure 4, this cutoff is assumed to be σ = 3.0 px, and we observe no significant changes by tuning it within the range of sensible values.We define the True Positive (TP) rate as the fraction of predictions that both get assigned a label and this label is within the distance σ of the prediction.Figure 4e shows that the model rarely predicts a spline where there is nothing with a TP rate of 0.999.Nevertheless, there are some predictions that do not get assigned a label which can be attributed to the double-prediction artefacts just mentioned.The likelihood of this happening decreases with density, but the rate is so low that it is almost negligible.Similarly, we define the False Negative (FN) rate as the fraction of labels that are not assigned a prediction closer than σ .Fig. 4f shows that the model in general manages a low FN rate at around ∼ 0.015, but that this increases to a rate of ∼ 0.06 at extreme densities such as ρ ≥ 3.0 mm −2 , where clusters tend to be densely packed and manual labelling likewise becomes challenging.
The filtering part of the model depends on the previously introduced thresholds τ s and τ o .The score threshold, 0 < τ s < 1, is used to prune low score predictions [Fig.2b(1)], while the overlap threshold, 0 < τ o < 1, is used to decide the probability of two independent predictions to be targeting the same object [Fig.2b(2)].Throughout this paper, we have set these to τ s = τ o = 0.5.However, due to their relevance in modifying the filtering process, we evaluate how different combinations of thresholds may alter the performance results.Figures 4g-i show the average performance obtained across all densities when filtering the predictions with variable thresholds.In spite of some dependency between worm density and TP/FN rates, we consider the average metric to be a good indicator of the performance in each case.
Fig. 4g shows the effect of the thresholds on accuracy.No significant dependency on the thresholds is observed.This can be explained by the fact that accuracy is determined by the best predictors only, which are not discarded until a high τ s is used, and once those are removed, τ o becomes irrelevant.Further, the fact that there is no notable difference between different values of τ o indicates that the clusters are highly compact.
In contrast, Fig. 4h shows that the TP rate has a stronger dependency on τ o at low τ s because low score predictions do not form compact clusters, and therefore a larger exclusion radius is required to discard them.Finally, Fig. 4i shows that misses only begin to occur once the best predictions are discarded, and a strong dependence on the τ s is not observed before that point.

Tracking from consecutive detections
Motility assays require not only accurate detections but also the ability to link these across frames to form timeresolved tracks of individual organisms.This is challenging at high densities where we have the breakdown of the assumption that the closest detected object to the previous frame corresponds to the same identity.In general, greedy approaches to particle tracking such as assigning directly the closest particle in consecutive frames frequently leads to failed tracks.Instead, the process of tracking can be efficiently formulated as a set of linear assignment problems [96].Naturally, here we can expand upon particle tracking by using a metric that measures distances not between the center-of-mass of the worms, but between the full splines as defined in Eq. ( 1).This works well for most predictions but can fail for fast-moving worms or in dense clusters.
A separate approach to tracking is Kalman filtering.This would require separate detection of entry and exit events of worms, as well as a probabilistic model for worm motility, which would most likely have to be highly nonlinear.Kalman filtering is viable for the tracking of few organisms, but for present large-scale systems, we require a more efficient approach.As previously mentioned, splines from adjacent frames are also predicted in order to embed temporal information into the latent vector.We propose a directed metric that leverages both past x − and future x + spline predictions [Fig.5a].Thus to find a mapping σ from one frame to the next, we solve Identity assignment can be seen as a network flow global optimization where nodes represent detections and edges carry the cost of assignment.To avoid having to perform all possible combinations of assignments, we include a physical distance threshold on the midpoint of the central spline.This threshold significantly simplifies the assignment scheme and improves the runtime of the filtering process [Fig.5b].
To quantify the performance of these methods, we define the tracking integrity ι as a scalar that indicates how consistent the assignment of a label to a prediction is along the tracked video.Perfect tracks have ι = 1, whereas labels that get assigned two different identities for half of the duration of the video have ι = 1  2 , and so on (see Methods for a detailed definition).We evaluate this on synthetically generated videos of 10 seconds (200 frames) that have perfectly labelled tracks, the results of which are shown in Fig. 5c.On videos with densities up to 2.0 mm −2 , we achieve an average integrity of ι ≈ 0.97.This is a ∼ 30 % improvement of the error over using direct spline assignment defined in Eq. ( 1).We observe that the integrity is almost perfect at low densities, but drops to ι ≈ 0.93 at the highest densities.
When applied to high density videos of C. elegans, the tracking method is able to keep track of individual worms as they pass through clusters of other worms [Fig.5d].In contrast to pixel-level classification of worms, our approach outputs splines directly, and thus subsequent analysis is straightforward.For instance, one may directly study the worm undulations [Fig.5f] or extract the worm spline angle ψ = arctan(y(s, t) − y 0 (t), x(s, t) − x 0 (t)) to provide insight into the movement patterns and kinematics of the worm [Fig.5e].
One of the key advantages of our methods is its ability to collect a larger number of samples compared to traditional techniques, while still obtaining reliable results.As the standard error decreases with the number of samples, using our methods allows for metrics to be gathered with less uncertainty while still requiring the same experimental setup.For instance, Figure 5g) shows how the error of estimating the average speed of the center of mass of the nematodes decreases with density.This advantage can be extended to tracking large numbers of nematodes in crowded environments, such as extremely dense petri dishes where more than 6,000 concurrent tracks can be simultaneously computed [Fig.5h].Thus, with our method, we are able to collect a larger number of samples and obtain more precise and reliable results, even in challenging conditions.

Discussion
We have introduced a novel deep learning approach for detecting and tracking slender bodies, such as crawling nematodes, in microscopy data.The presented convolutional neural network architecture is capable of accurately detecting a large number of overlapping organisms, a task that can be particularly challenging for standard methods such as bounding boxes and pixel-level classifiers due to the issue of occlusion and overlap.To address this, we have implemented a latent space encoding which allows us to filter by non-maximum suppression and effectively handle overlapping objects.Not only is our method capable of accurately detecting and tracking slender bodies, but it also demonstrates strong scalability, performing well across a range of input frame sizes and densities of bodies.This makes it an ideal tool for a variety of experimental settings where splines are useful descriptors, including studies of crawling nematodes, swimming spermatozoa and beating eukaryotic or prokaryotic flagella.
Besides a suitable detector model, labelled training data is also needed.We have demonstrated that relying on a physics-based model to generate synthetic data is adequate to train our network to perform well on real data.This is a key achievement as it means that applications of our system for different experimental studies do not require large datasets to be procured, but rather the implementation of a suitable simulation.Our approach for synthetic data generation relies on over-sampling the behavior of the worms.This is naturally a trade-off as too extreme behavior can lead to datasets that are too hard for the neural network to replicate.For our model, we found that we slightly undersampled certain worm shapes such as strong coiling, which the model therefore could struggle with identifying.Though we did not look into this here, an interesting avenue for future research would be to bootstrap synthetic motility models on small datasets of real organisms.In a similar fashion, the frame-generator procedure should oversample the textures, pixel intensities and noise of real videos.Here, it could be interesting to study whether style transfer [15] or diffusion models [97] could be used to further reduce the gap between training and inference data.
For tracking, we introduced a directed metric that employs past and future spline predictions to link them across time.At very high densities this may still fail, in particular, because the directed metric yields little advantage if predictions are missing in some frames.A potential way to improve on this could come from utilizing the latent space encoding as well.This would require temporal continuity in the latent space representation, which is achievable by modifying the associated loss function.This should enhance the integrity of tracking, as it could potentially be used to resolve issues such as switches by leveraging the separation of close physical predictions with different temporal behaviour that characterises the latent encoding.We believe that these suggestions might be fruitful avenues for further research for improving deep learning models for dense detection of splines.
In this paper, we have proposed a new approach for fast and precise detection and tracking of slender bodies in microscopy data.Its speed and accurate performance across a range of densities and sizes, combined with the ability to handle overlapping objects, make it a valuable tool for a variety of experimental settings where precise tracking is essential for obtaining quantitative metrics.

Methods
Convolutional neural network Most of the weights of the network are at the feature detection convolutional network whose backbone is made of four ResNet groups consisting of 2, 4, 4, 2 blocks with strides 1, 2, 1, 2, respectively.We modify the original ResNet architecture by replacing the initial max-pooling layer with an average-pool layer to avoid translational invariance.The final shape of the feature space is [H/16, W/16, C], with C being the number of candidates each cell proposes.We have set C = 8 for this project in order to fulfil the condition of M N even at high densities.All in all, there will always be C candidates per cell regardless of input size, which leads to a large number of candidates being sorted in the filtering process.The head of the convolutional neural network is composed of two fully connected layers of 512 and C • (3(m + 2) + 1) cells, respectively, with batch normalization in between.Due to the orientation invariance of the loss function on the spline predictions, it is possible that the splines in the predicted set x − , x, x + are not aligned.To remedy this, we aligned them by comparing them with the eigenvalues of the flipped spline.In order to get the flipped eigenvalues λ f , we use where A is the PCA transformation matrix and J is the exchange matrix.

Latent space encoder
The encoder q φ is composed of two fully connected layers with batch normalization inbetween.The input of the encoder is the vector of size 3(m + 2) characterizing the splines predictions and the output is D floating point values, corresponding to the coordinates of p in the D-dimensional latent space.We have found D = 8 to be a well-performing dimension in our experiments.Due to the orientation invariance of the splines predictions, we need to construct the encoder to cluster those splines regardless of orientations as well.To do so, the input values are expanded to include those of the flipped splines λ → (λ, λ f ) and both are fed to the same layer.To ensure symmetry, the output is then summed before passing through the last layer.In doing so, the encoder becomes independent of spline orientation.
Input clips pre-processing The images used to train the model have dark (small pixel intensity) backgrounds, as we employ zero-padded convolutional layers.This is relevant for real recordings, where a negative flip may be necessary to match the network requirements.During training, generated clips are normalized using a 1-99 percentile normalization.For real clips, we have found that accuracy is improved if we apply CLAHE (adaptive histogram equalization) before prediction.Likewise, a simple intensity correction factor µ may need to be applied to the videos in order to match the pixel profile of the simulated data.For our dataset, we use correction factors of µ ≈ 1.2, to get the best results.Note that we match real data to the synthetic as this avoids the need to retrain the network for different experimental setups.
Loss functions Spline descriptors are trained as a regression problem.Thus, the loss contribution is given by the custom distance defined in Eq. ( 1).To enforce specialization on the predictors, and due to the number of predictions M being considerably larger than the number of bodies N , only the best predictors are accounted for in the loss.Nevertheless, there may be labels x completely or partially outside the frame at t c , despite being inside at t 0 .To make sure not to punish bad predictions at the boundaries for not matching invisible splines, instead of using the number of simulated bodies N , the subset of bodies completely inside the frame N v is used and the final loss expression is given by: The score L2 loss is computed as the difference between the values predicted and the score the spline proposals should have.Thus, using Eq. ( 2), we train the predicted score of all predictions using: Finally, the loss function for the latent space encoder is a modified cross entropy loss scaled by the product of scores.Denote P i,j = P(i ↔ j) as defined in Eq. ( 3), then the encoder loss is defined as an average over all pairs of predictions i, j that are physically within the cutoff σ l , where S = ŝi ŝj , and t i,j indicates whether i and j are targeting the same label k, and is set by with k i , k j being the closest labels to the predictions z i , z j respectively.
Training details Training has been done from scratch, i.e. without the use of a pretrained backbone.During training, the frame size for the input clips used was 256×256, but due to the anchored approach, this does not constrain inference to happen at the same resolution.Synthetic input is generated on demand and on device rather than using a fixed pre-generated dataset.Thus, the network never sees the same frame twice and there is no hostto-device data transfer.As mentioned in the main text, all networks are trained simultaneously, despite the weights of each one depending on different cost functions.The code has been written in Jax using Haiku and training has been carried out on a cluster of 8 × NVIDIA A5000's.
Inference Inference happens at any resolution whose dimensions are multiple of 16.The input frames need to be slightly pre-processed as described in the previous sections.Candidate predictions are chosen using a score threshold, and non-maximum suppression in latent space is used for filtering.Due to the sequential nature of the filtering process, the implementation is written to use the CPU using numba.
Worm simulation Worm trajectories are computed by employing a resistive force theory crawling model used to predict rigid body motions of C. elegans from the undulations [95].Thus, we ensure that from a given set of generated undulations, the produced motions will match those of real worms.From empirical observations, we propose a simple equation (Eq.( 11)) to generate the undulation of the worms.We define the motions by the spline angle ψ k (s) with s ∈ [0, 1] [Fig.2d], and decompose this into a linear combination: This logically separates the worm undulations into two types of motion: one corresponding to a sinusoidal motion ψ s and one in which the whole body bends ψ u .These we define by where Ã = 1 2 (1 + | sin (2πt) |) A and the rest of parameters are sampled from random distributions.Although many improvements for the above equations can be suggested, we prefer to keep the model simple.
Once the values of the parameters for ψ are generated all for the timesteps of the simulation, the positional coordinates are obtained using x(s, t) = L s 0 cos (ψ(s , t) + γ) sin (ψ(s , t) + γ) ds (14) where γ is a random orientation and L is the length of the worm (also sampled).Once the skeleton is defined, the rigid body motions are predicted by solving [95] where the force f can be calculated from the spline velocity U = ∂ t x + V + Ω × ( x − x CoM ) by Here, V and Ω are the center-of-mass velocity and rotational velocity (that we are solving for), and α t and α n = α α t is the tangential and normal drag coefficients, which is also sampled for (α > 1).We did not find a need for using a non-linear force theory.The simulation is run with Python 3.9 using the Jax library.
Video synthesis Given the labels for the splines positions, synthetic videos are generated to be used as input during training.In order to add width to each worm, we vary the local body radius r by a function of the form r(s) = R |sin(arccos(as The pixel values of those circles are calculated with anti-aliasing.Once the worms have been rendered, noise artefacts such as uneven background, blurring, Gaussian noise, etc. are added to replicate the observed conditions of real experiments.During training, standard augmentation techniques are applied as well.In the same manner as the simulation of the motion and the neural network training, frame generation is also written in Python using the Jax library in order to leverage GPU capabilities.
Experimental dataset Videos of crawling C. elegans were filmed using the protocol described in Ref. [40].
Manually annotated dataset The evaluation dataset is annotated using a custom tool that can be found at https://github.com/kirkegaardlab/deeptanglelabel.Around ∼ 1,500 splines have been annotated and this dataset (videos and labels) is included in the SI.
Asymmetric dynamic time-warped error distance In order to evaluate the manually labelled dataset, we introduce an error distance that compares the similarity between two curves by calculating the distance between each point on one curve and the nearest segment on the other.The error distance used is a variation of the dynamic time warping distance, which is widely used for comparing time series data.We note that, just as is the case for the dynamic time warping distance, this is not a true distance in the mathematical sense.Tracking implementation Tracking is done by sequentially predicting individual frames.For better performance, batching of frames allows for parallel detections and can drastically reduce execution time.Nevertheless, due to the requirement of including surrounding frames for each detection, a considerable increase in memory usage is observed.Once a collection of spline detections is obtained, each prediction is adapted in order to make it work with the TrackPy Python library.Due to the peculiarity of our distance metric, we implement a custom neighbor strategy (see Code Availability) that avoids the assumption of a symmetric distance function.
It may happen that some detection artefacts appear during the sequential detection performed during tracking.We have implemented a quick check on the resulting tracks to make sure not to have stubs, and fix the obvious branching of tracks due to these artefacts.A slight increase in integrity is observed on dense clips.
Tracking integrity Given a true label of a track of length N , we associate to this track at each time point i a prediction identity I i .We may then define the integrity of the track as ι = 1/N 2 N i=1 N j=1 [I i = I j ].For instance, if a label is given identities I = [1, 1, 1, 5, 5, 5, 3, 3, 3] during the track, i.e. there have been two identity swaps, we find ι = 1  3 , which has the interpretation that the track was correct for a third of the time.This measure will in general scale like ι ∼ N −1 , as longer tracks will have a higher likelihood of identity swaps.

Figure 1 :
Figure 1: Microscopy images of different microorganisms whose slender structure and frequent overlaps makes them hard to detect using classical approaches.a. C. elegans motility experiment from the dataset of this paper.b.Motile, flexuous, thin, spiral-shaped B. pilosicoli bacteria.Still from Ref. [53].c.Beating flagella of the green alga C. reinhardtii, provided by Kirsty Wan, University of Exeter.d.Swimming human spermatozoa.From dataset in Ref. [56].

Figure 2 :
Figure 2: (a) Structure of the detection method.Trainable neural networks are colored in gray and represent the convolutional neural network f (I; θ) and the latent space encoder q(λ, x 0 ; φ).(b) Procedure to prune unfiltered predictions to final detections with the use of the encoded latent space vectors.(c) Method overview from the input clip I (we use a stack of 11 frames in this work) to the final matrix of splines x.The target frames [I − , I, I + ] (center frames from the clip, orange) are explicitly shown for both the synthetic and real videos.Additionally, the training setup is represented using lighter color arrows; from synthetic data to loss backpropagation.After detection, direct visualization of the predicted splines x is possible.(d) Diagram with a spline descriptor composed of k equidistant points along the skeleton of the nematode.(e) Visual representation of the two distances used in Eq. (1), the minimum of which corresponds to correct head-tail alignment and is the one that will be used in the model.

Figure 3 :
Figure 3: Showcase of the capabilities of the method.(a) Detected splines predicted on an entire densely populated well plate with a single forward pass through the neural network.Inset shows a zoom-in section to demonstrate the accuracy of detection across the entire plate (except near borders, where the plate interferes).The total plate contains around 6,000 splines.(b) Close-up evaluation of different experimental clips with different densities of worms.

Figure 4 :
Figure 4: (a) Average number of overlaps counted on frames of pixel size 512 × 512 with different densities of worms (N = 90).(b) Illustration of the asymmetric dynamic time warping distance error corresponding to the average value of the orange euclidean distances between the prediction (green) and the labelled points (white).(c) Example frame with manually labelled points (white) and models predictions (colored).The metric is only evaluated in the lighter area of size 100 × 100.(d) Quantified accuracy of the detections by showing the distance to the manually labelled splines.Distributions for different densities are shown.The violin plots represent the 99 percentile of the data whereas outliers are plotted individually.(e-f ) Rates for True Positive and False Negative on the manually annotated dataset.(g-i) Performance of the model with different combinations of score (τ s ) and overlap (τ o ) thresholds.N = 1,420.

Figure 5 :
Figure 5: (a) Illustration of the directed distance used to assign the same identity to consecutive detections.The simplified drawing shows two independent predictions at adjacent frames and showcases how the assignment scheme computes the identity by comparing future-present and past-present distances and choosing the assignment that minimizes their sum.(b) Diagram showcasing how using a location cutoff simplifies the assignment problem.Nodes represent independent detections at each frame whereas edge values are given by the directed distance measure.The assignment happens by minimizing the sum of edges at each timestep.(c) Comparison of using the straightforward spline distance and the proposed directed approach.The accuracy is evaluated by measuring the integrity of the tracks.In contrast to other metrics in this paper, this plot has been obtained using synthetic worms as longterm, accurate tracks are required to evaluate the tracking integrity (See Methods for details on Tracking integrity).(d) Qualitative example of 30 s trajectories of the center of mass of the nematodes in a dense experiment.The still background image represents the last frame of the video.To improve the visualization, a small subset of the trajectories is shown.In contrast, a corner of the frame is used to display all the trajectories to showcase the density of simultaneous tracks.(e) Two samples of the spline angle ψ of two randomly sampled nematodes from (d). (f ) Undulations corresponding to 30 s of the detections relative to the center of mass coordinate of nine randomly sampled nematodes from (d).(g) Standard error value of the measurements of the center of mass speed as a function of density.(h) Showcase the possible throughput of the method, by simultaneously tracking more than 6,000 tracks from a full dense plate.A small window on the tracks is shown to showcase their continuity.

Algorithm 1 :
Algorithm for asymmetric dynamic time warping Data: Label curve defined by N points {p i }, and prediction curve defined by M line segments {s j }.Result: The asymmetric dynamically time-warped distance from label to prediction.Initialize matrices C, D with size [N, M ].for i = 1 to N do for j = 1 to M do D i,j ← distance from point to segment(p i , s j ) C 1,1 ← D 1,1 for i = 2 to N do C i,1 = C i−1,1 + D i,1 for j = 2 to M do C 1,j = min (C 1,j−1 , D 1,j−1 ) for i = 2 to N do for j = 2 to M do C i,j = min (C i,j−1 , C i−1,j + D i,j ) return C N,M /N