Robust mouse tracking in complex environments using neural networks

The ability to track animals accurately is critical for behavioral experiments. For video-based assays, this is often accomplished by manipulating environmental conditions to increase contrast between the animal and the background in order to achieve proper foreground/background detection (segmentation). Modifying environmental conditions for experimental scalability opposes ethological relevance. The biobehavioral research community needs methods to monitor behaviors over long periods of time, under dynamic environmental conditions, and in animals that are genetically and behaviorally heterogeneous. To address this need, we applied a state-of-the-art neural network-based tracker for single mice. We compare three different neural network architectures across visually diverse mice and different environmental conditions. We find that an encoder-decoder segmentation neural network achieves high accuracy and speed with minimal training data. Furthermore, we provide a labeling interface, labeled training data, tuned hyperparameters, and a pretrained network for the behavior and neuroscience communities.


Supplementary Note 1 Fitting an Ellipse to a Mask
The same ellipse-fit algorithm was used as described in supplemental section 4.4.2 of the Ctrax paper 1 . While the paper uses a weighted sample mean and variance for these calculations, the segmentation neural network retains invariance to the situations in which they describe improvements. Additionally, we observe no difference between using weighted and unweighted sample means and variances.
Given a segmentation mask, the sample mean of pixel locations is calculated to represent the center position.
Similarly, the sample variance of pixel locations is calculated to represent the major axis length (a), minor axis length (b), and angle (θ).
To obtain the axis lengths and angle, an eigenvalue decomposition equation must be solved.

Supplementary Note 2 Annotated Datasets
We created 3 annotated datasets for training neural networks, each including a reference frame (input), segmentation mask, and ellipse-fit. Each dataset was generated to track mice in a different environment. An additional model was trained on all annotated examples for comparison. The exact number of frames represented in each dataset split as well as model performance can be found in Supplementary Tables 3 and 4.
The first annotated dataset uses images sampled from our standard open field arena video experiment and contains 16,802 annotated frames. This dataset was randomly split into a training set size of 16,234 frames and a validation set of 568 frames. The first 16,000 annotated frames were selected at random from 65 separate videos acquired from one of 24 testing arenas. We trained a model and found a small fraction of tracking issues when applying this model on the 1845 strain survey videos (0.007% of frames). We define tracking issues as the following: no mouse identified in the arena (eq 5), or a mouse becomes much larger the median size during an individual video (eq 6).
An additional 802 frames across 50 new videos that perform poorly were identified, correctly annotated, and incorporated into the annotated dataset. The addition of these frames corrected the remainder 0.007% of frames in the strain survey.
The second annotated dataset uses images sampled from our 24-hour experiment, which uses the standard open field arena with ALPHA-dri bedding and a food cup under two distinct lighting conditions (day visible illumination and night infrared illumination). For the dataset from this environment, we annotated a total of 2,192 frames across 6 videos of 4 day duration. Of the total number of annotated frames, 916 were taken from night illumination and 1,276 from the daylight illumination.
The third annotated dataset uses images sampled from the Accuscan Versamax Activity Monitoring Cages for the KOMP2 experiment. The dataset for this environment comprised 1,083 annotated frames. These annotations were all sampled across different videos (1 frame labeled per video) and 8 different arenas.

Center Hypotenuse Prediction Error
We apply a log10 transformation of the data from independent images (n samples) to achieve a normal-like distribution. For mean comparison, we use a paired-end t-test. For variance comparisons, we use a paired-end F-test.