Abstract
Keypoint tracking algorithms can flexibly quantify animal movement from videos obtained in a wide variety of settings. However, it remains unclear how to parse continuous keypoint data into discrete actions. This challenge is particularly acute because keypoint data are susceptible to highfrequency jitter that clustering algorithms can mistake for transitions between actions. Here we present keypointMoSeq, a machine learningbased platform for identifying behavioral modules (‘syllables’) from keypoint data without human supervision. KeypointMoSeq uses a generative model to distinguish keypoint noise from behavior, enabling it to identify syllables whose boundaries correspond to natural subsecond discontinuities in pose dynamics. KeypointMoSeq outperforms commonly used alternative clustering methods at identifying these transitions, at capturing correlations between neural activity and behavior and at classifying either solitary or social behaviors in accordance with human annotations. KeypointMoSeq also works in multiple species and generalizes beyond the syllable timescale, identifying fast sniffaligned movements in mice and a spectrum of oscillatory behaviors in fruit flies. KeypointMoSeq, therefore, renders accessible the modular structure of behavior through standard video recordings.
Similar content being viewed by others
Main
Work from ethology demonstrates that behavior—a chain of actions traced by the body’s movement over time—is both continuous and discrete^{1,2,3}. The rapid advance of keypoint tracking methods (including SLEAP^{4}, DeepLabCut^{5} and others^{6,7}) has given researchers broad access to the continuous dynamics that underlie animal behavior^{8}. But parsing these dynamics into chains of discrete actions remains an open problem^{9,10,11}. While several action segmentation approaches exist^{12,13,14,15,16,17}, their underlying logic and assumptions differ, with different methods often giving distinct descriptions of the same behavior^{13,15}. An important gap, therefore, exists between our access to movement kinematics and our ability to understand their underlying structure.
One method for parsing behavior in mice is Motion Sequencing (MoSeq)^{16,18,19,20,21}. MoSeq uses unsupervised machine learning to transform its inputs—which are not keypoints, but threedimensional (3D) depth videos—into a set of behavioral motifs (like rears, turns and pauses) called syllables. To identify syllables, MoSeq searches for discontinuities in behavioral data at a timescale that is set by the user; this timescale is specified through a ‘stickiness’ hyperparameter that influences the frequency with which syllables can transition. In the mouse, where MoSeq has been extensively applied, pervasive discontinuities at the subsecondtosecond timescale mark boundaries between syllables, and the stickiness hyperparameter is explicitly set to capture this timescale^{16}.
Previous studies have applied MoSeq to characterize the effects of genetic mutations, drugs, neural manipulations and changes in the sensory or physical environment^{16,22,23,24}. MoSeq syllables are encoded in the dorsolateral striatum (DLS)—an area important for action selection—and can be individually reinforced through closedloop dopamine stimulation^{22,23}, arguing that MoSeqidentified syllables are meaningful units of behavior used by the brain to organize action sequences. But MoSeq’s reliance on depth cameras is a substantial constraint; depth cameras are difficult to deploy, suffer from high sensitivity to reflections and have limited temporal resolution^{25}. In principle, these limits could be overcome by applying MoSeq to keypoint data. But attempts to do so have thus far failed: researchers applying MoSeqlike models to keypoint data have reported flickering state sequences that switch much faster than the animal’s actual behavior^{13}.
Here we confirm this finding and trace its cause to jitter in the keypoint estimates, which is mistaken by MoSeq for behavioral transitions. To address this challenge, we reformulated the model underlying MoSeq to simultaneously infer correct pose dynamics (from noisy or even missing data) and the behavioral syllables they represent. We validate this model, called keypointMoSeq, using accelerometry measurements, neural activity recordings and supervised behavior labels from expert observers, and show that it generalizes beyond mouse syllables to capture behaviors at multiple timescales and in several species. Because keypoint tracking can be applied in diverse settings (including natural environments), requires no specialized hardware and affords direct control over which body parts to track and at what resolution, we anticipate that keypointMoSeq will serve as a general tool for parsing the structure of behavior. To facilitate broad adoption, we have directly integrated keypointMoSeq with widely used tracking methods (including SLEAP and DeepLabCut) and made the code freely accessible for academic users at http://www.moseq4all.org/.
Results
Mouse syllables are evident in depthbased video recordings as discontinuities of movement that reoccur with subsecond cadence^{16}. To test if the same subsecond structure is present in keypoint data, we recorded conventional videos of mice exploring an open field arena and used a neural network to track eight keypoints (two ears and six points along the dorsal midline). We also captured simultaneous depth videos for comparison to depthbased MoSeq (Fig. 1a).
Similar subsecond discontinuities appeared in both the depth and keypoint data, with a keypointbased change score (total velocity of keypoints after egocentric alignment) spiking at the transitions between depthbased MoSeq syllables (Fig. 1b). Yet when we applied MoSeq directly to the keypoint data, it failed to recognize these discontinuities as syllable transitions, instead generating implausibly brief syllables that aligned poorly with the keypoint change score (Fig. 1b,c). These observations are consistent with prior work showing that MoSeq underperforms alternative clustering methods when applied to keypoints^{13,26}.
We wondered whether this poor performance could be explained by noise in the keypoint data, which might introduce subtle discontinuities that are falsely recognized by MoSeq as behavioral transitions. In our data, this noise took the form of highfrequency jitter that reflected errors in body part detection or rapid jumps in the inferred location of a stationary body part (Fig. 1d,e, Extended Data Fig. 1a,b and Supplementary Video 1). Much of the jitter—which was pervasive across camera angles and tracking methods—seemed to reflect inherent ambiguity in the true location of a keypoint, as frametoframe fluctuations in detected keypoint position had a similar scale as the variability in human labeling (Fig. 1f and Extended Data Fig. 1b–e). We confirmed that the jitter was unrelated to true movement by tracking the same body part using multiple cameras; although overall movement trajectories were almost identical across cameras, highfrequency fluctuations around those trajectories were uncorrelated, suggesting that the fluctuations are a tracking artifact (Extended Data Fig. 1f,g).
Consistent with the possibility that keypoint noise dominates MoSeq’s view of behavior, syllable transitions derived from keypoints—but not depth—frequently overlapped with jitter and lowconfidence estimates of keypoint position (Fig. 1b,g). We were unable to correct this defect through simple smoothing: application of a lowpass filter—while removing jitter—also blurred true transitions, preventing MoSeq from identifying syllable boundaries (Extended Data Fig. 1h). Median filtering and Gaussian smoothing also yielded no improvement. These data reveal that highfrequency tracking noise prevents MoSeq from accurately segmenting behavior.
Hierarchical modeling decouples noise from behavior
Keypoint jitter contaminates MoSeq syllables because MoSeq assumes that each keypoint is a faithful representation of a point on the animal, and thus cannot distinguish noise from real behavior. To address this issue, we rebuilt MoSeq as a switching linear dynamical system (SLDS)—a class of model that explicitly disentangles signal from noise in timeseries data^{27,28}. This model—called ‘keypointMoSeq’—has three hierarchical levels (Fig. 2a): a discrete state sequence that governs trajectories in a lowdimensional pose space, which then combines with location and heading information to yield actual keypoint coordinates. When fit to data, keypointMoSeq estimates for each frame the animal’s location and pose, the noise in each keypoint^{29} and the identity of the current behavioral syllable (Fig. 2a). Because of its structure, when a single keypoint implausibly jumps from one location to another, the model can attribute this sudden displacement to noise and preserve a smooth pose trajectory; if all the keypoints suddenly rotate within the egocentric reference frame, the model can adjust the inferred heading for that frame and restore a plausible sequence of coordinates (Fig. 2b).
Unlike traditional MoSeq, keypointMoSeq homed in on behavioral syllables rather than noise in the keypoint data, yielding syllable transitions that overlapped more strongly with changepoints in pose, correlated better with syllable transitions from depth MoSeq and clustered less around lowconfidence neural network detections (Fig. 2c). KeypointMoSeq also outperformed traditional MoSeq when the latter was applied to filtered keypoint data, or to keypoints inferred with a pose estimation method (Lightning Pose) that includes a jitter penalty in its training objective (Extended Data Fig. 2a,b). Furthermore, when we simulated missing data by ablating subsets of keypoints within random (0–3 s) intervals, keypointMoSeq was better able to preserve syllable labels and boundaries than traditional MoSeq (Extended Data Fig. 2c–f). From a modeling perspective, the output of MoSeq was sensible: crosslikelihood analysis revealed that keypointbased syllables were mathematically distinct trajectories in pose space, and submitting synthetic keypoint data that lacked any underlying block structure to keypointMoSeq resulted in models that failed to identify distinct syllables (Extended Data Fig. 2g,h).
Because keypointMoSeq produces slightly different syllable segmentations when run multiple times with different random seeds, we developed a likelihoodbased metric that allows post hoc ranking of model runs (Extended Data Fig. 3a–g); the metric tends to be lowest for outlier models and highest for those that are consensuslike, providing a rational basis for model selection (Extended Data Fig. 3h–k). The metric revealed that 500 fitting iterations (~30 min of compute time on a GPU for ~5 h of data) are sufficient to achieve a good model fit with our open field dataset. Rather than choosing a single best model, users can also estimate an approximate probability distribution over syllable labels, although full Bayesian convergence remains impractical (Extended Data Fig. 3l).
In our open field data, keypointMoSeq identified 25 syllables that were easily distinguishable to human observers (Extended Data Fig. 4a and Supplementary Videos 2 and 3). These included categories of behavior (for example, rearing, grooming and walking), and variations within categories (for example, turn angle, speed; Fig. 2e). Importantly, keypointMoSeq preserved access to the kinematic and morphological parameters that underlie each behavioral syllable (Extended Data Fig. 4b). Thus, keypointMoSeq can provide an interpretable segmentation of behavior from standard twodimensional (2D) keypoint tracking data.
KeypointMoSeq is sensitive to behavioral transitions
To characterize keypointMoSeq, we related the discovered syllables to orthogonal measures of behavior and neural activity and compared them to the behavioral states identified by alternative behavior analysis methods. These alternatives, which include VAME, MotionMapper and BSOiD, all work by transforming keypoint data into a feature space that reflects the local dynamics around each frame, and then clustering frames according to those features^{12,13,17,30}.
When applied to our open field data, behavioral states from VAME, BSOiD and MotionMapper were usually brief (median duration 33–100 ms, compared to ~400 ms for keypointMoSeq) and their transitions aligned poorly with changepoints in keypoint data, suggesting diminished sensitivity to the natural breakpoints in mouse behavior (Fig. 3a–c). This observation was not parameter dependent, because it remained true across a broad range of temporal windows (used by BSOiD and MotionMapper) and after comprehensive scans over latent dimension, state number, clustering mode and preprocessing options (across all methods as applicable; Extended Data Fig. 5a).
Rearing offers a clear example of the differing sensitivity of each method to temporal structure. BSOiD and keypointMoSeq both learned a specific set of rear states, and each encoded the mouse’s height with comparable accuracy (Fig. 3d,e). Yet the rear states had different dynamics. Whereas keypointMoSeq typically detected two syllable transitions per rear (one entering the rear and one exiting), BSOiD detected five to ten different transitions per rear, including switches between distinct rear states as well as flickering between rear and nonrear states (Fig. 3f and Extended Data Fig. 5b). Whereas mouse height increased at transitions into keypointMoSeq’s rear state and fell at transitions out of it, height tended to peak symmetrically at transitions into and out of BSOiD’s rear states (Fig. 3g); this observation suggests that—at least in this example—BSOiD does not effectively identify the boundaries between syllables, but instead fragments them throughout their execution.
We also evaluated each method using an orthogonal kinematic measurement: 3D head angle and acceleration from headmounted inertial measurement units (IMUs; Fig. 3h). Behavioral transitions were identifiable in the IMU data as sudden changes in acceleration (quantified by jerk) and orientation (quantified by angular velocity). These measures tended to overlap with state transitions from keypointMoSeq but less so (or not at all) for BSOiD, MotionMapper and VAME (Fig. 3i). Furthermore, IMUextracted behavioral features (like head pitch or acceleration) typically rose and fell symmetrically around BSOiD, MotionMapper and VAMEidentified transitions, while keypointMoSeq identified asymmetrical changes in these features (Fig. 3i and Extended Data Fig. 6a).
The fact that keypointMoSeq more clearly identifies behavioral boundaries does not necessarily mean that it is better at capturing the overall content of behavior. Indeed, coarse kinematic parameters were captured equally well by all four of the tested methods (Extended Data Fig. 5c). However, the fact that movement parameters—as measured by accelerometry—change suddenly at the onset of keypointMoSeq syllables, but not at the onset of BSOiD, VAME or MotionMapper states, provides evidence that these methods afford fundamentally different views of temporal structure in behavior.
State transitions align with fluctuations in neural data
A core use case for unsupervised behavioral classification is to understand how the brain generates selfmotivated behaviors outside a rigid task structure^{9}; in this setting, boundaries between behavioral states serve as surrogate timestamps for alignment of neural data. For example, we recently used depth MoSeq to show that dopamine fluctuations in DLS are temporally aligned to syllable transitions during spontaneous behavior^{22}. Here we asked whether the same result was apparent in keypointbased segmentations of behavior (Fig. 4a).
Syllableassociated dopamine fluctuations (as captured by dLight photometry) were remarkably similar between depth MoSeq and keypointMoSeq, but much lower in amplitude (or nonexistent) when assessed using BSOiD, VAME and MotionMapper (Fig. 4b and Extended Data Fig. 7a). We wondered if this apparent discrepancy in syllableassociated dopamine could be explained by differences in how each method represents the temporal structure of behavior. If, as we have shown, BSOiD, VAME and MotionMapper can capture the content of behavior but not the timing of transitions, then average dopamine levels should vary consistently across their behavior states but lack clear dynamics (increases or decreases) at state onsets. Indeed, for all four methods, almost every state was associated with a consistent aboveaverage or belowaverage dopamine level (Fig. 4c,d and Extended Data Fig. 7b), and yet dopamine dynamics varied widely. Whereas dopamine usually increased at the initiation of keypointMoSeq syllables, it was usually flat (having just reached a peak or nadir) at state onsets identified by alternative methods (Fig. 4c–e). Furthermore, aligning the dopamine signal to randomly sampled times throughout the execution of each behavioral state—rather than its onset—altered stateassociated dopamine dynamics for keypointMoSeq, but made little difference for alternative methods (Fig. 4f and Extended Data Fig. 7c,d). These results suggest that keypointMoSeq syllable onsets are meaningful landmarks for neural data analysis, while state onsets identified by alternative methods are often functionally indistinguishable from random timepoints during a behavior.
KeypointMoSeq generalizes across experimental setups and behaviors
Keypoint tracking is a powerful means of pose estimation because it generalizes widely across experimental setups. To test whether keypointMoSeq inherits this flexibility, we asked if it could quantify changes in behavior induced by environmental enrichment. Mice were recorded in either an empty arena or one that contained bedding, chew toys and a transparent shelter (Extended Data Fig. 8a). The enriched environment was too complex for traditional depth MoSeq but yielded easily to keypointbased pose estimation. Based on these poses, keypointMoSeq identified 39 syllables, of which 21 varied between environments: syllables upregulated in the enriched environment tended to involve manipulation and orientation toward nearby affordances (for example, ‘investigation’, ‘stationary right turn’ and ‘stop and dig’), whereas those upregulated in the empty box were limited to locomotion and rearing (‘dart forward’ and ‘rearup in corner’; Extended Data Fig. 8b,c). These results suggest that keypointMoSeq may be useful in a broad range of experimental contexts, including those whose cluttered structure precludes the effective use of depth cameras.
To test if keypointMoSeq can also generalize across laboratories—and to better understand the mapping between syllables and humanidentified behaviors—we next analyzed a pair of published benchmark datasets^{31,32}. The first dataset included human annotations for four mouse behaviors in an open field (locomotion, rearing, face grooming and body grooming) and keypoint detections from the TopViewMouse model in the DLC Model Zoo^{33} (Fig. 5a–c). The second dataset (part of the CalMS21 benchmark^{32}) included a set of three manually annotated social behaviors (mounting, investigation and attack) as well as keypoints for a pair of interacting mice (Fig. 5d–f). KeypointMoSeq recovered syllables from both datasets whose average duration was ~400 ms, while, as before, the BSOiD, MotionMapper and VAME identified behavioral states that were much shorter (Extended Data Fig. 9a). KeypointMoSeq states also conformed more closely to humanidentified behavioral states (Fig. 5c,f and Extended Data Fig. 9b). Although this advantage was modest overall, there were some important differences: in the CalMS21 dataset, for example, MotionMapper, BSOiD and VAME only identified a single behavior consistently, with BSOiD and VAME only capturing mounting and MotionMapper only capturing investigation in 100% of model fits; keypointMoSeq, in contrast, defined at least one state specific to each of the three behaviors in 100% of model fits (Extended Data Fig. 9c).
The above benchmark datasets differed widely in the number of keypoints tracked (7 for CalMS21 versus 21 for the TopViewMouse model; Fig. 5a,d), raising the question of how the pose representation fed to keypointMoSeq influences its outputs. One possibility—suggested by the higher syllable count for depth MoSeq (~50) compared to keypointMoSeq fit to 2D keypoints (~25)—is that higherdimensional input data allows MoSeq to make finer distinctions between behaviors. To test this rigorously, we used multiple cameras to estimate keypoints in 3D (including six keypoints that were not visible in the overheadcamera 2D dataset) and confirmed that the 3D keypoints had higher intrinsic dimensionality than 2D keypoints (Fig. 5g and Extended Data Fig. 9d,e). Despite this difference in dimensionality, similar changepoints were evident in both datasets, and keypointMoSeq identified syllables with similarly timed transitions (Fig. 5h and Extended Data Fig. 9f).
There was a bigger change, however, in how behaviors were categorized. KeypointMoSeq made finergrained behavior distinctions based on 3D data as compared to 2D data, especially for behaviors that varied in height (Fig. 5i–l and Supplementary Video 4). Turning, for example, was grouped as a single state based on the 2D keypoint data but partitioned into three states with different head positions based on the 3D keypoint data (nose to the ground versus nose in the air; Fig. 5j–l). Rearing was even more fractionated, with a single 2D syllable splitting six ways based on body angle and trajectory in the 3D keypoint data. Depthbased MoSeq fractionated these behaviors still further. This analysis suggests that higherdimensional input data permit richer descriptions of behavior, but even relatively lowdimensional 2D keypoint data still capture the timing of behavioral transitions.
KeypointMoSeq parses behavior across species and timescales
To test if keypointMoSeq generalizes across rodent species, we analyzed previously published 3D motion capture data derived from rats. In this dataset, rats were adorned with reflective body piercings and recorded in a circular home cage arena with a lever and water spout for operant training (Fig. 5m; Rat7M dataset^{34}). As with mice, keypointMoSeq syllables aligned with changepoints in the keypoint data (Fig. 5n) and included a diversity of behaviors, including a syllable specific to lever pressing in the arena (Fig. 5o and Supplementary Video 5).
Mice combine postural movements, respiration and whisking to sense their environment. Recent work suggests that rodents coordinate these behaviors in time, generating rhythmic head movements that synchronize with the sniff cycle^{35,36}. Using an autoregressive hidden Markov model (ARHMM), for example, headmovement motifs were discovered that align to respiration and arise during olfactory navigation^{21}. Respiration, therefore, defines a fast timescale of mouse behavior that coexists with—but is distinct from—the ~400ms timescale of behavioral syllables.
To test if keypointMoSeq can capture behavioral motifs at this faster timescale, we used 120Hz cameras to track 3D keypoints of mice and measured respiration with an implanted thermistor^{37} (Fig. 6a). Consistent with prior work, we observed respirationsynchronized fluctuations in nose velocity, although synchrony was weak or absent in other parts of the body (Fig. 6b,c). We then fit keypointMoSeq models with a range of target timescales (~35 ms to ~300 ms; Extended Data Fig. 10a). Motifs were defined as ‘respiration coupled’ if they consistently aligned with transitions in respiration state (inhaletoexhale or exhaletoinhale; Fig. 6d,e). Although respiration coupling was evident across all models, its prominence peaked at shorter timescales (Extended Data Fig. 10a), especially when fit to a subset of anterior keypoints that emphasized neck and nose movements (Extended Data Fig. 10b). The bestsynchronized motifs (from the fullbody model) tended to coincide with exhalation and involved isolated movements in which the nose flutters down (Fig. 6e,f). These results suggest that keypointMoSeq can characterize fast, sniffaligned movements in the mouse.
Given that keypointMoSeq can parse two different timescales of mouse behavior, we wondered if it could also segment fly behavior, which similarly occurs at multiple welldefined timescales. Flies tend to switch between distinct, oscillatory pose trajectories^{17}. These movements can be finely subdivided, as in the coordinated stance and swing phases of locomotion^{38}, or more coarsely segmented at the transitions between distinct oscillatory modes (for example, locomotion versus grooming), as they are by MotionMapper^{17}. To capture these distinct levels of organization, we fit keypointMoSeq to 2D keypoints from flies exploring a flat substrate^{17,39} (Extended Data Fig. 10c). The resulting behavioral motifs varied from tens to hundreds of milliseconds depending on keypointMoSeq’s target timescale. At longer timescales, keypointMoSeq identified recognizable behaviors such as locomotion, head grooming or leftwing grooming, similarly to the behaviors reported by MotionMapper (Fig. 6g, Supplementary Video 6 and Extended Data Fig 10d–f).
At shorter time scales, keypointMoSeq divided these behaviors into their constituent phases. Fast locomotion, for example, was split between six phaselocked motifs that tiled the stride cycle (Fig. 6h). As target timescales grew longer, locomotion merged from six to two phases (corresponding to the alternating swings and stances of a canonical tripod gait) before eventually collapsing to a single motif that encompassed the full stride cycle (Fig. 6h–j and Extended Data Fig. 10g). This shift was evident in the power spectral density of keypointMoSeq’s output, which began with a prominent peak at ~12 Hz during fast locomotion (corresponding to the stride cycle) that slowly disappeared as keypointMoSeq’s target timescale was increased (Fig. 6k). The same hierarchy of timescales appeared for nonlocomotion behaviors as well (Extended Data Fig. 10h). These results demonstrate that keypointMoSeq is useful as a tool for fly behavior analysis and suggest a principle for setting its target timescale that depends on whether researchers wish to subdivide the distinct phases of oscillatory behaviors.
Discussion
Syllables are broadly useful for understanding behavior^{16,22,23,24}, but their scope has been limited by the past requirement for depth data. Here we show that keypointMoSeq affords similar insight as depthbased MoSeq while benefiting from the generality of markerless keypoint tracking. Whereas depth MoSeq was limited to a narrow range of spatial scales and frame rates, keypointMoSeq can be applied to mammals and insects, parsing behaviors at the second or millisecond timescale. And because keypoint tracking is more robust to occlusion and environmental clutter, it is now possible to parse syllables amid environmental enrichment, in animals behaving alone or socially, with or without headgear and neural implants.
The core innovation enabling keypointMoSeq is a probabilistic model that effectively handles occlusions, tracking errors and highfrequency jitter. These noise sources are pervasive in pose tracking^{5,26}; because standard methods like SLEAP and DLC process each frame separately, keypoint coordinates tend to jump from frame to frame even when the subject’s pose has not discernably changed. A newer generation of pose tracking methods, such as GIMBAL^{29}, Deep Graph Pose^{26} and Lightning Pose^{40}, correct for some of these errors; and twostep pipelines that build on these methods may be less prone to keypoint jitter. Here, we describe a different solution: combining noisecorrection and behavior segmentation in a single endtoend model that leverages learned patterns of animal motion to infer the most plausible pose trajectory from noisy or missing data.
KeypointMoSeq is somewhat resilient to noise, but it will perform best with clean keypoint data that capture most parts of the body. Although directly modeling the raw pixel intensities of depth^{16} or 2D video^{41} provides the most detailed access to spontaneous behavior, technical challenges like reflections, occlusions and variation in perspective and illumination remain a challenge in those settings. The development of keypointMoSeq—together with advances in markerless pose tracking—should enable MoSeq to be used in a variety of these adversarial circumstances, such as when animals are obstructed from a single axis of view, when multiple animals are interacting simultaneously, when the environment changes dynamically and when animals wear elaborate headgear.
Compared to keypointMoSeq, the alternative methods for unsupervised behavior segmentation that we tested (BSOiD^{12}, MotionMapper^{17} and VAME^{13}) tend to emit shorter behavior motifs that often start or stop in the middle of what humans might identify as a behavioral module or motif (for example, a rear). Our analysis suggests two possible reasons for this difference. First, unlike alternative methods, MoSeq can discretize behavior at a particular userdefined timescale and, therefore, is better able to identify clear boundaries between behavioral elements that respect the natural rhythmicity in movements associated with syllables, sniffs or steps. The resulting parsimony prevents overfractionation of individual behaviors. Second, the hierarchical structure of keypointMoSeq’s underlying generative model means it can detect noise in keypoint trajectories and distinguish this noise from actual behavior without smoothing away meaningful behavioral transitions.
That said, we stress that there is no one best approach for behavioral analysis, as all methods involve tradeoffs^{42,43}. For example, keypointMoSeq does not yield a single fixed description of behavior, since its output is probabilistic. In principle, one could summarize this uncertainty in the form of a posterior distribution. Because proper posterior estimation is impractical using our current fitting procedure, we have defined an alternative approach whereby users generate an ensemble of candidate model fits and identify a consensus model for downstream analysis. Users wishing to better quantify model uncertainty can also apply subsequent analyses to the full ensemble of models. KeypointMoSeq is also limited to describing behavior at a single timescale. Although users may vary this timescale across a broad range, keypointMoSeq cannot simultaneously analyze behavior across multiple timescales or explicitly represent the hierarchical nesting of behavior motifs. Finally, because keypointMoSeq learns the identity of syllables from the data itself, it may miss especially rare behavioral events that could otherwise be captured using supervised methods.
To facilitate the adoption of keypointMoSeq, we built a website (http://www.moseq4all.org/) that includes free access to the code for academics as well as extensive documentation and guidance for implementation. As demonstrated here, the model underlying MoSeq is modular and thus accessible to extensions and modifications that can increase its alignment to behavioral data. For example, a timewarped version of MoSeq was recently reported that incorporates a term to explicitly model variation in movement vigor^{19}. We anticipate that the application of keypointMoSeq to a wide variety of experimental datasets will both yield important information about the strengths and failure modes of modelbased methods for behavioral classification, and prompt continued innovation.
Methods
Ethical compliance
All experimental procedures were approved by the Harvard Medical School Institutional Animal Care and Use Committee (protocol number 04930) and were performed in compliance with the ethical regulations of Harvard University as well as the Guide for Animal Care and Use of Laboratory Animals.
Animal care and behavioral experiments
Unless otherwise noted, behavioral recordings were performed on 8–16weekold C57/BL6 mice (The Jackson Laboratory stock no. 000664). Mice were transferred to our colony at 6–8 weeks of age and housed in a reverse 12h light/12h dark cycle. We singlehoused mice after stereotactic surgery and grouphoused them otherwise. On recording days, mice were brought to the laboratory, habituated in darkness for at least 20 min, and then placed in the behavioral arena for 30–60 min. We recorded 6 male mice for 10 sessions (6 h) in the initial round of open field recordings; 5 male mice for 52 sessions (50 h) during the accelerometry recordings; 16 male mice for 16 sessions (8 h) during the environmental enrichment experiment; and 5 male mice for 9 sessions (6 h) during the thermistor recordings. The dopamine photometry recordings were obtained from a recent study^{22}. They include 6 C57/BL6 mice and 8 DATIREScre (The Jackson Laboratory stock no. 006660) mice of both sexes, recorded for 378 sessions. Of these, we selected a random subset of 95 sessions (~50 h) for benchmarking keypointMoSeq.
Stereotactic surgery procedures
For all stereotactic surgeries, mice were anesthetized using 1–2% isoflurane in oxygen, at a flow rate of 1 l min^{−1} for the duration of the procedure. Anteroposterior (AP) and mediolateral (ML) coordinates (in millimeters) were zeroed relative to bregma, and the dorsoventral (DV) coordinate was zeroed relative to the pial surface. All mice were monitored daily for 4 days following surgery and were allowed to recover for at least 1 week. Mice were then habituated to handling and brief headfixation before beginning recordings.
For dopamine recordings, 400 nl of AAV5.CAG.dLight1.1 (Addgene, 111067; titer: 4.85 × 10^{12}) was injected at a 1:2 dilution into the DLS (AP 0.260; ML 2.550; DV −2.40), and a single 200μmdiameter, 0.37–0.57NA fiber cannula was implanted 200 μm above the injection site (see ref. ^{22} for additional details).
For accelerometry recordings, we surgically attached a MillMax connector (DigiKey, ED8450ND) and head bar to the skull and secured it with dental cement (Metabond). A ninedegreeoffreedom absolute orientation IMU (Bosch, BN0055) was mounted on the MillMax connector using a custom printed circuit board (PCB) with a net weight below 1 g.
For thermistor surgeries, we adapted a protocol previously described^{37}. We first prepared the implant (GAG22K7MCD419, TE Connectivity) by stripping the leads and soldering them to two male MillMax pins (0.05inch pitch, 8519305010001000). The pins and their solder joins were then entirely covered in PrimeDent lightcurable cement, and cured for 10–20 s, to ensure the longevity and stability of the electrical connection. Each implant was tested by touching two leads of a multimeter (set to measure resistance) to the female side of the MillMax, breathing gently on the thermistor, and checking for a resistance drop of roughly 20 kΩ to 18 kΩ.
To implant the thermistor, a midline incision was made from ~1 mm behind lambda to ~1 mm anterior to the nasal suture, and the skull cleaned and lightly scored. A craniotomy was made just anterior to the nasal suture (well posterior to the position originally reported^{37}), large enough for the thermistor to fit fully inside. The thermistor was fully inserted along the AP axis so that it lay flat in the horizontal plane inside the nasal cavity. The craniotomy was then sealed with KwikSil, and the thermistor wire was secured to the skull 1–2 mm posterior to the craniotomy with cyanoacrylate glue (Loctite 454). Then dental cement (Metabond) was used to attach the MillMax connector in an upright position between bregma and lambda, and a head bar was cemented to the skull at lambda.
Microsoft Azure recording setup
For the initial set of open field recordings (Figs. 1, 2, 3a–g and 5g–l), mice were recorded in a square arena with transparent floor and walls (30 cm length and width). Microsoft Azure Kinect cameras captured simultaneous depth and nearIR video at 30 Hz. Six cameras were used in total: one above, one below and four side cameras at right angles at the same height as the mouse.
Accelerometry recordings
For the accelerometry recordings, we used a single Microsoft Azure Kinect camera placed above the mouse, and an arena with transparent floor and opaque circular walls (45cm diameter). Data were transferred from the IMU using a lightweight tether attached to a custombuilt active commutator. The IMU was connected to a Teensy microcontroller, which was programmed using the Adafruit BNO055 library with default settings (sample rate: 100 Hz, units: m/s^{2}). To synchronize the IMU measurements and video recordings, we used an array of nearIR LEDs to display a rapid sequence of random 4bit codes that updated throughout the recording. The code sequence was later extracted from the behavioral videos and used to fit a piecewise linear model between timestamps from the videos and timestamps from the IMU.
Thermistor recordings
To record mouse respiration and movement at high frame rates, we built a multicamera recording arena using six Basler ace acA1300200um Monochrome USB 3.0 Cameras (Edmund Optics, 33978) that recorded from above, from below and four side views. The cameras were triggered at 120 Hz using an Arduino. Video compression was performed in real time on a GPU using a custom library (https://github.com/calebweinreb/multicamera_acquisition/). Mice were recorded in an opentop glass cube and illuminated with 32 nearIR highpower LED stars (LEDSupply, CREEXPEFRD3). To avoid reflections and saturations effects, the bottom camera was triggered slightly out of phase with the top cameras, and the LEDs were split into two groups: one group below the arena that turned on during the bottom camera’s exposure, and one group above the arena that turned on during the top and side cameras’ exposure.
To record the thermistor signal, we designed a custom PCB that used an opamp (INA330AIDGST, Texas Instruments) to transform the thermistor’s resistance fluctuations into voltages, and another circuit element to keep the voltage within the 0–3.3 V range. The PCB was connected to an Arduino (separate from the one controlling the cameras) that recorded the output. The PCB parts list, schematic and microcontroller code are available upon reasonable request to the laboratory of S.R.D.
Before behavioral recording sessions with the thermistor, mice were briefly headfixed, and a cable with a custom headstage was inserted into the headmounted MillMax adaptor. The cable was commutated with an assisted electric commutator from Doric Lenses and connected to the input of the opamp on the custom PCB. To synchronize the thermistor and video data, we piped a copy of the camera trigger signal from the cameraArduino to the thermistorArduino and recorded this signal alongside the thermistor output.
Environmental enrichment recordings
To test the effects of environmental enrichment on behavior, we built an arena for overhead video recording of an opentopped home cage. The home cage was surrounded on each side by a 16inch vertical barrier, illuminated from above by three nearIR LED starts (LEDSupply, CREEXPEFRD3) and recorded with a Basler ace acA1300200um Monochrome USB 3.0 Camera (Edmund Optics 33978). For half the recordings, the cage was filled with bedding, nesting material, chew sticks and a transparent, domeshaped hut. For the other half, the cage was completely empty (except for the mouse).
Software
The following publicly available software packages were used for analysis: Python (version 3.8), NumPy (version 1.24.3), Scikitlearn (version 1.2.2), PyTorch (version 1.9), Jax (version 0.3.22), SciPy (version 1.10.1), Matplotlib (version 3.7.1), Statsmodels (version 0.13.5), Motionmapperpy (version 1.0), DeepLabCut (version 2.2.1), SLEAP (version 1.2.3), BSOiD (version 1.5.1), VAME (version 1.1), GIMBAL (version 0.0.1), HRNet (unversioned), LightningPose (version 0.0.4) and segmentation_models_pytorch (version 0.3.3).
Statistics
All reported P values for comparisons between distributions were derived from Mann–Whitney U tests unless stated otherwise. In all comparisons to ‘shuffle’, the shuffle represents a cyclic permutation of the data.
Processing depth videos
Applying MoSeq to depth videos involves: (1) mouse tracking and background subtraction; (2) egocentric alignment and cropping; (3) PCA; and (4) probabilistic modeling. We applied steps 2–4 as described in the MoSeq2 pipeline^{25}. For step 1, we trained a convolutional neural network (CNN) with a Unet++^{44} architecture to segment the mouse from background using ~5,000 handlabeled frames as training data.
Keypoint tracking for Microsoft Azure IR recordings
We used CNNs with an HRNet^{45} architecture (https://github.com/stefanopini/simpleHRNet/) with a final stride of two for pose tracking. The networks were trained on ~1,000 handlabeled frames each for the overhead, belowfloor and sideview Microsoft Azure cameras. Frame labeling was crowdsourced through a commercial service (Scale AI). The crowdsourced labels were comparable to those from experts in our laboratory (Extended Data Fig. 1d). For the overhead camera, we tracked two ears and six points along the dorsal midline (tail base, lumbar spine, thoracic spine, cervical spine, head and nose). For the belowfloor camera, we tracked the tip of each forepaw, the tip and base of each hind paw, and four points along the ventral midline (tail base, genitals, abdomen and nose). For the side cameras, we tracked the same eight points as for the overhead camera and included the six limb points that were used for the belowfloor camera (14 total). We trained a separate CNN for each camera angle. Target activations were formed by centering a Gaussian with a 10pixel (px) standard deviation on each keypoint. We used the location of the maximum pixel in each output channel of the neural network to determine keypoint coordinates and used the value at that pixel to set the confidence score. The resulting mean absolute error (MAE) between network detections and manual annotations was 2.9 px for the training data and 3.2 px for heldout data. We also trained DeepLabCut and SLEAP models on the overheadcamera and belowfloorcamera datasets. For DeepLabCut, we used version 2.2.1, setting the architecture to resnet50 architecture and the ‘pos_dist_thresh’ parameter to 10, resulting in train and test MAEs of 3.4 px and 3.8 px, respectively. For SLEAP, we used version 1.2.3 with the baseline_large_rf.single.json configuration, resulting in train and test MAEs of 3.5 px and 4.7 px. For Lightning Pose^{40}, we used version 0.0.4 and default parameters with ‘pca_singleview’ and ‘temporal’ loss terms.
Keypoint tracking for thermistor recordings
We trained separate keypoint detection networks for the Basler camera arena (used for the thermistor recordings). CNNs with an HRNet architecture were trained on ~1,000 handlabeled frames each for the overhead and belowfloor cameras and ~3,000 handlabeled frames for the sideview cameras. The same keypoints were used as the ones for the Microsoft Azure dataset.
3D pose inference
Using 2D keypoint detections from six cameras, 3D keypoint coordinates were triangulated and then refined using GIMBAL, a modelbased approach that leverages anatomical constraints and motion continuity^{29}. To fit GIMBAL, we computed initial 3D keypoint estimates using robust triangulation (that is, by taking the median across all camera pairs, as in 3DDeepLabCut^{46}) and then filtered to remove outliers using the EllipticEnvelope method from sklearn; we then fit the skeletal parameters and directional priors for GIMBAL using expectation maximization with 50 pose states. Finally, we applied the fitted GIMBAL model to each recording, using the following parameters for all keypoints: obs_outlier_variance = 1e6, obs_inlier_variance = 10, pos_dt_variance = 10. The latter parameters were chosen based on the accuracy of the resulting 3D keypoint estimates, as assessed from visual inspection. Camera calibration and initial triangulation were performed using a custom library (https://github.com/calebweinreb/multicamcalibration/tree/main/multicam_calibration/).
Keypoint change score
We defined the keypoint ‘change score’ as the total velocity of keypoints after egocentric alignment. The goal of the change score is to highlight sudden shifts in pose. It was calculated by: (1) transforming keypoints into egocentric coordinates; (2) smoothing the transformed coordinates with Gaussian kernel (sigma = 1 frame); (3) calculating total change in coordinates across each frame; and (4) zscoring. Formally, the score can be defined as:
where y_{t} are the keypoint coordinates after Gaussian smoothing.
Spectral analysis of keypoint jitter
To analyze keypoint jitter, we quantified the magnitude of fluctuations across a range of frequencies by computing a spectrogram for each keypoint along each coordinate axis. Spectrograms were computed using the python function scipy.signal.spectrogram with nperseg = 128 and noverlap = 124. The spectrograms were then combined through averaging: each keypoint was assigned a spectrogram by averaging over the two coordinate axes, and the entire animal was assigned a spectrogram by averaging over all keypoints.
We used the keypointspecific spectrograms to calculate crosscorrelations with −log_{10} (neural network detection confidence), as well as the ‘error magnitude’ (Fig. 1g). Error magnitude was defined as the distance between the detected 2D location of a keypoint (based on a single camera angle) and a reprojection of its 3D position (based on consensus across six camera angles; see ‘3D pose inference’ above). We also computed the crosscorrelation between nose and tailbase fluctuations at each frequency, as measured by the overhead and belowfloor cameras, respectively. Finally, we averaged spectral power across keypoints to compute the crosscorrelation with model transition probabilities (Fig. 1g). The model transition probabilities were defined for each frame as the fraction of N = 20 model fits in which a transition occurred on that frame. Formally, if z^{(i)} denotes the syllable sequence learned by model fit i, then the transition probability at time t is calculated as
Applying keypointMoSeq
Datasets were modeled separately and multiple models with different random seeds were fit for each dataset (see Supplementary Table 1 for number of fits per dataset).
Modeling consisted of two phases: (1) fitting an ARHMM to a fixed pose trajectory derived from PCA of egocentricaligned keypoints; and (2) fitting a full keypointMoSeq model initialized from the ARHMM. References in the text to ‘MoSeq applied to keypoints’ or ‘MoSeq (keypoints)’, for example, in Figs. 1 and 2, refer to output of step 1. Both steps are described below, followed by a detailed description of the model and inference algorithm in the ‘mathematical notation’ section. In all cases, we excluded rare states (frequency < 0.5%) from downstream analysis. We have made the code available as a userfriendly package via https://keypointmoseq.readthedocs.io/en/latest/. With a consumer GPU, keypointMoSeq requires 30–60 min of computation time to model 5 h of data. The computation time scales linearly with dataset size.
Fitting an initial ARHMM
We first modified the keypoint coordinates, defining keypoints with confidence below 0.5 as missing data and in imputing their values via linear interpolation, and then augmenting all coordinates with a small amount of random noise; the noise values were uniformly sampled from the interval [−0.1, 0.1] and helped prevent degeneracy during model fitting. Importantly, these preprocessing steps were only applied during ARHMM fitting—the original coordinates were used when fitting the full keypointMoSeq model.
Next, we centered the coordinates on each frame, aligned them using the tail–nose angle, and then transformed them using PCA with whitening. The number of principal components (PCs) was chosen for each dataset as the minimum required to explain 90% of total variance. This resulted in four PCs for the overheadcamera 2D datasets, six PCs for the belowfloor camera 2D datasets and six PCs for the 3D dataset.
We then used Gibbs sampling to infer the states and parameters of an ARHMM, including the state sequence z, the autoregressive parameters A, b and Q, and the transition parameters π and β. The hyperparameters for this step, listed in ‘mathematical notation’ below, were generally identical to those in the original depth MoSeq model. The one exception was the stickiness hyperparameter κ, which we adjusted separately for each dataset to ensure a median state duration of 400 ms.
Fitting a full keypointMoSeq model
We next fit the full set of variables for keypointMoSeq, which include the ARHMM variables mentioned above, as well as the location v and heading h, latent pose trajectory x, perkeypoint noise level σ^{2} and perframe/perkeypoint noise scale s. Fitting was performed using Gibbs sampling for 500 iterations, at which point the log joint probability appeared to have stabilized.
The hyperparameters for this step are enumerated in ‘mathematical notation’. In general, we used the same hyperparameter values across datasets. The two exceptions were the stickiness hyperparameter κ, which again had to be adjusted to maintain a median state duration of 400 ms, and s_{0}, which determines a prior on the noise scale. Because lowconfidence keypoint detections often have high error, we set s_{0} using a logistic curve that transitions between a highnoise regime (s_{0} = 100) for detections with low confidence and a lownoise regime (s_{0} = 1) for detections with high confidence:
The κ value used for each dataset is reported in Supplementary Table 2.
Trajectory plots
To visualize the modal trajectory associated with each syllable (Fig. 2e), we (1) computed the full set of trajectories for all instances of all syllables, (2) used a local density criterion to identify a single representative instance of each syllable and (3) computed a final trajectory using the nearest neighbors of the representative trajectory.
Computing the trajectory of individual syllable instances
Let y_{t}, v_{t} and h_{t} denote the keypoint coordinates, centroid and heading of the mouse at time t, and let F(v, h; y) denote the rigid transformation that egocentrically aligns y using centroid v and heading h. Given a syllable instance with onset time T, we computed the corresponding trajectory X_{T} by centering and aligning the sequence of poses \(({y}_{{T5}},\ldots ,{y}_{{T+15}})\) using the centroid and heading on time T. In other words
Identifying a representative instance of each syllable
The collection of trajectories computed above can be thought of as a set of points in a high dimensional trajectory space (for K keypoints in 2D, this space would have dimension 40K). Each point has a syllable label, and the segregation of these labels in the trajectory space represents the kinematic differences between syllables. To capture these differences, we computed a local probability density function for each syllable, and a global density function across all syllables. We then selected a representative trajectory X for each syllable by maximizing the ratio:
The density functions were computed as the mean distance from each point to its 50 nearest neighbors. For the global density, the nearest neighbors were selected from among all instances of all syllables. For the local densities, the nearest neighbors were selected from among instances of the target syllable.
Computing final trajectories for each syllable
For each syllable and its representative trajectory X, we identified the 50 nearest neighbors of X from among other instances of the same syllable and then computed a final trajectory as the mean across these nearest neighbors. The trajectory plots in Fig. 2e consist of ten evenlyspace poses along this trajectory, that is, the poses at times \(T5,{T}3,\ldots ,T+13\).
Testing robustness to missing data
To test the ability of keypointMoSeq to infer syllables and sequences in the face of missing data, we artificially ablated random subsets of keypoints at randomly timed intervals and then modeled the ablated data (Extended Data Fig. 2c–f). The ablation intervals began on every 10th second of the recording and lasted between 33 ms and 3 s (uniformly at random). For each interval, anywhere between 1 and 8 keypoints were selected (uniformly at random). Ablation entailed (1) erasing the keypoint coordinates and then filling the gap by linear interpolation; (2) setting the corresponding confidence values to 0. We then applied keypointMoSeq 20 times with different random seeds, using a single, fixed set of parameters derived previously from standard model fitting on the unablated dataset. Fixing the parameters ensured that syllable labels would be comparable across repeated model fits.
Crosssyllable likelihoods
We defined each crosssyllable likelihood as the probability (on average) that instances of one syllable could have arisen based on the dynamics of another syllable. The probabilities were computed based on the discrete latent states z_{t}, continuous latent states x_{t} and autoregressive parameters A, b and Q output by keypointMoSeq. The instances I(n) of syllable n were defined as the set of all sequences \(({t}_{\rm{s}},\ldots ,{t}_{\rm{e}})\) of consecutive timepoints such that z_{t} = n for all \({t}_{\rm{s}}\le t\le {t}_{\rm{e}}\) and \({z}_{{t}_{\rm{s}}1}\ne n\ne {z}_{{t}_{\rm{e}}+1}\). For each such instance, one can calculate the probability \(P\left({x}_{{t}_{\rm{s}}},\ldots ,{x}_{{t}_{\rm{e}}}\right{A}_{{m}},{b}_{{m}},{Q}_{{m}})\) that the corresponding sequence of latent states arose from the autoregressive dynamics of syllable m. The crosssyllable likelihood C_{nm} is defined in terms of these probabilities as
Generating synthetic keypoint data
To generate the synthetic keypoint trajectories used for Extended Data Fig. 2h, we fit a linear dynamical system (LDS) to egocentrically aligned keypoint trajectories and then sampled randomly generated outputs from the fitted model. The LDS was identical to the model underlying keypointMoSeq (see ‘mathematical notation’), except that it only had one discrete state, lacked centroid and heading variables and allowed separate noise terms for the x and y coordinates of each keypoint.
Expected marginal likelihood score
Because keypointMoSeq can at best produce point estimates of the model parameters—which will differ from run to run—users typically run the model several times and then rank the resulting fits. For ranking model fits, we defined a custom metric called the expected marginal likelihood score. The score evaluates a given set of autoregressive parameters (A, b, Q) by the expected value of the marginal log likelihood: \({E}_{{x} \sim P\left(x\;y\right)}\log P\left(xA,b,Q\right)\). In practice, given an ensemble of pose trajectories x^{(i)} and parameters \({\theta }^{(i)}=(A,b,Q)\) derived from N separate MCMC chains, the scores are computed as:
The scores shown in Extended Data Fig. 3j–m were computed using an ensemble of N = 20 chains. We chose this custom score instead of a more standard metric (such as heldout likelihood) because computing the latter is intractable for the keypointMoSeq model.
Environmental enrichment analysis
We fit a single keypointMoSeq model to the environmental enrichment dataset, which included recordings in an enriched home cage and control recordings in an empty cage. The transition graph (Extended Data Fig. 8b) was generated with keypointMoSeq’s analysis pipeline (https://keypointmoseq.readthedocs.io/en/latest/analysis.html#syllabletransitiongraph/) using node positions from a force directed layout. Detection of differentially used syllables was also performed using the analysis pipeline, which applies a Kruskal–Wallis test for significant differences in the persession frequency of each syllable (https://keypointmoseq.readthedocs.io/en/latest/analysis.html#comparebetweengroups/). Syllables were clustered into three groups by applying community detection (networkx.community.louvain_communities) to a complete graph where nodes are syllables and edges were weighted by the bigram probabilities \({b}_{ij}=P({z}_{t}=i,\,{z}_{t+1}=j)\)).
Applying published methods for behavior analysis
We applied BSOiD, VAME and MotionMapper using default parameters, except for the parameter scans in Extended Data Fig. 5 (see Supplementary Table 3 for a summary for all parameter choices). In general, we were unable to uniformly improve the performance of any method by deviating from these default parameters. For example, switching VAME’s statepartition method from hidden Markov model (HMM) to kmeans led to higher change score alignment (Extended Data Fig. 5a) but caused a decrease in alignment to supervised behavior labels (Fig. 5e,f shows performance under an HMM; performance under kmeans is not shown). Our application of each method is described in detail below.
BSOiD is an automated pipeline for behavioral clustering that: (1) preprocesses keypoint trajectories to generate pose and movement features; (2) performs dimensionality reduction on a subset of frames using uniform manifold approximation and projection; (3) clusters points in the uniform manifold approximation and projection space; and (4) uses a classifier to extend the clustering to all frames^{12}. We fit BSOiD separately for each dataset. In each case, steps 2–4 were performed multiple times with different random seeds (see Supplementary Table 1 for number of fits per dataset), and the pipeline was applied with standard parameters; 50,000 randomly sampled frames were used for dimensionality reduction and clustering, and the min_cluster_size range was set to 0.5–1%. Because BSOiD uses a hardcoded window of 100 ms to calculate pose and movement features, we reran the pipeline with falsely inflated frame rates for the windowsize scan in Extended Data Fig. 5a. In all analyses involving BSOiD, rare states (frequency < 0.5%) were excluded from the analysis.
VAME is a pipeline for behavioral clustering that: (1) preprocesses keypoint trajectories and transforms them into egocentric coordinates; (2) fits a recurrent neural network; (3) clusters the latent code of the recurrent neural network^{13}. We applied these steps separately to each dataset, in each case running step 3 multiple times with different random seeds (see Supplementary Table 1 for number of fits per dataset). For step 1, we used the same parameters as in keypointMoSeq—egocentric alignment was performed along the tail–nose axis, and we set the pose_confidence threshold to 0.5. For step 2, we set time_window = 30 and zdims = 30 for all datasets, except for the zdimscan in Extended Data Fig. 5a. VAME provides two different options for step 3: fitting an HMM (default) or applying kmeans (alternative). We fit an HMM for all datasets and additionally applied kmeans to the initial open dataset. In general, we approximately matched the number of states/clusters in VAME to the number identified by keypointMoSeq, except when scanning over state number in Extended Data Fig. 5a. In all analyses involving VAME, rare states (frequency < 0.5%) were excluded from analysis.
MotionMapper performs unsupervised behavioral segmentation by: (1) applying a wavelet transform to preprocessed pose data; (2) nonlinearly embedding the transformed data in 2D; and (3) clustering the 2D data with a watershed transform^{17}. We applied these steps separately to each dataset, in each case running steps 2–3 multiple times with different random seeds (see Supplementary Table 1 for number of fits per dataset). There are several published implementations of MotionMapper, which perform essentially the same set of transformations but differ in programming language. We obtained similar results from a recent Python implementation from the Berman laboratory (https://github.com/bermanlabemory/motionmapperpy/) and a published MATLAB implementation^{30}. All results in the paper are from the Python implementation, which we applied as follows. Data were first egocentrically aligned along the tail–nose axis and then projected into eight dimensions using PCA. Ten logspaced frequencies between 0.25 Hz and 15 Hz were used for the wavelet transform, and dimensionality reduction was performed using tdistributed stochastic neighbor embedding. The threshold for watershedding was chosen to produce at least 25 clusters, consistent with keypointMoSeq for the overheadcamera data. Rare states (frequency < 0.5%) were excluded from analysis. For the parameter scan in Extended Data Fig. 5a, we varied each of these parameters while holding the others fixed, including the threshold for watershedding, the number of initial PCA dimensions, and the frequency range of wavelet analysis. We also repeated a subset of these analyses using an alternative autoencoderbased dimensionality reduction approach, as described in the motionmapperpy tutorial (https://github.com/bermanlabemory/motionmapperpy/blob/master/demo/motionmapperpy_mouse_demo.ipynb/).
Predicting kinematics from state sequences
We trained decoding models based on spline regression to predict kinematic parameters (height, velocity and turn speed) from state sequences output by keypointMoSeq and other behavior segmentation methods (Fig. 3e and Extended Data Fig. 5c). Let z_{t} represent an unsupervised behavioral state sequence and let B denote a spline basis, where B_{t,i} is the value of spline i and frame t. We generated such a basis using the ‘bs’ function from the Python package ‘patsy’, passing in six logspaced knot locations (1.0, 2.0, 3.9, 7.7, 15.2 and 30.0) and obtaining basis values over a 300frame interval. This resulted in a 300by5 basis matrix B. The spline basis and state sequence were combined to form a 5Ndimensional design matrix, where N is the number of distinct behavioral states. Specifically, for each instance \(({t}_{\rm{s}},\ldots ,{t}_{\rm{e}})\) of state n (see ‘Crosssyllable likelihoods’ for a definition of state instances), we inserted the first \({t}_{\rm{e}}{t}_{\rm{s}}\) frames of B into dimensions \(5n,\ldots ,5n+5\) of the design matrix, aligning the first frame of B to frame t_{s} in the design matrix. Kinematic features were regressed against the design matrix using Ridge regression from scikitlearn and fivefold crossvalidation. We used a range of values from 10^{−3} to 10^{3} for the regularization parameter α and reported the results with greatest accuracy.
Rearing analysis
To compare the dynamics of rearassociated states across methods, we systematically identified all instances of rearing in our initial open field dataset. During a stereotypical rear, mice briefly stood on their hind legs and extended their head upwards, leading to a transient increase in height from its modal value of 3–5 cm to a peak of 7–10 cm. Rears were typically brief, with mice exiting and then returning to a prone position within a few seconds. We encoded these features using the following criteria. First, rear onsets were defined as increases in height from below 5 cm to above 7 cm that occurred within the span of a second, with onset formally defined as the first frame where the height exceeded 5 cm. Next, rear offsets were defined as decreases in height from above 7 cm to below 5 cm that occurred within the span of a second, with offset formally defined as the first frame where the height fell below 7 cm. Finally, we defined complete rears as onset–offset pairs defining an interval with length between 0.5 s and 2 s. Height was determined from the distribution of depth values in cropped, aligned and backgroundsegmented videos. Specifically, we used the 98th percentile of the distribution in each frame.
Accelerometry processing
From the IMU, we obtained absolute rotations r_{y}, r_{p} and r_{r} (yaw, pitch and roll) and accelerations a_{x}, a_{y} and a_{z} (dorsal/ventral, posterior/anterior and left/right). To control for subtle variations in implant geometry and chip calibration, we centered the distribution of sensor readings for each variable on each session. We defined total acceleration as the norm of the three acceleration components:
Similarly, we defined total angular velocity as the norm ω of rotation derivative:
Finally, to calculate jerk, we smoothed the acceleration signal with a 50ms Gaussian kernel, generating a time series \(\widetilde{a}\), and then computed the norm of its derivative:
Aligning dopamine fluctuations to behavior states
For a detailed description of photometry data acquisition and preprocessing, see ref. ^{22}. Briefly, photometry signals were: (1) normalized using ΔF/F_{0} with a 5s window; (2) adjusted against a reference to remove motion artifacts and other nonligandassociated fluctuations; (3) zscored using a 20s sliding window; and (4) temporally aligned to the 30Hz behavioral videos.
Given a set of state onsets (either for a single state or across all states), we computed the onsetaligned dopamine trace by averaging the dopamine signal across onsetcentered windows. From the resulting traces, each of which can be denoted as a time series of dopamine signal values (\({d}_{{T}},\ldots ,{d}_{{T}}\)), we defined the total fluctuation size (Fig. 4d) and temporal asymmetry (Fig. 4e) as
A third metric—the average dopamine during each state (Extended Data Fig. 7b)—was defined simply as the mean of the dopamine signal across all frames bearing that state label. For each metric, shuffle distributions were generated by repeating the calculation with a temporally reversed copy of the dopamine times series.
Supervised behavior benchmark
Videos and behavioral annotations for the supervised open field behavior benchmark (Fig. 5a–c) were obtained from ref. ^{31}. The dataset contains 20 videos that are each 10–20min long. Each video includes framebyframe annotations of five possible behaviors: locomote, rear, face groom, body groom and defecate. We excluded ‘defecate’ from the analysis because it was extremely rare (<0.1% of frames).
For pose tracking, we used DLC’s SuperAnimal inference API that performs inference on videos without the need to annotate poses in those videos^{47}. Specifically, we used SuperAnimalTopViewMouse that applies DLCRNet50 as the pose estimation model. Keypoint detections were obtained using DeepLabCut’s API function deeplabcut.video_inference_superanimal. The API function uses a pretrained model called SuperAnimalTopViewMouse and performs video adaptation that applies multiresolution ensemble (that is, the image height resized to 400, 500 and 600 with a fixed aspect ratio) and rapid selftraining (model trained on zero shot predictions with confidence above 0.1) for 1,000 iterations to counter domain shift and reduce jittering predictions.
Keypoint coordinates and behavioral annotations for the supervised social behavior benchmark (Fig. 5d–f) were obtained from the CalMS21 dataset^{32} (task1). The dataset contains 70 videos of resident–intruder interactions with framebyframe annotations of four possible behaviors: attack, investigate, mount or other. All unsupervised behavior segmentation methods were fitted to 2D keypoint data for the resident mouse.
We used four metrics^{13} to compare supervised annotations and unsupervised states from each method. These included NMI, homogeneity, adjusted rand score and purity. All metrics besides purity were computed using the Python library scikitlearn (that is, with the function normalized_mutual_info_score, homogeneity_score, adjusted_rand_score). The purity score was defined as in ref. ^{13}.
Thermistor signal processing
During respiration, the movement of air through a mouse’s nasal cavity generates fluctuations in temperature that can be detected by a thermistor; temperature decreases during inhalations (because the mouse is warmer than the air around it) and rises between inhalations. Below we refer to the betweeninhalation intervals as ‘exhales’ but note that they may also contain pauses in respiration—pauses and exhales likely cannot be distinguished because warming of the thermistor occurs whether or not air is flowing.
To segment inhales and exhales using the thermistor signal, we first applied a 60Hz notch filter (scipy.signal.iirnotch, q = 10) and a lowpass filter (scipy.signal.butter, order = 3, cutoff = 40 Hz, analog = false) to the raw signal, and then used a median filter to subtract the slow DC offset component of the signal. We then performed peak detection using scipy.signal.find_peaks (minimium interpeak distance of 50 ms, minimum and maximum widths of 10 ms and 1,500 ms, respectively). To distinguish true peaks (inhalation onsets) from spurious peaks (noise), we varied the minimum prominence parameter from 10^{−4} to 1 while keeping other parameters fixed, and then used the value at which the number of peaks stabilized. Using the chosen minimum prominence, the signal was then analyzed twice—once at the chosen value, and again with a slightly more permissive minimum prominence (1/8 of the chosen value). Any lowamplitude breaths detected with the more permissive setting that overlapped with periods of breathing between 1 Hz and 6 Hz were added to the detections. This same process was then repeated to find exhale onsets but with the thermistor signal inverted. Finally, inhales and exhales were paired, and any instances of two inhales/exhales in a row were patched by inserting an exhale/inhale at the local extremum between them. Detections were then inspected manually, and any recordings with excessive noise, unusually high breathing rates (>14 Hz), or unusual autocorrelation profiles were removed from further analyses.
Classifying sniffaligned syllables
To test whether syllables were significantly sniff aligned, we compared the probability of inhalation in the 50 ms before versus 50 ms after syllable onset. Specifically, for each syllable, we quantified the preinhalation versus postinhalation fraction across all instances of that syllable, and then compared the predistribution and postdistribution values using a paired ttest. Syllables with P < 0.001 were considered significant.
Fly gait analysis
For the analysis of fly behavior, we used a published dataset of keypoint coordinates^{39}, which were derived from behavioral videos originally reported in ref. ^{17}. The full dataset contains 1h recordings (100 fps) of single flies moving freely on a backlit 100mmdiameter arena. Keypoints were tracked using LEAP (test accuracy ~2.5 px). MotionMapper results (including names for each cluster) were also included in the published dataset. We chose four 1h sessions (uniformly at random) for analysis with keypointMoSeq. All results reported here were derived from this 4h dataset.
The analysis of syllable probabilities across the stride cycle (Fig. 6i–k) was limited to periods of ‘fast locomotion’, as defined by the MotionMapper labeling (state label 7). To identify the start and end of each stride cycle, we applied PCA to egocentric keypoint coordinates (restricted to fast locomotion frames). We found that the first PC oscillated in a manner reflecting the fly’s gait, and thus smoothed the first PC using a oneframe Gaussian filter and performed peak detection on the smoothed signal. Each interpeak interval was defined as one stride. Stances and swings (Fig. 6j and Extended Data Fig. 10g) were defined by backward and forward motion of the leg tips, respectively (in egocentric coordinates).
Mathematical notation

1.
χ^{−2}(ν, τ^{2}) denotes the scaled inverse Chisquared distribution.

2.
\(\otimes\) denotes the Kronecker product.

3.
Δ^{N} is the Ndimensional simplex.

4.
I_{N} is the N × N identity matrix.

5.
1_{N × M} is the N × M matrix of ones.

6.
\(\bf{x}_{{t}_{1}:{t}_{2}}\) denotes the concatenation \(\left[\bf{x}_{{t}_{1}},\bf{x}_{{t}_{1}+1},\ldots ,\bf{x}_{{t}_{2}}\right]\) where t_{1 }< t_{2}.
Generative model
KeypointMoSeq learns syllables by fitting an SLDS model^{48}, which decomposes an animal’s pose trajectory into a sequence of stereotyped dynamical motifs. In general, SLDS models explain timeseries observations y_{1}, …, y_{T} through a hierarchy of latent states, including continuous states \(\bf{x}_{{t}}\in {{\mathbb{R}}}^{{M}}\) that represent the observations y_{t} in a lowdimensional space, and discrete states z_{t} ∈ {1, …, N} that govern the dynamics of x_{t} over time. In keypointMoSeq, the discrete states correspond to syllables, the continuous states correspond to pose, and the observations are keypoint coordinates. We further adapted SLDS by (1) including a sticky hierarchical Dirichlet prior (HDP); (2) explicitly modeling the animal’s location and heading; and (3) including a robust (heavytailed) observation distribution for keypoints. Below we review SLDS models in general and then describe each of the customizations implemented in keypointMoSeq.
SLDSs
The discrete states z_{t} ∈ {1, …, N} are assumed to form a Markov chain, meaning
where \({\pi }_{i}\in {\Delta }^{N}\) is the probability of transitioning from discrete state i to each other state. Conditional on the discrete states z_{t}, the continuous states x_{t} follow an Lorder vector autoregressive process with Gaussian noise. This means that the expected value of each x_{t} is a linear function of the previous L states \(\bf{x}_{{tL:t1}}\), as shown below
where \({A}_{{i}}\in {{\mathbb{R}}}^{{M\times {LM}}}\) is the autoregressive dynamics matrix, \(\bf{b}_{{i}}\in {{\mathbb{R}}}^{{M}}\) is the dynamics bias vector, and \({Q}_{{i}}\in {{\mathbb{R}}}^{{M\times M}}\) is the dynamics noise matrix for each discrete state i = 1, …, N. The dynamics parameters A_{i}, b_{i} and Q_{i} have a matrix normal inverse Wishart (MNIW) prior
where ν_{0} > M − 1 is the degrees of freedom, \({S}_{0}\in {{\mathbb{R}}}^{{M\times M}}\) is the prior covariance matrix, \({M}_{0}\in {{\mathbb{R}}}^{{M\times \left({LM}+1\right)}}\) is the prior mean dynamics matrix, and \({K}_{0}\in {{\mathbb{R}}}^{\left({{LM}}+1\right)\times \left({{L}M}+1\right)}\) is the prior scale matrix. Finally, in the standard formulation of SLDS (which we modify for keypoint data, as described below), each observation \(\bf{y}_{{t}}\in {{\mathbb{R}}}^{{D}}\) is a linear function of x_{t} plus noise:
Here we assume that the observation parameters C, d and S do not depend on z_{t}.
Sticky HDP
A key feature of depth Moseq is the use of a stickyHDP prior for the transition matrix. In general, HDP priors allow the number of distinct states in a HMM to be inferred directly from the data. The ‘sticky’ variant of the HDP prior includes an additional hyperparameter κ that tunes the frequency of selftransitions in the discrete state sequence z_{t}, and thus the distribution of syllable durations. As in depth MoSeq, we implement a stickyHDP prior using the weak limit approximation^{49}, as shown below:
where κ is being added in the ith position. Here \(\beta \in {\Delta }^{\rm{N}}\) is a global vector of augmented syllable transition probabilities, and the hyperparameters γ, α and κ control the sparsity of states, the weight of the sparsity prior and the bias toward selftransitions, respectively.
SLDS for postural dynamics
Keypoint coordinates reflect not only the pose of an animal, but also its location and heading. To disambiguate these factors, we define a canonical, egocentric reference frame in which the postural dynamics are modeled. The canonically aligned poses are then transformed into global coordinates using explicit centroid and heading variables that are learned by the model.
Concretely, let \({Y}_{\rm{t}}\in {{\mathbb{R}}}^{\rm{K\times D}}\) represent the coordinates of K keypoints at time t, where \(D\in \{2,3\}\). We define latent variables \(\bf{v}_{t}\in {{\mathbb{R}}}^{D}\) and \({h}_{{t}}\in \left[0,2\pi \right]\) to represent the animal’s centroid and heading angle. We assume that each heading angle h_{t} has an independent, uniform prior and that the centroid is autocorrelated as follows:
At each time point t, the pose Y_{t} is generated via rotation and translation of a centered and oriented pose \({\widetilde{Y}}_{{t}}\) that depends on the current continuous latent state x_{t}:
where R(h_{t}) is a matrix that rotates by angle h_{t} in the xy plane, and \(\varGamma \in {R}^{{K\times \left(K1\right)}}\) is defined by the truncated singular value decomposition \(\varGamma \Delta {\varGamma }^{{\rm{\top }}}={I}_{{K}}{{\bf{1}}}_{{K\times K}}/K\). Note that Γ encodes a linear transformation that isometrically maps \({{\mathbb{R}}}^{{\left(K1\right)\times D}}\) to the set of all centered keypoint arrangements in \({{\mathbb{R}}}^{{K\times D}}\), and thus ensures that \({\mathbb{E}}\left({\widetilde{Y}}_{{t}}\right)\) is always centered^{50}. The parameters \(C\in {{\mathbb{R}}}^{{\left(K1\right)D\times M}}\) and \(\bf{d}\in {{\mathbb{R}}}^{{\left(K1\right)D}}\) are initialized using PCA applied to the transformed keypoint coordinates \({\varGamma }^{{T}}{\widetilde{Y}}_{{t}}\). In principle C and d can be adjusted further during model fitting, and we describe the corresponding Gibbs updates in the inference section below. In practice, however, we keep C and d fixed to their initial values when fitting keypointMoSeq.
Robust observations
To account for occasional large errors during keypoint tracking, we use the heavytailed Student’s tdistribution, which corresponds to a normal distribution whose variance is itself a random variable. Here, we instantiate the random variances explicitly as a product of two parameters: a baseline variance σ_{k} for each keypoint and a timevarying scale s_{t,k}. We assume:
where ν_{σ} > 0 and ν_{s} > 0 are degrees of freedom, \({\sigma }_{0}^{2} > 0\) is a baseline scaling parameter, and \({s}_{0,t,k} > 0\) is a local scaling parameter, which encodes a prior on the scale of error for each keypoint on each frame. Where possible, we calculated the local scaling parameters as a function of the neural network confidences for each keypoint. The function was calibrated using the empirical relationship between confidence values and error sizes. The overall noise covariance S_{t} is generated from σ_{k} and s_{t,k} as follows:
Related work
KeypointMoSeq extends the model used in depth MoSeq^{16}, where a lowdimensional pose trajectory x_{t} (derived from egocentrically aligned depth videos) is used to fit an ARHMM with a transition matrix π, autoregressive parameters A_{i}, b_{i} and Q_{i} and discrete states z_{t} like those described here. Indeed, conditional on x_{t}, the models for keypoinMoSeq and depth MoSeq are identical. The main differences are that keypointMoSeq treats x_{t} as a latent variable (that is, updates it during fitting), includes explicit centroid and heading variables, and uses a robust noise model.
Disambiguating poses from position and heading is a common task in unsupervised behavior algorithms, and researchers have adopted a variety of approaches. VAME^{13}, for example, isolates pose by centering and aligning data ahead of time, whereas BSOiD^{12} transforms the keypoint data into a vector of relative distances and angles. The statistical pose model GIMBAL^{29}, on the other hand, introduces latent heading and centroid variables that are inferred simultaneously with the rest of the model. KeypointMoSeq adopts this latter approach, which can remove spurious correlations between egocentric features that can arise from errors in keypoint localization.
Inference algorithm
Our full model contains latent variables v, h, x, z and s and parameters A, b, Q, C, d, σ, β and π. We fit each of these variables—except for C and d—using Gibbs sampling, in which each variable is iteratively resampled from its posterior distribution conditional on the current values of all the other variables. The posterior distributions P(π, β∣z) and P(A, b, Q∣z, x) are unchanged from the original MoSeq paper and will not be be reproduced here (see ref. ^{16}, pages 42–44, and note the changes of notation Q → Σ, z → x and x → y). The Gibbs updates for variables C, d, σ, s, v and h are described below.
Resampling P(C, d∣s, σ, x, v, h, Y)
Let \({\tilde{\bf{x}}}_{\rm{t}}\) represent x_{t} with a 1 appended and define
The posterior update is \(\left(C,{\bf{d}}\right){\sim}{\mathscr{N}}\left({\text{vec}}\left(C,{\bf{d}}\right){{}}{\mu }_{n},{\varSigma }_{n}\right)\) where
with
Resampling P(s∣C, d, σ, x, v, h, Y)
Each s_{t,k} is conditionally independent with posterior
Resampling P(σ∣C, d, s, x, v, h, Y)
Each σ_{k} is conditionally independent with posterior
where \({S}_{y}={\sum }_{t=1}^{N}{\Vert \varGamma {(C\bf{x}_{t}+\bf{d})}_{k}{\tilde{Y}}_{t,k}\Vert }^{2}/{s}_{t,k}\)
Resampling P(v∣C, d, σ, s, x, h, Y)
Because the translations v_{1}, …, v_{T} form an LDS, they can be updated by Kalman sampling. The observation potentials have the form \({\mathscr{N}}\left(\bf{v}_{t}{\rm{ }}\mu ,{\gamma }^{2}{I}_{D}\right)\) where
Resampling P(h∣C, d, σ, s, x, v, Y)
The posterior of h_{t} is the vonMises distribution \(\text{vM}\)(θ, κ) where κ and \(\theta \in \left[0,2\pi \right]\) are the unique parameters satisfying \(\left[\kappa \cos \left(\theta \right),\kappa \sin \left(\theta \right)\right]=\left[{S}_{1,1}+{S}_{2,2},{S}_{1,2}{S}_{2,1}\right]\) for
Resampling P(x∣C, d, σ, s, v, h, Y)
To resample x, we first express its temporal dependencies as a firstorder autoregressive process, and then apply Kalman sampling. The change of variables is
Kalman sampling can then be applied to the sample the conditional distribution
(Assume x′ is leftpadded with zeros for negative time indices.)
Hyperparameters
We used the following hyperparameter values throughout the paper.
Transition matrix
Autoregressive process
Observation process
Centroid autocorrelation
Derivation of Gibbs updates
Derivation of C, d updates
To simply notation, define
The likelihood of the centered and aligned keypoint locations \(\tilde{Y}\) can be expanded as follows
where
Multiplying by the prior \(\text{vec}\left(\tilde{C}\right){\mathscr{\sim }}{\mathscr{N}}\left(0,{\sigma }_{C}^{2}I\right)\) yields
where
Derivation of σ _{k}, s _{t,k} updates
For each time t and keypoint k, let \({\bar{Y}}_{t,k}=\varGamma \left(C\bf{x}_{t}+\bf{d}\right)\). The likelihood of the centered and aligned keypoint location \({\tilde{Y}}_{t,k}\) is
We can then calculate posteriors \(P\left({s}_{t,k}{\rm{ }}{\sigma }_{k}\right)\) and \(P\left({\sigma }_{k}{\rm{ }}{s}_{t,k}\right)\) as follows
where \({S}_{y}={\sum }_{t}\parallel {\tilde{Y}}_{t,k}{\bar{Y}}_{t,k}{\parallel }^{2}/{s}_{t,k}\)
Derivation of v _{t} update
We assume an improper uniform prior on v_{t}, hence
where
Derivation of h _{t} update
We assume a proper uniform prior on h_{t}, hence
Let \(\left[\kappa \cos \left(\theta \right),\kappa \sin \left(\theta \right)\right]\) represent \(\left[{S}_{1,1}+{S}_{2,2},{S}_{1,2}{S}_{2,1}\right]\) in polar coordinates. Then
Reporting summary
Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.
Data availability
This study used the following publicly available datasets: CalMS21 (https://data.caltech.edu/records/s0vdx0k302)^{32}; DeepEthogram benchmark data^{31} (https://github.com/jbohnslav/deepethogram/); Rat7M (https://doi.org/10.6084/m9.figshare.c.5295370.v3)^{51}; and fly keypoint tracking (https://doi.org/10.1038/s4159201802345). Other data raw data generated in this study have been deposited in Zenodo (https://doi.org/10.5281/zenodo.10636983)^{52}. The thermistor recordings generated for this study are not publicly available at this time as they are being used for a followup paper. We plan to make these data publicly accessible upon publication of the followup study and in the meantime will provide them upon reasonable request.
Code availability
Software links and user support for both depth and keypoint data are available at http://www.moseq4all.org/. Data loading, project configuration and visualization are enabled through the keypointmoseq^{53} Python library (https://github.com/dattalab/keypointmoseq/). We also developed a standalone library called jaxmoseq^{54} for core model inference (https://github.com/dattalab/jaxmoseq/). Both libraries are freely available to the research community under an academic and noncommercial research use license. This license permits free academic and noncommercial use, explicitly prohibits redistribution and commercial use, and requires users to agree to terms including limitations on liability and indemnity. Full license details can be viewed on the respective GitHub repository pages.
References
Tinbergen, N. The Study of Instinct (Clarendon Press, 1951).
Dawkins, R. In Growing Points in Ethology (Bateson, P. P. G. & Hinde, R. A. eds.) Chap 1 (Cambridge University Press, 1976).
Baerends, G. P. The functional organization of behaviour. Anim. Behav. 24, 726–738 (1976).
Pereira, T. D. et al. SLEAP: a deep learning system for multianimal pose tracking. Nat. Methods 19, 486–495 (2022).
Mathis, A. et al. DeepLabCut: markerless pose estimation of userdefined body parts with deep learning. Nat. Neurosci. 21, 1281–1289 (2018).
Sun, J. J. et al. Selfsupervised keypoint discovery in behavioral videos. Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit. 2022, 2161–2170 (2022).
Graving, J. M. et al. DeepPoseKit, a software toolkit for fast and robust animal pose estimation using deep learning. eLife 8, e47994 (2019).
Mathis, A., Schneider, S., Lauer, J. & Mathis, M. W. A primer on motion capture with deep learning: principles, pitfalls, and perspectives. Neuron 108, 44–65 (2020).
Datta, S. R., Anderson, D. J., Branson, K., Perona, P. & Leifer, A. Computational neuroethology: a call to action. Neuron 104, 11–24 (2019).
Anderson, D. J. & Perona, P. Toward a science of computational ethology. Neuron 84, 18–31 (2014).
Pereira, T. D., Shaevitz, J. W. & Murthy, M. Quantifying behavior to understand the brain. Nat. Neurosci. 23, 1537–1549 (2020).
Hsu, A. I. & Yttri, E. A. BSOiD, an opensource unsupervised algorithm for identification and fast prediction of behaviors. Nat. Commun. 12, 5188 (2021).
Luxem, K. et al. Identifying behavioral structure from deep variational embeddings of animal motion. Commun. Biol. 5, 1267 (2022).
Marques, J. C., Lackner, S., Félix, R. & Orger, M. B. Structure of the Zebrafish locomotor repertoire revealed with unsupervised behavioral clustering. Curr. Biol. 28, 181–195 (2018).
Todd, J. G., Kain, J. S. & de Bivort, B. L. Systematic exploration of unsupervised methods for mapping behavior. Phys. Biol. 14, 015002 (2017).
Wiltschko, A. B. et al. Mapping subsecond structure in mouse behavior. Neuron 88, 1121–1135 (2015).
Berman, G. J., Choi, D. M., Bialek, W. & Shaevitz, J. W. Mapping the stereotyped behaviour of freely moving fruit flies. J. R. Soc. Interface https://doi.org/10.1098/rsif.2014.0672 (2014).
Batty, E. et al. BehaveNet: nonlinear embedding and Bayesian neural decoding of behavioral videos. in Advances in Neural Information Processing Systems 32 (eds H. Larochelle et al.) 15706–15717 (Curran Associates, 2019).
Costacurta, J. C. et al. Distinguishing discrete and continuous behavioral variability using warped autoregressive HMMs. in Advances in Neural Information Processing Systems 35 (eds S. Koyejo et al.) 23838–23850 (Curran Associates, 2022).
Jia, Y. et al. Selfee, selfsupervised features extraction of animal behaviors. eLife 11, e76218 (2022).
Findley, T. M. et al. Sniffsynchronized, gradientguided olfactory search by freely moving mice. eLife 10, e58523 (2021).
Markowitz, J. E. et al. Spontaneous behaviour is structured by reinforcement without explicit reward. Nature 614, 108–117 (2023).
Markowitz, J. E. et al. The striatum organizes 3D behavior via momenttomoment action selection. Cell 174, 44–58 (2018).
Wiltschko, A. B. et al. Revealing the structure of pharmacobehavioral space through motion sequencing. Nat. Neurosci. https://doi.org/10.1038/s41593020007063 (2020).
Lin, S. et al. Characterizing the structure of mouse behavior using motion sequencing. Preprint at https://arxiv.org/abs/2211.08497 (2022).
Wu, A. et al. Deep Graph Pose: a semisupervised deep graphical model for improved animal pose tracking. in Proceedings of the 34th International Conference on Neural Information Processing Systems (Curran Associates, 2020).
Murphy, K. P. Machine Learning (MIT Press, 2012).
Linderman, S. et al. In Proceedings of the 20th International Conference on Artificial Intelligence and Statistics Vol. 54 (eds Aarti, S. et al.) 914–922 (PMLR, Proceedings of Machine Learning Research, 2017).
Zhang, L., Dunn, T., Marshall, J., Olveczky, B. & Linderman, S. In Proceedings of The 24th International Conference on Artificial Intelligence and Statistics Vol. 130 (eds Banerjee Arindam & Fukumizu Kenji) 2800–2808 (PMLR, Proceedings of Machine Learning Research, 2021).
Klibaite, U. et al. Deep phenotyping reveals movement phenotypes in mouse neurodevelopmental models. Mol. Autism 13, 12 (2022).
Bohnslav, J. P. et al. DeepEthogram, a machine learning pipeline for supervised behavior classification from raw pixels. eLife 10, e63377 (2021).
Sun, J. J. et al. Caltech mouse social interactions (CalMS21) dataset. https://doi.org/10.22002/D1.1991 (2021).
Ye, S., Mathis, A. & Mathis, M. W. Panoptic animal pose estimators are zeroshot performers. Preprint at https://arxiv.org/abs/2203.07436 (2022).
Marshall, J. D. et al. Continuous wholebody 3D kinematic recordings across the rodent behavioral repertoire. Neuron 109, 420–437 (2021).
Moore, J. D. et al. Hierarchy of orofacial rhythms revealed through whisking and breathing. Nature 497, 205–210 (2013).
Kurnikova, A., Moore, J. D., Liao, S. M., Deschênes, M. & Kleinfeld, D. Coordination of orofacial motor actions into exploratory behavior by rat. Curr. Biol. 27, 688–696 (2017).
McAfee, S. S. et al. Minimally invasive highly precise monitoring of respiratory rhythm in the mouse using an epithelial temperature probe. J. Neurosci. Methods 263, 89–94 (2016).
DeAngelis, B. D., ZavatoneVeth, J. A. & Clark, D. A. The manifold structure of limb coordination in walking Drosophila. Elife https://doi.org/10.7554/eLife.46409 (2019).
Pereira, T. D. et al. Fast animal pose estimation using deep neural networks. Nat. Methods 16, 117–125 (2019).
Dan, B. et al. Lightning Pose: improved animal pose estimation via semisupervised learning, Bayesian ensembling, and cloudnative opensource tools. Preprint at bioRxiv https://doi.org/10.1101/2023.04.28.538703 (2023).
Batty, E. et al. In NeurIPS vol. 32 (eds H. Wallach et al.) (Curran Associates, 2019).
Berman, G. J., Bialek, W. & Shaevitz, J. W. Predictability and hierarchy in Drosophila behavior. Proc. Natl Acad. Sci. USA 113, 11943–11948 (2016).
Berman, G. J. Measuring behavior across scales. BMC Biol. 16, 23 (2018).
Zhou, Z., et al. UNet++: a nested Unet architecture for medical image segmentation. in (eds Stoyanov, D. et al.) Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support. DLMIA MLCDS 2018. Lecture Notes in Computer Science, vol 11045, 3–11 (Springer International Publishing, 2018). https://doi.org/10.1007/9783030008895_1
Sun, K., Xiao, B., Liu, D. & Wang, J. Deep highresolution representation learning for human pose estimation. in 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 5686–5696 (2019).
Nath, T. et al. Using DeepLabCut for 3D markerless pose estimation across species and behaviors. Nat. Protoc. 14, 2152–2176 (2019).
Ye, S. et al. SuperAnimal pretrained pose estimation models for behavioral analysis. Preprint at https://arxiv.org/abs/2203.07436 (2023).
Ackerson, G. A. & Fu, K.S. On state estimation in switching environments. IEEE Trans. Autom. Control. 15, 10–17 (1970).
Fox, E. B., Sudderth, E. B., Jordan, M. I. & Willsky, A. S. A sticky HDPHMM with application to speaker diarization. Ann. Appl. Stat. 5, 1020–1056 (2009).
Andreella, A. & Finos, L. Procrustes analysis for highdimensional data. Psychometrika 87, 1422–1438 (2022).
Marshall, J. D. et al. Rat 7M. figshare https://doi.org/10.6084/m9.figshare.c.5295370.v3 (2021).
Weinreb, C. et al. KeypointMoSeq: parsing behavior by linking point tracking to pose dynamics. Zenodo https://doi.org/10.5281/zenodo.10636983 (2024).
Weinreb, C. et al. dattalab/keypointmoseq: Keypoint MoSeq 0.4.3. Zenodo https://doi.org/10.5281/zenodo.10524840 (2024).
Weinreb, C. et al. dattalab/jaxmoseq: JAX MoSeq 0.2.1. Zenodo https://doi.org/10.5281/zenodo.10403244 (2023).
Acknowledgements
S.R.D. is supported by National Institutes of Health (NIH) grants RF1AG073625, R01NS114020 and U24NS109520, the Simons Foundation Autism Research Initiative and the Simons Collaboration on Plasticity and the Aging Brain. S.R.D. and S.W.L. are supported by NIH grant U19NS113201 and the Simons Collaboration on the Global Brain. C.W. is a Fellow of the Jane Coffin Childs Memorial Fund for Medical Research. W.F.G. is supported by NIH grant F31NS113385. M.J. is supported by NIH grant F31NS122155. S.W.L. is supported by the Alfred P. Sloan Foundation. T.P. is supported by a Salk Collaboration Grant. We thank J. Araki for administrative support; the HMS Research Instrumentation Core, which is supported by the Bertarelli Program in Translational Neuroscience and Neuroengineering and by NEI grant EY012196; and members of the laboratory of S.R.D. for useful comments on the paper. Portions of this research were conducted on the O2 High Performance Compute Cluster at Harvard Medical School. Mouse illustrations were downloaded from https://www.scidraw.io/.
Author information
Authors and Affiliations
Contributions
C.W. and S.R.D. conceived the project and designed the experiments. C.W. and S.W.L. designed the algorithm. C.W. implemented the algorithm with contributions from S.L., M.A.M.O., L.Z. and T.P. C.W. and J.P. collected data and S.M., W.F.G., M.J., S.A., E.C. and R.H. assisted. C.W., J.P., M.A.M.O., Y.S., A.M., M.W.M. and T.P. performed analyses. C.W. and S.R.D. wrote the manuscript with input from all authors. S.R.D. supervised the project.
Corresponding authors
Ethics declarations
Competing interests
S.R.D. sits on the scientific advisory boards of Neumora and Gilgamesh Therapeutics, which have licensed or sublicensed the MoSeq technology. The other authors declare no competing interests.
Peer review
Peer review information
Nature Methods thanks Matthew Smear and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Primary Handling Editor: Nina Vogt, in collaboration with the Nature Methods team.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Extended data
Extended Data Fig. 1 Markerless pose tracking exhibits fast fluctuations that are independent of behavior yet affect MoSeq output.
a) Example of a 5second interval during which the mouse is still yet the keypoint coordinates fluctuate, as shown in Fig. 1e, but here for SLEAP and DeepLabCut respectively. Left: egocentrically aligned keypoint trajectories. Right: path traced by each keypoint during the 5second interval. b) Crosscorrelation between the spectral content of keypoint fluctuations and either error magnitude (left) or a measure of lowconfidence keypoint detections (right). c) Magnitude of fast fluctuations in keypoint position for three different tracking methods, calculated as the perframe distance from the detected trajectory of a keypoint to a smoothened version of the same trajectory, where smoothing was performed using a gaussian kernel with width 100ms (N=4 million keypoint detections). d) Interannotator variability, shown as the distribution of distances between multiple annotations of the same keypoint. Annotations were either crowdsourced or obtained from experts (N=200 frames and N=4 labelers). e) Train and test error distributions for each keypoint tracking method (N=800 held out keypoint annotations). f) Top: position of the nose and tailbase over a 10second interval, shown for both the overhead and belowfloor cameras. Bottom: fast fluctuations in each coordinate, obtained as residuals after median filtering. g) Crosscorrelation between spectrograms obtained from two different camera angles for either the tail base or the nose, shown for each tracking method. h) Crosscorrelation of transitions rates, comparing MoSeq applied to depth and MoSeq applied to keypoints with various levels of smoothing using a lowpass, Gaussian, or median filter (N=1 model fit per filtering parameter).
Extended Data Fig. 2 KeypointMoSeq is robust to noise and missing data.
a) Mean change score values at syllable transitions. Syllables were either derived from keypointMoSeq applied to (unfiltered) keypoints from our custom neural network, or from traditional MoSeq applied to several versions of the keypoint data, including keypoints inferred from Lightning Pose, or keypoints from our custom neural network followed by lowpass filtering, median filtering, or no filtering. Error bars show standard deviation across N=20 model fits. The change scores are highest for keypointMoSeq (P < 10^{−4} over N=20 model fits, MannWhitney U test). b) Correlations of transition probabilities (that is, the probability of a new syllable starting at each frame), comparing depth MoSeq with each of the keypoint models shown in (a). c) Example of model responses to a onesecondlong ablation of keypoint observations, shown for keypointMoSeq (right) and traditional ARHMMbased MoSeq (left). Top: Change in syllable sequences. Each heatmap row represents an independent modeling run and each column represents a frame. The set of labels on each frame define a distribution, and the KullbackLeibler divergence (KL div.) between the ablated and unablated distributions is plotted below. Bottom: Change in lowdimensional pose state. Estimated pose trajectories derived from unablated (black) or ablated (blue) data. Each dimension of the latent pose space is plotted separately. Lines reflect the mean across modeling runs. d) Crosscorrelation of transition probabilities for ablated vs. unablated data (computed over frames that were included in an ablation), shown for keypointMoSeq (red) and traditional ARHMMbased MoSeq (red), Shading shows bootstrap 95% confidence intervals for N=20 model fits. Solid line shows crosscorrelation using all N=20 models (without bootstrapping). e) Mean KullbackLeibler divergence [as described in (c)] across all ablation intervals, stratified by number of ablated keypoints (left) or duration of the ablation (right). Shading represents the 99% confidence interval of the mean. f) Mean distance between pose states estimated from ablated vs. unablated data, with colors and shading as in (e). g) Syllable crosslikelihoods, defined as the probability, on average, that timeintervals assigned to one syllable (column) could have arisen from another syllable (row). Crosslikelihoods were calculated for keypointMoSeq and for depth MoSeq. The results for both methods are plotted twice, using either an absolute scale (left) or a log scale (right). h) Modeling results for synthetic keypoint data with a similar statistical structure as the real data but lacking in changepoints. Left: example of synthetic keypoint trajectories. Middle: autocorrelation of keypoint coordinates for real vs. synthetic data, showing similar dynamics at short timescales. Right: distribution of syllable frequencies for keypointMoSeq models trained on real vs. synthetic data.
Extended Data Fig. 3 Convergence and model selection.
a) Probabilistic graphical model (PGM) for keypointMoSeq highlighting the discrete syllable state. b) Number of syllables identified by keypointMoSeq as a function of fitting iteration, shown for multiple independent runs of fitting (referred to as ‘chains’). c) Confusion matrices depicting closer agreement between syllables from the same chain at different stages of fitting (left) compared to syllables from different chains at the final stage of fitting (right). d) Distributions of syllable sequence similarity [quantified by normalized mutual information (NMI)], either within chains at different iterations (N=20) or across chains (N=190). e) PGM highlighting pose state. f) Left: within and between chain variation in pose state, shown for each dimension of pose (rows) across an example 10second interval. Gray lines represent the variation across fitting iterations within each chain, and black lines represent the total variation across chains and fitting iterations. Right: zoomin on a 2second interval showing close agreement in the final pose trajectory learned by each chain. g) Distribution of the GelmanRubin statistic (ratio of withinchain variance to total variance) across timepoints and dimensions of the pose state. h) Expected marginal likelihood (EML) scores (defined as a mean over marginal likelihoods) for the final model parameters learned by each chain. Vertical bars represent standard error based on N=20 chains. i) The scores shown in (h) correlate with mean NMI for each model, which is low when a model’s syllable sequences are dissimilar from those of other models (P=0.005, Pearson test). j) EML scores are higher for models fit with an autoregressiveonly (ARonly) initialization stage (left) compared to those without (right; P = 0.004, N=20 fits for each method, MannWhitney U test). Plotted as in (h). k) EML scores (bottom) plateau within 500 iterations of Gibbs sampling and have a similar trajectory as the model log joint probability (top). Black lines represent the median across N=20 chains and shaded regions represent interquartile interval. l) Illustration of uncertainty in syllable sequence given a fixed set of syllable definitions. Top: syllable sequences derived from Gibbs sampling (conditioning on fixed autoregressive parameters and transition probabilities), shown for an example 10second window. Bottom: perframe marginal probability estimates for each syllable. Each line is one syllable, with colors as in the heatmap above.
Extended Data Fig. 4 Behaviors captured by keypointMoSeq syllables.
a) Average pose trajectories for syllables identified by keypointMoSeq. Each trajectory includes ten evenly timed poses from 165ms before to 500ms after syllable onset. b) Kinematic and morphological parameters for each syllable. Left: Average values of five parameters (rows) for each syllable (column). Middle: Mean and interquartile range of each parameter for one example syllable. Right: cartoons illustrating the computation of the three morphological parameters.
Extended Data Fig. 5 Methodtomethod differences in sensitivity to behavioral changepoints are robust to parameter settings.
a) Output of unsupervised behavior segmentation algorithms across a range of parameter settings, applied to 2D keypoint data from two different camera angles (N=1 model fits per parameter set). The median state duration (left) and the average (zscored) keypoint change score aligned to state transitions (right) are shown for each method and parameter value. Gray pointers indicate default parameter values used for subsequent analysis (see Supplementary Table 3 for a summary of parameters). b) Distributions showing the number of transitions that occur during each rear. c) Accuracy of kinematic decoding models that were fit to state sequences from each method.
Extended Data Fig. 6 Accelerometry reveals kinematic transitions at the onsets of keypointMoSeq states.
a) IMU signals aligned to state onsets from several behavior segmentation methods. Each row corresponds to a behavior state and shows the average across all onset times for that state. A single model fit is shown for each method.
Extended Data Fig. 7 Striatal dopamine fluctuations are enriched at keypointMoSeq syllable onsets.
a) Derivative of the dopamine signal aligned to the onsets of high velocity or low velocity behavior states. States from each method were classified evenly as high or low velocity based on the mean centroid velocity during their respective frames. Plots show mean and inter95% range across N=20 model fits. b) Distributions capturing the average absolute value of the dopamine signal across states from each method. c) Relationship between state durations and correlations from Fig. 5f. d) Average dopamine fluctuations aligned to state onsets (left) or aligned to random frames throughout the execution of each state (middle), as well as the absolute difference between the two alignment approaches (right), shown for each unsupervised behavior segmentation approach.
Extended Data Fig. 8 Changes in behavior caused by environmental enrichment.
a) Example frames from conventional 2D videos of the empty bin (left), and enriched environment (middle), as well as depth video of the enriched environment (right). b) Graph showing changes in syllabletosyllable transition statistics across environments. Edge color and width indicate the sign and magnitude of change in the frequency of each syllable pair. c) Right: changes in syllable frequency across environments, with stars indicating significant differences (P < 0.05, N=16, MannWhitney U test). Error bars show standard error of the mean. Left: Syllable groupings defined by clustering of the transition graph shown in (b).
Extended Data Fig. 9 Supervised behavior benchmark.
a) Distribution of state durations from each behavior segmentation method for the open field benchmark (top) and the CalMS21 social behavior benchmark (bottom). b) Three different similarity measures applied to the output of each unsupervised behavior analysis method, showing the median (gray bars) and interquartile interval (black lines) across independent model fits (N=20; * P < 10^{−5}, for keypointMoSeq vs. each other method, MannWhitney U test). c) Number of unsupervised states specific to each humanannotated behavior in the CalMS21 dataset, shown for 20 independent fits of each unsupervised method. A state was defined as specific if > 50% of frames bore the annotation. d) Left: Keypoints tracked in 2D (top) or 3D (bottom) and corresponding egocentric coordinate axes. Right: example keypoint trajectories and transition probabilities from keypointMoSeq. Transition probability is defined, for each frame, as the probability of a syllable transition occurring on that frame. e) Cumulative fraction of explained variance for increasing number of principal components (PCs). PCs were fit to egocentrically aligned 2D keypoints, egocentrically aligned 3D keypoints, or depth videos respectively. f) Crosscorrelation between the 3D keypoint change score and change scores derived from 2D keypoints and depth respectively (based on N=20 model fits).
Extended Data Fig. 10 KeypointMoSeq identifies behavioral motifs across timescales.
ab) Alignment of mouse behavior motifs to respiration. Figure created with SciDraw under a CC BY 4.0 license. a) Left: Keypoints used for model fitting. Middle: Median motif durations for models fit with a range of stickiness hyperparameters. Right: Proportion of significantly respirationaligned motifs, stratified by stickiness hyperparameter, showing mean and standard deviation across N=5 model fits. b) As (a), but restricted to upper spine, neck, head, and nose keypoints. ch) KeypointMoSeq partitions fly behavior across timescales. c) Fly keypoints used for fitting keypointMoSeq and MotionMapper. d) Motif durations (left) and number of motifs (right) for models trained with a range of target timescales. Ten separate models were fit for each timescale. For motif durations, we pooled the duration distributions from all 20 models and plotted the median duration in black and interquartile range in gray. For motif number, we counted the number of motifs with frequency above 0.5% for each of the 20 models and plotted the mean of this count in black and the standard deviation in gray. e) Density of points in 2D ‘behavior space’ generated by MotionMapper. Each whiteline delimited region corresponds to a MotionMapper state label. f) Confusion matrices showing the frequency of each MotionMapper state during each keypointMoSeq motif. g) Example of swing and stance annotations over a 600ms window. Lines show the egocentric coordinate of each leg tip (anteriorposterior axis only). Gray shading denotes the swing phase, defined as the interval posteriortoanterior limb motion. h) Crosscorrelation between the spectrograms of keypoints and motif labels respectively. Heatmap rows correspond to frequency bands of the spectrograms and columns correspond to models with different target timescales.
Supplementary information
Supplementary Table 1
Number of model fits for each dataset. Each row corresponds to a particular analysis (defined by a dataset and the figure(s) in which the analysis is shown) and shows the number of times each method was applied to the dataset in the context of that analysis.
Supplementary Table 2
Parameters for fitting keypointMoSeq. Number of PCs and values of the stickiness hyperparameter used for each dataset. The table excludes datasets and/or analyses where the stickiness was scanned over.
Supplementary Table 3
Parameter combinations tested in Extended Data Fig. 5. Each inscribed box corresponds to one parameter scan. Rows highlighted in orange represent parameters shown in Fig. 3 and used throughout the rest of the paper. Some methods have several highlighted rows because the same parameter combination arose in multiple parameter scans.
Supplementary Video 1
Example output from three different tracking methods, illustrating noise in keypoint detections.
Supplementary Video 2
Average pose trajectories for keypointMoSeq syllables derived from 2D open field recordings. Each trajectory includes ten evenly timed poses from 165 ms before to 500 ms after syllable onset.
Supplementary Video 3
Example syllable instances from the 2D open field dataset. For each syllable, the video shows an array of randomly selected examples. A white dot appears in each example at the moment of syllable onset and disappears when the syllable is over.
Supplementary Video 4
Average pose trajectories for keypointMoSeq syllables derived from 3D open field recordings, showing an xy projection on the left and a xz projection on the right. Each trajectory includes ten evenly timed poses from 165 ms before to 500 ms after syllable onset.
Supplementary Video 5
Average pose trajectories for keypointMoSeq syllables derived from rat motion capture data, shown in the same format as Supplementary Video 4.
Supplementary Video 6
Example fly syllables, shown in the same format as Supplementary Video 3.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Weinreb, C., Pearl, J.E., Lin, S. et al. KeypointMoSeq: parsing behavior by linking point tracking to pose dynamics. Nat Methods 21, 1329–1339 (2024). https://doi.org/10.1038/s41592024023182
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1038/s41592024023182