SuperAnimal pretrained pose estimation models for behavioral analysis

Quantification of behavior is critical in diverse applications from neuroscience, veterinary medicine to animal conservation. A common key step for behavioral analysis is first extracting relevant keypoints on animals, known as pose estimation. However, reliable inference of poses currently requires domain knowledge and manual labeling effort to build supervised models. We present SuperAnimal, a method to develop unified foundation models that can be used on over 45 species, without additional manual labels. These models show excellent performance across six pose estimation benchmarks. We demonstrate how to fine-tune the models (if needed) on differently labeled data and provide tooling for unsupervised video adaptation to boost performance and decrease jitter across frames. If fine-tuned, SuperAnimal models are 10–100× more data efficient than prior transfer-learning-based approaches. We illustrate the utility of our models in behavioral classification and kinematic analysis. Collectively, we present a data-efficient solution for animal pose estimation.

A key part of many behavioral analysis pipelines is animal pose estimation, yet this requires domain knowledge and labeling efforts to obtain reliable pose models (2,3,7,8).Open-source pose estimation software, such as DeepLab-Cut (9,10) and other tools (11)(12)(13)(14) also reviewed in (7), have gained popularity in the research community interested in understanding animal behavior.Compared to commercial solutions constrained to fixed cage and camera settings (15), DeepLabCut offers flexibility to train customized pose models of various animals in diverse settings.Notably, it requires few human-labeled images (around 100-800) to train a typical lab animal pose estimator that matches human level accuracy (9,10) due to its transfer learning abilities (9,16).
However, regardless of the data efficiency of current solutions, their flexibility still comes with the cost of requiring users to label if they want to define keypoints and then train deep neural networks, an effort that is often duplicated across labs.A solution is to build generalized, foundation-like models (17), for common model organisms across labs and in-thewild settings (discussed in (7)).Such models, once trained, can be used across labs and settings without further training and/or requiring little fine-tuning.
To provide the research community with easy access to such high performance models we present a new panoptic paradigm -which we call the SuperAnimal methodfor building pre-trained pose models, and the ability to perform video adaptation across many species, environments and video sizes (e.g., fine-tune them if needed; Figure 1a).SuperAnimal combines diverse datasets into two broad unified pose models that cover over 45 species of mammals with 27-39 keypoints.In brief, our new approach allows for merging and training diverse, differently labeled datasets.We developed an optimal keypoint matching algorithm to automatically align out-of-distribution datasets with our models.Then, at inference time, to minimize domain shifts, we developed a spatial-pyramid search method to account for changes in animal size (or use a top-down detector).We also developed a rapid, unsupervised video-adaptation method that uses pseudo-labeling to minimize temporal jitter in videos and allows users to fine-tune videos without any data labeling.
We developed models based on state-of-the-art convolutional neural networks (CNNs), such as HRNet (18) and DLCR-Net (10), and introduce AnimalTokenPose that uses transformers (19)(20)(21).We show that the resulting models have excellent zero-shot performance (i.e., with no additional training, tested on new data), and our approach outperforms ImageNet-pretraining on five benchmarks.If users then want to use these new weights for fine-tuning, we show they are 10X more data efficient, and our video adaptation method allows for smooth, refined videos that can be used in behavioral analysis pipelines.

Results
The SuperAnimal method comprises generalized data conversion, training with keypoint gradient masking and memory-replay, a keypoint matching algorithm, and the ability to fine-tune with video adaptation (Figure 1a), which will be explained below.Collectively, this is a formulation that treats diverse pose datasets as if they collectively formed one single super-pose template.This effectively allowed us to overcome a major challenge with combining datasets that are mackenzie@post.harvard.edu| *Jointly supervised this work.The panoptic animal pose estimation approach unifies the vocabulary of pose data across labs, such that each individual dataset is a subset of a super-set keypoint space, independently of its naming.c: For canonical, task-agnostic transfer learning, the encoder learns universal visual features from ImageNet, and a randomly initialized decoder is used to learn the pose from the downstream dataset.For task-aware fine-tuning, both encoder and decoder learn task-related visual-pose features in the pre-training datasets and the decoder is fine-tuned to update pose priors in downstream datasets.Crucially, the network has pose-estimation specific weights.d: Memory-replay combines the strengths of SuperAnimal models' zero-shot inference, data combination strategy, and leveraging labeled data for fine-tuning (if needed).It achieves better data efficiency.e: Data efficiency of baseline (ImageNet) and various SuperAnimal fine-tuning methods on the DLC-Openfield OOD dataset.Grey shadow represents minimum, maximum and blue dash is the mean for zero-shot performance across three shuffles.Large, connected dots represent mean results across three shuffles and smaller dots represent results for individual shuffles.f: Using memory replay avoids catastrophic forgetting; here all keypoints are predicted.g: Top: SuperAnimal-TopViewMouse qualitative results on the within distribution test images (IID).They were randomly selected based on visibility of the keypoints within the figure (but not on performance).Full keypoint color and mapping is available in Extended Data Figure S1).h: Top: Transformers also perform well on OOD tasks.Bottom: visualization of model performance on OOD images using DLCRNet.
not identically labeled across labs or datasets, as it is often the case even for the same species (Figure 1b, c and Extended Data Figure S1a).SuperAnimal enables multi-dataset training, allowing the model to receive richer learning signals (Figure 1c,d), resulting in the model having "pose priors" (whereas ImageNet pre-training, the current state-of-the-art in animal pose (10,16,22) has no pose-specific features).
In order to demonstrate the strength of our SuperAnimal method, we present two datasets that cover over 45 species: TopViewMouse-5K and Quadruped-40K, which are built from over 45,000 images sourced from diverse laboratory settings and in-the-wild data (Extended Data Figure S1a,  b).First, we used a new generalized data converter (see Methods) to unify the annotation space of those datasets and named the first dataset TopViewMouse-5K (as it contains approximately 5k images).Specifically, we merged 13 overhead-camera view-point lab mice datasets from across the research community (9,10,15,23,24) (see Methods) and from our own experiments (Figure 1e, h).Similarly, we collected side-view quadruped datasets (16,(25)(26)(27)(28)(29)(30), including a new annotated rodent dataset with images sourced from iNaturalist (see Methods), to form Quadruped-40K (Extended Data Figure S1b).However, below we leave out all benchmark datasets in order to show performance of the model on unseen data.Our released weights are trained on all available data described above (Extended Data Figure S1b).
The SuperAnimal method is a series of solutions that show excellent generalization on unseen images and videos (Fig- ure 1a-e, Tables S1, S2).For training, we developed keypoint gradient masking (Extended Data Figure S1c, d) to train neural networks across disjoint datasets without penalizing "missing" ground truth data from the superset of keypoints (Figure 1b).To note, typically transfer learning involves finetuning a pre-trained encoder, but using a randomly initialized decoder in the downstream dataset (9).In contrast, we finetuned the pre-trained encoder and decoder (Figure 1c).Additionally, inspired by the excellent zero-shot inference of pretrained models (31) and continual learning (32), we developed a tailored fine-tuning approach that combines zero-shot inference and few-shot learning, which we call "memoryreplay" (Figure 1d).This allowed us to combine datasets and extract all keypoints independently of how many were labeled in any one dataset (Figure 1d).We also developed a keypoint matching algorithm (Methods and Extended Data Figure S2a, b) to help minimize the mismatch caused by annotator bias in the ground truth datasets (see Suppl.Note).

SuperAnimal-TopViewMouse
To evaluate our models we tested performance "within distribution" also known as "independent and identically distributed" (IID), and on images considered "out-ofdistribution" (OOD).IID images are similar in appearance, but not identical to those used in training.OOD data stems from images that were never included in training and differ from the training data (33).
One use case of the new models we provide is to run video inference without any additional training, called "zero-shot".Therefore, to test performance, we built a SuperAnimal-TopViewMouse model that did not contain data from the DLC-Openfield (9) or related TriMouse dataset (9,10).Collectively, we find that the SuperAnimal methods are critical to avoid catastrophic forgetting (see Suppl.Video 1), and show excellent zero-shot performance (Figure 1e) on the DLC-Openfield benchmark.SuperAnimal-TopViewMouse performed well within distribution (IID) and OOD across diverse camera and cage settings (Figure 1g, h).
Zero-shot SuperAnimal-TopViewMouse shows an RMSE error of 14.31 pixels on the DLC-Openfield dataset, where the average mouse's nose width is approximately 10 pixels (9) (Figure 1e, f).Namely, we found that without any labeling we could outperform ImageNet-based transfer learning (Figure 1e; mixed-effect model; in the low-data regime d=3.06 [1.99, 4.13]; p<.0001; see Tables S4-S12).A user would need to label 30-700 frames to achieve the same performance as our SuperAnimal pre-trained models without any need to data labeling (see Figure 1e where zero-shot performance intersects with either SuperAnimal fine-tuning or ImageNet fine-tuning; Extended Data S3b).This also effectively removes the time needed for labeling and training models (which typically comprises several hours (34)).Note that the performance of zero-shot inference is likely underestimated by annotator bias (see Suppl.Note).We also show our method is not limited to convolutional neural networks (CNNs), but can be used with transformers (Figure 1h, and see Methods), which, we find, have especially good zero-shot performance and can outperform CNNs.
Depending on the zero-shot performance, one might want to still label their own frames to increase performance.We demonstrate that our SuperAnimal weights are better than using ImageNet weights.When we fine-tuned with labeled images and evaluated performance (called fewshot), SuperAnimal-TopViewMouse pre-trained models significantly outperformed ImageNet pre-trained models by a 10× data efficiency factor and large margin of performance gain (Figure 1e, Suppl.Tables).We found that both keypoint gradient masking and memory-replay boosted performance (Figure 1e-h, Extended Data Figure S1, Suppl.Videos 1 & 2), as keypoints that were not annotated would otherwise cause a false penalty and therefore degrade performance (see Methods).This combined encoder-decoder fine-tuning often improved efficiency over naive fine-tuning (Figure 1e, f).
Using this optimal SuperAnimal setting, if the model is finetuned with only 10 images (randomly selected) on DLC-Openfield, the SuperAnimal pre-trained model obtained an RMSE of 7.68 pixels, whereas ImageNet pre-training was 18.14 pixels.We show that the baseline ImageNet pre-trained model required 101 (randomly selected) images to reach a similar performance (6.28) to SuperAnimal pre-trained models (Figure 1e).Therefore, we outperformed DeepLabCut-ResNet-50 (i.e., ImageNet baseline) by over 2X in the low data regime (i.e., with 10 frames of labeling), and we can  S4a-d).b: Qualitative performance with SuperAnimal-Quadruped (HRNet) performance, randomly selected based on visibility of the keypoints within the figure (but not on performance).A likelihood cutoff of 0.65 was applied for keypoint visualization.Full keypoint color and mapping is available in Extended Data Figure S1).c: Performance on the OOD Horse-10 dataset, using the official IID and OOD splits, reported as a normalized error from eye to nose, see inset adopted from (16).Results are with HRNet, unless noted.d: Performance on the OOD iRodent dataset, reported as a normalized bounding-box error.Colors as in c with additions noted in c. e: Performance on the OOD AP-10K dataset, reported as mAP, legend to the right of f. f: Performance on AnimalPose (AP), where AP was removed from the SuperAnimal-Quadruped base model, reported as mAP.f, g, h: Qualitative performance on various OOD datasets, as noted and selected as in b.A likelihood cutoff of 0.6 was applied for keypoint visualization in all datasets except for iRodent, where 0.5 was used.
achieve the same performance as DeepLabCut-ImageNet weights with 10X less data.
One important point is that the SuperAnimal pre-trained model is now imbued with a "pose prior".Historically, the fine-tuning strategies assumed no "task priors" in the pretrained model, a paradigm adopted from previous taskagnostic transfer learning (35).Yet, here we show that naively fine-tuning on datasets that do not have the full superset of points might cause catastrophic forgetting (Figure 1g, see Methods and Suppl.Video 2).Specifically, if we finetuned with the four keypoint dataset from DLC-Openfield, the model would forget the full 27 keypoints.We show this is prevented by our memory-replay method.This method can also improve the fine-tuning performance (e.g., by 3.9 pixels in the DLC-Openfield dataset at N=10 training images; Cohen's d=-1.7494, Figure 1f).
We then tested its performance on four OOD benchmarks that had various metrics: Horse-10 ( 16) which reports the normalized error (normalized by the animals size, see inset in Figure 2b), a new dataset we present called iRodent, AP-10k (30) and lastly we consider AnimalPose (27) all which we report mAP.To probe the model performance we did not include Horse-10, iRodent, or AP-10K in the base model we call SuperAnimal-Quadruped here.
Horse-10 is a benchmark challenge that tests OOD robustness.We evaluated on the official splits and show zero-shot performance is similar across IID and OOD splits, which are both OOD to our model (Figure 2c).Next we show that with minimal fine-tuning using our SuperAnimal method we can match ImageNet-based transfer learning with 10X less data (Figure 2c, i.e, fine-tune with 73 frames vs 734 for the same level of performance).iRodent is a challenging new dataset comprising a diverse set of images of rodents, yet with our top-down HRNet CNN we can achieve excellent zero-shot performance (Figure 2d, h), although there is still a gap to close in future work, as even fine-tuning on over 350 frames only slightly improves the mAP.
Next, we tested our model on the AP-10K benchmark.Here we show that when fine-tuned, our SuperAnimal-Quadruped HRNet outperforms ImageNet-based fine-tuning (Figure 2e), and our zero-shot performance was better than a model that performs well on the AnimalPose benchmark (27).Animal-Pose is a benchmark dataset of dog, cat, cow, horse, sheep with 20 keypoints (27).Lastly, we benchmarked our SuperAnimal model directly on AnimalPose.For testing AnimalPose we made a SuperAnimal model variant that dropped AnimalPose during training for both HRNet and AnimalTokenPose backbones (Figure 2f).Notably, we find that our zero-shot performance is almost on par with fully supervised models (Figure 2e), and beats zero-shot of a only AP-10K dataset trained model (39) (Figure 2f).
Collectively, the SuperAnimal method presents an efficient way to achieve strong zero-shot performance and also provides better starting weights for fine-tuning (vs.ImageNetfine tuning).Also note that both sets of SuperAnimal models-TopViewMouse and Quadruped-learned to predict the union of all keypoints defined in multiple datasets even if no single dataset had defined all of these keypoints.Of course, despite strong generalization, there can still be failures (Extended Data Figure S2c).

Unsupervised Video Adaptation.
Independent of the use case (i.e., zero-shot or few-shot finetuning), to optimize performance on unseen user data we also developed two unsupervised methods for video inference that help overcome differences in the data SuperAnimal models were trained on compared to what data users might have.These so-called distribution shifts can come in various forms (e.g., spatial or temporal; see Methods).For example, the model can not perform well if the video is dramatically different in size than what we trained on (Figure 3a, b, c).Therefore, inspired by (40), we developed an unsupervised testtime augmentation called spatial-pyramid search that significantly boosted performance in three OOD videos (Figure 3b, c, Suppl.Video 3, Table S13; see Methods).This is unsupervised, as the user does not need to label any data, they simply give a range of video sizes.Note that in practice this does slow down inference time depending on the search parameter space.
Secondly, to improve temporal video performance we propose a new unsupervised domain adaptation method (41,42) tailored for pose estimation called video adaptation that mitigates the jittering in predictions (Figure 3d-f, Suppl.Video 4).The method runs pose inference on the videos and treats the output predictions as the pseudo ground-truth labels and then fine-tunes the model.This does not take extensive training time, and can be run during video analysis.For example, if a video (of a given size) can be run at 40 FPS, this video adaptation would slow down processing to approx.12 FPS.In Figure 3e-i we demonstrate the qualitative performance gains in video smoothness across frames with or without the video adaptation, and show quantification in Extended Data Figure S4e.

Unsupervised behavioral analysis.
To illustrate the value of zero-shot predictions for behavioral quantification (Figure 4a), we turned to an open-source dataset that was used to benchmark the performance of opensource machine learning tools vs. some commercially available solutions (15).Specifically, we used the open-field test (OFT) dataset presented in Sturman et al. (15).We evaluated the performance of SuperAnimal weights in an action segmentation task.To make OFT out-of-distribution, we made a variant of the SuperAnimal-TopViewMouse model As a strong baseline, we used the DeepLabCut keypoints trained by Sturman et al., who trained in a supervised way on each video specifically, thus making it in-domain (Figure 4a,  b).We asked if the SuperAnimal model variant, which has never been trained on the 20 videos they present, is sufficient to classify two critical kinematic-based postures: unsupported rearing in the open field, and supported rearing against the box wall (Figure 4a, b, see also Suppl.Video 5).If the keypoints were too noisy, this task would be very challenging.
In order to transform keypoints into behavioral actions via segmentation, we used skeleton-based features to convert keypoints to feature vectors (see Methods).We then either used only a MLP-based classifier as in Sturman et al., or we used a newly described non-linear clustering algorithm called CEBRA (36) to further improve the feature space, followed by the same classifier (see Methods and Figure 4c-e).
We found that SuperAnimal zero-shot could be as good as the supervised keypoint model in predicting both behaviors (Figure 4d-f, g; linear mixed effect model, fixed effect of 'method': F=0.999, p=0.393; see Table S14).Moreover, using CEBRA slightly improved upon the behavior classifica- over or under counted by 1 stride (no error larger than 1 was found).i: We computed the error between the GT stride length vs. model prediction for the hoofs (i.e, right_back_paw vs. Offhindfoot, etc).Each dot represents a stride, color denotes hindlimb vs. forelimb, near legs only.
tion, independent of which keypoints were used (Figure 4e,  f).We also compared the correlation of our result based on SuperAnimal or Sturman keypoint data against the three annotators per video and find that our model is well correlated to the ground truth annotations, particularly when using CEBRA (Figure 4f).SuperAnimal-TopViewMouse also performed well with post-hoc unsupervised analysis of behavior with the newly introduced Keypoint-MoSeq (43) and AmadeusGPT (44).Therefore, collectively this demonstrates that without any pose training data, SuperAnimal models can be used for downstream behavioral analysis on out-ofdistribution data.Lastly, to show the utility of the SuperAnimal-Quadruped model in video analysis we performed gait analysis in horses.Here, we turn to a ground truth video dataset where every frame of the video was annotated by an equine expert (16).We computed the stride and swing phase of the gait and show that the SuperAnimal-Quadruped model with video adaptation can match ground truth (Figure 4g) in 24 out of 30 videos, where we only miss one stride detection (either over or under, (Figure 4h).A failure case is shown in Extended Data Figure S4).We also computed the hoof-ground contacts and find generally reasonable agreement between ground truth and predictions (Figure 4i).The fraction of contacts within 1-5 frames of ground truth was 60%-80.5%,respectively, averaged across front and hind limbs across all videos.Collectively, this suggests our SuperAnimal models can be used in real-world tasks both in and outside the laboratory.

Discussion
Biology often uses a few, common model organisms such as rodents, zebrafish, Drosophila and non-human primates (45,46).Here we propose an approach to create robust, crosslab neural network models that are applicable for rodents and many other quadrupeds (>40 species).What keypoints are of relevance, also depends on the experiment.For instance, in reaching experiments (9,47), different keypoints are of interest than in open field studies.Our approach is general, and it will be an important future goal to expand the DeepLabCut Model Zoo to additional animals (e.g., insects, birds, or fish) and behavioral contexts.
Building a pretrained pose model via supervised learning benefits from the availability of the annotated pose datasets, and we show that our formulation removes the obstacles of leveraging inhomogenous pose datasets, which enables Su-perAnimal models to benefit from learning pose prior and larger datasets.Alternatively, unsupervised keypoint discovery can be used (48,49).While the unsupervised approach requires no pose annotations, the learned keypoints might lack interpretability and it is not clear whether it allows zeroshot inference on OOD data.Therefore, both approaches that create predictions based on the superset of annotated keypoints from different studies and unsupervised keypoint discovery are promising, complementary directions.
Taken together, we aimed to reduce the (human and comput-ing) resources needed to create or adapt animal pose models in both lab and in-the-wild animal studies, thereby increasing access to critical tools in animal behavior quantification.We developed a new framework called panoptic pose estimation, where models can be used across various environments in a zero-shot manner and if fine-tuned, they require 10× less labeled data than previous models.Our method might lead to the homogenization of animal pose estimation models, making them useful to a broader range of users, thus lowering the entry barrier and reducing resource consumption.Moreover, labs can now share their data and we can leverage a global community effort to build more powerful models.The DeepLabCut Model Zoo web platform allows access to Su-perAnimal pre-trained models, aids in collecting and labeling (Extended Data Fig S2d) more data, and hosts other usershared models at http://modelzoo.deeplabcut.org.

Datasets
We collected publicly available datasets from the community, as well as in-house datasets for building the SuperAnimal models.Thereby, we sought to cover diverse individuals, backgrounds, scenarios, and postures.In the following we detail the references for those datasets.
TopViewMouse-5k.3CSI, BM, EPM, LDB, OFT See full details at (15) and in (50).BlackMice See full details at (23).WhiteMice Courtesy of Prof. Sam Golden and Nastacia Goodwin.See details in SIMBA (24).TriMouse See full details at (10).DLC-Openfield See full details at (9).Kiehn-Lab-Openfield, Swimming, and treadmill Courtesy of Prof. Ole Kiehn, Dr. Jared Cregg, and Prof. Carmelo Bellardita; see details at (51).MausHaus We collected video data from five single-housed C57BL/6J male and female mice in an extended home cage, carried out in the laboratory of Mackenzie Mathis at Harvard University and also EPFL (temperature of housing was 20-25C, humidity 20-50%).Data were recorded at 30Hz with 640 × 480 pixels resolution acquired with White Matter, LLC eV cameras.Annotators localized 26 keypoints across 322 frames sampled from within DeepLabCut using the k-means clustering approach (34).All experimental procedures for mice were in accordance with the National Institutes of Health Guide for the Care and Use of Laboratory Animals and approved by the Harvard Institutional Animal Care and Use Committee (IACUC) (n=1 mouse), and by the Veterinary Office of the Canton of Geneva (Switzerland; license GE01) (n=4 mice).

Quadruped-40K.
AwA-Pose Quadruped dataset, see full details at (52).AnimalPose See full details at (27).AcinoSet See full details at (25).Horse-30 Horse-30 dataset, benchmark task is called Horse-10; See full details at (16).StanfordDogs See full details at (53,54).AP-10K See full details at (30).iRodent We utilized the iNaturalist API functions for scraping observations with the taxon ID of Suborder Myomorpha (55).The functions allowed us to filter the large amount of observations down to the ones with photos under the CC BY-NC creative license.The most common types of rodents from the collected observations are Muskrat (Ondatra zibethicus), Brown Rat (Rattus norvegicus), House Mouse (Mus musculus), Black Rat (Rattus rattus), Hispid Cotton Rat (Sigmodon hispidus), Meadow Vole (Microtus pennsylvanicus), Bank Vole (Clethrionomys glareolus), Deer Mouse (Peromyscus maniculatus), White-footed Mouse (Peromyscus leucopus), Striped Field Mouse (Apodemus agrarius).We then generated segmentation masks over target animals in the data by processing the media through an algorithm we designed that uses a Mask Region Based Convolutional Neural Networks(Mask R-CNN) (56) model with a ResNet-50-FPN backbone (57), pretrained on the COCO datasets (58).The processed 443 images were then manually labeled with both pose annotations and segmentation masks.Data is banked at https://zenodo.org/record/8250392.Additional OOD Videos.In Figure 3, for video testing we additionally used the following data: Golden Lab mouse: see details at (59).Smear Lab Mouse: see details at (60).Mathis Lab MausHaus: New video conditions, but the same as MausHaus ethics approval as above.

Panoptic pose estimation
We cast animal pose estimation as panoptic segmentation (61) on the animal body; i.e., every pixel on the body is potentially a semantically meaningful keypoint that has an individual identity.Ideally, an infinite collection of diverse pose datasets covers this and the union of keypoints that are defined across datasets make the label space of panoptic pose estimation.

Data conversion and panoptic vocabulary mapping (generalized data converter).
Data came from multiple sources and in multiple formats.To homogenize different annotation formats (COCO-style, DeepLabCut format, etc.), we implemented a generalized data converter.We parsed more than 20 public datasets and re-formatted them into DeepLabCut projects.Besides data conversion, the generalized data converter also implements key steps for the panoptic animal pose estimation task formulation.These steps include: 1. Hand-crafted conversion mapping.The same anatomical keypoint might be named differently in different datasets, or different anatomical locations might correspond to different labels in different datasets.Thus, the generalized data converter used a hand-crafted conversion mapping (see Extended Data Figs.S1a, S3) to enforce a shared vocabulary among datasets.We checked the visual appearance of keypoints to determine whether two keypoints (in different datasets) should be regarded as identical.In such cases, the model had to learn (possible) dataset-bias in a data-driven way.We can also think of it as a form of data augmentation that randomly shifts the coordinate of keypoints by a small magnitude, which is the case for keypoints which most dataset creators agree on (e.g., keypoints on the face).For keypoints on the body, the quality of the conversion table can be critical for the model to learn a stable representation of animal bodyparts.
2. Vocabulary projection.After the conversion mapping is made, keypoints from various datasets were projected to a super-set keypoint space.Every keypoint became a one-hot vector in the union of keypoint spaces of all datasets.Thereby the animal pose vocabularies were unified.3. Dataset merging.After annotations were unified into the super-set annotation space, we merged annotations from datasets by concatenating them into a collection of annotation vectors.Note that if the images only displayed a single species, we essentially built a specialized dataset for that species in different cage and camera settings.If there were multiple species present, we essentially grouped them in a species-invariant way to encourage the model to learn speciesagnostic keypoint representations, as is the case for our SuperAnimal-Quadruped model.

Data split and training fraction.
We used a 80:20 and 95:5 training:test ratio to split Quadruped-40K and TopViewMouse-5K, respectively.To measure in model performance, we used three data splits for all our experiments.
For all benchmark testing we used the data splits of the original papers.One exception is AnimalPose (22) in which there is no available data split that includes all 5 species.Therefore, we randomly split the dataset into 80:20 to create our own split.
Training protocol.The SuperAniml-Quadruped model used the same training protocol described in AP-10K paper (30).Specifically, we used the Adam optimizer (62) with an initial learning rate of 5e − 4. The total training epochs was 210, and we used a step decay for the learning rate at 170 and 200 epochs.We used batchsize 64.
For fine-tuning models with a very small number of unique images (e.g. less than 64 images in the training set), we did not train with batch-norm and used an initial learning rate 5e − 5 and observed stable training with this setting.

Model architectures
To show our methods are consistent regardless of the choice of architectures and data splits, we provided results using both variants of convolutional neural networks (CNNs) and several Transformer architectures.
DLCRNet.We use DLCRNet_ms5 (10) as the baseline network architecture for its excellent performance on animal pose estimation.A batch size of 8 was used and the SuperAnimal-TopViewMouse and were trained for a total of 750k iterations, respectively.In the fine-tuning stage, a batch size of 8 was used for 70k iterations.The Adam optimizer (62) was used for all training instances, and we otherwise used default parameters.We follow DeepLabCut's multi-step learning rate scheduler to drop learning rates three times from 1e-4 to 1e-5.Cross-entropy is used for learning heatmaps.For fine-tuning experiments, we keep the same optimizer, batch size and learning rate scheduler.The total number of training steps is adjusted to 70K iterations.
During video adaptation, we keep the same optimizer and learning rate scheduler, but with batch size 1 and total training steps as 1000.We observe low computational budget as described is sufficient for the model to adapt.
HRNet-w32.We also used HRNet (18) for SuperAnimal-Quadruped and further experiments shown in the Extended Data Figures S3b, 1h, and 2. Note that in the Extended Data the data splits are different from the ones in the main text and this split was also used for the transformers, therefore can be considered as another fully independent replication.We also adapted the TokenPose model by Yang et al. (20), which adds information about each keypoint in learnable queries called keypoint embeddings.The model was originally used for human pose estimation with a fixed number of keypoints.

Transformers. Inspired by recent results of Vision
Combining TokenPose and panoptic animal pose estimation, we obtain AnimalTokenPose models that are able to achieve high zero-shot performance in OOD datasets we prepared (Fig. 1 and Figure 2).
For keypoint estimation, 12 transformer encoder blocks with feature vector of size 192 were stacked.While the ViT encoder received raw pixels as an input, in TokenPose (20) the images of size 256 × 256 are first processed by a convolutional backbone and captured abstract features are then split into patches of size 4 × 4. As in TokenPose (20), we used the first 3 stages of HRNet ( 65) and 2 stacked residual blocks from a ResNet (66).

Algorithmic enhancements for training and inference
Keypoint gradient masking.First we manually verified a semantic mapping of the datasets with diverse naming (i.e., nose in dataset 1 and snout in dataset 2).Then, we defined a master keypoint space naming, where no one dataset needed to have all the named identified.This yielded sparse keypoint annotations into the super-set keypoint space (Extended Data Figs.S1b,  c).Training naively on these projected annotations would harm the training stability, as the loss function penalizes undefined keypoints, as if they were not visible (i.e., occluded).
For stable training of our panoptic pose estimation model, we mask components of the loss function across keypoints.The keypoint mask n k is set to 1 if the keypoint k is present in the annotation of the image and set to 0 if the keypoint is absent.We denote the predicted probability for keypoint k at pixel (i, j) as p k (i, j) ∈ [0, 1) and the respective label as t k (i, j) ∈ {0, 1}, and formulate the masked L k error loss function as with k = 2 for mean square error and k = 1 for L1 loss (e.g.used for locref maps in DLCRNet (10)) and the masked crossentropy loss function as Note that we make distinct the difference between not annotated and not defined in the original dataset and we only mask undefined keypoints.This is important as, in the case of sideview animals, "not annotated" could also mean occluded/invisible.Adding masking to not annotated keypoints will encourage the model to assign high likelihood to occluded keypoints.
Also note that the network predictions p k (i, j) are generated by applying a softmax to the logits l k (i, j) across all possible keypoints, including masked ones: The masking in the loss function then ensures that probability assigned to non-defined keypoints is neither penalized nor encouraged during training.
Handling the train and test time resolution discrepancy.One notable challenge our models face at inference time is the discrepancy in the resolution of images between train and test stages.Even though scale jitter augmentation is part of most pose estimation frameworks' data augmentation pipeline, including DeepLabCut's (10,34,67), one does not expect to handle dramatic change in the resolution.Indeed, it is well known that scale-augmentation greatly helps robustness to image and animal sizes (67).However, for shared models, we cannot anticipate the animal scale of user datasets.
In the case of fine-tuning, the downstream dataset (and the animals present in it) could have a very different resolution from the pre-training datasets.To handle those outliers, we apply resizing (height 400 pixels and same aspect ratio) to downstream datasets if their sizes are drastically different from our training images.To further deal with scale changes, we employ spatialpyramid search at test-time (see below).
Domain shifts and unsupervised adaptation.These domain shifts (68) describe a classical vulnerability of neural networks, where a model takes inputs from a data domain that is dissimilar from the training data domain, which usually leads to large performance degradation.We empirically observe three types of domain shifts when applying our models in a zero-shot manner.These domain shifts range from pixel statistics shift (69), to spatial shift (70), to semantic shift (68,69).To mitigate those, we applied two methods, test time spatial-pyramid search and video adaptation.Therefore, during inference, we build a spatial-pyramid composed of model's predictions for multiple copies of the original image at different resolutions.We used model's confidence as the criterion to filter out the resolutions that give sub-optimal performance and aggregate (taking median) predictions from resolutions that have above-threshold confidence as our final prediction.
The train-test resolution discrepancy (71) has been studied actively and most approach it through multi-resolution fusion (10,40,65).Previous work mostly focuses on IID setting where the resolution of testing images did not vary considerably from the training images.Moreover, prior work approaches multi-resolution fusion via deep features, requiring modifications of the architecture and adding more parameters.In contrast, the proposed spatial-pyramid search is designed to aid SuperAnimal models in zero-shot scenario where the resolutions of testing images are most likely out of distribution to our training images.We did not apply multi-resolution fusion via deep features for that requires fixing choice of architectures.On the other hand, commonly used multi-scale testing in IID setting does not need to carefully filter out very noisy predictions.This method can also be used for calibration to find the optimal scale.Spatial-pyramid pseudo-code: Video adaptation.
To aid SuperAnimal models to adapt to novel videos, we inference the model on the videos, and treat these predictions as the pseudo ground-truth (72) labels to train on.We remove the predictions that have low confidence to filter out unreliable predictions.Empirically, 1000 iterations with batch size 1 is sufficient to greatly reduce the jitter.The optimal number of iterations and the confidence threshold are hyperparameters for different videos.Combined together.The two methods above can be combined for better results: the pseudo labels can come from test time spatial-pyramid search which gives more accurate pseudo labels for the model to learn.(32).Indeed, a model gradually loses its ability to solve previous tasks after it learns to solve new ones.

Memory replay. Catastrophic forgetting (73) describes a classic problem in continual learning
Fine-tuning a SuperAnimal models falls into the category of continual learning: the downstream dataset defines potentially different keypoints than those learned by the models.Thus, the models might forget the keypoints they learned and only pick up those defined in the target dataset.Here, retraining with the original dataset and the new one, is not a feasible option as datasets cannot be easily shared and more computational resources would be required.
To counter that, we treat zero-shot inference of the model as a memory buffer that stores knowledge from the original model.When we fine-tune a SuperAnimal model, we replace the model predicted keypoints with the ground-truth annotations, resulting in hybrid learning of old knowledge and new knowledge.The quality of the zero-shot predictions can vary and we use the confidence of prediction as a threshold to filter out low confidence predictions.With threshold set to one, memory replay fine-tuning becomes naive-fine-tuning.

Memory replay pseudo-code:
def is_defined ( keypoints ) : # check whether the original dataset defines each keypoint .We use a flag ' -1 ' to denote that a given keypoint is not defined in the original dataset .Note this is different from not annotated , which use flag '0 ' return True if keypoints [2] >= 0 else False def load_pseudo_keypoints ( image_ids ) : # get the pseudo keypoints by image IDs .
# note , pseudo keypoints are loaded from disk and fixed throughout the process , so not drifting as is expected in typical online pseudo labeling return pseudo_keypoints def get_confidence ( keypoints ) : # get the model confidence of the predicted keypoints .Unlike ground truth data that have 3 discrete flags , predicted keypoints have confidence that can be used as likelihood readout for post -inference analysis return keypoints [2] def memory_replay ( model , superset_gt_data_loader , optimizer , threshold ) : # gt data is preprocessed such that annotations are now in superset keypoint space .
# every gt keypoint has 3 flags ( -1: not defined , 0: not labeled , 1: annotated ) For batch_data in superset_gt_data_loader : # since this specific body part is not defined in the new dataset , we use saved pseudo labels ( zeroshot prediction ) as gt .This prevents catastrophic forgetting and drifting .We can also use confidence to filter the pseudo keypoints If not is_defined ( gt_keypoints [ b_id , kpt_id ]) and get_confidence ( pseudo_keypoints [ b_id ][ kpt_id ]) > threshold : # we assume a single animal scenario for simplicity .For multiple animals , matching between gt and pseudo keypoints need to be completed .

Automatic keypoint matching.
In cases where users want to apply our models to an existing, annotated pose dataset, we recommend to use our keypoint matching algorithm.This step is important because our models define their own vocabulary of keypoints that might differ from the novel pose dataset.To minimize the gap between the model and the dataset, we propose a matching algorithm to minimize the gap between the models' vocabulary and the dataset vocabulary.Thus, we use our model to perform zero-shot inference on the whole dataset.This gives pairs of prediction and ground-truth for every image.Then, we cast the matching between models' predictions (2D coordinates) and ground-truth as bipartite matching using the Euclidean distance as the cost between paired of keypoints.We then solve the matching using the Hungarian algorithm.Thus for every image, we end up getting a matching matrix where 1 counts for match and 0 counts for non-matching.Because the models' predictions can be noisy from image to image, we average the aforementioned matching matrix across all the images and perform another bipartite matching, resulting in the final keypoint conversion table between the model and the dataset (example affinity matrices are shown in Figure S2a,b).
Note that the quality of the matching will impact the performance of the model, especially in fine-tuning.In the case where, e.g., the annotation nose is mistakenly converted to keypoint tail and vice versa, the model will have to de-learn the channel that corresponds to nose and tail.

Evaluation metrics
Supervised metrics for pose estimation.

RMSE.
Root Mean Squared Error is a metric to measure the distance between prediction and ground truth annotations in pixel space (7,9).However for pose estimation, it does not take the scale of the image and individuals into consideration and the distance is thus non-normalized.As our data is highly variable, we also sometimes use normalized errors.We use RMSE for the DLC-Openfield benchmarking, as this was the original authors main reported metric.Note that during evaluating RMSE, we do not remove predictions that have low confidence due to occlusion.Therefore, all predictions including outliers are penalized by RMSE.
Normalized Error.For Horse-10 experiments we use the eye-to-nose distance for normalization (computed ground truth is avilable in Horse-30 ( 16)).mAP.Mean average precision (mAP) is the averaged precision of object keypoint similarity (OKS) (74): , where d i is the Euclidean distances between each corresponding ground truth and detected keypoint and v i is the visibility flags of the ground truth, s is the object scale and k i is a per keypoint constant that controls falloff (see full implementation details at (58)), and see Methods Table 1.s is the square root of bounding box area (product of width X height of the bounding box).

Unsupervised metrics for video prediction smoothness. Convex hull body area measurement
To evaluate the smoothness of SuperAnimal model predictions in video, we utilize a simple unsupervised heuristic.It computes the area of a polygon encompassing all keypoints, the idea being that the smoother the detections, the lower the variance of this polygon's area.This is formally noted by A body , to estimate the animal body area.A body is calculated using the convex hull containing all keypoints over time.Let K represent the set of all keypoints for the animal at each time step, and conv(K) denote the convex hull containing all keypoints.The animal body area, A body , is then given by the area of the convex hull: where Area(conv(K)) is the function that calculates the area of the convex hull conv(K) containing all keypoints over time.

Jittering metric
We define jittering, denoted by J, as the average of the absolute values of centered, non-signed speeds across all examples and all keypoints.For a given keypoint k and example e, the jittering value is computed as follows: where: -J k,e is the jittering value for keypoint k in example e, -N k,e is the total number of centered, non-signed speed measurements for keypoint k in example e, -v k,e,i is the i-th centered, non-signed speed measurement for keypoint k in example e.

Keypoint dropping metric
Let K total be the total number of keypoints in the video sequence, and K dropped be the count of keypoints that are below a defined threshold T threshold and considered for dropping in environments with little occlusion and a top view.
We define "keypoint dropping" as the process of tracking the number of keypoints below the threshold for each video frame: where K dropped (t) is the count of keypoints dropped at time t, and δ i (t) is an indicator function that returns 1 if the i-th keypoint is below the threshold at time t, and 0 otherwise: where score i (t) is the confidence score or measurement of the i-th keypoint at time t.
Statistical analysis.Linear mixed-effects models were fitted in R (75) using the lme4 package (v1.1.31; (76)).Training data fraction (or, equivalently, the number of images) and fine-tuning methods were defined as fixed effects, whereas the various datasets and shuffles were treated as random effects; random intercepts and slopes were also added at the dataset level.The best models were selected based on the Akaike Information Criterion (AIC); adding complexity did not result in lower AIC, and even led to singular fits, indicative of overfitting.The weight of evidence for an effect was computed using likelihood ratio tests, as well as with p-values provided by lmerTest (v.3.1.3).Pairwise contrasts and Cohen's d standardized effect sizes were computed with the emmeans package (v.1.8.3), and degrees of freedom estimated with the Kenward-Roger method.Distributions of prediction errors with and without spatial-pyramid search were compared with the two-sample, one-sided (alternative hypothesis: "less") Kolmogorov-Smirnov test.The significance threshold was set at 0.05.

Behavioral Action Segmentation
As our benchmark dataset, we used the openfield test (OFT) task from Sturman et al. (15).We calculated the same skeletonbased features by concatenating 10 distances between keypoints, 6 angles, 4 body areas and two additional boolean variables coding whether the nose and head center were inside the arena, resulting in a 22D vector at each time step.For the action classifier, we used an MLP neural network as the action decoder that acted as a sliding window across 31 time steps to perform action segmentation and used F1 score on supported and unsupported rears as evaluation metrics.As in the original paper, we performed leave-one-out cross-validation on 20 videos and across 3 annotators.
For CEBRA (36), we used the model architecture 'offset10-model'.The output dimension was set to 32, as found via a simple grid search over the following values: [4,8,16,32].We trained it for 5000 iterations with batch size 4096, the Adam optimizer, and learning rate 1e-4.
Note that the original model for OFT task from Sturman et al. includes the center and four corners of the mouse cage, which is critical for their handcrafted features to determine the relative distance between the mouse and the walls.As our SuperAnimal models focus on animal bodyparts only, we take the corner coordinates from their released data for the sake of comparison.In practice, those static environmental keypoints can be provided by taking users' inputs via interactive GUI for videos.
Our SuperAnimal-Quadruped model was run on the videos from Horse-30 (16).The start (2 s) and end (2 s) of each of the 30 videos were removed from the analysis, to ignore instants when the horse is only partially seen.Front and back hoof contacts and lifts were identified using respectively peak and valley detection from the 2D kinematic traces of the front and back hooves.Beforehand, these trajectories were smoothed using a 2nd-order, low-pass, zero-lag Butterworth filter (cutoff=3 Hz) and centered on a keypoint located on the animal's back; this effectively expresses keypoint coordinates in a reference frame stationary relative to the moving horse, facilitating event detection.We extracted fore and hind limb strides between consecutive ground contacts, and stance phases between a contact of one hoof until it is lifted off the ground.Stride lengths (in pixels), stances, and the number of identified hoof contacts were then computed, and qualitatively compared to those obtained using the densely annotated (ground truth) keypoints (Figure 4g,h,i).
Annotation.Additionally, the website hosts a labeling web app that allows users to annotate curated images.The datasets currently available for annotation are from iNaturalist (77) and the OpenImage Datase (78).After selecting which dataset to label, images are displayed successively with the target animal prominently shown in front of an opaque masked background (which can be toggled off).The keypoint set is selected taking into account the species morphology and keypoint value in subsequent analysis.Once the annotation is complete, the data is saved to the database and made available for use in further research.
Online inference.To allow testing DeepLabCut models in the browser, the user selects a few images, which model to run, and receives predictions along with confidence scores for each keypoint.Users are then able to adjust or delete keypoints, as well as download the model weights.This allows for a quick and hassle-free evaluation of DeepLabCut's capabilities and suitability for specific tasks, making it available to a wider range of users.Methods) that convert data of various forms to DeepLabCut pose format.We call models we provide Super-Animal models for their generalization powers.After users download these super models from our website or via DeepLabCut APIs, they can either use the models as a plug-and-play solution or alternatively choose to adapt or fine-tune these models from videos or pose datasets.

Considerations on building general datasets for pretraining
To build generalizable pose models, a large-scale pre-training dataset is the key.It has been shown in both computer vision and natural language processing that pre-trained models significantly improve the generalization of models and data efficiency in the downstream datasets (63,80).However, data of lab animals are not ubiquitous on the internet.To get large scale animal pose data, it is critical to gather the data directly from the research community in a responsible and transparent way.A platform that actively interacts with the community is thus required to build such a pre-training dataset.As such a vocabulary is built on top of a wide range of pose datasets, it can be used across different research needs and it is also key to for useful zero-shot inference (see Methods).
We acknowledge that these SuperAnimal models would not have been be possible without the accumulated data from the community.In the future, feedback from the community for models' efficacy and failure modes (Extended Data Fig.S2) in different downstream data will be critical for updated model releases and algorithmic updates.As publicly available data increase, we expect the performance will improve.
Annotator bias in labeled data.Unlike previous works that require labeling data to create a working model, our models can be used as they are.For the purpose of evaluation, we could use the ground-truth of the target dataset or label frames of a novel video.We note when it comes to evaluating the performance of zero-shot inference, there will always be a systematic errors between the model and the annotator of the target dataset.We refer this type of error to be caused annotator bias, meaning annotators of different datasets try to place keypoints in slightly different places due to the bias of annotators.Therefore, the supervised metrics will tend to be an over-estimation of the error.
Reversely, SuperAnimal models can be used to monitor annotator bias as the model's predictions are consistent across frames while in many cases human annotators annotate keypoints in a inconsistent way.

Supervised metrics do not capture the richness of SuperAnimal-models
In pose estimation literature, work mostly report supervised metrics (RMSE, Normalized Error, and mAP).What is shared in the supervised metrics is that the metrics do not penalize keypoints that are not annotated in the dataset.In contrast to other pose models, our SuperAnimal models can predict keypoints that are not annotated in the labeled dataset.For instance, if we apply only supervised metrics to evaluate SuperAnimal models, catastrophic forgetting is not detected as metrics do not penalize keypoint predictions that are not annotated.

Figure 1 .
Figure 1.The DeepLabCut Model Zoo, the SuperAnimal method, and SuperAnimal-TopViewMouse model performance.a: The website can collect data shared by the research community; SuperAnimal models are trained, and can be used for inference on novel images and videos with or without further training.b:The panoptic animal pose estimation approach unifies the vocabulary of pose data across labs, such that each individual dataset is a subset of a super-set keypoint space, independently of its naming.c: For canonical, task-agnostic transfer learning, the encoder learns universal visual features from ImageNet, and a randomly initialized decoder is used to learn the pose from the downstream dataset.For task-aware fine-tuning, both encoder and decoder learn task-related visual-pose features in the pre-training datasets and the decoder is fine-tuned to update pose priors in downstream datasets.Crucially, the network has pose-estimation specific weights.d: Memory-replay combines the strengths of SuperAnimal models' zero-shot inference, data combination strategy, and leveraging labeled data for fine-tuning (if needed).It achieves better data efficiency.e: Data efficiency of baseline (ImageNet) and various SuperAnimal fine-tuning methods on the DLC-Openfield OOD dataset.Grey shadow represents minimum, maximum and blue dash is the mean for zero-shot performance across three shuffles.Large, connected dots represent mean results across three shuffles and smaller dots represent results for individual shuffles.f: Using memory replay avoids catastrophic forgetting; here all keypoints are predicted.g: Top: SuperAnimal-TopViewMouse qualitative results on the within distribution test images (IID).They were randomly selected based on visibility of the keypoints within the figure (but not on performance).Full keypoint color and mapping is available in Extended Data FigureS1).h: Top: Transformers also perform well on OOD tasks.Bottom: visualization of model performance on OOD images using DLCRNet.

Figure 2 .
Figure 2. SuperAnimal-Quadruped a: Conceptual diagram to demonstrate top-down detection and CNN-based (here we used HRNet) pose estimation, and the AnimalTokenPose architecture (see Methods and diagram at Extended Data FigureS4a-d).b: Qualitative performance with SuperAnimal-Quadruped (HRNet) performance, randomly selected based on visibility of the keypoints within the figure (but not on performance).A likelihood cutoff of 0.65 was applied for keypoint visualization.Full keypoint color and mapping is available in Extended Data FigureS1).c: Performance on the OOD Horse-10 dataset, using the official IID and OOD splits, reported as a normalized error from eye to nose, see inset adopted from(16).Results are with HRNet, unless noted.d: Performance on the OOD iRodent dataset, reported as a normalized bounding-box error.Colors as in c with additions noted in c. e: Performance on the OOD AP-10K dataset, reported as mAP, legend to the right of f. f: Performance on AnimalPose (AP), where AP was removed from the SuperAnimal-Quadruped base model, reported as mAP.f, g, h: Qualitative performance on various OOD datasets, as noted and selected as in b.A likelihood cutoff of 0.6 was applied for keypoint visualization in all datasets except for iRodent, where 0.5 was used.

Figure 3 .
Figure 3. Unsupervised video adaptation methods.a: Conceptual diagram to demonstrate that the spatial-pyramid search leverages prediction from multiple resolutions.b: SuperAnimal-TopViewMouse model was used to infer poses on three OOD videos.Visual inspection shows zero-shot inference with vs. without the spatial-pyramid search.c: Quantification between with and without using spatial-pyramid search.d: Illustration of the unsupervised video adaptation algorithm.e-h: Animal size described by convex hull of keypoints.Frequent changes of the convex hull indicates non-smooth keypoint predictions, and below are example images with and without video adaptation showing the trailing keypoints for 10 past frames of data (to demonstrate the motion smoothness).i-j: Same as in e-h with an additional median filtering post-video adaptation examples (dark purple line).

Figure 4 .
Figure 4. Zero-shot behavioral quantification with SuperAnimal.a: Workflow overview for behavioral analysis with SuperAnimal.b: Images of the open-source dataset from Sturman et al. (15) with their DeepLabCut "in distribution" model and our SuperAnimal zero-shot, out-of-distribution, results.c: Ethogram comparing ground truth annotations vs. zero-shot predictions from SuperAnimal-TopViewMouse. d: F1 score computed across IID (Sturman) and SuperAnimal with, or without CEBRA on the two behavioral classes.e: CEBRA (36) embedding on Sturman keypoints and SuperAnimal-based keypoints in 3D, transformed with FastICA for visualization.f: Correlation matrix that demonstrates the correlation between SuperAnimal-TopViewMouse and ground-truth annotations averaged across 3 annotators and across the model and keypoint configurations.g: We analyzed 30 horse videos where every frame had a ground truth (GT) annotation of keypoints (16) (left) vs. our SuperAnimal-Quadruped model (representative of performance on most videos, right).Light grey zones denoted the swing phase, black denotes the stance, for both ground truth (left) and classified (right).Only the right legs (closest to the camera) are shown.h: Histogram delineating the number of videos where the ground contact by the hoof were identical to the GT vs.over or under counted by 1 stride (no error larger than 1 was found).i: We computed the error between the GT stride length vs. model prediction for the hoofs (i.e, right_back_paw vs. Offhindfoot, etc).Each dot represents a stride, color denotes hindlimb vs. forelimb, near legs only.

Figure S2 .
Figure S2.Dataset and WebApp considerations.a: The affinity matrix represents the semantic similarity between keypoint defined by the model and keypoint defined by dataset annotations across images.The affinity matrix is obtained by hard voting.The voting per image is obtained via pairwise euclidean distance between SuperAnimal-Quadruped model's zero-shot predictions and Horse-30 dataset ground truth.b: Affinity matrix for Golden Lab Mouse (see Methods) video (bottom at Figure 3), where we deliberately tried to match the keypoint space to model's zero-shot prediction.The noise in the affinity matrix suggests annotator bias for hard keypoints (e.g., tail points along the tail where the exact position is not visually concretely defined, as say opposed to the nose).For this analysis we annotated 20 frames of the Golden Lab Mouse data to illustrate our matching process.c: Examples of OOD failures from several datasets, as noted, from our SuperAnimal-Quadruped model.d: Top Left: An example of the current WebApp interface at contrib.deeplabcut.org.Users can add and edit the annotations from images we collect, following an anatomical figure that aids the expected location of bodyparts.Top Right: Example of current Gradio App on HuggingFace.Bottom Left: our current stand-alone GUI for local computer use showing a simple ModelZoo with SuperAnimal weights.Bottom Right: example of the Google Colaboratory interface with ModelZoo inference with SuperAnimal weights.

Figure S3 .
Figure S3.Quadruped dataset creation and additional performance results on held-out (OOD) data.a: Visual illustration of the datasets that compose Quadruped-40K (see Extended Data Figure S1 for TopViewMouse).b: Zero-shot examples and performance quantification.SuperAnimal models outperform ImageNet pretraining, independently of the data (image) split or framework used (here we used HRNet implemented within mmpose).Here, elephants, giraffes, iRodents, and two mice datasets were held out for model training, respectively, and still show excellent zero-shot performance (higher mAP), and show good fine-tuning with as little as 1% of added data.

Figure S4 .
Figure S4.SuperAnimal pre-trained transformers for animal pose estimation.a: Visual Transformer architecture with additional MLP head for heatmap estimation.In our experiments we use ViT-Base with 12 heads, 12 layers and 786 hidden layer size.b: Transformer encoder architecture.c: MLP head architecture.d: The model complexity for the transformer models and HRNet.e: Quantification of video adaptation in terms of jitter and keypoint dropping (see Methods) across the four test videos.Grey is without, pink is with video adaptation.Quartiles are shown as dashed lines.f: Kinematic plots as in Figure4gwhere the SuperAnimal-Quadruped tracking performance was poor.
(19)sformers (19)on human pose estimation tasks(21)we assessed ViT's zero-shot performance.We conducted experiments with the original ViT architecture in three setups: with masked auto-encoder (MAE) (63) initialization, DeiT (64) initialization and truncated normal initialization with standard deviation 0.02 and 0 mean.Following the original setup(19), we did not use a convolutional backbone.The input image of size 224 × 224 was split into patches of 16 × 16 pixels, the depth of the transformer encoder was equal to 12 and each attention layer had 12 heads with a feature dimension of 768.It was crucial to use a pre-trained vision transformer; without pre-training, the model did not converge for either dataset (data not shown).
Test time spatial-pyramid search.Even though during training time, our model learns images of various resolutions, it might only perform the best if the appearance size of the animal is close to those in the training images.For inferencing OOD data, We propose to search for proper resolutions at test time and aggregate good candidates for the final predictions.

Table 1 .
We used the following k values per bodypart for the SuperAnimal-Quadruped evaluation.

Table S7 .
Type-III Analysis of VarianceTable for Horse-10 OOD mixed model.

Table S9 .
Type-III Analysis of VarianceTable for DLC-Openfield mixed model.