Skip to main content

Thank you for visiting You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

Deep-learning-based identification, tracking, pose estimation and behaviour classification of interacting primates and mice in complex environments

A preprint version of the article is available at bioRxiv.


Quantification of behaviours of interest from video data is commonly used to study brain function, the effects of pharmacological interventions, and genetic alterations. Existing approaches lack the capability to analyse the behaviour of groups of animals in complex environments. We present a novel deep learning architecture for classifying individual and social animal behaviour—even in complex environments directly from raw video frames—that requires no intervention after initial human supervision. Our behavioural classifier is embedded in a pipeline (SIPEC) that performs segmentation, identification, pose-estimation and classification of complex behaviour, outperforming the state of the art. SIPEC successfully recognizes multiple behaviours of freely moving individual mice as well as socially interacting non-human primates in three dimensions, using data only from simple mono-vision cameras in home-cage set-ups.

This is a preview of subscription content, access via your institution

Access options

Rent or buy this article

Get just this article for as long as you need it


Prices may be subject to local taxes which are calculated during checkout

Fig. 1: Overview of the SIPEC workflow and modules.
Fig. 2: Performance of SIPEC:SegNet, SIPEC:PoseNet and SIPEC:IdNet under demanding video conditions while using few labels.
Fig. 3: SIPEC:BehaveNet outperforms DLC.
Fig. 4: SIPEC can recognize social interactions of multiple primates and infer their three-dimensional positions using a single camera.

Data availability

Mouse data from Sturman and colleagues20 are available under Example mouse data for training are available through our GitHub repository. The primate videos are available to the scientific community on request to V.M. (

Code availability

We provide the code for SIPEC at ( and the GUI for the identification of animals


  1. Datta, S. R., Anderson, D. J., Branson, K., Perona, P. & Leifer, A. Computational neuroethology: a call to action. Neuron 104, 11–24 (2019).

    Article  Google Scholar 

  2. Mathis, A. et al. DeepLabCut: markerless pose estimation of user-defined body parts with deep learning. Nature 21, 1281–1289 (2018).

  3. Geuther, B. Q. et al. Robust mouse tracking in complex environments using neural networks. Commun. Biol. 2, 124 (2019).

    Article  Google Scholar 

  4. Romero-Ferrero, F., Bergomi, M. G., Hinz, R. C., Heras, F. J. & de Polavieja, G. Tracking all individuals in small or large collectives of unmarked animals. Nat. Methods 16, 179 (2019).

    Article  Google Scholar 

  5. Forys, B. J., Xiao, D., Gupta, P. & Murphy, T. H. Real-time selective markerless tracking of forepaws of head fixed mice using deep neural networks. eNeuro 7, ENEURO.0096-20.2020 (2020).

  6. Pereira, T. D. et al. Fast animal pose estimation using deep neural networks. Nat. Methods 16, 117 (2019).

    Article  Google Scholar 

  7. Graving, J. M. et al. DeepPoseKit, a software toolkit for fast and robust animal pose estimation using deep learning. eLife 8, e47994 (2019).

    Article  Google Scholar 

  8. Bala, P. C. et al. Automated markerless pose estimation in freely moving macaques with OpenMonkeyStudio. Nat. Commun. 11, 4560 (2020).

    Article  Google Scholar 

  9. Günel, S. et al. DeepFly3D, a deep learning-based approach for 3D limb and appendage tracking in tethered, adult Drosophila. eLife 8, e48571 (2019).

    Article  Google Scholar 

  10. Chen, Z. et al. AlphaTracker: a multi-animal tracking and behavioral analysis tool. Preprint at (2020).

  11. Lauer, J. et al. Multi-animal pose estimation and tracking with DeepLabCut. Preprint at (2021).

  12. Wiltschko, A. B. et al. Mapping sub-second structure in mouse behavior. Neuron 88, 1121–1135 (2015).

    Article  Google Scholar 

  13. Hsu, A. I. & Yttri, E. A. B-SOiD: an open source unsupervised algorithm for discovery of spontaneous behaviors. Nat Commun. 12, 5188 (2019).

  14. Berman, G. J., Choi, D. M., Bialek, W. & Shaevitz, J. W. Mapping the stereotyped behaviour of freely moving fruit flies. J. R. Soc. Interface 11, 20140672 (2014).

    Article  Google Scholar 

  15. Whiteway, M. R. et al. Partitioning variability in animal behavioral videos using semi-supervised variational autoencoders. PLoS Comput. Biol. 17, e1009439 (2021).

    Article  Google Scholar 

  16. Calhoun, A. J., Pillow, J. W. & Murthy, M. Unsupervised identification of the internal states that shape natural behavior. Nat. Neurosci. 22, 2040–2049 (2019).

    Article  Google Scholar 

  17. Batty, E. et al. BehaveNet: Nonlinear Embedding and Bayesian Neural Decoding of Behavioral Videos (NeurIPS, 2019).

  18. Nilsson, S. R. et al. Simple behavioral analysis (SimBA)—an open source toolkit for computer classification of complex social behaviors in experimental animals. Preprint at (2020).

  19. Segalin, C. et al. The Mouse Action Recognition System (MARS) software pipeline for automated analysis of social behaviors in mice. eLife 10, e63720 (2021).

    Article  Google Scholar 

  20. Sturman, O. et al. Deep learning-based behavioral analysis reaches human accuracy and is capable of outperforming commercial solutions. Neuropsychopharmacology 45, 1942–1952 (2020).

  21. Nourizonoz, A. et al. EthoLoop: automated closed-loop neuroethology in naturalistic environments. Nat. Methods 17, 1052–1059 (2020).

    Article  Google Scholar 

  22. Branson, K., Robie, A. A., Bender, J., Perona, P. & Dickinson, M. H. High-throughput ethomics in large groups of Drosophila. Nat. Methods 6, 451–457 (2009).

    Article  Google Scholar 

  23. Dankert, H., Wang, L., Hoopfer, E. D., Anderson, D. J. & Perona, P. Automated monitoring and analysis of social behavior in Drosophila. Nat. Methods 6, 297–303 (2009).

    Article  Google Scholar 

  24. Kabra, M., Robie, A. A., Rivera-Alba, M., Branson, S. & Branson, K. JAABA: interactive machine learning for automatic annotation of animal behavior. Nat. Methods 10, 64 (2013).

    Article  Google Scholar 

  25. Jhuang, H. et al. Automated home-cage behavioural phenotyping of mice. Nat. Commun. 1, 68 (2010).

    Article  Google Scholar 

  26. Hayden, B. Y., Park, H. S. & Zimmermann, J. Automated pose estimation in primates. Am. J. Primatol. (2021).

  27. He, K., Gkioxari, G., Dollár, P. & Girshick, R. Mask R-CNN. In Proc. IEEE International Conference on Computer Vision 2961–2969 (IEEE, 2017).

  28. Huang, G., Liu, Z., Van Der Maaten, L. & Weinberger, K. Q. Densely connected convolutional networks. In Proc. IEEE Conference on Computer Vision and Pattern Recognition 4700–4708 (IEEE, 2017).

  29. Cho, K. et al. Learning phrase representations using RNN encoder–decoder for statistical machine translation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) 1724–1734 (Association for Computational Linguistics, 2014).

  30. Chung, J., Gulcehre, C., Cho, K. & Bengio, Y. Empirical evaluation of gated recurrent neural networks on sequence modeling. In NeurIPS 2014 Workshop on Deep Learning (2014).

  31. Deb, D. et al. Face recognition: primates in the wild. Preprint at (2018).

  32. Chollet, F. Xception: deep learning with depthwise separable convolutions. In Proc. IEEE Conference on Computer Vision and Pattern Recognition 1251–1258 (IEEE, 2017).

  33. Van den Oord, A. et al. WaveNet: a generative model for raw audio. Preprint at (2016)

  34. Bai, S., Kolter, J. Z. & Koltun, V. An empirical evaluation of generic convolutional and recurrent networks for sequence modeling. Preprint at (2018).

  35. Jung, A. B. et al. Imgaug (GitHub, 2020);

  36. Yosinski, J., Clune, J., Bengio, Y. & Lipson, H. How transferable are features in deep neural networks? In Advances in Neural Information Processing Systems 3320–3328 (NeurIPS, 2014).

  37. He, K., Zhang, X., Ren, S. & Sun, J. Deep residual learning for image recognition. In Proc. IEEE Conference on Computer Vision and Pattern Recognition 770–778 (IEEE, 2016).

  38. Lin, T.-Y. et al. Microsoft COCO: Common Objects in Context. In European Conference on Computer Vision 740–755 (Springer, 2014).

  39. Dutta, A. & Zisserman, A. The VIA annotation software for images, audio and video. In Proc. 27th ACM International Conference on Multimedia (ACM, 2019);

  40. Xiao, B., Wu, H. & Wei, Y. Simple baselines for human pose estimation and tracking. In Computer Vision – ECCV 2018 (eds. Ferrari, V., Hebert, M., Sminchisescu, C. & Weiss, Y.) 472–487 (Springer International Publishing, 2018).

  41. Tan, M. & Le, Q. V. EfficientNet: rethinking model scaling for convolutional neural networks. Preprint at (2020).

  42. Krizhevsky, A., Sutskever, I. & Hinton, G. E. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems 1097–1105 (NeurIPS, 2012).

  43. Vidal, M., Wolf, N., Rosenberg, B., Harris, B. P. & Mathis, A. Perspectives on individual animal identification from biology and computer vision. Integr. Comp. Biol. 61, 900–916 (2021).

    Article  Google Scholar 

  44. Demšar, J. Statistical comparisons of classifiers over multiple data sets. J. Mach. Learn. Res. 7, 1–30 (2006).

    MathSciNet  MATH  Google Scholar 

  45. Tenenbaum, J. B. A global geometric framework for nonlinear dimensionality reduction. Science 290, 2319–2323 (2000).

    Article  Google Scholar 

  46. Lin, T.-Y. et al. Feature pyramid networks for object detection. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 936–944 (IEEE, 2017);

  47. Girshick, R. Fast R-CNN. In 2015 IEEE International Conference on Computer Vision (ICCV) 1440–1448 (IEEE, 2015);

  48. Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I. & Salakhutdinov, R. Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15, 1929−1958 (2014).

  49. Ioffe, S. & Szegedy, C. Batch normalization: accelerating deep network training by reducing internal covariate shift. In Proceedings of the 32nd International Conference on International Conference on Machine Learning Vol. 37, 448–456 (, 2015).

  50. Maas, A. L., Hannun, A. Y. & Ng, A. Y. Rectifier nonlinearities improve neural network acoustic models. In Proc. 30th International Conference on Machine Learning (ICML, 2013).

  51. Xu, B., Wang, N., Chen, T. & Li, M. Empirical evaluation of rectified activations in convolutional network. Preprint at (2015).

  52. Kingma, D. P. & Ba, J. Adam: a method for stochastic optimization. In Proceedings of the 3rd International Conference on Learning Representations (ICLR, 2014).

  53. Lin, T.-Y., Goyal, P., Girshick, R., He, K. & Dollar, P. Focal loss for Dense object detection. In 2017 IEEE International Conference on Computer Vision (ICCV, 2017).

  54. Bohnslav, J. P. et al. DeepEthogram: a machine learning pipeline for supervised behavior classification from raw pixels. eLife 10, 63377 (2020).

  55. Abadi, M. et al. TensorFlow: large-scale machine learning on heterogeneous distributed systems. Preprint at (2016).

  56. Chollet, F. Keras (GitHub, 2015);

Download references


This project was funded by the Swiss Federal Institute of Technology (ETH) Zurich and the European Research Council (ERC) under the ERC Consolidator Award (grant no. 818179 to MFY), SNSF (grant no. CRSII5_198739/1 to MFY; grant no. 310030_172889/1 to J.B., grant no. PP00P3_157539 to V.M.) ETH Research Grant (grant no. ETH-20 19-1 to J.B.), 3RCC (grant no. OC-2019-009 to J.B. and M.F.Y.), the Simons Foundation (award nos. 328189 and 543013 to V.M.) and the Botnar Foundation (to J.B.). We would like to thank P. Tornmalm and V. de La Rochefoucauld for annotating primate data and feedback on primate behaviour, and P. Johnson, B. Yasar, B. Wu, and A. Shah for helpful discussions and feedback.

Author information

Authors and Affiliations



M.M. developed, implemented, and evaluated the SIPEC modules and framework. J.Q. developed segmentation filtering, tracking and three-dimensional-estimation. M.M., W.B. and M.F.Y. wrote the manuscript. M.M., O.S., LvZ., S.K., W.B., V.M., J.B. and M.F.Y. conceptualized the study. All authors gave feedback on the manuscript.

Corresponding author

Correspondence to Mehmet Fatih Yanik.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Machine Intelligence thanks Adam Kepecs and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 Individual mouse segmentation.

For mice, SIPEC:SegNet performance in mAP, dice and IoU for single mouse as a function of the number of labels. The lines indicate the means for 5-fold CV while circles, squares, triangles indicate the mAP, dice, and IoU, respectively, for individual folds. All data is represented by mean, showing all points.

Extended Data Fig. 2 Identification performance of mice across days and interventions.

Identification accuracy across days for models trained on day 1. While the performance for the day the model is trained on is very high it drops when tested on day 2 but is still significantly above chance level. When tested on day 3, after a forced swim test intervention, the performance drops significantly. All data is represented by mean, showing all points.

Extended Data Fig. 3 Identification of typical vs difficult frames.

a) Examples of very difficult frames, which are also beyond human single-frame recognition, are excluded for the ‘typical’ frame evaluation. b) Example frames used for the ‘typical’ frame analysis. c) Identification performance is significantly higher on ‘typical’ frames than on all frames. All data is represented by mean, showing all points.

Extended Data Fig. 4 Additional behavioural evaluation.

a) Overall increased F1 score is caused by an increased recall in case of grooming events and precision for unsupported rearing events. b) Comparison of F1 values as well as Pearson Correlation of SIPEC:BehaveNet to human-to-human performance as well as combined model. Using pose estimates in conjunction with raw-pixel classification increases precision in comparison with solely raw-pixel classification while suffering from a decrease in recall. All data is represented by a Tukey box-and-whisker plot, showing all points. Wilcoxon paired test: *P≤0.05; ***P≤0.001; ****P≤0.0001.

Extended Data Fig. 5 3D depth estimates based on mask size.

The inverse of the square root of the mask size (based on SIPEC:SegNet output) highly correlates with the depth of the individual in 3D space.

Extended Data Fig. 6 Comparison of counts of behaviours between SIPEC:BehaveNet, pose estimation based approach and human raters.

Unsupported and supported rears and grooming events were counted per video for n = 20 different mice videos. Behaviours were integrated over multiple frames, as described in Sturman et al. Behavioural counts of 3 different human expert annotators were averaged (in legend as ‘human ground truth’). No significant differences were found for comparing the number of behaviours between SIPEC:BehaveNet and human annotators or Sturman et al. and human annotators (Tukey’s multiple comparison test). All data is represented by mean, showing all points.

Supplementary information

Supplementary Information

Supplementary Figs. 1–9 and Table 1.

Reporting Summary

Supplementary Video 1

A short example video of behaving primates in their homecage environment. SIPEC:SegNet is used to mask different primates and SIPEC:IdNet is used to identify them. During obstructions, the identity of a primate can alter but SIPEC:IdNet quickly recovers the correct identity over the next frames, as it becomes more visible and therefore better identifiable.

Supplementary Video 2

A comparison for tracking four mice by (left) and SIPEC (right). We used publicly available data from ( as well as’s publicly available inference results ( for a tracking comparison. Left: the tracking of exhibits prolonged label switching errors where the label of two or more animals gets swapped for some time. Right: tracking is performed by SIPEC:SegNet in conjunction with greedy mask matching to track the identities of animals. In this example video, SIPEC is more robust to these kinds of errors than (see also Supplementary Video 4).

Supplementary Video 3

Tracking of four mice by SIPEC in an open-field test. The masks generated by SIPEC:SegNet in conjunction with greedy mask matching are used to robustly track identities of four mice in an open-field test (see Methods).

Supplementary Video 4

SIPEC tracking over 52 min video. We used publicly available data from ( and tracked four mice. The masks generated by SIPEC:SegNet in conjunction with greedy mask matching are used to robustly track identities of four mice in an open-field test (see Methods).

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Marks, M., Jin, Q., Sturman, O. et al. Deep-learning-based identification, tracking, pose estimation and behaviour classification of interacting primates and mice in complex environments. Nat Mach Intell 4, 331–340 (2022).

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI:


Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing