Skip to main content

Thank you for visiting You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Article
  • Published:

Fast animal pose estimation using deep neural networks


The need for automated and efficient systems for tracking full animal pose has increased with the complexity of behavioral data and analyses. Here we introduce LEAP (LEAP estimates animal pose), a deep-learning-based method for predicting the positions of animal body parts. This framework consists of a graphical interface for labeling of body parts and training the network. LEAP offers fast prediction on new data, and training with as few as 100 frames results in 95% of peak performance. We validated LEAP using videos of freely behaving fruit flies and tracked 32 distinct points to describe the pose of the head, body, wings and legs, with an error rate of <3% of body length. We recapitulated reported findings on insect gait dynamics and demonstrated LEAP’s applicability for unsupervised behavioral classification. Finally, we extended the method to more challenging imaging situations and videos of freely moving mice.

This is a preview of subscription content, access via your institution

Access options

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Fig. 1: Body-part tracking via LEAP, a deep learning framework for animal pose estimation.
Fig. 2: LEAP is accurate and requires little training or labeled data.
Fig. 3: LEAP recapitulates known gait patterning in flies.
Fig. 4: Unsupervised embedding of body position dynamics.
Fig. 5: Locomotor clusters in behavior space separate distinct gait modes.
Fig. 6: LEAP generalizes to images with complex backgrounds or of other animals.

Similar content being viewed by others

Data availability

The entire primary dataset of 59 aligned, high-resolution behavioral videos is made available online for reproducibility or further studies based off of this method, as well as labeled data to train and ground-truth the networks, pre-trained networks used for all analyses, and estimated body-part positions for all 21 million frames. This dataset (~170 GiB) is freely available at Data from additional fly and mouse datasets used in Fig. 6 can be made available upon reasonable request.


  1. Anderson, D. J. & Perona, P. Toward a science of computational ethology. Neuron 84, 18–31 (2014).

    Article  CAS  Google Scholar 

  2. Szigeti, B., Stone, T. & Webb, B. Inconsistencies in C. elegans behavioural annotation. Preprint at bioRxiv (2016).

  3. Branson, K., Robie, A. A., Bender, J., Perona, P. & Dickinson, M. H. High-throughput ethomics in large groups of Drosophila. Nat. Methods 6, 451–457 (2009).

    Article  CAS  Google Scholar 

  4. Swierczek, N. A., Giles, A. C., Rankin, C. H. & Kerr, R. A. High-throughput behavioral analysis in C. elegans. Nat. Methods 8, 592–598 (2011).

    Article  CAS  Google Scholar 

  5. Deng, Y., Coen, P., Sun, M. & Shaevitz, J. W. Efficient multiple object tracking using mutually repulsive active membranes. PLoS ONE 8, e65769 (2013).

    Article  CAS  Google Scholar 

  6. Dankert, H., Wang, L., Hoopfer, E. D., Anderson, D. J. & Perona, P. Automated monitoring and analysis of social behavior in Drosophila. Nat. Methods 6, 297–303 (2009).

    Article  CAS  Google Scholar 

  7. Kabra, M., Robie, A. A., Rivera-Alba, M., Branson, S. & Branson, K. JAABA: interactive machine learning for automatic annotation of animal behavior. Nat. Methods 10, 64–67 (2013).

    Article  CAS  Google Scholar 

  8. Arthur, B. J., Sunayama-Morita, T., Coen, P., Murthy, M. & Stern, D. L. Multi-channel acoustic recording and automated analysis of Drosophila courtship songs. BMC Biol. 11, 11 (2013).

    Article  Google Scholar 

  9. Anderson, S. E., Dave, A. S. & Margoliash, D. Template-based automatic recognition of birdsong syllables from continuous recordings. J. Acoust. Soc. Am. 100, 1209–1219 (1996).

    Article  CAS  Google Scholar 

  10. Tachibana, R. O., Oosugi, N. & Okanoya, K. Semi-automatic classification of birdsong elements using a linear support vector machine. PLoS ONE 9, e92584 (2014).

    Article  Google Scholar 

  11. Berman, G. J., Choi, D. M., Bialek, W. & Shaevitz, J. W. Mapping the stereotyped behaviour of freely moving fruit flies. J. R. Soc. Interface 11, 20140672 (2014).

    Article  Google Scholar 

  12. Wiltschko, A. B. et al. Mapping sub-second structure in mouse behavior. Neuron 88, 1121–1135 (2015).

    Article  CAS  Google Scholar 

  13. Berman, G. J., Bialek, W. & Shaevitz, J. W. Predictability and hierarchy in Drosophila behavior. Proc. Natl Acad. Sci. USA 113, 11943–11948 (2016).

    Article  CAS  Google Scholar 

  14. Klibaite, U., Berman, G. J., Cande, J., Stern, D. L. & Shaevitz, J. W. An unsupervised method for quantifying the behavior of paired animals. Phys. Biol. 14, 015006 (2017).

    Article  Google Scholar 

  15. Wang, Q. et al. The PSI-U1 snRNP interaction regulates male mating behavior in Drosophila. Proc. Natl Acad. Sci. USA 113, 5269–5274 (2016).

    Article  CAS  Google Scholar 

  16. Vogelstein, J. T. et al. Discovery of brainwide neural-behavioral maps via multiscale unsupervised structure learning. Science 344, 386–392 (2014).

    Article  CAS  Google Scholar 

  17. Cande, J. et al. Optogenetic dissection of descending behavioral control in Drosophila. eLife 7, e34275 (2018).

    Article  Google Scholar 

  18. Uhlmann, V., Ramdya, P., Delgado-Gonzalo, R., Benton, R. & Unser, M. FlyLimbTracker: an active contour based approach for leg segment tracking in unmarked, freely behaving Drosophila. PLoS ONE 12, e0173433 (2017).

    Article  Google Scholar 

  19. Kain, J. et al. Leg-tracking and automated behavioural classification in Drosophila. Nat. Commun. 4, 1910 (2013).

    Article  Google Scholar 

  20. Machado, A. S., Darmohray, D. M., Fayad, J., Marques, H. G. & Carey, M. R. A quantitative framework for whole-body coordination reveals specific deficits in freely walking ataxic mice. eLife 4, e07892 (2015).

    Article  Google Scholar 

  21. Nashaat, M. A. et al. Pixying behavior: a versatile real-time and post hoc automated optical tracking method for freely moving and head fixed animals. eNeuro 4, e34275 (2017).

    Article  Google Scholar 

  22. Nanjappa, A. et al. Mouse pose estimation from depth images. arXiv Preprint at (2015).

  23. Nakamura, A. et al. Low-cost three-dimensional gait analysis system for mice with an infrared depth sensor. Neurosci. Res. 100, 55–62 (2015).

    Article  Google Scholar 

  24. Wang, Z., Mirbozorgi, S. A. & Ghovanloo, M. An automated behavior analysis system for freely moving rodents using depth image. Med. Biol. Eng. Comput. 56, 1807–1821 (2018).

    Article  Google Scholar 

  25. Mendes, C. S., Bartos, I., Akay, T., Márka, S. & Mann, R. S. Quantification of gait parameters in freely walking wild type and sensory deprived Drosophila melanogaster. eLife 2, e00231 (2013).

    Article  Google Scholar 

  26. Mendes, C. S. et al. Quantification of gait parameters in freely walking rodents. BMC Biol. 13, 50 (2015).

    Article  Google Scholar 

  27. Petrou, G. & Webb, B. Detailed tracking of body and leg movements of a freely walking female cricket during phonotaxis. J. Neurosci. Methods 203, 56–68 (2012).

    Article  Google Scholar 

  28. Toshev, A. & Szegedy, C. DeepPose: human pose estimation via deep neural networks. arXiv Preprint at (2013).

  29. Tompson, J. J., Jain, A., LeCun, Y. & Bregler, C. Joint training of a convolutional network and a graphical model for human pose estimation. In Advances in Neural Information Processing Systems Vol. 27 (eds Ghahramani, Z., Welling, M., Cortes, C., Lawrence, N. D. & Weinberger, K. Q.) 1799–1807 (Curran Associates, Inc., Red Hook, 2014).

  30. Carreira, J., Agrawal, P., Fragkiadaki, K. & Malik, J. Human pose estimation with iterative error feedback. arXi v Preprint at (2015).

  31. Wei, S.-E., Ramakrishna, V., Kanade, T. & Sheikh, Y. Convolutional pose machines. arXiv Preprint at (2016).

  32. Bulat, A. & Tzimiropoulos, G. Human pose estimation via convolutional part heatmap regression. arXiv Preprint at (2016).

  33. Cao, Z., Simon, T., Wei, S.-E. & Sheikh, Y. Realtime multi-person 2D pose estimation using part affinity fields. arXiv Preprint at (2016).

  34. Tome, D., Russell, C. & Agapito, L. Lifting from the deep: convolutional 3D pose estimation from a single image. arXiv Preprint at (2017).

  35. Shelhamer, E., Long, J. & Darrell, T. Fully convolutional networks for semantic segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 39, 640–651 (2017).

    Article  Google Scholar 

  36. Ronneberger, O., Fischer, P. & Brox, T. U-Net: convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention – MICCAI 2015 234–241 (Springer International Publishing, Cham, Switzerland, 2015).

  37. Lin, T.-Y. et al. Microsoft COCO: common objects in context. In Computer Vision – ECCV 2014 740–755 (Springer International Publishing, Cham, Switzerland, 2014).

  38. Andriluka, M., Pishchulin, L., Gehler, P. & Schiele, B. 2D human pose estimation: new benchmark and state of the art analysis. In Proc. IEEE Conference on Computer Vision and Pattern Recognition 3686–3693 (IEEE Computer Society, 2014).

  39. Güler, R. A., Neverova, N. & Kokkinos, I. DensePose: dense human pose estimation in the wild. arXiv Preprint at (2018).

  40. Mathis, A. et al. DeepLabCut: markerless pose estimation of user-defined body parts with deep learning. Nat. Neurosci. 21, 1281–1289 (2018).

    Article  CAS  Google Scholar 

  41. Isakov, A. et al. Recovery of locomotion after injury in Drosophila melanogaster depends on proprioception. J. Exp. Biol. 219, 1760–1771 (2016).

    Article  Google Scholar 

  42. Wosnitza, A., Bockemühl, T., Dübbert, M., Scholz, H. & Büschges, A. Inter-leg coordination in the control of walking speed in Drosophila. J. Exp. Biol. 216, 480–491 (2013).

    Article  Google Scholar 

  43. Qiao, B., Li, C., Allen, V. W., Shirasu-Hiza, M. & Syed, S. Automated analysis of long-term grooming behavior in Drosophila using a k-nearest neighbors classifier. eLife 7, e34497 (2018).

    Article  Google Scholar 

  44. Dombeck, D. A., Khabbaz, A. N., Collman, F., Adelman, T. L. & Tank, D. W. Imaging large-scale neural activity with cellular resolution in awake, mobile mice. Neuron 56, 43–57 (2007).

    Article  CAS  Google Scholar 

  45. Seelig, J. D. & Jayaraman, V. Neural dynamics for landmark orientation and angular path integration. Nature 521, 186–191 (2015).

    Article  CAS  Google Scholar 

  46. Pérez-Escudero, A., Vicente-Page, J., Hinz, R. C., Arganda, S. & de Polavieja, G. G. idTracker: tracking individuals in a group by automatic identification of unmarked animals. Nat. Methods 11, 743–748 (2014).

    Article  Google Scholar 

  47. Newell, A., Yang, K. & Deng, J. Stacked hourglass networks for human pose estimation. arXiv Preprint at (2016).

  48. Chyb, S. & Gompel, N. Atlas of Drosophila Morphology: Wild-type and Classical Mutants (Academic Press, London, Waltham and San Diego, 2013).

    Google Scholar 

  49. Kingma, D. P. & Ba, J. Adam: a method for stochastic optimization. arXiv Preprint at (2014).

  50. Morel, P. Gramm: grammar of graphics plotting in MATLAB. J. Open Source Softw. 3, 568 (2018).

    Article  Google Scholar 

  51. Baum, L. E., Petrie, T., Soules, G. & Weiss, N. A maximization technique occurring in the statistical analysis of probabilistic functions of markov chains. Ann. Math. Stat. 41, 164–171 (1970).

    Article  Google Scholar 

  52. Viterbi, A. Error bounds for convolutional codes and an asymptotically optimum decoding algorithm. IEEE Trans. Inf. Theory 13, 260–269 (1967).

    Article  Google Scholar 

  53. van der Maaten, L. & Hinton, G. Visualizing data using t-SNE. J. Mach. Learn. Res. 9, 2579–2605 (2008).

    Google Scholar 

Download references


The authors acknowledge J. Pillow for discussions; B.C. Cho for contributions to the acquisition and preprocessing pipeline for mouse experiments; P. Chen for a previous version of a neural network for pose estimation that was useful in designing our method; H. Jang, M. Murugan, and I. Witten for feedback on the GUI and other discussions; G. Guan for assistance maintaining flies; and the Murthy, Shaevitz and Wang labs for general feedback. This work was supported by the NIH R01 NS104899-01 BRAIN Initiative Award and an NSF BRAIN Initiative EAGER Award (to M.M. and J.W.S.), NIH R01 MH115750 BRAIN Initiative Award (to S.S.-H.W. and J.W.S.), the Nancy Lurie Marks Family Foundation and NIH R01 NS045193 (to S.S.-H.W.), an HHMI Faculty Scholar Award (to M.M.), NSF GRFP DGE-1148900 (to T.D.P.), and the Center for the Physics of Biological Function sponsored by the National Science Foundation (NSF PHY-1734030).

Author information

Authors and Affiliations



T.D.P., D.E.A., S.S.-H.W., J.W.S. and M.M. designed the study. T.D.P., D.E.A., L.W. and M.K. conducted experiments. T.D.P. and D.E.A. developed the GUI and analyzed data. T.D.P., D.E.A., J.W.S. and M.M. wrote the manuscript.

Corresponding authors

Correspondence to Mala Murthy or Joshua W. Shaevitz.

Ethics declarations

Competing interests

T.D.P., D.E.A., J.W.S. and M.M. are named as inventors on US provisional patent no. 62/741,643 filed by Princeton University.

Additional information

Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Integrated supplementary information

Supplementary Figure 1 Rotational invariance is learned at the cost of prediction accuracy.

a, The augmentation procedure consists of random rotations about the center of egocentrically aligned labeled frames. Labeled frames are split into training, validation and test sets. Colors are used to indicate unique images. Only training and validation sets are augmented and used for training. During training, images are drawn sequentially from the training and validation sets to form batches of 32 images, cycling back to the beginning if there are fewer images than required, and then rotated randomly within a range of angles. After each epoch, the ordering of the datasets is shuffled so as to create new combinations of batches. The test set images are not augmented before computing accuracy metrics reported throughout. b, Egocentric alignment accuracy of the preprocessing algorithm from ref. 9 when compared to manual labels of head and thorax. The error is the absolute deviation of the angle formed between the thorax and head from the horizontal centerline in the image. c, The accuracy measured as the r.m.s. error of position estimates when evaluated on data artificially rotated at a fixed angle (rows) with networks trained on data augmented by rotations between a range of angles (columns). Red boxes denote the best accuracy for each data angle.

Supplementary Figure 2 Cluster sampling to promote pose diversity in the labeling dataset.

a, PCA of unlabeled images captures 80% of the variance in the data (gray line; n = 29,500 images) within 50 components (blue bars). b, Top PCA eigenmodes visualized as coefficient images. Red and blue shading denote positive and negative coefficients at each pixel, respectively. Areas of similar colors indicate correlated pixel intensities within a given mode. c, Cluster centroids identified by k-means after PCA. Red and blue shading denote pixels with higher or lower intensity than the overall mean, respectively.

Supplementary Figure 3 Comparison of neural network architecture.

a, Diagram of our neural network architecture. Raw images are provided as input into the network, which then computes a set of confidence maps of the same height and width as the input image (top row). The network consists of a set of convolutions, max pooling and transposed convolutions whose weights are learned during training (top middle). Estimated confidence maps are compared to ground truth maps generated from user labels using a mean squared error loss function, which is then minimized during training (bottom row). b, Accuracy comparison between network architectures. We compared the accuracy of our architecture to the hourglass and stacked hourglass versions of the network described in ref. 10. The accuracy of our network is equivalent or better than that achieved when training with hourglass (over all 32 body parts, n = 300 held-out frames, P < 1 × 10–10, Wilcoxon rank sum test, one-tailed, z = –74.65) and stacked hourglass (over all 32 body parts, n = 300 held-out frames, P < 1 × 10–10, Wilcoxon rank sum test, one-tailed, z = –53.21) versions of the network described in ref. 10. Dots and error bars denote median and 25th and 75th percentiles; violin plots denote full distributions of errors.

Supplementary Figure 4 User-defined skeleton.

The 32 selected points approximately match the set of visible joints and interest points in the anatomy of the animal.

Supplementary Figure 5 Estimation accuracy improves with few samples.

a,b, Error distance distributions per body part when estimated with networks trained for 15 epochs on 10 (a) or 250 (b) labeled frames. c, Time spent labeling each frame decreases with the quality of initialization. Line and shaded regions correspond to mean and s.d., respectively. Starting frames require 115.4 ± 45.0 (mean ± s.d.) seconds to label, decreasing to 6.1 ± 7.7 s after initialization with a network trained on 1,000 labeled frames (n = 1,500 total labeled frames). d, Accuracy improvements are observed with very few labeled samples. A plateau is observed at around 150–200 frames, with marginal improvements with additional labeling. Circles denote the test set r.m.s. error for one replicate of fast training (15 epochs) at each dataset size; lines denote mean of all replicates.

Supplementary Figure 6 Comparison of behavioral space distributions generated from compressed images versus body-part positions.

a, Behavioral space distribution from 59 male flies calculated using the original MotionMapper pipeline (data and pipeline from 12), including Radon-transform compression and PCA-based projection onto the first 50 principal components followed by a nonlinear embedding of the resultant spectrograms. b, Behavioral space distribution from 59 male flies (data and pipeline from 12) calculated using spectrograms generated from tracked body-part positions rather than PCA modes (see Methods). c, Joint probability distribution of the cluster labels from a and b; sorted by row and column peaks.

Supplementary Figure 7 Generalization to more diverse morphologies with a single network trades off with accuracy.

a,b, Male and female flies differ in anatomical morphology, in part because of differences in their body length. a, The males more often extend their wings as they are used to produce courtship song. b, The females rarely extend their wings in this context. c, Training on labeled images of just males results in high accuracy on male test set images. d, Training on both males and females still results in high accuracy on male test set images. e, Quantification of r.m.s. error on the male test set shows that generalization to two different morphologies increases the error metric. Circles denote training replicates, diamonds denote median r.m.s. error for all replicates and solid and open markers correspond to specialized and generalized training, respectively.

Supplementary information

Supplementary Text and Figures

Supplementary Figs. 1–7 and Supplementary Results

Reporting Summary

Supplementary Video 1

Body-part tracking is reliable over long periods without temporal constraints. Raw images (left), max projection of all confidence maps (center) and tracked images (right) during a 20-s bout of free movement. Video playback at ×0.2 real-time speed.

Supplementary Video 2

Body-part tracking during free-moving locomotion. Raw images (left), max projection of all confidence maps (center) and tracked images (right) during a bout of locomotion. Video playback at ×0.15 real-time speed. Video corresponds to Fig. 1d.

Supplementary Video 3

Body-part tracking during head grooming. Raw images (left), max projection of all confidence maps (center), and tracked images (right) during a bout of head grooming. Video playback at ×0.15 real-time speed. Video corresponds to Fig. 1e.

Supplementary Video 4

Tracking joints robustly in images with heterogeneous background and noisy segmentation. Raw images (left), max projection of all confidence maps (center) and tracked images (right) of a freely moving courting male fly. Rows correspond to results from a network trained on unmasked and masked images, respectively. Video playback at ×0.2 real-time speed.

Supplementary Video 5

Tracking joints in freely moving rodents. Raw images (left), max projection of all confidence maps (center) and tracked images (right) of a freely moving mouse in an open field arena imaged from below through a clear acrylic floor. Video playback at ×0.2 real-time speed. Tracking is reliable over time but degenerate when certain parts are occluded, such as when the animal rears.

Supplementary Software

LEAP (LEAP estimates animal pose) software for estimation of animal body-part position.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Pereira, T.D., Aldarondo, D.E., Willmore, L. et al. Fast animal pose estimation using deep neural networks. Nat Methods 16, 117–125 (2019).

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI:

This article is cited by


Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing