Fast animal pose estimation using deep neural networks

Pereira, Talmo D.; Aldarondo, Diego E.; Willmore, Lindsay; Kislin, Mikhail; Wang, Samuel S.-H.; Murthy, Mala; Shaevitz, Joshua W.

doi:10.1038/s41592-018-0234-5

Article
Published: 20 December 2018

Fast animal pose estimation using deep neural networks

Nature Methods volume 16, pages 117–125 (2019)Cite this article

21k Accesses
319 Citations
59 Altmetric
Metrics details

Subjects

Abstract

The need for automated and efficient systems for tracking full animal pose has increased with the complexity of behavioral data and analyses. Here we introduce LEAP (LEAP estimates animal pose), a deep-learning-based method for predicting the positions of animal body parts. This framework consists of a graphical interface for labeling of body parts and training the network. LEAP offers fast prediction on new data, and training with as few as 100 frames results in 95% of peak performance. We validated LEAP using videos of freely behaving fruit flies and tracked 32 distinct points to describe the pose of the head, body, wings and legs, with an error rate of <3% of body length. We recapitulated reported findings on insect gait dynamics and demonstrated LEAP’s applicability for unsupervised behavioral classification. Finally, we extended the method to more challenging imaging situations and videos of freely moving mice.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on Springer Link
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Fig. 1: Body-part tracking via LEAP, a deep learning framework for animal pose estimation.**

**Fig. 2: LEAP is accurate and requires little training or labeled data.**

**Fig. 3: LEAP recapitulates known gait patterning in flies.**

**Fig. 4: Unsupervised embedding of body position dynamics.**

**Fig. 5: Locomotor clusters in behavior space separate distinct gait modes.**

**Fig. 6: LEAP generalizes to images with complex backgrounds or of other animals.**

Highly accurate protein structure prediction with AlphaFold

Article Open access 15 July 2021

John Jumper, Richard Evans, … Demis Hassabis

Spike sorting with Kilosort4

Article Open access 08 April 2024

Marius Pachitariu, Shashwat Sridhar, … Carsen Stringer

Geometry-enhanced pretraining on interatomic potentials

Article 05 April 2024

Taoyong Cui, Chenyu Tang, … Wanli Ouyang

Data availability

The entire primary dataset of 59 aligned, high-resolution behavioral videos is made available online for reproducibility or further studies based off of this method, as well as labeled data to train and ground-truth the networks, pre-trained networks used for all analyses, and estimated body-part positions for all 21 million frames. This dataset (~170 GiB) is freely available at http://arks.princeton.edu/ark:/88435/dsp01pz50gz79z. Data from additional fly and mouse datasets used in Fig. 6 can be made available upon reasonable request.

References

Anderson, D. J. & Perona, P. Toward a science of computational ethology. Neuron 84, 18–31 (2014).
Article CAS Google Scholar
Szigeti, B., Stone, T. & Webb, B. Inconsistencies in C. elegans behavioural annotation. Preprint at bioRxiv https://www.biorxiv.org/content/early/2016/07/29/066787 (2016).
Branson, K., Robie, A. A., Bender, J., Perona, P. & Dickinson, M. H. High-throughput ethomics in large groups of Drosophila. Nat. Methods 6, 451–457 (2009).
Article CAS Google Scholar
Swierczek, N. A., Giles, A. C., Rankin, C. H. & Kerr, R. A. High-throughput behavioral analysis in C. elegans. Nat. Methods 8, 592–598 (2011).
Article CAS Google Scholar
Deng, Y., Coen, P., Sun, M. & Shaevitz, J. W. Efficient multiple object tracking using mutually repulsive active membranes. PLoS ONE 8, e65769 (2013).
Article CAS Google Scholar
Dankert, H., Wang, L., Hoopfer, E. D., Anderson, D. J. & Perona, P. Automated monitoring and analysis of social behavior in Drosophila. Nat. Methods 6, 297–303 (2009).
Article CAS Google Scholar
Kabra, M., Robie, A. A., Rivera-Alba, M., Branson, S. & Branson, K. JAABA: interactive machine learning for automatic annotation of animal behavior. Nat. Methods 10, 64–67 (2013).
Article CAS Google Scholar
Arthur, B. J., Sunayama-Morita, T., Coen, P., Murthy, M. & Stern, D. L. Multi-channel acoustic recording and automated analysis of Drosophila courtship songs. BMC Biol. 11, 11 (2013).
Article Google Scholar
Anderson, S. E., Dave, A. S. & Margoliash, D. Template-based automatic recognition of birdsong syllables from continuous recordings. J. Acoust. Soc. Am. 100, 1209–1219 (1996).
Article CAS Google Scholar
Tachibana, R. O., Oosugi, N. & Okanoya, K. Semi-automatic classification of birdsong elements using a linear support vector machine. PLoS ONE 9, e92584 (2014).
Article Google Scholar
Berman, G. J., Choi, D. M., Bialek, W. & Shaevitz, J. W. Mapping the stereotyped behaviour of freely moving fruit flies. J. R. Soc. Interface 11, 20140672 (2014).
Article Google Scholar
Wiltschko, A. B. et al. Mapping sub-second structure in mouse behavior. Neuron 88, 1121–1135 (2015).
Article CAS Google Scholar
Berman, G. J., Bialek, W. & Shaevitz, J. W. Predictability and hierarchy in Drosophila behavior. Proc. Natl Acad. Sci. USA 113, 11943–11948 (2016).
Article CAS Google Scholar
Klibaite, U., Berman, G. J., Cande, J., Stern, D. L. & Shaevitz, J. W. An unsupervised method for quantifying the behavior of paired animals. Phys. Biol. 14, 015006 (2017).
Article Google Scholar
Wang, Q. et al. The PSI-U1 snRNP interaction regulates male mating behavior in Drosophila. Proc. Natl Acad. Sci. USA 113, 5269–5274 (2016).
Article CAS Google Scholar
Vogelstein, J. T. et al. Discovery of brainwide neural-behavioral maps via multiscale unsupervised structure learning. Science 344, 386–392 (2014).
Article CAS Google Scholar
Cande, J. et al. Optogenetic dissection of descending behavioral control in Drosophila. eLife 7, e34275 (2018).
Article Google Scholar
Uhlmann, V., Ramdya, P., Delgado-Gonzalo, R., Benton, R. & Unser, M. FlyLimbTracker: an active contour based approach for leg segment tracking in unmarked, freely behaving Drosophila. PLoS ONE 12, e0173433 (2017).
Article Google Scholar
Kain, J. et al. Leg-tracking and automated behavioural classification in Drosophila. Nat. Commun. 4, 1910 (2013).
Article Google Scholar
Machado, A. S., Darmohray, D. M., Fayad, J., Marques, H. G. & Carey, M. R. A quantitative framework for whole-body coordination reveals specific deficits in freely walking ataxic mice. eLife 4, e07892 (2015).
Article Google Scholar
Nashaat, M. A. et al. Pixying behavior: a versatile real-time and post hoc automated optical tracking method for freely moving and head fixed animals. eNeuro 4, e34275 (2017).
Article Google Scholar
Nanjappa, A. et al. Mouse pose estimation from depth images. arXiv Preprint at https://arxiv.org/abs/1511.07611 (2015).
Nakamura, A. et al. Low-cost three-dimensional gait analysis system for mice with an infrared depth sensor. Neurosci. Res. 100, 55–62 (2015).
Article Google Scholar
Wang, Z., Mirbozorgi, S. A. & Ghovanloo, M. An automated behavior analysis system for freely moving rodents using depth image. Med. Biol. Eng. Comput. 56, 1807–1821 (2018).
Article Google Scholar
Mendes, C. S., Bartos, I., Akay, T., Márka, S. & Mann, R. S. Quantification of gait parameters in freely walking wild type and sensory deprived Drosophila melanogaster. eLife 2, e00231 (2013).
Article Google Scholar
Mendes, C. S. et al. Quantification of gait parameters in freely walking rodents. BMC Biol. 13, 50 (2015).
Article Google Scholar
Petrou, G. & Webb, B. Detailed tracking of body and leg movements of a freely walking female cricket during phonotaxis. J. Neurosci. Methods 203, 56–68 (2012).
Article Google Scholar
Toshev, A. & Szegedy, C. DeepPose: human pose estimation via deep neural networks. arXiv Preprint at https://arxiv.org/abs/1312.4659 (2013).
Tompson, J. J., Jain, A., LeCun, Y. & Bregler, C. Joint training of a convolutional network and a graphical model for human pose estimation. In Advances in Neural Information Processing Systems Vol. 27 (eds Ghahramani, Z., Welling, M., Cortes, C., Lawrence, N. D. & Weinberger, K. Q.) 1799–1807 (Curran Associates, Inc., Red Hook, 2014).
Carreira, J., Agrawal, P., Fragkiadaki, K. & Malik, J. Human pose estimation with iterative error feedback. arXi v Preprint at https://arxiv.org/abs/1507.06550 (2015).
Wei, S.-E., Ramakrishna, V., Kanade, T. & Sheikh, Y. Convolutional pose machines. arXiv Preprint at https://arxiv.org/abs/1602.00134 (2016).
Bulat, A. & Tzimiropoulos, G. Human pose estimation via convolutional part heatmap regression. arXiv Preprint at https://arxiv.org/abs/1609.01743 (2016).
Cao, Z., Simon, T., Wei, S.-E. & Sheikh, Y. Realtime multi-person 2D pose estimation using part affinity fields. arXiv Preprint at https://arxiv.org/abs/1611.08050 (2016).
Tome, D., Russell, C. & Agapito, L. Lifting from the deep: convolutional 3D pose estimation from a single image. arXiv Preprint at https://arxiv.org/abs/1701.00295 (2017).
Shelhamer, E., Long, J. & Darrell, T. Fully convolutional networks for semantic segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 39, 640–651 (2017).
Article Google Scholar
Ronneberger, O., Fischer, P. & Brox, T. U-Net: convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention – MICCAI 2015 234–241 (Springer International Publishing, Cham, Switzerland, 2015).
Lin, T.-Y. et al. Microsoft COCO: common objects in context. In Computer Vision – ECCV 2014 740–755 (Springer International Publishing, Cham, Switzerland, 2014).
Andriluka, M., Pishchulin, L., Gehler, P. & Schiele, B. 2D human pose estimation: new benchmark and state of the art analysis. In Proc. IEEE Conference on Computer Vision and Pattern Recognition 3686–3693 (IEEE Computer Society, 2014).
Güler, R. A., Neverova, N. & Kokkinos, I. DensePose: dense human pose estimation in the wild. arXiv Preprint at https://arxiv.org/abs/1802.00434 (2018).
Mathis, A. et al. DeepLabCut: markerless pose estimation of user-defined body parts with deep learning. Nat. Neurosci. 21, 1281–1289 (2018).
Article CAS Google Scholar
Isakov, A. et al. Recovery of locomotion after injury in Drosophila melanogaster depends on proprioception. J. Exp. Biol. 219, 1760–1771 (2016).
Article Google Scholar
Wosnitza, A., Bockemühl, T., Dübbert, M., Scholz, H. & Büschges, A. Inter-leg coordination in the control of walking speed in Drosophila. J. Exp. Biol. 216, 480–491 (2013).
Article Google Scholar
Qiao, B., Li, C., Allen, V. W., Shirasu-Hiza, M. & Syed, S. Automated analysis of long-term grooming behavior in Drosophila using a k-nearest neighbors classifier. eLife 7, e34497 (2018).
Article Google Scholar
Dombeck, D. A., Khabbaz, A. N., Collman, F., Adelman, T. L. & Tank, D. W. Imaging large-scale neural activity with cellular resolution in awake, mobile mice. Neuron 56, 43–57 (2007).
Article CAS Google Scholar
Seelig, J. D. & Jayaraman, V. Neural dynamics for landmark orientation and angular path integration. Nature 521, 186–191 (2015).
Article CAS Google Scholar
Pérez-Escudero, A., Vicente-Page, J., Hinz, R. C., Arganda, S. & de Polavieja, G. G. idTracker: tracking individuals in a group by automatic identification of unmarked animals. Nat. Methods 11, 743–748 (2014).
Article Google Scholar
Newell, A., Yang, K. & Deng, J. Stacked hourglass networks for human pose estimation. arXiv Preprint at https://arxiv.org/abs/1603.06937 (2016).
Chyb, S. & Gompel, N. Atlas of Drosophila Morphology: Wild-type and Classical Mutants (Academic Press, London, Waltham and San Diego, 2013).
Google Scholar
Kingma, D. P. & Ba, J. Adam: a method for stochastic optimization. arXiv Preprint at https://arxiv.org/abs/1412.6980 (2014).
Morel, P. Gramm: grammar of graphics plotting in MATLAB. J. Open Source Softw. 3, 568 (2018).
Article Google Scholar
Baum, L. E., Petrie, T., Soules, G. & Weiss, N. A maximization technique occurring in the statistical analysis of probabilistic functions of markov chains. Ann. Math. Stat. 41, 164–171 (1970).
Article Google Scholar
Viterbi, A. Error bounds for convolutional codes and an asymptotically optimum decoding algorithm. IEEE Trans. Inf. Theory 13, 260–269 (1967).
Article Google Scholar
van der Maaten, L. & Hinton, G. Visualizing data using t-SNE. J. Mach. Learn. Res. 9, 2579–2605 (2008).
Google Scholar

Download references

Acknowledgements

The authors acknowledge J. Pillow for discussions; B.C. Cho for contributions to the acquisition and preprocessing pipeline for mouse experiments; P. Chen for a previous version of a neural network for pose estimation that was useful in designing our method; H. Jang, M. Murugan, and I. Witten for feedback on the GUI and other discussions; G. Guan for assistance maintaining flies; and the Murthy, Shaevitz and Wang labs for general feedback. This work was supported by the NIH R01 NS104899-01 BRAIN Initiative Award and an NSF BRAIN Initiative EAGER Award (to M.M. and J.W.S.), NIH R01 MH115750 BRAIN Initiative Award (to S.S.-H.W. and J.W.S.), the Nancy Lurie Marks Family Foundation and NIH R01 NS045193 (to S.S.-H.W.), an HHMI Faculty Scholar Award (to M.M.), NSF GRFP DGE-1148900 (to T.D.P.), and the Center for the Physics of Biological Function sponsored by the National Science Foundation (NSF PHY-1734030).

Author information

Diego E. Aldarondo
Present address: Program in Neuroscience, Harvard University, Cambridge, MA, USA
These authors contributed equally: Talmo D. Pereira, Diego E. Aldarondo.

Authors and Affiliations

Princeton Neuroscience Institute, Princeton University, Princeton, NJ, USA
Talmo D. Pereira, Diego E. Aldarondo, Lindsay Willmore, Mikhail Kislin, Samuel S.-H. Wang, Mala Murthy & Joshua W. Shaevitz
Department of Molecular Biology, Princeton University, Princeton, NJ, USA
Samuel S.-H. Wang & Mala Murthy
Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, NJ, USA
Joshua W. Shaevitz
Department of Physics, Princeton University, Princeton, NJ, USA
Joshua W. Shaevitz

Authors

Talmo D. Pereira
View author publications
You can also search for this author in PubMed Google Scholar
Diego E. Aldarondo
View author publications
You can also search for this author in PubMed Google Scholar
Lindsay Willmore
View author publications
You can also search for this author in PubMed Google Scholar
Mikhail Kislin
View author publications
You can also search for this author in PubMed Google Scholar
Samuel S.-H. Wang
View author publications
You can also search for this author in PubMed Google Scholar
Mala Murthy
View author publications
You can also search for this author in PubMed Google Scholar
Joshua W. Shaevitz
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

T.D.P., D.E.A., S.S.-H.W., J.W.S. and M.M. designed the study. T.D.P., D.E.A., L.W. and M.K. conducted experiments. T.D.P. and D.E.A. developed the GUI and analyzed data. T.D.P., D.E.A., J.W.S. and M.M. wrote the manuscript.

Corresponding authors

Correspondence to Mala Murthy or Joshua W. Shaevitz.

Ethics declarations

Competing interests

T.D.P., D.E.A., J.W.S. and M.M. are named as inventors on US provisional patent no. 62/741,643 filed by Princeton University.

Additional information

Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Integrated supplementary information

Supplementary Figure 1 Rotational invariance is learned at the cost of prediction accuracy.

a, The augmentation procedure consists of random rotations about the center of egocentrically aligned labeled frames. Labeled frames are split into training, validation and test sets. Colors are used to indicate unique images. Only training and validation sets are augmented and used for training. During training, images are drawn sequentially from the training and validation sets to form batches of 32 images, cycling back to the beginning if there are fewer images than required, and then rotated randomly within a range of angles. After each epoch, the ordering of the datasets is shuffled so as to create new combinations of batches. The test set images are not augmented before computing accuracy metrics reported throughout. b, Egocentric alignment accuracy of the preprocessing algorithm from ref. ⁹ when compared to manual labels of head and thorax. The error is the absolute deviation of the angle formed between the thorax and head from the horizontal centerline in the image. c, The accuracy measured as the r.m.s. error of position estimates when evaluated on data artificially rotated at a fixed angle (rows) with networks trained on data augmented by rotations between a range of angles (columns). Red boxes denote the best accuracy for each data angle.

Supplementary Figure 2 Cluster sampling to promote pose diversity in the labeling dataset.

a, PCA of unlabeled images captures 80% of the variance in the data (gray line; n = 29,500 images) within 50 components (blue bars). b, Top PCA eigenmodes visualized as coefficient images. Red and blue shading denote positive and negative coefficients at each pixel, respectively. Areas of similar colors indicate correlated pixel intensities within a given mode. c, Cluster centroids identified by k-means after PCA. Red and blue shading denote pixels with higher or lower intensity than the overall mean, respectively.

Supplementary Figure 3 Comparison of neural network architecture.

a, Diagram of our neural network architecture. Raw images are provided as input into the network, which then computes a set of confidence maps of the same height and width as the input image (top row). The network consists of a set of convolutions, max pooling and transposed convolutions whose weights are learned during training (top middle). Estimated confidence maps are compared to ground truth maps generated from user labels using a mean squared error loss function, which is then minimized during training (bottom row). b, Accuracy comparison between network architectures. We compared the accuracy of our architecture to the hourglass and stacked hourglass versions of the network described in ref. ¹⁰. The accuracy of our network is equivalent or better than that achieved when training with hourglass (over all 32 body parts, n = 300 held-out frames, P < 1 × 10^–10, Wilcoxon rank sum test, one-tailed, z = –74.65) and stacked hourglass (over all 32 body parts, n = 300 held-out frames, P < 1 × 10^–10, Wilcoxon rank sum test, one-tailed, z = –53.21) versions of the network described in ref. ¹⁰. Dots and error bars denote median and 25th and 75th percentiles; violin plots denote full distributions of errors.

Supplementary Figure 4 User-defined skeleton.

The 32 selected points approximately match the set of visible joints and interest points in the anatomy of the animal.

Supplementary Figure 5 Estimation accuracy improves with few samples.

a,b, Error distance distributions per body part when estimated with networks trained for 15 epochs on 10 (a) or 250 (b) labeled frames. c, Time spent labeling each frame decreases with the quality of initialization. Line and shaded regions correspond to mean and s.d., respectively. Starting frames require 115.4 ± 45.0 (mean ± s.d.) seconds to label, decreasing to 6.1 ± 7.7 s after initialization with a network trained on 1,000 labeled frames (n = 1,500 total labeled frames). d, Accuracy improvements are observed with very few labeled samples. A plateau is observed at around 150–200 frames, with marginal improvements with additional labeling. Circles denote the test set r.m.s. error for one replicate of fast training (15 epochs) at each dataset size; lines denote mean of all replicates.

Supplementary Figure 6 Comparison of behavioral space distributions generated from compressed images versus body-part positions.

a, Behavioral space distribution from 59 male flies calculated using the original MotionMapper pipeline (data and pipeline from 12), including Radon-transform compression and PCA-based projection onto the first 50 principal components followed by a nonlinear embedding of the resultant spectrograms. b, Behavioral space distribution from 59 male flies (data and pipeline from 12) calculated using spectrograms generated from tracked body-part positions rather than PCA modes (see Methods). c, Joint probability distribution of the cluster labels from a and b; sorted by row and column peaks.

Supplementary Figure 7 Generalization to more diverse morphologies with a single network trades off with accuracy.

a,b, Male and female flies differ in anatomical morphology, in part because of differences in their body length. a, The males more often extend their wings as they are used to produce courtship song. b, The females rarely extend their wings in this context. c, Training on labeled images of just males results in high accuracy on male test set images. d, Training on both males and females still results in high accuracy on male test set images. e, Quantification of r.m.s. error on the male test set shows that generalization to two different morphologies increases the error metric. Circles denote training replicates, diamonds denote median r.m.s. error for all replicates and solid and open markers correspond to specialized and generalized training, respectively.

Supplementary information

Supplementary Text and Figures

Supplementary Figs. 1–7 and Supplementary Results

Reporting Summary

Supplementary Video 1

Body-part tracking is reliable over long periods without temporal constraints. Raw images (left), max projection of all confidence maps (center) and tracked images (right) during a 20-s bout of free movement. Video playback at ×0.2 real-time speed.

Supplementary Video 2

Body-part tracking during free-moving locomotion. Raw images (left), max projection of all confidence maps (center) and tracked images (right) during a bout of locomotion. Video playback at ×0.15 real-time speed. Video corresponds to Fig. 1d.

Supplementary Video 3

Body-part tracking during head grooming. Raw images (left), max projection of all confidence maps (center), and tracked images (right) during a bout of head grooming. Video playback at ×0.15 real-time speed. Video corresponds to Fig. 1e.

Supplementary Video 4

Tracking joints robustly in images with heterogeneous background and noisy segmentation. Raw images (left), max projection of all confidence maps (center) and tracked images (right) of a freely moving courting male fly. Rows correspond to results from a network trained on unmasked and masked images, respectively. Video playback at ×0.2 real-time speed.

Supplementary Video 5

Tracking joints in freely moving rodents. Raw images (left), max projection of all confidence maps (center) and tracked images (right) of a freely moving mouse in an open field arena imaged from below through a clear acrylic floor. Video playback at ×0.2 real-time speed. Tracking is reliable over time but degenerate when certain parts are occluded, such as when the animal rears.

Supplementary Software

LEAP (LEAP estimates animal pose) software for estimation of animal body-part position.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Pereira, T.D., Aldarondo, D.E., Willmore, L. et al. Fast animal pose estimation using deep neural networks. Nat Methods 16, 117–125 (2019). https://doi.org/10.1038/s41592-018-0234-5

Download citation

Received: 25 May 2018
Accepted: 31 October 2018
Published: 20 December 2018
Issue Date: January 2019
DOI: https://doi.org/10.1038/s41592-018-0234-5

This article is cited by

Using AI to decode the behavioral responses of an insect to chemical stimuli: towards machine-animal computational technologies
- Edoardo Fazzari
- Fabio Carrara
- Donato Romano
International Journal of Machine Learning and Cybernetics (2024)
Automated Detection of Cat Facial Landmarks
- George Martvel
- Ilan Shimshoni
- Anna Zamansky
International Journal of Computer Vision (2024)
Non-invasive measurements of respiration and heart rate across wildlife species using Eulerian Video Magnification of infrared thermal imagery
- Caroline L. Rzucidlo
- Erin Curry
- Michelle R. Shero
BMC Biology (2023)
replicAnt: a pipeline for generating annotated images of animals in complex environments using Unreal Engine
- Fabian Plum
- René Bulla
- David Labonte
Nature Communications (2023)
Behavioral decomposition reveals rich encoding structure employed across neocortex in rats
- Bartul Mimica
- Tuçe Tombaz
- Jonathan R. Whitlock
Nature Communications (2023)