Article | Published:

Fast animal pose estimation using deep neural networks

Abstract

The need for automated and efficient systems for tracking full animal pose has increased with the complexity of behavioral data and analyses. Here we introduce LEAP (LEAP estimates animal pose), a deep-learning-based method for predicting the positions of animal body parts. This framework consists of a graphical interface for labeling of body parts and training the network. LEAP offers fast prediction on new data, and training with as few as 100 frames results in 95% of peak performance. We validated LEAP using videos of freely behaving fruit flies and tracked 32 distinct points to describe the pose of the head, body, wings and legs, with an error rate of <3% of body length. We recapitulated reported findings on insect gait dynamics and demonstrated LEAP’s applicability for unsupervised behavioral classification. Finally, we extended the method to more challenging imaging situations and videos of freely moving mice.

Access optionsAccess options

Rent or Buy article

Get time limited or full article access on ReadCube.

from$8.99

All prices are NET prices.

Data availability

The entire primary dataset of 59 aligned, high-resolution behavioral videos is made available online for reproducibility or further studies based off of this method, as well as labeled data to train and ground-truth the networks, pre-trained networks used for all analyses, and estimated body-part positions for all 21 million frames. This dataset (~170 GiB) is freely available at http://arks.princeton.edu/ark:/88435/dsp01pz50gz79z. Data from additional fly and mouse datasets used in Fig. 6 can be made available upon reasonable request.

Additional information

Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

  1. 1.

    Anderson, D. J. & Perona, P. Toward a science of computational ethology. Neuron 84, 18–31 (2014).

  2. 2.

    Szigeti, B., Stone, T. & Webb, B. Inconsistencies in C. elegans behavioural annotation. Preprint at bioRxiv https://www.biorxiv.org/content/early/2016/07/29/066787 (2016).

  3. 3.

    Branson, K., Robie, A. A., Bender, J., Perona, P. & Dickinson, M. H. High-throughput ethomics in large groups of Drosophila. Nat. Methods 6, 451–457 (2009).

  4. 4.

    Swierczek, N. A., Giles, A. C., Rankin, C. H. & Kerr, R. A. High-throughput behavioral analysis in C. elegans. Nat. Methods 8, 592–598 (2011).

  5. 5.

    Deng, Y., Coen, P., Sun, M. & Shaevitz, J. W. Efficient multiple object tracking using mutually repulsive active membranes. PLoS ONE 8, e65769 (2013).

  6. 6.

    Dankert, H., Wang, L., Hoopfer, E. D., Anderson, D. J. & Perona, P. Automated monitoring and analysis of social behavior in Drosophila. Nat. Methods 6, 297–303 (2009).

  7. 7.

    Kabra, M., Robie, A. A., Rivera-Alba, M., Branson, S. & Branson, K. JAABA: interactive machine learning for automatic annotation of animal behavior. Nat. Methods 10, 64–67 (2013).

  8. 8.

    Arthur, B. J., Sunayama-Morita, T., Coen, P., Murthy, M. & Stern, D. L. Multi-channel acoustic recording and automated analysis of Drosophila courtship songs. BMC Biol. 11, 11 (2013).

  9. 9.

    Anderson, S. E., Dave, A. S. & Margoliash, D. Template-based automatic recognition of birdsong syllables from continuous recordings. J. Acoust. Soc. Am. 100, 1209–1219 (1996).

  10. 10.

    Tachibana, R. O., Oosugi, N. & Okanoya, K. Semi-automatic classification of birdsong elements using a linear support vector machine. PLoS ONE 9, e92584 (2014).

  11. 11.

    Berman, G. J., Choi, D. M., Bialek, W. & Shaevitz, J. W. Mapping the stereotyped behaviour of freely moving fruit flies. J. R. Soc. Interface 11, 20140672 (2014).

  12. 12.

    Wiltschko, A. B. et al. Mapping sub-second structure in mouse behavior. Neuron 88, 1121–1135 (2015).

  13. 13.

    Berman, G. J., Bialek, W. & Shaevitz, J. W. Predictability and hierarchy in Drosophila behavior. Proc. Natl Acad. Sci. USA 113, 11943–11948 (2016).

  14. 14.

    Klibaite, U., Berman, G. J., Cande, J., Stern, D. L. & Shaevitz, J. W. An unsupervised method for quantifying the behavior of paired animals. Phys. Biol. 14, 015006 (2017).

  15. 15.

    Wang, Q. et al. The PSI-U1 snRNP interaction regulates male mating behavior in Drosophila. Proc. Natl Acad. Sci. USA 113, 5269–5274 (2016).

  16. 16.

    Vogelstein, J. T. et al. Discovery of brainwide neural-behavioral maps via multiscale unsupervised structure learning. Science 344, 386–392 (2014).

  17. 17.

    Cande, J. et al. Optogenetic dissection of descending behavioral control in Drosophila. eLife 7, e34275 (2018).

  18. 18.

    Uhlmann, V., Ramdya, P., Delgado-Gonzalo, R., Benton, R. & Unser, M. FlyLimbTracker: an active contour based approach for leg segment tracking in unmarked, freely behaving Drosophila. PLoS ONE 12, e0173433 (2017).

  19. 19.

    Kain, J. et al. Leg-tracking and automated behavioural classification in Drosophila. Nat. Commun. 4, 1910 (2013).

  20. 20.

    Machado, A. S., Darmohray, D. M., Fayad, J., Marques, H. G. & Carey, M. R. A quantitative framework for whole-body coordination reveals specific deficits in freely walking ataxic mice. eLife 4, e07892 (2015).

  21. 21.

    Nashaat, M. A. et al. Pixying behavior: a versatile real-time and post hoc automated optical tracking method for freely moving and head fixed animals. eNeuro 4, e34275 (2017).

  22. 22.

    Nanjappa, A. et al. Mouse pose estimation from depth images. arXiv Preprint at https://arxiv.org/abs/1511.07611 (2015).

  23. 23.

    Nakamura, A. et al. Low-cost three-dimensional gait analysis system for mice with an infrared depth sensor. Neurosci. Res. 100, 55–62 (2015).

  24. 24.

    Wang, Z., Mirbozorgi, S. A. & Ghovanloo, M. An automated behavior analysis system for freely moving rodents using depth image. Med. Biol. Eng. Comput. 56, 1807–1821 (2018).

  25. 25.

    Mendes, C. S., Bartos, I., Akay, T., Márka, S. & Mann, R. S. Quantification of gait parameters in freely walking wild type and sensory deprived Drosophila melanogaster. eLife 2, e00231 (2013).

  26. 26.

    Mendes, C. S. et al. Quantification of gait parameters in freely walking rodents. BMC Biol. 13, 50 (2015).

  27. 27.

    Petrou, G. & Webb, B. Detailed tracking of body and leg movements of a freely walking female cricket during phonotaxis. J. Neurosci. Methods 203, 56–68 (2012).

  28. 28.

    Toshev, A. & Szegedy, C. DeepPose: human pose estimation via deep neural networks. arXiv Preprint at https://arxiv.org/abs/1312.4659 (2013).

  29. 29.

    Tompson, J. J., Jain, A., LeCun, Y. & Bregler, C. Joint training of a convolutional network and a graphical model for human pose estimation. In Advances in Neural Information Processing Systems Vol. 27 (eds Ghahramani, Z., Welling, M., Cortes, C., Lawrence, N. D. & Weinberger, K. Q.) 1799–1807 (Curran Associates, Inc., Red Hook, 2014).

  30. 30.

    Carreira, J., Agrawal, P., Fragkiadaki, K. & Malik, J. Human pose estimation with iterative error feedback. arXi v Preprint at https://arxiv.org/abs/1507.06550 (2015).

  31. 31.

    Wei, S.-E., Ramakrishna, V., Kanade, T. & Sheikh, Y. Convolutional pose machines. arXiv Preprint at https://arxiv.org/abs/1602.00134 (2016).

  32. 32.

    Bulat, A. & Tzimiropoulos, G. Human pose estimation via convolutional part heatmap regression. arXiv Preprint at https://arxiv.org/abs/1609.01743 (2016).

  33. 33.

    Cao, Z., Simon, T., Wei, S.-E. & Sheikh, Y. Realtime multi-person 2D pose estimation using part affinity fields. arXiv Preprint at https://arxiv.org/abs/1611.08050 (2016).

  34. 34.

    Tome, D., Russell, C. & Agapito, L. Lifting from the deep: convolutional 3D pose estimation from a single image. arXiv Preprint at https://arxiv.org/abs/1701.00295 (2017).

  35. 35.

    Shelhamer, E., Long, J. & Darrell, T. Fully convolutional networks for semantic segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 39, 640–651 (2017).

  36. 36.

    Ronneberger, O., Fischer, P. & Brox, T. U-Net: convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention – MICCAI 2015 234–241 (Springer International Publishing, Cham, Switzerland, 2015).

  37. 37.

    Lin, T.-Y. et al. Microsoft COCO: common objects in context. In Computer Vision – ECCV 2014 740–755 (Springer International Publishing, Cham, Switzerland, 2014).

  38. 38.

    Andriluka, M., Pishchulin, L., Gehler, P. & Schiele, B. 2D human pose estimation: new benchmark and state of the art analysis. In Proc. IEEE Conference on Computer Vision and Pattern Recognition 3686–3693 (IEEE Computer Society, 2014).

  39. 39.

    Güler, R. A., Neverova, N. & Kokkinos, I. DensePose: dense human pose estimation in the wild. arXiv Preprint at https://arxiv.org/abs/1802.00434 (2018).

  40. 40.

    Mathis, A. et al. DeepLabCut: markerless pose estimation of user-defined body parts with deep learning. Nat. Neurosci. 21, 1281–1289 (2018).

  41. 41.

    Isakov, A. et al. Recovery of locomotion after injury in Drosophila melanogaster depends on proprioception. J. Exp. Biol. 219, 1760–1771 (2016).

  42. 42.

    Wosnitza, A., Bockemühl, T., Dübbert, M., Scholz, H. & Büschges, A. Inter-leg coordination in the control of walking speed in Drosophila. J. Exp. Biol. 216, 480–491 (2013).

  43. 43.

    Qiao, B., Li, C., Allen, V. W., Shirasu-Hiza, M. & Syed, S. Automated analysis of long-term grooming behavior in Drosophila using a k-nearest neighbors classifier. eLife 7, e34497 (2018).

  44. 44.

    Dombeck, D. A., Khabbaz, A. N., Collman, F., Adelman, T. L. & Tank, D. W. Imaging large-scale neural activity with cellular resolution in awake, mobile mice. Neuron 56, 43–57 (2007).

  45. 45.

    Seelig, J. D. & Jayaraman, V. Neural dynamics for landmark orientation and angular path integration. Nature 521, 186–191 (2015).

  46. 46.

    Pérez-Escudero, A., Vicente-Page, J., Hinz, R. C., Arganda, S. & de Polavieja, G. G. idTracker: tracking individuals in a group by automatic identification of unmarked animals. Nat. Methods 11, 743–748 (2014).

  47. 47.

    Newell, A., Yang, K. & Deng, J. Stacked hourglass networks for human pose estimation. arXiv Preprint at https://arxiv.org/abs/1603.06937 (2016).

  48. 48.

    Chyb, S. & Gompel, N. Atlas of Drosophila Morphology: Wild-type and Classical Mutants (Academic Press, London, Waltham and San Diego, 2013).

  49. 49.

    Kingma, D. P. & Ba, J. Adam: a method for stochastic optimization. arXiv Preprint at https://arxiv.org/abs/1412.6980 (2014).

  50. 50.

    Morel, P. Gramm: grammar of graphics plotting in MATLAB. J. Open Source Softw. 3, 568 (2018).

  51. 51.

    Baum, L. E., Petrie, T., Soules, G. & Weiss, N. A maximization technique occurring in the statistical analysis of probabilistic functions of markov chains. Ann. Math. Stat. 41, 164–171 (1970).

  52. 52.

    Viterbi, A. Error bounds for convolutional codes and an asymptotically optimum decoding algorithm. IEEE Trans. Inf. Theory 13, 260–269 (1967).

  53. 53.

    van der Maaten, L. & Hinton, G. Visualizing data using t-SNE. J. Mach. Learn. Res. 9, 2579–2605 (2008).

Download references

Acknowledgements

The authors acknowledge J. Pillow for discussions; B.C. Cho for contributions to the acquisition and preprocessing pipeline for mouse experiments; P. Chen for a previous version of a neural network for pose estimation that was useful in designing our method; H. Jang, M. Murugan, and I. Witten for feedback on the GUI and other discussions; G. Guan for assistance maintaining flies; and the Murthy, Shaevitz and Wang labs for general feedback. This work was supported by the NIH R01 NS104899-01 BRAIN Initiative Award and an NSF BRAIN Initiative EAGER Award (to M.M. and J.W.S.), NIH R01 MH115750 BRAIN Initiative Award (to S.S.-H.W. and J.W.S.), the Nancy Lurie Marks Family Foundation and NIH R01 NS045193 (to S.S.-H.W.), an HHMI Faculty Scholar Award (to M.M.), NSF GRFP DGE-1148900 (to T.D.P.), and the Center for the Physics of Biological Function sponsored by the National Science Foundation (NSF PHY-1734030).

Author information

T.D.P., D.E.A., S.S.-H.W., J.W.S. and M.M. designed the study. T.D.P., D.E.A., L.W. and M.K. conducted experiments. T.D.P. and D.E.A. developed the GUI and analyzed data. T.D.P., D.E.A., J.W.S. and M.M. wrote the manuscript.

Competing interests

T.D.P., D.E.A., J.W.S. and M.M. are named as inventors on US provisional patent no. 62/741,643 filed by Princeton University.

Correspondence to Mala Murthy or Joshua W. Shaevitz.

Integrated supplementary information

  1. Supplementary Figure 1 Rotational invariance is learned at the cost of prediction accuracy.

    a, The augmentation procedure consists of random rotations about the center of egocentrically aligned labeled frames. Labeled frames are split into training, validation and test sets. Colors are used to indicate unique images. Only training and validation sets are augmented and used for training. During training, images are drawn sequentially from the training and validation sets to form batches of 32 images, cycling back to the beginning if there are fewer images than required, and then rotated randomly within a range of angles. After each epoch, the ordering of the datasets is shuffled so as to create new combinations of batches. The test set images are not augmented before computing accuracy metrics reported throughout. b, Egocentric alignment accuracy of the preprocessing algorithm from ref. 9 when compared to manual labels of head and thorax. The error is the absolute deviation of the angle formed between the thorax and head from the horizontal centerline in the image. c, The accuracy measured as the r.m.s. error of position estimates when evaluated on data artificially rotated at a fixed angle (rows) with networks trained on data augmented by rotations between a range of angles (columns). Red boxes denote the best accuracy for each data angle.

  2. Supplementary Figure 2 Cluster sampling to promote pose diversity in the labeling dataset.

    a, PCA of unlabeled images captures 80% of the variance in the data (gray line; n = 29,500 images) within 50 components (blue bars). b, Top PCA eigenmodes visualized as coefficient images. Red and blue shading denote positive and negative coefficients at each pixel, respectively. Areas of similar colors indicate correlated pixel intensities within a given mode. c, Cluster centroids identified by k-means after PCA. Red and blue shading denote pixels with higher or lower intensity than the overall mean, respectively.

  3. Supplementary Figure 3 Comparison of neural network architecture.

    a, Diagram of our neural network architecture. Raw images are provided as input into the network, which then computes a set of confidence maps of the same height and width as the input image (top row). The network consists of a set of convolutions, max pooling and transposed convolutions whose weights are learned during training (top middle). Estimated confidence maps are compared to ground truth maps generated from user labels using a mean squared error loss function, which is then minimized during training (bottom row). b, Accuracy comparison between network architectures. We compared the accuracy of our architecture to the hourglass and stacked hourglass versions of the network described in ref. 10. The accuracy of our network is equivalent or better than that achieved when training with hourglass (over all 32 body parts, n = 300 held-out frames, P < 1 × 10–10, Wilcoxon rank sum test, one-tailed, z = –74.65) and stacked hourglass (over all 32 body parts, n = 300 held-out frames, P < 1 × 10–10, Wilcoxon rank sum test, one-tailed, z = –53.21) versions of the network described in ref. 10. Dots and error bars denote median and 25th and 75th percentiles; violin plots denote full distributions of errors.

  4. Supplementary Figure 4 User-defined skeleton.

    The 32 selected points approximately match the set of visible joints and interest points in the anatomy of the animal.

  5. Supplementary Figure 5 Estimation accuracy improves with few samples.

    a,b, Error distance distributions per body part when estimated with networks trained for 15 epochs on 10 (a) or 250 (b) labeled frames. c, Time spent labeling each frame decreases with the quality of initialization. Line and shaded regions correspond to mean and s.d., respectively. Starting frames require 115.4 ± 45.0 (mean ± s.d.) seconds to label, decreasing to 6.1 ± 7.7 s after initialization with a network trained on 1,000 labeled frames (n = 1,500 total labeled frames). d, Accuracy improvements are observed with very few labeled samples. A plateau is observed at around 150–200 frames, with marginal improvements with additional labeling. Circles denote the test set r.m.s. error for one replicate of fast training (15 epochs) at each dataset size; lines denote mean of all replicates.

  6. Supplementary Figure 6 Comparison of behavioral space distributions generated from compressed images versus body-part positions.

    a, Behavioral space distribution from 59 male flies calculated using the original MotionMapper pipeline (data and pipeline from 12), including Radon-transform compression and PCA-based projection onto the first 50 principal components followed by a nonlinear embedding of the resultant spectrograms. b, Behavioral space distribution from 59 male flies (data and pipeline from 12) calculated using spectrograms generated from tracked body-part positions rather than PCA modes (see Methods). c, Joint probability distribution of the cluster labels from a and b; sorted by row and column peaks.

  7. Supplementary Figure 7 Generalization to more diverse morphologies with a single network trades off with accuracy.

    a,b, Male and female flies differ in anatomical morphology, in part because of differences in their body length. a, The males more often extend their wings as they are used to produce courtship song. b, The females rarely extend their wings in this context. c, Training on labeled images of just males results in high accuracy on male test set images. d, Training on both males and females still results in high accuracy on male test set images. e, Quantification of r.m.s. error on the male test set shows that generalization to two different morphologies increases the error metric. Circles denote training replicates, diamonds denote median r.m.s. error for all replicates and solid and open markers correspond to specialized and generalized training, respectively.

Supplementary information

  1. Supplementary Text and Figures

    Supplementary Figs. 1–7 and Supplementary Results

  2. Reporting Summary

  3. Supplementary Video 1

    Body-part tracking is reliable over long periods without temporal constraints. Raw images (left), max projection of all confidence maps (center) and tracked images (right) during a 20-s bout of free movement. Video playback at ×0.2 real-time speed.

  4. Supplementary Video 2

    Body-part tracking during free-moving locomotion. Raw images (left), max projection of all confidence maps (center) and tracked images (right) during a bout of locomotion. Video playback at ×0.15 real-time speed. Video corresponds to Fig. 1d.

  5. Supplementary Video 3

    Body-part tracking during head grooming. Raw images (left), max projection of all confidence maps (center), and tracked images (right) during a bout of head grooming. Video playback at ×0.15 real-time speed. Video corresponds to Fig. 1e.

  6. Supplementary Video 4

    Tracking joints robustly in images with heterogeneous background and noisy segmentation. Raw images (left), max projection of all confidence maps (center) and tracked images (right) of a freely moving courting male fly. Rows correspond to results from a network trained on unmasked and masked images, respectively. Video playback at ×0.2 real-time speed.

  7. Supplementary Video 5

    Tracking joints in freely moving rodents. Raw images (left), max projection of all confidence maps (center) and tracked images (right) of a freely moving mouse in an open field arena imaged from below through a clear acrylic floor. Video playback at ×0.2 real-time speed. Tracking is reliable over time but degenerate when certain parts are occluded, such as when the animal rears.

  8. Supplementary Software

    LEAP (LEAP estimates animal pose) software for estimation of animal body-part position.

Rights and permissions

To obtain permission to re-use content from this article visit RightsLink.

About this article

Further reading

Fig. 1: Body-part tracking via LEAP, a deep learning framework for animal pose estimation.
Fig. 2: LEAP is accurate and requires little training or labeled data.
Fig. 3: LEAP recapitulates known gait patterning in flies.
Fig. 4: Unsupervised embedding of body position dynamics.
Fig. 5: Locomotor clusters in behavior space separate distinct gait modes.
Fig. 6: LEAP generalizes to images with complex backgrounds or of other animals.
Supplementary Figure 1: Rotational invariance is learned at the cost of prediction accuracy.
Supplementary Figure 2: Cluster sampling to promote pose diversity in the labeling dataset.
Supplementary Figure 3: Comparison of neural network architecture.
Supplementary Figure 4: User-defined skeleton.
Supplementary Figure 5: Estimation accuracy improves with few samples.
Supplementary Figure 6: Comparison of behavioral space distributions generated from compressed images versus body-part positions.
Supplementary Figure 7: Generalization to more diverse morphologies with a single network trades off with accuracy.