Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Article
  • Published:

Learning high-level visual representations from a child’s perspective without strong inductive biases

A preprint version of the article is available at arXiv.

Abstract

Young children develop sophisticated internal models of the world based on their visual experience. Can such models be learned from a child’s visual experience without strong inductive biases? To investigate this, we train state-of-the-art neural networks on a realistic proxy of a child’s visual experience without any explicit supervision or domain-specific inductive biases. Specifically, we train both embedding models and generative models on 200 hours of headcam video from a single child collected over two years and comprehensively evaluate their performance in downstream tasks using various reference models as yardsticks. On average, the best embedding models perform at a respectable 70% of a high-performance ImageNet-trained model, despite substantial differences in training data. They also learn broad semantic categories and object localization capabilities without explicit supervision, but they are less object-centric than models trained on all of ImageNet. Generative models trained with the same data successfully extrapolate simple properties of partially masked objects, like their rough outline, texture, colour or orientation, but struggle with finer object details. We replicate our experiments with two other children and find remarkably consistent results. Broadly useful high-level visual representations are thus robustly learnable from a sample of a child’s visual experience without strong inductive biases.

This is a preview of subscription content, access via your institution

Access options

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Fig. 1: Schematic overview of the experiments.
Fig. 2: Quantitative evaluation of the embedding models.
Fig. 3: Qualitative evaluation of the embedding models.
Fig. 4: t-distributed stochastic neighbor embeddings of the ImageNet classes.
Fig. 5: Nearest neighbours in the embedding space.
Fig. 6: Qualitative evaluation of the generative models.

Similar content being viewed by others

Data availability

Except for SAYCam, all data used in this study are publicly available. Instructions for accessing the public datasets are detailed in Methods. The SAYCam dataset can be accessed by authorized users with an institutional affiliation from the following Databrary repository: https://doi.org/10.17910/b7.564. The ‘Labeled S’ evaluation dataset, which is a subset of SAYCam, is also available from the same repository under the session name ‘Labeled S’.

Code availability

All of our pretrained models (over 70 different models), as well as a variety of tools to use and analyse them, are available from the following public repository: https://github.com/eminorhan/silicon-menagerie (ref. 63). The repository also contains further examples of (1) attention and class activation maps, (2) t-SNE visualizations of embeddings, (3) nearest neighbour retrievals from the embedding models and (4) unconditional and conditional samples from the generative models. The code used for training and evaluating all the models is also publicly available from the same repository.

References

  1. Bomba, P. & Siqueland, E. The nature and structure of infant form categories. J. Exp. Child Psychol. 35, 294–328 (1983).

    Article  Google Scholar 

  2. Murphy, G. The Big Book of Concepts (MIT, 2002).

  3. Kellman, P. & Spelke, E. Perception of partly occluded objects in infancy. Cogn. Psychol. 15, 483–524 (1983).

    Article  CAS  PubMed  Google Scholar 

  4. Spelke, E., Breinlinger, K., Macomber, J. & Jacobson, K. Origin of knowledge. Psychol. Rev. 99, 605–632 (1992).

    Article  CAS  PubMed  Google Scholar 

  5. Ayzenberg, V. & Lourenco, S. Young children outperform feed-forward and recurrent neural networks on challenging object recognition tasks. J. Vis. 20, 310–310 (2020).

    Article  Google Scholar 

  6. Huber, L. S., Geirhos, R. & Wichmann, F. A. The developmental trajectory of object recognition robustness: children are like small adults but unlike big deep neural networks. J. Vis. 23, 4 (2023).

  7. Locke, J. An Essay Concerning Human Understanding (ed. Fraser, A. C.) (Clarendon Press, 1894).

  8. Leibniz, G. New Essays on Human Understanding 2nd edn (eds Remnant, P. & Bennett, J.) (Cambridge Univ. Press, 1996).

  9. Spelke, E. Initial knowledge: six suggestions. Cognition 50, 431–445 (1994).

    Article  CAS  PubMed  Google Scholar 

  10. Markman, E. Categorization and Naming in Children (MIT, 1989).

  11. Merriman, W., Bowman, L. & MacWhinney, B. The mutual exclusivity bias in children’s word learning. Monogr. Soc. Res. Child Dev. 54, 1–132 (1989).

  12. Elman, J., Bates, E. & Johnson, M. Rethinking Innateness: A Connectionist Perspective on Development (MIT, 1996).

  13. Sullivan, J., Mei, M., Perfors, A., Wojcik, E. & Frank, M. SAYCam: a large, longitudinal audiovisual dataset recorded from the infant’s perspective. Open Mind 5, 20–29 (2022).

    Article  Google Scholar 

  14. Caron, M. et al. Emerging properties in self-supervised vision transformers. In Proc. IEEE/CVF International Conference on Computer Vision 9650–9660 (IEEE, 2021).

  15. Zhou, P. et al. Mugs: a multi-granular self-supervised learning framework. Preprint at https://arxiv.org/abs/2203.14415 (2022).

  16. He, K. et al. Masked autoencoders are scalable vision learners. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition 15979–15988 (IEEE, 2022).

  17. Dosovitskiy, A. et al. An image is worth 16x16 words: transformers for image recognition at scale. In International Conference on Learning Representations (2020).

  18. Xie, S., Girshick, R., Dollár, P., Tu, Z. & He, K. Aggregated residual transformations for deep neural networks. In Proc. IEEE Conference on Computer Vision and Pattern Recognition 1492–1500 (IEEE, 2017).

  19. Russakovsky, O. et al. ImageNet large scale visual recognition challenge. Int. J. Comput. Vis. 115, 211–252 (2015).

    Article  MathSciNet  Google Scholar 

  20. Smaira, L. et al. A short note on the Kinetics-700-2020 human action dataset. Preprint at https://arxiv.org/abs/2010.10864 (2020).

  21. Grauman, K. et al. Ego4D: around the world in 3,000 hours of egocentric video. In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition. 18995–19012 (IEEE, 2022).

  22. Esser, P., Rombach, R. & Ommer, B. Taming transformers for high-resolution image synthesis. In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition 12873–12883 (IEEE, 2021).

  23. Radford, A. et al. Language models are unsupervised multitask learners. OpenAI https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf (2019).

  24. Zhou, B., Khosla, A., Lapedriza, A., Oliva, A. & Torralba, A. Learning deep features for discriminative localization. In Proc. IEEE Conference on Computer Vision and Pattern Recognition 2921–2929 (IEEE, 2016).

  25. van der Maaten, L. & Hinton, G. Visualizing data using t-SNE. J. Mach. Learn. Res. 9, 2579–2605 (2008).

  26. Kuznetsova, A. et al. The Open Images Dataset V4. Int. J. Comput. Vis. 128, 1956–1981 (2020).

  27. Smith, L. & Slone, L. A developmental approach to machine learning? Front. Psychol. 8, 2124 (2017).

    Article  PubMed  PubMed Central  Google Scholar 

  28. Bambach, S., Crandall, D., Smith, L. & Yu, C. Toddler-inspired visual object learning. Adv. Neural Inf. Process. Syst. 31, 1209–1218 (2018).

  29. Zaadnoordijk, L., Besold, T. & Cusack, R. Lessons from infant learning for unsupervised machine learning. Nat. Mach. Intell. 4, 510–520 (2022).

    Article  Google Scholar 

  30. Orhan, E., Gupta, V. & Lake, B. Self-supervised learning through the eyes of a child. Adv. Neur. In. 33, 9960–9971 (2020).

    Google Scholar 

  31. Lee, D., Gujarathi, P. & Wood, J. Controlled-rearing studies of newborn chicks and deep neural networks. Preprint at https://arxiv.org/abs/2112.06106 (2021).

  32. Zhuang, C. et al. Unsupervised neural network models of the ventral visual stream. Proc. Natl Acad. Sci. USA 118, e2014196118 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  33. Zhuang, C. et al. How well do unsupervised learning algorithms model human real-time and life-long learning? In Thirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (2022).

  34. Vong, W. K., Wang, W., Orhan, A. E. & Lake, B. M. Grounded language acquisition through the eyes and ears of a single child. Science 383, 504–511 (2024).

  35. Locatello, F. et al. Object-centric learning with slot attention. Adv. Neur. In. 33, 11525–11538 (2020).

    Google Scholar 

  36. Lillicrap, T., Santoro, A., Marris, L., Akerman, C. & Hinton, G. Backpropagation and the brain. Nat. Rev. Neurosci. 21, 335–346 (2020).

    Article  CAS  PubMed  Google Scholar 

  37. Gureckis, T. & Markant, D. Self-directed learning: a cognitive and computational perspective. Perspect. Psychol. Sci. 7, 464–481 (2012).

    Article  PubMed  Google Scholar 

  38. Long, B. et al. The BabyView camera: designing a new head-mounted camera to capture children’s early social and visual environments. Behav. Res. Methods https://doi.org/10.3758/s13428-023-02206-1 (2023).

  39. Moore, D., Oakes, L., Romero, V. & McCrink, K. Leveraging developmental psychology to evaluate artificial intelligence. In 2022 IEEE International Conference on Development and Learning (ICDL) 36–41 (IEEE, 2022).

  40. Frank, M. C. Bridging the data gap between children and large language models. Trends Cogn. Sci. 27, 990–992 (2023).

  41. Object stimuli. Brady Lab https://bradylab.ucsd.edu/stimuli/ObjectCategories.zip

  42. Konkle, T., Brady, T., Alvarez, G. & Oliva, A. Conceptual distinctiveness supports detailed visual long-term memory for real-world objects. J. Exp. Psychol. Gen. 139, 558 (2010).

    Article  PubMed  PubMed Central  Google Scholar 

  43. Lomonaco, V. & Maltoni, D. CORe50 Dataset. GitHub https://vlomonaco.github.io/core50 (2017).

  44. Lomonaco, V. & Maltoni, D. CORe50: a new dataset and benchmark for continuous object recognition. In Proc. 1st Annual Conference on Robot Learning (eds Levine, S. et al.) 17–26 (PMLR, 2017).

  45. Russakovsky, O. et al. ImageNet Dataset. https://www.image-net.org/download.php (2015).

  46. Geirhos, R. et al. Shortcut learning in deep neural networks. Nat. Mach. Intell. 2, 665–673 (2020).

    Article  Google Scholar 

  47. Geirhos, R. et al. Partial success in closing the gap between human and machine vision. Adv. Neur. In. 34, 23885–23899 (2021).

    Google Scholar 

  48. Geirhos, R. et al. ImageNet OOD Dataset. GitHub https://github.com/bethgelab/model-vs-human (2021).

  49. Mehrer, J., Spoerer, C., Jones, E., Kriegeskorte, N. & Kietzmann, T. An ecologically motivated image dataset for deep learning yields better models of human vision. Proc. Natl Acad. Sci. USA 118, e2011417118 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  50. Mehrer, J., Spoerer, C., Jones, E., Kriegeskorte, N. & Kietzmann, T. Ecoset Dataset. Hugging Face https://huggingface.co/datasets/kietzmannlab/ecoset (2021).

  51. Zhou, B., Lapedriza, A., Khosla, A., Oliva, A. & Torralba, A. Places: a 10 million image database for scene recognition. IEEE T. Pattern Anal. 40, 1452–1464 (2017).

  52. Zhou, B. et al. Places365 Dataset. http://places2.csail.mit.edu (2017).

  53. Pont-Tuset, J. et al. The 2017 DAVIS challenge on video object segmentation. Preprint at https://arxiv.org/abs/1704.00675 (2017).

  54. Pont-Tuset, J. et al. DAVIS-2017 evaluation code, dataset and results. https://davischallenge.org/davis2017/code.html (2017).

  55. Lin, T. et al. Microsoft COCO: common objects in context. In Computer Vision – ECCV 2014 (eds Fleet, D. et al.) 740–755 (2014).

  56. COCO Dataset. https://cocodataset.org/#download (2014).

  57. Jabri, A., Owens, A. & Efros, A. Space-time correspondence as a contrastive random walk. Adv. Neur. In. 33, 19545–19560 (2020).

    Google Scholar 

  58. Kinetics-700-2020 Dataset. https://github.com/cvdfoundation/kinetics-dataset#kinetics-700-2020 (2020).

  59. Ego4D Dataset. https://ego4d-data.org/ (2022).

  60. Kingma, D. & Ba, J. Adam: A method for stochastic optimization. Preprint at https://arxiv.org/abs/1412.6980 (2014).

  61. VQGAN resources. GitHub https://github.com/CompVis/taming-transformers (2021).

  62. Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B. & Hochreiter, S. GANs trained by a two time-scale update rule converge to a local Nash equilibrium. Adv. Neural Inf. Process. Syst. 30, 6629–6640 (2017).

  63. Orhan, A. E. eminorhan/silicon-menagerie: v1.0.0-alpha. Zenodo https://doi.org/10.5281/zenodo.8322408 (2023).

Download references

Acknowledgements

We thank W. K. Vong, A. Tartaglini and M. Ren for helpful discussions and comments on an earlier version of this paper. This work was supported by the DARPA Machine Common Sense program (B.M.L.) and NSF Award 1922658 NRT-HDR: FUTURE Foundations, Translation and Responsibility for Data Science (B.M.L.).

Author information

Authors and Affiliations

Authors

Contributions

A.E.O. and B.M.L. conceptualized and designed the study. A.E.O. implemented the experiments. A.E.O. analysed the results with feedback from B.M.L. A.E.O. wrote the first draft. B.M.L. reviewed and edited the paper.

Corresponding author

Correspondence to A. Emin Orhan.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Machine Intelligence thanks Rhodri Cusack, Masataka Sawayama and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Supplementary Figs. 1–8 and Tables 1 and 2.

Reporting Summary

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Orhan, A.E., Lake, B.M. Learning high-level visual representations from a child’s perspective without strong inductive biases. Nat Mach Intell 6, 271–283 (2024). https://doi.org/10.1038/s42256-024-00802-0

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/s42256-024-00802-0

This article is cited by

Search

Quick links

Nature Briefing AI and Robotics

Sign up for the Nature Briefing: AI and Robotics newsletter — what matters in AI and robotics research, free to your inbox weekly.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing: AI and Robotics