Better models of human high-level visual cortex emerge from natural language supervision with a large and diverse dataset

Wang, Aria Y.; Kay, Kendrick; Naselaris, Thomas; Tarr, Michael J.; Wehbe, Leila

doi:10.1038/s42256-023-00753-y

Article
Published: 13 November 2023

Better models of human high-level visual cortex emerge from natural language supervision with a large and diverse dataset

Nature Machine Intelligence volume 5, pages 1415–1426 (2023)Cite this article

2723 Accesses
1 Citations
10 Altmetric
Metrics details

Subjects

A preprint version of the article is available at bioRxiv.

Abstract

High-performing neural networks for vision have dramatically advanced our ability to account for neural data in biological systems. Recently, further improvement in performance of these neural networks has been catalysed by joint training on images and natural language, increased dataset sizes and data diversity. We explored whether the same factors (joint training, dataset size and diversity) support similar improvements in the prediction of visual responses in the human brain. We used models pretrained with Contrastive Language-Image Pretraining (CLIP)—which learns image embeddings that best match text embeddings of image captions from diverse, large-scale datasets—to study visual representations. We built voxelwise encoding models based on CLIP image features to predict brain responses to real-world images. We found that ResNet50 with CLIP is a better model of high-level visual cortex, explaining up to R² = 79% of variance in voxel responses in held-out test data, a substantial increase from models trained only with image/label pairs (ImageNet trained ResNet) or text (BERT). Comparisons across different model backbones ruled out network architecture as a factor in performance improvements. Comparisons across models that controlled for dataset size and data diversity demonstrated that language feedback along with large and diverse datasets are important factors in explaining neural responses in high-level visual brain regions. Visualizations of model embeddings and principal component analysis revealed that our models capture both global and fine-grained semantic dimensions represented within human visual cortex.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on Springer Link
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Fig. 1: Model pipeline, motivation and prediction performance for the ResNet_CLIP visual encoder.**

**Fig. 2: Prediction performance for the CLIP text encoder.**

**Fig. 3: Performance for the CLIP visual encoder using a ResNet backbone as compared to ResNet_ImageNet.**

**Fig. 4: Better representations of scenes with people in a model trained with CLIP can account for gains in unique variance.**

Fig. 5: Variance partitioning analyses controlling for model architecture, data distribution and dataset size indicate that dataset size and diversity have comparatively smaller effects on voxel prediction than language input.

BOLD5000, a public fMRI dataset while viewing 5000 visual images

Article Open access 06 May 2019

Learning high-level visual representations from a child’s perspective without strong inductive biases

Article 07 March 2024

Brains and algorithms partially converge in natural language processing

Article Open access 16 February 2022

Data availability

We use the Natural Scenes Dataset (NSD), a large-scale fMRI dataset of participants viewing thousands of natural images. The NSD was made available by ref. ²⁴.

Code availability

Our code is available as a public Github repository https://github.com/ariaaay/clip2brain.git (ref. ⁶⁰).

References

Yamins, D. L. K. et al. Performance-optimized hierarchical models predict neural responses in higher visual cortex. Proc. Natl Acad. Sci. USA 111, 8619–8624 (2014).
Article Google Scholar
Yamins, D. L. K. & DiCarlo, J. J. Using goal-driven deep learning models to understand sensory cortex. Nat. Neurosci. 19, 356–365 (2016).
Article Google Scholar
Toneva, M., Mitchell, T. M. & Wehbe, L. Combining computational controls with natural text reveals aspects of meaning composition. Nat. Comput. Sci. 2, 745–757 (2022).
Article Google Scholar
Deng, J. et al. ImageNet: a large-scale hierarchical image database. In IEEE Conference on Computer Vision and Pattern Recognition 248–255 (IEEE, 2009).
Aminoff, E. M. & Tarr, M. J. Associative processing is inherent in scene perception. PLoS ONE 10, e0128840 (2015).
Article Google Scholar
Gauthier, I., James, T. W., Curby, K. M. & Tarr, M. J. The influence of conceptual knowledge on visual discrimination. Cogn Neuropsychol. 20, 507–523 (2003).
Article Google Scholar
Schaffner, J., Bao, S. D., Tobler, P. N., Hare, T. A. & Polania, R. Sensory perception relies on fitness-maximizing codes. Nat. Hum. Behav. 7, 1135–1151 (2023).
Lupyan, G., Thompson-Schill, S. L. & Swingley, D. Conceptual penetration of visual processing. Psychol. Sci. 21, 682–691 (2010).
Article Google Scholar
Radford, A. et al. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning (eds. Meila, M. & Zhang, T.) 8748–8763 (PMLR, 2021).
Li, L. H. et al. Grounded language-image pre-training. In IEEE/CVF Conference on Computer Vision and Pattern Recognition 10955–10965 (IEEE, 2022).
Yuan, L. et al. Florence: a new foundation model for computer vision. Preprint at https://doi.org/10.48550/arxiv.2111.11432 (2021).
Jia, C. et al. Scaling up visual and vision-language representation learning with noisy text supervision. In International Conference on Machine Learning (eds. Meila, M. & Zhang, T.) 4904–4916 (PMLR, 2021).
Wu dao 2.0. https://gpt3demo.com/apps/wu-dao-20 (accessed 20 October 2022).
Pinker, S.The language Instinct: How the Mind Creates Language (HarperCollins, 2007).
Fang, A. et al. Data determines distributional robustness in contrastive language image pre-training (CLIP). In Proceedings of international Conference on Machine Learning (eds. Chaudhuri, K. et al.) 6216–6234 (PMLR, 2022).
Mu, N., Kirillov, A., Wagner, D. & Xie, S. SLIP: self-supervision meets language-image pre-training. In Proceedings 17th European Conference on Computer Vision (eds. Avidan, S. & Brostow, G.) 529–544 (Springer Nature, 2022).
Li, L. H., Yatskar, M., Yin, D., Hsieh, C.-J. & Chang, K.-W. VisualBERT: a simple and performant baseline for vision and language. Preprint at https://doi.org/10.48550/arXiv.1908.03557 (2019).
Tan, H. & Bansal, M. LXMERT: learning cross-modality encoder representations from transformers. In Conference on Emperical Natural Language Processing (eds Inui, K. et al.) 5099–5110 (Association for Computational Linguistics, 2019).
Murray, S. O., Boyaci, H. & Kersten, D. The representation of perceived angular size in human primary visual cortex. Nat. Neurosci. 9, 429–434 (2006).
Article Google Scholar
Gilbert, C. D. & Li, W. Top-down influences on visual processing. Nat. Rev. Neurosci. 14, 350–363 (2013).
Article Google Scholar
He, K., Zhang, X., Ren, S. & Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 770–778 (IEEE, 2016).
Devlin, J., Chang, M., Lee, K. & Toutanova, K. BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers) (eds. Burstein, J. et al.) 4171–4186 (Association for Computational Linguistics, 2019).
Naselaris, T., Kay, K. N., Nishimoto, S. & Gallant, J. L. Encoding and decoding in fMRI. Neuroimage 56, 400–410 (2011).
Article Google Scholar
Allen, E. J. et al. A massive 7T fMRI dataset to bridge cognitive neuroscience and artificial intelligence. Nat. Neurosci. 25, 116–126 (2022).
Article Google Scholar
Chen, T., Kornblith, S., Norouzi, M. & Hinton, G. A simple framework for contrastive learning of visual representations. In International Conference on Machine Learning (eds. Daumé III, H. & Singh, A.) 1597-1607 (PMLR, 2020).
Schuhmann, C. et al. LAION-5B: an open large-scale dataset for training next generation image-text models. Adv. Neural Inf. Process. Syst. 35, 25278–25294 (2022).
Thomee, B. et al. YFCC100M: the new data in multimedia research. Commun. ACM 59, 64–73 (2016).
Article Google Scholar
Güçlü, U. & van Gerven, M. A. Deep neural networks reveal a gradient in the complexity of neural representations across the ventral stream. J. Neurosci. 35, 10005–10014 (2015).
Article Google Scholar
Huth, A. G., De Heer, W. A., Griffiths, T. L., Theunissen, F. E. & Gallant, J. L. Natural speech reveals the semantic maps that tile human cerebral cortex. Nature 532, 453–458 (2016).
Article Google Scholar
Epstein, R. A. & Baker, C. I. Scene perception in the human brain. Annu. Rev. Vis. Sci. 5, 373–397 (2019).
Article Google Scholar
Downing, P. E., Jiang, Y., Shuman, M. & Kanwisher, N. A cortical area selective for visual processing of the human body. Science 293, 2470–2473 (2001).
Article Google Scholar
Sergent, J., Ohta, S. & MacDonald, B. Functional neuroanatomy of face and object processing: a positron emission tomography study. Brain 115, 15–36 (1992).
Article Google Scholar
Kanwisher, N., McDermott, J. & Chun, M. M. The fusiform face area: a module in human extrastriate cortex specialized for face perception. J. Neurosci. 17, 4302–4311 (1997).
Article Google Scholar
Lescroart, M. D., Stansbury, D. E. & Gallant, J. L. Fourier power, subjective distance, and object categories all provide plausible models of bold responses in scene-selective visual areas. Front. Comput. Neurosci. 9, 135 (2015).
Article Google Scholar
de Heer, W. A., Huth, A. G., Griffiths, T. L., Gallant, J. L. & Theunissen, F. E. The hierarchical cortical organization of human speech processing. J. Neurosci. 37, 6539–6557 (2017).
Article Google Scholar
Saxe, R. & Kanwisher, N. People thinking about thinking people: the role of the temporo-parietal junction in “theory of mind". NeuroImage. 19, 1835–1842 (2003).
Çukur, T., Nishimoto, S., Huth, A. G. & Gallant, J. L. Attention during natural vision warps semantic representation across the human brain. Nat. Neurosci. 16, 763–770 (2013).
Article Google Scholar
Jain, N. et al. Selectivity for food in human ventral visual cortex. Commun. Biol. 6, 175 (2023).
Article Google Scholar
Pennock, I. M. L. et al. Color-biased regions in the ventral visual pathway are food selective. Curr. Biol. 33, 134–146.e4 (2023).
Article Google Scholar
Khosla, M., Apurva Ratan Murty, N. & Kanwisher, N. A highly selective response to food in human visual cortex revealed by hypothesis-free voxel decomposition. Curr. Biol. 32, 4159–4171.e9 (2022).
Article Google Scholar
Conwell, C., Prince, J. S., Hamblin, C. J. & Alvarez, G. A. Controlled assessment of CLIP-style language-aligned vision models in prediction of brain & behavioral data. In ICLR 2023 Workshop on Mathematical and Empirical Understanding of Foundation Models (eds. Kumar, A. et al.) (2023).
Conwell, C., Prince, J. S., Alvarez, G. A. & Konkle, T. Large-scale benchmarking of diverse artificial vision models in prediction of 7T human neuroimaging data. Preprint at https://doi.org/10.1101/2022.03.28.485868 (2022).
Conwell, C., Prince, J., Alvarez, G., Konkle, T. & Kay, K. Opportunistic experiments on a large-scale survey of diverse artificial vision models in prediction of 7T human fMRI data. In Conference on Cognitive Computational Neuroscience (2022).
Bracci, S. & Op de Beeck, H. P. Understanding human object vision: a picture is worth a thousand representations. Annu. Rev. Psychol. 74, 113–135 (2023).
Article Google Scholar
Chang, N., Pyles, J. A., Marcus, A., Gupta, A., Tarr, M. J. & Aminoff, E. M. BOLD5000, a public fMRI dataset while viewing 5000 visual images. Sci. Data 6, 49 (2019).
Article Google Scholar
Hebart, M. N., Contier, O., Teichmann, L., Rockter, A. H., Zheng, C. Y., Kidder, A., Corriveau, A., Vaziri-Pashkam, M. & Baker, C. I. THINGS-data, a multimodal collection of large-scale datasets for investigating object representations in human brain and behavior. eLife 12, e82580 (2023).
Article Google Scholar
Maier, M. & Abdel Rahman, R. No matter how: top-down effects of verbal and semantic category knowledge on early visual perception. Cogn. Affect. Behav. Neurosci. 19, 859–876 (2019).
Article Google Scholar
Charest, I., Allen, E., Wu, Y., Naselaris, T. & Kay, K. Precise identification of semantic representations in the human brain. J. Vis. 20, 539–539 (2020).
Article Google Scholar
Devereux, B. J., Clarke, A. & Tyler, L. K. Integrated deep visual and semantic attractor neural networks predict fMRI pattern-information along the ventral object processing pathway. Sci. Rep. 8, 10636 (2018).
Article Google Scholar
Nappa, R., Wessel, A., McEldoon, K. L., Gleitman, L. R. & Trueswell, J. C. Use of Speaker’s Gaze and Syntax in Verb Learning. Lang. Learn. Dev. 5, 203–234 (2009).
Article Google Scholar
Waxman, S. R. & Markow, D. B. Words as invitations to form categories: evidence from 12- to 13-month-old infants. Cogn. Psychol. 29, 257–302 (1995).
Article Google Scholar
Lupyan, G., Rakison, D. H. & McClelland, J. L. Language is not just for talking: redundant labels facilitate learning of novel categories. Psychol. Sci. 18, 1077–1083 (2007).
Article Google Scholar
Shusterman, A. & Spelke, E. in The Innate Mind: Structure and Contents (eds Carruthers, P. et al.) Ch. 6, 89–106 (Oxford Univ. Press, 2005).
Lin, T. Y. et al. Microsoft COCO: common objects in context. In European Conference on Computer Vision – ECCV 2014. Lecture Notes in Computer Science, 8693 (eds. Fleet, D., Pajdla, T., Schiele, B., & Tuytelaars, T.) 740–755 (Springer, 2014).
Dale, A. M., Fischl, B. & Sereno, M. I. Cortical surface-based analysis: I. segmentation and surface reconstruction. NeuroImage 9, 179–194 (1999).
Article Google Scholar
Fischl, B., Sereno, M. I. & Dale, A. M. Cortical surface-based analysis: II. Inflation, flattening, and a surface-based coordinate system. NeuroImage 9, 195–207 (1999).
Article Google Scholar
Gao, J. S., Huth, A. G., Lescroart, M. D. & Gallant, J. L. Pycortex: an interactive surface visualizer for fMRI. Front. Neuroinform. 9 (2015).
Koushik, J. torch-gel. GitHub https://github.com/jayanthkoushik/torch-gel (2017).
Benjamini, Y. & Hochberg, Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. R. Stat. Soc. Series B Methodol. 57, 289–300 (1995).
MathSciNet Google Scholar
Wang, A. ariaaay/clip2brain: initial release. Zenodo https://doi.org/10.5281/zenodo.8234313 (2023).

Download references

Acknowledgements

A.Y.W. and M.J.T. were supported by the AFRL/AFOSR award FA9550-18-1-0251. The NSD was supported by NSF IIS-1822683 and NSF IIS-1822929. We would like to thank the following people for contributing technical assistance, ideas and commentary to this project: J. Koushik, N. Chang and M. Henderson.

Author information

Authors and Affiliations

Neuroscience Institute, Carnegie Mellon University, Pittsburgh, PA, USA
Aria Y. Wang, Michael J. Tarr & Leila Wehbe
Machine Learning Department, Carnegie Mellon University, Pittsburgh, PA, USA
Aria Y. Wang, Michael J. Tarr & Leila Wehbe
Center for Magnetic Resonance Research (CMRR), Department of Radiology, University of Minnesota, Minneapolis, MN, USA
Kendrick Kay & Thomas Naselaris
Department of Neuroscience, University of Minnesota, Minneapolis, MN, USA
Thomas Naselaris
Department of Psychology, Carnegie Mellon University, Pittsburgh, PA, USA
Michael J. Tarr & Leila Wehbe

Authors

Aria Y. Wang
View author publications
You can also search for this author in PubMed Google Scholar
Kendrick Kay
View author publications
You can also search for this author in PubMed Google Scholar
Thomas Naselaris
View author publications
You can also search for this author in PubMed Google Scholar
Michael J. Tarr
View author publications
You can also search for this author in PubMed Google Scholar
Leila Wehbe
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

A.Y.W., M.J.T. and L.W. conceived the experiments. K.K. and T.N. collected the neuroimaging data. A.Y.W. conducted the experiments and analysed the results. All authors wrote and edited the manuscript.

Corresponding author

Correspondence to Leila Wehbe.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Machine Intelligence thanks the anonymous reviewers for their contribution to the peer review of this work.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Supplementary Figs. 1–13.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Wang, A.Y., Kay, K., Naselaris, T. et al. Better models of human high-level visual cortex emerge from natural language supervision with a large and diverse dataset. Nat Mach Intell 5, 1415–1426 (2023). https://doi.org/10.1038/s42256-023-00753-y

Download citation

Received: 09 November 2022
Accepted: 03 October 2023
Published: 13 November 2023
Issue Date: December 2023
DOI: https://doi.org/10.1038/s42256-023-00753-y

This article is cited by

What comparing deep neural networks can teach us about human vision
- Katja Seeliger
- Martin N. Hebart
Nature Machine Intelligence (2024)