Limits to visual representational correspondence between convolutional neural networks and the human brain

Convolutional neural networks (CNNs) are increasingly used to model human vision due to their high object categorization capabilities and general correspondence with human brain responses. Here we evaluate the performance of 14 different CNNs compared with human fMRI responses to natural and artificial images using representational similarity analysis. Despite the presence of some CNN-brain correspondence and CNNs’ impressive ability to fully capture lower level visual representation of real-world objects, we show that CNNs do not fully capture higher level visual representations of real-world objects, nor those of artificial objects, either at lower or higher levels of visual representations. The latter is particularly critical, as the processing of both real-world and artificial visual stimuli engages the same neural circuits. We report similar results regardless of differences in CNN architecture, training, or the presence of recurrent processing. This indicates some fundamental differences exist in how the brain and CNNs represent visual information.


High SF
Low SF D ** * * *** *** *** * *** *** * *** *** † † † † † * ** *** *** * ** *** *** † † * * ** *** *** Quantifying the brain-CNN correspondence in Experiment 2 with Full, High SF and Low SF images from real-world object categories. (A) The upper and lower bounds of noise ceiling of the fMRI responses for each image condition. (B) to (D) RDM correlations of each brain region with each sampled layer in each CNN for the Full, High SF and Low SF images, respectively. N = 10 human participants. For illustration purposes, the lower bounds of the noise ceiling from all brain regions were shifted to 0.7, and the difference between the actual noise ceiling and 0.7 was subtracted from each brain-CNN correlation value. The asterisks at the top of each plot mark the significance levels of the differences between the highest correlations (Fisher-transformed) of brain regions and CNN layers and the corresponding lower bound of the noise ceiling; one-tailed t-tests were used, and all p values reported were corrected for multiple comparisons for the 6 brain regions included using the Benjamini-Hochberg procedure. Error bars indicate standard errors of the means. † p < .1, * p < .05, ** p < .01, *** p < .001. Source data are provided as a Source Data file. Quantifying the brain-CNN correspondence in Experiment 1. Three different measures were used, including using Spearman brain-CNN correlation with Euclidean distance measures for RDM construction (same as those shown in Supplementary Figure 4), using Spearman brain-CNN correlation with correlation measures for RDM construction, and using Pearson brain-CNN correlation with Euclidean distance measures for RDM construction. N = 6 human participants. Virtually the same results were obtained in all three measures. For illustration purposes, the lower bounds of the noise ceiling from all brain regions were shifted to 0.7, and the difference between the actual noise ceiling and 0.7 was subtracted from each brain-CNN correlation value. Error bars indicate standard errors of the means. Source data are provided as a Source Data file.  Quantifying the brain-CNN correspondence in Experiment 2. Three different measures were used, including using Spearman brain-CNN correlation with Euclidean distance measures for RDM construction (same as those shown in Supplementary Figure 5), using Spearman brain-CNN correlation with correlation measures for RDM construction, and using Pearson brain-CNN correlation with Euclidean distance measures for RDM construction. N = 10 human participants. Virtually the same results were obtained in all three measures. For illustration purposes, the lower bounds of the noise ceiling from all brain regions were shifted to 0.7 and the difference between the actual noise ceiling, and 0.7 was subtracted from each brain-CNN correlation value. Error bars indicate standard errors of the means. Source data are provided as a Source Data file.  Results for the human visual regions are shown with category-selective voxels either included (All Voxels) or excluded (No Category Voxels). Since rotations and flips preserve distances on these MDS plots, to make these plots more informative and to see how the representational structure evolves across brain regions and CNN layers, we manually rotated and/or flipped each MDS when necessary. Source data are provided as a Source Data file.

Supplementary Figure 9
Supplementary Figure 9. Visualizing the representational structures for the controlled real-world object images in all human visual regions and CNN layers examined. Results for the human visual regions are shown with categoryselective voxels either included (All Voxels) or excluded (No Category Voxels). Since rotations and flips preserve distances on these MDS plots, to make these plots more informative and to see how the representational structure evolves across brain regions and CNN layers, we manually rotated and/or flipped each MDS when necessary. Source data are provided as a Source Data file.

Brain -Experiment 2 Full SF Images
CNNs -Full SF Images

Supplementary Figure 10
Supplementary Figure 10. Visualizing the representational structures for the full SF real-world object images in all human visual regions and CNN layers examined. Results for the human visual regions are shown with category-selective voxels either included (All Voxels) or excluded (No Category Voxels). Since rotations and flips preserve distances on these MDS plots, to make these plots more informative and to see how the representational structure evolves across brain regions and CNN layers, we manually rotated and/or flipped each MDS when necessary. Source data are provided as a Source Data file.

CNNs -High SF Images
Brain -Experiment 2 High SF Images

Supplementary Figure 11
Supplementary Figure 11. Visualizing the representational structures for the high SF real-world object images in all human visual regions and CNN layers examined. Results for the human visual regions are shown with categoryselective voxels either included (All Voxels) or excluded (No Category Voxels). Since rotations and flips preserve distances on these MDS plots, to make these plots more informative and to see how the representational structure evolves across brain regions and CNN layers, we manually rotated and/or flipped each MDS when necessary. Source data are provided as a Source Data file.

Supplementary Figure 12
Supplementary Figure 12. Visualizing the representational structures for the low SF real-world object images in all human visual regions and CNN layers examined. Results for the human visual regions are shown with category-selective voxels either included (All Voxels) or excluded (No Category Voxels). Since rotations and flips preserve distances on these MDS plots, to make these plots more informative and to see how the representational structure evolves across brain regions and CNN layers, we manually rotated and/or flipped each MDS when necessary. Source data are provided as a Source Data file.  Figure 14. The amount of LOT and VOT RDM variance explained by CNNs including and excluding category-selective voxels in LOT and VOT for Experiment 2 with realworld object images. N = 10 human participants. Within each CNN, the maximum amount of variance explained for each brain region (i.e., the peak value) was compared between when category-selective voxels were included and when they were excluded. (A) Results from the full-SF images. In LOT difference was found for InceptRes-v2, p = .02. In VOT differences were found for Alexnet, p = .02; InceptRes-v2, p = .03; and Xception, p = .02. (B) Results from high-SF images. In LOT, differences were found for Squeezenet, p = .03. (C) Results from low-SF image. In VOT, differences were found for Resnet-101, p = .03; and Vgg-16, p = .001. No other differences were found. Two-tailed t-tests were used, and all p values reported were uncorrected. Error bars indicate standard errors of the means. Source data are provided as a Source Data file. and (C) RDM correlations of each brain region with each sampled layer in each CNN for the real-world and artificial object images, respectively. N = 6 human participants. For illustration purposes, the lower bounds of the noise ceiling from all brain regions were shifted to 0.7, and the difference between the actual noise ceiling and 0.7 was subtracted from each brain-CNN correlation value. The asterisks at the top of each plot mark the significance levels of the differences between the highest correlations (Fisher-transformed) of brain regions and CNN layers and the corresponding lower bound of the noise ceiling; one-tailed t-tests were used, and all p values reported were corrected for multiple comparisons for the 6 brain regions included using the Benjamini-Hochberg procedure. Error bars indicate standard errors of the means. † p < .1, * p < .05, ** p < .01, *** p < .001. Source data are provided as a Source Data file. (We note that the lower bounds of the noise ceiling were higher for natural than artificial object categories. This could be due to several reasons. First, visual neurons may be more responsive to natural than artificial objects, as our brain has evolved to process natural rather than artificial stimuli. Second, the natural images were overall more variable in identity within a category and thus more interesting to look at than the similar-looking artificial images. This could potentially increase SNR for the natural objects. Lastly, the natural categories were more distinctive from each other than the artificial ones were. Thus any distortion in the representational structure due to noise would be smaller for the natural than for the artificial objects, potentially increasing RDM consistency among the subjects.) Quantifying the brain-CNN correspondence in Experiment 3. Three different measures were used, including using Spearman brain-CNN correlation with Euclidean distance measures for RDM construction (same as those shown in Supplementary Figure 15), using Spearman brain-CNN correlation with correlation measures for RDM construction, and using Pearson brain-CNN correlation with Euclidean distance measures for RDM construction. N = 6 human participants. Virtually the same results were obtained in all three measures. For illustration purposes, the lower bounds of the noise ceiling from all brain regions were shifted to 0.7, and the difference between the actual noise ceiling and 0.7 was subtracted from each brain-CNN correlation value. Error bars indicate standard errors of the means. Source data are provided as a Source Data file.

Brain -Experiment 3 Artificial Images
Supplementary Figure 17 Supplementary Figure 17. Visualizing the representational structures for the artificial object images in all human visual regions and CNN layers examined. Results for the human visual regions are shown with category-selective voxels either included (All Voxels) or excluded (No Category Voxels). Since rotations and flips preserve distances on these MDS plots, to make these plots more informative and to see how the representational structure evolved across brain regions and CNN layers, we manually rotated and/or flipped each MDS when necessary. Source data are provided as a Source Data file.   Figure 18. The amount of LOT and VOT RDM variance explained by CNNs including and excluding category-selective voxels in LOT and VOT for Experiment 3. N = 6 human participants. Within each CNN, the maximum amount of variance explained for each brain region (i.e., the peak value) was compared between when category-selective voxels were included and when they were excluded. (A) Results from the real-world natural object images. In VOT, differences were found for Resnet-18, p = .04. No other comparisons reached significance. (B) Results from the artificial object images. No difference reached significance. Two-tailed t-tests were used, and all p values reported were uncorrected. Error bars indicate standard errors of the means. Source data are provided as a Source Data file.